mlbench_core.models

pytorch

Since Kuang Liu<https://github.com/kuangliu/pytorch-cifar> has already included many classical neural network models. We use their implementation direclty for

  • VGG

linear_models

LogisticRegression

class mlbench_core.models.pytorch.linear_models.LogisticRegression(n_features)[source]

Logistic regression implementation

Parameters

n_features (int) – Number of features

LinearRegression

class mlbench_core.models.pytorch.linear_models.LinearRegression(n_features)[source]

Ridge regression implementation

Parameters

n_features (int) – Number of features

resnet

Contains definitions for Residual Networks.

Residual networks were originally proposed in [HZRS16a] . Then they improve the [HZRS16b] Here we refer to the settings in [HZRS16a] as v1 and [HZRS16b] as v2.

Since torchvision resnet has already implemented.

  • ResNet-18

  • ResNet-34

  • ResNet-50

  • ResNet-101

  • ResNet-152

for image net. Here we only implemented the remaining models

  • ResNet-20

  • ResNet-32

  • ResNet-44

  • ResNet-56

for CIFAR-10 dataset. Besides, their implementation uses projection shortcut by default.

ResNetCIFAR

class mlbench_core.models.pytorch.resnet.ResNetCIFAR(resnet_size, bottleneck, num_classes, version=_DEFAULT_RESNETCIFAR_VERSION)[source]

Basic ResNet implementation.

Parameters
  • resnet_size (int) – Number of layers

  • bottleneck (bool) – Whether to use a bottleneck layer (Not Implemented)

  • num_classes (int) – Number of output classes

  • version (int) – Resnet version (1 or 2). Default: 1

RNN

Google Neural Machine Translation

Model
class mlbench_core.models.pytorch.gnmt.GNMT(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, share_embedding=True, fusion=True)[source]

GNMT v2 model

Parameters
  • vocab_size (int) – size of vocabulary (number of tokens)

  • hidden_size (int) – internal hidden size of the model

  • num_layers (int) – number of layers, applies to both encoder and decoder

  • dropout (float) – probability of dropout (in encoder and decoder) tensors, if false the model uses (seq, batch, feature)

  • share_embedding (bool) – if True embeddings are shared between encoder and decoder

decode(self, inputs, context, inference=False)

Applies the decoder to inputs, given the context from the encoder.

Parameters
  • inputs (torch.tensor) – tensor with inputs (seq_len, batch)

  • context – context from the encoder

  • inference – if True inference mode, if False training mode

Returns

torch.tensor

encode(self, inputs, lengths)

Applies the encoder to inputs with a given input sequence lengths.

Parameters
  • inputs (torch.tensor) – tensor with inputs (seq_len, batch)

  • lengths – vector with sequence lengths (excluding padding)

Returns

torch.tensor

generate(self, inputs, context, beam_size)

Autoregressive generator, works with SequenceGenerator class. Executes decoder (in inference mode), applies log_softmax and topK for inference with beam search decoding.

Parameters
  • inputs – tensor with inputs to the decoder

  • context – context from the encoder

  • beam_size – beam size for the generator

Returns

(words, logprobs, scores, new_context) words: indices of topK tokens logprobs: log probabilities of topK tokens scores: scores from the attention module (for coverage penalty) new_context: new decoder context, includes new hidden states for decoder RNN cells

BahdanauAttention
Encoder
class mlbench_core.models.pytorch.gnmt.encoder.ResidualRecurrentEncoder(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, embedder=None, init_weight=0.1)[source]

Encoder with Embedding, LSTM layers, residual connections and optional dropout.

The first LSTM layer is bidirectional and uses variable sequence length API, the remaining (num_layers-1) layers are unidirectional. Residual connections are enabled after third LSTM layer, dropout is applied on inputs to LSTM layers.

Parameters
  • vocab_size – size of vocabulary

  • hidden_size – hidden size for LSTM layers

  • num_layers – number of LSTM layers, 1st layer is bidirectional

  • dropout – probability of dropout (on input to LSTM layers)

  • embedder – instance of nn.Embedding, if None constructor will create new embedding layer

  • init_weight – range for the uniform initializer

forward(self, inputs, lengths)[source]

Execute the encoder.

Parameters
  • inputs – tensor with indices from the vocabulary

  • lengths – vector with sequence lengths (excluding padding)

Returns

tensor with encoded sequences

Decoder
class mlbench_core.models.pytorch.gnmt.decoder.RecurrentAttention(input_size=1024, context_size=1024, hidden_size=1024, num_layers=1, dropout=0.2, init_weight=0.1, fusion=True)[source]

LSTM wrapped with an attention module.

Parameters
  • input_size (int) – number of features in input tensor

  • context_size (int) – number of features in output from encoder

  • hidden_size (int) – internal hidden size

  • num_layers (int) – number of layers in LSTM

  • dropout (float) – probability of dropout (on input to LSTM layer)

  • init_weight (float) – range for the uniform initializer

forward(self, inputs, hidden, context, context_len)[source]

Execute RecurrentAttention.

Parameters
  • inputs (int) – tensor with inputs

  • hidden (int) – hidden state for LSTM layer

  • context – context tensor from encoder

  • context_len – vector of encoder sequence lengths

Returns

(rnn_outputs, hidden, attn_output, attn_scores)

class mlbench_core.models.pytorch.gnmt.decoder.Classifier(in_features, out_features, init_weight=0.1)[source]

Fully-connected classifier

Parameters
  • in_features (int) – number of input features

  • out_features (int) – number of output features (size of vocabulary)

  • init_weight (float) – range for the uniform initializer

forward(self, x)[source]

Execute the classifier.

Parameters

x (torch.tensor) –

Returns

torch.tensor

class mlbench_core.models.pytorch.gnmt.decoder.ResidualRecurrentDecoder(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, embedder=None, init_weight=0.1, fusion=True)[source]

Decoder with Embedding, LSTM layers, attention, residual connections and optinal dropout.

Attention implemented in this module is different than the attention discussed in the GNMT arxiv paper. In this model the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep.

Residual connections are enabled after 3rd LSTM layer, dropout is applied on inputs to LSTM layers.

Parameters
  • vocab_size (int) – size of vocabulary

  • hidden_size (int) – hidden size for LSMT layers

  • num_layers (int) – number of LSTM layers

  • dropout (float) – probability of dropout (on input to LSTM layers)

  • embedder (nn.Embedding) – if None constructor will create new embedding layer

  • init_weight (float) – range for the uniform initializer

append_hidden(self, h)[source]

Appends the hidden vector h to the list of internal hidden states.

Parameters

h – hidden vector

forward(self, inputs, context, inference=False)[source]

Execute the decoder.

Parameters
  • inputs – tensor with inputs to the decoder

  • context – state of encoder, encoder sequence lengths and hidden state of decoder’s LSTM layers

  • inference – if True stores and repackages hidden state

Returns:

init_hidden(self, hidden)[source]

Converts flattened hidden state (from sequence generator) into a tuple of hidden states. :param hidden: None or flattened hidden state for decoder RNN layers

package_hidden(self)[source]

Flattens the hidden state from all LSTM layers into one tensor (for the sequence generator).

Transformer Model for Translation

Model
class mlbench_core.models.pytorch.transformer.TransformerModel(args, src_dict, trg_dict)[source]

Transformer model

This model uses MultiHeadAttention as described in [VSP+17]

Parameters
forward(self, src_tokens, src_lengths, prev_output_tokens)

Run the forward pass of the transformer model.

Parameters
  • src_tokens (torch.Tensor) – Source tokens

  • src_lengths (torch.Tensor) – Source sentence lengths

  • prev_output_tokens (torch.Tensor) – Previous output tokens

Returns

The model output, and attention weights if needed

Return type

(torch.Tensor, Optional[torch.Tensor])

Encoder
class mlbench_core.models.pytorch.transformer.encoder.TransformerEncoder(args, dictionary, embed_tokens, left_pad=True)[source]

Transformer encoder consisting of args.encoder_layers layers. Each layer is a TransformerEncoderLayer.

Parameters
  • args – Arguments of model. All arguments should be accessible via __getattribute__ method

  • dictionary (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – encoding dictionary

  • embed_tokens (torch.nn.Embedding) – input embedding

  • left_pad (bool) – Pad sources to the left (True) or right (False). Default: True

forward(self, src_tokens)[source]

Forward function of encoder

Parameters

src_tokens (torch.Tensor) – Source tokens

Returns

{encoder:out (torch.Tensor), encoder_padding_mask (torch.Tensor)}

Return type

(dict)

Decoder
class mlbench_core.models.pytorch.transformer.decoder.TransformerDecoder(args, dictionary, embed_tokens, no_encoder_attn=False, left_pad=False)[source]

Transformer decoder consisting of args.decoder_layers layers. Each layer is a TransformerDecoderLayer.

Parameters
  • args – Arguments of model. All arguments should be accessible via __getattribute__ method

  • dictionary (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – decoding dictionary

  • embed_tokens (torch.nn.Embedding) – output embedding

  • no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).

  • left_pad (bool) – Pad targets to the left (True) or right (False). Default: False

Layers
class mlbench_core.models.pytorch.transformer.modules.TransformerEncoderLayer(args)[source]

Encoder layer block.

In the original paper each operation (multi-head attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.encoder_normalize_before to True.

Parameters

args (argparse.Namespace) – parsed command-line arguments

class mlbench_core.models.pytorch.transformer.modules.TransformerDecoderLayer(args, no_encoder_attn=False)[source]

Decoder layer block.

In the original paper each operation (multi-head attention, encoder attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.decoder_normalize_before to True.

Parameters
  • args (argparse.Namespace) – parsed command-line arguments

  • no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).

SequenceGenerator
class mlbench_core.models.pytorch.transformer.sequence_generator.SequenceGenerator(model, src_dict, trg_dict, beam_size=1, minlen=1, maxlen=None, stop_early=True, normalize_scores=True, len_penalty=1, retain_dropout=False, sampling=False, sampling_topk=- 1, sampling_temperature=1)[source]

Generates translations of a given source sentence.

Parameters
  • model (torch.nn.Module) – The model to predict on. Should be instance of TransformerModel

  • src_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Source dictionary

  • trg_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Target dictionary

  • beam_size (int) – Size of the beam. Default 1

  • minlen (int) – Minimum generation length. Default 1

  • maxlen (int) – Maximum generation length. If None, takes value of model.max_decoder_positions(). Default None

  • stop_early (bool) – Stop generation immediately after we finalize beam_size hypotheses, even though longer hypotheses might have better normalized scores. Default True

  • normalize_scores (bool) – Normalize scores by the length of the output. Default True

  • len_penalty (float) – length penalty: <1.0 favors shorter, >1.0 favors longer sentences. Default 1

  • retain_dropout (bool) – Keep dropout layers. Default False

  • sampling (bool) – sample hypotheses instead of using beam search. Default False

  • sampling_topk (int) – sample from top K likely next words instead of all words. Default -1

  • sampling_temperature (int) – temperature for random sampling. Default 1

generate(self, src_tokens, src_lengths, maxlen=None, prefix_tokens=None)[source]

Generate a batch of translations.

generate_batch_translations(self, batch, maxlen_a=0.0, maxlen_b=None, prefix_size=0)[source]

Yield individual translations of a batch.

Parameters
  • batch (dict) – The model input batch. Must have keys net_input, target and ntokens

  • maxlen_a (float) –

  • maxlen_b (Optional[int]) – Generate sequences of max lengths maxlen_a*x + maxlen_b where x = input sentence length

  • prefix_size (int) – Prefix size

translate_batch(self, batch, maxlen_a=1.0, maxlen_b=50, prefix_size=0, remove_bpe=None, nbest=1, ignore_case=True)[source]
Parameters
  • batch (dict) – The model input batch. Must have keys net_input, target and ntokens

  • maxlen_a (float) – Default 1.0

  • maxlen_b (Optional[int]) – Generate sequences of max lengths maxlen_a*x + maxlen_b where x = input sentence length. Default 50

  • prefix_size (int) – Prefix size. Default 0

  • remove_bpe (Optional[str]) – BPE token. Default None

  • nbest (int) – Number of hypotheses to output. Default 1

  • ignore_case (bool) – Ignore case druing online eval. Default True

Returns

The translations and their targets for the given batch

Return type

(list[str], list[str])

References

HZRS16a(1,2,3)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778. 2016.

HZRS16b(1,2,3)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, 630–645. Springer, 2016.

VSP+17

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

NLP

LSTM Language Model

class mlbench_core.models.pytorch.nlp.RNNLM(ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False, weight_norm=False, batch_first=False)[source]

Container module with an encoder, a recurrent module, and a decoder.

repackage_hidden(self, h)[source]

Wraps hidden states in new Tensors, to detach them from their history.

tensorflow

resnet

Contains definitions for Residual Networks. Residual networks (‘v1’ ResNets) were originally proposed in [HZRS16a]. The full preactivation ‘v2’ ResNet variant was introduced by [HZRS16b]. The key difference of the full preactivation ‘v2’ variant compared to the ‘v1’ variant in [1] is the use of batch normalization before every weight layer rather than after.

mlbench_core.models.tensorflow.resnet_model.fixed_padding(inputs, kernel_size, data_format)[source]

Pads the input along the spatial dimensions independently of input size.

Parameters
  • inputs (tf.Tensor) – A tensor of size [batch, channels, height_in, width_in] or [batch, height_in, width_in, channels] depending on data_format.

  • kernel_size (int) – The kernel to be used in the conv2d or max_pool2d operation. Should be a positive integer.

  • data_format (str) – The input format (‘channels_last’ or ‘channels_first’).

Returns

A tensor with the same format as the input with the data either intact (if kernel_size == 1) or padded (if kernel_size > 1).

mlbench_core.models.tensorflow.resnet_model.conv2d_fixed_padding(inputs, filters, kernel_size, strides, data_format)[source]

Strided 2-D convolution with explicit padding.

mlbench_core.models.tensorflow.resnet_model.block_layer(inputs, filters, bottleneck, block_fn, blocks, strides, training, name, data_format)[source]

Creates one layer of blocks for the ResNet model.

Parameters
  • inputs (tf.Tensor) – A tensor of size [batch, channels, height_in, width_in] or [batch, height_in, width_in, channels] depending on data_format.

  • filters (int) – The number of filters for the first convolution of the layer.

  • bottleneck (bool) – Is the block created a bottleneck block.

  • block_fn (callable) – The block to use within the model, either building_block or bottleneck_block.

  • blocks (int) – The number of blocks contained in the layer.

  • strides (int) – The stride to use for the first convolution of the layer. If greater than 1, this layer will ultimately downsample the input.

  • training (bool) – Either True or False, whether we are currently training the model. Needed for batch norm.

  • name (str) – A string name for the tensor output of the block layer.

  • data_format (str) – The input format (‘channels_last’ or ‘channels_first’).

Returns

The output tensor of the block layer.

mlbench_core.models.tensorflow.resnet_model.batch_norm(inputs, training, data_format)[source]

Performs a batch normalization using a standard set of parameters.

Model

class mlbench_core.models.tensorflow.resnet_model.Model(resnet_size, bottleneck, num_classes, num_filters, kernel_size, conv_stride, first_pool_size, first_pool_stride, block_sizes, block_strides, resnet_version=DEFAULT_VERSION, data_format=None, dtype=DEFAULT_DTYPE)[source]

Base class for building the Resnet Model.

Cifar10Model

class mlbench_core.models.tensorflow.resnet_model.Cifar10Model(resnet_size, data_format=None, num_classes=10, resnet_version=2, dtype=tf.float32)[source]

Model class with appropriate defaults for CIFAR-10 data.