Change Log

MLBench Core

v3.0.0

v3.0.0 (2020-12-07)

Full Changelog

Implemented enhancements:

  • Support multiple clusters in CLI #91

  • Add notebook/code to visualize results #72

  • Support AWS in CLI #33

  • Fix rnn language model #303 (ehoelzl)

  • Transformer language translation #99 (ehoelzl)

Fixed bugs:

  • Training code keeps running for PyTorch after training is done #26

Closed issues:

  • Remove loss argument for metric computation #295

  • Update PyTorch to 1.7 #286

  • Refactor optimizer and chose more appropriate names #284

  • fails to create kind cluster #277

  • Refactor CLI #253

  • Dependabot couldn’t authenticate with https://pypi.python.org/simple/ #252

  • Unify requirements/setup.py versions #244

  • isort failing on all PRs #227

  • torch.div is not supported in PyTorch 1.6 #223

  • Refactor common functionality for tiller and helm #108

  • Add GPU support for AWS in CLI #104

  • Change CPU limit to #CPUs - 1 #101

  • Add –version flag #97

  • Cluster creation/deletion errors with non-default zone #94

  • Add command to list runs #86

  • RefreshError from gcloud #83

  • Run new benchmarks and document costs #82

  • Make nvidia k80 default GPU #80

  • Fix random seeds #79

  • benchmark against torch.nn.parallel.DistributedDataParallel MPSG #75

  • upgrade to pytorch 1.5 #74

  • Provide comparison to competitors #66

  • Add some integration tests #64

  • Remove stale branches #62

  • Add PowerSGD optimizer #59

  • Add RNN Language Model #54

  • Use torch.nn.DataParallel for intra-node computation #46

  • Add CLI support for DIND #42

  • Port over functionality from Language Model benchmark to the core library #34

  • make results reproducible from command-line #24

  • Contribution and docs section on README.md #17

  • test new torch.distributed #15

Merged pull requests:

v2.4.0

v2.4.0 (2020-04-20)

Full Changelog

Implemented enhancements:

  • Switch to black for code formatting #35

Closed issues:

  • Travis tests run only for Python 3.6 #65

  • Downloading results fails if --output option is not provided #57

  • Remember user input in mlbench run #56

  • Aggregate the gradients by model, instead of by layers. #45

  • Update docker images to CUDA10, mlbench-core module to newest #43

  • Upgrade PyTorch to 1.4 #40

Merged pull requests:

v2.3.2

v2.3.2 (2020-04-07)

Full Changelog

Implemented enhancements:

  • Add NCCL & GLOO Backend support #49

  • Add NCCL & GLOO Backend support #47 (giorgiosav)

Fixed bugs:

  • math ValueError with 1-node cluster #38

Merged pull requests:

v2.3.1

2.3.1 (2020-03-09)

Full Changelog

Implemented enhancements:

  • Customize Communication Scheme For Sparsified/Quantizatized/Decentralized scenarios #12

v2.3.0

v2.3.0 (2019-12-23)

Full Changelog

v2.2.1

v2.2.1 (2019-12-16)

Full Changelog

v2.2.0

v2.2.0 (2019-11-11)

Full Changelog

Implemented enhancements: - initialize_backends can now be called as context manager - Improved CLI to run multiple runs in parallel

v2.1.1

v2.1.1 (2019-11-11)

Full Changelog

v2.1.0

v2.1.0 (2019-11-4)

Full Changelog

Implemented enhancements:

  • Added CLI for MLBench runs

v2.0.0

v2.0.0 (2019-06-13)

Full Changelog

v1.4.4

v1.4.4 (2019-05-28)

Full Changelog

v1.4.3

v1.4.3 (2019-05-23)

Full Changelog

v1.4.2

v1.4.2 (2019-05-21)

Full Changelog

v1.4.1

v1.4.1 (2019-05-16)

Full Changelog

v1.4.0

v1.4.0 (2019-05-02)

Full Changelog

Implemented enhancements:

  • Split Train and Validation in Tensorflow #22

v1.3.4

v1.3.4 (2019-03-20)

Full Changelog

Implemented enhancements:

  • in controlflow, don’t mix train and validation #20

Fixed bugs:

  • Add metrics logging for Tensorflow #19

v1.3.3

v1.3.3 (2019-02-26)

Full Changelog

v1.3.2

v1.3.2 (2019-02-13)

Full Changelog

v1.3.1

v1.3.1 (2019-02-13)

Full Changelog

v1.3.0

v1.3.0 (2019-02-12)

Full Changelog

v1.2.1

v1.2.1 (2019-01-31)

Full Changelog

v1.2.0

v1.2.0 (2019-01-30)

Full Changelog

v1.1.1

v1.1.1 (2019-01-09)

Full Changelog

v1.1.0

v1.1.0 (2018-12-06)

Full Changelog

Fixed bugs:

  • Bug when saving checkpoints #13

Implemented enhancements:

  • Adds Tensorflow Controlflow, Dataset and Model code

  • Adds Pytorch linear models

  • Adds sparsified and decentralized optimizers

v1.0.0

1.0.0 (2018-11-15)

Implemented enhancements:

  • Add API Client to mlbench-core #6

  • Move to google-style docs #4

  • Add Imagenet Dataset for pytorch #3

  • Move worker code to mlbench-core repo #1

MLBench Helm

v3.0.0

v3.0.0 (2020-12-07)

Full Changelog

Implemented enhancements:

  • Add DIND Setup Script #4

  • Add Amazon Cloud setup script #3

Closed issues:

  • Add integration tests for newer versions of Kubernetes #23

  • Add deployment on KIND rather than Minikube #21

  • Use of GCloud script #19

  • Can not configure NVIDIA on AWS #17

  • Migrate to Kubernetes API v1 #15

  • Deployment on minikube requires kubernetes 1.15 #13

  • Remove obsolete info in values.yaml #12

  • mlbench worker pods not created #11

Merged pull requests:

v2.0.0

Implemented enhancements:

  • Added GKE and AWS Setup Scripts

MLBench Dashboard

v3.0.0

v3.0.0 (2020-12-07)

Full Changelog

Implemented enhancements:

  • Allow running of custom code #9

  • Define Job resource for mpirun execution #2

  • Create Kubernetes Job to execute mpirun #1

Closed issues:

  • Add integration tests #86

  • Dependabot couldn’t authenticate with https://pypi.python.org/simple/ #74

  • Fix dashboard scheduling #49

  • Add ability to stop run before end #48

  • Make sure all results are well zipped #44

  • Prevent user from inserting invalid run names #28

  • Travis tests run only for Python 3.6 #24

  • Remove stale branches #23

Merged pull requests:

v2.0.0

Implemented enhancements:

  • Added Download of Task Goals

  • Fixed some performance issues

v1.1.0

Implemented enhancements:

  • Added new Tensorflow Benchmark Image

  • Remove Bandwidth limiting

  • Added ability to run custom images in dashboard

MLBench Benchmarks

v3.0.0

v3.0.0 (2020-12-07)

Full Changelog

Implemented enhancements:

  • Update PyTorch base to 1.7 #64

  • Add NLP/machine translation Transformer benchmark task #33

  • Repair Logistic regression Model #30

  • Add NLP/machine translation RNN benchmark task #27

  • Add NLP benchmark images & task #24

  • Add Gloo support to PyTorch images #23

  • Add NCCL support to PyTorch images #22

  • documentation: clearly link ref code to benchmark tasks #14

  • Add time-to-accuracy speedup plot #7

  • Update GKE documentation to use kubernetes version 1.10.9 #4

  • Add tensorflow cifar10 benchmark #3

  • Transformer language translation #51 (ehoelzl)

Fixed bugs:

  • Change Tensorflow Benchmark to use OpenMPI #8

Closed issues:

  • Clean-up tasks #63

  • Support for local run #59

  • task implementations: delete choco, name tasks nlp/language-model and nlp/translation #55

  • remove open/closed division distinction #47

  • [Not an Issue] Comparing 3 backends on multi-node single-gpu env #44

  • Create light version of the base image for development #43

  • No unit tests #40

  • Remove stale branches #39

  • Remove Communication backend from image name #36

  • pytorch 1.4 #34

  • create light version (in addition to full) for resource heavy benchmark tasks #19

  • add script to compute official results from raw results (time to acc for example) #18

Merged pull requests:

v2.0.0

Implemented enhancements:

  • Added Goals to PyTorch Benchmark

  • Updated PyTorch Tutorial code

  • Changed all images to newest mlbench-core version.

v1.1.0

Implemented enhancements:

  • Added Tensorflow Benchmark