Change Log¶

MLBench Core¶

v3.0.0¶

v3.0.0 (2020-12-07)¶

Full Changelog

Implemented enhancements:

Support multiple clusters in CLI #91
Add notebook/code to visualize results #72
Support AWS in CLI #33
Fix rnn language model #303 (ehoelzl)
Transformer language translation #99 (ehoelzl)

Fixed bugs:

Training code keeps running for PyTorch after training is done #26

Closed issues:

Remove loss argument for metric computation #295
Update PyTorch to 1.7 #286
Refactor optimizer and chose more appropriate names #284
fails to create kind cluster #277
Refactor CLI #253
Dependabot couldn’t authenticate with https://pypi.python.org/simple/ #252
Unify requirements/setup.py versions #244
isort failing on all PRs #227
torch.div is not supported in PyTorch 1.6 #223
Refactor common functionality for tiller and helm #108
Add GPU support for AWS in CLI #104
Change CPU limit to #CPUs - 1 #101
Add –version flag #97
Cluster creation/deletion errors with non-default zone #94
Add command to list runs #86
RefreshError from gcloud #83
Run new benchmarks and document costs #82
Make nvidia k80 default GPU #80
Fix random seeds #79
benchmark against torch.nn.parallel.DistributedDataParallel MPSG #75
upgrade to pytorch 1.5 #74
Provide comparison to competitors #66
Add some integration tests #64
Remove stale branches #62
Add PowerSGD optimizer #59
Add RNN Language Model #54
Use torch.nn.DataParallel for intra-node computation #46
Add CLI support for DIND #42
Port over functionality from Language Model benchmark to the core library #34
make results reproducible from command-line #24
Contribution and docs section on README.md #17
test new torch.distributed #15

Merged pull requests:

Bugfix KIND cli #307 (ehoelzl)
Update README.md to show new badge #306 (ehoelzl)
Create manual.yml #305 (ehoelzl)
Switch to github actions #304 (ehoelzl)
Bump sphinx from 3.3.0 to 3.3.1 #301 (dependabot[bot])
Remove loss from metric argument #297 (ehoelzl)
Fix translators #294 (ehoelzl)
Update pytorch #292 (ehoelzl)
Bump sphinx from 3.2.1 to 3.3.0 in /docs #288 (dependabot[bot])
Refactor optimizers #285 (ehoelzl)
Bump isort from 5.5.4 to 5.6.4 #283 (dependabot[bot])
Bump sphinx-autoapi from 1.5.0 to 1.5.1 #280 (dependabot[bot])
Add gpu functionality on AWS #278 (mmilenkoski)
Catch exceptions when creating/deleting clusters #276 (ehoelzl)
Fix doc #275 (ehoelzl)
Fix AWS deployment #274 (mmilenkoski)
Create dependabot.yml #260 (ehoelzl)
Merge requirements & Update doc #259 (ehoelzl)
Bump google-api-python-client from 1.9.3 to 1.12.1 #246 (dependabot-preview[bot])
Bump numpy from 1.19.0 to 1.19.2 #245 (dependabot-preview[bot])
Bump boto3 from 1.14.6 to 1.14.50 #234 (dependabot-preview[bot])
Fix isort errors #233 (mmilenkoski)
Bump pytest-mock from 3.1.1 to 3.3.1 #231 (dependabot-preview[bot])
Bump isort from 4.3.21 to 5.4.2 #221 (dependabot-preview[bot])
Bump sphinx from 3.0.4 to 3.2.1 #220 (dependabot-preview[bot])
Bump grpcio from 1.29.0 to 1.31.0 #207 (dependabot-preview[bot])
Bump spacy from 2.3.0 to 2.3.2 #182 (dependabot-preview[bot])
Downgrade Sphinx #162 (ehoelzl)
Add developer docs #161 (Panaetius)
Fp optimizer changes #160 (ehoelzl)
Bump wcwidth from 0.1.9 to 0.2.5 #156 (dependabot-preview[bot])
Bump all versions and add doc test #152 (Panaetius)
Bump torchvision from 0.6.0 to 0.6.1 #151 (dependabot-preview[bot])
Bump numpy from 1.18.5 to 1.19.0 #150 (dependabot-preview[bot])
Bump torch from 1.5.0 to 1.5.1 #148 (dependabot-preview[bot])
Bump google-auth from 1.17.2 to 1.18.0 #147 (dependabot-preview[bot])
Bump sphinx-rtd-theme from 0.4.3 to 0.5.0 #144 (dependabot-preview[bot])
Bump spacy from 2.2.4 to 2.3.0 #142 (dependabot-preview[bot])
Bump sphinx from 3.1.0 to 3.1.1 #140 (dependabot-preview[bot])
Bump dill from 0.3.1.1 to 0.3.2 #138 (dependabot-preview[bot])
Update dependencies #137 (Panaetius)
Bump spacy from 2.2.3 to 2.2.4 #135 (dependabot-preview[bot])
Bump numpy from 1.16.6 to 1.18.5 #133 (dependabot-preview[bot])
Bump freezegun from 0.3.12 to 0.3.15 #129 (dependabot-preview[bot])
Bump tabulate from 0.8.6 to 0.8.7 #128 (dependabot-preview[bot])
Bump deprecation from 2.0.6 to 2.1.0 #125 (dependabot-preview[bot])
Bump pytest-black from 0.3.8 to 0.3.9 #124 (dependabot-preview[bot])
Bump sphinx-rtd-theme from 0.4.2 to 0.4.3 #123 (dependabot-preview[bot])
Bump sphinx from 1.8.1 to 3.1.0 #121 (dependabot-preview[bot])
Bump pytest-mock from 1.10.0 to 3.1.1 #120 (dependabot-preview[bot])
Bump torchtext from 0.5.0 to 0.6.0 #118 (dependabot-preview[bot])
Bump torchvision from 0.5.0 to 0.6.0 #117 (dependabot-preview[bot])
Adds support for multiple clusters #115 (Panaetius)
Bump click from 7.0 to 7.1.2 #114 (dependabot-preview[bot])
Bump google-cloud-container from 0.3.0 to 0.5.0 #113 (dependabot-preview[bot])
Bump appdirs from 1.4.3 to 1.4.4 #112 (dependabot-preview[bot])
Bump sphinxcontrib-bibtex from 0.4.0 to 1.0.0 #111 (dependabot-preview[bot])
Bump sphinx-autoapi from 1.3.0 to 1.4.0 #110 (dependabot-preview[bot])
Remove unused arguments in create_aws #109 (mmilenkoski)
Fix Random seeds, Add new tracker stats #107 (ehoelzl)
Add return_code check in test_cli #106 (mmilenkoski)
Add AWS support in CLI #103 (mmilenkoski)
Update test_cli.py #100 (giorgiosav)
Adds a chart command to cli #95 (Panaetius)
Add support for kind cluster creation in the CLI #93 (mmilenkoski)

v2.4.0¶

v2.4.0 (2020-04-20)¶

Full Changelog

Implemented enhancements:

Switch to black for code formatting #35

Closed issues:

Travis tests run only for Python 3.6 #65
Downloading results fails if --output option is not provided #57
Remember user input in mlbench run #56
Aggregate the gradients by model, instead of by layers. #45
Update docker images to CUDA10, mlbench-core module to newest #43
Upgrade PyTorch to 1.4 #40

Merged pull requests:

Pytorch v1.4.0 #68 (ehoelzl)
Fix ci #67 (ehoelzl)
Add aggregation by model #61 (ehoelzl)
Remember user input in mlbench run #60 (mmilenkoski)
Add default name of output file in CLI #58 (mmilenkoski)
Cli adaptation #55 (ehoelzl)
Update tags and patch version to 2.3.2 #52 (ehoelzl)
Add get_optimizer to create optimizer object #48 (mmilenkoski)

v2.3.2¶

v2.3.2 (2020-04-07)¶

Full Changelog

Implemented enhancements:

Add NCCL & GLOO Backend support #49
Add NCCL & GLOO Backend support #47 (giorgiosav)

Fixed bugs:

math ValueError with 1-node cluster #38

Merged pull requests:

num_workers fix #51 (giorgiosav)
Adds centralized Adam implementation #41 (mmilenkoski)

v2.3.1¶

2.3.1 (2020-03-09)¶

Full Changelog

Implemented enhancements:

Customize Communication Scheme For Sparsified/Quantizatized/Decentralized scenarios #12

v2.3.0¶

v2.3.0 (2019-12-23)¶

Full Changelog

v2.2.1¶

v2.2.1 (2019-12-16)¶

Full Changelog

v2.2.0¶

v2.2.0 (2019-11-11)¶

Full Changelog

Implemented enhancements: - initialize_backends can now be called as context manager - Improved CLI to run multiple runs in parallel

v2.1.1¶

v2.1.1 (2019-11-11)¶

Full Changelog

v2.1.0¶

v2.1.0 (2019-11-4)¶

Full Changelog

Implemented enhancements:

Added CLI for MLBench runs

v2.0.0¶

v2.0.0 (2019-06-13)¶

Full Changelog

v1.4.4¶

v1.4.4 (2019-05-28)¶

Full Changelog

v1.4.3¶

v1.4.3 (2019-05-23)¶

Full Changelog

v1.4.2¶

v1.4.2 (2019-05-21)¶

Full Changelog

v1.4.1¶

v1.4.1 (2019-05-16)¶

Full Changelog

v1.4.0¶

v1.4.0 (2019-05-02)¶

Full Changelog

Implemented enhancements:

Split Train and Validation in Tensorflow #22

v1.3.4¶

v1.3.4 (2019-03-20)¶

Full Changelog

Implemented enhancements:

in controlflow, don’t mix train and validation #20

Fixed bugs:

Add metrics logging for Tensorflow #19

v1.3.3¶

v1.3.3 (2019-02-26)¶

Full Changelog

v1.3.2¶

v1.3.2 (2019-02-13)¶

Full Changelog

v1.3.1¶

v1.3.1 (2019-02-13)¶

Full Changelog

v1.3.0¶

v1.3.0 (2019-02-12)¶

Full Changelog

v1.2.1¶

v1.2.1 (2019-01-31)¶

Full Changelog

v1.2.0¶

v1.2.0 (2019-01-30)¶

Full Changelog

v1.1.1¶

v1.1.1 (2019-01-09)¶

Full Changelog

v1.1.0¶

v1.1.0 (2018-12-06)¶

Full Changelog

Fixed bugs:

Bug when saving checkpoints #13

Implemented enhancements:

Adds Tensorflow Controlflow, Dataset and Model code
Adds Pytorch linear models
Adds sparsified and decentralized optimizers

v1.0.0¶

1.0.0 (2018-11-15)¶

Implemented enhancements:

Add API Client to mlbench-core #6
Move to google-style docs #4
Add Imagenet Dataset for pytorch #3
Move worker code to mlbench-core repo #1

MLBench Helm¶

v3.0.0¶

v3.0.0 (2020-12-07)¶

Full Changelog

Implemented enhancements:

Add DIND Setup Script #4
Add Amazon Cloud setup script #3

Closed issues:

Add integration tests for newer versions of Kubernetes #23
Add deployment on KIND rather than Minikube #21
Use of GCloud script #19
Can not configure NVIDIA on AWS #17
Migrate to Kubernetes API v1 #15
Deployment on minikube requires kubernetes 1.15 #13
Remove obsolete info in values.yaml #12
mlbench worker pods not created #11

Merged pull requests:

Add workflow #25 (ehoelzl)
Update to v1 #24 (ehoelzl)
Update doc requirements #22 (ehoelzl)
Remove AWS and GCloud scripts #20 (ehoelzl)
Removes unused entries from values.yaml #18 (Panaetius)
Switch to eksctl for aws deployment #16 (mmilenkoski)
Add setup script for kind with local registry #14 (mmilenkoski)

v2.0.0¶

Implemented enhancements:

Added GKE and AWS Setup Scripts

MLBench Dashboard¶

v3.0.0¶

v3.0.0 (2020-12-07)¶

Full Changelog

Implemented enhancements:

Allow running of custom code #9
Define Job resource for mpirun execution #2
Create Kubernetes Job to execute mpirun #1

Closed issues:

Add integration tests #86
Dependabot couldn’t authenticate with https://pypi.python.org/simple/ #74
Fix dashboard scheduling #49
Add ability to stop run before end #48
Make sure all results are well zipped #44
Prevent user from inserting invalid run names #28
Travis tests run only for Python 3.6 #24
Remove stale branches #23

Merged pull requests:

Switch to actions #121 (ehoelzl)
Bump sphinx from 3.3.0 to 3.3.1 in /docs #120 (dependabot[bot])
Fix stream disconnection #115 (ehoelzl)
Update images #114 (ehoelzl)
Fix integration tests #113 (ehoelzl)
Bump rq-scheduler from 0.8.3 to 0.10.0 #109 (dependabot[bot])
Bump sphinx from 3.2.1 to 3.3.0 in /docs #108 (dependabot[bot])
Bump fakeredis from 1.4.3 to 1.4.4 #102 (dependabot-preview[bot])
Bump pytest from 6.0.2 to 6.1.2 #101 (dependabot-preview[bot])
Bump pytest-django from 3.10.0 to 4.1.0 #100 (dependabot-preview[bot])
Bump tox from 3.20.0 to 3.20.1 #96 (dependabot-preview[bot])
Change ‘Benchmarks’ to ‘Benchmark Implementations’ #93 (ehoelzl)
Add integration tests #91 (ehoelzl)
Bump pytest-kind from 20.5.3 to 20.10.0 #89 (dependabot-preview[bot])
Add tests #75 (ehoelzl)
Bugfix #60 (ehoelzl)
Bump watchdog from 0.8.3 to 0.10.3 #58 (dependabot-preview[bot])
Bump uwsgi from 2.0.17 to 2.0.19.1 #57 (dependabot-preview[bot])
Bump sphinx from 1.7.1 to 3.1.1 #52 (dependabot-preview[bot])
Bump tox from 2.9.1 to 3.15.2 #46 (dependabot-preview[bot])
Bump sphinx-rtd-theme from 0.4.0 to 0.4.3 #45 (dependabot-preview[bot])
Bump django-constance from 2.2.0 to 2.6.0 #43 (dependabot-preview[bot])
Bump pytest-black from 0.3.8 to 0.3.9 #42 (dependabot-preview[bot])
Bump flake8 from 3.5.0 to 3.8.3 #40 (dependabot-preview[bot])
Bump redis from 2.10.6 to 3.5.3 #38 (dependabot-preview[bot])
Bump pip from 10.0.1 to 20.1.1 #37 (dependabot-preview[bot])
Bump bumpversion from 0.5.3 to 0.6.0 #34 (dependabot-preview[bot])
Bump django from 2.2.12 to 2.2.13 #33 (dependabot[bot])
Bump django from 2.2.12 to 2.2.13 in /Docker #32 (dependabot[bot])
Add backend benchmark #31 (ehoelzl)
Add transformer image #30 (ehoelzl)

v2.0.0¶

Implemented enhancements:

Added Download of Task Goals
Fixed some performance issues

v1.1.0¶

Implemented enhancements:

Added new Tensorflow Benchmark Image
Remove Bandwidth limiting
Added ability to run custom images in dashboard

MLBench Benchmarks¶

v3.0.0¶

v3.0.0 (2020-12-07)¶

Full Changelog

Implemented enhancements:

Update PyTorch base to 1.7 #64
Add NLP/machine translation Transformer benchmark task #33
Repair Logistic regression Model #30
Add NLP/machine translation RNN benchmark task #27
Add NLP benchmark images & task #24
Add Gloo support to PyTorch images #23
Add NCCL support to PyTorch images #22
documentation: clearly link ref code to benchmark tasks #14
Add time-to-accuracy speedup plot #7
Update GKE documentation to use kubernetes version 1.10.9 #4
Add tensorflow cifar10 benchmark #3
Transformer language translation #51 (ehoelzl)

Fixed bugs:

Change Tensorflow Benchmark to use OpenMPI #8

Closed issues:

Clean-up tasks #63
Support for local run #59
task implementations: delete choco, name tasks nlp/language-model and nlp/translation #55
remove open/closed division distinction #47
[Not an Issue] Comparing 3 backends on multi-node single-gpu env #44
Create light version of the base image for development #43
No unit tests #40
Remove stale branches #39
Remove Communication backend from image name #36
pytorch 1.4 #34
create light version (in addition to full) for resource heavy benchmark tasks #19
add script to compute official results from raw results (time to acc for example) #18

Merged pull requests:

Add workflow #68 (ehoelzl)
Fix rnn language model #67 (ehoelzl)
Update pytorch #65 (ehoelzl)
Adapt optimizer imports #62 (ehoelzl)
Translation changes #61 (ehoelzl)
Change ‘Benchmarks’ to ‘Benchmark Implementations’ #60 (ehoelzl)
Add generic worker #58 (ehoelzl)
Rename tasks #57 (ehoelzl)
Add link to task description #56 (ehoelzl)
Fix tasks #54 (ehoelzl)
Add backend benchmark code and image #53 (ehoelzl)
Update nccl #52 (ehoelzl)
Remove open/closed division from benchmarks #49 (mmilenkoski)
Pytorch 1.5.0 #48 (giorgiosav)
Refactor controlflow #46 (ehoelzl)
Add Image Recognition Benchmark with DistributedDataParallel #42 (mmilenkoski)
Pytorch v1.4.0 #41 (ehoelzl)
Add aggregation by model #38 (ehoelzl)
Add NCCL & GLOO support to images #35 (giorgiosav)
Rnn language translation #32 (ehoelzl)
Linear model #28 (ehoelzl)
Fix ci #26 (ehoelzl)
[WIP]Add LSTM language model #25 (Panaetius)

v2.0.0¶

Implemented enhancements:

Added Goals to PyTorch Benchmark
Updated PyTorch Tutorial code
Changed all images to newest mlbench-core version.

v1.1.0¶

Implemented enhancements:

Added Tensorflow Benchmark