Installation

Make sure to read Prerequisites before installing mlbench.

Then, the library can be installed directly using pip:

$ pip install mlbench-core

This will install the mlbench CLI to the current environment, and will allow creation/deletion of clusters, as well as creating runs.

$ mlbench --help
   Usage: mlbench [OPTIONS] COMMAND [ARGS]...

   Console script for mlbench_cli.

   Options:
     --version  Print mlbench version
     --help     Show this message and exit.

   Commands:
     charts             Chart the results of benchmark runs Save generated...
     create-cluster     Create a new cluster.
     delete             Delete a benchmark run
     delete-cluster     Delete a cluster.
     download           Download the results of a benchmark run
     get-dashboard-url  Returns the dashboard URL of the current cluster
     list-clusters      List all currently configured clusters.
     run                Start a new run for a benchmark image
     set-cluster        Set the current cluster to use.
     status             Get the status of a benchmark run, or all runs if no...

Cluster & Run Deployment (using CLI)

One can easily deploy a cluster on both AWS and GCloud, using the mlbench CLI.

For example, one can create a GCloud cluster by running:

$ mlbench create-cluster gcloud 3 my-cluster
[...]
MLBench successfully deployed

Which creates a cluster called my-cluster-3 with 3 nodes (See mlbench create-cluster gcloud --help for more options).

Once created, experiments can be run using:

$ mlbench run my-run 2

Benchmark:

 [0]     PyTorch Cifar-10 ResNet-20
 [1]     PyTorch Cifar-10 ResNet-20 (Scaling LR)
 [2]     PyTorch Linear Logistic Regression
 [3]     PyTorch Machine Translation GNMT
 [4]     PyTorch Machine Translation Transformer
 [5]     Tensorflow Cifar-10 ResNet-20 Open-MPI
 [6]     PyTorch Distributed Backend benchmarking
 [7]     Custom Image


Selection [0]: 1

[...]

Run started with name my-run-2

A few handy commands for quickstart:

  • To obtain the dashboard URL: mlbench get-dashboard-url.

  • To see the state of the experiment: mlbench status my-run-2.

  • To download the results of the experiment: mlbench download my-run-2.

  • To delete the cluster: mlbench delete-cluster gcloud my-cluster-3

Manual helm chart deployment (Optional)

Helm Chart installation

The manual deployment requires the repo mlbench-helm to be cloned, and helm to be installed Helm (Required)

MLBench’s Helm charts can also be deployed manually on a running Kubernetes cluster. For that, it is needed to have the credentials for the cluster in the kubectl config. For example, to obtain the credentials for a GCloud Kubernetes cluster, one should run

$ gcloud container clusters get-credentials --zone ${MACHINE_ZONE} ${CLUSTER_NAME}

This will setup kubectl for the cluster.

Then to deploy the dashboard on the running cluster, we first need to set up helm with service-account with cluster-admin rights:

$ kubectl --namespace kube-system create sa tiller
$ kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller

Then, we install the chart on the cluster:

$ cd mlbench-helm
$ helm upgrade --wait --recreate-pods -f values.yaml \
     --timeout 900s --install ${RELEASE_NAME} . \
     --set limits.workers=${NUM_NODES-1} \
     --set limits.gpu=${NUM_GPUS} \
     --set limits.cpu=${NUM_CPUS-1}
Where :
  • RELEASE_NAME represents the cluster name (called my-cluster-3 in the example above)

  • NUM_NODES is the maximum number of worker nodes available. This sets the maximum number of nodes that can be chosen for an experiment in the UI/CLI.

  • NUM_GPUS is the number of gpus requested by each worker pod.

  • NUM_CPUS is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UI

This will deploy the helm charts with the corresponding images to each node, and will set the hardware limits.

Note

Get the application URL by running these commands:
$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(gcloud compute instances list|grep $(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}") |awk '{print $5}')
$ gcloud compute firewall-rules create --quiet mlbench --allow tcp:$NODE_PORT,tcp:$NODE_PORT
$ echo http://$NODE_IP:$NODE_PORT

Danger

The last command opens up a firewall rule to the google cloud. Make sure to delete the rule once it’s not needed anymore:

$ gcloud compute firewall-rules delete --quiet mlbench

One can also create a new myvalues.yml file with custom limits:

limits:
  workers:
  cpu:
  gpu:

gcePersistentDisk:
  enabled:
  pdName:
  • limits.workers is the maximum number of worker nodes available to mlbench. This sets the maximum number of nodes that can be chosen for an experiment in the UI. By default mlbench starts 2 workers on startup.

  • limits.cpu is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UI

  • limits.gpu is the number of gpus requested by each worker pod.

  • gcePersistentDisk.enabled create resources related to NFS persistentVolume and persistentVolumeClaim.

  • gcePersistentDisk.pdName is the name of persistent disk existed in GKE.

Caution

If workers, cpu or gpu are set higher than available in the cluster, Kubernetes will not be able to allocate nodes to mlbench and the deployment will hang indefinitely, without throwing an exception. Kubernetes will just wait until nodes that fit the requirements become available. So make sure the cluster actually has the requested requirements.

Note

To use gpu in the cluster, the nvidia device plugin should be installed. See Plugins for details

Note

Use commands like gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name to create persistent disk.

Note

The GCE persistent disk will be mounted to /datasets/ directory on each worker.

Caution

Google installs several pods on each node by default, limiting the available CPU. This can take up to 0.5 CPU cores per node. So make sure to provision VM’s that have at least 1 more core than the amount of cores you want to use for you mlbench experiment. See here for further details on node limits.

Plugins

In values.yaml, one can optionally install Kubernetes plugins by turning on/off the following flags:

Kubernetes-in-Docker (KIND)

Kubernetes-in-Docker allows simulating multiple nodes locally on a single machine. This approach should be used only for local development and testing. It is not a recommended way to measure benchmark results.

To use KIND, you need to setup a local registry and start a KIND server. We provide the script kind-with-registry.sh that can be used to start a local registry and a local cluster with one master and two worker nodes.

In order to push an image to the local registry you need to follow the procedure below. We use the image mlbench/pytorch-cifar10-resnet-scaling:2.3.0 for illustration, but you can use any image of your choice.

  1. Pull (or build) an image on your local machine:

$ docker pull mlbench/pytorch-cifar10-resnet-scaling:2.3.0
  1. Tag the image to use the local registry:

$ docker tag mlbench/pytorch-cifar10-resnet-scaling:2.3.0 localhost:5000/pytorch-cifar10-resnet-scaling:2.3.0
  1. Push the image to the local registry

$ docker push localhost:5000/pytorch-cifar10-resnet-scaling:2.3.0
  1. Now you can use the image as a custom image when starting a run on your cluster. Please make sure to specify the new tag of the image (localhost:5000/pytorch-cifar10-resnet-scaling:2.3.0 in the running example).

Next, you need to install helm (See Prerequisites) and set the Manual helm chart deployment (Optional).

Finally, to install mlbench on your local cluster run the following command (you can replace rel with a release name of your choice)

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900s --install rel .
[...]
NOTES:
1. Get the application URL by running these commands:
   export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
   export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
   echo http://$NODE_IP:$NODE_PORT

Run the 3 commands printed by the last command. The third command will output the URL where you can access the MLBench Dashboard. From there, you can start and monitor runs on your local cluster.