Installation

Make sure to read Prerequisites before installing mlbench.

All guides assume you have checked out the mlbench-helm github repository and have a terminal open in the checked-out mlbench-helm directory.

Google Cloud and Cluster Setup

This project provides a script to make all the Google Cloud and Cluster setup. In order to do so, please run the following commands:

$ ./google_cloud_setup.sh create-cluster
$ ./google_cloud_setup.sh install-chart

To delete cluster and cleanup:

$ ./google_cloud_setup.sh delete-cluster

To uninstall chart:

$ ./google_cloud_setup.sh uninstall-chart

For general information on the available commands, please run:

$ ./google_cloud_setup.sh help

Helm Chart values

Since every Kubernetes is different, there are no reasonable defaults for some values, so the following properties have to be set. You can save them in a yaml file of your chosing. This guide will assume you saved them in myvalues.yaml. For a reference file for all configurable values, you can copy the values.yaml file to myvalues.yaml.

limits:
  workers:
  cpu:
  gpu:

gcePersistentDisk:
  enabled:
  pdName:
  • limits.workers is the maximum number of worker nodes available to mlbench. This sets the maximum number of nodes that can be chosen for an experiment in the UI. By default mlbench starts 2 workers on startup.
  • limits.cpu is the maximum number of CPUs (Cores) available on each worker node. Uses Kubernetes notation (8 or 8000m for 8 cpus/cores). This is also the maximum number of Cores that can be selected for an experiment in the UI
  • limits.gpu is the number of gpus requested by each worker pod.
  • gcePersistentDisk.enabled create resources related to NFS persistentVolume and persistentVolumeClaim.
  • gcePersistentDisk.pdName is the name of persistent disk existed in GKE.

Caution

If you set workers, cpu or gpu higher than available in your cluster, Kubernetes will not be able to allocate nodes to mlbench and the deployment will hang indefinitely, without throwing an exception. Kubernetes will just wait until nodes that fit the requirements become available. So make sure your cluster actually has the requirements avilable that you requested.

Note

To use gpu in the cluster, the nvidia device plugin should be installed. See Plugins for details

Note

Use commands like gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name to create persistent disk.

Note

The GCE persistent disk will be mounted to /datasets/ directory on each worker.

Basic Install

Set the Helm Chart values

Use helm to install the mlbench chart (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} .

Follow the instructions at the end of the helm install to get the dashboard URL. E.g.:

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel .
  [...]
  NOTES:
  1. Get the application URL by running these commands:
     export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
     export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
     echo http://$NODE_IP:$NODE_PORT

This outputs the URL the Dashboard is accessible at.

Plugins

In values.yaml, one can optionally install Kubernetes plugins by turning on/off the following flags:

Google Cloud / Google Kubernetes Engine

Set the Helm Chart values

Important

Make sure to read the prerequisites for Google Cloud

Please make sure that kubectl is configured correctly.

Caution

Google installs several pods on each node by default, limiting the available CPU. This can take up to 0.5 CPU cores per node. So make sure to provision VM’s that have at least 1 more core than the amount of cores you want to use for you mlbench experiment. See here for further details on node limits.

Install mlbench (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install ${RELEASE_NAME} .

To access mlbench, run these commands and open the URL that is returned (Note: The default instructions returned by helm on the commandline return the internal cluster ip only):

$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(gcloud compute instances list|grep $(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}") |awk '{print $5}')
$ gcloud compute firewall-rules create --quiet mlbench --allow tcp:$NODE_PORT,tcp:$NODE_PORT
$ echo http://$NODE_IP:$NODE_PORT

Danger

The last command opens up a firewall rule to the google cloud. Make sure to delete the rule once it’s not needed anymore:

$ gcloud compute firewall-rules delete --quiet mlbench

Minikube

Minikube allows running a single-node Kubernetes cluster inside a VM on your laptop, for users looking to try out Kubernetes or to develop with it.

Installing mlbench to minikube.

Set the Helm Chart values

Start minikube cluster

$ minikube start

Next install or upgrade a helm chart with desired configurations with name ${RELEASE_NAME}

$ helm init --kube-context minikube --wait
$ helm upgrade --wait --recreate-pods -f myvalues.yaml --timeout 900 --install ${RELEASE_NAME} .

Note

The minikube runs a single-node Kubernetes cluster inside a VM. So we need to fix the replicaCount=1 in values.yaml.

Once the installation is finished, one can obtain the url

$ export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services ${RELEASE_NAME}-mlbench-master)
$ export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
$ echo http://$NODE_IP:$NODE_PORT

Now the mlbench dashboard should be available at http://${NODE_IP}:${NODE_PORT}.

Note

To access http://$NODE_IP:$NODE_PORT outside minikube, run the following command on the host:

$ ssh -i ${MINIKUBE_HOME}/.minikube/machines/minikube/id_rsa -N -f -L localhost:${NODE_PORT}:${NODE_IP}:${NODE_PORT} docker@$(minikube ip)

where $MINIKUBE_HOME is by default $HOME. One can view mlbench dashboard at http://localhost:${NODE_PORT}

Docker-in-Docker (DIND)

Docker-in-Docker allows simulating multiple nodes locally on a single machine. This is useful for development.

Hint

For development purposes, it makes sense to use a local docker registry as well with DIND.

Describing how to set up a local registry would be too long for this guide, so here are some pointers:

  • You can find a guide here.
  • This page details setting up an image pull secret.
  • This details adding an image pull secret to a kubernetes service account.
  • You can use dind-proxy.sh in the mlbench repository to forward the registry port (5000) to kubernetes DIND.

Download the kubeadm-dind-cluster script.

$ wget https://cdn.rawgit.com/kubernetes-sigs/kubeadm-dind-cluster/master/fixed/dind-cluster-v1.11.sh
$ chmod +x dind-cluster-v1.11.sh

For networking to work in DIND, we need to set a CNI Plugin. In our experience, weave works well with DIND.

$ export CNI_PLUGIN=weave

Now we can start the local cluster with

$ ./dind-cluster-v1.11.sh up

This might take a couple of minutes.

Hint

If you’re using a local docker registry, run dind-proxy.sh after the previous step.

Install helm (See Prerequisites) and set the Helm Chart values.

Hint

For a local registry, make sure you have an imagePullSecret added to the kubernetes serviceaccount and set the repository and secret in the values.yaml file (regcred in this example):

master:
  imagePullSecret: regcred

  image:
    repository: localhost:5000/mlbench_master
    tag: latest
    pullPolicy: Always


worker:
  imagePullSecret: regcred

  image:
    repository: localhost:5000/mlbench_worker
    tag: latest
    pullPolicy: Always

Install mlbench (Replace ${RELEASE_NAME} with a name of your choice):

$ helm upgrade --wait --recreate-pods -f values.yaml --timeout 900 --install rel .
  [...]
  NOTES:
  1. Get the application URL by running these commands:
     export NODE_PORT=$(kubectl get --namespace default -o jsonpath="{.spec.ports[0].nodePort}" services rel-mlbench-master)
     export NODE_IP=$(kubectl get nodes --namespace default -o jsonpath="{.items[0].status.addresses[0].address}")
     echo http://$NODE_IP:$NODE_PORT

Run the 3 commands printed by the last command. This outputs the URL the Dashboard is accessible at.