MLBench Dashboard Documentation

MLBench Dashboard

MLBench comes with a dashboard to manage and monitor the cluster and jobs.

Dashboard Functionality

Main Page

Dashboard Main Page

Dashboard Main Page

The main view shows all MLBench worker nodes and their current status

Runs Page

Dashboard Runs Page

Dashboard Runs Page

The Runs page allows you to start a new experiment on the worker nodes. You can select how many workers to use and how many CPU Cores each worker can utilize.

Run Details Page

Dashboard Run Details Page

Dashboard Run Details Page

The Run Details page shows the progress and result of an experiment. You can track metrics like train loss and validation accuracy as well as see the stdout and stderr logs of all workers.

It also allows you to download all the metrics of a run as well as resource usage of all workers participating in the run as json files.

REST API

MLBench provides a basic REST Api though which most functionality can also be used. It’s accessible through the /api/ endpoints on the dashboard URL.

Pods

GET /api/pods/

All Worker-Pods available in the cluster, including status information

Example request:

GET /api/pods HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

[
  {
    "name":"worn-mouse-mlbench-worker-55bbdd4d8c-4mxh5",
    "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}",
    "phase":"Running",
    "ip":"10.244.2.58"
  },
  {
    "name":"worn-mouse-mlbench-worker-55bbdd4d8c-bwwsp",
    "labels":"{'app': 'mlbench', 'component': 'worker', 'pod-template-hash': '1166880847', 'release': 'worn-mouse'}",
    "phase":"Running",
    "ip":"10.244.3.57"
  }
]
Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes

Metrics

GET /api/metrics/

Get metrics (Cpu, Memory etc.) for all Worker Pods

Example request:

GET /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "quiet-mink-mlbench-worker-0": {
      "container_cpu_usage_seconds_total": [
          {
              "date": "2018-08-03T09:21:38.594282Z",
              "value": "0.188236813"
          },
          {
              "date": "2018-08-03T09:21:50.244277Z",
              "value": "0.215950298"
          }
      ]
  },
  "quiet-mink-mlbench-worker-1": {
      "container_cpu_usage_seconds_total": [
          {
              "date": "2018-08-03T09:21:29.347960Z",
              "value": "0.149286015"
          },
          {
              "date": "2018-08-03T09:21:44.266181Z",
              "value": "0.15325329"
          }
      ],
      "container_cpu_user_seconds_total": [
          {
              "date": "2018-08-03T09:21:29.406238Z",
              "value": "0.1"
          },
          {
              "date": "2018-08-03T09:21:44.331823Z",
              "value": "0.1"
          }
      ]
  }
}
Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes
GET /api/metrics/(str: pod_name_or_run_id)/

Get metrics (Cpu, Memory etc.) for all Worker Pods

Example request:

GET /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "container_cpu_usage_seconds_total": [
      {
          "date": "2018-08-03T09:21:29.347960Z",
          "value": "0.149286015"
      },
      {
          "date": "2018-08-03T09:21:44.266181Z",
          "value": "0.15325329"
      }
  ],
  "container_cpu_user_seconds_total": [
      {
          "date": "2018-08-03T09:21:29.406238Z",
          "value": "0.1"
      },
      {
          "date": "2018-08-03T09:21:44.331823Z",
          "value": "0.1"
      }
  ]
}
Query Parameters
  • since – only get metrics newer than this date, (Default 1970-01-01T00:00:00.000000Z)

  • metric_type – one of pod or run to determine what kind of metric to get (Default: pod)

Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes
POST /api/metrics

Save metrics. “pod_name” and “run_id” are mutually exclusive. The fields of metrics and their types are defined in mlbench/api/models/kubemetrics.py.

Example request:

POST /api/metrics HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

{
  "pod_name": "quiet-mink-mlbench-worker-1",
  "run_id": 2,
  "name": "accuracy",
  "date": "2018-08-03T09:21:44.331823Z",
  "value": "0.7845",
  "cumulative": False,
  "metadata": "some additional data"
}

Example response:

HTTP/1.1 201 CREATED
Vary: Accept
Content-Type: text/javascript

{
  "pod_name": "quiet-mink-mlbench-worker-1",
  "name": "accuracy",
  "date": "2018-08-03T09:21:44.331823Z",
  "value": "0.7845",
  "cumulative": False,
  "metadata": "some additional data"
}
Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes

Runs

GET /api/runs/

Gets all active/failed/finished runs

Example request:

GET /api/runs/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

[
  {
    "id": 1,
    "name": "Name of the run",
    "created_at": "2018-08-03T09:21:29.347960Z",
    "state": "STARTED",
    "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
    "job_metadata": {...}
  },
  {
    "id": 2,
    "name": "Another run",
    "created_at": "2018-08-02T08:11:22.123456Z",
    "state": "FINISHED",
    "job_id": "add4de0f-9705-4618-93a1-00bbc8d9498e",
    "job_metadata": {...}
  },
]
Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes
GET /api/runs/(int: run_id)/

Gets a run by id

Example request:

GET /api/runs/1/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "id": 1,
  "name": "Name of the run",
  "created_at": "2018-08-03T09:21:29.347960Z",
  "state": "STARTED",
  "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
  "job_metadata": {...}
}

:run_id The id of the run

Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes
POST /api/runs/

Starts a new Run

Example request:

POST /api/runs/ HTTP/1.1
Host: example.com
Accept: application/json, text/javascript
Request JSON Object
  • name (string) – Name of the run

  • num_workers (int) – Number of worker nodes for the run

  • num_cpus (json) – Number of Cores utilized by each worker

Example response:

HTTP/1.1 200 OK
Vary: Accept
Content-Type: text/javascript

{
  "id": 1,
  "name": "Name of the run",
  "created_at": "2018-08-03T09:21:29.347960Z",
  "state": "STARTED",
  "job_id": "5ec9f286-e12d-41bc-886e-0174ef2bddae",
  "job_metadata": {...}
}
Request Headers
  • Accept – the response content type depends on Accept header

Response Headers
Status Codes

Indices and tables