mlbench_core.api

MLBench Master/Dashboard API Client Functionality

mlbench_core.api.MLBENCH_IMAGES[source]

Dict of official benchmark images

Note

Format: {name: (image_name, command, run_on_all, GPU_supported)}

class mlbench_core.api.ApiClient(max_workers=5, in_cluster=True, label_selector='component=master,app=mlbench', k8s_namespace='default', url=None, load_config=True)[source]

Client for the mlbench Master/Dashboard REST API

When used inside a cluster, will use the API Pod IP directly for communication. When used outside of a cluster, will try to figure out how to access the API depending on the K8s service type, if it’s accessible. Endpoint URL can also be set manually.

All requests are executed in a separate process to ensure non-blocking execution. Results are returned as concurrent.futures.Future objects wrapping requests responses.

Expects K8s credentials to be set correctly (automatic inside a cluster, through kubectl outside of it)

Parameters
  • max_workers (int) – maximum number of processes to run in parallel

  • in_cluster (bool) – Whether the client is run inside the K8s cluster or not

  • label_selector (str) – K8s label selectors to find the master pod when running inside a cluster. Default: component=master,app=mlbench

  • k8s_namespace (str) – K8s namespace mlbench is running in. Default: default

  • url (str) – ip:port/path or hostname:port/path that overrides automatic endpoint detection, pointing to the root of the master/dashboard node. Default: None

create_run(self, name, num_workers, num_cpus=2.0, max_bandwidth=1000, image=None, backend=None, custom_image_name=None, custom_image_command=None, custom_backend=None, run_all_nodes=False, gpu_enabled=False, light_target=False)[source]

Create a new benchmark run.

Available official benchmarks can be found in the mlbench_core.api.MLBENCH_IMAGES dict.

Parameters
  • name (str) – The name of the run

  • num_workers (int) – The number of worker nodes to use

  • num_cpus (float) – The number of CPU Cores per worker to utilize. Default: 2.0

  • max_bandwidth (int) – Maximum bandwidth available for communication between worker nodes in mbps. Default: 1000

  • image (str) – Name of the official benchmark image to use ( see mlbench_core.api.MLBENCH_IMAGES keys). Default: None

  • backend (str) – Name of the backend to use (see mlbench_core.api.MLBENCH_BACKENDS) Default: None

  • custom_image_name (str) – The name of a custom Docker image to run. Can be a dockerhub or private Docker repository url. Default: None

  • custom_image_command (str) – Command to run on the custom image. Default: None

  • custom_backend (str) – Custom backend to use. Default: None

  • run_all_nodes (bool) – Whether to run custom_image_command on all worker nodes or only the rank 0 node.

  • gpu_enabled (bool) – Enable GPU acceleration. Default: False

  • light_target (bool) – Use light target goal Default: False

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

delete_run(self, run_id)[source]

Delete a benchmark run.

Args:

run_id(str): The id of the run to get

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

download_run_metrics(self, run_id, since=None, summarize=None)[source]

Get all metrics for a run as zip.

Parameters
  • run_id (str) – The id of the run to get metrics for

  • since (datetime) – Only get metrics newer than this date Default: None

  • summarize (int) – If set, metrics are summarized to at most this

  • entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_all_metrics(self)[source]

Get all metrics ever recorded by the master node.

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_pod_metrics(self, pod_id, since=None, summarize=None)[source]

Get all metrics for a worker pod.

Parameters
  • pod_id (str) – The id of the pod to get metrics for

  • since (datetime) – Only get metrics newer than this date Default: None

  • summarize (int) – If set, metrics are summarized to at most this

  • entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_run(self, run_id)[source]

Get a specific benchmark run

Parameters

run_id (str) – The id of the run to get

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_run_metrics(self, run_id, since=None, summarize=None, metric_filter=None, last_n=None)[source]

Get all metrics for a run.

Parameters
  • run_id (str) – The id of the run to get metrics for

  • since (datetime) – Only get metrics newer than this date Default: None

  • summarize (int) – If set, metrics are summarized to at most this

  • entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_runs(self)[source]

Get all active, finished and failed benchmark runs

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_worker_pods(self)[source]

Get information on all worker nodes.

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

post_metric(self, run_id, name, value, cumulative=False, metadata='', date=None)[source]

Save a metric to the master node for a run.

Parameters
  • run_id (str) – The id of the run to save a metric for

  • name (str) – The name of the metric, e.g. accuracy

  • value (Number) – The metric value to save

  • cumulative (bool, optional) – Whether this metric is cumulative or not. Cumulative metrics are values that increment over time, i.e. current_calue = previous_value + value_difference. Non-cumulative values or discrete values at a certain time. Default: False

  • metadata (dict) – Optional metadata to attach to a metric. Default: None

  • date (datetime) – The date the metric was gathered. Default: datetime.now

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()