mlbench_core.api¶

MLBench Master/Dashboard API Client Functionality

mlbench_core.api.MLBENCH_IMAGES[source]¶: Dict of official benchmark images

Note

Format: {name: (image_name, command, run_on_all, GPU_supported)}

class mlbench_core.api.ApiClient(max_workers=5, in_cluster=True, label_selector='component=master,app=mlbench', k8s_namespace='default', url=None, load_config=True)[source]¶

Client for the mlbench Master/Dashboard REST API

When used inside a cluster, will use the API Pod IP directly for communication. When used outside of a cluster, will try to figure out how to access the API depending on the K8s service type, if it’s accessible. Endpoint URL can also be set manually.

All requests are executed in a separate process to ensure non-blocking execution. Results are returned as concurrent.futures.Future objects wrapping requests responses.

Expects K8s credentials to be set correctly (automatic inside a cluster, through kubectl outside of it)

Parameters

max_workers (int) – maximum number of processes to run in parallel
in_cluster (bool) – Whether the client is run inside the K8s cluster or not
label_selector (str) – K8s label selectors to find the master pod when running inside a cluster. Default: component=master,app=mlbench
k8s_namespace (str) – K8s namespace mlbench is running in. Default: default
service_name (str) – Name of the master service, usually something like release-mlbench-master. Only needed when running outside of a cluster. Default: None
url (str) – ip:port/path or hostname:port/path that overrides automatic endpoint detection, pointing to the root of the master/dashboard node. Default: None

create_run(self, name, num_workers, num_cpus=2.0, max_bandwidth=1000, image=None, backend=None, custom_image_name=None, custom_image_command=None, custom_backend=None, run_all_nodes=False, gpu_enabled=False, light_target=False)[source]¶

Create a new benchmark run.

Available official benchmarks can be found in the mlbench_core.api.MLBENCH_IMAGES dict.

Parameters

name (str) – The name of the run
num_workers (int) – The number of worker nodes to use
num_cpus (float) – The number of CPU Cores per worker to utilize. Default: 2.0
max_bandwidth (int) – Maximum bandwidth available for communication between worker nodes in mbps. Default: 1000
image (str) – Name of the official benchmark image to use ( see mlbench_core.api.MLBENCH_IMAGES keys). Default: None
backend (str) – Name of the backend to use (see mlbench_core.api.MLBENCH_BACKENDS) Default: None
custom_image_name (str) – The name of a custom Docker image to run. Can be a dockerhub or private Docker repository url. Default: None
custom_image_command (str) – Command to run on the custom image. Default: None
custom_backend (str) – Custom backend to use. Default: None
run_all_nodes (bool) – Whether to run custom_image_command on all worker nodes or only the rank 0 node.
gpu_enabled (bool) – Enable GPU acceleration. Default: False
light_target (bool) – Use light target goal Default: False

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

delete_run(self, run_id)[source]¶

Delete a benchmark run.

Args:
run_id(str): The id of the run to get

Returns: A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

download_run_metrics(self, run_id, since=None, summarize=None)[source]¶

Get all metrics for a run as zip.

Parameters

run_id (str) – The id of the run to get metrics for
since (datetime) – Only get metrics newer than this date Default: None
summarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_all_metrics(self)[source]¶

Get all metrics ever recorded by the master node.

Returns: A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_pod_metrics(self, pod_id, since=None, summarize=None)[source]¶

Get all metrics for a worker pod.

Parameters

pod_id (str) – The id of the pod to get metrics for
since (datetime) – Only get metrics newer than this date Default: None
summarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_run(self, run_id)[source]¶

Get a specific benchmark run

Parameters: run_id (str) – The id of the run to get
Returns: A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_run_metrics(self, run_id, since=None, summarize=None, metric_filter=None, last_n=None)[source]¶

Get all metrics for a run.

Parameters

run_id (str) – The id of the run to get metrics for
since (datetime) – Only get metrics newer than this date Default: None
summarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) – None

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_runs(self)[source]¶

Get all active, finished and failed benchmark runs

Returns: A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

get_worker_pods(self)[source]¶

Get information on all worker nodes.

Returns: A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()

post_metric(self, run_id, name, value, cumulative=False, metadata='', date=None)[source]¶

Save a metric to the master node for a run.

Parameters

run_id (str) – The id of the run to save a metric for
name (str) – The name of the metric, e.g. accuracy
value (Number) – The metric value to save
cumulative (bool, optional) – Whether this metric is cumulative or not. Cumulative metrics are values that increment over time, i.e. current_calue = previous_value + value_difference. Non-cumulative values or discrete values at a certain time. Default: False
metadata (dict) – Optional metadata to attach to a metric. Default: None
date (datetime) – The date the metric was gathered. Default: datetime.now

Returns

A concurrent.futures.Future objects wrapping requests.response object. Get the result by calling return_value.result().json()