mlbench_core.api¶
MLBench Master/Dashboard API Client Functionality
-
mlbench_core.api.MLBENCH_IMAGES[source]¶ Dict of official benchmark images
Note
Format:
{name: (image_name, command, run_on_all, GPU_supported)}
-
class
mlbench_core.api.ApiClient(max_workers=5, in_cluster=True, label_selector='component=master,app=mlbench', k8s_namespace='default', url=None, load_config=True)[source]¶ Client for the mlbench Master/Dashboard REST API
When used inside a cluster, will use the API Pod IP directly for communication. When used outside of a cluster, will try to figure out how to access the API depending on the K8s service type, if it’s accessible. Endpoint URL can also be set manually.
All requests are executed in a separate process to ensure non-blocking execution. Results are returned as
concurrent.futures.Futureobjects wrappingrequestsresponses.Expects K8s credentials to be set correctly (automatic inside a cluster, through kubectl outside of it)
- Parameters
max_workers (int) – maximum number of processes to run in parallel
in_cluster (bool) – Whether the client is run inside the K8s cluster or not
label_selector (str) – K8s label selectors to find the master pod when running inside a cluster. Default:
component=master,app=mlbenchk8s_namespace (str) – K8s namespace mlbench is running in. Default:
defaultservice_name (str) – Name of the master service, usually something like
release-mlbench-master. Only needed when running outside of a cluster. Default:Noneurl (str) – ip:port/path or hostname:port/path that overrides automatic endpoint detection, pointing to the root of the master/dashboard node. Default:
None
-
create_run(self, name, num_workers, num_cpus=2.0, max_bandwidth=1000, image=None, backend=None, custom_image_name=None, custom_image_command=None, custom_backend=None, run_all_nodes=False, gpu_enabled=False, light_target=False)[source]¶ Create a new benchmark run.
Available official benchmarks can be found in the
mlbench_core.api.MLBENCH_IMAGESdict.- Parameters
name (str) – The name of the run
num_workers (int) – The number of worker nodes to use
num_cpus (float) – The number of CPU Cores per worker to utilize. Default:
2.0max_bandwidth (int) – Maximum bandwidth available for communication between worker nodes in mbps. Default:
1000image (str) – Name of the official benchmark image to use ( see
mlbench_core.api.MLBENCH_IMAGESkeys). Default:Nonebackend (str) – Name of the backend to use (see
mlbench_core.api.MLBENCH_BACKENDS) Default:Nonecustom_image_name (str) – The name of a custom Docker image to run. Can be a dockerhub or private Docker repository url. Default:
Nonecustom_image_command (str) – Command to run on the custom image. Default:
Nonecustom_backend (str) – Custom backend to use. Default:
Nonerun_all_nodes (bool) – Whether to run
custom_image_commandon all worker nodes or only the rank 0 node.gpu_enabled (bool) – Enable GPU acceleration. Default:
Falselight_target (bool) – Use light target goal Default:
False
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
delete_run(self, run_id)[source]¶ Delete a benchmark run.
- Args:
run_id(str): The id of the run to get
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
download_run_metrics(self, run_id, since=None, summarize=None)[source]¶ Get all metrics for a run as zip.
- Parameters
run_id (str) – The id of the run to get metrics for
since (datetime) – Only get metrics newer than this date Default:
Nonesummarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) –
None
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_all_metrics(self)[source]¶ Get all metrics ever recorded by the master node.
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_pod_metrics(self, pod_id, since=None, summarize=None)[source]¶ Get all metrics for a worker pod.
- Parameters
pod_id (str) – The id of the pod to get metrics for
since (datetime) – Only get metrics newer than this date Default:
Nonesummarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) –
None
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_run(self, run_id)[source]¶ Get a specific benchmark run
- Parameters
run_id (str) – The id of the run to get
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_run_metrics(self, run_id, since=None, summarize=None, metric_filter=None, last_n=None)[source]¶ Get all metrics for a run.
- Parameters
run_id (str) – The id of the run to get metrics for
since (datetime) – Only get metrics newer than this date Default:
Nonesummarize (int) – If set, metrics are summarized to at most this
entries by averaging the metrics. Default (many) –
None
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_runs(self)[source]¶ Get all active, finished and failed benchmark runs
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
get_worker_pods(self)[source]¶ Get information on all worker nodes.
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()
-
post_metric(self, run_id, name, value, cumulative=False, metadata='', date=None)[source]¶ Save a metric to the master node for a run.
- Parameters
run_id (str) – The id of the run to save a metric for
name (str) – The name of the metric, e.g.
accuracyvalue (Number) – The metric value to save
cumulative (bool, optional) – Whether this metric is cumulative or not. Cumulative metrics are values that increment over time, i.e.
current_calue = previous_value + value_difference. Non-cumulative values or discrete values at a certain time. Default:Falsemetadata (dict) – Optional metadata to attach to a metric. Default:
Nonedate (datetime) – The date the metric was gathered. Default:
datetime.now
- Returns
A
concurrent.futures.Futureobjects wrappingrequests.responseobject. Get the result by callingreturn_value.result().json()