API Reference¶

Packages¶

batch.tensorstack.dev/v1beta1

batch.tensorstack.dev/v1beta1¶

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types¶

DeepSpeedJob
DeepSpeedJobList

Config¶

Configuration information for running a DeepSpeed job. Details are outlined in the official DeepSpeed documentation (https://www.deepspeed.ai/getting-started/) for comprehensive guidance.

Appears in: - DeepSpeedJobSpec

Field	Description
`customCommand` string	Custom launch commands, when enabled, other options in Config except for `slotsPerWorker` will not take effect.
`slotsPerWorker` integer	The number of slots for each worker/replica. This is normally set to the number of GPUs requested for each replica.
`localRank` boolean	If parameter `local_rank` should be passed to training programs.
`autotune` AutotuneType	Parameters for running the autotuning process to find configurations for a training job on a particular cluster/machine.
`run` RunType	Mechanism to start the training program.
`otherArgs` string array	Seting up other command line args for the deepspeed job.

DeepSpeedJob¶

DeepSpeedJob defines the schema for the DeepSpeedJob API.

Appears in: - DeepSpeedJobList

Field	Description
`apiVersion` string	`batch.tensorstack.dev/v1beta1`
`kind` string	`DeepSpeedJob`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` DeepSpeedJobSpec
`status` DeepSpeedJobStatus

DeepSpeedJobList¶

DeepSpeedJobList contains a list of DeepSpeedJob

Field	Description
`apiVersion` string	`batch.tensorstack.dev/v1beta1`
`kind` string	`DeepSpeedJobList`
`metadata` ListMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`items` DeepSpeedJob array

DeepSpeedJobSpec¶

DeepSpeedJobSpec outlines the intended configuration and execution parameters for a DeepSpeedJob.

Appears in: - DeepSpeedJob

Field	Description
`scheduler` SchedulePolicy	Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler.
`runPolicy` RunPolicy	Execution policy configurations governing the behavior of the distributed training job.
`runMode` RunMode	Job's execution behavior. If omitted, defaults to `Immediate` mode, and tasks are executed immediately upon submission.
`elastic` ElasticConfig	Configurations for how to launch an elastic training.
`config` Config	Key configurations for executing DeepSpeed training jobs.
`disableCustomEnv` boolean	Setting environment variables during DeepSpeed training necessitates creating an env file to store the desired variables. The launcher will then distribute these variables to each worker process. Nevertheless, certain scenarios require disabling this automated behavior, and this flag enables control over this functionality.
`false`: (default) The environment variables set in job specs are used in the training processes. The controller will automatically put the environment variables into the env file so that the launcher can send them to each worker;
`true`: The environment variables set in the job specs are only used to start the container entrypoint program, and the training program does not need these environment variables.
`worker` Worker	Specifications for the worker replicas.

DeepSpeedJobStatus¶

DeepSpeedJobStatus represents the observed state of a DeepSpeedJob.

Appears in: - DeepSpeedJob

Field	Description
`tasks` Tasks array
`aggregate` Aggregate
`phase` JobPhase	Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
`backoffCount` integer	The number of restarts being performed.
`conditions` JobCondition array	The latest available observations of an object's current state.

ElasticConfig¶

Configuration governing the elastic scaling behavior of the job.

Appears in: - DeepSpeedJobSpec

Field	Description
`enabled` boolean	Set true to use elastic training.
`minReplicas` integer	The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created.
`maxReplicas` integer	The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created.
`expectedReplicas` integer	Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down.

RunPolicy¶

RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.

Appears in: - DeepSpeedJobSpec

Field	Description
`activeDeadlineSeconds` integer	Specifies the duration in seconds relative to the `startTime` that the job may be active before the system tries to terminate it; value must be positive integer.
`backoffLimit` integer	Optional number of retries before marking this job failed.
`cleanUpPolicy` CleanUpPolicy	Clean the tasks after the training job finished.

RunType¶

How the training program should be started. Exactly one of the 3 choices should be set.

Appears in: - Config

Field	Description
`python` string array	Using a python script
`module` string array	Using a python module
`exec` string array	Using an executable program

Worker¶

Worker defines the configurations for DeepSpeedJob worker replicas.

Appears in: - DeepSpeedJobSpec

Field	Description
`replicas` integer	The number of workers to launch.
`template` PodTemplateSpec	Describes the pod that will be created for this replica. Note that `RestartPolicy` in `PodTemplateSpec` will always be set to `Never` as the job controller will decide if restarts are desired.
`restartPolicy` RestartPolicy	Restart policy for all replicas owned by the job. One of Always, OnFailure, Never, or ExitCode. Defaults to `OnFailure`.