API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

Config

Configuration information for running a DeepSpeed job. Details are outlined in the official DeepSpeed documentation (https://www.deepspeed.ai/getting-started/) for comprehensive guidance.

Appears in:

FieldDescription
customCommand stringCustom launch commands, when enabled, other options in Config except for slotsPerWorker will not take effect.
slotsPerWorker integerThe number of slots for each worker/replica. This is normally set to the number of GPUs requested for each replica.
localRank booleanIf parameter local_rank should be passed to training programs.
autotune AutotuneTypeParameters for running the autotuning process to find configurations for a training job on a particular cluster/machine.
run RunTypeMechanism to start the training program.
otherArgs string arraySeting up other command line args for the deepspeed job.

DeepSpeedJob

DeepSpeedJob defines the schema for the DeepSpeedJob API.

Appears in:

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringDeepSpeedJob
metadata ObjectMetaRefer to Kubernetes API documentation for fields of metadata.
spec DeepSpeedJobSpec
status DeepSpeedJobStatus

DeepSpeedJobList

DeepSpeedJobList contains a list of DeepSpeedJob

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringDeepSpeedJobList
metadata ListMetaRefer to Kubernetes API documentation for fields of metadata.
items DeepSpeedJob array

DeepSpeedJobSpec

DeepSpeedJobSpec outlines the intended configuration and execution parameters for a DeepSpeedJob.

Appears in:

FieldDescription
scheduler SchedulePolicyIdentifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler.
runPolicy RunPolicyExecution policy configurations governing the behavior of the distributed training job.
runMode RunModeJob's execution behavior. If omitted, defaults to Immediate mode, and tasks are executed immediately upon submission.
elastic ElasticConfigConfigurations for how to launch an elastic training.
config ConfigKey configurations for executing DeepSpeed training jobs.
disableCustomEnv booleanSetting environment variables during DeepSpeed training necessitates creating an env file to store the desired variables. The launcher will then distribute these variables to each worker process. Nevertheless, certain scenarios require disabling this automated behavior, and this flag enables control over this functionality.
false: (default) The environment variables set in job specs are used in the training processes. The controller will automatically put the environment variables into the env file so that the launcher can send them to each worker;
true: The environment variables set in the job specs are only used to start the container entrypoint program, and the training program does not need these environment variables.
worker WorkerSpecifications for the worker replicas.

DeepSpeedJobStatus

DeepSpeedJobStatus represents the observed state of a DeepSpeedJob.

Appears in:

FieldDescription
tasks Tasks array
aggregate Aggregate
phase JobPhaseProvides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
backoffCount integerThe number of restarts being performed.
conditions JobCondition arrayThe latest available observations of an object's current state.

ElasticConfig

Configuration governing the elastic scaling behavior of the job.

Appears in:

FieldDescription
enabled booleanSet true to use elastic training.
minReplicas integerThe minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created.
maxReplicas integerThe maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created.
expectedReplicas integerNumber of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down.

RunPolicy

RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.

Appears in:

FieldDescription
activeDeadlineSeconds integerSpecifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer.
backoffLimit integerOptional number of retries before marking this job failed.
cleanUpPolicy CleanUpPolicyClean the tasks after the training job finished.

RunType

How the training program should be started. Exactly one of the 3 choices should be set.

Appears in:

FieldDescription
python string arrayUsing a python script
module string arrayUsing a python module
exec string arrayUsing an executable program

Worker

Worker defines the configurations for DeepSpeedJob worker replicas.

Appears in:

FieldDescription
replicas integerThe number of workers to launch.
template PodTemplateSpecDescribes the pod that will be created for this replica. Note that RestartPolicy in PodTemplateSpec will always be set to Never as the job controller will decide if restarts are desired.
restartPolicy RestartPolicyRestart policy for all replicas owned by the job. One of Always, OnFailure, Never, or ExitCode. Defaults to OnFailure.