API Reference
Packages
batch.tensorstack.dev/v1beta1
Package v1beta1 contains API Schema definitions for the batch v1beta1 API group
Resource Types
Config
Configuration information for running a DeepSpeed job. Details are outlined in the official DeepSpeed documentation (https://www.deepspeed.ai/getting-started/) for comprehensive guidance.
Appears in:
Field | Description |
---|---|
customCommand string | Custom launch commands, when enabled, other options in Config except for slotsPerWorker will not take effect. |
slotsPerWorker integer | The number of slots for each worker/replica. This is normally set to the number of GPUs requested for each replica. |
localRank boolean | If parameter local_rank should be passed to training programs. |
autotune AutotuneType | Parameters for running the autotuning process to find configurations for a training job on a particular cluster/machine. |
run RunType | Mechanism to start the training program. |
otherArgs string array | Seting up other command line args for the deepspeed job. |
DeepSpeedJob
DeepSpeedJob defines the schema for the DeepSpeedJob API.
Appears in:
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | DeepSpeedJob |
metadata ObjectMeta | Refer to Kubernetes API documentation for fields of metadata . |
spec DeepSpeedJobSpec | |
status DeepSpeedJobStatus |
DeepSpeedJobList
DeepSpeedJobList contains a list of DeepSpeedJob
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | DeepSpeedJobList |
metadata ListMeta | Refer to Kubernetes API documentation for fields of metadata . |
items DeepSpeedJob array |
DeepSpeedJobSpec
DeepSpeedJobSpec outlines the intended configuration and execution parameters for a DeepSpeedJob.
Appears in:
Field | Description |
---|---|
scheduler SchedulePolicy | Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler. |
runPolicy RunPolicy | Execution policy configurations governing the behavior of the distributed training job. |
runMode RunMode | Job's execution behavior. If omitted, defaults to Immediate mode, and tasks are executed immediately upon submission. |
elastic ElasticConfig | Configurations for how to launch an elastic training. |
config Config | Key configurations for executing DeepSpeed training jobs. |
disableCustomEnv boolean | Setting environment variables during DeepSpeed training necessitates creating an env file to store the desired variables. The launcher will then distribute these variables to each worker process. Nevertheless, certain scenarios require disabling this automated behavior, and this flag enables control over this functionality. |
false : (default) The environment variables set in job specs are used in the training processes. The controller will automatically put the environment variables into the env file so that the launcher can send them to each worker; | |
true : The environment variables set in the job specs are only used to start the container entrypoint program, and the training program does not need these environment variables. | |
worker Worker | Specifications for the worker replicas. |
DeepSpeedJobStatus
DeepSpeedJobStatus represents the observed state of a DeepSpeedJob.
Appears in:
Field | Description |
---|---|
tasks Tasks array | |
aggregate Aggregate | |
phase JobPhase | Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine. |
backoffCount integer | The number of restarts being performed. |
conditions JobCondition array | The latest available observations of an object's current state. |
ElasticConfig
Configuration governing the elastic scaling behavior of the job.
Appears in:
Field | Description |
---|---|
enabled boolean | Set true to use elastic training. |
minReplicas integer | The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created. |
maxReplicas integer | The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created. |
expectedReplicas integer | Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down. |
RunPolicy
RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.
Appears in:
Field | Description |
---|---|
activeDeadlineSeconds integer | Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer. |
backoffLimit integer | Optional number of retries before marking this job failed. |
cleanUpPolicy CleanUpPolicy | Clean the tasks after the training job finished. |
RunType
How the training program should be started. Exactly one of the 3 choices should be set.
Appears in:
Field | Description |
---|---|
python string array | Using a python script |
module string array | Using a python module |
exec string array | Using an executable program |
Worker
Worker defines the configurations for DeepSpeedJob worker replicas.
Appears in:
Field | Description |
---|---|
replicas integer | The number of workers to launch. |
template PodTemplateSpec | Describes the pod that will be created for this replica. Note that RestartPolicy in PodTemplateSpec will always be set to Never as the job controller will decide if restarts are desired. |
restartPolicy RestartPolicy | Restart policy for all replicas owned by the job. One of Always, OnFailure, Never, or ExitCode. Defaults to OnFailure . |