API Reference
Packages
batch.tensorstack.dev/v1beta1
Package v1beta1 contains API Schema definitions for the batch v1beta1 API group
Resource Types
Config
Configuration information for running a DeepSpeed job. Details are outlined in the official DeepSpeed documentation (https://www.deepspeed.ai/getting-started/) for comprehensive guidance.
Appears in:
| Field | Description | 
|---|---|
| customCommandstring | Custom launch commands, when enabled, other options in Config except for slotsPerWorkerwill not take effect. | 
| slotsPerWorkerinteger | The number of slots for each worker/replica. This is normally set to the number of GPUs requested for each replica. | 
| localRankboolean | If parameter local_rankshould be passed to training programs. | 
| autotuneAutotuneType | Parameters for running the autotuning process to find configurations for a training job on a particular cluster/machine. | 
| runRunType | Mechanism to start the training program. | 
| otherArgsstring array | Seting up other command line args for the deepspeed job. | 
DeepSpeedJob
DeepSpeedJob defines the schema for the DeepSpeedJob API.
Appears in:
| Field | Description | 
|---|---|
| apiVersionstring | batch.tensorstack.dev/v1beta1 | 
| kindstring | DeepSpeedJob | 
| metadataObjectMeta | Refer to Kubernetes API documentation for fields of metadata. | 
| specDeepSpeedJobSpec | |
| statusDeepSpeedJobStatus | 
DeepSpeedJobList
DeepSpeedJobList contains a list of DeepSpeedJob
| Field | Description | 
|---|---|
| apiVersionstring | batch.tensorstack.dev/v1beta1 | 
| kindstring | DeepSpeedJobList | 
| metadataListMeta | Refer to Kubernetes API documentation for fields of metadata. | 
| itemsDeepSpeedJob array | 
DeepSpeedJobSpec
DeepSpeedJobSpec outlines the intended configuration and execution parameters for a DeepSpeedJob.
Appears in:
| Field | Description | 
|---|---|
| schedulerSchedulePolicy | Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler. | 
| runPolicyRunPolicy | Execution policy configurations governing the behavior of the distributed training job. | 
| runModeRunMode | Job’s execution behavior. If omitted, defaults to Immediatemode, and tasks are executed immediately upon submission. | 
| elasticElasticConfig | Configurations for how to launch an elastic training. | 
| configConfig | Key configurations for executing DeepSpeed training jobs. | 
| disableCustomEnvboolean | Setting environment variables during DeepSpeed training necessitates creating an env file to store the desired variables. The launcher will then distribute these variables to each worker process. Nevertheless, certain scenarios require disabling this automated behavior, and this flag enables control over this functionality. | 
| false: (default) The environment variables set in job specs are used in the training processes. The controller  will automatically put the environment variables into the env file so that the launcher can send them to each worker; | |
| true: The environment variables set in the job specs are only used to start the container entrypoint program, and the training program does not need these environment variables. | |
| workerWorker | Specifications for the worker replicas. | 
DeepSpeedJobStatus
DeepSpeedJobStatus represents the observed state of a DeepSpeedJob.
Appears in:
| Field | Description | 
|---|---|
| tasksTasks array | |
| aggregateAggregate | |
| phaseJobPhase | Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine. | 
| backoffCountinteger | The number of restarts being performed. | 
| conditionsJobCondition array | The latest available observations of an object’s current state. | 
ElasticConfig
Configuration governing the elastic scaling behavior of the job.
Appears in:
| Field | Description | 
|---|---|
| enabledboolean | Set true to use elastic training. | 
| minReplicasinteger | The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created. | 
| maxReplicasinteger | The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created. | 
| expectedReplicasinteger | Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down. | 
RunPolicy
RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.
Appears in:
| Field | Description | 
|---|---|
| activeDeadlineSecondsinteger | Specifies the duration in seconds relative to the startTimethat the job may be active before the system tries to terminate it; value must be positive integer. | 
| backoffLimitinteger | Optional number of retries before marking this job failed. | 
| cleanUpPolicyCleanUpPolicy | Clean the tasks after the training job finished. | 
RunType
How the training program should be started. Exactly one of the 3 choices should be set.
Appears in:
| Field | Description | 
|---|---|
| pythonstring array | Using a python script | 
| modulestring array | Using a python module | 
| execstring array | Using an executable program | 
Worker
Worker defines the configurations for DeepSpeedJob worker replicas.
Appears in:
| Field | Description | 
|---|---|
| replicasinteger | The number of workers to launch. | 
| templatePodTemplateSpec | Describes the pod that will be created for this replica. Note that RestartPolicyinPodTemplateSpecwill always be set toNeveras the job controller will decide if restarts are desired. | 
| restartPolicyRestartPolicy | Restart policy for all replicas owned by the job. One of Always, OnFailure, Never, or ExitCode. Defaults to OnFailure. |