跳转至

API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

Config

Configuration information for running a DeepSpeed job. Details are outlined in the official DeepSpeed documentation (https://www.deepspeed.ai/getting-started/) for comprehensive guidance.

Appears in: - DeepSpeedJobSpec

Field Description
customCommand string Custom launch commands, when enabled, other options in Config except for slotsPerWorker will not take effect.
slotsPerWorker integer The number of slots for each worker/replica. This is normally set to the number of GPUs requested for each replica.
localRank boolean If parameter local_rank should be passed to training programs.
autotune AutotuneType Parameters for running the autotuning process to find configurations for a training job on a particular cluster/machine.
run RunType Mechanism to start the training program.
otherArgs string array Seting up other command line args for the deepspeed job.

DeepSpeedJob

DeepSpeedJob defines the schema for the DeepSpeedJob API.

Appears in: - DeepSpeedJobList

Field Description
apiVersion string batch.tensorstack.dev/v1beta1
kind string DeepSpeedJob
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec DeepSpeedJobSpec
status DeepSpeedJobStatus

DeepSpeedJobList

DeepSpeedJobList contains a list of DeepSpeedJob

Field Description
apiVersion string batch.tensorstack.dev/v1beta1
kind string DeepSpeedJobList
metadata ListMeta Refer to Kubernetes API documentation for fields of metadata.
items DeepSpeedJob array

DeepSpeedJobSpec

DeepSpeedJobSpec outlines the intended configuration and execution parameters for a DeepSpeedJob.

Appears in: - DeepSpeedJob

Field Description
scheduler SchedulePolicy Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler.
runPolicy RunPolicy Execution policy configurations governing the behavior of the distributed training job.
runMode RunMode Job's execution behavior. If omitted, defaults to Immediate mode, and tasks are executed immediately upon submission.
elastic ElasticConfig Configurations for how to launch an elastic training.
config Config Key configurations for executing DeepSpeed training jobs.
disableCustomEnv boolean Setting environment variables during DeepSpeed training necessitates creating an env file to store the desired variables. The launcher will then distribute these variables to each worker process. Nevertheless, certain scenarios require disabling this automated behavior, and this flag enables control over this functionality.
false: (default) The environment variables set in job specs are used in the training processes. The controller will automatically put the environment variables into the env file so that the launcher can send them to each worker;
true: The environment variables set in the job specs are only used to start the container entrypoint program, and the training program does not need these environment variables.
worker Worker Specifications for the worker replicas.

DeepSpeedJobStatus

DeepSpeedJobStatus represents the observed state of a DeepSpeedJob.

Appears in: - DeepSpeedJob

Field Description
tasks Tasks array
aggregate Aggregate
phase JobPhase Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
backoffCount integer The number of restarts being performed.
conditions JobCondition array The latest available observations of an object's current state.

ElasticConfig

Configuration governing the elastic scaling behavior of the job.

Appears in: - DeepSpeedJobSpec

Field Description
enabled boolean Set true to use elastic training.
minReplicas integer The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created.
maxReplicas integer The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created.
expectedReplicas integer Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down.

RunPolicy

RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.

Appears in: - DeepSpeedJobSpec

Field Description
activeDeadlineSeconds integer Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer.
backoffLimit integer Optional number of retries before marking this job failed.
cleanUpPolicy CleanUpPolicy Clean the tasks after the training job finished.

RunType

How the training program should be started. Exactly one of the 3 choices should be set.

Appears in: - Config

Field Description
python string array Using a python script
module string array Using a python module
exec string array Using an executable program

Worker

Worker defines the configurations for DeepSpeedJob worker replicas.

Appears in: - DeepSpeedJobSpec

Field Description
replicas integer The number of workers to launch.
template PodTemplateSpec Describes the pod that will be created for this replica. Note that RestartPolicy in PodTemplateSpec will always be set to Never as the job controller will decide if restarts are desired.
restartPolicy RestartPolicy Restart policy for all replicas owned by the job. One of Always, OnFailure, Never, or ExitCode. Defaults to OnFailure.