API Reference

Packages

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Configuration governing the elastic scaling behavior of the job.

Appears in:

Field	Description
`enabled` boolean	Set true to use elastic training.
`minReplicas` integer	The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created.
`maxReplicas` integer	The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created.
`expectedReplicas` integer	Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down.

PyTorchTrainingJob is the Schema for the pytorchtrainingjobs API.

Appears in:

Field	Description
`apiVersion` string	`batch.tensorstack.dev/v1beta1`
`kind` string	`PyTorchTrainingJob`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` PyTorchTrainingJobSpec
`status` PyTorchTrainingJobStatus

PyTorchTrainingJobList contains a list of PyTorchTrainingJob

Field	Description
`apiVersion` string	`batch.tensorstack.dev/v1beta1`
`kind` string	`PyTorchTrainingJobList`
`metadata` ListMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`items` PyTorchTrainingJob array

PyTorchTrainingJobSpec outlines the intended configuration and execution parameters for a PyTorchTrainingJo.

Appears in:

Field	Description
`replicaSpecs` ReplicaSpec array	An array of ReplicaSpec. Specifies the pytorch cluster configuration.
`elastic` ElasticConfig	Configurations for how to launch an elastic training. Elastic training is effective only in torchrun mode.
`torchrunConfig` TorchrunConfig	Whether and how to use torchrun to launch a training process.
`runMode` RunMode	Job’s execution behavior. If omitted, defaults to `Immediate` mode, and tasks are executed immediately upon submission.
`runPolicy` RunPolicy	Execution policy configurations governing the behavior of a PytorchTrainingJob.
`tensorboardSpec` TensorBoardSpec	If specified, controller will create a Tensorboard for showing training logs.
`scheduler` SchedulePolicy	Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler.

PyTorchTrainingJobStatus defines the observed state of PyTorchTrainingJob.

Appears in:

Field	Description
`tasks` Tasks array	The status details of individual tasks.
`tensorboard` DependentStatus	The status of the tensorboard.
`backoffCount` integer	The number of restarts having been performed.
`aggregate` Aggregate	The number of tasks in each state.
`conditions` JobCondition array	The latest available observations of an object’s current state.
`phase` JobPhase	Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.

ReplicaSpec is a description of the job replica.

Appears in:

Field	Description
`type` string	ReplicaType is the type of the replica.
`replicas` integer	The desired number of replicas of the current template. Defaults to 1.
`scalingWeight` integer	Scaling weight of the current replica used in elastic training.
`template` PodTemplateSpec	Describes the pod that will be created for this replica. Note that `RestartPolicy` in `PodTemplateSpec` will always be set to `Never` as the job controller will decide if restarts are desired.
`restartPolicy` RestartPolicy	Restart policy for all replicas within the job. One of `Always`, `OnFailure`, `Never`, or `ExitCode`.

Describes how to launch pytorch training with torchrun.

Appears in:

Field	Description
`enabled` boolean	Set true to use torchrun launch pytorch training.
`maxRestarts` integer
`procPerNode` string	Number of processes to be started on every replica.
`rdzvBackend` string	Communication backed used for the group. Defaults to `c10d`.
`extraOptions` string array	Extra options for torchrun.