API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

ElasticConfig

Configuration governing the elastic scaling behavior of the job.

Appears in:

FieldDescription
enabled booleanSet true to use elastic training.
minReplicas integerThe minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created.
maxReplicas integerThe maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created.
expectedReplicas integerNumber of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down.

PyTorchTrainingJob

PyTorchTrainingJob is the Schema for the pytorchtrainingjobs API.

Appears in:

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringPyTorchTrainingJob
metadata ObjectMetaRefer to Kubernetes API documentation for fields of metadata.
spec PyTorchTrainingJobSpec
status PyTorchTrainingJobStatus

PyTorchTrainingJobList

PyTorchTrainingJobList contains a list of PyTorchTrainingJob

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringPyTorchTrainingJobList
metadata ListMetaRefer to Kubernetes API documentation for fields of metadata.
items PyTorchTrainingJob array

PyTorchTrainingJobSpec

PyTorchTrainingJobSpec outlines the intended configuration and execution parameters for a PyTorchTrainingJo.

Appears in:

FieldDescription
replicaSpecs ReplicaSpec arrayAn array of ReplicaSpec. Specifies the pytorch cluster configuration.
elastic ElasticConfigConfigurations for how to launch an elastic training. Elastic training is effective only in torchrun mode.
torchrunConfig TorchrunConfigWhether and how to use torchrun to launch a training process.
runMode RunModeJob's execution behavior. If omitted, defaults to Immediate mode, and tasks are executed immediately upon submission.
runPolicy RunPolicyExecution policy configurations governing the behavior of a PytorchTrainingJob.
tensorboardSpec TensorBoardSpecIf specified, controller will create a Tensorboard for showing training logs.
scheduler SchedulePolicyIdentifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler.

PyTorchTrainingJobStatus

PyTorchTrainingJobStatus defines the observed state of PyTorchTrainingJob.

Appears in:

FieldDescription
tasks Tasks arrayThe status details of individual tasks.
tensorboard DependentStatusThe status of the tensorboard.
backoffCount integerThe number of restarts having been performed.
aggregate AggregateThe number of tasks in each state.
conditions JobCondition arrayThe latest available observations of an object's current state.
phase JobPhaseProvides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.

ReplicaSpec

ReplicaSpec is a description of the job replica.

Appears in:

FieldDescription
type stringReplicaType is the type of the replica.
replicas integerThe desired number of replicas of the current template. Defaults to 1.
scalingWeight integerScaling weight of the current replica used in elastic training.
template PodTemplateSpecDescribes the pod that will be created for this replica. Note that RestartPolicy in PodTemplateSpec will always be set to Never as the job controller will decide if restarts are desired.
restartPolicy RestartPolicyRestart policy for all replicas within the job. One of Always, OnFailure, Never, or ExitCode.

TorchrunConfig

Describes how to launch pytorch training with torchrun.

Appears in:

FieldDescription
enabled booleanSet true to use torchrun launch pytorch training.
maxRestarts integer
procPerNode stringNumber of processes to be started on every replica.
rdzvBackend stringCommunication backed used for the group. Defaults to c10d.
extraOptions string arrayExtra options for torchrun.