API Reference
Packages
batch.tensorstack.dev/v1beta1
Package v1beta1 contains API Schema definitions for the batch v1beta1 API group
Resource Types
ElasticConfig
Configuration governing the elastic scaling behavior of the job.
Appears in:
| Field | Description | 
|---|---|
| enabledboolean | Set true to use elastic training. | 
| minReplicasinteger | The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created. | 
| maxReplicasinteger | The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created. | 
| expectedReplicasinteger | Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down. | 
PyTorchTrainingJob
PyTorchTrainingJob is the Schema for the pytorchtrainingjobs API.
Appears in:
| Field | Description | 
|---|---|
| apiVersionstring | batch.tensorstack.dev/v1beta1 | 
| kindstring | PyTorchTrainingJob | 
| metadataObjectMeta | Refer to Kubernetes API documentation for fields of metadata. | 
| specPyTorchTrainingJobSpec | |
| statusPyTorchTrainingJobStatus | 
PyTorchTrainingJobList
PyTorchTrainingJobList contains a list of PyTorchTrainingJob
| Field | Description | 
|---|---|
| apiVersionstring | batch.tensorstack.dev/v1beta1 | 
| kindstring | PyTorchTrainingJobList | 
| metadataListMeta | Refer to Kubernetes API documentation for fields of metadata. | 
| itemsPyTorchTrainingJob array | 
PyTorchTrainingJobSpec
PyTorchTrainingJobSpec outlines the intended configuration and execution parameters for a PyTorchTrainingJo.
Appears in:
| Field | Description | 
|---|---|
| replicaSpecsReplicaSpec array | An array of ReplicaSpec. Specifies the pytorch cluster configuration. | 
| elasticElasticConfig | Configurations for how to launch an elastic training. Elastic training is effective only in torchrun mode. | 
| torchrunConfigTorchrunConfig | Whether and how to use torchrun to launch a training process. | 
| runModeRunMode | Job’s execution behavior. If omitted, defaults to Immediatemode, and tasks are executed immediately upon submission. | 
| runPolicyRunPolicy | Execution policy configurations governing the behavior of a PytorchTrainingJob. | 
| tensorboardSpecTensorBoardSpec | If specified, controller will create a Tensorboard for showing training logs. | 
| schedulerSchedulePolicy | Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler. | 
PyTorchTrainingJobStatus
PyTorchTrainingJobStatus defines the observed state of PyTorchTrainingJob.
Appears in:
| Field | Description | 
|---|---|
| tasksTasks array | The status details of individual tasks. | 
| tensorboardDependentStatus | The status of the tensorboard. | 
| backoffCountinteger | The number of restarts having been performed. | 
| aggregateAggregate | The number of tasks in each state. | 
| conditionsJobCondition array | The latest available observations of an object’s current state. | 
| phaseJobPhase | Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine. | 
ReplicaSpec
ReplicaSpec is a description of the job replica.
Appears in:
| Field | Description | 
|---|---|
| typestring | ReplicaType is the type of the replica. | 
| replicasinteger | The desired number of replicas of the current template. Defaults to 1. | 
| scalingWeightinteger | Scaling weight of the current replica used in elastic training. | 
| templatePodTemplateSpec | Describes the pod that will be created for this replica. Note that RestartPolicyinPodTemplateSpecwill always be set toNeveras the job controller will decide if restarts are desired. | 
| restartPolicyRestartPolicy | Restart policy for all replicas within the job. One of Always,OnFailure,Never, orExitCode. | 
TorchrunConfig
Describes how to launch pytorch training with torchrun.
Appears in:
| Field | Description | 
|---|---|
| enabledboolean | Set true to use torchrun launch pytorch training. | 
| maxRestartsinteger | |
| procPerNodestring | Number of processes to be started on every replica. | 
| rdzvBackendstring | Communication backed used for the group. Defaults to c10d. | 
| extraOptionsstring array | Extra options for torchrun. |