API Reference
Packages
batch.tensorstack.dev/v1beta1
Package v1beta1 contains API Schema definitions for the batch v1beta1 API group
Resource Types
ElasticConfig
Configuration governing the elastic scaling behavior of the job.
Appears in:
Field | Description |
---|---|
enabled boolean | Set true to use elastic training. |
minReplicas integer | The minimum number of replicas to start to run this elastic compute. The autoscaler cannot scale down an elastic job below this number. This value cannnot be changed once the job is created. |
maxReplicas integer | The maximum number of replicas to start to run this elastic compute. The autoscaler cannot scale up an elastic job over this number. This value cannnot be changed once the job is created. |
expectedReplicas integer | Number of replicas to be created. This number can be set to an initial value upon creation. This value can be modified dynamically by an external entity, such as a user or an autoscaler, to scale the job up or down. |
PyTorchTrainingJob
PyTorchTrainingJob is the Schema for the pytorchtrainingjobs API.
Appears in:
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | PyTorchTrainingJob |
metadata ObjectMeta | Refer to Kubernetes API documentation for fields of metadata . |
spec PyTorchTrainingJobSpec | |
status PyTorchTrainingJobStatus |
PyTorchTrainingJobList
PyTorchTrainingJobList contains a list of PyTorchTrainingJob
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | PyTorchTrainingJobList |
metadata ListMeta | Refer to Kubernetes API documentation for fields of metadata . |
items PyTorchTrainingJob array |
PyTorchTrainingJobSpec
PyTorchTrainingJobSpec outlines the intended configuration and execution parameters for a PyTorchTrainingJo.
Appears in:
Field | Description |
---|---|
replicaSpecs ReplicaSpec array | An array of ReplicaSpec. Specifies the pytorch cluster configuration. |
elastic ElasticConfig | Configurations for how to launch an elastic training. Elastic training is effective only in torchrun mode. |
torchrunConfig TorchrunConfig | Whether and how to use torchrun to launch a training process. |
runMode RunMode | Job's execution behavior. If omitted, defaults to Immediate mode, and tasks are executed immediately upon submission. |
runPolicy RunPolicy | Execution policy configurations governing the behavior of a PytorchTrainingJob. |
tensorboardSpec TensorBoardSpec | If specified, controller will create a Tensorboard for showing training logs. |
scheduler SchedulePolicy | Identifies the preferred scheduler for allocating resources to replicas. Defaults to cluster default scheduler. |
PyTorchTrainingJobStatus
PyTorchTrainingJobStatus defines the observed state of PyTorchTrainingJob.
Appears in:
Field | Description |
---|---|
tasks Tasks array | The status details of individual tasks. |
tensorboard DependentStatus | The status of the tensorboard. |
backoffCount integer | The number of restarts having been performed. |
aggregate Aggregate | The number of tasks in each state. |
conditions JobCondition array | The latest available observations of an object's current state. |
phase JobPhase | Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine. |
ReplicaSpec
ReplicaSpec is a description of the job replica.
Appears in:
Field | Description |
---|---|
type string | ReplicaType is the type of the replica. |
replicas integer | The desired number of replicas of the current template. Defaults to 1. |
scalingWeight integer | Scaling weight of the current replica used in elastic training. |
template PodTemplateSpec | Describes the pod that will be created for this replica. Note that RestartPolicy in PodTemplateSpec will always be set to Never as the job controller will decide if restarts are desired. |
restartPolicy RestartPolicy | Restart policy for all replicas within the job. One of Always , OnFailure , Never , or ExitCode . |
TorchrunConfig
Describes how to launch pytorch training with torchrun.
Appears in:
Field | Description |
---|---|
enabled boolean | Set true to use torchrun launch pytorch training. |
maxRestarts integer | |
procPerNode string | Number of processes to be started on every replica. |
rdzvBackend string | Communication backed used for the group. Defaults to c10d . |
extraOptions string array | Extra options for torchrun. |