API Reference
Packages
batch.tensorstack.dev/v1beta1
Package v1beta1 contains API Schema definitions for the batch v1beta1 API group
Resource Types
ColossalAIJob
ColossalAIJob is the Schema for the colossalaijobs API
Appears in:
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | ColossalAIJob |
metadata ObjectMeta | Refer to Kubernetes API documentation for fields of metadata . |
spec ColossalAIJobSpec | |
status ColossalAIJobStatus |
ColossalAIJobList
ColossalAIJobList contains a list of ColossalAIJob.
Field | Description |
---|---|
apiVersion string | batch.tensorstack.dev/v1beta1 |
kind string | ColossalAIJobList |
metadata ListMeta | Refer to Kubernetes API documentation for fields of metadata . |
items ColossalAIJob array |
ColossalAIJobSpec
ColossalAIJobSpec defines the configurations of a ColossalAI training job.
Appears in:
Field | Description |
---|---|
ssh SSHConfig | SSH configs. |
runMode RunMode | The desired running mode of the job, defaults to Immediate . |
runPolicy RunPolicy | Controls the handling of completed replicas and other related processes. |
scheduler SchedulePolicy | Specifies the scheduler to request for resources. Defaults to cluster default scheduler. |
torchConfig TorchConfig | Describes how to start the colossalai job. |
replicaSpecs ReplicaSpec array | List of replica specs belonging to the job. There must be at least one replica defined for a Job. |
ColossalAIJobStatus
ColossalAIJobStatus describes the observed state of ColossalAIJob.
Appears in:
Field | Description |
---|---|
tasks Tasks array | The statuses of individual tasks. |
aggregate Aggregate | The number of replicas in each phase. |
phase JobPhase | Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine. |
conditions JobCondition array | The latest available observations of an object's current state. |
ReplicaSpec
ReplicaSpec defines the desired state of replicas.
Appears in:
Field | Description |
---|---|
type ReplicaType | Replica type. |
replicas integer | The desired number of replicas of this replica type. Defaults to 1. |
restartPolicy RestartPolicy | Restart policy for replicas of this replica type. One of Always, OnFailure, Never. Optional: Default to OnFailure. |
template PodTemplateSpec | Defines the template used to create pods. |
ReplicaType
Underlying type: string
Appears in:
RestartPolicy
RestartPolicy describes how the replica should be restarted.
Appears in:
Field | Description |
---|---|
policy RestartPolicyType | The policy to restart finished replica. |
limit integer | The maximum number of restarts. Optional: Default to 0. |
RestartPolicyType
Underlying type: string
Appears in:
RunPolicy
RunPolicy dictates specific actions to be taken by the controller upon job completion.
Appears in:
Field | Description |
---|---|
cleanUpWorkers boolean | Defaults to false. |
SSHConfig
SSHConfig specifies various configurations for running the SSH daemon (sshd).
Appears in:
Field | Description |
---|---|
authMountPath string | SSHAuthMountPath is the directory where SSH keys are mounted. Defaults to "/root/.ssh". |
sshdPath string | The location of the sshd executable file. |
TorchConfig
MPIConfig describes how to start the mpi job.
Appears in:
Field | Description |
---|---|
procPerWorker integer | The number of processes of a worker. Defaults to 1. |
script string array | Specifies the command used to start the workers. |
extraArgs string array | Args of torchrun. |