API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

ColossalAIJob

ColossalAIJob is the Schema for the colossalaijobs API

Appears in:

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringColossalAIJob
metadata ObjectMetaRefer to Kubernetes API documentation for fields of metadata.
spec ColossalAIJobSpec
status ColossalAIJobStatus

ColossalAIJobList

ColossalAIJobList contains a list of ColossalAIJob.

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringColossalAIJobList
metadata ListMetaRefer to Kubernetes API documentation for fields of metadata.
items ColossalAIJob array

ColossalAIJobSpec

ColossalAIJobSpec defines the configurations of a ColossalAI training job.

Appears in:

FieldDescription
ssh SSHConfigSSH configs.
runMode RunModeThe desired running mode of the job, defaults to Immediate.
runPolicy RunPolicyControls the handling of completed replicas and other related processes.
scheduler SchedulePolicySpecifies the scheduler to request for resources. Defaults to cluster default scheduler.
torchConfig TorchConfigDescribes how to start the colossalai job.
replicaSpecs ReplicaSpec arrayList of replica specs belonging to the job. There must be at least one replica defined for a Job.

ColossalAIJobStatus

ColossalAIJobStatus describes the observed state of ColossalAIJob.

Appears in:

FieldDescription
tasks Tasks arrayThe statuses of individual tasks.
aggregate AggregateThe number of replicas in each phase.
phase JobPhaseProvides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
conditions JobCondition arrayThe latest available observations of an object's current state.

ReplicaSpec

ReplicaSpec defines the desired state of replicas.

Appears in:

FieldDescription
type ReplicaTypeReplica type.
replicas integerThe desired number of replicas of this replica type. Defaults to 1.
restartPolicy RestartPolicyRestart policy for replicas of this replica type. One of Always, OnFailure, Never. Optional: Default to OnFailure.
template PodTemplateSpecDefines the template used to create pods.

ReplicaType

Underlying type: string

Appears in:

RestartPolicy

RestartPolicy describes how the replica should be restarted.

Appears in:

FieldDescription
policy RestartPolicyTypeThe policy to restart finished replica.
limit integerThe maximum number of restarts. Optional: Default to 0.

RestartPolicyType

Underlying type: string

Appears in:

RunPolicy

RunPolicy dictates specific actions to be taken by the controller upon job completion.

Appears in:

FieldDescription
cleanUpWorkers booleanDefaults to false.

SSHConfig

SSHConfig specifies various configurations for running the SSH daemon (sshd).

Appears in:

FieldDescription
authMountPath stringSSHAuthMountPath is the directory where SSH keys are mounted. Defaults to "/root/.ssh".
sshdPath stringThe location of the sshd executable file.

TorchConfig

MPIConfig describes how to start the mpi job.

Appears in:

FieldDescription
procPerWorker integerThe number of processes of a worker. Defaults to 1.
script string arraySpecifies the command used to start the workers.
extraArgs string arrayArgs of torchrun.