API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

Aggregate

Aggregate records the number of replica pods at each phase.

Appears in:

FieldDescription
creating integerPod has been created, but resources have not been scheduled.
pending integerPod has been accepted by the system, but one or more of the containers has not been started. This includes time before being bound to a node, as well as time spent pulling images onto the host.
running integerPod has been bound to a node and all of the containers have been started. At least one container is still running or is in the process of being restarted.
succeeded integerAll containers in the pod have voluntarily terminated with a container exit code of 0, and the system is not going to restart any of these containers.
failed integerAll containers in the pod have terminated, and at least one container has terminated in failure (exited with a non-zero exit code or was stopped by the system).
unknown integerFor some reason the state of the pod could not be obtained, typically due to an error in communicating with the host of the pod.
deleted integerPod has been deleted.

CleanUpPolicy

Underlying type: string

CleanUpPolicy specifies the collection of replicas that are to be deleted upon job completion.

Appears in:

ContainerStatus

ContainerStatus defines the observed state of the container.

Appears in:

DebugMode

DebugMode configs whether and how to start a job in debug mode.

Appears in:

FieldDescription
enabled booleanWhether to enable debug mode.
replicaSpecs ReplicaDebugSet arrayIf provided, these specs provide overwriting values for job replicas.

FinishRule

A finishRule is a condition used to check if the job has finished. A finishRule identifies a set of replicas, and the controller determines the job's status by checking the status of all of these replicas.

Appears in:

GenericJob

GenericJob represents the schema for a general-purpose batch job API. While it offers less automation compared to specialized APIs like PyTorchTrainingJob, it allows for greater flexibility in specifying parallel replicas/pods. This design serves as a comprehensive job definition mechanism when more specialized APIs are not applicable or available.

Appears in:

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringGenericJob
metadata ObjectMetaRefer to Kubernetes API documentation for fields of metadata.
spec GenericJobSpec
status GenericJobStatus

GenericJobList

GenericJobList contains a list of GenericJob

FieldDescription
apiVersion stringbatch.tensorstack.dev/v1beta1
kind stringGenericJobList
metadata ListMetaRefer to Kubernetes API documentation for fields of metadata.
items GenericJob array

GenericJobSpec

GenericJobSpec defines the desired state of GenericJob

Appears in:

FieldDescription
successRules FinishRule arrayRules used to check if a generic job has succeeded. The job succeeded when any one of the successRules is fulfilled. Each item of successRules may refer to a series of replicas, and the job succeeded only if all of the replicas referred in this series are completed successfully.
failureRules FinishRule arrayRules used to check if a generic job has failed. The job failed when any one of failureRules is fulfilled. Each item of failureRules refers to a series of replicas, and the job failed only if all of the replicas in this series failed.
service ServiceOptionDetails of v1/Service for replica pods. Optional: Defaults to empty and no service will be created.
runMode RunModeJob running mode. Defaults to Immediate mode.
cleanUpPolicy CleanUpPolicyTo avoid wasting resources on completed tasks, controller will reclaim resource according to the following policies: None: (default) no resources reclamation; Unfinished: only finished pods is to be deleted; All: all the pods are to be deleted.
scheduler SchedulePolicyIf specified, the pod will be dispatched by the specified scheduler. Otherwise, the pod will be dispatched by the default scheduler.
replicaSpecs ReplicaSpec arrayList of replica specs belonging to the job. There must be at least one replica defined for a Job.

GenericJobStatus

GenericJobStatus defines the observed state of GenericJob

Appears in:

FieldDescription
tasks Tasks arrayAn array of status of individual tasks.
phase JobPhaseProvides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
aggregate AggregateRecords the number of replicas at each phase.
conditions JobCondition arrayThe latest available observations of a job's current state.

JobCondition

JobCondition describes the current state of a job.

Appears in:

FieldDescription
type JobConditionTypeType of job condition: Complete or Failed.
status ConditionStatusStatus of the condition, one of True, False, Unknown.
lastTransitionTime TimeLast time the condition transited from one status to another.
reason stringBrief reason for the condition's last transition.
message stringHuman readable message indicating details about last transition.

JobConditionType

Underlying type: string

JobConditionType defines all possible types of JobStatus. Can be one of: Initialized, Running, ReplicaFailure, Completed, or Failed.

Appears in:

JobPhase

Underlying type: string

Appears in:

PauseMode

PauseMode configs whether and how to start a job in pause mode.

Appears in:

FieldDescription
enabled booleanWhether to enable pause mode.
resumeSpecs ResumeSpec arrayIf provided, these specs provide overwriting values for job replicas when resuming.

ReplicaDebugSet

ReplicaDebugSet describes how to start replicas in debug mode.

Appears in:

FieldDescription
type stringReplica type.
skipInitContainer booleanSkips creation of initContainer, if true.
command stringEntrypoint array. Optional: Default to ["sleep", "inf"]

ReplicaSpec

ReplicaSpec defines the desired state of replicas.

Appears in:

FieldDescription
type stringReplica type.
replicas integerThe desired number of replicas of this replica type. Defaults to 1.
restartPolicy RestartPolicyRestart policy for replicas of this replica type. One of Always, OnFailure, Never. Optional: Default to OnFailure.
template PodTemplateSpecDefines the template used to create pods.

ReplicaStatus

ReplicaStatus defines the observed state of the pod.

Appears in:

FieldDescription
name stringPod name.
uid UIDPod uid.
phase PodPhasePod phase. The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
containers ContainerStatus arrayContainers status.

RestartPolicy

RestartPolicy describes how the replica should be restarted.

Appears in:

FieldDescription
policy RestartPolicyTypeThe policy to restart finished replica.
limit integerThe maximum number of restarts. Optional: Default to 0.

RestartPolicyType

Underlying type: string

Appears in:

ResumeSpec

ResumeSpec describes how to resume replicas from pause mode.

Appears in:

FieldDescription
type stringReplica type.
skipInitContainer booleanSkips creation of initContainer, if true.
command stringEntrypoint array. Provides overwriting values if provided; otherwise, values in immediate mode are used.
args stringArguments to the entrypoint. Arguments in immediate mode are used if not provided.

RunMode

RunMode defines the job's execution behavior: Immediate mode: (Default) Tasks are executed immediately upon submission. Debug mode: Job pods are created, but regular executions are replaced with null operations (e.g., sleep) for convenient debugging purposes. Pause mode: Job execution is halted, and pods are deleted to reclaim resources. A graceful pod termination process is initiated to allow pods to exit cleanly.

Appears in:

FieldDescription
debug DebugModeDebug mode.
pause PauseModePause mode.

SchedulePolicy

SchedulePolicy signals to K8s how the job should be scheduled.

Appears in:

FieldDescription
t9kScheduler T9kSchedulerT9k Scheduler. TODO: link to t9k scheduler docs.

ServiceOption

Details of a replicas' servivce.

Appears in:

FieldDescription
ports ServicePort arrayThe list of ports that are exposed by this service.

T9kScheduler

T9kScheduler provides additonal configurations needed for the scheduling process.

Appears in:

FieldDescription
queue stringSpecifies the name of the queue should be used for running this workload. TODO: link to t9k scheduler docs.
priority integerIndicates the priority of the PodGroup; valid range: [0, 100]. Optional: Default to 0.

Tasks

Task defines the observed state of the task.

Appears in:

FieldDescription
type stringReplica type.
restartCount integerThe number of restarts that have been performed.
replicas ReplicaStatus arrayReplicas status array.