跳转至

API Reference

Packages

batch.tensorstack.dev/v1beta1

Package v1beta1 contains API Schema definitions for the batch v1beta1 API group

Resource Types

Aggregate

Aggregate records the number of replica pods at each phase.

Appears in: - GenericJobStatus

Field Description
creating integer Pod has been created, but resources have not been scheduled.
pending integer Pod has been accepted by the system, but one or more of the containers has not been started. This includes time before being bound to a node, as well as time spent pulling images onto the host.
running integer Pod has been bound to a node and all of the containers have been started. At least one container is still running or is in the process of being restarted.
succeeded integer All containers in the pod have voluntarily terminated with a container exit code of 0, and the system is not going to restart any of these containers.
failed integer All containers in the pod have terminated, and at least one container has terminated in failure (exited with a non-zero exit code or was stopped by the system).
unknown integer For some reason the state of the pod could not be obtained, typically due to an error in communicating with the host of the pod.
deleted integer Pod has been deleted.

CleanUpPolicy

Underlying type: string

CleanUpPolicy specifies the collection of replicas that are to be deleted upon job completion.

Appears in: - GenericJobSpec

ContainerStatus

ContainerStatus defines the observed state of the container.

Appears in: - ReplicaStatus

DebugMode

DebugMode configs whether and how to start a job in debug mode.

Appears in: - RunMode

Field Description
enabled boolean Whether to enable debug mode.
replicaSpecs ReplicaDebugSet array If provided, these specs provide overwriting values for job replicas.

FinishRule

A finishRule is a condition used to check if the job has finished. A finishRule identifies a set of replicas, and the controller determines the job's status by checking the status of all of these replicas.

Appears in: - GenericJobSpec

GenericJob

GenericJob represents the schema for a general-purpose batch job API. While it offers less automation compared to specialized APIs like PyTorchTrainingJob, it allows for greater flexibility in specifying parallel replicas/pods. This design serves as a comprehensive job definition mechanism when more specialized APIs are not applicable or available.

Appears in: - GenericJobList

Field Description
apiVersion string batch.tensorstack.dev/v1beta1
kind string GenericJob
metadata ObjectMeta Refer to Kubernetes API documentation for fields of metadata.
spec GenericJobSpec
status GenericJobStatus

GenericJobList

GenericJobList contains a list of GenericJob

Field Description
apiVersion string batch.tensorstack.dev/v1beta1
kind string GenericJobList
metadata ListMeta Refer to Kubernetes API documentation for fields of metadata.
items GenericJob array

GenericJobSpec

GenericJobSpec defines the desired state of GenericJob

Appears in: - GenericJob

Field Description
successRules FinishRule array Rules used to check if a generic job has succeeded. The job succeeded when any one of the successRules is fulfilled. Each item of successRules may refer to a series of replicas, and the job succeeded only if all of the replicas referred in this series are completed successfully.
failureRules FinishRule array Rules used to check if a generic job has failed. The job failed when any one of failureRules is fulfilled. Each item of failureRules refers to a series of replicas, and the job failed only if all of the replicas in this series failed.
service ServiceOption Details of v1/Service for replica pods. Optional: Defaults to empty and no service will be created.
runMode RunMode Job running mode. Defaults to Immediate mode.
cleanUpPolicy CleanUpPolicy To avoid wasting resources on completed tasks, controller will reclaim resource according to the following policies: None: (default) no resources reclamation; Unfinished: only finished pods is to be deleted; All: all the pods are to be deleted.
scheduler SchedulePolicy If specified, the pod will be dispatched by the specified scheduler. Otherwise, the pod will be dispatched by the default scheduler.
replicaSpecs ReplicaSpec array List of replica specs belonging to the job. There must be at least one replica defined for a Job.

GenericJobStatus

GenericJobStatus defines the observed state of GenericJob

Appears in: - GenericJob

Field Description
tasks Tasks array An array of status of individual tasks.
phase JobPhase Provides a simple, high-level summary of where the Job is in its lifecycle. Note that this is NOT indended to be a comprehensive state machine.
aggregate Aggregate Records the number of replicas at each phase.
conditions JobCondition array The latest available observations of a job's current state.

JobCondition

JobCondition describes the current state of a job.

Appears in: - GenericJobStatus

Field Description
type JobConditionType Type of job condition: Complete or Failed.
status ConditionStatus Status of the condition, one of True, False, Unknown.
lastTransitionTime Time Last time the condition transited from one status to another.
reason string Brief reason for the condition's last transition.
message string Human readable message indicating details about last transition.

JobConditionType

Underlying type: string

JobConditionType defines all possible types of JobStatus. Can be one of: Initialized, Running, ReplicaFailure, Completed, or Failed.

Appears in: - JobCondition

JobPhase

Underlying type: string

Appears in: - GenericJobStatus

PauseMode

PauseMode configs whether and how to start a job in pause mode.

Appears in: - RunMode

Field Description
enabled boolean Whether to enable pause mode.
resumeSpecs ResumeSpec array If provided, these specs provide overwriting values for job replicas when resuming.

ReplicaDebugSet

ReplicaDebugSet describes how to start replicas in debug mode.

Appears in: - DebugMode

Field Description
type string Replica type.
skipInitContainer boolean Skips creation of initContainer, if true.
command string Entrypoint array. Optional: Default to ["sleep", "inf"]

ReplicaSpec

ReplicaSpec defines the desired state of replicas.

Appears in: - GenericJobSpec

Field Description
type string Replica type.
replicas integer The desired number of replicas of this replica type. Defaults to 1.
restartPolicy RestartPolicy Restart policy for replicas of this replica type. One of Always, OnFailure, Never. Optional: Default to OnFailure.
template PodTemplateSpec Defines the template used to create pods.

ReplicaStatus

ReplicaStatus defines the observed state of the pod.

Appears in: - Tasks

Field Description
name string Pod name.
uid UID Pod uid.
phase PodPhase Pod phase. The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
containers ContainerStatus array Containers status.

RestartPolicy

RestartPolicy describes how the replica should be restarted.

Appears in: - ReplicaSpec

Field Description
policy RestartPolicyType The policy to restart finished replica.
limit integer The maximum number of restarts. Optional: Default to 0.

RestartPolicyType

Underlying type: string

Appears in: - RestartPolicy

ResumeSpec

ResumeSpec describes how to resume replicas from pause mode.

Appears in: - PauseMode

Field Description
type string Replica type.
skipInitContainer boolean Skips creation of initContainer, if true.
command string Entrypoint array. Provides overwriting values if provided; otherwise, values in immediate mode are used.
args string Arguments to the entrypoint. Arguments in immediate mode are used if not provided.

RunMode

RunMode defines the job's execution behavior: Immediate mode: (Default) Tasks are executed immediately upon submission. Debug mode: Job pods are created, but regular executions are replaced with null operations (e.g., sleep) for convenient debugging purposes. Pause mode: Job execution is halted, and pods are deleted to reclaim resources. A graceful pod termination process is initiated to allow pods to exit cleanly.

Appears in: - GenericJobSpec

Field Description
debug DebugMode Debug mode.
pause PauseMode Pause mode.

SchedulePolicy

SchedulePolicy signals to K8s how the job should be scheduled.

Appears in: - GenericJobSpec

Field Description
t9kScheduler T9kScheduler T9k Scheduler. TODO: link to t9k scheduler docs.

ServiceOption

Details of a replicas' servivce.

Appears in: - GenericJobSpec

Field Description
ports ServicePort array The list of ports that are exposed by this service.

T9kScheduler

T9kScheduler provides additonal configurations needed for the scheduling process.

Appears in: - SchedulePolicy

Field Description
queue string Specifies the name of the queue should be used for running this workload. TODO: link to t9k scheduler docs.
priority integer Indicates the priority of the PodGroup; valid range: [0, 100]. Optional: Default to 0.

Tasks

Task defines the observed state of the task.

Appears in: - GenericJobStatus

Field Description
type string Replica type.
restartCount integer The number of restarts that have been performed.
replicas ReplicaStatus array Replicas status array.