Skip to main content

Timeouts and Retries

Furiko provides several mechanisms to impose timeouts and retrying of failed Jobs.

Task-level Timeouts

The following timeouts apply to individual tasks in a Job.

pendingTimeoutSeconds

Specifies the maximum duration that a single task can be pending for. This includes the time taken for scheduling, image pull and container creation.

If not specified, defaults to the global controller defaultTaskPendingTimeoutSeconds value.

tip

If the Pod cannot pull the container image, it will remain in ImagePullBackOff indefinitely. A pending timeout helps to stop these Jobs eventually.

runningTimeoutSeconds

Specifies the maximum duration that a single task can be running for. The time starts once the task starts running.

If not specified, defaults to no timeout.

info

Alternatively, you can use timeout(1) available on Unix systems, which provides several additional mechanisms to control the exit code and signals being sent on timeout.

Another way is to also use timeouts in your application directly, and read the value from a Job Option.

note

This timeout is not yet implemented.

taskTemplate.pod.spec.activeDeadlineSeconds

You can also set the Pod's activeDeadlineSeconds directly, which is the duration relative to the Pod's startTime before Kubelet will actively try to kill associated containers.

Job-level Timeouts

The following timeouts apply to the entire Job across all tasks.

jobTimeoutSeconds

Specifies a global timeout for the entire Job across all tasks. The time starts when the Job is started, and is inclusive of the retry delay and time spent waiting for the tasks to start running.

note

This timeout is not yet implemented.

Retries

Furiko retries failed Jobs by creating a new Task.

If a Node is misconfigured or has some host-level issue, using restartPolicy: OnFailure to recreate the container would not be sufficient to avoid spurious Job failures which may only be resolved by running the Task on a different Node. As such, Furiko recommends using Job-level retries, which recreates an entirely new Pod.

maxAttempts

Specifies the maximum number of task attempts.

If the job is a parallel job, this corresponds to the maximum number of attempts for each parallel index.

If not specified, defaults to 1 (i.e. no retries). Must be a positive integer.

retryDelaySeconds

Specifies the duration between the last failed task and creation of the next task.

If the job is a parallel job, the retry delay is from the time of the last failed task with the same parallel index. That is, if there are two parallel tasks - index 0 and index 1 - which failed at t=0 and t=15, with retryDelaySeconds of 30, the controller will only create the next attempts at t=30 and t=45 respectively.

If not specified, it means there is no delay between creating task attempts.

restartPolicy

If the Job uses both restartPolicy: OnFailure in conjunction with Furiko Job-level tries, Jobs may take a longer time before finally terminating in failure.

If a Job is in a CrashLoopBackOff, it will be deemed to be still "pending", and if it remains/transitions to this state even after its pending timeout, it will be killed. The next Task will be only created after retryDelaySeconds, which results in the creation of a brand-new Pod.

This means that the total time taken before terminating in failure would be roughly around pendingTimeoutSeconds * maxAttempts + retryDelaySeconds * (maxAttempts - 1), rather than simply the sum of all tasks' running durations.