Kubernetes CronJobs are a surprisingly unintuitive way to run scheduled batch workloads, often leading to confusion about their precise execution and lifecycle.
Let’s see one in action. Imagine we want to run a simple script every minute to check the health of a service.
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello-cron
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c", "date; echo Hello from the Kubernetes cluster; sleep 5"]
restartPolicy: OnFailure
When this CronJob is created, Kubernetes doesn’t immediately run the job. Instead, it schedules it based on the schedule field, which follows the standard cron format. At the designated time, Kubernetes creates a Job object. This Job object, in turn, is responsible for creating one or more Pods to execute the actual workload defined in jobTemplate.spec.template.
The core problem CronJob solves is the reliable, scheduled execution of ephemeral tasks. Unlike long-running services, these tasks start, do some work, and then terminate. Kubernetes Jobs ensure that the defined number of completions are reached, and CronJobs add the scheduling layer.
Internally, a CronJob controller watches the Kubernetes API for CronJob objects. For each CronJob, it maintains a "next schedule time." When the current time reaches or surpasses this next schedule time, the controller creates a Job resource. The Job resource then takes over, managing the creation and lifecycle of the Pod(s) that perform the actual work. The CronJob then updates its "next schedule time" based on the schedule and the startingDeadlineSeconds (if specified).
The jobTemplate is where you define the blueprint for the Job that will be created. This includes the container image, command, environment variables, and crucially, the restartPolicy. For Jobs created by CronJobs, restartPolicy should almost always be OnFailure or Never. Always is not a valid option because a Job is intended to complete, not run indefinitely.
Let’s dive into some specific configurations you’ll interact with. The schedule field is paramount. */1 * * * * means "run at minute 0, 1, 2… of every hour, every day." If you wanted to run a job every 5 minutes, you’d use */5 * * * *.
The startingDeadlineSeconds field is also critical for understanding reliability. If a CronJob misses its scheduled time (e.g., due to a Kubernetes outage or if the previous job is still running and the concurrencyPolicy prevents a new one), startingDeadlineSeconds specifies how many seconds after the scheduled time the Job can still start. If it misses this deadline, the Job for that missed schedule is skipped.
The concurrencyPolicy dictates how to handle concurrent executions. Allow (the default) means multiple Jobs can run simultaneously. Forbid means if a previous Job is still running, the new one will be skipped. Replace means if a previous Job is running, the controller will kill the old Job and start the new one. Choosing the right concurrencyPolicy is vital for preventing unintended resource consumption or data corruption.
Consider the successfulJobsHistoryLimit and failedJobsHistoryLimit. These settings control how many completed Job and failed Job objects are retained in the API server. By default, Kubernetes keeps 3 successful and 1 failed Job. This is important for debugging and auditing, but keeping too many can consume significant API server resources.
A common point of confusion is the difference between the CronJob object and the Job objects it creates. The CronJob is the scheduler; it creates Jobs. The Jobs then create Pods to do the actual work. You’ll often find yourself troubleshooting Pods that are stuck or failing, but the root cause might be in the Job definition or the CronJob’s scheduling.
When a CronJob is deleted, by default, all the Jobs and Pods it created are also deleted. If you want to preserve the history of Jobs even after deleting the CronJob, you need to set suspend: true on the CronJob before deleting it. This will prevent new Jobs from being created while allowing existing ones to complete and be retained according to their history limits.
The next thing you’ll likely grapple with is managing the state and dependencies between scheduled jobs.