Terraform modules are the key to provision GKE clusters repeatably, but they’re also the source of insidious drift and configuration debt.
Let’s see what a basic GKE cluster looks like when provisioned with Terraform. We’ll start with a minimal main.tf:
provider "google" {
project = "my-gcp-project-id"
region = "us-central1"
}
module "gke_cluster" {
source = "terraform-google-modules/kubernetes-engine/google"
version = "20.0.0" # Specify a version!
name = "my-reproducible-cluster"
project_id = "my-gcp-project-id"
region = "us-central1"
network = "default" # Or your VPC network
subnetwork = "default" # Or your subnetwork
# Node pool configuration
node_pools = [
{
name = "default-pool"
# ... other node pool settings
}
]
# Cluster configuration
ip_range_pods = "10.10.0.0/16"
ip_range_services = "10.20.0.0/16"
# ... other cluster settings
}
And a variables.tf to keep things clean:
variable "gcp_project_id" {
description = "The GCP project ID."
type = string
default = "my-gcp-project-id"
}
variable "gcp_region" {
description = "The GCP region for the cluster."
type = string
default = "us-central1"
}
Running terraform init and terraform apply will spin up a GKE cluster with these specifications. The source attribute points to a specific version of the Terraform Google Kubernetes Engine module, ensuring that the structure and defaults of the cluster configuration are consistent across applies. This is the foundation of reproducibility.
The real power comes from understanding that the module abstracts away hundreds of individual google_container_cluster and google_container_node_pool resource arguments. Instead of managing a sprawling main.tf, you’re managing a concise module block. This makes your configuration easier to read, write, and maintain.
Here’s the mental model:
- Module as a Blueprint: The GKE module is a pre-defined, opinionated blueprint for creating a Kubernetes cluster. It bundles common configurations and best practices.
- Input Variables as Parameters: You control the blueprint by passing values into the module’s input variables (e.g.,
name,region,node_pools). - Terraform as the Builder: Terraform reads your module configuration and the module’s internal code. It then calls the Google Cloud API to provision the exact resources defined by the blueprint and your parameters.
- State as the Record: Terraform’s state file (
terraform.tfstate) records the actual resources created in GCP, linking them back to your configuration. This is crucial for subsequentapplyoperations and for detecting drift.
Let’s look at a more complex node pool configuration within the node_pools variable:
node_pools = [
{
name = "default-pool"
management = {
auto_repair = true
auto_upgrade = true
}
node_locations = ["us-central1-a", "us-central1-b"]
autoscaling = {
min_node_count = 1
max_node_count = 5
}
node_config = {
machine_type = "e2-medium"
disk_size_gb = 100
disk_type = "pd-standard"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
preemptible = false
}
},
{
name = "gpu-pool"
node_locations = ["us-central1-c"]
autoscaling = {
min_node_count = 0
max_node_count = 2
}
node_config = {
machine_type = "n1-standard-1"
guest_accelerator = [
{
type = "nvidia-tesla-t4"
count = 1
}
]
disk_size_gb = 50
disk_type = "pd-ssd"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
}
}
]
This structure allows you to define multiple node pools with distinct characteristics, such as machine types, GPU availability, autoscaling parameters, and disk configurations, all within a single, manageable variable.
The most surprising truth about using modules for GKE is that the module itself doesn’t guarantee your cluster is identical to another cluster provisioned with the same module version if you rely on defaults that GCP can change. The module provides a consistent interface and structure, but GCP’s underlying API might evolve behavior or introduce new default options that aren’t yet reflected in the module’s version. The module’s version attribute locks down the module’s code, not the GCP API’s behavior.
To truly achieve reproducibility, you must explicitly define every setting that matters to you, rather than relying on module defaults or GCP defaults. For example, instead of letting autoscaling.min_node_count default to 1, explicitly set it to 1. If a setting is not explicitly defined, Terraform will query the cloud provider for its current value. If that value changes in the cloud provider (e.g., GCP updates a default), the next terraform plan will show a drift, even though your code hasn’t changed.
The next step is to integrate this reproducible cluster definition into a larger infrastructure-as-code strategy, perhaps by defining Kubernetes resources (Deployments, Services) within the same Terraform configuration using the kubernetes provider.