The surprising thing about GKE logging and monitoring is that it’s not just about seeing what your cluster is doing, it’s about influencing its behavior through automated responses.
Let’s look at a typical GKE cluster running a simple web application.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-web-app
spec:
replicas: 3
selector:
matchLabels:
app: my-web-app
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: web
image: nginx:latest
ports:
- containerPort: 80
When you deploy this, GKE automatically starts sending stdout and stderr from the nginx container to Cloud Logging. It also collects Kubernetes events and metrics like CPU and memory usage from the nodes and pods.
The magic happens when you tie this data to Cloud Monitoring. You can create metrics explorers to visualize your application’s performance:
- Request Latency: A custom metric exported by your application (e.g., via Prometheus client libraries) can be scraped by Cloud Monitoring. You’d see a time-series graph showing average latency over the last hour.
- Pod Restarts: A built-in metric
kubernetes.io/container/restart_countshows how many times containers have restarted. You can graph this per deployment. - Node CPU Utilization: The
kubernetes.io/node/cpu/utilizationmetric shows how busy your underlying GKE nodes are.
But this isn’t just about dashboards. You can set up alerting policies based on these metrics. For instance, if the average request latency for my-web-app exceeds 500ms for 5 minutes, an alert fires. This alert can trigger a notification to your Slack channel or PagerDuty, or it can even initiate an automated action.
This automated action is where logging and monitoring truly become operational. You can configure Cloud Operations to trigger a Cloud Function or a Cloud Run service. Imagine an alert firing because your my-web-app deployment is experiencing high error rates (e.g., HTTP 5xx). The triggered Cloud Function could then automatically scale up the my-web-app deployment by updating its spec.replicas field.
Here’s how you’d enable this basic logging and monitoring setup if it wasn’t already:
-
Enable Cloud Operations for GKE: When you create a GKE cluster, ensure "Enable Cloud Operations for GKE" is checked. If it’s an existing cluster, you can enable it by updating the cluster configuration:
gcloud container clusters update CLUSTER_NAME \ --zone COMPUTE_ZONE \ --enable-logging \ --enable-monitoringThis installs the necessary agents (like the Ops Agent) into your cluster. The Ops Agent is responsible for collecting logs and metrics and sending them to Cloud Logging and Cloud Monitoring, respectively. It runs as a DaemonSet, ensuring an agent pod is present on each node.
-
View Logs: Navigate to the Cloud Logging section in the Google Cloud Console. You can filter logs by GKE cluster, namespace, and workload. For example, to see logs from your
my-web-appdeployment in thedefaultnamespace:- Go to Logging > Logs Explorer.
- In the query builder, select Resource type: Kubernetes Container and then filter by Cluster:
your-cluster-name, Namespace:default, Pod:my-web-app-*. - The query would look something like:
resource.type="k8s_container" resource.labels.cluster_name="your-cluster-name" resource.labels.namespace_name="default" resource.labels.pod_name:"my-web-app-"
-
View Metrics: Go to Monitoring > Metrics Explorer.
- Select Resource type:
Kubernetes Pod. - Select Metric:
Container CPU utilization. - Filter by Cluster name:
your-cluster-name, Namespace name:default, Pod name:my-web-app. You can then choose to aggregate bymean,max, etc.
- Select Resource type:
-
Create an Alerting Policy: In Monitoring > Alerting, click "Create Policy".
- Add Condition: Choose
Kubernetes Podas the resource type andContainer Restartsas the metric. Set the threshold to> 0for a duration of5 minutes. This means if any pod in your deployment restarts more than zero times within a 5-minute window, the alert will trigger. - Configure Notifications: Add a notification channel (e.g., email, Slack).
- Add Condition: Choose
The Ops Agent, by default, collects stdout and stderr from containers. It also collects system logs from nodes, Kubernetes API server logs, and kubelet logs. For more advanced scenarios, like collecting application-specific metrics in Prometheus format or tracing data, you’d configure the Ops Agent or deploy separate collectors.
What many people miss is the ability to ingest custom metrics directly from their applications and leverage them for alerting and automated remediation. If your application, for example, exposes a /metrics endpoint in Prometheus format, you can configure the Ops Agent to scrape this endpoint and then use those custom metrics in Cloud Monitoring just like any other built-in metric. This allows for highly granular monitoring of application-specific behavior and the creation of alerts that directly reflect your application’s health from its own perspective.
The next step is often implementing distributed tracing to understand request flows across multiple services within your GKE cluster.