The Loki ingestion rate limit (429 Too Many Requests) error means that Loki’s ingestion API is dropping your logs because your agents are sending data faster than Loki can process and store it.
Here are the common causes and how to fix them:
1. Insufficient Ingestion Streams
Loki’s ingestion pipeline is designed to handle multiple streams of logs concurrently. If all your logs are being sent as a single stream, or very few distinct streams, you’re bottlenecking the processing.
Diagnosis: Check your agent configuration. Look for how you’re defining stream_labels in Promtail, Fluentd, or Fluent Bit. If you see a very small number of unique stream_labels across all your log sources, this is likely the issue.
Fix: Ensure your log sources are being labeled with a diverse set of labels. For example, instead of just job="my-app", use job="my-app", instance="{{.Instance}}", and namespace="{{.Namespace}}" (for Kubernetes). This allows Loki to parallelize ingestion across more streams.
Why it works: Each unique combination of stream labels represents a separate ingestion stream within Loki. More streams mean more opportunities for parallel processing and higher overall throughput.
2. Under-provisioned Ingestor Pods (for self-hosted Loki)
If you’re running Loki yourself, the ingester components are responsible for receiving and processing logs. If there aren’t enough of these, or they don’t have enough resources, they’ll get overwhelmed.
Diagnosis:
- Kubernetes:
kubectl get pods -l app=loki,component=ingester -n <loki-namespace>and check theREADYcolumn. If pods are crashing or not ready, or if they are consistently at 100% CPU/memory usage (kubectl top pods -l app=loki,component=ingester -n <loki-namespace>), this is a strong indicator. - Docker Compose/Binary: Check the logs of your ingester process for high CPU/memory usage or errors.
Fix:
- Kubernetes:
- Increase the replica count for the ingester deployment:
kubectl scale deployment loki-ingester --replicas=4 -n <loki-namespace>(adjust4as needed). - Increase resource requests/limits for the ingester pods in your Helm chart
values.yamlor Kubernetes manifest. For example, changeresources: { cpu: "500m", memory: "1Gi" }toresources: { cpu: "1000m", memory: "2Gi" }.
- Increase the replica count for the ingester deployment:
- Docker Compose/Binary: Update your
docker-compose.ymlto increase the number of ingester services or adjust the resource limits for the ingester container.
Why it works: More ingester instances or more powerful ingester instances can handle a greater volume of incoming log data simultaneously.
3. Inefficient Labeling Strategy
While diverse labels are good, excessively high cardinality labels (labels with millions of unique values) can also overwhelm Loki’s index and ingestion pipeline.
Diagnosis: Use Loki’s internal metrics. Query sum(rate(loki_ingester_labels_unique_total[5m])) by (label_name) to see which labels are generating the most unique values. If any label shows a consistently high number, it’s a candidate.
Fix: Reduce the cardinality of problematic labels. This might involve:
- Removing labels that aren’t essential for querying.
- Using techniques like label hashing or aggregation in your agent configuration to reduce the number of unique label combinations. For example, in Promtail, you could use
pipeline_stagesto drop or re-label certain fields.
Why it works: High-cardinality labels require Loki to maintain larger, more complex indexes, which slows down ingestion and querying. Reducing cardinality lightens this load.
4. Network Bandwidth or Latency Issues
Your agents might be capable of sending logs, but the network path to Loki could be the bottleneck.
Diagnosis:
- Check network metrics on your agent hosts and Loki pods. Look for packet loss, high latency, or saturation of available bandwidth.
- Use tools like
pingandtraceroutefrom your agent hosts to your Loki endpoint to check latency and network hops.
Fix:
- Increase network bandwidth between your agents and Loki.
- Optimize network routing.
- If using a cloud provider, ensure your Loki instances are in the same region and availability zone as your log sources for lower latency.
Why it works: Log ingestion relies on efficient data transfer. Network limitations directly cap the rate at which logs can reach Loki.
5. Loki Configuration Issues (e.g., ingestion_rate_mb or ingestion_rate_bytes)
Loki has built-in rate limiting configurations that might be set too low.
Diagnosis: Examine your Loki configuration file (loki.yaml or Helm chart values.yaml). Look for settings like ingestion_rate_mb or ingestion_rate_bytes within the ingester or limits sections.
Fix: Increase these values. For example, if ingestion_rate_mb: 10 is set, try increasing it to ingestion_rate_mb: 20 or ingestion_rate_mb: 50. Restart your Loki ingester pods after changing the configuration.
Why it works: These parameters directly control the maximum throughput Loki’s ingesters will accept before returning 429 errors.
6. Underlying Storage Performance
The performance of your object storage (S3, GCS, Azure Blob Storage) or block storage (if using local disks) can impact how quickly Loki can write processed data.
Diagnosis:
- Monitor your object storage’s performance metrics (e.g., S3 request latency, throughput, error rates).
- If using local disks, monitor disk I/O (IOPS, throughput) on the Loki ingester nodes.
Fix:
- Ensure your object storage bucket is in a region close to your Loki deployment.
- For S3, consider using S3 Transfer Acceleration if applicable.
- If using local disks, ensure they are fast (e.g., SSDs) and that the disk is not saturated. Consider using more performant storage solutions.
Why it works: Loki needs to write data to its backend storage. Slow storage will cause the ingestion pipeline to back up and eventually hit rate limits.
The next error you’ll likely encounter after fixing ingestion rate limits is related to query performance, specifically context deadline exceeded errors if your queries are too broad or your query frontend is under-provisioned.