Neon’s autoscaling is a bit like a restaurant kitchen that magically hires and fires cooks based on how many customers walk in. It’s designed to keep your database responsive during peak times without costing you a fortune when things are quiet.

Let’s see it in action. Imagine you have a Neon compute branch with a min_scale of 1 and max_scale of 5.

{
  "id": "branch-xxxx",
  "name": "main",
  "project_id": "project-yyyy",
  "config": {
    "autoscaling": {
      "min_scale": 1,
      "max_scale": 5
    },
    "compute": {
      "scale_count": 1,
      "scale_limit": 5
    }
  },
  "state": "running",
  "created_at": "2023-10-27T10:00:00Z",
  "updated_at": "2023-10-27T10:05:00Z"
}

If your application suddenly gets a surge of traffic, Neon monitors the load on your compute instances. When the average CPU utilization or the number of active connections crosses a certain threshold (which Neon manages internally), it starts provisioning new compute instances. So, if the load increases significantly, scale_count might jump to 2, then 3, and so on, up to your max_scale of 5. This happens relatively quickly, usually within a minute or two, so your application stays snappy.

Conversely, when the traffic subsides, Neon detects the reduced load. After a period of inactivity (also managed by Neon’s internal heuristics), it will start scaling down. scale_count will decrease back towards your min_scale of 1. This is where the cost savings come in – you’re not paying for idle compute capacity.

The core problem Neon autoscaling solves is the classic trade-off between performance and cost in cloud databases. Traditionally, you’d either over-provision to handle the worst-case traffic, leading to wasted money during quiet periods, or under-provision and risk performance degradation or outages during spikes. Autoscaling automates this balancing act.

Internally, Neon uses a sophisticated monitoring system that tracks key metrics for each compute instance. It’s not just a simple on/off switch. It looks at a combination of factors like CPU utilization, memory usage, and active query load. When these metrics consistently indicate high demand, new instances are launched. When they consistently indicate low demand for a sustained period, instances are terminated. The min_scale and max_scale parameters are your guardrails, defining the absolute lower and upper bounds of how many compute instances Neon can manage for this branch. min_scale ensures you always have at least one instance running, providing a baseline level of availability and performance, while max_scale prevents runaway costs by setting a ceiling on how many instances can be active simultaneously.

The "scale_count" represents the current number of active compute instances for that branch. Neon dynamically adjusts this value based on the observed workload and your configured min_scale and max_scale. The scale_limit is effectively the same as max_scale in this context, representing the upper bound.

One thing most people don’t realize is how quickly Neon can scale down. It’s not just about adding capacity; it’s also about reclaiming resources efficiently. If a compute instance has been idle and its workload metrics have been consistently low for a configurable duration, Neon will initiate the shutdown process. This isn’t immediate; there’s a grace period to ensure that a temporary lull in traffic doesn’t trigger an unnecessary scale-down, only for traffic to pick up again moments later. This intelligent cooldown period is crucial for maintaining stability while still achieving cost efficiency.

The next concept you’ll want to understand is how Neon manages connection pooling across these dynamically scaling compute instances.

Want structured learning?

Take the full Neon course →