Running batch predictions on Vertex AI can slash your inference costs, but it’s not just about throwing more data at the model.

Let’s watch a real-time batch prediction job for an image classification model. We’ll use a Vertex AI SDK to set up the job, pointing it to a Vertex AI Dataset containing our images and a Vertex AI Endpoint hosting our trained model. The key here is the batch_predict_input_uri and batch_predict_output_uri.

from google.cloud import aiplatform

aiplatform.init(project='your-gcp-project-id', location='us-central1')

model = aiplatform.Model('projects/your-gcp-project-id/locations/us-central1/models/your-model-id')
dataset = aiplatform.TabularDataset('projects/your-gcp-project-id/locations/us-central1/datasets/your-dataset-id')

job = model.batch_predict(
    job_display_name='my-image-classification-batch-prediction',
    bigquery_source='bq://your-gcp-project-id.your_dataset.your_table',
    gcs_destination_prefix='gs://your-gcp-bucket/batch_predictions',
    instances_format='jsonl',
    predictions_format='jsonl',
    machine_type='n1-standard-4',
    accelerator_type='NVIDIA_TESLA_T4',
    accelerator_count=1,
    starting_replica_count=1,
    max_replica_count=2
)

job.run()

This looks straightforward, but the magic is in how Vertex AI manages the underlying infrastructure. For batch prediction, Vertex AI provisions machines, distributes your data across them, runs predictions in parallel, and then aggregates the results. You don’t need to manage clusters or scaling yourself. The machine_type, accelerator_type, accelerator_count, starting_replica_count, and max_replica_count are your primary knobs for performance and cost.

The problem this solves is the expensive, real-time API calls for every single prediction. When you have thousands or millions of data points that don’t need an immediate response, batch prediction is orders of magnitude cheaper. It leverages larger, more cost-effective machine types and runs them for the duration of the job, rather than keeping a costly endpoint alive for sporadic requests. The instances_format and predictions_format specify how your input data and output predictions are structured, commonly jsonl for ease of processing.

Internally, Vertex AI takes your input data (e.g., images in GCS or records in BigQuery), splits it into shards, and assigns each shard to a worker instance. These workers load your model and process their assigned data. The machine_type dictates the CPU and memory, while accelerator_type and accelerator_count specify GPUs. The starting_replica_count and max_replica_count define the autoscaling range for your workers. Vertex AI handles the distribution and aggregation automatically.

The most surprising cost-saver, and often overlooked, is the machine_type and accelerator_type selection. People often default to the smallest GPU (like NVIDIA_TESLA_T4) and a modest CPU, assuming that more parallelism will compensate. However, for many models, especially those with significant CPU-bound preprocessing or postprocessing, a larger CPU (n1-standard-8 or n1-standard-16) paired with a single T4 can be far more efficient than multiple T4s with small CPUs. This is because the overhead of inter-worker communication and model loading can outweigh the benefits of extreme parallelism if the individual workers aren’t balanced. Experimentation is key, but don’t dismiss beefier CPUs.

Once your batch prediction job completes successfully, the next hurdle is often analyzing and integrating the massive output files.

Want structured learning?

Take the full Gemini-api course →