The Gemini API’s throughput and latency aren’t just about how fast it answers, but how predictably it can do so under pressure, revealing bottlenecks in your application and the underlying infrastructure.
Let’s see it in action. Imagine you’re building a real-time content moderation system. You’ve got incoming user posts, and each one needs to be checked by Gemini for policy violations.
import google.generativeai as genai
import time
import threading
# Configure your API key
genai.configure(api_key="YOUR_API_KEY")
# Define the model
model = genai.GenerativeModel('gemini-pro')
# Sample prompt for content moderation
prompt = "Please review the following text for hate speech: 'This is a sample text.'"
# --- Configuration for Load Test ---
num_requests = 100
concurrency_level = 10 # Number of threads to run concurrently
# --- Data Collection ---
latencies = []
successful_requests = 0
failed_requests = 0
def send_request():
global successful_requests, failed_requests
start_time = time.time()
try:
response = model.generate_content(prompt)
end_time = time.time()
latencies.append(end_time - start_time)
successful_requests += 1
except Exception as e:
print(f"Request failed: {e}")
failed_requests += 1
# --- Running the Load Test ---
threads = []
for _ in range(num_requests):
thread = threading.Thread(target=send_request)
threads.append(thread)
thread.start()
# Control concurrency: don't start a new thread if we've hit the limit
if len(threads) % concurrency_level == 0:
# Wait for some threads to finish before starting more
# A more sophisticated approach would actively manage a thread pool
time.sleep(0.1) # Small delay to allow threads to pick up work
# Wait for all threads to complete
for thread in threads:
thread.join()
# --- Results ---
print(f"Total requests sent: {num_requests}")
print(f"Successful requests: {successful_requests}")
print(f"Failed requests: {failed_requests}")
if latencies:
avg_latency = sum(latencies) / len(latencies)
p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]
print(f"Average latency: {avg_latency:.4f} seconds")
print(f"95th percentile latency: {p95_latency:.4f} seconds")
else:
print("No successful requests to calculate latency.")
This script simulates sending 100 moderation requests concurrently using 10 threads. It measures the time taken for each successful request, recording latencies. The output gives you average and 95th percentile latency, alongside success and failure counts.
The core problem this addresses is understanding how your application behaves when it hits the Gemini API with a high volume of requests. You might have a perfectly fine application for single requests, but as concurrency increases, network saturation, API rate limits, or even your own application’s resource contention can cause performance to degrade dramatically. This isn’t just about Gemini; it’s about the entire system from your user to the Gemini model and back.
Internally, the Gemini API is designed to handle massive scale. It uses a distributed architecture where requests are routed to available model instances. When you send a request, it goes through Google’s infrastructure, gets processed by a Gemini model shard, and the response is sent back. The latency you observe is the sum of: network travel time (your client to Google’s ingress, then back), request queuing within Google’s systems, model inference time, and any processing time on your end. Throughput is limited by the rate at which these requests can be successfully processed by the API without error.
The key levers you control are:
- Concurrency (
concurrency_level): How many requests are actively in flight at any given moment. Higher concurrency means more requests hitting the API simultaneously, testing its capacity. - Request Complexity (Prompt and Response Size): Longer prompts or requests for more detailed responses will naturally take longer to process. Varying this in your load test is crucial.
- Batching (if applicable): For certain tasks, you might be able to group multiple smaller requests into a single API call, which can be more efficient. Gemini’s
generate_contentdoesn’t directly support batching multiple distinct prompts in one call, but you can pass a list ofcontentstogenerate_contentif you are generating multiple parts of a single, continuous response. - Error Handling and Retries: A robust retry strategy with exponential backoff is vital for handling transient API errors or rate limits, and it affects your perceived throughput and latency.
When load testing, many developers focus solely on the P99 latency of the API response itself. However, the total latency your application experiences includes the time your application spends preparing the request, handling the response, and any internal queuing or processing. If your application is slow to send requests or slow to process responses, it will become the bottleneck, even if the Gemini API itself is lightning fast. You might see successful requests with low Gemini-internal latency, but your overall user-facing latency is high because your application is a bottleneck.
After you’ve optimized your load testing and observed your application’s behavior under stress, the next logical step is to investigate adaptive request routing based on observed latency and error rates.