You can move petabytes of data into Cloud Storage with Storage Transfer Service, but the real magic is how it handles the network congestion and retries so you don’t have to.

Let’s see it in action. Imagine you’ve got a massive dataset on Amazon S3 and you want to migrate it to a Cloud Storage bucket. You’ll define a transfer job in the Google Cloud Console or via the gcloud CLI.

Here’s a simplified gcloud command to get you started:

gcloud storage-transfer transfer-jobs create \
  --project=your-gcp-project-id \
  --display-name="S3 to GCS Migration" \
  --config-file=transfer_config.json

And your transfer_config.json might look something like this:

{
  "description": "Migrate data from S3 bucket my-s3-bucket to GCS bucket my-gcs-bucket",
  "project_id": "your-gcp-project-id",
  "transfer_spec": {
    "aws_s3_data_source": {
      "bucket_name": "my-s3-bucket",
      "aws_access_key_file_content": "AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    },
    "gcs_data_sink": {
      "bucket_name": "my-gcs-bucket",
      "directory": "migrated-data"
    },
    "object_conditions": {
      "last_modified_before": "2023-01-01T00:00:00Z"
    },
    "transfer_options": {
      "overwrite_objects_already_existing_in_sink": true,
      "delete_objects_from_source_after_transfer": false
    }
  },
  "schedule": {
    "schedule_start_date": {
      "year": 2024,
      "month": 7,
      "day": 29
    },
    "start_time_of_day": {
      "hours": 10,
      "minutes": 0,
      "seconds": 0
    }
  }
}

This configuration tells Storage Transfer Service to:

  • Read from my-s3-bucket in AWS.
  • Use the provided AWS access key for authentication.
  • Write to my-gcs-bucket in Google Cloud, specifically into a migrated-data subdirectory.
  • Only transfer objects last modified before January 1st, 2023.
  • Overwrite any existing objects in the destination that have the same name.
  • Start the transfer on July 29th, 2024, at 10:00 AM UTC.

Behind the scenes, Storage Transfer Service isn’t just making a simple API call per object. It’s a distributed system that spins up agents within Google’s network to efficiently pull data from your source. It’s smart about batching requests, handling transient network errors, and retrying failed transfers automatically. It can also leverage Google’s own network backbone for faster transfers when moving between cloud providers.

You can control how much bandwidth it consumes, how many concurrent transfers it attempts, and even set transfer windows to minimize impact on your source system. For example, to limit the transfer to 100 Mbps and 50 concurrent transfers:

gcloud storage-transfer transfer-jobs update your-transfer-job-id \
  --max-bandwidth-mbps=100 \
  --max-concurrent-transfers=50

The max-bandwidth-mbps flag tells the service to throttle its egress from the source to 100 megabits per second. The max-concurrent-transfers flag limits the number of simultaneous HTTP requests the service will make to your source storage system. These knobs are crucial for avoiding overwhelming your source storage or incurring unexpected egress charges from your original cloud provider.

What most people miss is the ability to perform incremental transfers. Once an initial job completes, you can set up a new job with the same source and destination, but with updated object_conditions (e.g., last_modified_after a specific date or created_after) or simply by running the same job again if overwrite_objects_already_existing_in_sink is true. Storage Transfer Service will only pick up new or modified files, making it incredibly efficient for ongoing migrations or keeping datasets synchronized.

After a successful transfer, you’ll likely want to explore Cloud Storage features like lifecycle management to automatically delete older versions of your data.

Want structured learning?

Take the full Gcp course →