Avro vs. Protobuf: Serialization Showdown

Avro and Protobuf are both powerful tools for serializing structured data, but they tackle the problem from fundamentally different angles.

Let’s see how this looks in practice. Imagine we have a simple User message with an id and a name.

Protobuf:

First, you define your schema in a .proto file:

syntax = "proto3";

message User {
  int64 id = 1;
  string name = 2;
}

Then, you use the Protobuf compiler (protoc) to generate code for your language. For Python:

protoc --python_out=. user.proto

This creates a user_pb2.py file. Now you can use it:

from user_pb2 import User

# Create a user object
user = User(id=123, name="Alice")

# Serialize it
serialized_data = user.SerializeToString()
print(f"Serialized (Protobuf): {serialized_data}")

# Deserialize it
new_user = User()
new_user.ParseFromString(serialized_data)
print(f"Deserialized (Protobuf): ID={new_user.id}, Name={new_user.name}")

Avro:

Avro requires two schemas: one for the data itself and one for the writer. This is a key difference.

First, define the schema in a JSON file (writer’s schema):

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"}
  ]
}

And let’s say the reader’s schema is the same for now. You’d typically use an Avro library in your language. For Python:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

# Load schemas
schema = avro.schema.parse(open("user.avsc", "rb").read())

# Data to write
user_data = {"id": 123, "name": "Alice"}

# Serialize
writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)
writer.append(user_data)
writer.close()

# Deserialize
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
deserialized_users = []
for user in reader:
    deserialized_users.append(user)
reader.close()
print(f"Deserialized (Avro): {deserialized_users[0]}")

The most surprising true thing about Avro is that it doesn’t embed field names in the serialized data; it relies entirely on the schema to interpret the byte stream.

This means the serialized data is incredibly compact, just a sequence of values in schema order. When you deserialize, you must provide the writer’s schema. If the reader’s schema is different (e.g., the writer added a new optional field), Avro uses schema evolution rules to reconcile them, often filling in default values or ignoring unknown fields. This flexibility is crucial for systems where schemas change over time, like microservices.

Protobuf, on the other hand, embeds field numbers (not names) and wire types into the data. This makes it self-describing to a degree, allowing a Protobuf message to be parsed without a schema if you know its structure, but it results in larger payloads compared to Avro. Its strength lies in its speed and simplicity for applications where schema evolution is less of a concern or managed differently.

The core difference boils down to this: Avro is schema-centric, prioritizing schema evolution and compact data by separating the schema from the data. Protobuf is message-centric, embedding schema information (field numbers) within the data for faster parsing and a degree of self-description, at the cost of slightly larger messages.

The next concept you’ll likely grapple with is schema evolution and how each format handles it differently.