LLamaIndex’s PII Redaction module doesn’t just find sensitive data; it actively rewrites your documents to remove it, making RAG systems safer without losing crucial context.
Let’s see it in action. Imagine you have a document with personal information that you want to use in a RAG system, but you need to protect privacy.
from llama_index.core import Document
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.readers import Document
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.core.extractors import PIIExtractor
# Sample document with PII
doc_text = "John Doe lives at 123 Main St, Anytown, CA 91234. His email is john.doe@example.com and his phone number is (555) 123-4567. He works at Acme Corp."
doc = Document(text=doc_text)
# Initialize the PII extractor
# This extractor uses spaCy's NER model to identify PII entities.
# We can specify which entity types to look for.
pii_extractor = PIIExtractor(
# Example: only extract PERSON, LOC, ORG, and EMAIL
# By default, it extracts a broader set.
entity_types=["PERSON", "LOC", "ORG", "EMAIL", "PHONE_NUMBER"]
)
# Initialize a sentence splitter to break down the document
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=20)
# Set up the ingestion pipeline
pipeline = IngestionPipeline(
transformations=[
splitter,
pii_extractor,
MetadataReplacementPostProcessor(
# This post-processor replaces the original document content
# with a redacted version based on the extracted PII.
# It uses a placeholder like "[REDACTED_PERSON]" for identified entities.
replacement_dict={
"PERSON": "[REDACTED_PERSON]",
"LOC": "[REDACTED_LOCATION]",
"ORG": "[REDACTED_ORGANIZATION]",
"EMAIL": "[REDACTED_EMAIL]",
"PHONE_NUMBER": "[REDACTED_PHONE]",
}
),
]
)
# Run the pipeline
nodes = pipeline.run(documents=[doc])
# The 'nodes' will now contain documents where PII has been replaced.
# Let's print the text of the first node to see the result.
print(nodes[0].get_content())
Here’s what the output would look like:
[REDACTED_PERSON] lives at [REDACTED_LOCATION]. His email is [REDACTED_EMAIL] and his phone number is [REDACTED_PHONE]. He works at [REDACTED_ORGANIZATION].
The PII Redaction module in LlamaIndex is designed to tackle the critical challenge of data privacy in RAG systems. When you feed documents into a RAG pipeline, sensitive information like names, addresses, phone numbers, or credit card details can be inadvertently exposed during retrieval or generation. Traditional methods might involve pre-filtering documents, which can be blunt and lose valuable non-PII context. LlamaIndex’s approach integrates directly into the ingestion process.
The core idea is to identify and then mask or replace Personally Identifiable Information (PII) before it’s indexed. This is achieved through a combination of components. First, an PIIExtractor is used. This extractor leverages sophisticated Natural Language Processing (NLP) models, often based on Named Entity Recognition (NER), to scan the text and flag specific types of PII. You can configure which entity types (like PERSON, EMAIL, PHONE_NUMBER, CREDIT_CARD, LOCATION, ORGANIZATION, etc.) the extractor should look for.
Once PII is identified, the MetadataReplacementPostProcessor (or a similar mechanism within the pipeline) takes over. This component is responsible for the actual redaction. It examines the identified PII entities and replaces them in the document’s text with predefined placeholders. For instance, a detected name like "Jane Smith" might be replaced with [REDACTED_PERSON], an email address with [REDACTED_EMAIL], and so on. The crucial part is that these placeholders retain some semantic meaning, allowing a downstream RAG model to understand that a person’s name or a location was present, without knowing the specific details. This preserves the structural integrity and general meaning of the document for retrieval purposes.
The IngestionPipeline orchestrates this entire process. You define a sequence of transformations: first, perhaps a SentenceSplitter to break down large documents into manageable chunks (nodes). Then, the PIIExtractor to find the sensitive data within these chunks. Finally, the MetadataReplacementPostProcessor to perform the actual redaction on the text of these nodes. The output of the pipeline is a set of nodes ready for indexing, where the sensitive information has been systematically scrubbed.
The benefit here is twofold: enhanced privacy and maintained utility. By redacting PII during ingestion, you ensure that the indexed data itself is anonymized. This significantly reduces the risk of accidental data leakage when users query the RAG system. Simultaneously, because the redaction replaces PII with descriptive placeholders rather than entirely removing the sentence or paragraph, the RAG model can still retrieve relevant documents based on the context surrounding the PII. For example, a query about "companies located in California" could still retrieve a document that originally stated, "[REDACTED_PERSON] works at Acme Corp in [REDACTED_LOCATION]," because the location placeholder is still present.
Most people focus on the PIIExtractor and the MetadataReplacementPostProcessor as separate steps. However, the real power comes from how they interact within the IngestionPipeline and how the replacement_dict in the post-processor is designed. The keys in this dictionary ("PERSON", "LOC", etc.) must precisely match the entity types that the PIIExtractor is configured to find and output. If your PIIExtractor is set to find PERSON but your replacement_dict only has a key for NAME, the replacement won’t happen for persons. It’s a direct mapping where the output labels from the NER model become the keys for the redaction replacement strategy.
The next step after redacting PII is typically to implement a retrieval strategy that can effectively use these anonymized documents, perhaps by focusing on the non-PII context or by using specialized LLMs trained to handle masked entities.