Processing Streaming Multi-Document YAML Inputs
๐ท๏ธ Structured Data Formats: JSON, YAML, and CSV / YAML Processing
When working with YAML in real-world systems, you will often encounter files that contain multiple YAML documents separated by the --- delimiter. This is common in configuration management tools, CI/CD pipelines, and Kubernetes manifests where a single file defines multiple resources. Processing these streaming multi-document inputs requires a different approach than reading a single YAML document.
๐ง Context: What Are Multi-Document YAML Files?
A multi-document YAML file contains several independent YAML documents within a single file, each separated by a line containing three dashes (---). An optional trailing ... can mark the end of a document.
Example of a multi-document YAML structure:
- Document 1: A server configuration
- Document 2: A database configuration
- Document 3: A load balancer configuration
Each document is processed independently, and they do not share any context or references unless explicitly linked through external logic.
โ๏ธ The Core Challenge: Streaming vs. Loading
The standard yaml.safe_load() function can only read a single document. If you try to use it on a multi-document file, it will only return the first document and ignore the rest.
What happens with standard loading:
- yaml.safe_load(file) โ Returns only the first document as a Python dictionary
- yaml.safe_load_all(file) โ Returns a generator that yields each document one at a time
The safe_load_all() function is the key tool for processing streaming multi-document inputs. It treats the file as a stream of documents rather than a single block of data.
๐ ๏ธ How to Process Multi-Document YAML Inputs
Step 1: Open the YAML file normally
Open the file using Python's built-in open() function in read mode. This gives you a file object that can be passed to the YAML parser.
Step 2: Use safe_load_all() instead of safe_load()
Pass the file object to yaml.safe_load_all(). This returns a generator object, not a list. Each iteration of the generator yields one complete YAML document as a Python dictionary.
Step 3: Iterate through the documents
Use a for loop to process each document individually. Inside the loop, you can inspect, transform, or validate each document as needed.
Step 4: Handle the end of stream
The generator will automatically stop when there are no more documents. You do not need to check for a termination marker.
๐ Comparison: Single Document vs. Multi-Document Processing
| Feature | Single Document | Multi-Document |
|---|---|---|
| Function used | yaml.safe_load() | yaml.safe_load_all() |
| Return type | Single dictionary | Generator of dictionaries |
| Memory usage | Loads everything at once | Processes one at a time |
| Use case | Simple config files | Kubernetes manifests, CI pipelines |
| Error handling | Fails on first issue | Can handle each document separately |
๐ต๏ธ Common Patterns for Streaming Multi-Document YAML
Pattern 1: Counting documents
You can count how many documents are in a file by iterating through the generator and incrementing a counter. This is useful for validation or reporting.
Pattern 2: Filtering documents by type
Each document often contains a key like kind or type that identifies what it represents. You can check this key inside the loop and only process documents that match your criteria.
Pattern 3: Collecting all documents into a list
If you need to access all documents later, you can convert the generator to a list using list(yaml.safe_load_all(file)). Be cautious with large files as this loads everything into memory.
Pattern 4: Processing documents with error isolation
Since each document is independent, you can wrap the processing of each document in a try-except block. This way, a malformed document does not crash the entire pipeline.
๐งช Practical Tips for Engineers
Always use safe_load_all() for multi-document files
Never assume a YAML file contains only one document. Using safe_load_all() as a default practice ensures your code works with both single and multi-document inputs.
Be aware of the generator behavior
The generator from safe_load_all() can only be iterated once. If you need to process the documents multiple times, convert to a list first or restructure your logic.
Handle empty files gracefully
If the YAML file is empty or contains no documents, the generator will yield nothing. Your loop will simply not execute, which may be unexpected. Add a check to handle this case.
Use context managers for file handling
Always use the with open() statement to ensure the file is properly closed after reading, even if an error occurs during processing.
๐ฆ When to Use Multi-Document YAML Processing
Use this approach when:
- Reading Kubernetes manifest files that define multiple resources
- Processing CI/CD pipeline definitions with multiple stages
- Handling configuration bundles where each document is a separate component
- Working with Ansible playbooks that contain multiple plays
- Parsing log or event streams formatted as YAML documents
Avoid this approach when:
- The YAML file is extremely large and you need random access to documents
- Documents are interdependent and must be validated together
- You need to preserve comments or formatting from the original file
- The input is a single document that does not use the --- separator
โ Summary Checklist
- Use yaml.safe_load_all() for any file that may contain multiple YAML documents
- Iterate through the generator with a for loop to process each document
- Handle each document independently to isolate errors
- Convert to a list only when you need multiple passes over the data
- Always use with open() for safe file handling
- Check for empty files or zero-document scenarios
Multi-document YAML processing is a fundamental skill for working with modern infrastructure tooling. By mastering the streaming approach, you can handle complex configuration files with confidence and build robust data pipelines that gracefully handle any number of documents.
This technique reads multiple YAML documents separated by --- from a single stream or file.
๐ข Example 1: Reading two simple YAML documents from a string
This shows how to load multiple YAML documents from a single string using yaml.safe_load_all.
import yaml
data = """
name: Alice
age: 30
---
name: Bob
age: 25
"""
documents = list(yaml.safe_load_all(data))
print(documents)
๐ค Output: [{'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}]
๐ข Example 2: Iterating through documents without loading all into memory
This demonstrates processing each document one at a time using a generator.
import yaml
data = """
item: apple
quantity: 5
---
item: banana
quantity: 3
---
item: cherry
quantity: 8
"""
for doc in yaml.safe_load_all(data):
print(doc["item"], "->", doc["quantity"])
๐ค Output: apple -> 5 banana -> 3 cherry -> 8
๐ข Example 3: Handling documents with different structures
This shows that each document in a stream can have a completely different schema.
import yaml
data = """
server: web01
port: 8080
---
database:
host: localhost
name: prod
---
enabled: true
"""
docs = list(yaml.safe_load_all(data))
print(docs[0]["server"])
print(docs[1]["database"]["host"])
print(docs[2]["enabled"])
๐ค Output: web01 localhost True
๐ข Example 4: Reading multi-document YAML from a file
This demonstrates loading documents from an external .yaml file using a context manager.
import yaml
with open("configs.yaml", "r") as f:
for doc in yaml.safe_load_all(f):
print(doc)
๐ค Output: {'env': 'staging', 'debug': False} {'env': 'production', 'debug': False}
๐ข Example 5: Filtering documents during streaming
This shows how to process only documents that match a condition while streaming.
import yaml
data = """
name: alpha
status: active
---
name: beta
status: inactive
---
name: gamma
status: active
"""
for doc in yaml.safe_load_all(data):
if doc["status"] == "active":
print(f"Active: {doc['name']}")
๐ค Output: Active: alpha Active: gamma
๐ Comparison: Loading Single vs. Multi-Document YAML
| Feature | Single Document (yaml.safe_load) |
Multi-Document (yaml.safe_load_all) |
|---|---|---|
| Input format | One YAML document | Multiple documents separated by --- |
| Return type | Single Python object | Generator yielding multiple objects |
| Memory usage | Loads entire document at once | Processes one document at a time |
| Use case | Simple config files | Log files, batch records, stream data |
When working with YAML in real-world systems, you will often encounter files that contain multiple YAML documents separated by the --- delimiter. This is common in configuration management tools, CI/CD pipelines, and Kubernetes manifests where a single file defines multiple resources. Processing these streaming multi-document inputs requires a different approach than reading a single YAML document.
๐ง Context: What Are Multi-Document YAML Files?
A multi-document YAML file contains several independent YAML documents within a single file, each separated by a line containing three dashes (---). An optional trailing ... can mark the end of a document.
Example of a multi-document YAML structure:
- Document 1: A server configuration
- Document 2: A database configuration
- Document 3: A load balancer configuration
Each document is processed independently, and they do not share any context or references unless explicitly linked through external logic.
โ๏ธ The Core Challenge: Streaming vs. Loading
The standard yaml.safe_load() function can only read a single document. If you try to use it on a multi-document file, it will only return the first document and ignore the rest.
What happens with standard loading:
- yaml.safe_load(file) โ Returns only the first document as a Python dictionary
- yaml.safe_load_all(file) โ Returns a generator that yields each document one at a time
The safe_load_all() function is the key tool for processing streaming multi-document inputs. It treats the file as a stream of documents rather than a single block of data.
๐ ๏ธ How to Process Multi-Document YAML Inputs
Step 1: Open the YAML file normally
Open the file using Python's built-in open() function in read mode. This gives you a file object that can be passed to the YAML parser.
Step 2: Use safe_load_all() instead of safe_load()
Pass the file object to yaml.safe_load_all(). This returns a generator object, not a list. Each iteration of the generator yields one complete YAML document as a Python dictionary.
Step 3: Iterate through the documents
Use a for loop to process each document individually. Inside the loop, you can inspect, transform, or validate each document as needed.
Step 4: Handle the end of stream
The generator will automatically stop when there are no more documents. You do not need to check for a termination marker.
๐ Comparison: Single Document vs. Multi-Document Processing
| Feature | Single Document | Multi-Document |
|---|---|---|
| Function used | yaml.safe_load() | yaml.safe_load_all() |
| Return type | Single dictionary | Generator of dictionaries |
| Memory usage | Loads everything at once | Processes one at a time |
| Use case | Simple config files | Kubernetes manifests, CI pipelines |
| Error handling | Fails on first issue | Can handle each document separately |
๐ต๏ธ Common Patterns for Streaming Multi-Document YAML
Pattern 1: Counting documents
You can count how many documents are in a file by iterating through the generator and incrementing a counter. This is useful for validation or reporting.
Pattern 2: Filtering documents by type
Each document often contains a key like kind or type that identifies what it represents. You can check this key inside the loop and only process documents that match your criteria.
Pattern 3: Collecting all documents into a list
If you need to access all documents later, you can convert the generator to a list using list(yaml.safe_load_all(file)). Be cautious with large files as this loads everything into memory.
Pattern 4: Processing documents with error isolation
Since each document is independent, you can wrap the processing of each document in a try-except block. This way, a malformed document does not crash the entire pipeline.
๐งช Practical Tips for Engineers
Always use safe_load_all() for multi-document files
Never assume a YAML file contains only one document. Using safe_load_all() as a default practice ensures your code works with both single and multi-document inputs.
Be aware of the generator behavior
The generator from safe_load_all() can only be iterated once. If you need to process the documents multiple times, convert to a list first or restructure your logic.
Handle empty files gracefully
If the YAML file is empty or contains no documents, the generator will yield nothing. Your loop will simply not execute, which may be unexpected. Add a check to handle this case.
Use context managers for file handling
Always use the with open() statement to ensure the file is properly closed after reading, even if an error occurs during processing.
๐ฆ When to Use Multi-Document YAML Processing
Use this approach when:
- Reading Kubernetes manifest files that define multiple resources
- Processing CI/CD pipeline definitions with multiple stages
- Handling configuration bundles where each document is a separate component
- Working with Ansible playbooks that contain multiple plays
- Parsing log or event streams formatted as YAML documents
Avoid this approach when:
- The YAML file is extremely large and you need random access to documents
- Documents are interdependent and must be validated together
- You need to preserve comments or formatting from the original file
- The input is a single document that does not use the --- separator
โ Summary Checklist
- Use yaml.safe_load_all() for any file that may contain multiple YAML documents
- Iterate through the generator with a for loop to process each document
- Handle each document independently to isolate errors
- Convert to a list only when you need multiple passes over the data
- Always use with open() for safe file handling
- Check for empty files or zero-document scenarios
Multi-document YAML processing is a fundamental skill for working with modern infrastructure tooling. By mastering the streaming approach, you can handle complex configuration files with confidence and build robust data pipelines that gracefully handle any number of documents.
Interactive Views
You are currently in ๐ All-in-One mode. Use the tabs at the top to switch to ๐ Theory Only or ๐ป Code Only views.
This technique reads multiple YAML documents separated by --- from a single stream or file.
๐ข Example 1: Reading two simple YAML documents from a string
This shows how to load multiple YAML documents from a single string using yaml.safe_load_all.
import yaml
data = """
name: Alice
age: 30
---
name: Bob
age: 25
"""
documents = list(yaml.safe_load_all(data))
print(documents)
๐ค Output: [{'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}]
๐ข Example 2: Iterating through documents without loading all into memory
This demonstrates processing each document one at a time using a generator.
import yaml
data = """
item: apple
quantity: 5
---
item: banana
quantity: 3
---
item: cherry
quantity: 8
"""
for doc in yaml.safe_load_all(data):
print(doc["item"], "->", doc["quantity"])
๐ค Output: apple -> 5 banana -> 3 cherry -> 8
๐ข Example 3: Handling documents with different structures
This shows that each document in a stream can have a completely different schema.
import yaml
data = """
server: web01
port: 8080
---
database:
host: localhost
name: prod
---
enabled: true
"""
docs = list(yaml.safe_load_all(data))
print(docs[0]["server"])
print(docs[1]["database"]["host"])
print(docs[2]["enabled"])
๐ค Output: web01 localhost True
๐ข Example 4: Reading multi-document YAML from a file
This demonstrates loading documents from an external .yaml file using a context manager.
import yaml
with open("configs.yaml", "r") as f:
for doc in yaml.safe_load_all(f):
print(doc)
๐ค Output: {'env': 'staging', 'debug': False} {'env': 'production', 'debug': False}
๐ข Example 5: Filtering documents during streaming
This shows how to process only documents that match a condition while streaming.
import yaml
data = """
name: alpha
status: active
---
name: beta
status: inactive
---
name: gamma
status: active
"""
for doc in yaml.safe_load_all(data):
if doc["status"] == "active":
print(f"Active: {doc['name']}")
๐ค Output: Active: alpha Active: gamma
๐ Comparison: Loading Single vs. Multi-Document YAML
| Feature | Single Document (yaml.safe_load) |
Multi-Document (yaml.safe_load_all) |
|---|---|---|
| Input format | One YAML document | Multiple documents separated by --- |
| Return type | Single Python object | Generator yielding multiple objects |
| Memory usage | Loads entire document at once | Processes one document at a time |
| Use case | Simple config files | Log files, batch records, stream data |