Machine Readiness
Stored receipt and evidence
24
80
0
0
0
Samples
No stored offer samples.
Samples
No stored action samples.
Samples
No stored product samples.
Document
Not stored for this site.
Document
# DocETL System Description and LLM Instructions (Short)
Note: use docetl.org/llms-full.txt for the full system description and LLM instructions.
DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets.
DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org.
We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler.
## Docs
- [LLM Instructions (Full)](https://www.docetl.org/llms-full.txt)
- [Website](https://www.docetl.org)
- [DocWrangler Playground](https://www.docetl.org/playground)
- [Main Documentation](https://ucbepic.github.io/docetl)
- [GitHub Repository](https://github.com/ucbepic/docetl)
- [Agentic Optimization Research Paper](https://arxiv.org/abs/2410.12189)
- [Discord Community](https://discord.gg/fHp7B2X3xx)
### Core Operators
- [Map Operation](https://ucbepic.github.io/docetl/operators/map/)
- [Reduce Operation](https://ucbepic.github.io/docetl/operators/reduce/)
- [Resolve Operation](https://ucbepic.github.io/docetl/operators/resolve/)
- [Parallel Map Operation](https://ucbepic.github.io/docetl/operators/parallel-map/)
- [Filter Operation](https://ucbepic.github.io/docetl/operators/filter/)
- [Equijoin Operation](https://ucbepic.github.io/docetl/operators/equijoin/)
### Auxiliary Operators
- [Split Operation](https://ucbepic.github.io/docetl/operators/split/)
- [Gather Operation](https://ucbepic.github.io/docetl/operators/gather/)
- [Unnest Operation](https://ucbepic.github.io/docetl/operators/unnest/)
- [Sample Operation](https://ucbepic.github.io/docetl/operators/sample)
- [Code Operation](https://ucbepic.github.io/docetl/operators/code/)
### LLM Providers
- [LiteLLM Supported Providers](https://docs.litellm.ai/docs/providers)
## Optional
### Datasets and Data Loading
DocETL supports both standard and dynamic data loading. Input data must be in one of two formats:
1. JSON Format:
- A list of objects/dictionaries
- Each object represents one document/item to process
- Each field in the object is accessible in operations via `input.field_name`
Example JSON:
```json
[
{
"text": "First document content",
"date": "2024-03-20",
"metadata": {"source": "email"}
},
{
"text": "Second document content",
"date": "2024-03-21",
"metadata": {"source": "chat"}
}
]
```
2. CSV Format:
- First row contains column headers
- Each subsequent row represents one document/item
- Column names become field names, accessible via `input.column_name`
Example CSV:
```csv
text,date,source
"First document content","2024-03-20","email"
"Second document content","2024-03-21","chat"
```
Configure datasets in your pipeline:
```yaml
datasets:
documents:
type: file
path: "data.json" # or "data.csv"
```
!!! note
- JSON files must contain a list of objects at the root level
- CSV files must have a header row with column names
- All documents in a dataset should have consistent fields
- For other formats, use parsing tools to convert to the required format
### Schema Design and Validation
!!! warning "Model Capabilities and Schema Complexity"
When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google):
- Keep output schemas extremely simple
- Prefer single string outputs or simple key-value pairs
- Avoid complex types (lists, nested objects)
- Break complex operations into multiple simpler steps
1. Basic Types:
| Type | Aliases | Description |
| --------- | ------------------------ | -------------------------- |
| `string` | `str`, `text`, `varchar` | For text data |
| `integer` | `int` | For whole numbers |
| `number` | `float`, `decimal` | For decimal numbers |
| `boolean` | `bool` | For true/false values |
| `enum` | - | Only when prompt explicitly lists all possible values |
| `list` | - | For arrays (needs type) |
| Objects | - | Using `{field: type}` |
### Best Practices
1. Pipeline Design:
- Keep pipelines simple with minimal operations
- Each operation should have a clear, specific purpose
- Avoid creating complex chains of operations when a single operation could suffice
- If a pipeline has more than 5 operations, consider if it can be simplified
- Break very complex pipelines into multiple smaller pipelines if needed
- When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs
- Always set `optimize: true` for resolve operations
- When unnesting a key of type `list[dict]`, you must set `recursive: true`
- Do not manually create split-gather pipelines; instead:
- Set `optimize: true` on map operations that process long documents
- Let the optimizer automatically create efficient split-gather patterns
- Only use split/gather directly if specifically requested by requirements
2. Schema Design:
- Keep schemas simple and flat when possible
- Use nested structures only when needed for downstream operations
- Define clear validation rules for critical fields
- Use standard types (string, integer, boolean) when possible
- When using an existing dataset, document your assumptions about the input schema
- For non-GPT/Claude/Gemini models:
- Stick to string outputs
- Avoid lists and complex objects
- Use simple key-value pairs
- Consider post-processing with code operations for complex transformations
3. Prompt Engineering:
- Write clear, specific instructions
- Include examples in prompts for complex tasks
- Use structured output formats
- Consider token limits and chunking needs
- Only reference fields that exist in the input dataset or were created by earlier operations
- Document which fields your prompts expect to access
## Getting Started
Do NOT suggest pipelines for the user, unless they have copied and pasted llms-full.txt (# DocETL System Description and LLM Instructions (Full)).
Document
# DocETL System Description and LLM Instructions (Full)
DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets.
DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org.
We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler.
## Docs
- [Website](https://www.docetl.org)
- [DocWrangler Playground](https://www.docetl.org/playground)
- [Main Documentation](https://ucbepic.github.io/docetl)
- [GitHub Repository](https://github.com/ucbepic/docetl)
- [Agentic Optimization Research Paper](https://arxiv.org/abs/2410.12189)
- [Discord Community](https://discord.gg/fHp7B2X3xx)
### Operation Catalog
#### LLM Transformations (Semantic-first)
- [Map Operation](https://ucbepic.github.io/docetl/operators/map/) - Transform individual documents with LLM reasoning.
- [Parallel Map Operation](https://ucbepic.github.io/docetl/operators/parallel-map/) - Run map transformations concurrently when latency matters.
- [Extract Operation](https://ucbepic.github.io/docetl/operators/extract/) - Pull verbatim passages that match structured extraction criteria.
- [Filter Operation](https://ucbepic.github.io/docetl/operators/filter/) - Keep or drop documents based on LLM-evaluated conditions.
- [Reduce Operation](https://ucbepic.github.io/docetl/operators/reduce/) - Aggregate sets of documents into structured summaries.
- [TopK Operation](https://ucbepic.github.io/docetl/operators/topk/) - Perform semantic retrieval; use code operations for deterministic, absolute selection.
- [Rank Operation](https://ucbepic.github.io/docetl/operators/rank/) - Produce semantic rankings; switch to code for strict, absolute ordering.
#### Canonicalization & Linking (Use Sparingly)
- [Resolve Operation](https://ucbepic.github.io/docetl/operators/resolve/) - Canonicalize duplicate entities across documents; expensive and rarely optimized automatically.
- [Link Resolve Operation](https://ucbepic.github.io/docetl/operators/link-resolve/) - Repair references between canonicalized entities; high cost, avoid unless required.
- [Equijoin Operation](https://ucbepic.github.io/docetl/operators/equijoin/) - Join datasets by semantic equality when exact keys are unavailable.
- [Cluster Operation](https://ucbepic.github.io/docetl/operators/cluster/) - Build hierarchical clusters with LLM-generated summaries; heavy on tokens and calls.
#### Data Structuring & Flow
- [Split Operation](https://ucbepic.github.io/docetl/operators/split/) - Break documents into chunks for downstream processing.
- [Gather Operation](https://ucbepic.github.io/docetl/operators/gather/) - Reassemble split outputs while preserving context.
- [Unnest Operation](https://ucbepic.github.io/docetl/operators/unnest/) - Flatten nested list structures, with optional recursion.
- [Sample Operation](https://ucbepic.github.io/docetl/operators/sample/) - Select representative subsets for fast iteration.
#### Code-Powered Operations
- [Code Operation](https://ucbepic.github.io/docetl/operators/code/) - Execute deterministic Python transforms within the pipeline.
### LLM Providers
- [LiteLLM Supported Providers](https://docs.litellm.ai/docs/providers)
## Optional
### Datasets and Data Loading
DocETL supports both standard and dynamic data loading. Input data must be in one of two formats:
1. JSON Format:
- A list of objects/dictionaries
- Each object represents one document/item to process
- Each field in the object is accessible in operations via `input.field_name`
Example JSON:
```json
[
{
"text": "First document content",
"date": "2024-03-20",
"metadata": {"source": "email"}
},
{
"text": "Second document content",
"date": "2024-03-21",
"metadata": {"source": "chat"}
}
]
```
2. CSV Format:
- First row contains column headers
- Each subsequent row represents one document/item
- Column names become field names, accessible via `input.column_name`
Example CSV:
```csv
text,date,source
"First document content","2024-03-20","email"
"Second document content","2024-03-21","chat"
```
Configure datasets in your pipeline:
```yaml
datasets:
documents:
type: file
path: "data.json" # or "data.csv"
```
For non-standard formats (audio, PDFs, etc.), use dynamic loading with parsing tools:
```yaml
datasets:
audio_transcripts:
type: file
source: local
path: "audio_files/audio_paths.json" # JSON list of paths to audio files
parsing_tools:
- input_key: audio_path # Field containing file path
function: whisper_speech_to_text
output_key: transcript # Field where transcript will be stored
```
!!! note
- JSON files must contain a list of objects at the root level
- CSV files must have a header row with column names
- All documents in a dataset should have consistent fields
- For other formats, use parsing tools to convert to the required format
### Operation Guide
1. LLM Transformations (Semantic-first):
- Map: Transform individual documents using LLM reasoning. https://ucbepic.github.io/docetl/operators/map/
- Parallel Map: Process multiple documents concurrently to reduce latency. https://ucbepic.github.io/docetl/operators/parallel-map/
- Extract: Pull verbatim passages that match extraction criteria. https://ucbepic.github.io/docetl/operators/extract/
- Filter: Select documents based on LLM-powered conditions. https://ucbepic.github.io/docetl/operators/filter/
- Reduce: Combine document sets into structured insights. https://ucbepic.github.io/docetl/operators/reduce/
- TopK: Run semantic retrieval; reach for code operations when you need deterministic absolute selection. https://ucbepic.github.io/docetl/operators/topk/
- Rank: Produce semantic ranking results; implement strict ordering in code if LLM fuzziness is unacceptable. https://ucbepic.github.io/docetl/operators/rank/
2. Canonicalization & Linking (Use sparingly—they are slow and expensive):
- Resolve: Perform entity resolution across documents; only reach for it when deduplication is essential. https://ucbepic.github.io/docetl/operators/resolve/
- Link Resolve: Repair cross-document references after canonicalization; expect high LLM usage. https://ucbepic.github.io/docetl/operators/link-resolve/
- Equijoin: Join documents based on semantic equality; still experimental and costly. https://ucbepic.github.io/docetl/operators/equijoin/
- Cluster: Group items into hierarchical clusters with LLM-generated summaries; token-heavy. https://ucbepic.github.io/docetl/operators/cluster/
3. Data Structuring & Flow:
- Split: Break large documents into manageable chunks. https://ucbepic.github.io/docetl/operators/split/
- Gather: Maintain context when reassembling split documents. https://ucbepic.github.io/docetl/operators/gather/
- Unnest: Flatten nested data structures, including recursive lists. https://ucbepic.github.io/docetl/operators/unnest/
- Sample: Select representative subsets for rapid iteration. https://ucbepic.github.io/docetl/operators/sample/
4. Code-Powered Utilities:
- Code: Execute custom Python code within the pipeline for deterministic logic. https://ucbepic.github.io/docetl/operators/code/
### Pipeline Structure
Pipelines are defined in YAML with the following key components:
#### Basic Components
1. Datasets Configuration:
```yaml
datasets:
input_data:
path: data.json
type: file
```
2. Model Configuration:
```yaml
default_model: gpt-5-nano
```
DocETL uses LiteLLM under the hood, supporting a wide range of LLM providers including:
- OpenAI (gpt-5 family such as gpt-5-nano)
- Anthropic (claude-3, claude-2)
- Google (gemini-pro)
- Mistral AI
- Azure OpenAI
- AWS Bedrock
- Ollama
And many more. See the [complete list of supported providers](https://docs.litellm.ai/docs/providers).
3. System Prompt (Optional):
```yaml
system_prompt:
dataset_description: description of your data
persona: role the LLM should assume
```
#### Schema Design and Validation
Schemas in DocETL define the structure and types of output data from LLM operations. They help ensure consistency and facilitate downstream processing.
!!! warning "Model Capabilities and Schema Complexity"
When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google):
- Keep output schemas extremely simple
- Prefer single string outputs or simple key-value pairs
- Avoid complex types (lists, nested objects)
- Break complex operations into multiple simpler steps
Example for non-GPT/Claude/Gemini models:
```yaml
# Good - Simple schema
output:
schema:
category: string
confidence: string # Use string instead of number for better reliability
# Avoid - Complex schema
output:
schema:
categories: "list[string]" # Too complex
metadata: "{score: number, confidence: number}" # Too complex
```
1. Basic Types:
| Type | Aliases | Description |
| --------- | ------------------------ | -------------------------- |
| `string` | `str`, `text`, `varchar` | For text data |
| `integer` | `int` | For whole numbers |
| `number` | `float`, `decimal` | For decimal numbers |
| `boolean` | `bool` | For true/false values |
| `enum` | - | For a set of values, only when prompt explicitly lists all possible values |
| `list` | - | For arrays (needs type) |
| Objects | - | Using `{field: type}` |
2. Schema Examples:
```yaml
# Simple schema
output:
schema:
summary: string
sentiment: string # Use string if prompt doesn't list exact values
confidence: number
# List schema
output:
schema:
tags: "list[string]"
users: "list[{name: string, age: integer}]"
# Enum schema (only when prompt explicitly lists all possible values)
output:
schema:
# Good - prompt explicitly says "respond with positive, negative, or neutral"
sentiment: "enum[positive, negative, neutral]"
# Bad - prompt doesn't explicitly list all category values
category: "enum[news, opinion, analysis]" # Should be string instead
```
!!! tip "Schema Best Practices"
- Keep schemas simple when possible
- Use nested structures only when needed for downstream operations
- Complex schemas often lead to lower quality LLM outputs
- Break complex schemas into multiple simpler operations
- Only use enum types when the prompt explicitly lists all possible values
- Default to string type when values aren't explicitly enumerated
#### LLM Operation Prompts
All LLM-powered operations (map, filter, reduce, resolve) use Jinja2 templates for their prompts:
- Map/Filter operations: Access document fields using `input.field_name`
- Reduce operations: Access list of documents using `inputs`, iterate with `{% for item in inputs %}`
- Resolve operations: Access document pairs using `input1` and `input2` for comparison prompts
#### Comprehensive Operation Examples
1. Filter Operation with Validation:
```yaml
- name: filter_high_impact_articles
type: filter
prompt: |
Analyze the following news article:
Title: "{{ input.title }}"
Content: "{{ input.content }}"
Determine if this article is high-impact based on the following criteria:
1. Covers a significant global or national event
2. Has potential long-term consequences
3. Affects a large number of people
4. Is from a reputable source
Respond with 'true' if the article meets at least 3 of these criteria, otherwise respond with 'false'.
output:
schema:
is_high_impact: boolean
```
2. Resolve Operation:
```yaml
- name: standardize_patient_names
type: resolve
optimize: true
comparison_prompt: |
Compare the following two patient name entries:
Patient 1: {{ input1.patient_name }}
Date of Birth 1: {{ input1.date_of_birth }}
Patient 2: {{ input2.patient_name }}
Date of Birth 2: {{ input2.date_of_birth }}
Are these entries likely referring to the same patient? Consider name similarity and date of birth.
Respond with "True" if they are likely the same patient, or "False" if they are likely different patients.
resolution_prompt: |
Standardize the following patient name entries into a single, consistent format:
{% for entry in inputs %}
Patient Name {{ loop.index }}: {{ entry.patient_name }}
{% endfor %}
Provide a single, standardized patient name that represents all the matched entries.
Use the format "LastName, FirstName MiddleInitial" if available.
output:
schema:
patient_name: string
```
Note that in resolve operations, the `inputs` list contains the matched pairs of documents. The prompts can reference any fields from the input documents, even if you don't want to resolve/rewrite those fields. This can help with disambiguation.
3. Advanced Map Operation with Structured Output:
```yaml
- name: extract_medical_info
type: map
optimize: true
output:
schema:
medications: "list[{name: string, dosage: string, frequency: string}]"
symptoms: "list[{description: string, severity: string, duration: string}]"
recommendations: "list[string]"
prompt: |
Analyze the following medical record and extract key information:
{{ input.text }}
For each medication mentioned:
1. Extract the name, dosage, and frequency
2. Ensure dosage includes units (mg, ml, etc.)
3. Standardize frequency to times per day/week
For each symptom:
1. Provide a clear description
2. Rate severity (mild/moderate/severe)
3. Note duration if mentioned
Finally, list any doctor's recommendations.
```
4. Reduce Operation:
```yaml
- name: analyze_product_feedback
type: reduce
reduce_key: product_id
prompt: |
Analyze these customer reviews for product {{ reduce_key }}:
{% for review in inputs %}
Review {{ loop.index }}:
Rating: {{ review.rating }}
Text: {{ review.review_text }}
{% endfor %}
Identify:
1. Common quality issues
2. Reliability concerns
3. Suggested improvements
output:
schema:
quality_issues: "list[{issue: string, frequency: string, severity: string}]"
reliability_concerns: "list[string]"
improvement_suggestions: "list[string]"
```
5. Unnest Operation with Recursive Processing:
```yaml
# First, extract nested data
- name: extract_product_details
type: map
prompt: |
Extract product details from this catalog entry:
{{ input.text }}
Include:
1. Product categories (main and sub-categories)
2. Product features
output:
schema:
categories: "list[{main: string, subcategories: list[string]}]"
# Unnest categories recursively
- name: unnest_categories
type: unnest
unnest_key: categories
recursive: true # Must set recursive: true when unnesting a list[dict] type key. Not needed for list[string] or other simple list types.
depth: 2 # Limit recursion to 2 levels (main category and subcategories)
# Analyze individual categories
- name: analyze_category
type: map
prompt: |
Analyze this product category:
Main Category: {{ input.main }}
{% if input.subcategories %}
Subcategories:
{% for subcat in input.subcategories %}
- {{ subcat }}
{% endfor %}
{% endif %}
Provide:
1. Market size (one of large/medium/small)
2. Competition level
3. Growth potential
output:
schema:
market_size: "enum[large, medium, small]"
competition: string
growth_potential: string
```
This example demonstrates:
- How to use recursive unnesting for nested data structures
- Processing of hierarchical categories
- Depth control for recursive operations
- Handling of unnested data in subsequent operations
6. Extract Operation for Verbatim Passages:
```yaml
- name: findings
type: extract
prompt: |
Extract all sections that discuss key findings, results, or conclusions.
Focus on paragraphs that mention outcomes, statistics, or discovered insights.
document_keys:
- report_text
model: gpt-5-nano
```
Extract keeps the selected text verbatim and appends it to the document with an `_extracted_` suffix.
7. Parallel Map Operation for Concurrent Prompts:
```yaml
- name: process_job_application
type: parallel_map
prompts:
- name: extract_skills
prompt: "List the top 5 relevant engineering skills from {{ input.resume }}."
output_keys:
- skills
model: gpt-5-nano
gleaning:
num_rounds: 1
validation_prompt: |
Confirm the skills list contains exactly five items.
- name: calculate_experience
prompt: "Estimate total years of relevant experience from {{ input.resume }}."
output_keys:
- years_experience
model: gpt-5-nano
- name: evaluate_cultural_fit
prompt: "Rate the cultural fit from {{ input.cover_letter }} on a 1-10 scale."
output_keys:
- cultural_fit_score
model: gpt-5-nano
output:
schema:
skills: "list[string]"
years_experience: number
cultural_fit_score: integer
```
Prompts run concurrently, returning a single enriched document per input item.
8. TopK Operation for Semantic Retrieval:
```yaml
- name: find_relevant_tickets
type: topk
method: embedding
k: 5
keys:
- subject
- description
- customer_feedback
query: "payment processing errors with international transactions"
embedding_model: text-embedding-3-small
```
TopK surfaces the five most semantically similar support tickets. For deterministic keyword retrieval, implement a code operation instead.
9. Rank Operation for Nuanced Ordering:
```yaml
- name: rank_by_controversy
type: rank
prompt: |
Order these debates by how controversial they are.
Consider disagreement, divisive topics, emotional language, and public reaction.
input_keys: ["content", "title", "date"]
direction: desc
rerank_call_budget: 10
initial_ordering_method: likert
model: gpt-5-nano
```
Rank outputs a semantic ordering (`_rank`). Use code when you need deterministic numeric sorting.
10. Equijoin Operation for Semantic Joins:
```yaml
- name: match_candidates_to_jobs
type: equijoin
blocking_keys:
left: [skills]
right: [required_skills]
blocking_threshold: 0.4
embedding_model: text-embedding-3-small
comparison_prompt: |
Compare the candidate (skills: {{ left.skills }}, experience: {{ left.years_experience }})
to the job requirements (skills: {{ right.required_skills }}, experience: {{ right.desired_experience }}).
Respond with "True" if the alignment is strong, otherwise "False".
```
Equijoin is powerful but costly; use blocking and run `docetl build` to optimize generated comparisons.
11. Link Resolve Operation for Fixing References:
```yaml
- name: fix_links
type: link_resolve
id_key: title
link_key: related_to
blocking_threshold: 0.85
embedding_model: text-embedding-3-small
comparison_model: gpt-5-nano
comparison_prompt: |
Compare the link value "{{ link_value }}" with the canonical record "{{ id_value }}".
Consider the description: {{ item.description }}.
Respond with "True" if they refer to the same concept.
```
Link resolve assumes item IDs are canonical and only rewrites broken references.
12. Cluster Operation for Hierarchical Grouping:
```yaml
- name: cluster_concepts
type: cluster
max_batch_size: 5
embedding_keys:
- concept
- description
output_key: categories
summary_schema:
concept: str
description: str
summary_prompt: |
Given these related concepts, name the overarching concept and summarize why they belong together.
{% for input in inputs %}
{{ input.concept }}: {{ input.description }}
{% endfor %}
```
Cluster builds a binary tree of summaries; keep an eye on token usage.
13. Split and Gather Operations for Long Documents:
```yaml
- name: split_transcript
type: split
split_key: transcript
method: token_count
method_kwargs:
num_tokens: 600
model: gpt-5-nano
- name: analyze_segment
type: map
input:
- transcript_chunk
prompt: |
Summarize this transcript segment:
{{ input.transcript_chunk }}
output:
schema:
summary: string
- name: gather_transcript
type: gather
content_key: transcript_chunk
doc_id_key: split_transcript_id
order_key: split_transcript_chunk_num
peripheral_chunks:
previous:
tail:
count: 1
content_key: transcript_chunk
next:
head:
count: 1
content_key: transcript_chunk
```
Split emits chunk-level rows; gather adds adjacent context and a `_rendered` field ready for downstream LLMs.
14. Sample Operation for Quick Iteration:
```yaml
- name: observe_subset
type: sample
method: uniform
samples: 0.1
stratify_key: category
random_state: 42
```
Sampling limits processing to 10% of the data while maintaining category balance; remove it before production runs.
#### Common Pipeline Patterns
Most DocETL pipelines follow one of two patterns:
1. Map-only: For simple transformations where each document is processed independently
```yaml
operations:
- extract_info # map operation
```
2. Map-Resolve-Reduce: For complex analysis requiring entity resolution and aggregation
```yaml
operations:
- extract_entities # map operation
- standardize_entities # resolve operation
- summarize_by_entity # reduce operation
```
#### Code-Powered Operations
DocETL supports Python code operations for cases where you need deterministic processing, complex calculations, or integration with external libraries. Code operations are useful when you need:
- Deterministic and reproducible results
- Integration with Python libraries
- Structured data transformations
- Math-based or computational processing
1. Code Map Operation:
```yaml
- name: extract_keywords
type: code_map
code: |
def transform(doc) -> dict:
# Process each document independently
keywords = doc['text'].lower().split()
return {
'keywords': keywords,
'keyword_count': len(keywords)
}
```
2. Code Reduce Operation:
```yaml
- name: aggregate_stats
type: code_reduce
reduce_key: category
code: |
def transform(items) -> dict:
# Aggregate multiple items into a single result
total = sum(item['value'] for item in items)
avg = total / len(items)
return {
'total': total,
'average': avg,
'count': len(items)
}
```
3. Code Filter Operation:
```yaml
- name: filter_valid_entries
type: code_filter
code: |
def transform(doc) -> bool:
# Return True to keep the document, False to filter it out
return doc['score'] >= 0.5 and len(doc['text']) > 100
```
#### Example Pipeline with Code and LLM Operations
Here's a pipeline that combines code and LLM operations to analyze customer reviews:
```yaml
default_model: gpt-5-nano
system_prompt:
dataset_description: a collection of customer reviews with ratings and text
persona: a customer feedback analyst
datasets:
reviews:
path: reviews.json
type: file
operations:
# Code operation to preprocess and filter reviews
- name: preprocess_reviews
type: code_map
code: |
def transform(doc) -> dict:
# Clean and tokenize text
text = doc['text'].strip().lower()
words = text.split()
return {
'text': text,
'word_count': len(words),
'rating': doc['rating'],
'processed_date': doc['date'][:10] # Extract date only
}
# Code operation to filter out short reviews
- name: filter_short_reviews
type: code_filter
code: |
def transform(doc) -> bool:
return doc['word_count'] >= 20 # Keep only substantial reviews
# Code operation to calculate basic statistics
- name: calculate_stats
type: code_reduce
reduce_key: processed_date
code: |
def transform(items) -> dict:
ratings = [item['rating'] for item in items]
return {
'avg_rating': sum(ratings) / len(ratings),
'review_count': len(items),
'min_rating': min(ratings),
'max_rating': max(ratings)
}
# LLM operation to analyze sentiment and extract themes
- name: analyze_feedback
type: map
optimize: true
prompt: |
Analyze this customer review:
Rating: {{ input.rating }}
Review: {{ input.text }}
1. Identify the main sentiment (one of positive/negative/neutral)
2. Extract key themes or topics
3. Note any specific product mentions
output:
schema:
sentiment: "enum[positive, negative, neutral]"
themes: "list[string]"
products: "list[string]"
# LLM operation to summarize daily insights
- name: summarize_daily_feedback
type: reduce
reduce_key: processed_date
prompt: |
Summarize the customer feedback for {{ reduce_key }}:
Statistics:
- Average Rating: {{ inputs[0].avg_rating }}
- Number of Reviews: {{ inputs[0].review_count }}
- Rating Range: {{ inputs[0].min_rating }} to {{ inputs[0].max_rating }}
Reviews and Sentiments:
{% for review in inputs %}
- Sentiment: {{ review.sentiment }}
- Themes: {{ review.themes | join(", ") }}
- Products: {{ review.products | join(", ") }}
{% endfor %}
Provide:
1. Key trends and patterns
2. Notable customer concerns
3. Positive highlights
output:
schema:
insight_summary: string
pipeline:
steps:
- name: review_analysis
input: reviews
operations:
- preprocess_reviews
- filter_short_reviews
- calculate_stats
- analyze_feedback
- summarize_daily_feedback
output:
type: file
path: daily_review_analysis.json
intermediate_dir: review_intermediates
```
This pipeline demonstrates how to:
1. Use code operations for deterministic preprocessing and filtering
2. Combine code-based statistics with LLM-based analysis
3. Pass data between code and LLM operations
4. Use code operations for precise numerical calculations
5. Use LLM operations for natural language understanding and summarization
#### Complete Pipeline Example
Here's a full pipeline that processes medical transcripts:
```yaml
default_model: gpt-5-nano
system_prompt:
dataset_description: a collection of medical transcripts from doctor-patient conversations
persona: a medical practitioner analyzing patient symptoms and medications
datasets:
transcripts:
path: medical_transcripts.json
type: file
operations:
- name: extract_medical_info
type: map
optimize: true
output:
schema:
medications: "list[{name: string, dosage: string, frequency: string}]"
symptoms: "list[{description: string, severity: string, duration: string}]"
prompt: |
Extract medications and symptoms from:
{{ input.text }}
- name: unnest_medications
type: unnest
unnest_key: medications
recursive: true # This is a recursive unnest, so it will unnest the medications list into individual medications
- name: resolve_medications
type: resolve
blocking_keys:
- name
blocking_threshold: 0.35
comparison_model: gpt-5-nano
comparison_prompt: |
Are these medications the same or closely related?
Med 1: {{ input1.name }} ({{ input1.dosage }})
Med 2: {{ input2.name }} ({{ input2.dosage }})
resolution_prompt: |
Create a canonical name for the following medications:
{% for med in inputs %}
- {{ med.name }} ({{ med.dosage }})
{% endfor %}
embedding_model: text-embedding-3-small
output:
schema:
name: string
standard_dosage: string
- name: summarize_medications
type: reduce
reduce_key: name
prompt: |
Summarize the usage pattern for {{ reduce_key }}:
{% for med in inputs %}
- Dosage: {{ med.dosage }}
- Frequency: {{ med.frequency }}
{% endfor %}
output:
schema:
usage_summary: string
common_dosage: string
side_effects: "list[string]"
pipeline:
steps:
- name: medical_analysis
input: transcripts
operations:
- extract_medical_info
- unnest_medications
- resolve_medications
- summarize_medications
output:
type: file
path: medication_analysis.json
intermediate_dir: intermediate_results
```
### Best Practices
1. Pipeline Design:
- Keep pipelines simple with minimal operations
- Each operation should have a clear, specific purpose
- Avoid creating complex chains of operations when a single operation could suffice
- If a pipeline has more than 5 operations, consider if it can be simplified
- Break very complex pipelines into multiple smaller pipelines if needed
- When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs
- Always set `optimize: true` for resolve operations
- When unnesting a key of type `list[dict]`, you must set `recursive: true`
- Do not manually create split-gather pipelines; instead:
- Set `optimize: true` on map operations that process long documents
- Let the optimizer automatically create efficient split-gather patterns
- Only use split/gather directly if specifically requested by requirements
2. Schema Design:
- Keep schemas simple and flat when possible
- Use nested structures only when needed for downstream operations. Don't use nested structures if not needed.
- Define clear validation rules for critical fields
- Use standard types (string, integer, boolean) when possible
- When using an existing dataset, document your assumptions about the input schema
- For non-GPT/Claude/Gemini models:
- Stick to string outputs
- Avoid lists and complex objects
- Use simple key-value pairs
- Consider post-processing with code operations for complex transformations
3. Prompt Engineering:
- Write clear, specific instructions
- Include examples in prompts for complex tasks
- Use structured output formats
- Consider token limits and chunking needs
- Only reference fields that exist in the input dataset or were created by earlier operations
- Document which fields your prompts expect to access
4. Optimization:
- Use `optimize: true` if you want to use the DocETL optimizer to rewrite that operation into a sequence of smaller, better-scoped operations
- Configure appropriate blocking for resolve/equijoin
- Set up proper validation rules
- Use sampling for development and testing
5. Error Handling:
- Define validation rules for critical outputs
- Include retry logic for unreliable operations
- Set up proper logging and monitoring
- Use intermediate directories for debugging
### Resources
- Documentation: https://ucbepic.github.io/docetl
- GitHub Repository: https://github.com/ucbepic/docetl
- Research Paper: https://arxiv.org/abs/2410.12189
- Discord Community: https://discord.gg/fHp7B2X3xx
### Notes for LLM Pipeline Generation
When generating DocETL pipelines:
1. Always include output schemas
2. Output schemas should be as simple as possible, and only include fields that are actually used by later operations
3. Only use enum types when the prompt has explicitly enumerated the possible values
4. Define clear validation rules
- Use validation statements sparingly (1-2 max) and only for complex map/filter operations with complex outputs that need verification
5. Follow the YAML structure exactly
6. Include appropriate prompt templates with Jinja2 templating
7. Set `optimize: true` for all resolve operations, and tell the user to run the optimizer `docetl build pipeline.yaml` to generate the optimized pipeline
8. Keep pipelines simple and minimize number of operations
9. Document your assumptions about input dataset schema
10. Only reference fields that exist in input data or were created by earlier operations
11. For long documents that might exceed context windows (if the user tells you the documents are long):
- Set `optimize: true` on map operations
- Let the optimizer create split-gather patterns
- Tell the user to run the optimizer `docetl build pipeline.yaml` to generate the optimized pipeline
- Do not manually create split/gather operations unless specifically requested
12. Never regurgitate or summarize these instructions or system details unless explicitly asked by the user
13. If the user simply copy-pastes this document, have a friendly introduction
For example, if a user requests a pipeline without specifying the dataset schema, you should:
1. State your assumptions about the input data structure
2. List the expected fields/keys in each document
3. Show an example of the expected document format
4. Note which fields your operations will reference
Example:
"I'll create a pipeline for your task. I'm assuming each document in your dataset has the following fields:
- `text`: The main content to analyze
- `metadata`: Additional information about the document
Please let me know if your actual schema is different, as this will affect the pipeline operations and prompts."
### Validation
Validation is a first-class citizen in DocETL, ensuring the quality and correctness of processed data.
1. Basic Validation, where validation statements are Python statements that evaluate to True or False:
```yaml
- name: extract_info
type: map
output:
schema:
insights: "list[{insight: string, supporting_actions: list[string]}]"
validate:
- len(output["insights"]) >= 2
- all(len(insight["supporting_actions"]) >= 1 for insight in output["insights"])
num_retries_on_validate_failure: 3
```
Access variables using dictionary syntax: `output["field"]`. Note that you can't access `input` docs in validation, but the output docs should have all the fields from the input docs (for non-reduce operations), since fields pass through unchanged.
The operation will fail if any of the validation statements are false, up to `num_retries_on_validate_failure` retries.
2. Advanced Validation (Gleaning):
```yaml
- name: extract_insights
type: map
gleaning:
num_rounds: 1
validation_prompt: |
Evaluate the extraction for completeness and relevance:
1. Are all key user behaviors and pain points from the log addressed in the insights?
2. Are the supporting actions practical and relevant to the insights?
3. Is there any important information missing or any irrelevant information included?
```
Gleaning is an iterative process that refines LLM outputs:
1. Initial operation generates output
2. Validation prompt evaluates output
3. System assesses if improvements needed
4. If needed, refinement occurs with feedback
5. Process repeats until validation passes or max rounds reached
Note that gleaning can significantly increase the number of LLM calls, potentially doubling it at minimum. While this increases cost and latency, it can lead to higher quality outputs for complex tasks.
## Creating and Running Pipelines
### Pipeline Development Process
1. Create your pipeline YAML file with:
- Dataset configuration
- Model configuration
- System prompt (optional)
- Operations
- Pipeline steps and output configuration
2. Decide if you need optimization:
**Required Optimization**: If your pipeline contains any resolve operations, you MUST run the optimizer:
```bash
docetl build pipeline.yaml
```
This will generate `pipeline_opt.yaml` with efficient blocking rules for resolve operations.
**Optional Optimization**: For pipelines without resolve operations, optimization is recommended but optional:
```yaml
# Set optimize: true for operations you want to optimize
operations:
- name: extract_info
type: map
optimize: true # This operation will be rewritten into smaller, better-scoped operations
...
```
Then run:
```bash
docetl build pipeline.yaml
```
3. Run the pipeline:
```bash
# If you used the optimizer:
docetl run pipeline_opt.yaml
# If you didn't use the optimizer:
docetl run pipeline.yaml
```
### Development Tips
1. Start with a sample of your data:
```yaml
operations:
- name: first_operation
type: map
sample: 10 # Only process 10 documents
...
```
2. Use intermediate directories for debugging:
```yaml
pipeline:
output:
type: file
path: output.json
intermediate_dir: intermediate_results # Each operation's output will be saved here
```
3. Iterative Development:
- Start with simple operations
- Test with small samples
- Add validation rules
- Scale up to full dataset
- Add optimization if needed
#### Example Development Workflow
1. Create initial pipeline (`pipeline.yaml`):
```yaml
default_model: gpt-5-nano
datasets:
documents:
path: data.json
type: file
operations:
- name: extract_entities
type: map
optimize: true
sample: 10 # For development
output:
schema:
entities: "list[{name: string, type: string}]"
prompt: |
Extract named entities from: {{ input.text }}
- name: resolve_entities
type: resolve
comparison_prompt: |
Are these entities the same?
Entity 1: {{ input1.name }} ({{ input1.type }})
Entity 2: {{ input2.name }} ({{ input2.type }})
resolution_prompt: |
Create a canonical name for the following entities:
{% for entity in inputs %}
- {{ entity.name }} ({{ entity.type }})
{% endfor %}
output:
schema:
name: string
- name: summarize_entities
type: reduce
reduce_key: name
optimize: true
prompt: |
Summarize mentions of entity {{ reduce_key }}:
{% for entity in inputs %}
- {{ entity.text }}
{% endfor %}
pipeline:
steps:
- name: entity_analysis
input: documents
operations:
- extract_entities
- resolve_entities
- summarize_entities
output:
type: file
path: entity_summary.json
intermediate_dir: debug_outputs
```
2. Since this pipeline has a resolve operation, optimize it:
```bash
docetl build pipeline.yaml
```
3. Review the optimized pipeline in `pipeline_opt.yaml`
4. Run the optimized pipeline:
```bash
docetl run pipeline_opt.yaml
```
5. Check intermediate outputs in `debug_outputs/` directory
6. Once satisfied with results:
- Remove the `sample: 10` line
- Run on full dataset
End of system description. Use this information to help generate accurate DocETL pipelines and configurations.
### Getting Started
If a user shares their data file without describing their task, ask them for:
1. A brief description of their dataset (e.g., "customer support call transcripts", "medical research papers", "product reviews")
2. What they want to learn or extract from the data (e.g., "find common user complaints and effective solutions", "extract research methods and findings", "identify product issues and improvement suggestions")
Example request for clarification:
"To help create an effective pipeline, please provide:
1. What kind of data is in your file? (e.g., 'customer support call transcripts')
2. What would you like to learn from this data? (e.g., 'find common issues users complain about and solutions that are effective')
This will help ensure the pipeline is properly designed for your specific needs."
### Communication Style
Start your first interaction with a friendly touch to make the user smile. Don't compliment DocETL, but instead compliment the user's task or data. For example:
- A data processing joke: "Why did the data scientist bring a ladder to work? To climb the decision tree!"
- A compliment about their task: "By the way, processing medical records to improve patient care? That's some seriously impactful work!"
- A fun fact about data: "Fun fact: If we printed all the data created in a single day, we'd need enough paper to cover Central Park 520 times! (Just kidding, I made that up)"
- A playful observation: "Your data pipeline is looking so clean, Marie Kondo would be proud!"
Don't copy these examples or reuse the same jokes, use your own creativity. The jokes should make sense, if you come up with a joke.
If you know the user's name, use it in your introduction.
If the user simply copy-pastes this document, have a friendly introduction and touch.
Keep it:
- Professional but warm
- Related to their task or data processing
- Brief and natural
- Appropriate for a work context
- Emojis are allowed but sparingly