Extended Format Support
Potato supports annotation of various document formats beyond plain text, including PDF documents, Word files, spreadsheets, Markdown, and source code.
Overview
The format support system consists of two main components:
- Format Handlers - Extract text and coordinate mappings from documents
- Display Components - Render documents in the annotation interface
Supported Formats
| Format | Extensions | Display Type | Span Annotation |
|---|---|---|---|
pdf |
Yes | ||
| Word | .docx | document |
Yes |
| Markdown | .md, .markdown | document |
Yes |
| Excel/CSV | .xlsx, .csv, .tsv | spreadsheet |
Yes (row/cell) |
| Source Code | .py, .js, .java, etc. | code |
Yes (line-based) |
Installation
Extended format support requires additional dependencies. Install them with:
pip install pdfplumber python-docx mammoth mistune pygments openpyxl
Or install specific dependencies for the formats you need:
# PDF support
pip install pdfplumber
# Word document support
pip install python-docx mammoth
# Markdown support
pip install mistune pygments
# Excel support
pip install openpyxl
Configuration
PDF Documents
Display PDF documents with text extraction and span annotation support:
instance_display:
fields:
- key: document_path
type: pdf
label: "Document"
span_target: true
display_options:
view_mode: scroll # scroll, paginated, or side-by-side
max_height: 700 # Maximum viewer height in pixels
text_layer: true # Enable text selection
show_page_controls: true
initial_page: 1
zoom: auto # auto, page-fit, page-width, or percentage
View Modes:
- scroll - Continuous scrolling through pages (default)
- paginated - One page at a time with navigation
- side-by-side - Multiple pages visible
PDF Bounding Box Annotation
For image-based annotation on PDF pages (e.g., marking figures, tables, charts), use the bounding box annotation mode:
instance_display:
fields:
- key: document_path
type: pdf
label: "Document"
display_options:
annotation_mode: bounding_box # Enable bounding box drawing
view_mode: paginated # Auto-set to paginated for bbox
bbox_min_size: 10 # Minimum box size in pixels
show_bbox_labels: true # Show label on boxes
show_page_controls: true
Bounding Box Features: - Draw mode: Click and drag to create bounding boxes - Select mode: Click on boxes to select and manage them - Delete: Remove selected bounding boxes - Page navigation: Navigate between pages while maintaining box data - Label assignment: Associate labels with each bounding box
Bounding Box Coordinates:
Bounding boxes are stored with page information and normalized coordinates:
{
"format_coords": {
"format": "bounding_box",
"page": 2,
"bbox": [0.1, 0.2, 0.3, 0.15],
"bbox_pixels": [100, 200, 300, 150],
"label": "FIGURE"
}
}
The bbox field contains normalized coordinates [x, y, width, height] where
all values are 0-1 relative to page dimensions. The optional bbox_pixels
field contains the original pixel coordinates.
Word Documents (DOCX)
Display and annotate Word documents:
instance_display:
fields:
- key: document_path
type: document
label: "Word Document"
span_target: true
display_options:
collapsible: false
max_height: 500
show_outline: true # Show table of contents
style_theme: default # default, minimal, or print
HTML/Document Bounding Box Annotation
For image-based region annotation on rendered HTML documents (e.g., marking headers, paragraphs, images, tables), use the bounding box annotation mode:
instance_display:
fields:
- key: html_content
type: document
label: "Document Layout"
display_options:
annotation_mode: bounding_box # Enable bounding box drawing
bbox_min_size: 10 # Minimum box size in pixels
show_bbox_labels: true # Show labels on boxes
style_theme: default
Bounding Box Features: - Draw mode: Click and drag to create bounding boxes on the document - Select mode: Click on boxes to select and manage them - Delete: Remove selected bounding boxes - Label assignment: Associate labels with each bounding box
Use Cases: - Document layout analysis (identifying headers, paragraphs, figures) - Web page region annotation - Form field detection - UI element labeling
Spreadsheets (Excel/CSV)
Display tabular data with row-based or cell-based annotation:
instance_display:
fields:
- key: data_table
type: spreadsheet
label: "Data Table"
display_options:
annotation_mode: row # row, cell, or range
show_headers: true
max_height: 400
striped: true # Alternating row colors
hoverable: true # Highlight on hover
sortable: true # Enable column sorting
selectable: true # Enable row/cell selection
compact: false # Compact table styling
Annotation Modes:
- row - Annotate entire rows (default)
- cell - Annotate individual cells
- range - Select cell ranges
Source Code
Display syntax-highlighted code with line-level annotation:
instance_display:
fields:
- key: code
type: code
label: "Source Code"
span_target: true
display_options:
language: python # Auto-detect if not specified
show_line_numbers: true
max_height: 500
wrap_lines: false
highlight_lines: [5, 10] # Highlight specific lines
theme: default # default or dark
copy_button: true
Supported Languages: Python, JavaScript, TypeScript, Java, C/C++, C#, Go, Rust, Ruby, PHP, Swift, Kotlin, Scala, R, SQL, Bash, YAML, JSON, XML, HTML, CSS, and more.
Markdown
Display rendered Markdown with syntax highlighting:
instance_display:
fields:
- key: markdown_content
type: document
label: "Documentation"
span_target: true
display_options:
style_theme: minimal
show_outline: true
Format Handling Configuration
Configure global format handling options:
format_handling:
enabled: true
default_format: auto # Auto-detect from file extension
pdf:
extraction_mode: text # text, ocr, or hybrid
cache_extracted: true
spreadsheet:
annotation_mode: row
max_rows: 1000
Data Structure
File Path References
Reference external files in your data:
{
"id": "doc_001",
"document_path": "/path/to/document.pdf",
"text": "Extracted text for search..."
}
Embedded Table Data
Embed table data directly:
{
"id": "table_001",
"table_data": {
"headers": ["Column A", "Column B", "Column C"],
"rows": [
["Value 1", "Value 2", "Value 3"],
["Value 4", "Value 5", "Value 6"]
]
}
}
Embedded Code
Embed source code directly:
{
"id": "code_001",
"code": "def hello():\n print('Hello, World!')",
"language": "python"
}
Coordinate Mapping
When span annotations are made on format documents, Potato stores both: 1. Character offsets (standard for all text) 2. Format-specific coordinates
PDF Coordinates
{
"start": 245,
"end": 258,
"format_coords": {
"format": "pdf",
"page": 2,
"bbox": [120.5, 340.2, 185.3, 352.8]
}
}
Spreadsheet Coordinates
{
"start": 0,
"end": 15,
"format_coords": {
"format": "spreadsheet",
"row": 5,
"col": 2,
"cell_ref": "B5"
}
}
Code Coordinates
{
"start": 100,
"end": 150,
"format_coords": {
"format": "code",
"line": 10,
"column": 5
}
}
Example Projects
See the following example projects:
examples/image/pdf-annotation/- PDF entity annotationexamples/advanced/spreadsheet-annotation/- Tabular data quality reviewexamples/advanced/code-annotation/- Code quality review
Troubleshooting
PDF not displaying
- Ensure PDF.js can access the file (CORS for remote URLs)
- Check browser console for JavaScript errors
- Try enabling
text_layer: falseto test rendering
Missing dependencies
If you see import errors, install the required packages:
pip install pdfplumber python-docx mammoth mistune pygments openpyxl
Slow extraction
For large documents:
1. Use max_pages option for PDFs
2. Use max_rows option for spreadsheets
3. Consider pre-extracting content
API Reference
Format Handler Registry
from potato.format_handlers import format_handler_registry
# Check if a file can be handled
if format_handler_registry.can_handle("document.pdf"):
output = format_handler_registry.extract("document.pdf")
# Access extracted content
text = output.text
html = output.rendered_html
coords = output.coordinate_map
metadata = output.metadata
# List supported formats
formats = format_handler_registry.get_supported_formats()
Coordinate Mapper
from potato.format_handlers import CoordinateMapper, PDFCoordinate
# Create mapper
mapper = CoordinateMapper()
# Add mappings
mapper.add_mapping(0, 100, PDFCoordinate(page=1, bbox=[10, 20, 200, 30]))
# Get coordinates for a character range
coords = mapper.get_coords_for_range(50, 75)
Bounding Box Coordinates
from potato.format_handlers import BoundingBoxCoordinate
# Create from normalized coordinates
bbox = BoundingBoxCoordinate(
page=2,
bbox=[0.1, 0.2, 0.3, 0.15], # [x, y, width, height] normalized 0-1
label="FIGURE"
)
# Create from pixel coordinates
bbox = BoundingBoxCoordinate.from_pixel_coords(
page=2,
x=100, y=200, width=300, height=150,
page_width=1000, page_height=800,
label="TABLE"
)
# Convert to pixel coordinates
pixels = bbox.to_pixel_coords(1000, 800) # [x, y, width, height]