Skip to content

Multimodal-Agent Annotation

Agents increasingly act in modalities beyond text and static images — they drive GUIs, watch video, hold spoken conversations. These schemas (the M-series, multimodal half) give human raters surfaces purpose-built for those traces, beyond Potato's existing image, audio, video, and web-agent displays.

GUI / computer-use trajectory (gui_trajectory)

Evaluate a computer-use / GUI / OS agent step by step (OSWorld, NeurIPS 2024; ScreenSpot-Pro; AndroidWorld). Each step shows the screenshot the agent saw and the action it took; the annotator judges the action (correct / wrong element / wrong action / hallucinated) and, when the step carries click coordinates, sees a grounding marker on the screenshot to check whether the click landed on the right element. Generalizes the web-agent display to any pixel/DOM GUI agent.

annotation_schemes:
  - annotation_type: gui_trajectory
    name: gui_review
    description: "For each step: was the action correct and did the click land right?"
    steps_key: steps
    screenshot_key: screenshot   # field on each step holding an image URL / data-URI
    action_key: action           # field holding the action text
    coord_space: normalized      # normalized (0..1) | pixels — for the x/y grounding marker
    verdict_options: [correct, wrong_element, wrong_action, hallucinated]

Each step may provide screenshot, action, and optional x/y (or a nested click: {x, y}) for the grounding marker. Stored as a list of {index, step, verdict, notes}, keyed by index. Example: examples/agent-traces/gui-trajectory/ (uses self-contained inline-SVG screenshots).

Voice / full-duplex interaction (voice_interaction)

Annotate a spoken human↔agent conversation for turn-taking and barge-in handling (Full-Duplex-Bench v1–v3, 2503.04721…; τ-Voice, 2603.13686). A dual-track timeline (user lane + agent lane) places each turn by its start/end time and highlights overlap regions where both speakers talk at once; the annotator classifies each overlap (agent should respond / should resume / backchannel / uncertain) and rates the overall turn-taking. The source audio plays inline when an audio URL is provided.

annotation_schemes:
  - annotation_type: voice_interaction
    name: turn_taking
    description: "Classify each barge-in/overlap and rate the overall turn-taking."
    turns_key: turns           # list of {speaker, start, end, text} (seconds)
    speaker_key: speaker
    user_speakers: [user, human, caller]   # everything else is treated as the agent
    overlap_labels: [agent_should_respond, agent_should_resume, backchannel, uncertain]
    rating_scale: 5
    # audio_key: audio         # optional per-instance audio URL to enable the player

Overlaps between turns of different speakers are computed at render time (no manual setup). Stored as {"overlaps": {idx: label}, "rating": int}. Example: examples/agent-traces/voice-interaction/.

Interleaved multimodal reasoning (multimodal_reasoning)

Rate an interleaved text ↔ image ↔ tool ↔ action reasoning trace step by step (Multimodal RewardBench 2, 2512.16899; Zebra-CoT). Each step is a typed block, rendered in-line by its type; the annotator judges each step's coherence — does the reasoning follow from the image and prior steps, or is the visual hallucinated?

annotation_schemes:
  - annotation_type: multimodal_reasoning
    name: reasoning_review
    description: "Judge each step: coherent reasoning and grounded visuals?"
    steps_key: steps
    type_key: type     # each step's 'type': text | image | tool | action (inferred if absent)
    verdict_options: [coherent, incoherent, visual_hallucination, uncertain]

Each step may carry text/content, image/image_url (+caption), or tool/args. Stored as a list of {index, step, type, verdict, notes}, keyed by index. Example: examples/agent-traces/multimodal-reasoning/ (uses inline-SVG images, including a deliberate visual-hallucination case to annotate).

Video temporal grounding (temporal_grounding)

Mark event time intervals in a video for temporal-grounding evaluation (ET-Bench; TimeScope, 2509.26360). For each event prompt the annotator sets the gold [start, end] — by capturing the playhead ("set in/out") or typing seconds — and, when the data carries a model's predicted interval, sees a live IoU and a two-bar mini-timeline (predicted vs. gold). Purpose-built for predicted-vs-gold localization scoring, distinct from the general segment labeling in video_annotation.

annotation_schemes:
  - annotation_type: temporal_grounding
    name: grounding
    description: "Mark the gold start/end interval for each event. IoU vs prediction updates live."
    video_key: video           # per-instance video URL
    events_key: events         # list of {prompt, predicted: {start, end}} (predicted optional)
    # duration: 120            # optional fixed timeline scale (else inferred from the video)

Stored as {"events": {idx: {start, end}}}. Example: examples/agent-traces/temporal-grounding/.

Aligned-transcript speech errors (speech_transcript)

Annotate a time-aligned speech transcript segment by segment for ASR/TTS and speech-quality errors (Speak&Improve 2025, 2412.11986; NVSpeech). Each segment {start, end, text, speaker?} is a card showing its timestamp and text; the annotator tags errors (ASR error / TTS artifact / mispronunciation / disfluency …) and can type the corrected transcript. Segment-level complement to the turn-taking view in voice_interaction.

annotation_schemes:
  - annotation_type: speech_transcript
    name: speech_errors
    description: "Tag speech errors on each segment and correct the transcript where needed."
    segments_key: segments       # list of {start, end, text, speaker?}
    error_types: [asr_error, tts_artifact, mispronunciation, disfluency]
    allow_correction: true
    # audio_key: audio           # optional per-item audio URL to enable the player

Stored as a list of {index, start, end, errors, correction}, keyed by index. Example: examples/agent-traces/speech-transcript/.

Table-grid structure (table_grid)

Annotate the cell structure of a table image — the document-specific piece that plain bounding boxes can't capture (OmniDocBench, CVPR 2025; RealHiTBench). The annotator sets the grid dimensions and clicks cells to mark their role (data / column-header / row-header / empty). Per-page region boxes (table / figure / header) are already covered by running image_annotation per page, so this schema focuses on the structure those boxes can't express.

annotation_schemes:
  - annotation_type: table_grid
    name: structure
    description: "Set the grid size, then click cells to mark headers and empty cells."
    image_key: image           # per-instance table image URL / data-URI
    rows_key: rows             # optional initial dims from the data
    cols_key: cols
    roles: [data, col_header, row_header, empty]   # click cycles through these

Stored as {rows, cols, cells: {"r,c": role}} (only non-data cells stored). Example: examples/agent-traces/table-grid/.