Embedding Dashboard Visualization

The Embedding Visualization feature provides an interactive 2D visualization of your annotation data in the admin dashboard. Using UMAP dimensionality reduction on text embeddings, it allows you to:

Explore data patterns: See clustering and distribution of your instances
Track annotation progress: Visualize annotated vs. unannotated items
Prioritize annotation: Select regions to prioritize for annotation
Understand label distribution: Color points by predicted labels (MACE or majority vote)

Requirements

The embedding visualization requires the following dependencies:

# Required dependencies
pip install umap-learn>=0.5.0
pip install sentence-transformers  # Already required for diversity_ordering
pip install scikit-learn  # Already required for diversity_ordering

Additionally, diversity ordering must be enabled in your configuration, as the visualization uses the embeddings computed by the DiversityManager.

Configuration

Add the embedding_visualization section to your YAML config file:

# Required: Enable diversity ordering for embeddings
diversity_ordering:
  enabled: true
  model_name: "all-MiniLM-L6-v2"
  num_clusters: 10

# Optional: Configure embedding visualization
embedding_visualization:
  enabled: true                    # Enable/disable visualization (default: true)
  sample_size: 1000               # Max instances to visualize (default: 1000)
  include_all_annotated: true     # Always include annotated items (default: true)
  embedding_model: "all-MiniLM-L6-v2"  # Text embedding model
  label_source: "mace"            # "mace" or "majority" (default: "mace")

  umap:                           # UMAP projection settings
    n_neighbors: 15               # Number of neighbors (default: 15)
    min_dist: 0.1                 # Minimum distance (default: 0.1)
    metric: "cosine"              # Distance metric (default: "cosine")

Configuration Options

Option	Type	Default	Description
`enabled`	bool	`true`	Enable/disable the visualization
`sample_size`	int	1000	Maximum instances to visualize (for performance)
`include_all_annotated`	bool	`true`	Always include all annotated instances in the sample
`embedding_model`	string	"all-MiniLM-L6-v2"	Sentence-transformer model for text
`label_source`	string	"mace"	Label source: "mace" (MACE predictions) or "majority" (majority vote)
`umap.n_neighbors`	int	15	UMAP: Number of neighbors to consider
`umap.min_dist`	float	0.1	UMAP: Minimum distance between points (0-1)
`umap.metric`	string	"cosine"	UMAP: Distance metric (cosine, euclidean, manhattan, correlation)

Using the Visualization

Accessing the Dashboard

Navigate to the Admin Dashboard (/admin)
Enter your admin API key (found in your task directory as admin_api_key.txt or set in config)
Click the "Embeddings" tab

Understanding the Visualization

The scatter plot shows your instances projected into 2D space:

Position: Similar instances appear close together based on their text embeddings
Color: Points are colored by their predicted label
If using MACE: Colors reflect MACE's best prediction
If using majority vote: Colors reflect the most common annotation
Gray points are unannotated
Hover: Hover over a point to see the instance preview in the side panel

Selection Tools

Use the Plotly.js selection tools to select instances:

Lasso Selection: Click and drag to draw a free-form selection
Box Selection: Click and drag to select a rectangular region
Click: Click individual points to add them to your selection

Priority Queue

The selection panel allows you to create a priority queue for annotation:

Make a selection using lasso or box tool
Click "Add to Queue" to add the selection as a priority group
Repeat to add multiple priority groups
Click "Apply Reordering" to reorder the annotation queue

When you apply reordering: - Selected instances are moved to the front of the annotation queue - Multiple selections are interleaved by priority - Lower priority numbers come first in each round

Interleaving Example

If you select two regions: - Region 1 (Priority 1): [A, B, C] - Region 2 (Priority 2): [X, Y]

The resulting order will be: A, X, B, Y, C

This ensures diverse annotation coverage even if annotators only complete part of the queue.

API Endpoints

The visualization is powered by these admin API endpoints:

GET /admin/api/embedding_viz/data

Returns visualization data including 2D coordinates, labels, and colors.

Query Parameters: - force_refresh: If "true", recompute UMAP projection

Response:

{
  "points": [
    {
      "instance_id": "item_001",
      "x": 0.234,
      "y": -1.456,
      "label": "Positive",
      "label_source": "mace",
      "preview": "This is the text content...",
      "preview_type": "text",
      "annotated": true,
      "annotation_count": 3
    }
  ],
  "labels": ["Positive", "Negative", "Neutral", null],
  "label_colors": {
    "Positive": "#22c55e",
    "Negative": "#ef4444",
    "Neutral": "#eab308",
    "null": "#94a3b8"
  },
  "stats": {
    "total_instances": 1000,
    "visualized_instances": 500,
    "annotated_instances": 342,
    "unannotated_instances": 658
  }
}

POST /admin/api/embedding_viz/reorder

Reorder the annotation queue based on selections.

Request Body:

{
  "selections": [
    {
      "instance_ids": ["item_005", "item_012", "item_023"],
      "priority": 1
    },
    {
      "instance_ids": ["item_101", "item_102"],
      "priority": 2
    }
  ],
  "interleave": true
}

Response:

{
  "success": true,
  "reordered_count": 5,
  "new_order_preview": ["item_005", "item_101", "item_012", "item_102", "item_023"]
}

POST /admin/api/embedding_viz/refresh

Force re-computation of embeddings and UMAP projection.

Request Body:

{
  "force_recompute": true
}

GET /admin/api/embedding_viz/stats

Returns statistics about the embedding visualization system.

Response:

{
  "enabled": true,
  "umap_available": true,
  "numpy_available": true,
  "embeddings_available": true,
  "embedding_count": 1000,
  "cache_valid": true,
  "config": {
    "sample_size": 1000,
    "include_all_annotated": true,
    "label_source": "mace",
    "umap_n_neighbors": 15,
    "umap_min_dist": 0.1
  }
}

Performance Considerations

Large Datasets

For datasets with many instances:

Sampling: The visualization automatically samples instances based on sample_size
Include Annotated: Setting include_all_annotated: true ensures annotated items are always shown
UMAP Parameters: Lower n_neighbors values compute faster but may lose structure

Caching

UMAP projections are cached after first computation
Cache is invalidated when new embeddings are added
Use the "Refresh" button or /refresh endpoint to force recomputation

Troubleshooting

"Embedding visualization not enabled"

Cause: Missing dependencies or disabled in config.

Solution:

pip install umap-learn>=0.5.0

"Diversity manager not available"

Cause: diversity_ordering is not enabled in config.

Solution: Add to your config:

diversity_ordering:
  enabled: true

"No embeddings available"

Cause: Embeddings haven't been computed yet.

Solution: Embeddings are computed when items are loaded. Ensure your data files are loaded and wait for embedding computation to complete.

Visualization is slow

Cause: Large number of instances or first-time computation.

Solutions: 1. Reduce sample_size in config 2. Reduce umap.n_neighbors 3. Wait for initial computation to complete (subsequent loads use cache)

Diversity Ordering - Required for embeddings
MACE Adjudication - Label predictions used for coloring
Admin Dashboard - Dashboard overview

Embedding Dashboard Visualization

Requirements

Configuration

Configuration Options

Using the Visualization

Accessing the Dashboard

Understanding the Visualization

Selection Tools

Priority Queue

Interleaving Example

API Endpoints

GET /admin/api/embedding_viz/data

POST /admin/api/embedding_viz/reorder

POST /admin/api/embedding_viz/refresh

GET /admin/api/embedding_viz/stats

Performance Considerations

Large Datasets

Caching

Troubleshooting

"Embedding visualization not enabled"

"Diversity manager not available"

"No embeddings available"

Visualization is slow

Related Documentation