Solo Mode Advanced Features

This page documents advanced subsystems available in Solo Mode that go beyond the core 10-phase workflow described in the Solo Mode guide. These features enable automated quality improvement, cost optimization, and deeper analysis of annotation patterns.

Edge Case Rule Discovery

Inspired by the Co-DETECT framework, the edge case rule system automatically discovers annotation rules from instances where the LLM has low confidence. Rules are extracted, clustered into categories, reviewed by the human annotator, and injected back into the annotation prompt.

How It Works

Rule extraction: When the LLM labels an instance with confidence below confidence_threshold, it generates a generalizable rule of the form "When \<condition> → \<action>".
Clustering: Once enough rules accumulate (min_rules_for_clustering), they are clustered by semantic similarity using sentence embeddings and K-Means. Each cluster is summarized into a single category by the LLM.
Human review: Categories are presented to the annotator for approval or rejection.
Prompt injection: Approved categories are injected into the annotation prompt, either via LLM-assisted integration or by appending an "Edge Case Guidelines" section.
Re-annotation: Instances previously labeled with low confidence under old prompts are re-annotated with the improved prompt.

Configuration

solo_mode:
  edge_case_rules:
    enabled: true
    confidence_threshold: 0.75
    min_rules_for_clustering: 10
    target_cluster_size: 15
    auto_extract_on_labeling: true
    reannotation_enabled: true
    reannotation_confidence_threshold: 0.60
    max_reannotations_per_instance: 2

Option	Default	Description
`enabled`	`true`	Enable/disable the edge case rule system
`confidence_threshold`	`0.75`	Extract rules when LLM confidence is below this value
`min_rules_for_clustering`	`10`	Minimum unclustered rules before triggering clustering
`target_cluster_size`	`15`	Target number of rules per cluster (Co-DETECT recommends 10–20)
`auto_extract_on_labeling`	`true`	Automatically extract rules during LLM labeling
`reannotation_enabled`	`true`	Re-annotate low-confidence instances after prompt updates
`reannotation_confidence_threshold`	`0.60`	Only re-annotate instances with confidence below this
`max_reannotations_per_instance`	`2`	Maximum times a single instance can be re-annotated

Instance Selection Weight

You can direct the instance selector to prioritize instances matching edge case rules:

solo_mode:
  instance_selection:
    edge_case_rule_weight: 0.1  # default: 0.0

Increase this to route more items from edge case rule clusters to the human annotator.

Labeling Functions

Inspired by ALCHEmist (NeurIPS 2024), this system extracts reusable labeling functions from high-confidence LLM predictions. These functions can label new instances via keyword matching and majority voting — without additional API calls.

How It Works

Extraction: From predictions where the LLM reports high confidence (min_confidence), the system asks the LLM to identify generalizable patterns (keywords, conditions). Falls back to keyword frequency analysis if the LLM is unavailable.
Application: For each new instance, all enabled labeling functions vote on a label using confidence-weighted majority voting.
Acceptance: If vote agreement exceeds vote_threshold, the label is accepted without calling the LLM. Otherwise the instance is passed through to the normal LLM labeling pipeline.

Configuration

solo_mode:
  labeling_functions:
    enabled: true
    min_confidence: 0.85
    min_coverage: 3
    max_functions: 50
    auto_extract: true
    vote_threshold: 0.5

Option	Default	Description
`enabled`	`true`	Enable/disable labeling functions
`min_confidence`	`0.85`	Minimum LLM confidence for a prediction to be used for function extraction
`min_coverage`	`3`	Minimum instances a pattern must match to become a function
`max_functions`	`50`	Maximum number of active labeling functions
`auto_extract`	`true`	Automatically extract functions from high-confidence predictions
`vote_threshold`	`0.5`	Minimum vote agreement required to accept a labeling function result

Cost Savings

Labeling functions are most effective when your data contains recurring patterns. The stats endpoint reports: - coverage: Fraction of instances labeled by functions (avoiding LLM calls) - accuracy: Agreement between function labels and human labels (when available)

Confusion Analysis

Enriches the standard confusion matrix with example instances, LLM reasoning, and optional root cause analysis with guideline suggestions.

How It Works

Pattern detection: Groups human-LLM disagreements by (predicted, actual) label pairs, filtering to pairs that occur at least min_instances_for_pattern times.
Enrichment: Each confusion pattern includes up to 5 example instances with the original text, LLM reasoning, and confidence score.
Root cause analysis (optional): Uses the LLM to explain why a specific confusion pattern occurs.
Guideline suggestions (optional): Uses the LLM to propose a concise guideline to disambiguate the confused labels.

Configuration

solo_mode:
  confusion_analysis:
    enabled: true
    min_instances_for_pattern: 3
    max_patterns: 20
    auto_suggest_guidelines: false

Option	Default	Description
`enabled`	`true`	Enable/disable confusion analysis
`min_instances_for_pattern`	`3`	Minimum disagreements for a label pair to be reported as a pattern
`max_patterns`	`20`	Maximum number of confusion patterns to report
`auto_suggest_guidelines`	`false`	Automatically generate LLM guideline suggestions for each pattern

API

GET /solo/api/confusion-analysis

Returns confusion matrix data with heatmap-ready cell values, per-label accuracy, and enriched patterns.

Disagreement Explorer

Provides rich aggregated data for visual exploration of human-LLM disagreements, including scatter plots, temporal timelines, per-label breakdowns, and a filterable disagreement list.

How It Works

The explorer is read-only — it aggregates data from the validation tracker and predictions without modifying any state.

Scatter plot: Each compared instance is plotted as (confidence, agrees/disagrees), revealing whether high-confidence predictions tend to be correct.

Timeline: Comparisons are bucketed into windows (default size 10). Each bucket shows its agreement rate, and the overall trend is classified as improving, declining, or stable based on first-half vs. second-half agreement rate difference (>5% threshold).

Label breakdown: Per-label statistics including total comparisons, agreement rate, and top confused-with labels.

Disagreement list: Sorted by confidence descending (most surprising disagreements first), filterable by label.

API

GET /solo/api/disagreement-explorer

Returns scatter_points, disagreements, label_breakdown, and summary data. Accepts an optional ?label=<label> query parameter to filter to one label.

GET /solo/api/disagreement-timeline

Returns buckets (per-window stats) and trend classification. Accepts ?bucket_size=<N> (clamped to [2, 100], default 10).

Note: GET /solo/api/disagreements (without the -explorer suffix) is a separate, simpler endpoint that returns aggregate counts only (total, pending, resolved, pending_ids).

Orchestrates an automated cycle of confusion analysis → guideline suggestions → prompt revision → re-annotation. Monitors agreement rate trends and stops when metrics plateau.

How It Works

Trigger: After every trigger_interval human annotations, the loop checks whether a refinement cycle should run.
Analyze: Runs confusion analysis on current disagreement patterns.
Suggest: For each significant confusion pattern, generates a guideline suggestion.
Apply: If auto_apply_suggestions is true, suggestions are applied immediately. Otherwise the cycle pauses in awaiting_approval status for human review.
Re-annotate: Affected instances are re-annotated with the updated prompt.
Evaluate: After re-annotation, the improvement in agreement rate is measured.

Stop Conditions

The loop automatically stops when: - Maximum cycles reached (max_cycles) - Improvement plateaus: patience consecutive cycles with improvement below min_improvement

Configuration

solo_mode:
  refinement_loop:
    enabled: true
    trigger_interval: 50
    min_improvement: 0.02
    max_cycles: 5
    patience: 2
    auto_apply_suggestions: false

Option	Default	Description
`enabled`	`true`	Enable/disable the refinement loop
`trigger_interval`	`50`	Check for refinement every N human annotations
`min_improvement`	`0.02`	Minimum agreement rate improvement to count as progress
`max_cycles`	`5`	Maximum number of refinement cycles
`patience`	`2`	Consecutive cycles without improvement before stopping
`auto_apply_suggestions`	`false`	Apply guideline suggestions without human review

API

GET /solo/api/refinement-status

Returns enabled state, cycle count, running/stopped state, stop reason, patience countdown, and full cycle history.

Related refinement endpoints (all require authentication):

POST /solo/api/refinement/trigger     # Manually trigger a refinement cycle
GET  /solo/api/refinement/log         # Full cycle history (validated framework)
GET  /solo/api/refinement/pending     # Candidates awaiting admin approval
GET  /solo/api/refinement/strategies  # List available refinement strategies
GET  /solo/api/reannotation-report    # Before/after accuracy on re-annotated instances

Admin-only (require X-API-Key):

POST /solo/api/refinement/reset       # Reset refinement loop state
POST /solo/api/refinement/approve     # Apply a pending refinement candidate
POST /solo/api/refinement/reject      # Reject a pending refinement candidate

Confidence Routing

Implements cascaded model escalation: a cheap/fast model tries first, and if its confidence is below the tier threshold, the instance escalates to a more expensive/capable model. If all tiers fail, the instance is routed to the human.

How It Works

For each instance: 1. The first (cheapest) tier model labels the instance. 2. If confidence ≥ tier threshold → accept the label. 3. If confidence < threshold → escalate to the next tier, keeping the best result so far. 4. If all tiers are exhausted → route to the human annotation queue.

Configuration

solo_mode:
  confidence_routing:
    enabled: true
    tiers:
      - name: "fast"
        model:
          endpoint_type: "openai"
          model: "gpt-4o-mini"
          api_key: "${OPENAI_API_KEY}"
        confidence_threshold: 0.85
      - name: "accurate"
        model:
          endpoint_type: "anthropic"
          model: "claude-3-5-sonnet-20241022"
          api_key: "${ANTHROPIC_API_KEY}"
        confidence_threshold: 0.70

Option	Default	Description
`enabled`	`false`	Enable/disable confidence routing (replaces single-model labeling)
`tiers`	`[]`	Ordered list of model tiers, cheapest first
`tiers[].name`	`""`	Human-readable tier name for stats reporting
`tiers[].model`	—	Model configuration (same format as `labeling_models` entries)
`tiers[].confidence_threshold`	`0.8`	Minimum confidence to accept a label at this tier

Per-Tier Statistics

The stats endpoint reports per-tier metrics: - instances_attempted: Total instances routed to this tier - instances_accepted: Instances where confidence met the threshold - instances_escalated: Instances passed to the next tier - acceptance_rate: Fraction accepted at this tier - avg_confidence: Mean confidence of accepted predictions - avg_latency_ms: Mean response time

Plus global stats: total_routed and human_routed_count.

API

Per-tier and global confidence-routing stats are included in the response of GET /solo/api/status under llm_stats.confidence_routing when routing is enabled. There is no dedicated routing-stats endpoint.

Prompt Optimizer

DSPy-style automatic prompt optimization using labeled examples. The optimizer analyzes correct and incorrect predictions to iteratively improve the annotation prompt.

How It Works

Collect examples: Gathers labeled instances — both correctly and incorrectly predicted by the LLM.
Optimize: Sends the current prompt along with sample correct (up to 5) and incorrect (up to 10) examples to the LLM. The LLM returns an improved prompt with a list of changes and rationale.
Validate: Checks that the optimized prompt differs from the original and is within length limits.
Apply: Updates the annotation prompt.

Optimization can run on a timer in the background or be triggered on-demand.

Smallest Model Search

When find_smallest_model is enabled, the optimizer tests available models (smallest first) against labeled examples and selects the smallest model that meets target_accuracy. This reduces API costs by using the cheapest sufficient model.

Configuration

solo_mode:
  prompt_optimization:
    enabled: true
    find_smallest_model: true
    target_accuracy: 0.85
    optimization_interval_seconds: 300
    accuracy_weight: 0.7
    length_weight: 0.2
    consistency_weight: 0.1

Option	Default	Description
`enabled`	`true`	Enable/disable prompt optimization
`find_smallest_model`	`true`	Search for the cheapest model that meets accuracy targets
`target_accuracy`	`0.85`	Target accuracy threshold
`optimization_interval_seconds`	`300`	Seconds between background optimization runs
`accuracy_weight`	`0.7`	Weight for accuracy in optimization scoring
`length_weight`	`0.2`	Weight for prompt brevity
`consistency_weight`	`0.1`	Weight for prediction consistency

API

POST /solo/api/optimize-prompt

Triggers on-demand optimization. (Requires authentication.)

Optimization history is exposed via GET /solo/api/prompts, which returns the prompt version history including who/what produced each version (user_setup, manual_edit, prompt_optimizer, etc.).

Edge Case Synthesizer

Proactively generates synthetic edge case examples to test and refine annotation prompts before large-scale labeling begins. Unlike edge case rules (which are discovered reactively from low-confidence predictions), the synthesizer creates hypothetical boundary examples.

How It Works

Synthesis: The LLM generates examples that lie on label boundaries, have ambiguous signals, require careful interpretation, and test specific guideline aspects.
Labeling: The human annotator labels the synthesized examples.
Prompt revision: Labeled edge cases feed into the prompt revision system, providing concrete examples of how ambiguous cases should be handled.
Aspect tracking: The system tracks which guideline aspects have been tested, helping identify gaps in prompt coverage.

Configuration

Edge case synthesis is part of the core Solo Mode workflow (phases 3–5) and uses the labeling_models and revision_models settings. No separate configuration section is required.

API

GET /solo/api/edge-cases

Returns all synthesized edge cases with their labeling status (counts of total, labeled, and unlabeled cases).

Edge case synthesis runs automatically when the workflow enters the edge_case_synthesis phase — there is no separate trigger endpoint. Labeling individual edge cases happens through the /solo/edge-cases page route (POST with case_id, label, and optional notes), not via a dedicated REST endpoint.

Schema-Specific Thresholds

Solo Mode supports schema-specific agreement thresholds for annotation types where exact match is too strict. These are configured in the thresholds section:

solo_mode:
  thresholds:
    # Core thresholds
    end_human_annotation_agreement: 0.90
    minimum_validation_sample: 50
    confidence_low: 0.5
    confidence_high: 0.8
    periodic_review_interval: 100

    # Schema-specific thresholds
    likert_tolerance: 1
    multiselect_jaccard_threshold: 0.5
    textbox_embedding_threshold: 0.7
    span_overlap_threshold: 0.5

Option	Default	Description
`likert_tolerance`	`1`	Maximum allowed difference between human and LLM Likert ratings to count as agreement (e.g., tolerance of 1 means a human rating of 3 agrees with LLM ratings of 2, 3, or 4)
`multiselect_jaccard_threshold`	`0.5`	Minimum Jaccard similarity between human and LLM multiselect label sets to count as agreement
`textbox_embedding_threshold`	`0.7`	Minimum cosine similarity between human and LLM text embeddings to count as agreement
`span_overlap_threshold`	`0.5`	Minimum token-level overlap (IoU) between human and LLM spans to count as agreement

Instance Selection Weights

The instance selector uses a weighted mixture to choose which instances the human should annotate next. In addition to the core weights documented in the Solo Mode guide, two additional weights are available for advanced features:

solo_mode:
  instance_selection:
    low_confidence_weight: 0.3
    diversity_weight: 0.2
    random_weight: 0.2
    disagreement_weight: 0.1
    edge_case_rule_weight: 0.1
    cartography_weight: 0.1

Weight	Default	Description
`low_confidence_weight`	`0.4`	Prioritize instances where the LLM is uncertain
`diversity_weight`	`0.3`	Prioritize instances from different embedding clusters
`random_weight`	`0.2`	Random sample for calibration
`disagreement_weight`	`0.1`	Prioritize instances with prior human-LLM disagreement
`edge_case_rule_weight`	`0.0`	Prioritize instances matching discovered edge case rules
`cartography_weight`	`0.0`	Prioritize instances based on dataset cartography (training dynamics)

Weights are automatically normalized to sum to 1.0 (a warning is logged if they don't).

Complete Configuration Reference

Below is a comprehensive YAML configuration showing all advanced Solo Mode options with their defaults:

solo_mode:
  enabled: true

  # LLM models for annotation labeling (tried in order)
  labeling_models:
    - endpoint_type: "anthropic"
      model: "claude-3-5-sonnet-20241022"
      api_key: "${ANTHROPIC_API_KEY}"
      max_tokens: 1000
      temperature: 0.1

  # LLM models for prompt revision (defaults to labeling_models if empty)
  revision_models:
    - endpoint_type: "anthropic"
      model: "claude-3-5-sonnet-20241022"

  # Embedding model for diversity and similarity
  embedding:
    model_name: "all-MiniLM-L6-v2"

  # Uncertainty estimation
  uncertainty:
    strategy: "direct_confidence"      # direct_confidence | direct_uncertainty | token_entropy | sampling_diversity
    num_samples: 5                     # For sampling_diversity
    sampling_temperature: 1.0          # For sampling_diversity

  # Agreement and quality thresholds
  thresholds:
    end_human_annotation_agreement: 0.90
    minimum_validation_sample: 50
    confidence_low: 0.5
    confidence_high: 0.8
    periodic_review_interval: 100
    likert_tolerance: 1
    multiselect_jaccard_threshold: 0.5
    textbox_embedding_threshold: 0.7
    span_overlap_threshold: 0.5

  # Instance selection weights (auto-normalized to sum to 1.0)
  instance_selection:
    low_confidence_weight: 0.4
    diversity_weight: 0.3
    random_weight: 0.2
    disagreement_weight: 0.1
    edge_case_rule_weight: 0.0
    cartography_weight: 0.0

  # Batch sizes
  batches:
    llm_labeling_batch: 50
    max_parallel_labels: 200

  # Prompt optimization
  prompt_optimization:
    enabled: true
    find_smallest_model: true
    target_accuracy: 0.85
    optimization_interval_seconds: 300
    accuracy_weight: 0.7
    length_weight: 0.2
    consistency_weight: 0.1

  # Edge case rule discovery (Co-DETECT)
  edge_case_rules:
    enabled: true
    confidence_threshold: 0.75
    min_rules_for_clustering: 10
    target_cluster_size: 15
    auto_extract_on_labeling: true
    reannotation_enabled: true
    reannotation_confidence_threshold: 0.60
    max_reannotations_per_instance: 2

  # Labeling functions (ALCHEmist)
  labeling_functions:
    enabled: true
    min_confidence: 0.85
    min_coverage: 3
    max_functions: 50
    auto_extract: true
    vote_threshold: 0.5

  # Confidence routing (cascaded escalation)
  confidence_routing:
    enabled: false
    tiers: []

  # Confusion analysis
  confusion_analysis:
    enabled: true
    min_instances_for_pattern: 3
    max_patterns: 20
    auto_suggest_guidelines: false

  # Automated refinement loop
  refinement_loop:
    enabled: true
    trigger_interval: 50
    min_improvement: 0.02
    max_cycles: 5
    patience: 2
    auto_apply_suggestions: false

Solo Mode — Core workflow and getting started
Solo Mode Developer Guide — Architecture and extension points
AI Support — General AI endpoint configuration
Active Learning — ML-based instance prioritization (non-Solo Mode)

Solo Mode Advanced Features

Edge Case Rule Discovery

How It Works

Configuration

Instance Selection Weight

Labeling Functions

How It Works

Configuration

Cost Savings

Confusion Analysis

How It Works

Configuration

API

Disagreement Explorer

How It Works

API

Refinement Loop

How It Works

Stop Conditions

Configuration

API

Confidence Routing

How It Works

Configuration

Per-Tier Statistics

API

Prompt Optimizer

How It Works

Smallest Model Search

Configuration

API

Edge Case Synthesizer

How It Works

Configuration

API

Schema-Specific Thresholds

Instance Selection Weights

Complete Configuration Reference

Related Documentation