Solo Mode Advanced Features
This page documents advanced subsystems available in Solo Mode that go beyond the core 10-phase workflow described in the Solo Mode guide. These features enable automated quality improvement, cost optimization, and deeper analysis of annotation patterns.
Edge Case Rule Discovery
Inspired by the Co-DETECT framework, the edge case rule system automatically discovers annotation rules from instances where the LLM has low confidence. Rules are extracted, clustered into categories, reviewed by the human annotator, and injected back into the annotation prompt.
How It Works
- Rule extraction: When the LLM labels an instance with confidence below
confidence_threshold, it generates a generalizable rule of the form "When \<condition> → \<action>". - Clustering: Once enough rules accumulate (
min_rules_for_clustering), they are clustered by semantic similarity using sentence embeddings and K-Means. Each cluster is summarized into a single category by the LLM. - Human review: Categories are presented to the annotator for approval or rejection.
- Prompt injection: Approved categories are injected into the annotation prompt, either via LLM-assisted integration or by appending an "Edge Case Guidelines" section.
- Re-annotation: Instances previously labeled with low confidence under old prompts are re-annotated with the improved prompt.
Configuration
solo_mode:
edge_case_rules:
enabled: true
confidence_threshold: 0.75
min_rules_for_clustering: 10
target_cluster_size: 15
auto_extract_on_labeling: true
reannotation_enabled: true
reannotation_confidence_threshold: 0.60
max_reannotations_per_instance: 2
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable the edge case rule system |
confidence_threshold |
0.75 |
Extract rules when LLM confidence is below this value |
min_rules_for_clustering |
10 |
Minimum unclustered rules before triggering clustering |
target_cluster_size |
15 |
Target number of rules per cluster (Co-DETECT recommends 10–20) |
auto_extract_on_labeling |
true |
Automatically extract rules during LLM labeling |
reannotation_enabled |
true |
Re-annotate low-confidence instances after prompt updates |
reannotation_confidence_threshold |
0.60 |
Only re-annotate instances with confidence below this |
max_reannotations_per_instance |
2 |
Maximum times a single instance can be re-annotated |
Instance Selection Weight
You can direct the instance selector to prioritize instances matching edge case rules:
solo_mode:
instance_selection:
edge_case_rule_weight: 0.1 # default: 0.0
Increase this to route more items from edge case rule clusters to the human annotator.
Labeling Functions
Inspired by ALCHEmist (NeurIPS 2024), this system extracts reusable labeling functions from high-confidence LLM predictions. These functions can label new instances via keyword matching and majority voting — without additional API calls.
How It Works
- Extraction: From predictions where the LLM reports high confidence (
min_confidence), the system asks the LLM to identify generalizable patterns (keywords, conditions). Falls back to keyword frequency analysis if the LLM is unavailable. - Application: For each new instance, all enabled labeling functions vote on a label using confidence-weighted majority voting.
- Acceptance: If vote agreement exceeds
vote_threshold, the label is accepted without calling the LLM. Otherwise the instance is passed through to the normal LLM labeling pipeline.
Configuration
solo_mode:
labeling_functions:
enabled: true
min_confidence: 0.85
min_coverage: 3
max_functions: 50
auto_extract: true
vote_threshold: 0.5
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable labeling functions |
min_confidence |
0.85 |
Minimum LLM confidence for a prediction to be used for function extraction |
min_coverage |
3 |
Minimum instances a pattern must match to become a function |
max_functions |
50 |
Maximum number of active labeling functions |
auto_extract |
true |
Automatically extract functions from high-confidence predictions |
vote_threshold |
0.5 |
Minimum vote agreement required to accept a labeling function result |
Cost Savings
Labeling functions are most effective when your data contains recurring patterns. The stats endpoint reports: - coverage: Fraction of instances labeled by functions (avoiding LLM calls) - accuracy: Agreement between function labels and human labels (when available)
Confusion Analysis
Enriches the standard confusion matrix with example instances, LLM reasoning, and optional root cause analysis with guideline suggestions.
How It Works
- Pattern detection: Groups human-LLM disagreements by (predicted, actual) label pairs, filtering to pairs that occur at least
min_instances_for_patterntimes. - Enrichment: Each confusion pattern includes up to 5 example instances with the original text, LLM reasoning, and confidence score.
- Root cause analysis (optional): Uses the LLM to explain why a specific confusion pattern occurs.
- Guideline suggestions (optional): Uses the LLM to propose a concise guideline to disambiguate the confused labels.
Configuration
solo_mode:
confusion_analysis:
enabled: true
min_instances_for_pattern: 3
max_patterns: 20
auto_suggest_guidelines: false
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable confusion analysis |
min_instances_for_pattern |
3 |
Minimum disagreements for a label pair to be reported as a pattern |
max_patterns |
20 |
Maximum number of confusion patterns to report |
auto_suggest_guidelines |
false |
Automatically generate LLM guideline suggestions for each pattern |
API
GET /solo/api/confusion-analysis
Returns confusion matrix data with heatmap-ready cell values, per-label accuracy, and enriched patterns.
Disagreement Explorer
Provides rich aggregated data for visual exploration of human-LLM disagreements, including scatter plots, temporal timelines, per-label breakdowns, and a filterable disagreement list.
How It Works
The explorer is read-only — it aggregates data from the validation tracker and predictions without modifying any state.
Scatter plot: Each compared instance is plotted as (confidence, agrees/disagrees), revealing whether high-confidence predictions tend to be correct.
Timeline: Comparisons are bucketed into windows (default size 10). Each bucket shows its agreement rate, and the overall trend is classified as improving, declining, or stable based on first-half vs. second-half agreement rate difference (>5% threshold).
Label breakdown: Per-label statistics including total comparisons, agreement rate, and top confused-with labels.
Disagreement list: Sorted by confidence descending (most surprising disagreements first), filterable by label.
API
GET /solo/api/disagreement-explorer
Returns scatter_points, disagreements, label_breakdown, and summary data.
Accepts an optional ?label=<label> query parameter to filter to one label.
GET /solo/api/disagreement-timeline
Returns buckets (per-window stats) and trend classification.
Accepts ?bucket_size=<N> (clamped to [2, 100], default 10).
Note:
GET /solo/api/disagreements(without the-explorersuffix) is a separate, simpler endpoint that returns aggregate counts only (total,pending,resolved,pending_ids).
Refinement Loop
Orchestrates an automated cycle of confusion analysis → guideline suggestions → prompt revision → re-annotation. Monitors agreement rate trends and stops when metrics plateau.
How It Works
- Trigger: After every
trigger_intervalhuman annotations, the loop checks whether a refinement cycle should run. - Analyze: Runs confusion analysis on current disagreement patterns.
- Suggest: For each significant confusion pattern, generates a guideline suggestion.
- Apply: If
auto_apply_suggestionsis true, suggestions are applied immediately. Otherwise the cycle pauses inawaiting_approvalstatus for human review. - Re-annotate: Affected instances are re-annotated with the updated prompt.
- Evaluate: After re-annotation, the improvement in agreement rate is measured.
Stop Conditions
The loop automatically stops when:
- Maximum cycles reached (max_cycles)
- Improvement plateaus: patience consecutive cycles with improvement below min_improvement
Configuration
solo_mode:
refinement_loop:
enabled: true
trigger_interval: 50
min_improvement: 0.02
max_cycles: 5
patience: 2
auto_apply_suggestions: false
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable the refinement loop |
trigger_interval |
50 |
Check for refinement every N human annotations |
min_improvement |
0.02 |
Minimum agreement rate improvement to count as progress |
max_cycles |
5 |
Maximum number of refinement cycles |
patience |
2 |
Consecutive cycles without improvement before stopping |
auto_apply_suggestions |
false |
Apply guideline suggestions without human review |
API
GET /solo/api/refinement-status
Returns enabled state, cycle count, running/stopped state, stop reason, patience countdown, and full cycle history.
Related refinement endpoints (all require authentication):
POST /solo/api/refinement/trigger # Manually trigger a refinement cycle
GET /solo/api/refinement/log # Full cycle history (validated framework)
GET /solo/api/refinement/pending # Candidates awaiting admin approval
GET /solo/api/refinement/strategies # List available refinement strategies
GET /solo/api/reannotation-report # Before/after accuracy on re-annotated instances
Admin-only (require X-API-Key):
POST /solo/api/refinement/reset # Reset refinement loop state
POST /solo/api/refinement/approve # Apply a pending refinement candidate
POST /solo/api/refinement/reject # Reject a pending refinement candidate
Confidence Routing
Implements cascaded model escalation: a cheap/fast model tries first, and if its confidence is below the tier threshold, the instance escalates to a more expensive/capable model. If all tiers fail, the instance is routed to the human.
How It Works
For each instance: 1. The first (cheapest) tier model labels the instance. 2. If confidence ≥ tier threshold → accept the label. 3. If confidence < threshold → escalate to the next tier, keeping the best result so far. 4. If all tiers are exhausted → route to the human annotation queue.
Configuration
solo_mode:
confidence_routing:
enabled: true
tiers:
- name: "fast"
model:
endpoint_type: "openai"
model: "gpt-4o-mini"
api_key: "${OPENAI_API_KEY}"
confidence_threshold: 0.85
- name: "accurate"
model:
endpoint_type: "anthropic"
model: "claude-3-5-sonnet-20241022"
api_key: "${ANTHROPIC_API_KEY}"
confidence_threshold: 0.70
| Option | Default | Description |
|---|---|---|
enabled |
false |
Enable/disable confidence routing (replaces single-model labeling) |
tiers |
[] |
Ordered list of model tiers, cheapest first |
tiers[].name |
"" |
Human-readable tier name for stats reporting |
tiers[].model |
— | Model configuration (same format as labeling_models entries) |
tiers[].confidence_threshold |
0.8 |
Minimum confidence to accept a label at this tier |
Per-Tier Statistics
The stats endpoint reports per-tier metrics: - instances_attempted: Total instances routed to this tier - instances_accepted: Instances where confidence met the threshold - instances_escalated: Instances passed to the next tier - acceptance_rate: Fraction accepted at this tier - avg_confidence: Mean confidence of accepted predictions - avg_latency_ms: Mean response time
Plus global stats: total_routed and human_routed_count.
API
Per-tier and global confidence-routing stats are included in the response of
GET /solo/api/status under llm_stats.confidence_routing when routing is
enabled. There is no dedicated routing-stats endpoint.
Prompt Optimizer
DSPy-style automatic prompt optimization using labeled examples. The optimizer analyzes correct and incorrect predictions to iteratively improve the annotation prompt.
How It Works
- Collect examples: Gathers labeled instances — both correctly and incorrectly predicted by the LLM.
- Optimize: Sends the current prompt along with sample correct (up to 5) and incorrect (up to 10) examples to the LLM. The LLM returns an improved prompt with a list of changes and rationale.
- Validate: Checks that the optimized prompt differs from the original and is within length limits.
- Apply: Updates the annotation prompt.
Optimization can run on a timer in the background or be triggered on-demand.
Smallest Model Search
When find_smallest_model is enabled, the optimizer tests available models (smallest first) against labeled examples and selects the smallest model that meets target_accuracy. This reduces API costs by using the cheapest sufficient model.
Configuration
solo_mode:
prompt_optimization:
enabled: true
find_smallest_model: true
target_accuracy: 0.85
optimization_interval_seconds: 300
accuracy_weight: 0.7
length_weight: 0.2
consistency_weight: 0.1
| Option | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable prompt optimization |
find_smallest_model |
true |
Search for the cheapest model that meets accuracy targets |
target_accuracy |
0.85 |
Target accuracy threshold |
optimization_interval_seconds |
300 |
Seconds between background optimization runs |
accuracy_weight |
0.7 |
Weight for accuracy in optimization scoring |
length_weight |
0.2 |
Weight for prompt brevity |
consistency_weight |
0.1 |
Weight for prediction consistency |
API
POST /solo/api/optimize-prompt
Triggers on-demand optimization. (Requires authentication.)
Optimization history is exposed via GET /solo/api/prompts, which returns
the prompt version history including who/what produced each version
(user_setup, manual_edit, prompt_optimizer, etc.).
Edge Case Synthesizer
Proactively generates synthetic edge case examples to test and refine annotation prompts before large-scale labeling begins. Unlike edge case rules (which are discovered reactively from low-confidence predictions), the synthesizer creates hypothetical boundary examples.
How It Works
- Synthesis: The LLM generates examples that lie on label boundaries, have ambiguous signals, require careful interpretation, and test specific guideline aspects.
- Labeling: The human annotator labels the synthesized examples.
- Prompt revision: Labeled edge cases feed into the prompt revision system, providing concrete examples of how ambiguous cases should be handled.
- Aspect tracking: The system tracks which guideline aspects have been tested, helping identify gaps in prompt coverage.
Configuration
Edge case synthesis is part of the core Solo Mode workflow (phases 3–5) and uses the labeling_models and revision_models settings. No separate configuration section is required.
API
GET /solo/api/edge-cases
Returns all synthesized edge cases with their labeling status (counts of total, labeled, and unlabeled cases).
Edge case synthesis runs automatically when the workflow enters the
edge_case_synthesis phase — there is no separate trigger endpoint.
Labeling individual edge cases happens through the /solo/edge-cases
page route (POST with case_id, label, and optional notes), not via
a dedicated REST endpoint.
Schema-Specific Thresholds
Solo Mode supports schema-specific agreement thresholds for annotation types where exact match is too strict. These are configured in the thresholds section:
solo_mode:
thresholds:
# Core thresholds
end_human_annotation_agreement: 0.90
minimum_validation_sample: 50
confidence_low: 0.5
confidence_high: 0.8
periodic_review_interval: 100
# Schema-specific thresholds
likert_tolerance: 1
multiselect_jaccard_threshold: 0.5
textbox_embedding_threshold: 0.7
span_overlap_threshold: 0.5
| Option | Default | Description |
|---|---|---|
likert_tolerance |
1 |
Maximum allowed difference between human and LLM Likert ratings to count as agreement (e.g., tolerance of 1 means a human rating of 3 agrees with LLM ratings of 2, 3, or 4) |
multiselect_jaccard_threshold |
0.5 |
Minimum Jaccard similarity between human and LLM multiselect label sets to count as agreement |
textbox_embedding_threshold |
0.7 |
Minimum cosine similarity between human and LLM text embeddings to count as agreement |
span_overlap_threshold |
0.5 |
Minimum token-level overlap (IoU) between human and LLM spans to count as agreement |
Instance Selection Weights
The instance selector uses a weighted mixture to choose which instances the human should annotate next. In addition to the core weights documented in the Solo Mode guide, two additional weights are available for advanced features:
solo_mode:
instance_selection:
low_confidence_weight: 0.3
diversity_weight: 0.2
random_weight: 0.2
disagreement_weight: 0.1
edge_case_rule_weight: 0.1
cartography_weight: 0.1
| Weight | Default | Description |
|---|---|---|
low_confidence_weight |
0.4 |
Prioritize instances where the LLM is uncertain |
diversity_weight |
0.3 |
Prioritize instances from different embedding clusters |
random_weight |
0.2 |
Random sample for calibration |
disagreement_weight |
0.1 |
Prioritize instances with prior human-LLM disagreement |
edge_case_rule_weight |
0.0 |
Prioritize instances matching discovered edge case rules |
cartography_weight |
0.0 |
Prioritize instances based on dataset cartography (training dynamics) |
Weights are automatically normalized to sum to 1.0 (a warning is logged if they don't).
Complete Configuration Reference
Below is a comprehensive YAML configuration showing all advanced Solo Mode options with their defaults:
solo_mode:
enabled: true
# LLM models for annotation labeling (tried in order)
labeling_models:
- endpoint_type: "anthropic"
model: "claude-3-5-sonnet-20241022"
api_key: "${ANTHROPIC_API_KEY}"
max_tokens: 1000
temperature: 0.1
# LLM models for prompt revision (defaults to labeling_models if empty)
revision_models:
- endpoint_type: "anthropic"
model: "claude-3-5-sonnet-20241022"
# Embedding model for diversity and similarity
embedding:
model_name: "all-MiniLM-L6-v2"
# Uncertainty estimation
uncertainty:
strategy: "direct_confidence" # direct_confidence | direct_uncertainty | token_entropy | sampling_diversity
num_samples: 5 # For sampling_diversity
sampling_temperature: 1.0 # For sampling_diversity
# Agreement and quality thresholds
thresholds:
end_human_annotation_agreement: 0.90
minimum_validation_sample: 50
confidence_low: 0.5
confidence_high: 0.8
periodic_review_interval: 100
likert_tolerance: 1
multiselect_jaccard_threshold: 0.5
textbox_embedding_threshold: 0.7
span_overlap_threshold: 0.5
# Instance selection weights (auto-normalized to sum to 1.0)
instance_selection:
low_confidence_weight: 0.4
diversity_weight: 0.3
random_weight: 0.2
disagreement_weight: 0.1
edge_case_rule_weight: 0.0
cartography_weight: 0.0
# Batch sizes
batches:
llm_labeling_batch: 50
max_parallel_labels: 200
# Prompt optimization
prompt_optimization:
enabled: true
find_smallest_model: true
target_accuracy: 0.85
optimization_interval_seconds: 300
accuracy_weight: 0.7
length_weight: 0.2
consistency_weight: 0.1
# Edge case rule discovery (Co-DETECT)
edge_case_rules:
enabled: true
confidence_threshold: 0.75
min_rules_for_clustering: 10
target_cluster_size: 15
auto_extract_on_labeling: true
reannotation_enabled: true
reannotation_confidence_threshold: 0.60
max_reannotations_per_instance: 2
# Labeling functions (ALCHEmist)
labeling_functions:
enabled: true
min_confidence: 0.85
min_coverage: 3
max_functions: 50
auto_extract: true
vote_threshold: 0.5
# Confidence routing (cascaded escalation)
confidence_routing:
enabled: false
tiers: []
# Confusion analysis
confusion_analysis:
enabled: true
min_instances_for_pattern: 3
max_patterns: 20
auto_suggest_guidelines: false
# Automated refinement loop
refinement_loop:
enabled: true
trigger_interval: 50
min_improvement: 0.02
max_cycles: 5
patience: 2
auto_apply_suggestions: false
Related Documentation
- Solo Mode — Core workflow and getting started
- Solo Mode Developer Guide — Architecture and extension points
- AI Support — General AI endpoint configuration
- Active Learning — ML-based instance prioritization (non-Solo Mode)