Active Learning Administrator Guide
This guide provides comprehensive instructions for administrators on how to configure and use active learning in the Potato annotation platform. Active learning uses machine learning to intelligently prioritize annotation tasks, helping you get the most value from your annotation budget.
Table of Contents
- Overview
- Basic Configuration
- Advanced Configuration
- LLM Integration
- Model Persistence
- Multi-Schema Support
- Monitoring and Metrics
- Best Practices
- Troubleshooting
- Examples
Overview
Active learning in Potato automatically reorders annotation instances based on machine learning predictions, prioritizing items where the model is most uncertain. This helps you:
- Maximize annotation efficiency by focusing on the most informative instances
- Reduce annotation costs by requiring fewer annotations for the same model performance
- Improve model quality by ensuring diverse and representative training data
- Scale annotation workflows with intelligent instance prioritization
How It Works
- Training: A machine learning classifier is trained on existing annotations
- Prediction: The model predicts confidence scores for unannotated instances
- Reordering: Instances are reordered based on uncertainty (lowest confidence first)
- Annotation: Annotators work on the most uncertain instances
- Retraining: The model is retrained periodically as new annotations are added
Basic Configuration
Enabling Active Learning
Add the active_learning section to your YAML configuration file:
active_learning:
enabled: true
schema_names: ["sentiment", "topic"]
min_annotations_per_instance: 2
min_instances_for_training: 10
update_frequency: 5
max_instances_to_reorder: 50
Core Parameters
| Parameter | Description | Default | Recommended |
|---|---|---|---|
enabled |
Enable/disable active learning | false |
true |
schema_names |
List of annotation schemas to use | [] |
All schemas |
min_annotations_per_instance |
Minimum annotations needed per instance | 1 |
2-3 |
min_instances_for_training |
Minimum instances needed before training | 10 |
20-50 |
update_frequency |
How often to retrain (in annotations) | 5 |
5-10 |
max_instances_to_reorder |
Maximum instances to reorder | 100 |
50-200 |
Example Basic Configuration
# Basic active learning setup
active_learning:
enabled: true
schema_names: ["sentiment"]
min_annotations_per_instance: 2
min_instances_for_training: 20
update_frequency: 10
max_instances_to_reorder: 100
random_sample_percent: 20
resolution_strategy: "majority_vote"
Advanced Configuration
Classifier Configuration
Choose and configure your machine learning classifier:
active_learning:
enabled: true
classifier_name: "sklearn.ensemble.RandomForestClassifier"
classifier_kwargs:
n_estimators: 100
max_depth: 10
random_state: 42
vectorizer_name: "sklearn.feature_extraction.text.TfidfVectorizer"
vectorizer_kwargs:
max_features: 1000
ngram_range: [1, 2]
stop_words: "english"
Supported Classifiers
| Classifier | Use Case | Pros | Cons |
|---|---|---|---|
LogisticRegression |
Binary/multi-class | Fast, interpretable | Linear only |
RandomForestClassifier |
Complex patterns | Robust, handles non-linear | Slower training |
SVC |
High-dimensional data | Good with sparse data | Memory intensive |
MultinomialNB |
Text classification | Very fast | Assumes independence |
Supported Vectorizers
| Vectorizer | Use Case | Pros | Cons |
|---|---|---|---|
CountVectorizer |
Simple text features | Fast, simple | No word importance |
TfidfVectorizer |
Text with word importance | Better performance | Slightly slower |
HashingVectorizer |
Large datasets | Memory efficient | No feature names |
Resolution Strategies
When multiple annotators label the same instance, choose how to resolve conflicts:
active_learning:
resolution_strategy: "majority_vote" # Options: majority_vote, consensus, random
| Strategy | Description | Use When |
|---|---|---|
majority_vote |
Most common label wins | Multiple annotators, clear disagreements |
consensus |
All annotators must agree | High-quality requirements |
random |
Randomly select one annotation | Quick testing, simple workflows |
LLM Integration
Enabling LLM Support
Integrate Large Language Models for advanced confidence scoring:
active_learning:
enabled: true
llm_enabled: true
llm_config:
endpoint_url: "http://localhost:8000"
model_name: "llama-2-7b"
use_mock: false
max_tokens: 100
temperature: 0.1
LLM Configuration Options
| Parameter | Description | Default | Example |
|---|---|---|---|
endpoint_url |
VLLM endpoint URL | Required | http://localhost:8000 |
model_name |
Model name on server | Required | llama-2-7b |
use_mock |
Use mock for testing | false |
true |
max_tokens |
Maximum response tokens | 100 |
50-200 |
temperature |
Response randomness | 0.1 |
0.0-1.0 |
Mock Mode for Testing
Use mock mode during development and testing:
active_learning:
llm_enabled: true
llm_config:
use_mock: true
endpoint_url: "http://localhost:8000" # Not used in mock mode
model_name: "test-model"
Model Persistence
Enabling Model Persistence
Save trained models for reuse and analysis:
active_learning:
enabled: true
model_persistence_enabled: true
model_save_directory: "/path/to/models"
model_retention_count: 2
Persistence Configuration
| Parameter | Description | Default | Recommended |
|---|---|---|---|
model_persistence_enabled |
Enable model saving | false |
true |
model_save_directory |
Directory to save models | Required | /models/ |
model_retention_count |
Number of models to keep | 2 |
3-5 |
Database Integration
For large-scale deployments, enable database persistence:
active_learning:
enabled: true
database_enabled: true
database_config:
host: "localhost"
port: 3306
database: "potato_al"
username: "potato_user"
password: "secure_password"
Multi-Schema Support
Schema Cycling
Configure active learning to cycle through multiple annotation schemas:
active_learning:
enabled: true
schema_names: ["sentiment", "topic", "urgency"]
min_annotations_per_instance: 2
min_instances_for_training: 15
Schema-Specific Configuration
Configure different parameters for each schema:
active_learning:
enabled: true
schema_names: ["sentiment", "topic"]
schema_configs:
sentiment:
min_annotations_per_instance: 3
classifier_name: "sklearn.linear_model.LogisticRegression"
topic:
min_annotations_per_instance: 2
classifier_name: "sklearn.ensemble.RandomForestClassifier"
Monitoring and Metrics
Accessing Metrics
Active learning provides comprehensive metrics through the admin interface:
# Get active learning statistics
from potato.active_learning_manager import get_active_learning_manager
manager = get_active_learning_manager()
stats = manager.get_stats()
print(f"Training count: {stats['training_count']}")
print(f"Models trained: {stats['models_trained']}")
print(f"Last training time: {stats['last_training_time']}")
print(f"LLM enabled: {stats['llm_enabled']}")
Key Metrics
| Metric | Description | What to Monitor |
|---|---|---|
training_count |
Number of training cycles | Training frequency |
models_trained |
Schemas with trained models | Coverage across schemas |
last_training_time |
Time since last training | Training recency |
llm_enabled |
LLM integration status | LLM availability |
training_accuracy |
Model accuracy scores | Model performance |
Performance Monitoring
Monitor training performance and adjust parameters:
active_learning:
enabled: true
# Adjust these based on performance monitoring
update_frequency: 5 # Increase if training is too frequent
min_instances_for_training: 20 # Decrease if training is too slow
max_instances_to_reorder: 50 # Adjust based on dataset size
Best Practices
Configuration Best Practices
- Start Simple: Begin with basic configuration and add complexity gradually
- Monitor Performance: Track training times and model accuracy
- Balance Parameters: Adjust
update_frequencyandmin_instances_for_trainingbased on your workflow - Use Appropriate Classifiers: Choose classifiers based on your data characteristics
- Enable Persistence: Save models for analysis and debugging
Workflow Best Practices
- Sufficient Initial Data: Ensure you have enough initial annotations before enabling active learning
- Regular Monitoring: Check metrics regularly to ensure optimal performance
- Quality Control: Use resolution strategies appropriate for your quality requirements
- Scalability: Adjust parameters for large datasets
- Testing: Use mock mode during development and testing
Performance Optimization
- Fast Classifiers: Use fast classifiers for real-time annotation workflows
- Feature Limits: Limit vectorizer features to maintain speed
- Update Frequency: Balance between responsiveness and computational cost
- Memory Management: Monitor memory usage with large datasets
Troubleshooting
Common Issues
Training Not Triggering
Problem: Active learning training is not being triggered.
Solutions:
- Check min_instances_for_training is not too high
- Verify min_annotations_per_instance is met
- Ensure update_frequency is appropriate
- Check that annotations are being added correctly
active_learning:
min_instances_for_training: 10 # Reduce if too high
min_annotations_per_instance: 1 # Reduce if too high
update_frequency: 5 # Reduce for more frequent training
Slow Training
Problem: Training is taking too long.
Solutions:
- Use faster classifiers (LogisticRegression, MultinomialNB)
- Limit vectorizer features
- Increase update_frequency
- Use simpler vectorizers
active_learning:
classifier_name: "sklearn.linear_model.LogisticRegression"
vectorizer_kwargs:
max_features: 500 # Reduce feature count
update_frequency: 10 # Train less frequently
LLM Integration Issues
Problem: LLM integration is not working.
Solutions: - Verify VLLM endpoint is running and accessible - Check endpoint URL and model name - Use mock mode for testing - Verify network connectivity
active_learning:
llm_config:
use_mock: true # Use mock for testing
endpoint_url: "http://localhost:8000" # Verify URL
model_name: "llama-2-7b" # Verify model name
Debug Mode
Enable debug logging for troubleshooting:
import logging
logging.getLogger('potato.active_learning_manager').setLevel(logging.DEBUG)
Examples
Basic Sentiment Analysis
# Basic sentiment analysis with active learning
annotation_schemes:
- annotation_type: radio
name: sentiment
description: What is the sentiment of this text?
labels:
- positive
- negative
- neutral
active_learning:
enabled: true
schema_names: ["sentiment"]
min_annotations_per_instance: 2
min_instances_for_training: 20
update_frequency: 10
classifier_name: "sklearn.linear_model.LogisticRegression"
vectorizer_name: "sklearn.feature_extraction.text.TfidfVectorizer"
vectorizer_kwargs:
max_features: 1000
stop_words: "english"
resolution_strategy: "majority_vote"
random_sample_percent: 20
Multi-Schema Classification
# Multi-schema classification with active learning
annotation_schemes:
- annotation_type: radio
name: sentiment
description: What is the sentiment?
labels: [positive, negative, neutral]
- annotation_type: multiselect
name: topics
description: What topics are mentioned?
labels: [politics, technology, sports, entertainment]
active_learning:
enabled: true
schema_names: ["sentiment", "topics"]
min_annotations_per_instance: 2
min_instances_for_training: 30
update_frequency: 15
classifier_name: "sklearn.ensemble.RandomForestClassifier"
classifier_kwargs:
n_estimators: 100
max_depth: 10
vectorizer_name: "sklearn.feature_extraction.text.TfidfVectorizer"
vectorizer_kwargs:
max_features: 2000
ngram_range: [1, 2]
resolution_strategy: "majority_vote"
random_sample_percent: 15
model_persistence_enabled: true
model_save_directory: "./models"
Advanced LLM Integration
# Advanced configuration with LLM integration
active_learning:
enabled: true
schema_names: ["sentiment", "intent"]
min_annotations_per_instance: 3
min_instances_for_training: 50
update_frequency: 20
# Traditional ML classifier
classifier_name: "sklearn.ensemble.RandomForestClassifier"
classifier_kwargs:
n_estimators: 200
max_depth: 15
vectorizer_name: "sklearn.feature_extraction.text.TfidfVectorizer"
vectorizer_kwargs:
max_features: 3000
ngram_range: [1, 3]
# LLM integration
llm_enabled: true
llm_config:
endpoint_url: "http://localhost:8000"
model_name: "llama-2-7b"
use_mock: false
max_tokens: 150
temperature: 0.1
# Model persistence
model_persistence_enabled: true
model_save_directory: "/data/potato/models"
model_retention_count: 5
# Database integration
database_enabled: true
database_config:
host: "localhost"
port: 3306
database: "potato_al"
username: "potato_user"
password: "secure_password"
resolution_strategy: "majority_vote"
random_sample_percent: 10
max_instances_to_reorder: 200
Query Strategies
Potato supports multiple query strategies beyond basic uncertainty sampling. For a comprehensive reference with mathematical formulations and citations, see Active Learning Query Strategies.
| Strategy | Description | Config Value |
|---|---|---|
| Uncertainty Sampling | Select least-confident instances | uncertainty |
| Diversity Sampling | Maximize feature-space coverage | diversity |
| BADGE | Uncertainty-weighted diversity | badge |
| BALD | Ensemble disagreement | bald |
| Hybrid | Weighted combination | hybrid |
active_learning:
query_strategy: "hybrid"
hybrid_weights:
uncertainty: 0.7
diversity: 0.3
Cold-Start with LLM Selection
Before enough annotations exist for classifier training, Potato can use an LLM to identify the most informative instances. Instances with moderate LLM confidence (near the decision boundary) are prioritized.
active_learning:
cold_start_strategy: "llm"
cold_start_batch_size: 20
llm:
enabled: true
endpoint_url: "http://localhost:8080/v1/chat/completions"
model_name: "your-model"
Sentence-Transformer Embeddings
For tasks where TF-IDF features are insufficient, use pre-trained sentence-transformer embeddings:
active_learning:
vectorizer_name: "sentence-transformers"
vectorizer_params:
model_name: "all-MiniLM-L6-v2"
Requires: pip install sentence-transformers
Probability Calibration
Raw predict_proba outputs from sklearn classifiers are often poorly calibrated. Potato wraps classifiers with CalibratedClassifierCV by default:
active_learning:
calibrate_probabilities: true # default
Set to false to disable calibration (e.g., for debugging or when using RandomForest which has better-calibrated probabilities natively).
Configuration Reference (New Fields)
| Field | Type | Default | Description |
|---|---|---|---|
query_strategy |
string | "uncertainty" |
Query strategy: uncertainty, diversity, badge, bald, hybrid |
hybrid_weights |
dict | {uncertainty: 0.7, diversity: 0.3} |
Weights for hybrid strategy (must sum to 1.0) |
bald_params |
dict | {n_estimators: 5, bootstrap_fraction: 0.8} |
BALD ensemble parameters |
calibrate_probabilities |
bool | true |
Wrap classifier with CalibratedClassifierCV |
cold_start_strategy |
string | "random" |
Cold-start strategy: random, llm |
cold_start_batch_size |
int | 20 |
Instances to sample for LLM cold-start |
classifier_params |
dict | {} |
Extra parameters passed to classifier constructor |
vectorizer_params |
dict | {} |
Extra parameters passed to vectorizer constructor |
use_icl_ensemble |
bool | false |
Blend ICL predictions with classifier |
annotation_routing |
bool | false |
Enable LLM-based annotation routing |
Architecture Overview
The active learning system uses a modular ActiveLearningManager class (singleton pattern) with these key components:
- Asynchronous training: Model training runs in background threads to avoid blocking annotation workflows
- Modular classifiers/vectorizers: Pluggable sklearn classifiers and vectorizers configured via YAML
- LLM integration: Optional VLLM endpoint integration for advanced confidence scoring, with mock mode for testing
- Model persistence: Pickle-based model storage with configurable retention policies and training history tracking
- Schema cycling: Balanced training across multiple annotation schemas
Key source files:
- potato/active_learning_manager.py — Main manager class
- potato/ai/llm_active_learning.py — LLM integration module
Conclusion
Active learning in Potato provides powerful capabilities for optimizing annotation workflows. By following this guide and best practices, administrators can configure and use active learning effectively to improve annotation efficiency and model quality.
For additional support and advanced configurations, refer to: - Active Learning Query Strategies — Detailed strategy reference with citations - AI Support — LLM endpoint configuration