F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both types of classification error, ranging from 0 (worst) to 1 (perfect).
The F1 score combines precision and recall into a single number using the harmonic mean: F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean penalizes extreme imbalances more than the arithmetic mean: a model with 100% precision but 10% recall gets an F1 of 0.18, not 0.55 as the arithmetic mean would suggest. This makes F1 a useful default metric when you need to balance both error types.
Variants include the weighted F1 (averages per-class F1 scores weighted by class frequency, useful for imbalanced datasets), macro F1 (unweighted average across classes, treating all classes equally), and the F-beta score (a generalization where beta controls the relative importance of precision versus recall, with beta=2 weighting recall twice as heavily).
For practical applications, F1 is a good starting metric but should not be the only one. It implicitly assumes equal cost for false positives and false negatives, which is rarely true in business contexts. A churn model where missing a churning customer costs $1,000 but a false alert costs $10 should optimize for a metric that reflects these asymmetric costs. Use F1 for initial model comparison, then switch to a business-value-weighted metric for production optimization.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.