F1 Score

The F1 score combines precision and recall into a single number using the harmonic mean: F1 = 2 * (precision * recall) / (precision + recall). The harmonic mean penalizes extreme imbalances more than the arithmetic mean: a model with 100% precision but 10% recall gets an F1 of 0.18, not 0.55 as the arithmetic mean would suggest. This makes F1 a useful default metric when you need to balance both error types.

Variants include the weighted F1 (averages per-class F1 scores weighted by class frequency, useful for imbalanced datasets), macro F1 (unweighted average across classes, treating all classes equally), and the F-beta score (a generalization where beta controls the relative importance of precision versus recall, with beta=2 weighting recall twice as heavily).

For practical applications, F1 is a good starting metric but should not be the only one. It implicitly assumes equal cost for false positives and false negatives, which is rarely true in business contexts. A churn model where missing a churning customer costs $1,000 but a false alert costs $10 should optimize for a metric that reflects these asymmetric costs. Use F1 for initial model comparison, then switch to a business-value-weighted metric for production optimization.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering