scorers
Viteval provides a comprehensive set of pre-built scorers for common evaluation scenarios, powered by the autoevals
library.
Import
import { scorers } from 'viteval';
Text Similarity
exactMatch
Returns 1 if output exactly matches expected, 0 otherwise.
scorers.exactMatch
Use cases: Exact answers, structured outputs, code generation
Example:
// Input: "What is 2+2?"
// Output: "4"
// Expected: "4"
// Score: 1.0
// Output: "Four"
// Expected: "4"
// Score: 0.0
levenshtein
Measures text similarity using edit distance algorithm.
scorers.levenshtein
Range: 0.0 - 1.0 (higher = more similar)
Use cases: Approximate text matching, typo tolerance
Example:
// Output: "The sky is blue"
// Expected: "The sky is bleu"
// Score: ~0.92 (small typo)
// Output: "Blue sky"
// Expected: "The sky is blue"
// Score: ~0.66 (partial match)
answerSimilarity
Semantic similarity using embeddings.
scorers.answerSimilarity
Range: 0.0 - 1.0
Use cases: Meaning-based comparison, paraphrasing
Example:
// Output: "The capital of France is Paris"
// Expected: "Paris is France's capital city"
// Score: ~0.95 (same meaning, different wording)
embeddingSimilarity
Measures semantic similarity using embeddings.
scorers.embeddingSimilarity
Range: 0.0 - 1.0
Use cases: Semantic search, content matching
Features:
- Uses state-of-the-art embedding models
- Captures semantic meaning beyond keywords
- Language-agnostic comparison
Content Quality
factual
Evaluates factual accuracy against ground truth.
scorers.factual
Range: 0.0 - 1.0
Use cases: Knowledge-based QA, educational content
Example:
// Output: "Paris is the capital of France"
// Expected: "Paris"
// Score: 1.0 (factually correct)
// Output: "London is the capital of France"
// Expected: "Paris"
// Score: 0.0 (factually incorrect)
summary
Evaluates summary quality and completeness.
scorers.summary
Range: 0.0 - 1.0
Use cases: Text summarization, content condensation
Criteria:
- Relevance to source material
- Completeness of key points
- Conciseness
- Accuracy
translation
Assesses translation accuracy between languages.
scorers.translation
Range: 0.0 - 1.0
Use cases: Machine translation, multilingual content
Criteria:
- Meaning preservation
- Grammar correctness
- Cultural appropriateness
- Fluency
Answer Quality
answerCorrectness
Measures how correct an answer is compared to expected.
scorers.answerCorrectness
Range: 0.0 - 1.0
Use cases: QA systems, educational assessments
answerRelevancy
Evaluates how relevant the answer is to the question.
scorers.answerRelevancy
Range: 0.0 - 1.0
Use cases: Chat systems, search results
Example:
// Question: "What's the weather like?"
// Output: "It's sunny and 75°F"
// Score: 1.0 (highly relevant)
// Output: "I like pizza"
// Score: 0.0 (not relevant)
Context Evaluation
contextEntityRecall
Checks if all expected entities are mentioned in context.
scorers.contextEntityRecall
Range: 0.0 - 1.0
Use cases: Information extraction, entity recognition
contextPrecision
Measures precision of context usage in responses.
scorers.contextPrecision
Range: 0.0 - 1.0
Use cases: RAG systems, context-aware generation
contextRecall
Evaluates how well context information is recalled.
scorers.contextRecall
Range: 0.0 - 1.0
Use cases: Reading comprehension, context utilization
contextRelevancy
Assesses relevance of provided context to the task.
scorers.contextRelevancy
Range: 0.0 - 1.0
Use cases: Context filtering, relevance ranking
Safety and Moderation
moderation
Detects harmful, inappropriate, or unsafe content.
scorers.moderation
Range: 0.0 - 1.0 (1.0 = safe, 0.0 = unsafe)
Use cases: Content filtering, safety checks
Categories detected:
- Hate speech
- Violence
- Self-harm
- Sexual content
- Harassment
Example:
// Output: "Here's a helpful recipe for cookies"
// Score: 1.0 (safe content)
// Output: "Instructions for harmful activities"
// Score: 0.0 (unsafe content)
Structured Data
jsonDiff
Compares JSON structures for differences.
scorers.jsonDiff
Range: 0.0 - 1.0
Use cases: API responses, structured output validation
Example:
// Output: '{"name": "John", "age": 30}'
// Expected: '{"name": "John", "age": 30}'
// Score: 1.0 (exact match)
// Output: '{"name": "John"}'
// Expected: '{"name": "John", "age": 30}'
// Score: 0.5 (partial match)
sql
Validates SQL query syntax and correctness.
scorers.sql
Range: 0.0 - 1.0
Use cases: SQL generation, query validation
Criteria:
- Syntax correctness
- Logical structure
- Expected results
Numeric and List Comparison
numericDiff
Calculates numerical difference between outputs.
scorers.numericDiff
Range: 0.0 - 1.0
Use cases: Mathematical calculations, numeric predictions
Example:
// Output: "42"
// Expected: "40"
// Score: ~0.95 (small difference)
// Output: "100"
// Expected: "40"
// Score: ~0.0 (large difference)
listContains
Verifies if a list contains expected items.
scorers.listContains
Range: 0.0 - 1.0
Use cases: List generation, item retrieval
Example:
// Output: ["apple", "banana", "cherry"]
// Expected: ["apple", "banana"]
// Score: 1.0 (contains all expected items)
// Output: ["apple"]
// Expected: ["apple", "banana"]
// Score: 0.5 (contains half of expected items)
Specialized Scorers
possible
Checks if an answer is logically possible.
scorers.possible
Range: 0.0 - 1.0
Use cases: Logical reasoning, plausibility checking
Example:
// Question: "How many wheels does a bicycle have?"
// Output: "2"
// Score: 1.0 (possible)
// Output: "7"
// Score: 0.0 (not possible)
humor
Evaluates humor quality and appropriateness.
scorers.humor
Range: 0.0 - 1.0
Use cases: Creative writing, entertainment content
Criteria:
- Humor presence
- Appropriateness
- Cleverness
- Timing