`scorers`

Viteval provides a comprehensive set of pre-built scorers for common evaluation scenarios, powered by the autoevals library.

Import

import { scorers } from 'viteval';

Text Similarity

`exactMatch`

Returns 1 if output exactly matches expected, 0 otherwise.

scorers.exactMatch

Use cases: Exact answers, structured outputs, code generation

Example:

// Input: "What is 2+2?"
// Output: "4"
// Expected: "4"
// Score: 1.0

// Output: "Four" 
// Expected: "4"
// Score: 0.0

`levenshtein`

Measures text similarity using edit distance algorithm.

scorers.levenshtein

Range: 0.0 - 1.0 (higher = more similar)
Use cases: Approximate text matching, typo tolerance

Example:

// Output: "The sky is blue"
// Expected: "The sky is bleu"  
// Score: ~0.92 (small typo)

// Output: "Blue sky"
// Expected: "The sky is blue"
// Score: ~0.66 (partial match)

`answerSimilarity`

Semantic similarity using embeddings.

scorers.answerSimilarity

Range: 0.0 - 1.0
Use cases: Meaning-based comparison, paraphrasing

Example:

// Output: "The capital of France is Paris"
// Expected: "Paris is France's capital city"
// Score: ~0.95 (same meaning, different wording)

`embeddingSimilarity`

Measures semantic similarity using embeddings.

scorers.embeddingSimilarity

Range: 0.0 - 1.0
Use cases: Semantic search, content matching

Features:

Uses state-of-the-art embedding models
Captures semantic meaning beyond keywords
Language-agnostic comparison

Content Quality

`factual`

Evaluates factual accuracy against ground truth.

scorers.factual

Range: 0.0 - 1.0
Use cases: Knowledge-based QA, educational content

Example:

// Output: "Paris is the capital of France"
// Expected: "Paris"
// Score: 1.0 (factually correct)

// Output: "London is the capital of France"  
// Expected: "Paris"
// Score: 0.0 (factually incorrect)

`summary`

Evaluates summary quality and completeness.

scorers.summary

Range: 0.0 - 1.0
Use cases: Text summarization, content condensation

Criteria:

Relevance to source material
Completeness of key points
Conciseness
Accuracy

`translation`

Assesses translation accuracy between languages.

scorers.translation

Range: 0.0 - 1.0
Use cases: Machine translation, multilingual content

Criteria:

Meaning preservation
Grammar correctness
Cultural appropriateness
Fluency

Answer Quality

`answerCorrectness`

Measures how correct an answer is compared to expected.

scorers.answerCorrectness

Range: 0.0 - 1.0
Use cases: QA systems, educational assessments

`answerRelevancy`

Evaluates how relevant the answer is to the question.

scorers.answerRelevancy

Range: 0.0 - 1.0
Use cases: Chat systems, search results

Example:

// Question: "What's the weather like?"
// Output: "It's sunny and 75°F"
// Score: 1.0 (highly relevant)

// Output: "I like pizza"
// Score: 0.0 (not relevant)

Context Evaluation

`contextEntityRecall`

Checks if all expected entities are mentioned in context.

scorers.contextEntityRecall

Range: 0.0 - 1.0
Use cases: Information extraction, entity recognition

`contextPrecision`

Measures precision of context usage in responses.

scorers.contextPrecision

Range: 0.0 - 1.0
Use cases: RAG systems, context-aware generation

`contextRecall`

Evaluates how well context information is recalled.

scorers.contextRecall

Range: 0.0 - 1.0
Use cases: Reading comprehension, context utilization

`contextRelevancy`

Assesses relevance of provided context to the task.

scorers.contextRelevancy

Range: 0.0 - 1.0
Use cases: Context filtering, relevance ranking

Safety and Moderation

`moderation`

Detects harmful, inappropriate, or unsafe content.

scorers.moderation

Range: 0.0 - 1.0 (1.0 = safe, 0.0 = unsafe)
Use cases: Content filtering, safety checks

Categories detected:

Hate speech
Violence
Self-harm
Sexual content
Harassment

Example:

// Output: "Here's a helpful recipe for cookies"
// Score: 1.0 (safe content)

// Output: "Instructions for harmful activities"
// Score: 0.0 (unsafe content)

Structured Data

`jsonDiff`

Compares JSON structures for differences.

scorers.jsonDiff

Range: 0.0 - 1.0
Use cases: API responses, structured output validation

Example:

// Output: '{"name": "John", "age": 30}'
// Expected: '{"name": "John", "age": 30}'
// Score: 1.0 (exact match)

// Output: '{"name": "John"}'
// Expected: '{"name": "John", "age": 30}'
// Score: 0.5 (partial match)

`sql`

Validates SQL query syntax and correctness.

scorers.sql

Range: 0.0 - 1.0
Use cases: SQL generation, query validation

Criteria:

Syntax correctness
Logical structure
Expected results

Numeric and List Comparison

`numericDiff`

Calculates numerical difference between outputs.

scorers.numericDiff

Range: 0.0 - 1.0
Use cases: Mathematical calculations, numeric predictions

Example:

// Output: "42"
// Expected: "40"
// Score: ~0.95 (small difference)

// Output: "100"
// Expected: "40"  
// Score: ~0.0 (large difference)

`listContains`

Verifies if a list contains expected items.

scorers.listContains

Range: 0.0 - 1.0
Use cases: List generation, item retrieval

Example:

// Output: ["apple", "banana", "cherry"]
// Expected: ["apple", "banana"]
// Score: 1.0 (contains all expected items)

// Output: ["apple"]
// Expected: ["apple", "banana"]
// Score: 0.5 (contains half of expected items)

Specialized Scorers

`possible`

Checks if an answer is logically possible.

scorers.possible

Range: 0.0 - 1.0
Use cases: Logical reasoning, plausibility checking

Example:

// Question: "How many wheels does a bicycle have?"
// Output: "2"
// Score: 1.0 (possible)

// Output: "7"
// Score: 0.0 (not possible)

`humor`

Evaluates humor quality and appropriateness.

scorers.humor

Range: 0.0 - 1.0
Use cases: Creative writing, entertainment content

Criteria:

Humor presence
Appropriateness
Cleverness
Timing

scorers ​

Import ​

Text Similarity ​

exactMatch ​

levenshtein ​

answerSimilarity ​

embeddingSimilarity ​

Content Quality ​

factual ​

summary ​

translation ​

Answer Quality ​

answerCorrectness ​

answerRelevancy ​

Context Evaluation ​

contextEntityRecall ​

contextPrecision ​

contextRecall ​

contextRelevancy ​

Safety and Moderation ​

moderation ​

Structured Data ​

jsonDiff ​

sql ​

Numeric and List Comparison ​

numericDiff ​

listContains ​

Specialized Scorers ​

possible ​

humor ​

`scorers`

Import

Text Similarity

`exactMatch`

`levenshtein`

`answerSimilarity`

`embeddingSimilarity`

Content Quality

`factual`

`summary`

`translation`

Answer Quality

`answerCorrectness`

`answerRelevancy`

Context Evaluation

`contextEntityRecall`

`contextPrecision`

`contextRecall`

`contextRelevancy`

Safety and Moderation

`moderation`

Structured Data

`jsonDiff`

`sql`

Numeric and List Comparison

`numericDiff`

`listContains`

Specialized Scorers

`possible`

`humor`