Assertions and Expectations

PromptCanary compares responses using structural rules plus optional semantic similarity.

Assertion types

`format`

Valid values:

bullet_points
numbered_list
json
plain_text
markdown

Example:

yaml

expect:
  format: json

`min_length` and `max_length`

Use character-length bounds to catch truncated or excessively long outputs.

yaml

expect:
  min_length: 40
  max_length: 500

Validation rule: min_length must be less than or equal to max_length.

`must_contain` and `must_not_contain`

Use required and forbidden terms for contract-like checks.

yaml

expect:
  must_contain: [refund policy, SLA]
  must_not_contain: [I cannot help, error]

`startsWith` and `endsWith`

Check if content begins or ends with a specific string (case-insensitive).

typescript

// Programmatic API
assertions.startsWith('Hello World', 'Hello'); // { passed: true, ... }
assertions.endsWith('Hello World', 'World'); // { passed: true, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'starts_with', value: 'Dear Customer' },
  { type: 'ends_with', value: 'Best regards' },
]);

Both functions are case-insensitive, matching the behavior of contains().

`containsAll` and `containsAny`

Check if content contains multiple substrings (all or any).

typescript

// Programmatic API
assertions.containsAll('Hello World Test', ['Hello', 'World']); // { passed: true, ... }
assertions.containsAny('Hello World', ['Goodbye', 'Hello']); // { passed: true, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'contains_all', value: ['refund', 'policy', 'days'] },
  { type: 'contains_any', value: ['error', 'warning', 'failed'] },
]);

containsAll(content, substrings): Passes if ALL substrings are found (case-insensitive). Reports missing substrings in details.
containsAny(content, substrings): Passes if AT LEAST ONE substring is found (case-insensitive). Reports which one matched or that none did.

Both functions are case-insensitive, matching the behavior of contains().

`caseSensitiveContains`

Check if content contains a substring with exact case matching.

typescript

// Programmatic API
assertions.caseSensitiveContains('Hello World', 'Hello'); // { passed: true, ... }
assertions.caseSensitiveContains('Hello World', 'hello'); // { passed: false, ... }

// With runAll()
assertions.runAll(content, [{ type: 'case_sensitive_contains', value: 'Error' }]);

Unlike contains() which is case-insensitive, caseSensitiveContains() requires exact case matching. Use this when you need to verify that specific capitalization is present in the response.

`wordCount`

Check if content has a specific number of words within optional min/max bounds.

typescript

// Programmatic API
assertions.wordCount('one two three four five', { min: 3, max: 10 }); // { passed: true, ... }
assertions.wordCount('hello world', { min: 5 }); // { passed: false, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'word_count', value: { min: 10, max: 100 } },
  { type: 'word_count', value: { max: 50 } },
]);

wordCount(content, options): Counts words by splitting on whitespace. Passes if word count is within specified bounds.
- options.min: Minimum word count (optional)
- options.max: Maximum word count (optional)
- If neither is specified, any word count passes
- Empty or whitespace-only content counts as 0 words

`levenshtein`

Compute normalized Levenshtein (edit distance) similarity between response content and expected text. Returns a score from 0.0 (completely different) to 1.0 (identical). Works locally with no API calls.

typescript

// Programmatic API — score only (always passes)
assertions.levenshtein('hello world', 'hello world'); // { passed: true, score: 1.0, ... }

// With threshold — pass/fail based on minimum score
assertions.levenshtein('hello world', 'hello wrold', { threshold: 0.8 });
// { passed: true, score: 0.909, ... }

assertions.levenshtein('abc', 'xyz', { threshold: 0.5 });
// { passed: false, score: 0.0, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'levenshtein', value: { expected: 'expected output', threshold: 0.7 } },
]);

levenshtein(content, expected, options?): Normalized edit distance scorer.
- expected: The reference string to compare against
- options.threshold: Minimum score to pass (optional — without it, always passes)
- Formula: 1 - editDistance / max(content.length, expected.length)
- Score of 1.0 = identical strings, 0.0 = completely different

`rouge1`

Compute ROUGE-1 (unigram recall) score between response content and reference text. Measures how many reference words appear in the output. Good for summarization testing.

typescript

// Programmatic API — score only (always passes)
assertions.rouge1('the cat sat on the mat', 'the cat sat');
// { passed: true, score: 1.0, ... } — all reference tokens found

// With threshold
assertions.rouge1('the cat', 'the cat sat on the mat', { threshold: 0.5 });
// { passed: false, score: 0.333, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'rouge1', value: { expected: 'key points from reference', threshold: 0.6 } },
]);

rouge1(content, reference, options?): Recall-based unigram overlap.
- reference: The reference text to compare against
- options.threshold: Minimum score to pass (optional)
- Case-insensitive tokenization on whitespace
- Score = matched reference tokens / total reference tokens

`bleu`

Compute BLEU score between response content and reference text. Uses modified n-gram precision (1-4 grams) with brevity penalty. Good for translation and rephrasing testing.

typescript

// Programmatic API
assertions.bleu('the cat sat on the mat', 'the cat sat on the mat');
// { passed: true, score: 1.0, ... }

// Shorter output gets brevity penalty
assertions.bleu('the cat', 'the cat sat on the mat', { threshold: 0.5 });
// { passed: false, score: ~0.15, ... }

// With runAll()
assertions.runAll(content, [
  { type: 'bleu', value: { expected: 'reference translation', threshold: 0.4 } },
]);

bleu(content, reference, options?): Precision-based n-gram scorer with brevity penalty.
- reference: The reference text to compare against
- options.threshold: Minimum score to pass (optional)
- Uses n-grams from 1 to 4 with equal weights
- Applies brevity penalty when output is shorter than reference

`custom`

Run a user-defined scorer function for domain-specific assertion logic. Supports both sync and async scorers.

typescript

// Sync scorer
const result = await assertions.custom(content, {
  scorer: (output) => {
    const hasCitation = /\[\d+\]/.test(output);
    return {
      score: hasCitation ? 1.0 : 0.0,
      pass: hasCitation,
      reason: hasCitation ? 'Contains citation' : 'Missing citation reference',
    };
  },
});

// Async scorer with input context
const result = await assertions.custom(content, {
  scorer: async (output, input) => {
    const grammarScore = await checkGrammar(output);
    return {
      score: grammarScore,
      pass: grammarScore > 0.8,
      reason: `Grammar score: ${grammarScore}`,
    };
  },
  input: 'original prompt',
});

custom(content, options): Async custom scorer function.
- options.scorer: Function (output, input?) => { score, pass, reason } — can be sync or async
- options.input: Optional input context passed as second argument to scorer
- Returns AssertionResult with the scorer's score, pass/fail, and reason

Operational assertions

Operational assertions check performance and resource metrics from the test result, not the response content.

`latency`

Check if response latency is within acceptable limits.

typescript

// Programmatic API
const result = await testPrompt({
  /* ... */
});
assertions.latency(result.latencyMs, { max: 500 }); // { passed: true, ... }

// Typical usage in tests
it('responds within 500ms', async () => {
  const result = await testPrompt({
    provider: 'openai',
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'Hello' }],
  });
  expect(assertions.latency(result.latencyMs, { max: 500 }).passed).toBe(true);
});

latency(latencyMs, options): Checks response latency in milliseconds.
- latencyMs: The latency value from TestPromptResult.latencyMs
- options.max: Maximum acceptable latency in milliseconds (required)
- Passes if latencyMs <= max

`tokenCount`

Check if token usage is within acceptable limits.

typescript

// Programmatic API
const result = await testPrompt({
  /* ... */
});
assertions.tokenCount(result.tokenUsage, { max: 1000 }); // { passed: true, ... }

// Check specific token types
assertions.tokenCount(result.tokenUsage, {
  maxPrompt: 500,
  maxCompletion: 500,
});

// Check all limits simultaneously
assertions.tokenCount(result.tokenUsage, {
  max: 1000,
  maxPrompt: 500,
  maxCompletion: 500,
});

// Typical usage in tests
it('uses tokens efficiently', async () => {
  const result = await testPrompt({
    provider: 'openai',
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'Summarize this...' }],
  });
  expect(assertions.tokenCount(result.tokenUsage, { max: 1000 }).passed).toBe(true);
});

tokenCount(tokenUsage, options): Checks token usage metrics.
- tokenUsage: The token usage object from TestPromptResult.tokenUsage with { prompt: number, completion: number }
- options.max: Maximum total tokens (prompt + completion) (optional)
- options.maxPrompt: Maximum prompt tokens (optional)
- options.maxCompletion: Maximum completion tokens (optional)
- Passes if all specified limits are met
- At least one limit must be specified
- Useful for monitoring API costs and ensuring efficient prompt design

`tone`

Allowed values:

professional
casual
technical
friendly
formal

Example:

yaml

expect:
  tone: professional

`semantic_similarity`

Checks meaning similarity against a baseline response using embeddings.

yaml

expect:
  semantic_similarity:
    baseline: The service is available and healthy.
    threshold: 0.8

baseline: required reference text
threshold: number from 0 to 1 (default 0.8)

Three-layer comparison pipeline

PromptCanary evaluates responses in three layers:

Structural assertions (format, length, required/forbidden terms, tone).
Semantic similarity via cosine similarity between response and baseline embeddings.
Drift detection against historical semantic scores.

Drift detection flags significant score drops using a moving average baseline:

More than 2 standard deviations below historical average, or
More than 10 percent below average (whichever is more lenient)

Reliability tips

Use several assertion types together instead of one strict check.
Keep baselines short and stable for semantic comparisons.
Avoid brittle must_contain strings that depend on exact phrasing.
Set realistic thresholds; tune from observed production runs.
Use provider-specific tests when behavior intentionally differs across models.

Assertions and Expectations ​

Assertion types ​

format ​

min_length and max_length ​

must_contain and must_not_contain ​

startsWith and endsWith ​

containsAll and containsAny ​

caseSensitiveContains ​

wordCount ​

levenshtein ​

rouge1 ​

bleu ​

custom ​

Operational assertions ​

latency ​

tokenCount ​

tone ​

semantic_similarity ​

Three-layer comparison pipeline ​

Reliability tips ​

Assertions and Expectations

Assertion types

`format`

`min_length` and `max_length`

`must_contain` and `must_not_contain`

`startsWith` and `endsWith`

`containsAll` and `containsAny`

`caseSensitiveContains`

`wordCount`

`levenshtein`

`rouge1`

`bleu`

`custom`

Operational assertions

`latency`

`tokenCount`

`tone`

`semantic_similarity`

Three-layer comparison pipeline

Reliability tips