autoevals

AutoEvals is a tool to quickly and easily evaluate AI model outputs.

Quickstart

npm install autoevals

Example

Use AutoEvals to model-grade an example LLM completion using the factuality prompt.

import { Factuality } from "autoevals";
 
(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";
 
  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();

Functions

AnswerCorrectness

▸ AnswerCorrectness(args): Score | Promise<Score>

Measures answer correctness compared to ground truth using a weighted average of factuality and semantic similarity.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `context?`: `string` \| `string`[] ; `input?`: `string` ; `model?`: `string` } & { `maxTokens?`: `number` ; `temperature?`: `number` } & `OpenAIAuth` & { `answerSimilarity?`: `Scorer`<`string`, {}> ; `answerSimilarityWeight?`: `number` ; `factualityWeight?`: `number` }>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

AnswerRelevancy

▸ AnswerRelevancy(args): Score | Promise<Score>

Scores the relevancy of the generated answer to the given question. Answers with incomplete, redundant or unnecessary information are penalized.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `context?`: `string` \| `string`[] ; `input?`: `string` ; `model?`: `string` } & { `maxTokens?`: `number` ; `temperature?`: `number` } & `OpenAIAuth` & { `strictness?`: `number` }>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

AnswerSimilarity

▸ AnswerSimilarity(args): Score | Promise<Score>

Scores the semantic similarity between the generated answer and ground truth.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `RagasArgs`>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Battle

▸ Battle(args): Score | Promise<Score>

Test whether an output better performs the instructions than the original (expected) value.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `instructions`: `string` }>>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ClosedQA

▸ ClosedQA(args): Score | Promise<Score>

Test whether an output answers the input using knowledge built into the model. You can specify criteria to further constrain the answer.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `criteria`: `any` ; `input`: `string` }>>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ContextEntityRecall

▸ ContextEntityRecall(args): Score | Promise<Score>

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `context?`: `string` \| `string`[] ; `input?`: `string` ; `model?`: `string` } & { `maxTokens?`: `number` ; `temperature?`: `number` } & `OpenAIAuth` & { `pairwiseScorer?`: `Scorer`<`string`, {}> }>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ContextPrecision

▸ ContextPrecision(args): Score | Promise<Score>

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `RagasArgs`>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ContextRecall

▸ ContextRecall(args): Score | Promise<Score>

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `RagasArgs`>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ContextRelevancy

▸ ContextRelevancy(args): Score | Promise<Score>

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `RagasArgs`>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

EmbeddingSimilarity

▸ EmbeddingSimilarity(args): Score | Promise<Score>

A scorer that uses cosine similarity to compare two strings.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `expectedMin?`: `number` ; `model?`: `string` ; `prefix?`: `string` } & `OpenAIAuth`>

Returns

Score | Promise<Score>

A score between 0 and 1, where 1 is a perfect match.

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Factuality

▸ Factuality(args): Score | Promise<Score>

Test whether an output is factual, compared to an original (expected) value.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `expected?`: `string` ; `input`: `string` ; `output`: `string` }>>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Faithfulness

▸ Faithfulness(args): Score | Promise<Score>

Measures factual consistency of the generated answer with the given context.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `RagasArgs`>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Humor

▸ Humor(args): Score | Promise<Score>

Test whether an output is funny.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{}>>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

JSONDiff

▸ JSONDiff(args): Score | Promise<Score>

A simple scorer that compares JSON objects, using a customizable comparison method for strings (defaults to Levenshtein) and numbers (defaults to NumericDiff).

Parameters

Name	Type
`args`	`ScorerArgs`<`any`, { `numberScorer?`: `Scorer`<`number`, {}> ; `stringScorer?`: `Scorer`<`string`, {}> }>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

LLMClassifierFromSpec

▸ LLMClassifierFromSpec<RenderArgs>(name, spec): Scorer<any, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
`RenderArgs`

Parameters

Name	Type
`name`	`string`
`spec`	`Object`
`spec.choice_scores`	`Record`<`string`, `number`>
`spec.model?`	`string`
`spec.prompt`	`string`
`spec.temperature?`	`number`
`spec.use_cot?`	`boolean`

Returns

Scorer<any, LLMClassifierArgs<RenderArgs>>

LLMClassifierFromSpecFile

▸ LLMClassifierFromSpecFile<RenderArgs>(name, templateName): Scorer<any, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
`RenderArgs`

Parameters

Name	Type
`name`	`string`
`templateName`	`"battle"` \| `"closed_q_a"` \| `"factuality"` \| `"humor"` \| `"possible"` \| `"security"` \| `"sql"` \| `"summary"` \| `"translation"`

Returns

Scorer<any, LLMClassifierArgs<RenderArgs>>

LLMClassifierFromTemplate

▸ LLMClassifierFromTemplate<RenderArgs>(«destructured»): Scorer<string, LLMClassifierArgs<RenderArgs>>

Type parameters

Name
`RenderArgs`

Parameters

Name	Type
`«destructured»`	`Object`
› `choiceScores`	`Record`<`string`, `number`>
› `model?`	`string`
› `name`	`string`
› `promptTemplate`	`string`
› `temperature?`	`number`
› `useCoT?`	`boolean`

Returns

Scorer<string, LLMClassifierArgs<RenderArgs>>

Levenshtein

▸ Levenshtein(args): Score | Promise<Score>

A simple scorer that uses the Levenshtein distance to compare two strings.

Parameters

Name	Type
`args`	`Object`

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

LevenshteinScorer

▸ LevenshteinScorer(args): Score | Promise<Score>

Parameters

Name	Type
`args`	`Object`

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ListContains

▸ ListContains(args): Score | Promise<Score>

A scorer that semantically evaluates the overlap between two lists of strings. It works by computing the pairwise similarity between each element of the output and the expected value, and then using Linear Sum Assignment to find the best matching pairs.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`[], { `allowExtraEntities?`: `boolean` ; `pairwiseScorer?`: `Scorer`<`string`, {}> }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Moderation

▸ Moderation(args): Score | Promise<Score>

A scorer that uses OpenAI's moderation API to determine if AI response contains ANY flagged content.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `threshold?`: `number` } & `OpenAIAuth`>

Returns

Score | Promise<Score>

A score between 0 and 1, where 1 means content passed all moderation checks.

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

NumericDiff

▸ NumericDiff(args): Score | Promise<Score>

A simple scorer that compares numbers by normalizing their difference.

Parameters

Name	Type
`args`	`Object`

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

OpenAIClassifier

▸ OpenAIClassifier<RenderArgs, Output>(args): Promise<Score>

Type parameters

Name
`RenderArgs`
`Output`

Parameters

Name	Type
`args`	`ScorerArgs`<`Output`, `OpenAIClassifierArgs`<`RenderArgs`>>

Returns

Promise<Score>

Defined in

autoevals/js/llm.ts:84

Possible

▸ Possible(args): Score | Promise<Score>

Test whether an output is a possible solution to the challenge posed in the input.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `input`: `string` }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Security

▸ Security(args): Score | Promise<Score>

Test whether an output is malicious.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{}>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Sql

▸ Sql(args): Score | Promise<Score>

Test whether a SQL query is semantically the same as a reference (output) query.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `input`: `string` }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Summary

▸ Summary(args): Score | Promise<Score>

Test whether an output is a better summary of the input than the original (expected) value.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `input`: `string` }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

Translation

▸ Translation(args): Score | Promise<Score>

Test whether an output is as good of a translation of the input in the specified language as an expert (expected) value.

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, `LLMClassifierArgs`<{ `input`: `string` ; `language`: `string` }>>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

ValidJSON

▸ ValidJSON(args): Score | Promise<Score>

A binary scorer that evaluates the validity of JSON output, optionally validating against a JSON Schema definition (see https://json-schema.org/learn/getting-started-step-by-step#create).

Parameters

Name	Type
`args`	`ScorerArgs`<`string`, { `schema?`: `any` }>

Returns

Score | Promise<Score>

Defined in

node_modules/.pnpm/@braintrust+core@0.0.8/node_modules/@braintrust/core/dist/index.d.ts:21

buildClassificationTools

▸ buildClassificationTools(useCoT, choiceStrings): ChatCompletionTool[]

Parameters

Name	Type
`useCoT`	`boolean`
`choiceStrings`	`string`[]

Returns

ChatCompletionTool[]

Defined in

autoevals/js/llm.ts:50

makePartial

▸ makePartial<Output, Extra>(fn, name?): ScorerWithPartial<Output, Extra>

Type parameters

Name
`Output`
`Extra`

Parameters

Name	Type
`fn`	`Scorer`<`Output`, `Extra`>
`name?`	`string`

Returns

ScorerWithPartial<Output, Extra>

Defined in

autoevals/js/partial.ts:11

Type Aliases

LLMArgs

Ƭ LLMArgs: { maxTokens?: number ; temperature?: number } & OpenAIAuth

Defined in

autoevals/js/llm.ts:19

LLMClassifierArgs

Ƭ LLMClassifierArgs<RenderArgs>: { model?: string ; useCoT?: boolean } & LLMArgs & RenderArgs

Type parameters

Name
`RenderArgs`

Defined in

autoevals/js/llm.ts:189

ModelGradedSpec

Ƭ ModelGradedSpec: z.infer<typeof modelGradedSpecSchema>

Defined in

autoevals/js/templates.ts:22

OpenAIClassifierArgs

Ƭ OpenAIClassifierArgs<RenderArgs>: { cache?: ChatCache ; choiceScores: Record<string, number> ; classificationTools: ChatCompletionTool[] ; messages: ChatCompletionMessageParam[] ; model: string ; name: string } & LLMArgs & RenderArgs

Type parameters

Name
`RenderArgs`

Defined in

autoevals/js/llm.ts:74

Variables

DEFAULT_MODEL

• Const DEFAULT_MODEL: "gpt-4o"

Defined in

autoevals/js/llm.ts:24

Evaluators

• Const Evaluators: { label: string ; methods: AutoevalMethod[] }[]

Defined in

autoevals/js/manifest.ts:37

modelGradedSpecSchema

• Const modelGradedSpecSchema: ZodObject<{ choice_scores: ZodRecord<ZodString, ZodNumber> ; model: ZodOptional<ZodString> ; prompt: ZodString ; temperature: ZodOptional<ZodNumber> ; use_cot: ZodOptional<ZodBoolean> }, "strip", ZodTypeAny, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }, { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>

Defined in

autoevals/js/templates.ts:14

templates

• Const templates: Record<"battle" | "closed_q_a" | "factuality" | "humor" | "possible" | "security" | "sql" | "summary" | "translation", { choice_scores: Record<string, number> ; model?: string ; prompt: string ; temperature?: number ; use_cot?: boolean }>

Defined in

autoevals/js/templates.ts:36