Skip to main content

Pairwise String Comparison

Often you will want to compare predictions of an LLM, Chain, or Agent for a given input. The StringComparison evaluators facilitate this so you can answer questions like:

  • Which LLM or prompt produces a preferred output for a given question?
  • Which examples should I include for few-shot example selection?
  • Which output is better to include for fintetuning?

The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the labeled_pairwise_string evaluator.

With References

npm install @langchain/anthropic
import { loadEvaluator } from "langchain/evaluation";

const chain = await loadEvaluator("labeled_pairwise_string", {
criteria: "correctness",
});

const res = await chain.evaluateStringPairs({
prediction: "there are three dogs",
predictionB: "4",
input: "how many dogs are in the park?",
reference: "four",
});

console.log(res);

/*
{
reasoning: 'Both responses attempt to answer the question about the number of dogs in the park. However, Response A states that there are three dogs, which is incorrect according to the reference answer. Response B, on the other hand, correctly states that there are four dogs, which matches the reference answer. Therefore, Response B is more accurate.Final Decision: [[B]]',
value: 'B',
score: 0
}
*/

API Reference:

Methods

The pairwise string evaluator can be called using evaluateStringPairs methods, which accept:

  • prediction (string) – The predicted response of the first model, chain, or prompt.
  • predictionB (string) – The predicted response of the second model, chain, or prompt.
  • input (string) – The input question, prompt, or other text.
  • reference (string) – (Only for the labeled_pairwise_string variant) The reference response.

They return a dictionary with the following values:

  • value: 'A' or 'B', indicating whether prediction or predictionB is preferred, respectively
  • score: Integer 0 or 1 mapped from the 'value', where a score of 1 would mean that the first prediction is preferred, and a score of 0 would mean predictionB is preferred.
  • reasoning: String "chain of thought reasoning" from the LLM generated prior to creating the score

Without References

When references aren't available, you can still predict the preferred response. The results will reflect the evaluation model's preference, which is less reliable and may result in preferences that are factually incorrect.

import { loadEvaluator } from "langchain/evaluation";

const chain = await loadEvaluator("pairwise_string", {
criteria: "conciseness",
});

const res = await chain.evaluateStringPairs({
prediction: "Addition is a mathematical operation.",
predictionB:
"Addition is a mathematical operation that adds two numbers to create a third number, the 'sum'.",
input: "What is addition?",
});

console.log({ res });

/*
{
res: {
reasoning: 'Response A is concise, but it lacks detail. Response B, while slightly longer, provides a more complete and informative answer by explaining what addition does. It is still concise and to the point.Final decision: [[B]]',
value: 'B',
score: 0
}
}
*/

API Reference:

Defining the Criteria

By default, the LLM is instructed to select the 'preferred' response based on helpfulness, relevance, correctness, and depth of thought. You can customize the criteria by passing in a criteria argument, where the criteria could take any of the following forms:

  • Criteria - to use one of the default criteria and their descriptions
  • Constitutional principal - use one any of the constitutional principles defined in langchain
  • Dictionary: a list of custom criteria, where the key is the name of the criteria, and the value is the description.

Below is an example for determining preferred writing responses based on a custom style.

import { loadEvaluator } from "langchain/evaluation";

const customCriterion = {
simplicity: "Is the language straightforward and unpretentious?",
clarity: "Are the sentences clear and easy to understand?",
precision: "Is the writing precise, with no unnecessary words or details?",
truthfulness: "Does the writing feel honest and sincere?",
subtext: "Does the writing suggest deeper meanings or themes?",
};

const chain = await loadEvaluator("pairwise_string", {
criteria: customCriterion,
});

const res = await chain.evaluateStringPairs({
prediction:
"Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
predictionB:
"Where one finds a symphony of joy, every domicile of happiness resounds in harmonious, identical notes; yet, every abode of despair conducts a dissonant orchestra, each playing an elegy of grief that is peculiar and profound to its own existence.",
input: "Write some prose about families.",
});

console.log(res);

/*
{
reasoning: "Response A is simple, clear, and precise. It uses straightforward language to convey a deep and universal truth about families. The metaphor of joy and sorrow as music is effective and easy to understand. Response B, on the other hand, is more complex and less clear. It uses more sophisticated language and a more elaborate metaphor, which may make it harder for some readers to understand. It also includes unnecessary words and details that don't add to the overall meaning of the prose.Both responses are truthful and sincere, and both suggest deeper meanings about the nature of family life. However, Response A does a better job of conveying these meanings in a simple, clear, and precise way.Therefore, the better response is [[A]].",
value: 'A',
score: 1
}
*/

API Reference:

Customize the LLM

By default, the loader uses gpt-4 in the evaluation chain. You can customize this when loading.

import { loadEvaluator } from "langchain/evaluation";
import { ChatAnthropic } from "@langchain/anthropic";

const model = new ChatAnthropic({ temperature: 0 });

const chain = await loadEvaluator("labeled_pairwise_string", { llm: model });

const res = await chain.evaluateStringPairs({
prediction: "there are three dogs",
predictionB: "4",
input: "how many dogs are in the park?",
reference: "four",
});

console.log(res);

/*
{
reasoning: 'Here is my assessment:Response B is more correct and accurate compared to Response A. Response B simply states "4", which matches the ground truth reference answer of "four". Meanwhile, Response A states "there are three dogs", which is incorrect according to the reference. In terms of following instructions and directly answering the question "how many dogs are in the park?", Response B gives the precise numerical answer, while Response A provides an incomplete sentence. Overall, Response B is more accurate and better followed the instructions to directly answer the question.[[B]]',
value: 'B',
score: 0
}
*/

API Reference:

Customize the Evaluation Prompt

You can use your own custom evaluation prompt to add more task-specific instructions or to instruct the evaluator to score the output.

Note: If you use a prompt that expects generates a result in a unique format, you may also have to pass in a custom output parser (outputParser=yourParser()) instead of the default PairwiseStringResultOutputParser

import { loadEvaluator } from "langchain/evaluation";
import { PromptTemplate } from "@langchain/core/prompts";

const promptTemplate = PromptTemplate.fromTemplate(
`Given the input context, which do you prefer: A or B?
Evaluate based on the following criteria:
{criteria}
Reason step by step and finally, respond with either [[A]] or [[B]] on its own line.

DATA
----
input: {input}
reference: {reference}
A: {prediction}
B: {predictionB}
---
Reasoning:
`
);

const chain = await loadEvaluator("labeled_pairwise_string", {
chainOptions: {
prompt: promptTemplate,
},
});

const res = await chain.evaluateStringPairs({
prediction: "The dog that ate the ice cream was named fido.",
predictionB: "The dog's name is spot",
input: "What is the name of the dog that ate the ice cream?",
reference: "The dog's name is fido",
});

console.log(res);

/*
{
reasoning: 'Helpfulness: Both A and B are helpful as they provide a direct answer to the question.Relevance: Both A and B refer to the question, but only A matches the reference text.Correctness: Only A is correct as it matches the reference text.Depth: Both A and B are straightforward and do not demonstrate depth of thought.Based on these criteria, the preferred response is A. ',
value: 'A',
score: 1
}
*/

API Reference:


Help us out by providing feedback on this documentation page: