evaluation#

Evaluation chains for grading LLM and Chain outputs.

This module contains off-the-shelf evaluation chains for grading the output of LangChain primitives such as language models and chains.

Loading an evaluator

To load an evaluator, you can use the load_evaluators or load_evaluator functions with the names of the evaluators to load.

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("qa")
evaluator.evaluate_strings(
    prediction="We sold more than 40,000 units last week",
    input="How many units did we sell last week?",
    reference="We sold 32,378 units",
)

The evaluator must be one of EvaluatorType.

Datasets

To load one of the LangChain HuggingFace datasets, you can use the load_dataset function with the name of the dataset to load.

from langchain.evaluation import load_dataset

ds = load_dataset("llm-math")

Some common use cases for evaluation include:

Low-level API

These evaluators implement one of the following interfaces:

  • StringEvaluator: Evaluate a prediction string against a reference label and/or input context.

  • PairwiseStringEvaluator: Evaluate two prediction strings against each other. Useful for scoring preferences, measuring similarity between two chain or llm agents, or comparing outputs on similar inputs.

  • AgentTrajectoryEvaluator Evaluate the full sequence of actions taken by an agent.

These interfaces enable easier composability and usage within a higher level evaluation framework.

Classes

evaluation.agents.trajectory_eval_chain.TrajectoryEval

A named tuple containing the score and reasoning for a trajectory.

evaluation.agents.trajectory_eval_chain.TrajectoryEvalChain

A chain for evaluating ReAct style agents.

evaluation.agents.trajectory_eval_chain.TrajectoryOutputParser

Trajectory output parser.

evaluation.comparison.eval_chain.LabeledPairwiseStringEvalChain

Labeled Pairwise String Evaluation Chain.

evaluation.comparison.eval_chain.PairwiseStringEvalChain

Pairwise String Evaluation Chain.

evaluation.comparison.eval_chain.PairwiseStringResultOutputParser

A parser for the output of the PairwiseStringEvalChain.

evaluation.criteria.eval_chain.Criteria(value)

A Criteria to evaluate.

evaluation.criteria.eval_chain.CriteriaEvalChain

LLM Chain for evaluating runs against criteria.

evaluation.criteria.eval_chain.CriteriaResultOutputParser

A parser for the output of the CriteriaEvalChain.

evaluation.criteria.eval_chain.LabeledCriteriaEvalChain

Criteria evaluation chain that requires references.

evaluation.embedding_distance.base.EmbeddingDistance(value)

Embedding Distance Metric.

evaluation.embedding_distance.base.EmbeddingDistanceEvalChain

Embedding distance evaluation chain.

evaluation.embedding_distance.base.PairwiseEmbeddingDistanceEvalChain

Use embedding distances to score semantic difference between two predictions.

evaluation.exact_match.base.ExactMatchStringEvaluator(*)

Compute an exact match between the prediction and the reference.

evaluation.parsing.base.JsonEqualityEvaluator([...])

Json Equality Evaluator.

evaluation.parsing.base.JsonValidityEvaluator(**_)

Evaluate whether the prediction is valid JSON.

evaluation.parsing.json_distance.JsonEditDistanceEvaluator([...])

An evaluator that calculates the edit distance between JSON strings.

evaluation.parsing.json_schema.JsonSchemaEvaluator(**_)

An evaluator that validates a JSON prediction against a JSON schema reference.

evaluation.qa.eval_chain.ContextQAEvalChain

LLM Chain for evaluating QA w/o GT based on context.

evaluation.qa.eval_chain.CotQAEvalChain

LLM Chain for evaluating QA using chain of thought reasoning.

evaluation.qa.eval_chain.QAEvalChain

LLM Chain for evaluating question answering.

evaluation.qa.generate_chain.QAGenerateChain

LLM Chain for generating examples for question answering.

evaluation.regex_match.base.RegexMatchStringEvaluator(*)

Compute a regex match between the prediction and the reference.

evaluation.schema.AgentTrajectoryEvaluator()

Interface for evaluating agent trajectories.

evaluation.schema.EvaluatorType(value)

The types of the evaluators.

evaluation.schema.LLMEvalChain

A base class for evaluators that use an LLM.

evaluation.schema.PairwiseStringEvaluator()

Compare the output of two models (or two outputs of the same model).

evaluation.schema.StringEvaluator()

String evaluator interface.

evaluation.scoring.eval_chain.LabeledScoreStringEvalChain

A chain for scoring the output of a model on a scale of 1-10.

evaluation.scoring.eval_chain.ScoreStringEvalChain

A chain for scoring on a scale of 1-10 the output of a model.

evaluation.scoring.eval_chain.ScoreStringResultOutputParser

A parser for the output of the ScoreStringEvalChain.

evaluation.string_distance.base.PairwiseStringDistanceEvalChain

Compute string edit distances between two predictions.

evaluation.string_distance.base.StringDistance(value)

Distance metric to use.

evaluation.string_distance.base.StringDistanceEvalChain

Compute string distances between the prediction and the reference.

Functions

evaluation.comparison.eval_chain.resolve_pairwise_criteria(...)

Resolve the criteria for the pairwise evaluator.

evaluation.criteria.eval_chain.resolve_criteria(...)

Resolve the criteria to evaluate.

evaluation.loading.load_dataset(uri)

Load a dataset from the LangChainDatasets on HuggingFace.

evaluation.loading.load_evaluator(evaluator,Β *)

Load the requested evaluation chain specified by a string.

evaluation.loading.load_evaluators(evaluators,Β *)

Load evaluators specified by a list of evaluator types.

evaluation.scoring.eval_chain.resolve_criteria(...)

Resolve the criteria for the pairwise evaluator.