evaluation
#
Evaluation chains for grading LLM and Chain outputs.
This module contains off-the-shelf evaluation chains for grading the output of LangChain primitives such as language models and chains.
Loading an evaluator
To load an evaluator, you can use the load_evaluators
or
load_evaluator
functions with the
names of the evaluators to load.
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("qa")
evaluator.evaluate_strings(
prediction="We sold more than 40,000 units last week",
input="How many units did we sell last week?",
reference="We sold 32,378 units",
)
The evaluator must be one of EvaluatorType
.
Datasets
To load one of the LangChain HuggingFace datasets, you can use the load_dataset
function with the
name of the dataset to load.
from langchain.evaluation import load_dataset
ds = load_dataset("llm-math")
Some common use cases for evaluation include:
Grading the accuracy of a response against ground truth answers:
QAEvalChain
Comparing the output of two models:
PairwiseStringEvalChain
orLabeledPairwiseStringEvalChain
when there is additionally a reference label.Judging the efficacy of an agentβs tool usage:
TrajectoryEvalChain
Checking whether an output complies with a set of criteria:
CriteriaEvalChain
orLabeledCriteriaEvalChain
when there is additionally a reference label.Computing semantic difference between a prediction and reference:
EmbeddingDistanceEvalChain
or between two predictions:PairwiseEmbeddingDistanceEvalChain
Measuring the string distance between a prediction and reference
StringDistanceEvalChain
or between two predictionsPairwiseStringDistanceEvalChain
Low-level API
These evaluators implement one of the following interfaces:
StringEvaluator
: Evaluate a prediction string against a reference label and/or input context.PairwiseStringEvaluator
: Evaluate two prediction strings against each other. Useful for scoring preferences, measuring similarity between two chain or llm agents, or comparing outputs on similar inputs.AgentTrajectoryEvaluator
Evaluate the full sequence of actions taken by an agent.
These interfaces enable easier composability and usage within a higher level evaluation framework.
Classes
A named tuple containing the score and reasoning for a trajectory. |
|
A chain for evaluating ReAct style agents. |
|
|
Trajectory output parser. |
|
A chain for comparing two outputs, such as the outputs |
A chain for comparing two outputs, such as the outputs |
|
|
A parser for the output of the PairwiseStringEvalChain. |
A Criteria to evaluate. |
|
LLM Chain for evaluating runs against criteria. |
|
A parser for the output of the CriteriaEvalChain. |
|
Criteria evaluation chain that requires references. |
|
Embedding Distance Metric. |
|
|
Use embedding distances to score semantic difference between a prediction and reference. |
|
Use embedding distances to score semantic difference between two predictions. |
Compute an exact match between the prediction and the reference. |
|
Evaluate whether the prediction is equal to the reference after |
|
Evaluate whether the prediction is valid JSON. |
|
|
An evaluator that calculates the edit distance between JSON strings. |
An evaluator that validates a JSON prediction against a JSON schema reference. |
|
LLM Chain for evaluating QA w/o GT based on context |
|
LLM Chain for evaluating QA using chain of thought reasoning. |
|
LLM Chain for evaluating question answering. |
|
LLM Chain for generating examples for question answering. |
|
Compute a regex match between the prediction and the reference. |
|
Interface for evaluating agent trajectories. |
|
|
The types of the evaluators. |
A base class for evaluators that use an LLM. |
|
Compare the output of two models (or two outputs of the same model). |
|
Grade, tag, or otherwise evaluate predictions relative to their inputs and/or reference labels. |
|
A chain for scoring the output of a model on a scale of 1-10. |
|
A chain for scoring on a scale of 1-10 the output of a model. |
|
A parser for the output of the ScoreStringEvalChain. |
|
|
Compute string edit distances between two predictions. |
Distance metric to use. |
|
Compute string distances between the prediction and the reference. |
Functions
|
Resolve the criteria for the pairwise evaluator. |
Resolve the criteria to evaluate. |
|
Load a dataset from the LangChainDatasets on HuggingFace. |
|
|
Load the requested evaluation chain specified by a string. |
|
Load evaluators specified by a list of evaluator types. |
Resolve the criteria for the pairwise evaluator. |