arun_on_dataset#
- async langchain.smith.evaluation.runner_utils.arun_on_dataset(client: Client | None, dataset_name: str, llm_or_chain_factory: Callable[[], Chain | Runnable] | BaseLanguageModel | Callable[[dict], Any] | Runnable | Chain, *, evaluation: RunEvalConfig | None = None, dataset_version: datetime | str | None = None, concurrency_level: int = 5, project_name: str | None = None, project_metadata: Dict[str, Any] | None = None, verbose: bool = False, revision_id: str | None = None, **kwargs: Any) Dict[str, Any] [source]#
Run the Chain or language model on a dataset and store traces to the specified project name.
- Parameters:
dataset_name (str) β Name of the dataset to run the chain on.
llm_or_chain_factory (Callable[[], Chain | Runnable] | BaseLanguageModel | Callable[[dict], Any] | Runnable | Chain) β Language model or Chain constructor to run over the dataset. The Chain constructor is used to permit independent calls on each example without carrying over state.
evaluation (RunEvalConfig | None) β Configuration for evaluators to run on the results of the chain
concurrency_level (int) β The number of async tasks to run concurrently.
project_name (str | None) β Name of the project to store the traces in. Defaults to {dataset_name}-{chain class name}-{datetime}.
project_metadata (Dict[str, Any] | None) β Optional metadata to add to the project. Useful for storing information the test variant. (prompt version, model version, etc.)
client (Client | None) β LangSmith client to use to access the dataset and to log feedback and run traces.
verbose (bool) β Whether to print progress.
tags β Tags to add to each run in the project.
revision_id (str | None) β Optional revision identifier to assign this test run to track the performance of different versions of your system.
dataset_version (datetime | str | None) β
kwargs (Any) β
- Returns:
A dictionary containing the runβs project name and the resulting model outputs.
- Return type:
Dict[str, Any]
For the (usually faster) async version of this function, see
arun_on_dataset()
.Examples
from langsmith import Client from langchain_openai import ChatOpenAI from langchain.chains import LLMChain from langchain.smith import smith_eval.RunEvalConfig, run_on_dataset # Chains may have memory. Passing in a constructor function lets the # evaluation framework avoid cross-contamination between runs. def construct_chain(): llm = ChatOpenAI(temperature=0) chain = LLMChain.from_string( llm, "What's the answer to {your_input_key}" ) return chain # Load off-the-shelf evaluators via config or the EvaluatorType (string or enum) evaluation_config = smith_eval.RunEvalConfig( evaluators=[ "qa", # "Correctness" against a reference answer "embedding_distance", smith_eval.RunEvalConfig.Criteria("helpfulness"), smith_eval.RunEvalConfig.Criteria({ "fifth-grader-score": "Do you have to be smarter than a fifth grader to answer this question?" }), ] ) client = Client() await arun_on_dataset( client, dataset_name="<my_dataset_name>", llm_or_chain_factory=construct_chain, evaluation=evaluation_config, )
You can also create custom evaluators by subclassing the
StringEvaluator
or LangSmithβs RunEvaluator classes.from typing import Optional from langchain.evaluation import StringEvaluator class MyStringEvaluator(StringEvaluator): @property def requires_input(self) -> bool: return False @property def requires_reference(self) -> bool: return True @property def evaluation_name(self) -> str: return "exact_match" def _evaluate_strings(self, prediction, reference=None, input=None, **kwargs) -> dict: return {"score": prediction == reference} evaluation_config = smith_eval.RunEvalConfig( custom_evaluators = [MyStringEvaluator()], ) await arun_on_dataset( client, dataset_name="<my_dataset_name>", llm_or_chain_factory=construct_chain, evaluation=evaluation_config, )