TensorflowDatasets#

class langchain_community.utilities.tensorflow_datasets.TensorflowDatasets[source]#

Bases: BaseModel

Access to the TensorFlow Datasets.

The Current implementation can work only with datasets that fit in a memory.

TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf.data.Datasets. To get started see the Guide: https://www.tensorflow.org/datasets/overview and the list of datasets: https://www.tensorflow.org/datasets/catalog/

overview#all_datasets

You have to provide the sample_to_document_function: a function that: a sample from the dataset-specific format to the Document.

dataset_name#: the name of the dataset to load

split_name#: the name of the split to load. Defaults to “train”.

load_max_docs#: a limit to the number of loaded documents. Defaults to 100.

sample_to_document_function#: a function that converts a dataset sample to a Document

Example

from langchain_community.utilities import TensorflowDatasets

def mlqaen_example_to_document(example: dict) -> Document:
    return Document(
        page_content=decode_to_str(example["context"]),
        metadata={
            "id": decode_to_str(example["id"]),
            "title": decode_to_str(example["title"]),
            "question": decode_to_str(example["question"]),
            "answer": decode_to_str(example["answers"]["text"][0]),
        },
    )

tsds_client = TensorflowDatasets(
        dataset_name="mlqa/en",
        split_name="train",
        load_max_docs=MAX_DOCS,
        sample_to_document_function=mlqaen_example_to_document,
    )

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

param dataset_name: str = ''#

param load_max_docs: int = 100#

param sample_to_document_function: Callable[[Dict], Document] | None = None#

param split_name: str = 'train'#

lazy_load() → Iterator[Document][source]#

Download a selected dataset lazily.

Returns: an iterator of Documents.

Return type:: Iterator[Document]

load() → List[Document][source]#

Download a selected dataset.

Returns: a list of Documents.

Return type:: List[Document]