TensorflowDatasets#

class langchain_community.utilities.tensorflow_datasets.TensorflowDatasets[source]#

Bases: BaseModel

Access to the TensorFlow Datasets.

The Current implementation can work only with datasets that fit in a memory.

TensorFlow Datasets is a collection of datasets ready to use, with TensorFlow or other Python ML frameworks, such as Jax. All datasets are exposed as tf.data.Datasets. To get started see the Guide: https://www.tensorflow.org/datasets/overview and the list of datasets: https://www.tensorflow.org/datasets/catalog/

overview#all_datasets

You have to provide the sample_to_document_function: a function that

a sample from the dataset-specific format to the Document.

dataset_name#

the name of the dataset to load

split_name#

the name of the split to load. Defaults to β€œtrain”.

load_max_docs#

a limit to the number of loaded documents. Defaults to 100.

sample_to_document_function#

a function that converts a dataset sample to a Document

Example

from langchain_community.utilities import TensorflowDatasets

def mlqaen_example_to_document(example: dict) -> Document:
    return Document(
        page_content=decode_to_str(example["context"]),
        metadata={
            "id": decode_to_str(example["id"]),
            "title": decode_to_str(example["title"]),
            "question": decode_to_str(example["question"]),
            "answer": decode_to_str(example["answers"]["text"][0]),
        },
    )

tsds_client = TensorflowDatasets(
        dataset_name="mlqa/en",
        split_name="train",
        load_max_docs=MAX_DOCS,
        sample_to_document_function=mlqaen_example_to_document,
    )

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

param dataset_name: str = ''#
param load_max_docs: int = 100#
param sample_to_document_function: Callable[[Dict], Document] | None = None#
param split_name: str = 'train'#
lazy_load() β†’ Iterator[Document][source]#

Download a selected dataset lazily.

Returns: an iterator of Documents.

Return type:

Iterator[Document]

load() β†’ List[Document][source]#

Download a selected dataset.

Returns: a list of Documents.

Return type:

List[Document]