HuggingFaceDatasetLoader#

class langchain_community.document_loaders.hugging_face_dataset.HuggingFaceDatasetLoader(path: str, page_content_column: str = 'text', name: str | None = None, data_dir: str | None = None, data_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None, cache_dir: str | None = None, keep_in_memory: bool | None = None, save_infos: bool = False, use_auth_token: bool | str | None = None, num_proc: int | None = None)[source]#

Load from Hugging Face Hub datasets.

Initialize the HuggingFaceDatasetLoader.

Parameters:
  • path (str) – Path or name of the dataset.

  • page_content_column (str) – Page content column name. Default is β€œtext”.

  • name (str | None) – Name of the dataset configuration.

  • data_dir (str | None) – Data directory of the dataset configuration.

  • data_files (str | Sequence[str] | Mapping[str, str | Sequence[str]] | None) – Path(s) to source data file(s).

  • cache_dir (str | None) – Directory to read/write data.

  • keep_in_memory (bool | None) – Whether to copy the dataset in-memory.

  • save_infos (bool) – Save the dataset information (checksums/size/splits/…). Default is False.

  • use_auth_token (bool | str | None) – Bearer token for remote files on the Dataset Hub.

  • num_proc (int | None) – Number of processes.

Methods

__init__(path[,Β page_content_column,Β name,Β ...])

Initialize the HuggingFaceDatasetLoader.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Load documents lazily.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

parse_obj(page_content)

__init__(path: str, page_content_column: str = 'text', name: str | None = None, data_dir: str | None = None, data_files: str | Sequence[str] | Mapping[str, str | Sequence[str]] | None = None, cache_dir: str | None = None, keep_in_memory: bool | None = None, save_infos: bool = False, use_auth_token: bool | str | None = None, num_proc: int | None = None)[source]#

Initialize the HuggingFaceDatasetLoader.

Parameters:
  • path (str) – Path or name of the dataset.

  • page_content_column (str) – Page content column name. Default is β€œtext”.

  • name (str | None) – Name of the dataset configuration.

  • data_dir (str | None) – Data directory of the dataset configuration.

  • data_files (str | Sequence[str] | Mapping[str, str | Sequence[str]] | None) – Path(s) to source data file(s).

  • cache_dir (str | None) – Directory to read/write data.

  • keep_in_memory (bool | None) – Whether to copy the dataset in-memory.

  • save_infos (bool) – Save the dataset information (checksums/size/splits/…). Default is False.

  • use_auth_token (bool | str | None) – Bearer token for remote files on the Dataset Hub.

  • num_proc (int | None) – Number of processes.

async alazy_load() β†’ AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() β†’ List[Document]#

Load data into Document objects.

Return type:

List[Document]

lazy_load() β†’ Iterator[Document][source]#

Load documents lazily.

Return type:

Iterator[Document]

load() β†’ List[Document]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) β†’ List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]

parse_obj(page_content: str | object) β†’ str[source]#
Parameters:

page_content (str | object) –

Return type:

str

Examples using HuggingFaceDatasetLoader