GenericLoader#

class langchain_community.document_loaders.generic.GenericLoader( blob_loader: BlobLoader, blob_parser: BaseBlobParser, )[source]#

Generic Document Loader.

A generic document loader that allows combining an arbitrary blob loader with a blob parser.

Examples

Parse a specific PDF file:

 from langchain_community.document_loaders import GenericLoader
 from langchain_community.document_loaders.parsers.pdf import PyPDFParser

 # Recursively load all text files in a directory.
 loader = GenericLoader.from_filesystem(
     "my_lovely_pdf.pdf",
     parser=PyPDFParser()
 )

.. code-block:: python

     from langchain_community.document_loaders import GenericLoader
     from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader


     loader = GenericLoader.from_filesystem(
         path="path/to/directory",
         glob="**/[!.]*",
         suffixes=[".pdf"],
         show_progress=True,
     )

     docs = loader.lazy_load()
     next(docs)

Example instantiations to change which files are loaded:

# Recursively load all text files in a directory.
loader = GenericLoader.from_filesystem("/path/to/dir", glob="**/*.txt")

# Recursively load all non-hidden files in a directory.
loader = GenericLoader.from_filesystem("/path/to/dir", glob="**/[!.]*")

# Load all files in a directory without recursion.
loader = GenericLoader.from_filesystem("/path/to/dir", glob="*")

Example instantiations to change which parser is used:

from langchain_community.document_loaders.parsers.pdf import PyPDFParser

# Recursively load all text files in a directory.
loader = GenericLoader.from_filesystem(
    "/path/to/dir",
    glob="**/*.pdf",
    parser=PyPDFParser()
)

A generic document loader.

Parameters:

blob_loader (BlobLoader) – A blob loader which knows how to yield blobs
blob_parser (BaseBlobParser) – A blob parser which knows how to parse blobs into documents

Methods

`__init__`(blob_loader, blob_parser)	A generic document loader.
`alazy_load`()	A lazy loader for Documents.
`aload`()	Load data into Document objects.
`from_filesystem`(path, *[, glob, exclude, ...])	Create a generic document loader using a filesystem blob loader.
`get_parser`(**kwargs)	Override this method to associate a default parser with the class.
`lazy_load`()	Load documents lazily.
`load`()	Load data into Document objects.
`load_and_split`([text_splitter])	Load all documents and split them into sentences.

__init__( blob_loader: BlobLoader, blob_parser: BaseBlobParser, ) → None[source]#

A generic document loader.

Parameters:

blob_loader (BlobLoader) – A blob loader which knows how to yield blobs
blob_parser (BaseBlobParser) – A blob parser which knows how to parse blobs into documents

Return type:

None

async alazy_load() → AsyncIterator[Document]#

A lazy loader for Documents.

Return type:: AsyncIterator[Document]

async aload() → list[Document]#

Load data into Document objects.

Return type:: list[Document]

classmethod from_filesystem(

path: str | Path,

*,

glob: str = '**/[!.]*',

exclude: Sequence[str] = (),

suffixes: Sequence[str] | None = None,

show_progress: bool = False,

parser: Literal['default'] | BaseBlobParser = 'default',

parser_kwargs: dict | None = None,

) → GenericLoader[source]#

Create a generic document loader using a filesystem blob loader.

Parameters:

path (str | Path) –
The path to the directory to load documents from OR the path to a single file to load. If this is a file, glob, exclude, suffixes

will be ignored.
glob (str) – The glob pattern to use to find documents.
suffixes (Sequence[str] | None) – The suffixes to use to filter documents. If None, all files matching the glob will be loaded.
exclude (Sequence[str]) – A list of patterns to exclude from the loader.
show_progress (bool) – Whether to show a progress bar or not (requires tqdm). Proxies to the file system loader.
parser (Literal['default'] | ~langchain_core.document_loaders.base.BaseBlobParser) – A blob parser which knows how to parse blobs into documents, will instantiate a default parser if not provided. The default can be overridden by either passing a parser or setting the class attribute blob_parser (the latter should be used with inheritance).
parser_kwargs (dict | None) – Keyword arguments to pass to the parser.

Returns:

A generic document loader.

Return type:

GenericLoader

static get_parser(

**kwargs: Any,

) → BaseBlobParser[source]#

Override this method to associate a default parser with the class.

Parameters:: kwargs (Any)
Return type:: BaseBlobParser

lazy_load() → Iterator[Document][source]#

Load documents lazily. Use this when working at a large scale.

Return type:: Iterator[Document]

load() → list[Document]#

Load data into Document objects.

Return type:: list[Document]

load_and_split( text_splitter: TextSplitter | None = None, ) → List[Document][source]#

Load all documents and split them into sentences.

Parameters:: text_splitter (Optional[TextSplitter])
Return type:: List[Document]

Examples using GenericLoader