GenericLoader#
- class langchain_community.document_loaders.generic.GenericLoader(blob_loader: BlobLoader, blob_parser: BaseBlobParser)[source]#
Generic Document Loader.
A generic document loader that allows combining an arbitrary blob loader with a blob parser.
Examples
Parse a specific PDF file:
from langchain_community.document_loaders import GenericLoader from langchain_community.document_loaders.parsers.pdf import PyPDFParser # Recursively load all text files in a directory. loader = GenericLoader.from_filesystem( "my_lovely_pdf.pdf", parser=PyPDFParser() ) .. code-block:: python from langchain_community.document_loaders import GenericLoader from langchain_community.document_loaders.blob_loaders import FileSystemBlobLoader loader = GenericLoader.from_filesystem( path="path/to/directory", glob="**/[!.]*", suffixes=[".pdf"], show_progress=True, ) docs = loader.lazy_load() next(docs)
Example instantiations to change which files are loaded:
# Recursively load all text files in a directory. loader = GenericLoader.from_filesystem("/path/to/dir", glob="**/*.txt") # Recursively load all non-hidden files in a directory. loader = GenericLoader.from_filesystem("/path/to/dir", glob="**/[!.]*") # Load all files in a directory without recursion. loader = GenericLoader.from_filesystem("/path/to/dir", glob="*")
Example instantiations to change which parser is used:
from langchain_community.document_loaders.parsers.pdf import PyPDFParser # Recursively load all text files in a directory. loader = GenericLoader.from_filesystem( "/path/to/dir", glob="**/*.pdf", parser=PyPDFParser() )
A generic document loader.
- Parameters:
blob_loader (BlobLoader) β A blob loader which knows how to yield blobs
blob_parser (BaseBlobParser) β A blob parser which knows how to parse blobs into documents
Methods
__init__
(blob_loader,Β blob_parser)A generic document loader.
A lazy loader for Documents.
aload
()Load data into Document objects.
from_filesystem
(path,Β *[,Β glob,Β exclude,Β ...])Create a generic document loader using a filesystem blob loader.
get_parser
(**kwargs)Override this method to associate a default parser with the class.
Load documents lazily.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load all documents and split them into sentences.
- __init__(blob_loader: BlobLoader, blob_parser: BaseBlobParser) None [source]#
A generic document loader.
- Parameters:
blob_loader (BlobLoader) β A blob loader which knows how to yield blobs
blob_parser (BaseBlobParser) β A blob parser which knows how to parse blobs into documents
- Return type:
None
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- classmethod from_filesystem(path: str | Path, *, glob: str = '**/[!.]*', exclude: Sequence[str] = (), suffixes: Sequence[str] | None = None, show_progress: bool = False, parser: Literal['default'] | BaseBlobParser = 'default', parser_kwargs: dict | None = None) GenericLoader [source]#
Create a generic document loader using a filesystem blob loader.
- Parameters:
path (str | Path) β
The path to the directory to load documents from OR the path to a single file to load. If this is a file, glob, exclude, suffixes
will be ignored.
glob (str) β The glob pattern to use to find documents.
suffixes (Sequence[str] | None) β The suffixes to use to filter documents. If None, all files matching the glob will be loaded.
exclude (Sequence[str]) β A list of patterns to exclude from the loader.
show_progress (bool) β Whether to show a progress bar or not (requires tqdm). Proxies to the file system loader.
parser (Literal['default'] | ~langchain_core.document_loaders.base.BaseBlobParser) β A blob parser which knows how to parse blobs into documents, will instantiate a default parser if not provided. The default can be overridden by either passing a parser or setting the class attribute blob_parser (the latter should be used with inheritance).
parser_kwargs (dict | None) β Keyword arguments to pass to the parser.
- Returns:
A generic document loader.
- Return type:
- static get_parser(**kwargs: Any) BaseBlobParser [source]#
Override this method to associate a default parser with the class.
- Parameters:
kwargs (Any)
- Return type:
- lazy_load() Iterator[Document] [source]#
Load documents lazily. Use this when working at a large scale.
- Return type:
Iterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) List[Document] [source]#
Load all documents and split them into sentences.
- Parameters:
text_splitter (Optional[TextSplitter])
- Return type:
List[Document]
Examples using GenericLoader