CloudBlobLoader#

class langchain_community.document_loaders.blob_loaders.cloud_blob_loader.CloudBlobLoader(url: str | AnyPath, *, glob: str = '**/[!.]*', exclude: Sequence[str] = (), suffixes: Sequence[str] | None = None, show_progress: bool = False)[source]#

Load blobs from cloud URL or file:.

Example:

loader = CloudBlobLoader("s3://mybucket/id")

for blob in loader.yield_blobs():
    print(blob)

Initialize with a url and how to glob over it.

Use [CloudPathLib](https://cloudpathlib.drivendata.org/).

Parameters:

url (str | AnyPath) – Cloud URL to load from. Supports s3://, az://, gs://, file:// schemes. If no scheme is provided, it is assumed to be a local file. If a path to a file is provided, glob/exclude/suffixes are ignored.
glob (str) – Glob pattern relative to the specified path by default set to pick up all non-hidden files
exclude (Sequence[str]) – patterns to exclude from results, use glob syntax
suffixes (Sequence[str] | None) – Provide to keep only files with these suffixes Useful when wanting to keep files with different suffixes Suffixes must include the dot, e.g. “.txt”
show_progress (bool) – If true, will show a progress bar as the files are loaded. This forces an iteration through all matching files to count them prior to loading them.

Examples

Methods

`__init__`(url, *[, glob, exclude, suffixes, ...])	Initialize with a url and how to glob over it.
`count_matching_files`()	Count files that match the pattern without loading them.
`from_path`(path, *[, encoding, mime_type, ...])	Load the blob from a path like object.
`yield_blobs`()	Yield blobs that match the requested pattern.

__init__(url: str | AnyPath, *, glob: str = '**/[!.]*', exclude: Sequence[str] = (), suffixes: Sequence[str] | None = None, show_progress: bool = False) → None[source]#

Initialize with a url and how to glob over it.

Use [CloudPathLib](https://cloudpathlib.drivendata.org/).

Parameters:

url (str | AnyPath) – Cloud URL to load from. Supports s3://, az://, gs://, file:// schemes. If no scheme is provided, it is assumed to be a local file. If a path to a file is provided, glob/exclude/suffixes are ignored.
glob (str) – Glob pattern relative to the specified path by default set to pick up all non-hidden files
exclude (Sequence[str]) – patterns to exclude from results, use glob syntax
suffixes (Sequence[str] | None) – Provide to keep only files with these suffixes Useful when wanting to keep files with different suffixes Suffixes must include the dot, e.g. “.txt”
show_progress (bool) – If true, will show a progress bar as the files are loaded. This forces an iteration through all matching files to count them prior to loading them.

Return type:

None

Examples

count_matching_files() → int[source]#

Count files that match the pattern without loading them.

Return type:: int

classmethod from_path(path: AnyPath, *, encoding: str = 'utf-8', mime_type: str | None = None, guess_type: bool = True, metadata: dict | None = None) → Blob[source]#

Load the blob from a path like object.

Parameters:

path (AnyPath) – path like object to file to be read Supports s3://, az://, gs://, file:// schemes. If no scheme is provided, it is assumed to be a local file.
encoding (str) – Encoding to use if decoding the bytes into a string
mime_type (str | None) – if provided, will be set as the mime-type of the data
guess_type (bool) – If True, the mimetype will be guessed from the file extension, if a mime-type was not provided
metadata (dict | None) – Metadata to associate with the blob

Returns:

Blob instance

Return type:

Blob

yield_blobs() → Iterable[Blob][source]#

Yield blobs that match the requested pattern.

Return type:: Iterable[Blob]