ReadTheDocsLoader#
- class langchain_community.document_loaders.readthedocs.ReadTheDocsLoader(path: str | Path, encoding: str | None = None, errors: str | None = None, custom_html_tag: Tuple[str, dict] | None = None, patterns: Sequence[str] = ('*.htm', '*.html'), exclude_links_ratio: float = 1.0, **kwargs: Any | None)[source]#
Load ReadTheDocs documentation directory.
Initialize ReadTheDocsLoader
The loader loops over all files under path and extracts the actual content of the files by retrieving main html tags. Default main html tags include <main id=”main-content>, <div role=”main>, and <article role=”main”>. You can also define your own html tags by passing custom_html_tag, e.g. (“div”, “class=main”). The loader iterates html tags with the order of custom html tags (if exists) and default html tags. If any of the tags is not empty, the loop will break and retrieve the content out of that tag.
- Parameters:
path (Union[str, Path]) – The location of pulled readthedocs folder.
encoding (Optional[str]) – The encoding with which to open the documents.
errors (Optional[str]) – Specify how encoding and decoding errors are to be handled—this cannot be used in binary mode.
custom_html_tag (Optional[Tuple[str, dict]]) – Optional custom html tag to retrieve the content from files.
patterns (Sequence[str]) – The file patterns to load, passed to glob.rglob.
exclude_links_ratio (float) – The ratio of links:content to exclude pages from. This is to reduce the frequency at which index pages make their way into retrieved results. Recommended: 0.5
kwargs (Optional[Any]) – named arguments passed to bs4.BeautifulSoup.
Methods
__init__
(path[, encoding, errors, ...])Initialize ReadTheDocsLoader
A lazy loader for Documents.
aload
()Load data into Document objects.
A lazy loader for Documents.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(path: str | Path, encoding: str | None = None, errors: str | None = None, custom_html_tag: Tuple[str, dict] | None = None, patterns: Sequence[str] = ('*.htm', '*.html'), exclude_links_ratio: float = 1.0, **kwargs: Any | None)[source]#
Initialize ReadTheDocsLoader
The loader loops over all files under path and extracts the actual content of the files by retrieving main html tags. Default main html tags include <main id=”main-content>, <div role=”main>, and <article role=”main”>. You can also define your own html tags by passing custom_html_tag, e.g. (“div”, “class=main”). The loader iterates html tags with the order of custom html tags (if exists) and default html tags. If any of the tags is not empty, the loop will break and retrieve the content out of that tag.
- Parameters:
path (str | Path) – The location of pulled readthedocs folder.
encoding (str | None) – The encoding with which to open the documents.
errors (str | None) – Specify how encoding and decoding errors are to be handled—this cannot be used in binary mode.
custom_html_tag (Tuple[str, dict] | None) – Optional custom html tag to retrieve the content from files.
patterns (Sequence[str]) – The file patterns to load, passed to glob.rglob.
exclude_links_ratio (float) – The ratio of links:content to exclude pages from. This is to reduce the frequency at which index pages make their way into retrieved results. Recommended: 0.5
kwargs (Any | None) – named arguments passed to bs4.BeautifulSoup.
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- lazy_load() Iterator[Document] [source]#
A lazy loader for Documents.
- Return type:
Iterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) List[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
List[Document]
Examples using ReadTheDocsLoader