MWDumpLoader#
- class langchain_community.document_loaders.mediawikidump.MWDumpLoader(file_path: str | Path, encoding: str | None = 'utf8', namespaces: Sequence[int] | None = None, skip_redirects: bool | None = False, stop_on_error: bool | None = True)[source]#
Load MediaWiki dump from an XML file.
Example
from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_community.document_loaders import MWDumpLoader loader = MWDumpLoader( file_path="myWiki.xml", encoding="utf8" ) docs = loader.load() text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=0 ) texts = text_splitter.split_documents(docs)
- Parameters:
file_path (str) – XML local file path
encoding (str, optional) – Charset encoding, defaults to “utf8”
namespaces (List[int],optional) – The namespace of pages you want to parse. See https://www.mediawiki.org/wiki/Help:Namespaces#Localisation for a list of all common namespaces
skip_redirects (bool, optional) – TR=rue to skip pages that redirect to other pages, False to keep them. False by default
stop_on_error (bool, optional) – False to skip over pages that cause parsing errors, True to stop. True by default
Methods
__init__
(file_path[, encoding, namespaces, ...])A lazy loader for Documents.
aload
()Load data into Document objects.
Lazy load from a file path.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(file_path: str | Path, encoding: str | None = 'utf8', namespaces: Sequence[int] | None = None, skip_redirects: bool | None = False, stop_on_error: bool | None = True)[source]#
- Parameters:
file_path (str | Path)
encoding (str | None)
namespaces (Sequence[int] | None)
skip_redirects (bool | None)
stop_on_error (bool | None)
- async alazy_load() AsyncIterator[Document] #
A lazy loader for Documents.
- Return type:
AsyncIterator[Document]
- lazy_load() Iterator[Document] [source]#
Lazy load from a file path.
- Return type:
Iterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) list[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
list[Document]
Examples using MWDumpLoader