PlaywrightURLLoader#
- class langchain_community.document_loaders.url_playwright.PlaywrightURLLoader(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: List[str] | None = None, evaluator: PlaywrightEvaluator | None = None, proxy: Dict[str, str] | None = None)[source]#
Load HTML pages with Playwright and parse with Unstructured.
This is useful for loading pages that require javascript to render.
- Parameters:
urls (List[str]) β
continue_on_failure (bool) β
headless (bool) β
remove_selectors (List[str] | None) β
evaluator (PlaywrightEvaluator | None) β
proxy (Dict[str, str] | None) β
- urls#
List of URLs to load.
- Type:
List[str]
- continue_on_failure#
If True, continue loading other URLs on failure.
- Type:
bool
- headless#
If True, the browser will run in headless mode.
- Type:
bool
- proxy#
If set, the browser will access URLs through the specified proxy.
- Type:
Optional[Dict[str, str]]
Example
from langchain_community.document_loaders import PlaywrightURLLoader urls = ["https://api.ipify.org/?format=json",] proxy={ "server": "https://xx.xx.xx:15818", # https://<host>:<port> "username": "username", "password": "password" } loader = PlaywrightURLLoader(urls, proxy=proxy) data = loader.load()
Load a list of URLs using Playwright.
Methods
__init__
(urls[,Β continue_on_failure,Β ...])Load a list of URLs using Playwright.
Load the specified URLs with Playwright and create Documents asynchronously.
aload
()Load the specified URLs with Playwright and create Documents asynchronously.
Load the specified URLs using Playwright and create Document instances.
load
()Load data into Document objects.
load_and_split
([text_splitter])Load Documents and split into chunks.
- __init__(urls: List[str], continue_on_failure: bool = True, headless: bool = True, remove_selectors: List[str] | None = None, evaluator: PlaywrightEvaluator | None = None, proxy: Dict[str, str] | None = None)[source]#
Load a list of URLs using Playwright.
- Parameters:
urls (List[str]) β
continue_on_failure (bool) β
headless (bool) β
remove_selectors (List[str] | None) β
evaluator (PlaywrightEvaluator | None) β
proxy (Dict[str, str] | None) β
- async alazy_load() AsyncIterator[Document] [source]#
Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.
- Returns:
A list of Document instances with loaded content.
- Return type:
AsyncIterator[Document]
- async aload() List[Document] [source]#
Load the specified URLs with Playwright and create Documents asynchronously. Use this function when in a jupyter notebook environment.
- Returns:
A list of Document instances with loaded content.
- Return type:
List[Document]
- lazy_load() Iterator[Document] [source]#
Load the specified URLs using Playwright and create Document instances.
- Returns:
A list of Document instances with loaded content.
- Return type:
Iterator[Document]
- load_and_split(text_splitter: TextSplitter | None = None) List[Document] #
Load Documents and split into chunks. Chunks are returned as Documents.
Do not override this method. It should be considered to be deprecated!
- Parameters:
text_splitter (Optional[TextSplitter]) β TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.
- Returns:
List of Documents.
- Return type:
List[Document]
Examples using PlaywrightURLLoader