URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.

Unstructured URL Loader

You have to install the unstructured library:

!pip install -U unstructured

from langchain_community.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader = UnstructuredURLLoader(urls=urls)

data = loader.load()

Selenium URL Loader

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using Selenium allows us to load pages that require JavaScript to render.

To use the SeleniumURLLoader, you have to install selenium and unstructured.

!pip install -U selenium unstructured

from langchain_community.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

loader = SeleniumURLLoader(urls=urls)

data = loader.load()

Playwright URL Loader

This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader.

Playwright enables reliable end-to-end testing for modern web apps.

As in the Selenium case, Playwright allows us to load and render the JavaScript pages.

To use the PlaywrightURLLoader, you have to install playwright and unstructured. Additionally, you have to install the Playwright Chromium browser:

!pip install -U playwright unstructured

!playwright install

from langchain_community.document_loaders import PlaywrightURLLoader

urls = [
    "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]

loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])

data = loader.load()

URL

Unstructured URL Loader​

Selenium URL Loader​

Playwright URL Loader​

Help us out by providing feedback on this documentation page:

Unstructured URL Loader

Selenium URL Loader

Playwright URL Loader