Skip to main content

URL

This example covers how to load HTML documents from a list of URLs into the Document format that we can use downstream.

Unstructured URL Loader

You have to install the unstructured library:

!pip install -U unstructured
from langchain_community.document_loaders import UnstructuredURLLoader
urls = [
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023",
]

Pass in ssl_verify=False with headers=headers to get past ssl_verification error.

loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

Selenium URL Loader

This covers how to load HTML documents from a list of URLs using the SeleniumURLLoader.

Using Selenium allows us to load pages that require JavaScript to render.

To use the SeleniumURLLoader, you have to install selenium and unstructured.

!pip install -U selenium unstructured
from langchain_community.document_loaders import SeleniumURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = SeleniumURLLoader(urls=urls)
data = loader.load()

Playwright URL Loader

This covers how to load HTML documents from a list of URLs using the PlaywrightURLLoader.

Playwright enables reliable end-to-end testing for modern web apps.

As in the Selenium case, Playwright allows us to load and render the JavaScript pages.

To use the PlaywrightURLLoader, you have to install playwright and unstructured. Additionally, you have to install the Playwright Chromium browser:

!pip install -U playwright unstructured
!playwright install
from langchain_community.document_loaders import PlaywrightURLLoader
urls = [
"https://www.youtube.com/watch?v=dQw4w9WgXcQ",
"https://goo.gl/maps/NDSHwePEyaHMFGwh8",
]
loader = PlaywrightURLLoader(urls=urls, remove_selectors=["header", "footer"])
data = loader.load()

Help us out by providing feedback on this documentation page: