HtmlLinkExtractor#

class langchain_community.graph_vectorstores.extractors.html_link_extractor.HtmlLinkExtractor(*, kind: str = 'hyperlink', drop_fragments: bool = True)[source]#

Beta

This feature is in beta. It is actively being worked on, so the API may change.

Extract hyperlinks from HTML content.

Expects the input to be an HTML string or a BeautifulSoup object.

Example:

extractor = HtmlLinkExtractor()
results = extractor.extract_one(HtmlInput(html, url))

How to link Documents on hyperlinks in HTML#

Preliminaries#

Install the beautifulsoup4 package:

pip install -q langchain_community beautifulsoup4

Usage#

For this example, we’ll scrape 2 HTML pages that have an hyperlink from one page to the other using an AsyncHtmlLoader. Then we use the HtmlLinkExtractor to create the links in the documents.

Using extract_one()#

We can use extract_one() on a document to get the links and add the links to the document metadata with add_links():

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import (
    HtmlInput,
    HtmlLinkExtractor,
)
from langchain_community.graph_vectorstores.links import add_links
from langchain_core.documents import Document

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)

documents = loader.load()

html_extractor = HtmlLinkExtractor()

for doc in documents:
    links = html_extractor.extract_one(HtmlInput(doc.page_content, url))
    add_links(doc, links)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

Using as_document_extractor()#

If you use a document loader that returns the raw HTML and that sets the source key in the document metadata such as AsyncHtmlLoader, you can simplify by using as_document_extractor() that takes directly a Document as input:

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import HtmlLinkExtractor
from langchain_core.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)
documents = loader.load()
html_extractor = HtmlLinkExtractor().as_document_extractor()

for document in documents:
    links = html_extractor.extract_one(document)
    add_links(document, links)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

Using LinkExtractorTransformer#

Using the LinkExtractorTransformer, we can simplify the link extraction:

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import (
    HtmlLinkExtractor,
    LinkExtractorTransformer,
)
from langchain_community.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)

documents = loader.load()
transformer = LinkExtractorTransformer([HtmlLinkExtractor().as_document_extractor()])
documents = transformer.transform_documents(documents)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

We can check that there is a link from the first document to the second:

for doc_to in documents:
    for link_to in doc_to.metadata["links"]:
        if link_to.direction == "in":
            for doc_from in documents:
                for link_from in doc_from.metadata["links"]:
                    if (
                        link_to.direction == "in"
                        and link_from.direction == "out"
                        and link_to.tag == link_from.tag
                    ):
                        print(
                            f"Found link from {doc_from.metadata['source']} to {doc_to.metadata['source']}."
                        )

Found link from https://python.langchain.com/v0.2/docs/integrations/providers/astradb/ to https://docs.datastax.com/en/astra/home/astra.html.

The documents with URL links can then be added to a GraphVectorStore:

from langchain_community.graph_vectorstores import CassandraGraphVectorStore

store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)

param kind:: The kind of edge to extract. Defaults to hyperlink.
param drop_fragments:: Whether fragments in URLs and links should be dropped. Defaults to True.

Methods

`__init__`(*[, kind, drop_fragments])	Extract hyperlinks from HTML content.
`as_document_extractor`([url_metadata_key])	Return a LinkExtractor that applies to documents.
`extract_many`(inputs)	Add edges from each input to the corresponding documents.
`extract_one`(input)	Add edges from each input to the corresponding documents.

__init__(*, kind: str = 'hyperlink', drop_fragments: bool = True)[source]#

Extract hyperlinks from HTML content.

Expects the input to be an HTML string or a BeautifulSoup object.

Example:

extractor = HtmlLinkExtractor()
results = extractor.extract_one(HtmlInput(html, url))

How to link Documents on hyperlinks in HTML#

Preliminaries#

Install the beautifulsoup4 package:

pip install -q langchain_community beautifulsoup4

Usage#

For this example, we’ll scrape 2 HTML pages that have an hyperlink from one page to the other using an AsyncHtmlLoader. Then we use the HtmlLinkExtractor to create the links in the documents.

Using extract_one()#

We can use extract_one() on a document to get the links and add the links to the document metadata with add_links():

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import (
    HtmlInput,
    HtmlLinkExtractor,
)
from langchain_community.graph_vectorstores.links import add_links
from langchain_core.documents import Document

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)

documents = loader.load()

html_extractor = HtmlLinkExtractor()

for doc in documents:
    links = html_extractor.extract_one(HtmlInput(doc.page_content, url))
    add_links(doc, links)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

Using as_document_extractor()#

If you use a document loader that returns the raw HTML and that sets the source key in the document metadata such as AsyncHtmlLoader, you can simplify by using as_document_extractor() that takes directly a Document as input:

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import HtmlLinkExtractor
from langchain_core.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)
documents = loader.load()
html_extractor = HtmlLinkExtractor().as_document_extractor()

for document in documents:
    links = html_extractor.extract_one(document)
    add_links(document, links)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

Using LinkExtractorTransformer#

Using the LinkExtractorTransformer, we can simplify the link extraction:

from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import (
    HtmlLinkExtractor,
    LinkExtractorTransformer,
)
from langchain_community.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)

documents = loader.load()
transformer = LinkExtractorTransformer([HtmlLinkExtractor().as_document_extractor()])
documents = transformer.transform_documents(documents)

documents[0].metadata["links"][:5]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/')]

We can check that there is a link from the first document to the second:

for doc_to in documents:
    for link_to in doc_to.metadata["links"]:
        if link_to.direction == "in":
            for doc_from in documents:
                for link_from in doc_from.metadata["links"]:
                    if (
                        link_to.direction == "in"
                        and link_from.direction == "out"
                        and link_to.tag == link_from.tag
                    ):
                        print(
                            f"Found link from {doc_from.metadata['source']} to {doc_to.metadata['source']}."
                        )

Found link from https://python.langchain.com/v0.2/docs/integrations/providers/astradb/ to https://docs.datastax.com/en/astra/home/astra.html.

The documents with URL links can then be added to a GraphVectorStore:

from langchain_community.graph_vectorstores import CassandraGraphVectorStore

store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)

param kind:: The kind of edge to extract. Defaults to hyperlink.
param drop_fragments:: Whether fragments in URLs and links should be dropped. Defaults to True.

Parameters:

kind (str) –
drop_fragments (bool) –

as_document_extractor(url_metadata_key: str = 'source') → LinkExtractor[Document][source]#

Return a LinkExtractor that applies to documents.

Note

Since the HtmlLinkExtractor parses HTML, if you use with other similar link extractors it may be more efficient to call the link extractors directly on the parsed BeautifulSoup object.

Parameters:: url_metadata_key (str) – The name of the filed in document metadata with the URL of the document.
Return type:: LinkExtractor[Document]

extract_many(inputs: Iterable[InputT]) → Iterable[Set[Link]]#

Add edges from each input to the corresponding documents.

Parameters:: inputs (Iterable[InputT]) – The input content to extract edges from.
Returns:: Iterable over the set of links extracted from the input.
Return type:: Iterable[Set[Link]]

extract_one(input: HtmlInput) → Set[Link][source]#

Add edges from each input to the corresponding documents.

Parameters:: input (HtmlInput) – The input content to extract edges from.
Returns:: Set of links extracted from the input.
Return type:: Set[Link]

Parameters:

kind (str) –
drop_fragments (bool) –