HtmlLinkExtractor#
- class langchain_community.graph_vectorstores.extractors.html_link_extractor.HtmlLinkExtractor(
- *,
- kind: str = 'hyperlink',
- drop_fragments: bool = True,
Beta
This feature is in beta. It is actively being worked on, so the API may change.
Extract hyperlinks from HTML content.
Expects the input to be an HTML string or a BeautifulSoup object.
Example:
extractor = HtmlLinkExtractor() results = extractor.extract_one(HtmlInput(html, url))
How to link Documents on hyperlinks in HTML#
Preliminaries#
Install the
beautifulsoup4
package:pip install -q langchain_community beautifulsoup4
Usage#
For this example, we’ll scrape 2 HTML pages that have an hyperlink from one page to the other using an
AsyncHtmlLoader
. Then we use theHtmlLinkExtractor
to create the links in the documents.Using extract_one()#
We can use
extract_one()
on a document to get the links and add the links to the document metadata withadd_links()
:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import ( HtmlInput, HtmlLinkExtractor, ) from langchain_community.graph_vectorstores.links import add_links from langchain_core.documents import Document loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() html_extractor = HtmlLinkExtractor() for doc in documents: links = html_extractor.extract_one(HtmlInput(doc.page_content, url)) add_links(doc, links) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
Using as_document_extractor()#
If you use a document loader that returns the raw HTML and that sets the source key in the document metadata such as
AsyncHtmlLoader
, you can simplify by usingas_document_extractor()
that takes directly aDocument
as input:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import HtmlLinkExtractor from langchain_community.graph_vectorstores.links import add_links loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() html_extractor = HtmlLinkExtractor().as_document_extractor() for document in documents: links = html_extractor.extract_one(document) add_links(document, links) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
Using LinkExtractorTransformer#
Using the
LinkExtractorTransformer
, we can simplify the link extraction:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import ( HtmlLinkExtractor, LinkExtractorTransformer, ) from langchain_community.graph_vectorstores.links import add_links loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() transformer = LinkExtractorTransformer([HtmlLinkExtractor().as_document_extractor()]) documents = transformer.transform_documents(documents) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
We can check that there is a link from the first document to the second:
for doc_to in documents: for link_to in doc_to.metadata["links"]: if link_to.direction == "in": for doc_from in documents: for link_from in doc_from.metadata["links"]: if ( link_to.direction == "in" and link_from.direction == "out" and link_to.tag == link_from.tag ): print( f"Found link from {doc_from.metadata['source']} to {doc_to.metadata['source']}." )
Found link from https://python.langchain.com/docs/integrations/providers/astradb/ to https://docs.datastax.com/en/astra/home/astra.html.
The documents with URL links can then be added to a
GraphVectorStore
:from langchain_community.graph_vectorstores import CassandraGraphVectorStore store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)
- param kind:
The kind of edge to extract. Defaults to
hyperlink
.- param drop_fragments:
Whether fragments in URLs and links should be dropped. Defaults to
True
.
Methods
__init__
(*[, kind, drop_fragments])Extract hyperlinks from HTML content.
as_document_extractor
([url_metadata_key])Return a LinkExtractor that applies to documents.
extract_many
(inputs)Add edges from each input to the corresponding documents.
extract_one
(input)Add edges from each input to the corresponding documents.
- __init__(
- *,
- kind: str = 'hyperlink',
- drop_fragments: bool = True,
Extract hyperlinks from HTML content.
Expects the input to be an HTML string or a BeautifulSoup object.
Example:
extractor = HtmlLinkExtractor() results = extractor.extract_one(HtmlInput(html, url))
How to link Documents on hyperlinks in HTML#
Preliminaries#
Install the
beautifulsoup4
package:pip install -q langchain_community beautifulsoup4
Usage#
For this example, we’ll scrape 2 HTML pages that have an hyperlink from one page to the other using an
AsyncHtmlLoader
. Then we use theHtmlLinkExtractor
to create the links in the documents.Using extract_one()#
We can use
extract_one()
on a document to get the links and add the links to the document metadata withadd_links()
:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import ( HtmlInput, HtmlLinkExtractor, ) from langchain_community.graph_vectorstores.links import add_links from langchain_core.documents import Document loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() html_extractor = HtmlLinkExtractor() for doc in documents: links = html_extractor.extract_one(HtmlInput(doc.page_content, url)) add_links(doc, links) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
Using as_document_extractor()#
If you use a document loader that returns the raw HTML and that sets the source key in the document metadata such as
AsyncHtmlLoader
, you can simplify by usingas_document_extractor()
that takes directly aDocument
as input:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import HtmlLinkExtractor from langchain_community.graph_vectorstores.links import add_links loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() html_extractor = HtmlLinkExtractor().as_document_extractor() for document in documents: links = html_extractor.extract_one(document) add_links(document, links) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
Using LinkExtractorTransformer#
Using the
LinkExtractorTransformer
, we can simplify the link extraction:from langchain_community.document_loaders import AsyncHtmlLoader from langchain_community.graph_vectorstores.extractors import ( HtmlLinkExtractor, LinkExtractorTransformer, ) from langchain_community.graph_vectorstores.links import add_links loader = AsyncHtmlLoader( [ "https://python.langchain.com/docs/integrations/providers/astradb/", "https://docs.datastax.com/en/astra/home/astra.html", ] ) documents = loader.load() transformer = LinkExtractorTransformer([HtmlLinkExtractor().as_document_extractor()]) documents = transformer.transform_documents(documents) documents[0].metadata["links"][:5]
[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/spreedly/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/nvidia/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/ray_serve/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/integrations/providers/bageldb/'), Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/docs/introduction/')]
We can check that there is a link from the first document to the second:
for doc_to in documents: for link_to in doc_to.metadata["links"]: if link_to.direction == "in": for doc_from in documents: for link_from in doc_from.metadata["links"]: if ( link_to.direction == "in" and link_from.direction == "out" and link_to.tag == link_from.tag ): print( f"Found link from {doc_from.metadata['source']} to {doc_to.metadata['source']}." )
Found link from https://python.langchain.com/docs/integrations/providers/astradb/ to https://docs.datastax.com/en/astra/home/astra.html.
The documents with URL links can then be added to a
GraphVectorStore
:from langchain_community.graph_vectorstores import CassandraGraphVectorStore store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)
- param kind:
The kind of edge to extract. Defaults to
hyperlink
.- param drop_fragments:
Whether fragments in URLs and links should be dropped. Defaults to
True
.
- Parameters:
kind (str)
drop_fragments (bool)
- as_document_extractor(
- url_metadata_key: str = 'source',
Return a LinkExtractor that applies to documents.
Note
Since the HtmlLinkExtractor parses HTML, if you use with other similar link extractors it may be more efficient to call the link extractors directly on the parsed BeautifulSoup object.
- Parameters:
url_metadata_key (str) – The name of the filed in document metadata with the URL of the document.
- Return type:
- extract_many(
- inputs: Iterable[InputT],
Add edges from each input to the corresponding documents.
- Parameters:
inputs (Iterable[InputT]) – The input content to extract edges from.
- Returns:
Iterable over the set of links extracted from the input.
- Return type:
Iterable[Set[Link]]
- Parameters:
kind (str)
drop_fragments (bool)