BeautifulSoupTransformer#

class langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]#

Transform HTML content by extracting specific tags and removing unwanted ones.

Example

from langchain_community.document_transformers import BeautifulSoupTransformer

bs4_transformer = BeautifulSoupTransformer()
docs_transformed = bs4_transformer.transform_documents(docs)

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

Methods

__init__()

Initialize the transformer.

atransform_documents(documents,Β **kwargs)

Asynchronously transform a list of documents.

extract_tags(html_content,Β tags,Β *[,Β ...])

Extract specific tags from a given HTML content.

remove_unnecessary_lines(content)

Clean up the content by removing unnecessary lines.

remove_unwanted_classnames(html_content,Β ...)

Remove unwanted classname from a given HTML content.

remove_unwanted_tags(html_content,Β unwanted_tags)

Remove unwanted tags from a given HTML content.

transform_documents(documents[,Β ...])

Transform a list of Document objects by cleaning their HTML content.

__init__() β†’ None[source]#

Initialize the transformer.

This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.

Return type:

None

async atransform_documents(documents: Sequence[Document], **kwargs: Any) β†’ Sequence[Document][source]#

Asynchronously transform a list of documents.

Parameters:
  • documents (Sequence[Document]) – A sequence of Documents to be transformed.

  • kwargs (Any)

Returns:

A sequence of transformed Documents.

Return type:

Sequence[Document]

static extract_tags(html_content: str, tags: List[str] | Tuple[str, ...], *, remove_comments: bool = False) β†’ str[source]#

Extract specific tags from a given HTML content.

Parameters:
  • html_content (str) – The original HTML content string.

  • tags (List[str] | Tuple[str, ...]) – A list of tags to be extracted from the HTML.

  • remove_comments (bool) – If set to True, the comments will be removed.

Returns:

A string combining the content of the extracted tags.

Return type:

str

static remove_unnecessary_lines(content: str) β†’ str[source]#

Clean up the content by removing unnecessary lines.

Parameters:

content (str) – A string, which may contain unnecessary lines or spaces.

Returns:

A cleaned string with unnecessary lines removed.

Return type:

str

static remove_unwanted_classnames(html_content: str, unwanted_classnames: List[str] | Tuple[str, ...]) β†’ str[source]#

Remove unwanted classname from a given HTML content.

Parameters:
  • html_content (str) – The original HTML content string.

  • unwanted_classnames (List[str] | Tuple[str, ...]) – A list of classnames to be removed from the HTML.

Returns:

A cleaned HTML string with unwanted classnames removed.

Return type:

str

static remove_unwanted_tags(html_content: str, unwanted_tags: List[str] | Tuple[str, ...]) β†’ str[source]#

Remove unwanted tags from a given HTML content.

Parameters:
  • html_content (str) – The original HTML content string.

  • unwanted_tags (List[str] | Tuple[str, ...]) – A list of tags to be removed from the HTML.

Returns:

A cleaned HTML string with unwanted tags removed.

Return type:

str

transform_documents(documents: Sequence[Document], unwanted_tags: List[str] | Tuple[str, ...] = ('script', 'style'), tags_to_extract: List[str] | Tuple[str, ...] = ('p', 'li', 'div', 'a'), remove_lines: bool = True, *, unwanted_classnames: Tuple[str, ...] | List[str] = (), remove_comments: bool = False, **kwargs: Any) β†’ Sequence[Document][source]#

Transform a list of Document objects by cleaning their HTML content.

Parameters:
  • documents (Sequence[Document]) – A sequence of Document objects containing HTML content.

  • unwanted_tags (List[str] | Tuple[str, ...]) – A list of tags to be removed from the HTML.

  • tags_to_extract (List[str] | Tuple[str, ...]) – A list of tags whose content will be extracted.

  • remove_lines (bool) – If set to True, unnecessary lines will be removed.

  • unwanted_classnames (Tuple[str, ...] | List[str]) – A list of class names to be removed from the HTML

  • remove_comments (bool) – If set to True, comments will be removed.

  • kwargs (Any)

Returns:

A sequence of Document objects with transformed content.

Return type:

Sequence[Document]

Examples using BeautifulSoupTransformer