BeautifulSoupTransformer#
- class langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer[source]#
Transform HTML content by extracting specific tags and removing unwanted ones.
Example
from langchain_community.document_transformers import BeautifulSoupTransformer bs4_transformer = BeautifulSoupTransformer() docs_transformed = bs4_transformer.transform_documents(docs)
Initialize the transformer.
This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.
Methods
__init__
()Initialize the transformer.
atransform_documents
(documents,Β **kwargs)Asynchronously transform a list of documents.
extract_tags
(html_content,Β tags,Β *[,Β ...])Extract specific tags from a given HTML content.
remove_unnecessary_lines
(content)Clean up the content by removing unnecessary lines.
remove_unwanted_classnames
(html_content,Β ...)Remove unwanted classname from a given HTML content.
remove_unwanted_tags
(html_content,Β unwanted_tags)Remove unwanted tags from a given HTML content.
transform_documents
(documents[,Β ...])Transform a list of Document objects by cleaning their HTML content.
- __init__() None [source]#
Initialize the transformer.
This checks if the BeautifulSoup4 package is installed. If not, it raises an ImportError.
- Return type:
None
- async atransform_documents(documents: Sequence[Document], **kwargs: Any) Sequence[Document] [source]#
Asynchronously transform a list of documents.
- static extract_tags(html_content: str, tags: List[str] | Tuple[str, ...], *, remove_comments: bool = False) str [source]#
Extract specific tags from a given HTML content.
- Parameters:
html_content (str) β The original HTML content string.
tags (List[str] | Tuple[str, ...]) β A list of tags to be extracted from the HTML.
remove_comments (bool) β If set to True, the comments will be removed.
- Returns:
A string combining the content of the extracted tags.
- Return type:
str
- static remove_unnecessary_lines(content: str) str [source]#
Clean up the content by removing unnecessary lines.
- Parameters:
content (str) β A string, which may contain unnecessary lines or spaces.
- Returns:
A cleaned string with unnecessary lines removed.
- Return type:
str
- static remove_unwanted_classnames(html_content: str, unwanted_classnames: List[str] | Tuple[str, ...]) str [source]#
Remove unwanted classname from a given HTML content.
- Parameters:
html_content (str) β The original HTML content string.
unwanted_classnames (List[str] | Tuple[str, ...]) β A list of classnames to be removed from the HTML.
- Returns:
A cleaned HTML string with unwanted classnames removed.
- Return type:
str
- static remove_unwanted_tags(html_content: str, unwanted_tags: List[str] | Tuple[str, ...]) str [source]#
Remove unwanted tags from a given HTML content.
- Parameters:
html_content (str) β The original HTML content string.
unwanted_tags (List[str] | Tuple[str, ...]) β A list of tags to be removed from the HTML.
- Returns:
A cleaned HTML string with unwanted tags removed.
- Return type:
str
- transform_documents(documents: Sequence[Document], unwanted_tags: List[str] | Tuple[str, ...] = ('script', 'style'), tags_to_extract: List[str] | Tuple[str, ...] = ('p', 'li', 'div', 'a'), remove_lines: bool = True, *, unwanted_classnames: Tuple[str, ...] | List[str] = (), remove_comments: bool = False, **kwargs: Any) Sequence[Document] [source]#
Transform a list of Document objects by cleaning their HTML content.
- Parameters:
documents (Sequence[Document]) β A sequence of Document objects containing HTML content.
unwanted_tags (List[str] | Tuple[str, ...]) β A list of tags to be removed from the HTML.
tags_to_extract (List[str] | Tuple[str, ...]) β A list of tags whose content will be extracted.
remove_lines (bool) β If set to True, unnecessary lines will be removed.
unwanted_classnames (Tuple[str, ...] | List[str]) β A list of class names to be removed from the HTML
remove_comments (bool) β If set to True, comments will be removed.
kwargs (Any)
- Returns:
A sequence of Document objects with transformed content.
- Return type:
Sequence[Document]
Examples using BeautifulSoupTransformer