extract_sub_links#
- langchain_core.utils.html.extract_sub_links(
- raw_html: str,
- url: str,
- *,
- base_url: str | None = None,
- pattern: str | Pattern | None = None,
- prevent_outside: bool = True,
- exclude_prefixes: Sequence[str] = (),
- continue_on_failure: bool = False,
Extract all links from a raw HTML string and convert into absolute paths.
- Parameters:
raw_html (str) – original HTML.
url (str) – the url of the HTML.
base_url (str | None) – the base URL to check for outside links against.
pattern (str | Pattern | None) – Regex to use for extracting links from raw HTML.
prevent_outside (bool) – If True, ignore external links which are not children of the base URL.
exclude_prefixes (Sequence[str]) – Exclude any URLs that start with one of these prefixes.
continue_on_failure (bool) – If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.
- Returns:
sub links.
- Return type:
list[str]