Skip to main content

scrapfly

ScrapFly

ScrapFly is a web scraping API with headless browser capabilities, proxies, and anti-bot bypass. It allows for extracting web page data into accessible LLM markdown or text.

Installation

Install ScrapFly Python SDK and he required Langchain packages using pip:

pip install scrapfly-sdk langchain langchain-community

Usage

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)
API Reference:ScrapflyLoader

The ScrapflyLoader also allows passigng ScrapeConfig object for customizing the scrape request. See the documentation for the full feature details and their API params: https://scrapfly.io/docs/scrape-api/getting-started

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_scrape_config = {
"asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
"render_js": True, # Enable JavaScript rendering with a cloud headless browser
"proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
"country": "us", # Select a proxy location
"auto_scroll": True, # Auto scroll the page
"js": "", # Execute custom JavaScript code by the headless browser
}

scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your ScrapFly API key", # Get your API key from https://www.scrapfly.io/
ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
scrape_config=scrapfly_scrape_config, # Pass the scrape_config object
scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()
print(documents)
API Reference:ScrapflyLoader

Was this page helpful?


You can also leave detailed feedback on GitHub.