Unlike traditional web scraping tools, Diffbot doesn’t require any rules to read the content on a page. It starts with computer vision, which classifies a page into one of 20 possible types. Content is then interpreted by a machine learning model trained to identify the key attributes on a page based on its type. The result is a website transformed into clean structured data (like JSON or CSV), ready for your application.
This covers how to extract HTML documents from a list of URLs using the Diffbot extract API, into a document format that we can use downstream.
urls = [
The Diffbot Extract API Requires an API token. Once you have it, you can extract the data.
Read instructions how to get the Diffbot API Token.
from langchain.document_loaders import DiffbotLoader
loader = DiffbotLoader(urls=urls, api_token=os.environ.get("DIFFBOT_API_TOKEN"))
.load() method, you can see the documents loaded