Skip to main content
Open on GitHub

Grobid

GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents.

It is designed and expected to be used to parse academic papers, where it works particularly well.

Note: if the articles supplied to Grobid are large documents (e.g. dissertations) exceeding a certain number of elements, they might not be processed.

This page covers how to use the Grobid to parse articles for LangChain.

Installation

The grobid installation is described in details in https://grobid.readthedocs.io/en/latest/Install-Grobid/. However, it is probably easier and less troublesome to run grobid through a docker container, as documented here.

Use Grobid with LangChain

Once grobid is installed and up and running (you can check by accessing it http://localhost:8070), you're ready to go.

You can now use the GrobidParser to produce documents

from langchain_community.document_loaders.parsers import GrobidParser
from langchain_community.document_loaders.generic import GenericLoader

#Produce chunks from article paragraphs
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

#Produce chunks from article sentences
loader = GenericLoader.from_filesystem(
"/Users/31treehaus/Desktop/Papers/",
glob="*",
suffixes=[".pdf"],
parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()
API Reference:GrobidParser | GenericLoader

Chunk metadata will include Bounding Boxes. Although these are a bit funky to parse, they are explained in https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/


Was this page helpful?