LanguageParser#
- class langchain_community.document_loaders.parsers.language.language_parser.LanguageParser(language: Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None = None, parser_threshold: int = 0)[source]#
Parse using the respective programming language syntax.
Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.
This approach can potentially improve the accuracy of QA models over source code.
The supported languages for code parsing are:
C: “c” (*)
C++: “cpp” (*)
C#: “csharp” (*)
COBOL: “cobol”
Elixir: “elixir”
Go: “go” (*)
Java: “java” (*)
JavaScript: “js” (requires package esprima)
Kotlin: “kotlin” (*)
Lua: “lua” (*)
Perl: “perl” (*)
Python: “python”
Ruby: “ruby” (*)
Rust: “rust” (*)
Scala: “scala” (*)
SQL: “sql” (*)
TypeScript: “ts” (*)
Items marked with (*) require the packages tree_sitter and tree_sitter_languages. It is straightforward to add support for additional languages using tree_sitter, although this currently requires modifying LangChain.
The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.
If a language is not explicitly specified, LanguageParser will infer one from filename extensions, if present.
Examples
from langchain_community.document_loaders.generic import GenericLoader from langchain_community.document_loaders.parsers import LanguageParser loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py", ".js"], parser=LanguageParser() ) docs = loader.load() Example instantiations to manually select the language: .. code-block:: python loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py"], parser=LanguageParser(language="python") ) Example instantiations to set number of lines threshold: .. code-block:: python loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py"], parser=LanguageParser(parser_threshold=200) )
Language parser that split code using the respective language syntax.
- Parameters:
language (Optional[Language]) – If None (default), it will try to infer language from source.
parser_threshold (int) – Minimum lines needed to activate parsing (0 by default).
Methods
__init__
([language, parser_threshold])Language parser that split code using the respective language syntax.
lazy_parse
(blob)Lazy parsing interface.
parse
(blob)Eagerly parse the blob into a document or documents.
- __init__(language: Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None = None, parser_threshold: int = 0)[source]#
Language parser that split code using the respective language syntax.
- Parameters:
language (Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None) – If None (default), it will try to infer language from source.
parser_threshold (int) – Minimum lines needed to activate parsing (0 by default).
- lazy_parse(blob: Blob) Iterator[Document] [source]#
Lazy parsing interface.
Subclasses are required to implement this method.
Examples using LanguageParser