LanguageParser#

class langchain_community.document_loaders.parsers.language.language_parser.LanguageParser(language: Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None = None, parser_threshold: int = 0)[source]#

Parse using the respective programming language syntax.

Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes.

This approach can potentially improve the accuracy of QA models over source code.

The supported languages for code parsing are:

  • C: “c” (*)

  • C++: “cpp” (*)

  • C#: “csharp” (*)

  • COBOL: “cobol”

  • Elixir: “elixir”

  • Go: “go” (*)

  • Java: “java” (*)

  • JavaScript: “js” (requires package esprima)

  • Kotlin: “kotlin” (*)

  • Lua: “lua” (*)

  • Perl: “perl” (*)

  • Python: “python”

  • Ruby: “ruby” (*)

  • Rust: “rust” (*)

  • Scala: “scala” (*)

  • SQL: “sql” (*)

  • TypeScript: “ts” (*)

Items marked with (*) require the packages tree_sitter and tree_sitter_languages. It is straightforward to add support for additional languages using tree_sitter, although this currently requires modifying LangChain.

The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax.

If a language is not explicitly specified, LanguageParser will infer one from filename extensions, if present.

Examples

    from langchain_community.document_loaders.generic import GenericLoader
    from langchain_community.document_loaders.parsers import LanguageParser

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py", ".js"],
        parser=LanguageParser()
    )
    docs = loader.load()

Example instantiations to manually select the language:

.. code-block:: python


    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(language="python")
    )

Example instantiations to set number of lines threshold:

.. code-block:: python

    loader = GenericLoader.from_filesystem(
        "./code",
        glob="**/*",
        suffixes=[".py"],
        parser=LanguageParser(parser_threshold=200)
    )

Language parser that split code using the respective language syntax.

Parameters:
  • language (Optional[Language]) – If None (default), it will try to infer language from source.

  • parser_threshold (int) – Minimum lines needed to activate parsing (0 by default).

Methods

__init__([language, parser_threshold])

Language parser that split code using the respective language syntax.

lazy_parse(blob)

Lazy parsing interface.

parse(blob)

Eagerly parse the blob into a document or documents.

__init__(language: Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None = None, parser_threshold: int = 0)[source]#

Language parser that split code using the respective language syntax.

Parameters:
  • language (Literal['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl', 'elixir', 'sql'] | None) – If None (default), it will try to infer language from source.

  • parser_threshold (int) – Minimum lines needed to activate parsing (0 by default).

lazy_parse(blob: Blob) Iterator[Document][source]#

Lazy parsing interface.

Subclasses are required to implement this method.

Parameters:

blob (Blob) – Blob instance

Returns:

Generator of documents

Return type:

Iterator[Document]

parse(blob: Blob) list[Document]#

Eagerly parse the blob into a document or documents.

This is a convenience method for interactive development environment.

Production applications should favor the lazy_parse method instead.

Subclasses should generally not over-ride this parse method.

Parameters:

blob (Blob) – Blob instance

Returns:

List of documents

Return type:

list[Document]

Examples using LanguageParser