[docs]classLanguageParser(BaseBlobParser):"""Parse using the respective programming language syntax. Each top-level function and class in the code is loaded into separate documents. Furthermore, an extra document is generated, containing the remaining top-level code that excludes the already segmented functions and classes. This approach can potentially improve the accuracy of QA models over source code. The supported languages for code parsing are: - C: "c" (*) - C++: "cpp" (*) - C#: "csharp" (*) - COBOL: "cobol" - Elixir: "elixir" - Go: "go" (*) - Java: "java" (*) - JavaScript: "js" (requires package `esprima`) - Kotlin: "kotlin" (*) - Lua: "lua" (*) - Perl: "perl" (*) - Python: "python" - Ruby: "ruby" (*) - Rust: "rust" (*) - Scala: "scala" (*) - SQL: "sql" (*) - TypeScript: "ts" (*) Items marked with (*) require the packages `tree_sitter` and `tree_sitter_languages`. It is straightforward to add support for additional languages using `tree_sitter`, although this currently requires modifying LangChain. The language used for parsing can be configured, along with the minimum number of lines required to activate the splitting based on syntax. If a language is not explicitly specified, `LanguageParser` will infer one from filename extensions, if present. Examples: .. code-block:: python from langchain_community.document_loaders.generic import GenericLoader from langchain_community.document_loaders.parsers import LanguageParser loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py", ".js"], parser=LanguageParser() ) docs = loader.load() Example instantiations to manually select the language: .. code-block:: python loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py"], parser=LanguageParser(language="python") ) Example instantiations to set number of lines threshold: .. code-block:: python loader = GenericLoader.from_filesystem( "./code", glob="**/*", suffixes=[".py"], parser=LanguageParser(parser_threshold=200) ) """
[docs]def__init__(self,language:Optional[Language]=None,parser_threshold:int=0):""" Language parser that split code using the respective language syntax. Args: language: If None (default), it will try to infer language from source. parser_threshold: Minimum lines needed to activate parsing (0 by default). """iflanguageandlanguagenotinLANGUAGE_SEGMENTERS:raiseException(f"No parser available for {language}")self.language=languageself.parser_threshold=parser_threshold