SparkSQL#
- class langchain_community.utilities.spark_sql.SparkSQL(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#
SparkSQL is a utility class for interacting with Spark SQL.
Initialize a SparkSQL object.
- Parameters:
spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.
catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.
schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.
ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.
include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.
sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.
Methods
__init__
([spark_session, catalog, schema, ...])Initialize a SparkSQL object.
from_uri
(database_uri[, engine_args])Creating a remote Spark Session via Spark connect.
get_table_info
([table_names])get_table_info_no_throw
([table_names])Get information about specified tables.
Get names of tables available.
run
(command[, fetch])run_no_throw
(command[, fetch])Execute a SQL command and return a string representing the results.
- __init__(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#
Initialize a SparkSQL object.
- Parameters:
spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.
catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.
schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.
ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.
include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.
sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.
- classmethod from_uri(database_uri: str, engine_args: dict | None = None, **kwargs: Any) SparkSQL [source]#
Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(“sc://localhost:15002”)
- Parameters:
database_uri (str)
engine_args (dict | None)
kwargs (Any)
- Return type:
- get_table_info(table_names: List[str] | None = None) str [source]#
- Parameters:
table_names (List[str] | None)
- Return type:
str
- get_table_info_no_throw(table_names: List[str] | None = None) str [source]#
Get information about specified tables.
Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)
If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.
- Parameters:
table_names (List[str] | None)
- Return type:
str
- get_usable_table_names() Iterable[str] [source]#
Get names of tables available.
- Return type:
Iterable[str]
- run(command: str, fetch: str = 'all') str [source]#
- Parameters:
command (str)
fetch (str)
- Return type:
str
- run_no_throw(command: str, fetch: str = 'all') str [source]#
Execute a SQL command and return a string representing the results.
If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.
If the statement throws an error, the error message is returned.
- Parameters:
command (str)
fetch (str)
- Return type:
str
Examples using SparkSQL