SparkSQL#

class langchain_community.utilities.spark_sql.SparkSQL(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#

SparkSQL is a utility class for interacting with Spark SQL.

Initialize a SparkSQL object.

Parameters:

spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.
catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.
schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.
ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.
include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.
sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

Methods

`__init__`([spark_session, catalog, schema, ...])	Initialize a SparkSQL object.
`from_uri`(database_uri[, engine_args])	Creating a remote Spark Session via Spark connect.
`get_table_info`([table_names])
`get_table_info_no_throw`([table_names])	Get information about specified tables.
`get_usable_table_names`()	Get names of tables available.
`run`(command[, fetch])
`run_no_throw`(command[, fetch])	Execute a SQL command and return a string representing the results.

Initialize a SparkSQL object.

Parameters:

spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.
catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.
schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.
ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.
include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.
sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

classmethod from_uri(database_uri: str, engine_args: dict | None = None, **kwargs: Any) → SparkSQL[source]#

Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(“sc://localhost:15002”)

Parameters:

database_uri (str) –
engine_args (dict | None) –
kwargs (Any) –

Return type:

SparkSQL

get_table_info(table_names: List[str] | None = None) → str[source]#

Parameters:: table_names (List[str] | None) –
Return type:: str

get_table_info_no_throw(table_names: List[str] | None = None) → str[source]#

Get information about specified tables.

Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)

If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.

Parameters:: table_names (List[str] | None) –
Return type:: str

get_usable_table_names() → Iterable[str][source]#

Get names of tables available.

Return type:: Iterable[str]

run(command: str, fetch: str = 'all') → str[source]#

Parameters:

command (str) –
fetch (str) –

Return type:

str

run_no_throw(command: str, fetch: str = 'all') → str[source]#

Execute a SQL command and return a string representing the results.

If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.

If the statement throws an error, the error message is returned.

Parameters:

command (str) –
fetch (str) –

Return type:

str

Examples using SparkSQL

Spark SQL Toolkit