SparkSQL#

class langchain_community.utilities.spark_sql.SparkSQL(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#

SparkSQL is a utility class for interacting with Spark SQL.

Initialize a SparkSQL object.

Parameters:
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

Methods

__init__([spark_session, catalog, schema, ...])

Initialize a SparkSQL object.

from_uri(database_uri[, engine_args])

Creating a remote Spark Session via Spark connect.

get_table_info([table_names])

get_table_info_no_throw([table_names])

Get information about specified tables.

get_usable_table_names()

Get names of tables available.

run(command[, fetch])

run_no_throw(command[, fetch])

Execute a SQL command and return a string representing the results.

__init__(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#

Initialize a SparkSQL object.

Parameters:
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

classmethod from_uri(database_uri: str, engine_args: dict | None = None, **kwargs: Any) SparkSQL[source]#

Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(“sc://localhost:15002”)

Parameters:
  • database_uri (str)

  • engine_args (dict | None)

  • kwargs (Any)

Return type:

SparkSQL

get_table_info(table_names: List[str] | None = None) str[source]#
Parameters:

table_names (List[str] | None)

Return type:

str

get_table_info_no_throw(table_names: List[str] | None = None) str[source]#

Get information about specified tables.

Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)

If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.

Parameters:

table_names (List[str] | None)

Return type:

str

get_usable_table_names() Iterable[str][source]#

Get names of tables available.

Return type:

Iterable[str]

run(command: str, fetch: str = 'all') str[source]#
Parameters:
  • command (str)

  • fetch (str)

Return type:

str

run_no_throw(command: str, fetch: str = 'all') str[source]#

Execute a SQL command and return a string representing the results.

If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.

If the statement throws an error, the error message is returned.

Parameters:
  • command (str)

  • fetch (str)

Return type:

str

Examples using SparkSQL