SparkSQL#

class langchain_community.utilities.spark_sql.SparkSQL(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#

SparkSQL is a utility class for interacting with Spark SQL.

Initialize a SparkSQL object.

Parameters:
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

Methods

__init__([spark_session,Β catalog,Β schema,Β ...])

Initialize a SparkSQL object.

from_uri(database_uri[,Β engine_args])

Creating a remote Spark Session via Spark connect.

get_table_info([table_names])

get_table_info_no_throw([table_names])

Get information about specified tables.

get_usable_table_names()

Get names of tables available.

run(command[,Β fetch])

run_no_throw(command[,Β fetch])

Execute a SQL command and return a string representing the results.

__init__(spark_session: SparkSession | None = None, catalog: str | None = None, schema: str | None = None, ignore_tables: List[str] | None = None, include_tables: List[str] | None = None, sample_rows_in_table_info: int = 3)[source]#

Initialize a SparkSQL object.

Parameters:
  • spark_session (Optional[SparkSession]) – A SparkSession object. If not provided, one will be created.

  • catalog (Optional[str]) – The catalog to use. If not provided, the default catalog will be used.

  • schema (Optional[str]) – The schema to use. If not provided, the default schema will be used.

  • ignore_tables (Optional[List[str]]) – A list of tables to ignore. If not provided, all tables will be used.

  • include_tables (Optional[List[str]]) – A list of tables to include. If not provided, all tables will be used.

  • sample_rows_in_table_info (int) – The number of rows to include in the table info. Defaults to 3.

classmethod from_uri(database_uri: str, engine_args: dict | None = None, **kwargs: Any) β†’ SparkSQL[source]#

Creating a remote Spark Session via Spark connect. For example: SparkSQL.from_uri(β€œsc://localhost:15002”)

Parameters:
  • database_uri (str) –

  • engine_args (dict | None) –

  • kwargs (Any) –

Return type:

SparkSQL

get_table_info(table_names: List[str] | None = None) β†’ str[source]#
Parameters:

table_names (List[str] | None) –

Return type:

str

get_table_info_no_throw(table_names: List[str] | None = None) β†’ str[source]#

Get information about specified tables.

Follows best practices as specified in: Rajkumar et al, 2022 (https://arxiv.org/abs/2204.00498)

If sample_rows_in_table_info, the specified number of sample rows will be appended to each table description. This can increase performance as demonstrated in the paper.

Parameters:

table_names (List[str] | None) –

Return type:

str

get_usable_table_names() β†’ Iterable[str][source]#

Get names of tables available.

Return type:

Iterable[str]

run(command: str, fetch: str = 'all') β†’ str[source]#
Parameters:
  • command (str) –

  • fetch (str) –

Return type:

str

run_no_throw(command: str, fetch: str = 'all') β†’ str[source]#

Execute a SQL command and return a string representing the results.

If the statement returns rows, a string of the results is returned. If the statement returns no rows, an empty string is returned.

If the statement throws an error, the error message is returned.

Parameters:
  • command (str) –

  • fetch (str) –

Return type:

str

Examples using SparkSQL