pyspark.sql.SparkSession#
- class pyspark.sql.SparkSession(sparkContext, jsparkSession=None, options={})[source]#
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used to create
DataFrame, registerDataFrameas tables, execute SQL over tables, cache tables, and read parquet files. To create aSparkSession, use the following builder pattern:Changed in version 3.4.0: Supports Spark Connect.
- builder#
Creates a
Builderfor constructing aSparkSession.Changed in version 3.4.0: Supports Spark Connect.
Examples
Create a Spark session.
>>> spark = ( ... SparkSession.builder ... .master("local") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Create a Spark session with Spark Connect.
>>> spark = ( ... SparkSession.builder ... .remote("sc://localhost") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Methods
active()Returns the active or default
SparkSessionfor the current thread, returned by the builder.addArtifact(*path[, pyfile, archive, file])Add artifact(s) to the client session.
addArtifacts(*path[, pyfile, archive, file])Add artifact(s) to the client session.
addTag(tag)Add a tag to be assigned to all the operations started by this thread in this session.
Clear all registered progress handlers.
Clear the current thread's operation tags.
copyFromLocalToFs(local_path, dest_path)Copy file from local to cloud storage file system.
createDataFrame(data[, schema, ...])Creates a
DataFramefrom anRDD, a list, apandas.DataFrame, anumpy.ndarray, or apyarrow.Table.Returns the active
SparkSessionfor the current thread, returned by the buildergetTags()Get the tags that are currently set to be assigned to all the operations started by this thread.
Interrupt all operations of this session currently running on the connected server.
interruptOperation(op_id)Interrupt an operation of this session with the given operationId.
interruptTag(tag)Interrupt all operations of this session with the given operation tag.
Returns a new
SparkSessionas new session, that has separate SQLConf, registered temporary views and UDFs, but sharedSparkContextand table cache.range(start[, end, step, numPartitions])Create a
DataFramewith singlepyspark.sql.types.LongTypecolumn namedid, containing elements in a range fromstarttoend(exclusive) with step valuestep.registerProgressHandler(handler)Register a progress handler to be called when a progress update is received from the server.
removeProgressHandler(handler)Remove a progress handler that was previously registered.
removeTag(tag)Remove a tag previously added to be assigned to all the operations started by this thread in this session.
sql(sqlQuery[, args])Returns a
DataFramerepresenting the result of the given query.stop()Stop the underlying
SparkContext.table(tableName)Returns the specified table as a
DataFrame.Attributes
Creates a
Builderfor constructing aSparkSession.Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
Gives access to the Spark Connect client.
Runtime configuration interface for Spark.
Returns a
DataSourceRegistrationfor data source registration.Returns a
Profilefor performance/memory profiling.Returns a
DataFrameReaderthat can be used to read data in as aDataFrame.Returns a
DataStreamReaderthat can be used to read data streams as a streamingDataFrame.Returns the underlying
SparkContext.Returns a
StreamingQueryManagerthat allows managing all theStreamingQueryinstances active on this context.Returns a
UDFRegistrationfor UDF registration.Returns a
UDTFRegistrationfor UDTF registration.The version of Spark on which this application is running.