SparkSession(sparkContext[, jsparkSession])
SparkSession
The entry point to programming Spark with the Dataset and DataFrame API.
Catalog(sparkSession)
Catalog
User-facing catalog API, accessible through SparkSession.catalog.
DataFrame(jdf, sql_ctx)
DataFrame
A distributed collection of data grouped into named columns.
Column(jc)
Column
A column in a DataFrame.
Row
A row in DataFrame.
GroupedData(jgd, df)
GroupedData
A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy().
DataFrame.groupBy()
PandasCogroupedOps(gd1, gd2)
PandasCogroupedOps
A logical grouping of two GroupedData, created by GroupedData.cogroup().
GroupedData.cogroup()
DataFrameNaFunctions(df)
DataFrameNaFunctions
Functionality for working with missing data in DataFrame.
DataFrameStatFunctions(df)
DataFrameStatFunctions
Functionality for statistic functions with DataFrame.
Window
Utility functions for defining window in DataFrames.
The entry point to programming Spark with the Dataset and DataFrame API. To create a Spark session, you should use SparkSession.builder attribute. See also SparkSession.
SparkSession.builder
SparkSession.builder.appName(name)
SparkSession.builder.appName
Sets a name for the application, which will be shown in the Spark web UI.
SparkSession.builder.config([key, value, conf])
SparkSession.builder.config
Sets a config option.
SparkSession.builder.enableHiveSupport()
SparkSession.builder.enableHiveSupport
Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.
SparkSession.builder.getOrCreate()
SparkSession.builder.getOrCreate
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
SparkSession.builder.master(master)
SparkSession.builder.master
Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.
SparkSession.catalog
Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
SparkSession.conf
Runtime configuration interface for Spark.
SparkSession.createDataFrame(data[, schema, …])
SparkSession.createDataFrame
Creates a DataFrame from an RDD, a list or a pandas.DataFrame.
RDD
pandas.DataFrame
SparkSession.getActiveSession()
SparkSession.getActiveSession
Returns the active SparkSession for the current thread, returned by the builder
SparkSession.newSession()
SparkSession.newSession
Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.
SparkContext
SparkSession.range(start[, end, step, …])
SparkSession.range
Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.
pyspark.sql.types.LongType
id
start
end
step
SparkSession.read
Returns a DataFrameReader that can be used to read data in as a DataFrame.
DataFrameReader
SparkSession.readStream
Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
DataStreamReader
SparkSession.sparkContext
Returns the underlying SparkContext.
SparkSession.sql(sqlQuery)
SparkSession.sql
Returns a DataFrame representing the result of the given query.
SparkSession.stop()
SparkSession.stop
Stop the underlying SparkContext.
SparkSession.streams
Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.
StreamingQueryManager
StreamingQuery
SparkSession.table(tableName)
SparkSession.table
Returns the specified table as a DataFrame.
SparkSession.udf
Returns a UDFRegistration for UDF registration.
UDFRegistration
SparkSession.version
The version of Spark on which this application is running.
RuntimeConfig(jconf)
RuntimeConfig
User-facing configuration API, accessible through SparkSession.conf.
DataFrameReader.csv(path[, schema, sep, …])
DataFrameReader.csv
Loads a CSV file and returns the result as a DataFrame.
DataFrameReader.format(source)
DataFrameReader.format
Specifies the input data source format.
DataFrameReader.jdbc(url, table[, column, …])
DataFrameReader.jdbc
Construct a DataFrame representing the database table named table accessible via JDBC URL url and connection properties.
table
url
properties
DataFrameReader.json(path[, schema, …])
DataFrameReader.json
Loads JSON files and returns the results as a DataFrame.
DataFrameReader.load([path, format, schema])
DataFrameReader.load
Loads data from a data source and returns it as a DataFrame.
DataFrameReader.option(key, value)
DataFrameReader.option
Adds an input option for the underlying data source.
DataFrameReader.options(**options)
DataFrameReader.options
Adds input options for the underlying data source.
DataFrameReader.orc(path[, mergeSchema, …])
DataFrameReader.orc
Loads ORC files, returning the result as a DataFrame.
DataFrameReader.parquet(*paths, **options)
DataFrameReader.parquet
Loads Parquet files, returning the result as a DataFrame.
DataFrameReader.schema(schema)
DataFrameReader.schema
Specifies the input schema.
DataFrameReader.table(tableName)
DataFrameReader.table
DataFrameWriter.bucketBy(numBuckets, col, *cols)
DataFrameWriter.bucketBy
Buckets the output by the given columns.
DataFrameWriter.csv(path[, mode, …])
DataFrameWriter.csv
Saves the content of the DataFrame in CSV format at the specified path.
DataFrameWriter.format(source)
DataFrameWriter.format
Specifies the underlying output data source.
DataFrameWriter.insertInto(tableName[, …])
DataFrameWriter.insertInto
Inserts the content of the DataFrame to the specified table.
DataFrameWriter.jdbc(url, table[, mode, …])
DataFrameWriter.jdbc
Saves the content of the DataFrame to an external database table via JDBC.
DataFrameWriter.json(path[, mode, …])
DataFrameWriter.json
Saves the content of the DataFrame in JSON format (JSON Lines text format or newline-delimited JSON) at the specified path.
DataFrameWriter.mode(saveMode)
DataFrameWriter.mode
Specifies the behavior when data or table already exists.
DataFrameWriter.option(key, value)
DataFrameWriter.option
Adds an output option for the underlying data source.
DataFrameWriter.options(**options)
DataFrameWriter.options
Adds output options for the underlying data source.
DataFrameWriter.orc(path[, mode, …])
DataFrameWriter.orc
Saves the content of the DataFrame in ORC format at the specified path.
DataFrameWriter.parquet(path[, mode, …])
DataFrameWriter.parquet
Saves the content of the DataFrame in Parquet format at the specified path.
DataFrameWriter.partitionBy(*cols)
DataFrameWriter.partitionBy
Partitions the output by the given columns on the file system.
DataFrameWriter.save([path, format, mode, …])
DataFrameWriter.save
Saves the contents of the DataFrame to a data source.
DataFrameWriter.saveAsTable(name[, format, …])
DataFrameWriter.saveAsTable
Saves the content of the DataFrame as the specified table.
DataFrameWriter.sortBy(col, *cols)
DataFrameWriter.sortBy
Sorts the output in each bucket by the given columns on the file system.
DataFrameWriter.text(path[, compression, …])
DataFrameWriter.text
Saves the content of the DataFrame in a text file at the specified path.
DataFrame.agg(*exprs)
DataFrame.agg
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).
df.groupBy().agg()
DataFrame.alias(alias)
DataFrame.alias
Returns a new DataFrame with an alias set.
DataFrame.approxQuantile(col, probabilities, …)
DataFrame.approxQuantile
Calculates the approximate quantiles of numerical columns of a DataFrame.
DataFrame.cache()
DataFrame.cache
Persists the DataFrame with the default storage level (MEMORY_AND_DISK).
DataFrame.checkpoint([eager])
DataFrame.checkpoint
Returns a checkpointed version of this DataFrame.
DataFrame.coalesce(numPartitions)
DataFrame.coalesce
Returns a new DataFrame that has exactly numPartitions partitions.
DataFrame.colRegex(colName)
DataFrame.colRegex
Selects column based on the column name specified as a regex and returns it as Column.
DataFrame.collect()
DataFrame.collect
Returns all the records as a list of Row.
DataFrame.columns
Returns all column names as a list.
DataFrame.corr(col1, col2[, method])
DataFrame.corr
Calculates the correlation of two columns of a DataFrame as a double value.
DataFrame.count()
DataFrame.count
Returns the number of rows in this DataFrame.
DataFrame.cov(col1, col2)
DataFrame.cov
Calculate the sample covariance for the given columns, specified by their names, as a double value.
DataFrame.createGlobalTempView(name)
DataFrame.createGlobalTempView
Creates a global temporary view with this DataFrame.
DataFrame.createOrReplaceGlobalTempView(name)
DataFrame.createOrReplaceGlobalTempView
Creates or replaces a global temporary view using the given name.
DataFrame.createOrReplaceTempView(name)
DataFrame.createOrReplaceTempView
Creates or replaces a local temporary view with this DataFrame.
DataFrame.createTempView(name)
DataFrame.createTempView
Creates a local temporary view with this DataFrame.
DataFrame.crossJoin(other)
DataFrame.crossJoin
Returns the cartesian product with another DataFrame.
DataFrame.crosstab(col1, col2)
DataFrame.crosstab
Computes a pair-wise frequency table of the given columns.
DataFrame.cube(*cols)
DataFrame.cube
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.
DataFrame.describe(*cols)
DataFrame.describe
Computes basic statistics for numeric and string columns.
DataFrame.distinct()
DataFrame.distinct
Returns a new DataFrame containing the distinct rows in this DataFrame.
DataFrame.drop(*cols)
DataFrame.drop
Returns a new DataFrame that drops the specified column.
DataFrame.dropDuplicates([subset])
DataFrame.dropDuplicates
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
DataFrame.drop_duplicates([subset])
DataFrame.drop_duplicates
drop_duplicates() is an alias for dropDuplicates().
drop_duplicates()
dropDuplicates()
DataFrame.dropna([how, thresh, subset])
DataFrame.dropna
Returns a new DataFrame omitting rows with null values.
DataFrame.dtypes
Returns all column names and their data types as a list.
DataFrame.exceptAll(other)
DataFrame.exceptAll
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
DataFrame.explain([extended, mode])
DataFrame.explain
Prints the (logical and physical) plans to the console for debugging purpose.
DataFrame.fillna(value[, subset])
DataFrame.fillna
Replace null values, alias for na.fill().
na.fill()
DataFrame.filter(condition)
DataFrame.filter
Filters rows using the given condition.
DataFrame.first()
DataFrame.first
Returns the first row as a Row.
DataFrame.foreach(f)
DataFrame.foreach
Applies the f function to all Row of this DataFrame.
f
DataFrame.foreachPartition(f)
DataFrame.foreachPartition
Applies the f function to each partition of this DataFrame.
DataFrame.freqItems(cols[, support])
DataFrame.freqItems
Finding frequent items for columns, possibly with false positives.
DataFrame.groupBy(*cols)
DataFrame.groupBy
Groups the DataFrame using the specified columns, so we can run aggregation on them.
DataFrame.head([n])
DataFrame.head
Returns the first n rows.
n
DataFrame.hint(name, *parameters)
DataFrame.hint
Specifies some hint on the current DataFrame.
DataFrame.inputFiles()
DataFrame.inputFiles
Returns a best-effort snapshot of the files that compose this DataFrame.
DataFrame.intersect(other)
DataFrame.intersect
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.
DataFrame.intersectAll(other)
DataFrame.intersectAll
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
DataFrame.isLocal()
DataFrame.isLocal
Returns True if the collect() and take() methods can be run locally (without any Spark executors).
True
collect()
take()
DataFrame.isStreaming
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.
DataFrame.join(other[, on, how])
DataFrame.join
Joins with another DataFrame, using the given join expression.
DataFrame.limit(num)
DataFrame.limit
Limits the result count to the number specified.
DataFrame.localCheckpoint([eager])
DataFrame.localCheckpoint
Returns a locally checkpointed version of this DataFrame.
DataFrame.mapInPandas(func, schema)
DataFrame.mapInPandas
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.
DataFrame.na
Returns a DataFrameNaFunctions for handling missing values.
DataFrame.orderBy(*cols, **kwargs)
DataFrame.orderBy
Returns a new DataFrame sorted by the specified column(s).
DataFrame.persist([storageLevel])
DataFrame.persist
Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.
DataFrame.printSchema()
DataFrame.printSchema
Prints out the schema in the tree format.
DataFrame.randomSplit(weights[, seed])
DataFrame.randomSplit
Randomly splits this DataFrame with the provided weights.
DataFrame.rdd
Returns the content as an pyspark.RDD of Row.
pyspark.RDD
DataFrame.registerTempTable(name)
DataFrame.registerTempTable
Registers this DataFrame as a temporary table using the given name.
DataFrame.repartition(numPartitions, *cols)
DataFrame.repartition
Returns a new DataFrame partitioned by the given partitioning expressions.
DataFrame.repartitionByRange(numPartitions, …)
DataFrame.repartitionByRange
DataFrame.replace(to_replace[, value, subset])
DataFrame.replace
Returns a new DataFrame replacing a value with another value.
DataFrame.rollup(*cols)
DataFrame.rollup
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
DataFrame.sameSemantics(other)
DataFrame.sameSemantics
Returns True when the logical query plans inside both DataFrames are equal and therefore return same results.
DataFrame.sample([withReplacement, …])
DataFrame.sample
Returns a sampled subset of this DataFrame.
DataFrame.sampleBy(col, fractions[, seed])
DataFrame.sampleBy
Returns a stratified sample without replacement based on the fraction given on each stratum.
DataFrame.schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
pyspark.sql.types.StructType
DataFrame.select(*cols)
DataFrame.select
Projects a set of expressions and returns a new DataFrame.
DataFrame.selectExpr(*expr)
DataFrame.selectExpr
Projects a set of SQL expressions and returns a new DataFrame.
DataFrame.semanticHash()
DataFrame.semanticHash
Returns a hash code of the logical query plan against this DataFrame.
DataFrame.show([n, truncate, vertical])
DataFrame.show
Prints the first n rows to the console.
DataFrame.sort(*cols, **kwargs)
DataFrame.sort
DataFrame.sortWithinPartitions(*cols, **kwargs)
DataFrame.sortWithinPartitions
Returns a new DataFrame with each partition sorted by the specified column(s).
DataFrame.stat
Returns a DataFrameStatFunctions for statistic functions.
DataFrame.storageLevel
Get the DataFrame’s current storage level.
DataFrame.subtract(other)
DataFrame.subtract
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.
DataFrame.summary(*statistics)
DataFrame.summary
Computes specified statistics for numeric and string columns.
DataFrame.tail(num)
DataFrame.tail
Returns the last num rows as a list of Row.
num
list
DataFrame.take(num)
DataFrame.take
Returns the first num rows as a list of Row.
DataFrame.toDF(*cols)
DataFrame.toDF
Returns a new DataFrame that with new specified column names
DataFrame.toJSON([use_unicode])
DataFrame.toJSON
Converts a DataFrame into a RDD of string.
DataFrame.toLocalIterator([prefetchPartitions])
DataFrame.toLocalIterator
Returns an iterator that contains all of the rows in this DataFrame.
DataFrame.toPandas()
DataFrame.toPandas
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
DataFrame.transform(func)
DataFrame.transform
Returns a new DataFrame.
DataFrame.union(other)
DataFrame.union
Return a new DataFrame containing union of rows in this and another DataFrame.
DataFrame.unionAll(other)
DataFrame.unionAll
DataFrame.unionByName(other[, …])
DataFrame.unionByName
Returns a new DataFrame containing union of rows in this and another DataFrame.
DataFrame.unpersist([blocking])
DataFrame.unpersist
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
DataFrame.where(condition)
DataFrame.where
where() is an alias for filter().
where()
filter()
DataFrame.withColumn(colName, col)
DataFrame.withColumn
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
DataFrame.withColumnRenamed(existing, new)
DataFrame.withColumnRenamed
Returns a new DataFrame by renaming an existing column.
DataFrame.withWatermark(eventTime, …)
DataFrame.withWatermark
Defines an event time watermark for this DataFrame.
DataFrame.write
Interface for saving the content of the non-streaming DataFrame out into external storage.
DataFrame.writeStream
Interface for saving the content of the streaming DataFrame out into external storage.
DataFrame.writeTo(table)
DataFrame.writeTo
Create a write configuration builder for v2 sources.
DataFrame.to_pandas_on_spark([index_col])
DataFrame.to_pandas_on_spark
Converts the existing DataFrame into a pandas-on-Spark DataFrame.
DataFrameNaFunctions.drop([how, thresh, subset])
DataFrameNaFunctions.drop
DataFrameNaFunctions.fill(value[, subset])
DataFrameNaFunctions.fill
DataFrameNaFunctions.replace(to_replace[, …])
DataFrameNaFunctions.replace
DataFrameStatFunctions.approxQuantile(col, …)
DataFrameStatFunctions.approxQuantile
DataFrameStatFunctions.corr(col1, col2[, method])
DataFrameStatFunctions.corr
DataFrameStatFunctions.cov(col1, col2)
DataFrameStatFunctions.cov
DataFrameStatFunctions.crosstab(col1, col2)
DataFrameStatFunctions.crosstab
DataFrameStatFunctions.freqItems(cols[, support])
DataFrameStatFunctions.freqItems
DataFrameStatFunctions.sampleBy(col, fractions)
DataFrameStatFunctions.sampleBy
Column.alias(*alias, **kwargs)
Column.alias
Returns this column aliased with a new name or names (in the case of expressions that return more than one column, such as explode).
Column.asc()
Column.asc
Returns a sort expression based on ascending order of the column.
Column.asc_nulls_first()
Column.asc_nulls_first
Returns a sort expression based on ascending order of the column, and null values return before non-null values.
Column.asc_nulls_last()
Column.asc_nulls_last
Returns a sort expression based on ascending order of the column, and null values appear after non-null values.
Column.astype(dataType)
Column.astype
astype() is an alias for cast().
astype()
cast()
Column.between(lowerBound, upperBound)
Column.between
True if the current column is between the lower bound and upper bound, inclusive.
Column.bitwiseAND(other)
Column.bitwiseAND
Compute bitwise AND of this expression with another expression.
Column.bitwiseOR(other)
Column.bitwiseOR
Compute bitwise OR of this expression with another expression.
Column.bitwiseXOR(other)
Column.bitwiseXOR
Compute bitwise XOR of this expression with another expression.
Column.cast(dataType)
Column.cast
Casts the column into type dataType.
dataType
Column.contains(other)
Column.contains
Contains the other element.
Column.desc()
Column.desc
Returns a sort expression based on the descending order of the column.
Column.desc_nulls_first()
Column.desc_nulls_first
Returns a sort expression based on the descending order of the column, and null values appear before non-null values.
Column.desc_nulls_last()
Column.desc_nulls_last
Returns a sort expression based on the descending order of the column, and null values appear after non-null values.
Column.dropFields(*fieldNames)
Column.dropFields
An expression that drops fields in StructType by name.
StructType
Column.endswith(other)
Column.endswith
String ends with.
Column.eqNullSafe(other)
Column.eqNullSafe
Equality test that is safe for null values.
Column.getField(name)
Column.getField
An expression that gets a field by name in a StructType.
Column.getItem(key)
Column.getItem
An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.
ordinal
Column.isNotNull()
Column.isNotNull
True if the current expression is NOT null.
Column.isNull()
Column.isNull
True if the current expression is null.
Column.isin(*cols)
Column.isin
A boolean expression that is evaluated to true if the value of this expression is contained by the evaluated values of the arguments.
Column.like(other)
Column.like
SQL like expression.
Column.name(*alias, **kwargs)
Column.name
name() is an alias for alias().
name()
alias()
Column.otherwise(value)
Column.otherwise
Evaluates a list of conditions and returns one of multiple possible result expressions.
Column.over(window)
Column.over
Define a windowing column.
Column.rlike(other)
Column.rlike
SQL RLIKE expression (LIKE with Regex).
Column.startswith(other)
Column.startswith
String starts with.
Column.substr(startPos, length)
Column.substr
Return a Column which is a substring of the column.
Column.when(condition, value)
Column.when
Column.withField(fieldName, col)
Column.withField
An expression that adds/replaces a field in StructType by name.
ArrayType(elementType[, containsNull])
ArrayType
Array data type.
BinaryType
Binary (byte array) data type.
BooleanType
Boolean data type.
ByteType
Byte data type, i.e.
DataType
Base class for data types.
DateType
Date (datetime.date) data type.
DecimalType([precision, scale])
DecimalType
Decimal (decimal.Decimal) data type.
DoubleType
Double data type, representing double precision floats.
FloatType
Float data type, representing single precision floats.
IntegerType
Int data type, i.e.
LongType
Long data type, i.e.
MapType(keyType, valueType[, valueContainsNull])
MapType
Map data type.
NullType
Null type.
ShortType
Short data type, i.e.
StringType
String data type.
StructField(name, dataType[, nullable, metadata])
StructField
A field in StructType.
StructType([fields])
Struct type, consisting of a list of StructField.
TimestampType
Timestamp (datetime.datetime) data type.
Row.asDict([recursive])
Row.asDict
Return as a dict
abs(col)
abs
Computes the absolute value.
acos(col)
acos
New in version 1.4.0.
acosh(col)
acosh
Computes inverse hyperbolic cosine of the input column.
add_months(start, months)
add_months
Returns the date that is months months after start
aggregate(col, initialValue, merge[, finish])
aggregate
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
approxCountDistinct(col[, rsd])
approxCountDistinct
Deprecated since version 2.1.0.
approx_count_distinct(col[, rsd])
approx_count_distinct
Aggregate function: returns a new Column for approximate distinct count of column col.
array(*cols)
array
Creates a new array column.
array_contains(col, value)
array_contains
Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
array_distinct(col)
array_distinct
Collection function: removes duplicate values from the array.
array_except(col1, col2)
array_except
Collection function: returns an array of the elements in col1 but not in col2, without duplicates.
array_intersect(col1, col2)
array_intersect
Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates.
array_join(col, delimiter[, null_replacement])
array_join
Concatenates the elements of column using the delimiter.
array_max(col)
array_max
Collection function: returns the maximum value of the array.
array_min(col)
array_min
Collection function: returns the minimum value of the array.
array_position(col, value)
array_position
Collection function: Locates the position of the first occurrence of the given value in the given array.
array_remove(col, element)
array_remove
Collection function: Remove all elements that equal to element from the given array.
array_repeat(col, count)
array_repeat
Collection function: creates an array containing a column repeated count times.
array_sort(col)
array_sort
Collection function: sorts the input array in ascending order.
array_union(col1, col2)
array_union
Collection function: returns an array of the elements in the union of col1 and col2, without duplicates.
arrays_overlap(a1, a2)
arrays_overlap
Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
arrays_zip(*cols)
arrays_zip
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
asc(col)
asc
Returns a sort expression based on the ascending order of the given column name.
asc_nulls_first(col)
asc_nulls_first
Returns a sort expression based on the ascending order of the given column name, and null values return before non-null values.
asc_nulls_last(col)
asc_nulls_last
Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
ascii(col)
ascii
Computes the numeric value of the first character of the string column.
asin(col)
asin
New in version 1.3.0.
asinh(col)
asinh
Computes inverse hyperbolic sine of the input column.
assert_true(col[, errMsg])
assert_true
Returns null if the input column is true; throws an exception with the provided error message otherwise.
atan(col)
atan
atanh(col)
atanh
Computes inverse hyperbolic tangent of the input column.
atan2(col1, col2)
atan2
avg(col)
avg
Aggregate function: returns the average of the values in a group.
base64(col)
base64
Computes the BASE64 encoding of a binary column and returns it as a string column.
bin(col)
bin
Returns the string representation of the binary value of the given column.
bitwise_not(col)
bitwise_not
Computes bitwise not.
bitwiseNOT(col)
bitwiseNOT
broadcast(df)
broadcast
Marks a DataFrame as small enough for use in broadcast joins.
bround(col[, scale])
bround
Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
bucket(numBuckets, col)
bucket
Partition transform function: A transform for any type that partitions by a hash of the input column.
cbrt(col)
cbrt
Computes the cube-root of the given value.
ceil(col)
ceil
Computes the ceiling of the given value.
coalesce(*cols)
coalesce
Returns the first column that is not null.
col(col)
col
Returns a Column based on the given column name.’ Examples ——– >>> col(‘x’) Column<’x’> >>> column(‘x’) Column<’x’>
collect_list(col)
collect_list
Aggregate function: returns a list of objects with duplicates.
collect_set(col)
collect_set
Aggregate function: returns a set of objects with duplicate elements eliminated.
column(col)
column
concat(*cols)
concat
Concatenates multiple input columns together into a single column.
concat_ws(sep, *cols)
concat_ws
Concatenates multiple input string columns together into a single string column, using the given separator.
conv(col, fromBase, toBase)
conv
Convert a number in a string column from one base to another.
corr(col1, col2)
corr
Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.
col1
col2
cos(col)
cos
cosh(col)
cosh
count(col)
count
Aggregate function: returns the number of items in a group.
count_distinct(col, *cols)
count_distinct
Returns a new Column for distinct count of col or cols.
cols
countDistinct(col, *cols)
countDistinct
covar_pop(col1, col2)
covar_pop
Returns a new Column for the population covariance of col1 and col2.
covar_samp(col1, col2)
covar_samp
Returns a new Column for the sample covariance of col1 and col2.
crc32(col)
crc32
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
create_map(*cols)
create_map
Creates a new map column.
cume_dist()
cume_dist
Window function: returns the cumulative distribution of values within a window partition, i.e.
current_date()
current_date
Returns the current date at the start of query evaluation as a DateType column.
current_timestamp()
current_timestamp
Returns the current timestamp at the start of query evaluation as a TimestampType column.
date_add(start, days)
date_add
Returns the date that is days days after start
date_format(date, format)
date_format
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
date_sub(start, days)
date_sub
Returns the date that is days days before start
date_trunc(format, timestamp)
date_trunc
Returns timestamp truncated to the unit specified by the format.
datediff(end, start)
datediff
Returns the number of days from start to end.
dayofmonth(col)
dayofmonth
Extract the day of the month of a given date as integer.
dayofweek(col)
dayofweek
Extract the day of the week of a given date as integer.
dayofyear(col)
dayofyear
Extract the day of the year of a given date as integer.
days(col)
days
Partition transform function: A transform for timestamps and dates to partition data into days.
decode(col, charset)
decode
Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
degrees(col)
degrees
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
dense_rank()
dense_rank
Window function: returns the rank of rows within a window partition, without any gaps.
desc(col)
desc
Returns a sort expression based on the descending order of the given column name.
desc_nulls_first(col)
desc_nulls_first
Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
desc_nulls_last(col)
desc_nulls_last
Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
element_at(col, extraction)
element_at
Collection function: Returns element of array at given index in extraction if col is array.
encode(col, charset)
encode
Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
exists(col, f)
exists
Returns whether a predicate holds for one or more elements in the array.
exp(col)
exp
Computes the exponential of the given value.
explode(col)
explode
Returns a new row for each element in the given array or map.
explode_outer(col)
explode_outer
expm1(col)
expm1
Computes the exponential of the given value minus one.
expr(str)
expr
Parses the expression string into the column that it represents
factorial(col)
factorial
Computes the factorial of the given value.
filter(col, f)
filter
Returns an array of elements for which a predicate holds in a given array.
first(col[, ignorenulls])
first
Aggregate function: returns the first value in a group.
flatten(col)
flatten
Collection function: creates a single array from an array of arrays.
floor(col)
floor
Computes the floor of the given value.
forall(col, f)
forall
Returns whether a predicate holds for every element in the array.
format_number(col, d)
format_number
Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
format_string(format, *cols)
format_string
Formats the arguments in printf-style and returns the result as a string column.
from_csv(col, schema[, options])
from_csv
Parses a column containing a CSV string to a row with the specified schema.
from_json(col, schema[, options])
from_json
Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
from_unixtime(timestamp[, format])
from_unixtime
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
from_utc_timestamp(timestamp, tz)
from_utc_timestamp
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
get_json_object(col, path)
get_json_object
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
greatest(*cols)
greatest
Returns the greatest value of the list of column names, skipping null values.
grouping(col)
grouping
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
grouping_id(*cols)
grouping_id
Aggregate function: returns the level of grouping, equals to
hash(*cols)
hash
Calculates the hash code of given columns, and returns the result as an int column.
hex(col)
hex
Computes hex value of the given column, which could be pyspark.sql.types.StringType, pyspark.sql.types.BinaryType, pyspark.sql.types.IntegerType or pyspark.sql.types.LongType.
pyspark.sql.types.StringType
pyspark.sql.types.BinaryType
pyspark.sql.types.IntegerType
hour(col)
hour
Extract the hours of a given date as integer.
hours(col)
hours
Partition transform function: A transform for timestamps to partition data into hours.
hypot(col1, col2)
hypot
Computes sqrt(a^2 + b^2) without intermediate overflow or underflow.
sqrt(a^2 + b^2)
initcap(col)
initcap
Translate the first letter of each word to upper case in the sentence.
input_file_name()
input_file_name
Creates a string column for the file name of the current Spark task.
instr(str, substr)
instr
Locate the position of the first occurrence of substr column in the given string.
isnan(col)
isnan
An expression that returns true iff the column is NaN.
isnull(col)
isnull
An expression that returns true iff the column is null.
json_tuple(col, *fields)
json_tuple
Creates a new row for a json column according to the given field names.
kurtosis(col)
kurtosis
Aggregate function: returns the kurtosis of the values in a group.
lag(col[, offset, default])
lag
Window function: returns the value that is offset rows before the current row, and default if there is less than offset rows before the current row.
last(col[, ignorenulls])
last
Aggregate function: returns the last value in a group.
last_day(date)
last_day
Returns the last day of the month which the given date belongs to.
lead(col[, offset, default])
lead
Window function: returns the value that is offset rows after the current row, and default if there is less than offset rows after the current row.
least(*cols)
least
Returns the least value of the list of column names, skipping null values.
length(col)
length
Computes the character length of string data or number of bytes of binary data.
levenshtein(left, right)
levenshtein
Computes the Levenshtein distance of the two given strings.
lit(col)
lit
Creates a Column of literal value.
locate(substr, str[, pos])
locate
Locate the position of the first occurrence of substr in a string column, after position pos.
log(arg1[, arg2])
log
Returns the first argument-based logarithm of the second argument.
log10(col)
log10
Computes the logarithm of the given value in Base 10.
log1p(col)
log1p
Computes the natural logarithm of the given value plus one.
log2(col)
log2
Returns the base-2 logarithm of the argument.
lower(col)
lower
Converts a string expression to lower case.
lpad(col, len, pad)
lpad
Left-pad the string column to width len with pad.
ltrim(col)
ltrim
Trim the spaces from left end for the specified string value.
map_concat(*cols)
map_concat
Returns the union of all the given maps.
map_entries(col)
map_entries
Collection function: Returns an unordered array of all entries in the given map.
map_filter(col, f)
map_filter
Returns a map whose key-value pairs satisfy a predicate.
map_from_arrays(col1, col2)
map_from_arrays
Creates a new map from two arrays.
map_from_entries(col)
map_from_entries
Collection function: Returns a map created from the given array of entries.
map_keys(col)
map_keys
Collection function: Returns an unordered array containing the keys of the map.
map_values(col)
map_values
Collection function: Returns an unordered array containing the values of the map.
map_zip_with(col1, col2, f)
map_zip_with
Merge two given maps, key-wise into a single map using a function.
max(col)
max
Aggregate function: returns the maximum value of the expression in a group.
md5(col)
md5
Calculates the MD5 digest and returns the value as a 32 character hex string.
mean(col)
mean
min(col)
min
Aggregate function: returns the minimum value of the expression in a group.
minute(col)
minute
Extract the minutes of a given date as integer.
monotonically_increasing_id()
monotonically_increasing_id
A column that generates monotonically increasing 64-bit integers.
month(col)
month
Extract the month of a given date as integer.
months(col)
months
Partition transform function: A transform for timestamps and dates to partition data into months.
months_between(date1, date2[, roundOff])
months_between
Returns number of months between dates date1 and date2.
nanvl(col1, col2)
nanvl
Returns col1 if it is not NaN, or col2 if col1 is NaN.
next_day(date, dayOfWeek)
next_day
Returns the first date which is later than the value of the date column.
nth_value(col, offset[, ignoreNulls])
nth_value
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
ntile(n)
ntile
Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition.
overlay(src, replace, pos[, len])
overlay
Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
pandas_udf([f, returnType, functionType])
pandas_udf
Creates a pandas user defined function (a.k.a.
percent_rank()
percent_rank
Window function: returns the relative rank (i.e.
percentile_approx(col, percentage[, accuracy])
percentile_approx
Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
posexplode(col)
posexplode
Returns a new row for each element with position in the given array or map.
posexplode_outer(col)
posexplode_outer
pow(col1, col2)
pow
Returns the value of the first argument raised to the power of the second argument.
product(col)
product
Aggregate function: returns the product of the values in a group.
quarter(col)
quarter
Extract the quarter of a given date as integer.
radians(col)
radians
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
raise_error(errMsg)
raise_error
Throws an exception with the provided error message.
rand([seed])
rand
Generates a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
randn([seed])
randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
rank()
rank
Window function: returns the rank of rows within a window partition.
regexp_extract(str, pattern, idx)
regexp_extract
Extract a specific group matched by a Java regex, from the specified string column.
regexp_replace(str, pattern, replacement)
regexp_replace
Replace all substrings of the specified string value that match regexp with rep.
repeat(col, n)
repeat
Repeats a string column n times, and returns it as a new string column.
reverse(col)
reverse
Collection function: returns a reversed string or an array with reverse order of elements.
rint(col)
rint
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
round(col[, scale])
round
Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.
row_number()
row_number
Window function: returns a sequential number starting at 1 within a window partition.
rpad(col, len, pad)
rpad
Right-pad the string column to width len with pad.
rtrim(col)
rtrim
Trim the spaces from right end for the specified string value.
schema_of_csv(csv[, options])
schema_of_csv
Parses a CSV string and infers its schema in DDL format.
schema_of_json(json[, options])
schema_of_json
Parses a JSON string and infers its schema in DDL format.
second(col)
second
Extract the seconds of a given date as integer.
sentences(string[, language, country])
sentences
Splits a string into arrays of sentences, where each sentence is an array of words.
sequence(start, stop[, step])
sequence
Generate a sequence of integers from start to stop, incrementing by step.
session_window(timeColumn, gapDuration)
session_window
Generates session window given a timestamp specifying column.
sha1(col)
sha1
Returns the hex string result of SHA-1.
sha2(col, numBits)
sha2
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
shiftleft(col, numBits)
shiftleft
Shift the given value numBits left.
shiftright(col, numBits)
shiftright
(Signed) shift the given value numBits right.
shiftrightunsigned(col, numBits)
shiftrightunsigned
Unsigned shift the given value numBits right.
shuffle(col)
shuffle
Collection function: Generates a random permutation of the given array.
signum(col)
signum
Computes the signum of the given value.
sin(col)
sin
sinh(col)
sinh
size(col)
size
Collection function: returns the length of the array or map stored in the column.
skewness(col)
skewness
Aggregate function: returns the skewness of the values in a group.
slice(x, start, length)
slice
Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
sort_array(col[, asc])
sort_array
Collection function: sorts the input array in ascending or descending order according to the natural ordering of the array elements.
soundex(col)
soundex
Returns the SoundEx encoding for a string
spark_partition_id()
spark_partition_id
A column for partition ID.
split(str, pattern[, limit])
split
Splits str around matches of the given pattern.
sqrt(col)
sqrt
Computes the square root of the specified float value.
stddev(col)
stddev
Aggregate function: alias for stddev_samp.
stddev_pop(col)
stddev_pop
Aggregate function: returns population standard deviation of the expression in a group.
stddev_samp(col)
stddev_samp
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
struct(*cols)
struct
Creates a new struct column.
substring(str, pos, len)
substring
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
substring_index(str, delim, count)
substring_index
Returns the substring from string str before count occurrences of the delimiter delim.
sum(col)
sum
Aggregate function: returns the sum of all values in the expression.
sum_distinct(col)
sum_distinct
Aggregate function: returns the sum of distinct values in the expression.
sumDistinct(col)
sumDistinct
tan(col)
tan
tanh(col)
tanh
timestamp_seconds(col)
timestamp_seconds
New in version 3.1.0.
toDegrees(col)
toDegrees
toRadians(col)
toRadians
to_csv(col[, options])
to_csv
Converts a column containing a StructType into a CSV string.
to_date(col[, format])
to_date
Converts a Column into pyspark.sql.types.DateType using the optionally specified format.
pyspark.sql.types.DateType
to_json(col[, options])
to_json
Converts a column containing a StructType, ArrayType or a MapType into a JSON string.
to_timestamp(col[, format])
to_timestamp
Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.
pyspark.sql.types.TimestampType
to_utc_timestamp(timestamp, tz)
to_utc_timestamp
transform(col, f)
transform
Returns an array of elements after applying a transformation to each element in the input array.
transform_keys(col, f)
transform_keys
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
transform_values(col, f)
transform_values
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
translate(srcCol, matching, replace)
translate
A function translate any character in the srcCol by a character in matching.
trim(col)
trim
Trim the spaces from both ends for the specified string column.
trunc(date, format)
trunc
Returns date truncated to the unit specified by the format.
udf([f, returnType])
udf
Creates a user defined function (UDF).
unbase64(col)
unbase64
Decodes a BASE64 encoded string column and returns it as a binary column.
unhex(col)
unhex
Inverse of hex.
unix_timestamp([timestamp, format])
unix_timestamp
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
upper(col)
upper
Converts a string expression to upper case.
var_pop(col)
var_pop
Aggregate function: returns the population variance of the values in a group.
var_samp(col)
var_samp
Aggregate function: returns the unbiased sample variance of the values in a group.
variance(col)
variance
Aggregate function: alias for var_samp
weekofyear(col)
weekofyear
Extract the week number of a given date as integer.
when(condition, value)
when
window(timeColumn, windowDuration[, …])
window
Bucketize rows into one or more time windows given a timestamp specifying column.
xxhash64(*cols)
xxhash64
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.
year(col)
year
Extract the year of a given date as integer.
years(col)
years
Partition transform function: A transform for timestamps and dates to partition data into years.
zip_with(left, right, f)
zip_with
Merge two given arrays, element-wise, into a single array using a function.
from_avro(data, jsonFormatSchema[, options])
from_avro
Converts a binary column of Avro format into its corresponding catalyst value.
to_avro(data[, jsonFormatSchema])
to_avro
Converts a column into binary of avro format.
Window.currentRow
Window.orderBy(*cols)
Window.orderBy
Creates a WindowSpec with the ordering defined.
WindowSpec
Window.partitionBy(*cols)
Window.partitionBy
Creates a WindowSpec with the partitioning defined.
Window.rangeBetween(start, end)
Window.rangeBetween
Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive).
Window.rowsBetween(start, end)
Window.rowsBetween
Window.unboundedFollowing
Window.unboundedPreceding
WindowSpec.orderBy(*cols)
WindowSpec.orderBy
Defines the ordering columns in a WindowSpec.
WindowSpec.partitionBy(*cols)
WindowSpec.partitionBy
Defines the partitioning columns in a WindowSpec.
WindowSpec.rangeBetween(start, end)
WindowSpec.rangeBetween
Defines the frame boundaries, from start (inclusive) to end (inclusive).
WindowSpec.rowsBetween(start, end)
WindowSpec.rowsBetween
GroupedData.agg(*exprs)
GroupedData.agg
Compute aggregates and returns the result as a DataFrame.
GroupedData.apply(udf)
GroupedData.apply
It is an alias of pyspark.sql.GroupedData.applyInPandas(); however, it takes a pyspark.sql.functions.pandas_udf() whereas pyspark.sql.GroupedData.applyInPandas() takes a Python native function.
pyspark.sql.GroupedData.applyInPandas()
pyspark.sql.functions.pandas_udf()
GroupedData.applyInPandas(func, schema)
GroupedData.applyInPandas
Maps each group of the current DataFrame using a pandas udf and returns the result as a DataFrame.
GroupedData.avg(*cols)
GroupedData.avg
Computes average values for each numeric columns for each group.
GroupedData.cogroup(other)
GroupedData.cogroup
Cogroups this group with another group so that we can run cogrouped operations.
GroupedData.count()
GroupedData.count
Counts the number of records for each group.
GroupedData.max(*cols)
GroupedData.max
Computes the max value for each numeric columns for each group.
GroupedData.mean(*cols)
GroupedData.mean
GroupedData.min(*cols)
GroupedData.min
Computes the min value for each numeric column for each group.
GroupedData.pivot(pivot_col[, values])
GroupedData.pivot
Pivots a column of the current DataFrame and perform the specified aggregation.
GroupedData.sum(*cols)
GroupedData.sum
Computes the sum for each numeric columns for each group.
PandasCogroupedOps.applyInPandas(func, schema)
PandasCogroupedOps.applyInPandas
Applies a function to each cogroup using pandas and returns the result as a DataFrame.
Catalog.cacheTable(tableName)
Catalog.cacheTable
Caches the specified table in-memory.
Catalog.clearCache()
Catalog.clearCache
Removes all cached tables from the in-memory cache.
Catalog.createExternalTable(tableName[, …])
Catalog.createExternalTable
Creates a table based on the dataset in a data source.
Catalog.createTable(tableName[, path, …])
Catalog.createTable
Catalog.currentDatabase()
Catalog.currentDatabase
Returns the current default database in this session.
Catalog.dropGlobalTempView(viewName)
Catalog.dropGlobalTempView
Drops the global temporary view with the given view name in the catalog.
Catalog.dropTempView(viewName)
Catalog.dropTempView
Drops the local temporary view with the given view name in the catalog.
Catalog.isCached(tableName)
Catalog.isCached
Returns true if the table is currently cached in-memory.
Catalog.listColumns(tableName[, dbName])
Catalog.listColumns
Returns a list of columns for the given table/view in the specified database.
Catalog.listDatabases()
Catalog.listDatabases
Returns a list of databases available across all sessions.
Catalog.listFunctions([dbName])
Catalog.listFunctions
Returns a list of functions registered in the specified database.
Catalog.listTables([dbName])
Catalog.listTables
Returns a list of tables/views in the specified database.
Catalog.recoverPartitions(tableName)
Catalog.recoverPartitions
Recovers all the partitions of the given table and update the catalog.
Catalog.refreshByPath(path)
Catalog.refreshByPath
Invalidates and refreshes all the cached data (and the associated metadata) for any DataFrame that contains the given data source path.
Catalog.refreshTable(tableName)
Catalog.refreshTable
Invalidates and refreshes all the cached data and metadata of the given table.
Catalog.registerFunction(name, f[, returnType])
Catalog.registerFunction
An alias for spark.udf.register().
spark.udf.register()
Catalog.setCurrentDatabase(dbName)
Catalog.setCurrentDatabase
Sets the current default database in this session.
Catalog.uncacheTable(tableName)
Catalog.uncacheTable
Removes the specified table from the in-memory cache.