Buckets the output by the given columns.
Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
2.0
Saves the content of the DataFrame
in CSV format at the specified path.
Saves the content of the DataFrame
in CSV format at the specified path.
This is equivalent to:
format("csv").save(path)
You can set the following CSV-specific option(s) for writing CSV files:
sep
(default ,
): sets the single character as a separator for each
field and value.quote
(default "
): sets the single character used for escaping quoted values where
the separator can be part of the value.escape
(default \
): sets the single character used for escaping quotes inside
an already quoted value.escapeQuotes
(default true
): a flag indicating whether values containing
quotes should always be enclosed in quotes. Default is to escape all values containing
a quote character.quoteAll
(default false
): A flag indicating whether all values should always be
enclosed in quotes. Default is to only escape values containing a quote character.header
(default false
): writes the names of columns as the first line.nullValue
(default empty string): sets the string representation of a null value.compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
).dateFormat
(default yyyy-MM-dd
): sets the string that indicates a date format.
Custom date formats follow the formats at java.text.SimpleDateFormat
. This applies to
date type.timestampFormat
(default yyyy-MM-dd'T'HH:mm:ss.SSSZZ
): sets the string that
indicates a timestamp format. Custom date formats follow the formats at
java.text.SimpleDateFormat
. This applies to timestamp type.2.0.0
Specifies the underlying output data source.
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
1.4.0
Inserts the content of the DataFrame
to the specified table.
Inserts the content of the DataFrame
to the specified table. It requires that
the schema of the DataFrame
is the same as the schema of the table.
1.4.0
Unlike saveAsTable
, insertInto
ignores the column names and just uses position-based
resolution. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 5| 6| | 3| 4| | 1| 2| +---+---+
Because it inserts data to an existing table, format or options will be ignored.
Saves the content of the DataFrame
to an external database table via JDBC.
Saves the content of the DataFrame
to an external database table via JDBC. In the case the
table already exists in the external database, behavior of this function depends on the
save mode, specified by the mode
function (default to throwing an exception).
Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
You can set the following JDBC-specific option(s) for storing JDBC:
truncate
(default false
): use TRUNCATE TABLE
instead of DROP TABLE
.In case of failures, users should turn off truncate
option to use DROP TABLE
again. Also,
due to the different behavior of TRUNCATE TABLE
among DBMS, it's not always safe to use this.
MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and OracleDialect supports this
while PostgresDialect and default JDBCDirect doesn't. For unknown and unsupported JDBCDirect,
the user option truncate
is ignored.
JDBC database url of the form jdbc:subprotocol:subname
Name of the table in the external database.
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "batchsize" can be used to control the number of rows per insert. "isolationLevel" can be one of "NONE", "READ_COMMITTED", "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of "READ_UNCOMMITTED".
1.4.0
Saves the content of the DataFrame
in JSON format (
JSON Lines text format or newline-delimited JSON) at the specified path.
Saves the content of the DataFrame
in JSON format (
JSON Lines text format or newline-delimited JSON) at the specified path.
This is equivalent to:
format("json").save(path)
You can set the following JSON-specific option(s) for writing JSON files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
).dateFormat
(default yyyy-MM-dd
): sets the string that indicates a date format.
Custom date formats follow the formats at java.text.SimpleDateFormat
. This applies to
date type.timestampFormat
(default yyyy-MM-dd'T'HH:mm:ss.SSSZZ
): sets the string that
indicates a timestamp format. Custom date formats follow the formats at
java.text.SimpleDateFormat
. This applies to timestamp type.1.4.0
Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
overwrite
: overwrite the existing data.append
: append the data.ignore
: ignore the operation (i.e. no-op).error
: default option, throw an exception at runtime.
1.4.0
Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
SaveMode.Overwrite
: overwrite the existing data.SaveMode.Append
: append the data.SaveMode.Ignore
: ignore the operation (i.e. no-op).SaveMode.ErrorIfExists
: default option, throw an exception at runtime.
1.4.0
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
2.0.0
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
2.0.0
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
2.0.0
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
1.4.0
Adds output options for the underlying data source.
Adds output options for the underlying data source.
1.4.0
(Scala-specific) Adds output options for the underlying data source.
(Scala-specific) Adds output options for the underlying data source.
1.4.0
Saves the content of the DataFrame
in ORC format at the specified path.
Saves the content of the DataFrame
in ORC format at the specified path.
This is equivalent to:
format("orc").save(path)
You can set the following ORC-specific option(s) for writing ORC files:
compression
(default snappy
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names(none
, snappy
, zlib
, and lzo
).
This will override orc.compress
.1.5.0
Currently, this method can only be used after enabling Hive support
Saves the content of the DataFrame
in Parquet format at the specified path.
Saves the content of the DataFrame
in Parquet format at the specified path.
This is equivalent to:
format("parquet").save(path)
You can set the following Parquet-specific option(s) for writing Parquet files:
compression
(default is the value specified in spark.sql.parquet.compression.codec
):
compression codec to use when saving to file. This can be one of the known case-insensitive
shorten names(none, snappy
, gzip
, and lzo
). This will override
spark.sql.parquet.compression.codec
.1.4.0
Partitions the output by the given columns on the file system.
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
1.4.0
Saves the content of the DataFrame
as the specified table.
Saves the content of the DataFrame
as the specified table.
1.4.0
Saves the content of the DataFrame
at the specified path.
Saves the content of the DataFrame
at the specified path.
1.4.0
Saves the content of the DataFrame
as the specified table.
Saves the content of the DataFrame
as the specified table.
In the case the table already exists, behavior of this function depends on the
save mode, specified by the mode
function (default to throwing an exception).
When mode
is Overwrite
, the schema of the DataFrame
does not need to be
the same as that of the existing table.
When mode
is Append
, if there is an existing table, we will use the format and options of
the existing table. The column order in the schema of the DataFrame
doesn't need to be same
as that of the existing table. Unlike insertInto
, saveAsTable
will use the column names to
find the correct column positions. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 1| 2| | 4| 3| +---+---+
When the DataFrame is created from a non-partitioned HadoopFsRelation
with a single input
path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC
and Parquet), the table is persisted in a Hive compatible format, which means other systems
like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL
specific format.
1.4.0
Sorts the output in each bucket by the given columns.
Sorts the output in each bucket by the given columns.
This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.
2.0
Saves the content of the DataFrame
in a text file at the specified path.
Saves the content of the DataFrame
in a text file at the specified path.
The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file. For example:
// Scala: df.write.text("/path/to/output") // Java: df.write().text("/path/to/output")
You can set the following option(s) for writing text files:
compression
(default null
): compression codec to use when saving to file. This can be
one of the known case-insensitive shorten names (none
, bzip2
, gzip
, lz4
,
snappy
and deflate
).1.6.0
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use
Dataset.write
to access this.1.4.0