[SPARK-8462] [DOCS] Documentation fixes for Spark SQL
This fixes various minor documentation issues on the Spark SQL page
Author: Lars Francke <lars.francke@gmail.com>
Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:
dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes
(cherry picked from commit 4ce3bab89f
)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
This commit is contained in:
parent
152f4465d3
commit
bd9bbd6119
|
@ -819,8 +819,8 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet")
|
|||
|
||||
You can also manually specify the data source that will be used along with any extra options
|
||||
that you would like to pass to the data source. Data sources are specified by their fully qualified
|
||||
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use the shorted
|
||||
name (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
|
||||
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
|
||||
names (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
|
||||
using this syntax.
|
||||
|
||||
<div class="codetabs">
|
||||
|
@ -828,7 +828,7 @@ using this syntax.
|
|||
|
||||
{% highlight scala %}
|
||||
val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
|
||||
df.select("name", "age").write.format("json").save("namesAndAges.parquet")
|
||||
df.select("name", "age").write.format("json").save("namesAndAges.json")
|
||||
{% endhighlight %}
|
||||
|
||||
</div>
|
||||
|
@ -975,7 +975,7 @@ schemaPeople.write().parquet("people.parquet");
|
|||
// The result of loading a parquet file is also a DataFrame.
|
||||
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");
|
||||
|
||||
//Parquet files can also be registered as tables and then used in SQL statements.
|
||||
// Parquet files can also be registered as tables and then used in SQL statements.
|
||||
parquetFile.registerTempTable("parquetFile");
|
||||
DataFrame teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");
|
||||
List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
|
||||
|
@ -1059,7 +1059,7 @@ SELECT * FROM parquetTable
|
|||
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
|
||||
table, data are usually stored in different directories, with partitioning column values encoded in
|
||||
the path of each partition directory. The Parquet data source is now able to discover and infer
|
||||
partitioning information automatically. For exmaple, we can store all our previously used
|
||||
partitioning information automatically. For example, we can store all our previously used
|
||||
population data into a partitioned table using the following directory structure, with two extra
|
||||
columns, `gender` and `country` as partitioning columns:
|
||||
|
||||
|
@ -1121,12 +1121,12 @@ source is now able to automatically detect this case and merge schemas of all th
|
|||
import sqlContext.implicits._
|
||||
|
||||
// Create a simple DataFrame, stored into a partition directory
|
||||
val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
|
||||
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
|
||||
df1.write.parquet("data/test_table/key=1")
|
||||
|
||||
// Create another DataFrame in a new partition directory,
|
||||
// adding a new column and dropping an existing column
|
||||
val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
|
||||
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
|
||||
df2.write.parquet("data/test_table/key=2")
|
||||
|
||||
// Read the partitioned table
|
||||
|
@ -1134,7 +1134,7 @@ val df3 = sqlContext.read.parquet("data/test_table")
|
|||
df3.printSchema()
|
||||
|
||||
// The final schema consists of all 3 columns in the Parquet files together
|
||||
// with the partiioning column appeared in the partition directory paths.
|
||||
// with the partitioning column appeared in the partition directory paths.
|
||||
// root
|
||||
// |-- single: int (nullable = true)
|
||||
// |-- double: int (nullable = true)
|
||||
|
@ -1165,7 +1165,7 @@ df3 = sqlContext.load("data/test_table", "parquet")
|
|||
df3.printSchema()
|
||||
|
||||
# The final schema consists of all 3 columns in the Parquet files together
|
||||
# with the partiioning column appeared in the partition directory paths.
|
||||
# with the partitioning column appeared in the partition directory paths.
|
||||
# root
|
||||
# |-- single: int (nullable = true)
|
||||
# |-- double: int (nullable = true)
|
||||
|
@ -1192,7 +1192,7 @@ df3 <- loadDF(sqlContext, "data/test_table", "parquet")
|
|||
printSchema(df3)
|
||||
|
||||
# The final schema consists of all 3 columns in the Parquet files together
|
||||
# with the partiioning column appeared in the partition directory paths.
|
||||
# with the partitioning column appeared in the partition directory paths.
|
||||
# root
|
||||
# |-- single: int (nullable = true)
|
||||
# |-- double: int (nullable = true)
|
||||
|
@ -1249,7 +1249,7 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
|
|||
<td>false</td>
|
||||
<td>
|
||||
Turn on Parquet filter pushdown optimization. This feature is turned off by default because of a known
|
||||
bug in Paruet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>).
|
||||
bug in Parquet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>).
|
||||
However, if your table doesn't contain any nullable string or binary columns, it's still safe to turn
|
||||
this feature on.
|
||||
</td>
|
||||
|
@ -1398,7 +1398,7 @@ sqlContext <- sparkRSQL.init(sc)
|
|||
# The path can be either a single text file or a directory storing text files.
|
||||
path <- "examples/src/main/resources/people.json"
|
||||
# Create a DataFrame from the file(s) pointed to by path
|
||||
people <- jsonFile(sqlContex,t path)
|
||||
people <- jsonFile(sqlContext, path)
|
||||
|
||||
# The inferred schema can be visualized using the printSchema() method.
|
||||
printSchema(people)
|
||||
|
@ -1470,7 +1470,7 @@ sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
|
|||
|
||||
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
|
||||
adds support for finding tables in the MetaStore and writing queries using HiveQL. In addition to
|
||||
the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
|
||||
the `sql` method a `HiveContext` also provides an `hql` method, which allows queries to be
|
||||
expressed in HiveQL.
|
||||
|
||||
{% highlight java %}
|
||||
|
@ -2766,7 +2766,7 @@ from pyspark.sql.types import *
|
|||
</tr>
|
||||
<tr>
|
||||
<td> <b>MapType</b> </td>
|
||||
<td> enviroment </td>
|
||||
<td> environment </td>
|
||||
<td>
|
||||
list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
|
||||
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
|
||||
|
|
Loading…
Reference in a new issue