Commit graph

181 commits

Author SHA1 Message Date
gatorsmile c0eda7e87f [SPARK-19396][DOC] JDBC Options are Case In-sensitive
### What changes were proposed in this pull request?
The case are not sensitive in JDBC options, after the PR https://github.com/apache/spark/pull/15884 is merged to Spark 2.1.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16734 from gatorsmile/fixDocCaseInsensitive.
2017-01-30 14:05:53 -08:00
aokolnychyi 3fdce81434 [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide
## What changes were proposed in this pull request?

- A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
- Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
- Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
- Python is not covered.
- The PR might not resolve the ticket since I do not know what exactly was planned by the author.

In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.

## How was this patch tested?

The patch was tested locally by building the docs. The examples were run as well.

![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #16329 from aokolnychyi/SPARK-16046.
2017-01-24 22:13:17 -08:00
Dongjoon Hyun 923e594844 [SPARK-18941][SQL][DOC] Add a new behavior document on CREATE/DROP TABLE with LOCATION
## What changes were proposed in this pull request?

This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276).

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

**Newly Added Description**
<img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png">

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16400 from dongjoon-hyun/SPARK-18941.
2017-01-07 18:55:01 -08:00
Wenchen Fan cca945b6aa [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables
## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16296 from cloud-fan/create-table.
2017-01-05 17:40:27 -08:00
Cheng Lian 871f6114ac [SPARK-19016][SQL][DOC] Document scalable partition handling
## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.
2016-12-30 14:46:30 -08:00
c-sahuja 01c7c6b884 Update Spark documentation to provide information on how to create External Table
## What changes were proposed in this pull request?
Although, currently, the saveAsTable does not provide an API to save the table as an external table from a DataFrame, we can achieve this functionality by using options on DataFrameWriter where the key for the map is the String: "path" and the value is another String which is the location of the external table itself. This can be provided before the call to saveAsTable is performed.

## How was this patch tested?
Documentation was reviewed for formatting and content after the push was performed on the branch.
![updated documentation](https://cloud.githubusercontent.com/assets/15376052/20953147/4cfcf308-bc57-11e6-807c-e21fb774a760.PNG)

Author: c-sahuja <sahuja@cloudera.com>

Closes #16185 from c-sahuja/createExternalTable.
2016-12-06 19:03:23 -08:00
Dongjoon Hyun 410b789866 [MINOR][DOC] Use SparkR TRUE value and add default values for StructField in SQL Guide.
## What changes were proposed in this pull request?

In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional.

**BEFORE**
* SPARK 2.1.0 RC1
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types

**AFTER**

* R
<img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png">

* Scala
<img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png">

* Python
<img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png">

## How was this patch tested?

Manual.

```
cd docs
SKIP_API=1 jekyll build
open _site/index.html
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.
2016-12-05 10:36:13 -08:00
Eric Liang 489845f3a0 [SPARK-18145] Update documentation for hive partition management in 2.1
## What changes were proposed in this pull request?

This documents the partition handling changes for Spark 2.1 and how to migrate existing tables.

## How was this patch tested?

Built docs locally.

rxin

Author: Eric Liang <ekl@databricks.com>

Closes #16074 from ericl/spark-18145.
2016-11-29 20:06:39 -08:00
Weiqing Yang f4a98e421e
[WIP][SQL][DOC] Fix incorrect code tag
## What changes were proposed in this pull request?
This PR is to fix incorrect `code` tag in `sql-programming-guide.md`

## How was this patch tested?
Manually.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15941 from weiqingy/fixtag.
2016-11-26 15:41:37 +00:00
Dongjoon Hyun fb07bbe575 [SPARK-18413][SQL][FOLLOW-UP] Use numPartitions instead of maxConnections
## What changes were proposed in this pull request?

This is a follow-up PR of #15868 to merge `maxConnections` option into `numPartitions` options.

## How was this patch tested?

Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15966 from dongjoon-hyun/SPARK-18413-2.
2016-11-25 10:35:07 -08:00
Dongjoon Hyun 07beb5d21c
[SPARK-18413][SQL] Add maxConnections JDBCOption
## What changes were proposed in this pull request?

This PR adds a new JDBCOption `maxConnections` which means the maximum number of simultaneous JDBC connections allowed. This option applies only to writing with coalesce operation if needed. It defaults to the number of partitions of RDD. Previously, SQL users cannot cannot control this while Scala/Java/Python users can use `coalesce` (or `repartition`) API.

**Reported Scenario**

For the following cases, the number of connections becomes 200 and database cannot handle all of them.

```sql
CREATE OR REPLACE TEMPORARY VIEW resultview
USING org.apache.spark.sql.jdbc
OPTIONS (
  url "jdbc:oracle:thin:10.129.10.111:1521:BKDB",
  dbtable "result",
  user "HIVE",
  password "HIVE"
);
-- set spark.sql.shuffle.partitions=200
INSERT OVERWRITE TABLE resultview SELECT g, count(1) AS COUNT FROM tnet.DT_LIVE_INFO GROUP BY g
```

## How was this patch tested?

Manual. Do the followings and see Spark UI.

**Step 1 (MySQL)**
```
CREATE TABLE t1 (a INT);
CREATE TABLE data (a INT);
INSERT INTO data VALUES (1);
INSERT INTO data VALUES (2);
INSERT INTO data VALUES (3);
```

**Step 2 (Spark)**
```scala
SPARK_HOME=$PWD bin/spark-shell --driver-memory 4G --driver-class-path mysql-connector-java-5.1.40-bin.jar
scala> sql("SET spark.sql.shuffle.partitions=3")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', password '')")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '1')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '2')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '3')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
scala> sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', password '', maxConnections '4')")
scala> sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
```

![maxconnections](https://cloud.githubusercontent.com/assets/9700541/20287987/ed8409c2-aa84-11e6-8aab-ae28e63fe54d.png)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15868 from dongjoon-hyun/SPARK-18413.
2016-11-21 13:57:36 +00:00
Weiqing Yang 241e04bc03
[MINOR][DOC] Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation
## What changes were proposed in this pull request?

Fix typos in the 'configuration', 'monitoring' and 'sql-programming-guide' documentation.

## How was this patch tested?
Manually.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15886 from weiqingy/fixTypo.
2016-11-16 10:34:56 +00:00
Felix Cheung 44c8bfda79 [SQL][DOC] updating doc for JSON source to link to jsonlines.org
## What changes were proposed in this pull request?

API and programming guide doc changes for Scala, Python and R.

## How was this patch tested?

manual test

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15629 from felixcheung/jsondoc.
2016-10-26 23:06:11 -07:00
Sean Owen 4ecbe1b92f
[SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
## What changes were proposed in this pull request?

Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15382 from srowen/SPARK-17810.
2016-10-24 10:44:45 +01:00
Tommy YU f39852e598 [SPARK-18001][DOCUMENT] fix broke link to SparkDataFrame
## What changes were proposed in this pull request?

In http://spark.apache.org/docs/latest/sql-programming-guide.html, Section "Untyped Dataset Operations (aka DataFrame Operations)"

Link to R DataFrame doesn't work that return
The requested URL /docs/latest/api/R/DataFrame.html was not found on this server.

Correct link is SparkDataFrame.html for spark 2.0

## How was this patch tested?

Manual checked.

Author: Tommy YU <tummyyu@163.com>

Closes #15543 from Wenpei/spark-18001.
2016-10-18 21:15:32 -07:00
Weiqing Yang 20dd11096c [MINOR][DOC] Add more built-in sources in sql-programming-guide.md
## What changes were proposed in this pull request?
Add more built-in sources in sql-programming-guide.md.

## How was this patch tested?
Manually.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15522 from weiqingy/dsDoc.
2016-10-18 13:38:14 -07:00
Dhruve Ashar a0ebcb3a30
[DOC] Fix typo in sql hive doc
Change is too trivial to file a JIRA.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #15485 from dhruve/master.
2016-10-14 17:45:27 +01:00
hyukjinkwon 0c0ad436ad [SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package
## What changes were proposed in this pull request?

This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options.

This PR includes some changes as below:

  - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`.

- Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance.

- Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url.

- Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`).

- Exclude Spark-only options in connection properties.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15292 from HyukjinKwon/SPARK-17719.
2016-10-10 22:22:41 -07:00
Wenchen Fan 23ddff4b2b [SPARK-17338][SQL] add global temp view
## What changes were proposed in this pull request?

Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.

changes for `SessionCatalog`:

1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name.
2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved.
3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved.
4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views.
5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view.
6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views.
7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views.

changes for SQL commands:

1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views
2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views.
3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc.

changes for other public API

1. add a new method `dropGlobalTempView` in `Catalog`
2. `Catalog.findTable` can find global temp view
3. add a new method `createGlobalTempView` in `Dataset`

## How was this patch tested?

new tests in `SQLViewSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14897 from cloud-fan/global-temp-view.
2016-10-10 15:48:57 +08:00
Justin Pihony 50b89d05b7
[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc
## What changes were proposed in this pull request?

This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.

## How was this patch tested?

This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.

## Additional details

rxin This seems to have been most recently touched by you and was also commented on in the JIRA.

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Justin Pihony <justin.pihony@gmail.com>
Author: Justin Pihony <justin.pihony@typesafe.com>

Closes #12601 from JustinPihony/jdbc_reconciliation.
2016-09-26 09:54:22 +01:00
Daniel Darabos 69cb049697
Correct fetchsize property name in docs
## What changes were proposed in this pull request?

Replace `fetchSize` with `fetchsize` in the docs.

## How was this patch tested?

I manually tested `fetchSize` and `fetchsize`. The latter has an effect. See also [`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38) for the definition of the property.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #14975 from darabos/patch-3.
2016-09-17 12:28:42 +01:00
GraceH 4b6c2cbcb1 [SPARK-16968] Document additional options in jdbc Writer
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
This is the document for previous JDBC Writer options.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit test has been added in previous PR.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: GraceH <jhuang1@paypal.com>

Closes #14683 from GraceH/jdbc_options.
2016-08-22 09:03:46 +01:00
keliang 1275f64696 [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu…
## What changes were proposed in this pull request?
default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

## How was this patch tested?

not need

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…ide.md

JIRA_ID:SPARK-16870
Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
Test:done

Author: keliang <keliang@cmss.chinamobile.com>

Closes #14477 from biglobster/keliang.
2016-08-07 09:28:32 +01:00
Cheng Lian 10e1c0e638 [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings
## What changes were proposed in this pull request?

This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed.

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #14368 from liancheng/revise-examples.
2016-08-02 15:02:40 +08:00
Takeshi YAMAMURO cda4603de3 [SQL][DOC] Fix a default name for parquet compression
## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14351 from maropu/FixParquetDoc.
2016-07-25 15:08:58 -07:00
Cheng Lian 53b2456d1d [SPARK-16380][EXAMPLES] Update SQL examples and programming guide for Python language binding
This PR is based on PR #14098 authored by wangmiao1981.

## What changes were proposed in this pull request?

This PR replaces the original Python Spark SQL example file with the following three files:

- `sql/basic.py`

  Demonstrates basic Spark SQL features.

- `sql/datasource.py`

  Demonstrates various Spark SQL data sources.

- `sql/hive.py`

  Demonstrates Spark SQL Hive interaction.

This PR also removes hard-coded Python example snippets in the SQL programming guide by extracting snippets from the above files using the `include_example` Liquid template tag.

## How was this patch tested?

Manually tested.

Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Cheng Lian <lian@databricks.com>

Closes #14317 from liancheng/py-examples-update.
2016-07-23 11:41:24 -07:00
WeichenXu 9674af6f6f [SPARK-16568][SQL][DOCUMENTATION] update sql programming guide refreshTable API in python code
## What changes were proposed in this pull request?

update `refreshTable` API in python code of the sql-programming-guide.

This API is added in SPARK-15820

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14220 from WeichenXu123/update_sql_doc_catalog.
2016-07-19 18:48:41 -07:00
Cheng Lian 1426a08052 [SPARK-16303][DOCS][EXAMPLES] Minor Scala/Java example update
## What changes were proposed in this pull request?

This PR moves one and the last hard-coded Scala example snippet from the SQL programming guide into `SparkSqlExample.scala`. It also renames all Scala/Java example files so that all "Sql" in the file names are updated to "SQL".

## How was this patch tested?

Manually verified the generated HTML page.

Author: Cheng Lian <lian@databricks.com>

Closes #14245 from liancheng/minor-scala-example-update.
2016-07-18 23:07:59 -07:00
Shivaram Venkataraman 01c4c1fa53 [SPARK-16553][DOCS] Fix SQL example file name in docs
## What changes were proposed in this pull request?

Fixes a typo in the sql programming guide

## How was this patch tested?

Building docs locally

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14208 from shivaram/spark-sql-doc-fix.
2016-07-14 14:19:30 -07:00
aokolnychyi 772c213ec7 [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples
- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
- Removed the inconsistency between Scala and Java Spark SQL examples
- Scala and Java Spark SQL examples were updated

The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.

![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #14119 from aokolnychyi/spark_16303.
2016-07-13 16:12:11 +08:00
Lianhui Wang 5ad68ba5ce [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
## What changes were proposed in this pull request?
when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.

## How was this patch tested?
add unit tests

Author: Lianhui Wang <lianhuiwang09@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>

Closes #13494 from lianhuiwang/metadata-only.
2016-07-12 18:52:15 +02:00
Xin Ren 9cb1eb7af7 [SPARK-16381][SQL][SPARKR] Update SQL examples and programming guide for R language binding
https://issues.apache.org/jira/browse/SPARK-16381

## What changes were proposed in this pull request?

Update SQL examples and programming guide for R language binding.

Here I just follow example https://github.com/apache/spark/compare/master...liancheng:example-snippet-extraction, created a separate R file to store all the example code.

## How was this patch tested?

Manual test on my local machine.
Screenshot as below:

![screen shot 2016-07-06 at 4 52 25 pm](https://cloud.githubusercontent.com/assets/3925641/16638180/13925a58-439a-11e6-8d57-8451a63dcae9.png)

Author: Xin Ren <iamshrek@126.com>

Closes #14082 from keypointt/SPARK-16381.
2016-07-11 20:05:28 +08:00
Cheng Lian bde1d6a615 [SPARK-16294][SQL] Labelling support for the include_example Jekyll plugin
## What changes were proposed in this pull request?

This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page.

## How was this patch tested?

Manually tested.

<img width="923" alt="screenshot at jun 29 19-53-21" src="https://cloud.githubusercontent.com/assets/230655/16451099/66a76db2-3e33-11e6-84fb-63104c2f0688.png">

Author: Cheng Lian <lian@databricks.com>

Closes #13972 from liancheng/include-example-with-labels.
2016-06-29 22:50:53 -07:00
Yin Huai dd6b7dbe70 [SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide.
## What changes were proposed in this pull request?
This PR makes several updates to SQL programming guide.

Author: Yin Huai <yhuai@databricks.com>

Closes #13938 from yhuai/doc.
2016-06-27 22:44:08 -07:00
Felix Cheung 79aa1d82ca [SQL][DOC] SQL programming guide add deprecated methods in 2.0.0
## What changes were proposed in this pull request?

Doc changes

## How was this patch tested?

manual

liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13827 from felixcheung/sqldocdeprecate.
2016-06-22 10:37:13 +08:00
Takeshi YAMAMURO 41e0ffb19f [SPARK-15894][SQL][DOC] Update docs for controlling #partitions
## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options.

## How was this patch tested?
N/A

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13797 from maropu/SPARK-15894-2.
2016-06-21 14:27:16 +08:00
Felix Cheung 58f6e27dd7 [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R
## What changes were proposed in this pull request?

Update doc as per discussion in PR #13592

## How was this patch tested?

manual

shivaram liancheng

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #13799 from felixcheung/rsqlprogrammingguide.
2016-06-21 13:56:37 +08:00
Cheng Lian 6df8e38860 [SPARK-15863][SQL][DOC] Initial SQL programming guide update for Spark 2.0
## What changes were proposed in this pull request?

Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.

We may also want to add more examples for Scala/Java Dataset typed transformations.

## How was this patch tested?

N/A

Author: Cheng Lian <lian@databricks.com>

Closes #13592 from liancheng/sql-programming-guide-2.0.
2016-06-20 14:50:28 -07:00
Mortada Mehyar 675a73715d [DOCUMENTATION] fixed groupby aggregation example for pyspark
## What changes were proposed in this pull request?

fixing documentation for the groupby/agg example in python

## How was this patch tested?

the existing example in the documentation dose not contain valid syntax (missing parenthesis) and is not using `Column` in the expression for `agg()`

after the fix here's how I tested it:

```
In [1]: from pyspark.sql import Row

In [2]: import pyspark.sql.functions as func

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:records = [{'age': 19, 'department': 1, 'expense': 100},
: {'age': 20, 'department': 1, 'expense': 200},
: {'age': 21, 'department': 2, 'expense': 300},
: {'age': 22, 'department': 2, 'expense': 300},
: {'age': 23, 'department': 3, 'expense': 300}]
:--

In [4]: df = sqlContext.createDataFrame([Row(**d) for d in records])

In [5]: df.groupBy("department").agg(df["department"], func.max("age"), func.sum("expense")).show()

+----------+----------+--------+------------+
|department|department|max(age)|sum(expense)|
+----------+----------+--------+------------+
|         1|         1|      20|         300|
|         2|         2|      22|         600|
|         3|         3|      23|         300|
+----------+----------+--------+------------+

Author: Mortada Mehyar <mortada.mehyar@gmail.com>

Closes #13587 from mortada/groupby_agg_doc_fix.
2016-06-10 00:23:34 -07:00
gatorsmile 6cb8f836da [SPARK-15396][SQL][DOC] It can't connect hive metastore database
#### What changes were proposed in this pull request?
The `hive.metastore.warehouse.dir` property in hive-site.xml is deprecated since Spark 2.0.0. Users might not be able to connect to the existing metastore if they do not use the new conf parameter `spark.sql.warehouse.dir`.

This PR is to update the document and example for explaining the latest changes in the configuration of default location of database.

Below is the screenshot of the latest generated docs:

<img width="681" alt="screenshot 2016-05-20 08 38 10" src="https://cloud.githubusercontent.com/assets/11567269/15433296/a05c4ace-1e66-11e6-8d2b-73682b32e9c2.png">

<img width="789" alt="screenshot 2016-05-20 08 53 26" src="https://cloud.githubusercontent.com/assets/11567269/15433734/645dc42e-1e68-11e6-9476-effc9f8721bb.png">

<img width="789" alt="screenshot 2016-05-20 08 53 37" src="https://cloud.githubusercontent.com/assets/11567269/15433738/68569f92-1e68-11e6-83d3-ef5bb221a8d8.png">

No change is made in the R's example.

<img width="860" alt="screenshot 2016-05-20 08 54 38" src="https://cloud.githubusercontent.com/assets/11567269/15433779/965b8312-1e68-11e6-8bc4-53c88ceacde2.png">

#### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13225 from gatorsmile/document.
2016-05-21 23:12:27 -07:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
Sun Rui 4ae9fe091c [SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR.
## What changes were proposed in this pull request?

dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame.

The function signature is:

	dapply(df, function(localDF) {}, schema = NULL)

R function input: local data.frame from the partition on local node
R function output: local data.frame

Schema specifies the Row format of the resulting DataFrame. It must match the R function's output.
If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply().

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <rui.sun@intel.com>
Author: Sun Rui <sunrui2016@gmail.com>

Closes #12493 from sun-rui/SPARK-12919.
2016-04-29 16:41:07 -07:00
Dongjoon Hyun 6ab4d9e0c7 [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request?

This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.

- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12649 from dongjoon-hyun/SPARK-14883.
2016-04-24 22:10:27 -07:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Dongjoon Hyun 1a0cca1fc8 [MINOR][DOCS] Fix wrong data types in JSON Datasets example.
## What changes were proposed in this pull request?

This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
2016-04-11 09:03:11 +01:00
Reynold Xin 9ca0760d67 [SPARK-10063][SQL] Remove DirectParquetOutputCommitter
## What changes were proposed in this pull request?
This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.

## How was this patch tested?
Removed the related tests also.

Author: Reynold Xin <rxin@databricks.com>

Closes #12229 from rxin/SPARK-10063.
2016-04-07 00:51:45 -07:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Daoyuan Wang d1c193a2f1 [SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test
## What changes were proposed in this pull request?

Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.

## How was this patch tested?

This patch will not affect the real code path.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #11758 from adrian-wang/spark12855.
2016-03-16 22:52:10 -07:00
Dongjoon Hyun 4ce2d24e2a [SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x
## What changes were proposed in this pull request?

`Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs.

**Migration Guide**
```
- ## Migration Guide for Shark Users
- ...
- ### Scheduling
- ...
- ### Reducer number
- ...
- ### Caching
```

## How was this patch tested?

Pass the Jenkins test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11770 from dongjoon-hyun/SPARK-13942.
2016-03-16 15:50:24 -07:00
Dongjoon Hyun c3689bc24e [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code.
## What changes were proposed in this pull request?

In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.

```
-    final ArrayList<Product2<Object, Object>> dataToWrite =
-      new ArrayList<Product2<Object, Object>>();
+    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
```

Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.

## How was this patch tested?

Manual.
Pass the existing tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11541 from dongjoon-hyun/SPARK-13702.
2016-03-09 10:31:26 +00:00