Commit graph

16451 commits

Author SHA1 Message Date
Wenchen Fan 8640cdb836 [SPARK-15441][SQL] support null object in Dataset outer-join
## What changes were proposed in this pull request?

Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null.

This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object.

This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct.

## How was this patch tested?

new test in `DatasetSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13425 from cloud-fan/outer-join2.
2016-06-01 16:16:54 -07:00
Cheng Lian 7bb64aae27 [SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables.
This PR is an alternative to #13120 authored by xwu0226.

## What changes were proposed in this pull request?

When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external).

This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table.

Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround.

[1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408

## How was this patch tested?

1. A new test case is added in `HiveQuerySuite` for this case
2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.)

Author: Cheng Lian <lian@databricks.com>

Closes #13270 from liancheng/spark-15269-unpleasant-fix.
2016-06-01 16:02:27 -07:00
Andrew Or 9e2643b21d [SPARK-15596][SPARK-15635][SQL] ALTER TABLE RENAME fixes
## What changes were proposed in this pull request?

**SPARK-15596**: Even after we renamed a cached table, the plan would remain in the cache with the old table name. If I created a new table using the old name then the old table would return incorrect data. Note that this applies only to Hive tables.

**SPARK-15635**: Renaming a datasource table would render the table not query-able. This is because we store the location of the table in a "path" property, which was not updated to reflect Hive's change in table location following a rename.

## How was this patch tested?

DDLSuite

Author: Andrew Or <andrew@databricks.com>

Closes #13416 from andrewor14/rename-table.
2016-06-01 14:26:24 -07:00
Thomas Graves 5b08ee6396 [SPARK-15671] performance regression CoalesceRDD.pickBin with large #…
I was running a 15TB join job with 202000 partitions. It looks like the changes I made to CoalesceRDD in pickBin() are really slow with that large of partitions. The array filter with that many elements just takes to long.
It took about an hour for it to pickBins for all the partitions.
original change:
83ee92f603

Just reverting the pickBin code back to get currpreflocs fixes the issue

After reverting the pickBin code the coalesce takes about 10 seconds so for now it makes sense to revert those changes and we can look at further optimizations later.

Tested this via RDDSuite unit test and manually testing the very large job.

Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>

Closes #13443 from tgravescs/SPARK-15671.
2016-06-01 13:21:40 -07:00
WeichenXu 2402b91461 [SPARK-15702][DOCUMENTATION] Update document programming-guide accumulator section
## What changes were proposed in this pull request?

Update document programming-guide accumulator section (scala language)
java and python version, because the API haven't done, so I do not modify them.

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13441 from WeichenXu123/update_doc_accumulatorV2_clean.
2016-06-01 12:57:02 -07:00
Yanbo Liang 07a98ca4ce [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.

## How was this patch tested?
Exist tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13410 from yanboliang/spark-15587.
2016-06-01 10:49:51 -07:00
Reynold Xin a71d1364ae [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.

## How was this patch tested?
Updated tests to reflect the moves.

Author: Reynold Xin <rxin@databricks.com>

Closes #13429 from rxin/SPARK-15686.
2016-06-01 10:14:40 -07:00
Sean Zhong d5012c2740 [SPARK-15495][SQL] Improve the explain output for Aggregation operator
## What changes were proposed in this pull request?

This PR improves the explain output of Aggregator operator.

SQL:

```
Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1")
spark.sql("cache table df1")
spark.sql("select count(a), count(c), b from df1 group by b").explain()
```

**Before change:**

```
*TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
``````

**After change:**

```
*Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8])
+- Exchange hashpartitioning(b#8, 200), None
   +- *Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L])
      +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1)
```

## How was this patch tested?

Manual test and existing UT.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13363 from clockfly/verbose3.
2016-06-01 09:58:01 -07:00
Cheng Lian 1f43562daf [SPARK-14343][SQL] Proper column pruning for text data source
## What changes were proposed in this pull request?

Text data source ignores requested schema, and may give wrong result when the only data column is not requested. This may happen when only partitioning column(s) are requested for a partitioned text table.

## How was this patch tested?

New test case added in `TextSuite`.

Author: Cheng Lian <lian@databricks.com>

Closes #13431 from liancheng/spark-14343-partitioned-text-table.
2016-06-01 07:30:55 -07:00
Lianhui Wang 6563d72b16 [SPARK-15664][MLLIB] Replace FileSystem.get(conf) with path.getFileSystem(conf) when removing CheckpointFile in MLlib
## What changes were proposed in this pull request?
if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib.
So we should always get the FileSystem from Path to avoid wrong FS problem.
## How was this patch tested?
N/A

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13408 from lianhuiwang/SPARK-15664.
2016-06-01 08:30:38 -05:00
jerryshao e4ce1bc4f3 [SPARK-15659][SQL] Ensure FileSystem is gotten from path
## What changes were proposed in this pull request?

Currently `spark.sql.warehouse.dir` is pointed to local dir by default, which will throw exception when HADOOP_CONF_DIR is configured and default FS is hdfs.

```
java.lang.IllegalArgumentException: Wrong FS: file:/Users/sshao/projects/apache-spark/spark-warehouse, expected: hdfs://localhost:8020
```

So we should always get the `FileSystem` from `Path` to avoid wrong FS problem.

## How was this patch tested?

Local test.

Author: jerryshao <sshao@hortonworks.com>

Closes #13405 from jerryshao/SPARK-15659.
2016-06-01 08:28:19 -05:00
Andrew Or 1dd9256441 [HOTFIX] DDLSuite was broken by 93e9714 2016-05-31 20:06:08 -07:00
Tejas Patil ac38bdc756 [SPARK-15601][CORE] CircularBuffer's toString() to print only the contents written if buffer isn't full
## What changes were proposed in this pull request?

1. The class allocated 4x space than needed as it was using `Int` to store the `Byte` values

2. If CircularBuffer isn't full, currently toString() will print some garbage chars along with the content written as is tries to print the entire array allocated for the buffer. The fix is to keep track of buffer getting full and don't print the tail of the buffer if it isn't full (suggestion by sameeragarwal over https://github.com/apache/spark/pull/12194#discussion_r64495331)

3. Simplified `toString()`

## How was this patch tested?

Added new test case

Author: Tejas Patil <tejasp@fb.com>

Closes #13351 from tejasapatil/circular_buffer.
2016-05-31 19:52:22 -05:00
xin Wu 04f925ede8 [SPARK-15236][SQL][SPARK SHELL] Add spark-defaults property to switch to use InMemoryCatalog
## What changes were proposed in this pull request?
This PR change REPL/Main to check this property `spark.sql.catalogImplementation` to decide if `enableHiveSupport `should be called.

If `spark.sql.catalogImplementation` is set to `hive`, and hive classes are built, Spark will use Hive support.
Other wise, Spark will create a SparkSession with in-memory catalog support.

## How was this patch tested?
Run the REPL component test.

Author: xin Wu <xinwu@us.ibm.com>
Author: Xin Wu <xinwu@us.ibm.com>

Closes #13088 from xwu0226/SPARK-15236.
2016-05-31 17:42:47 -07:00
Dongjoon Hyun 85d6b0db9f [SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.
## What changes were proposed in this pull request?

This PR changes function `SparkSession.builder.sparkContext(..)` from **private[sql]** into **private[spark]**, and uses it if applicable like the followings.
```
- val spark = SparkSession.builder().config(sc.getConf).getOrCreate()
+ val spark = SparkSession.builder().sparkContext(sc).getOrCreate()
```

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13365 from dongjoon-hyun/SPARK-15618.
2016-05-31 17:40:44 -07:00
Eric Liang 93e97147eb [MINOR] Slightly better error message when attempting to query hive tables w/in-mem catalog
andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13427 from ericl/better-error-msg.
2016-05-31 17:39:03 -07:00
Dongjoon Hyun 196a0d8273 [MINOR][SQL][DOCS] Fix docs of Dataset.scala and SQLImplicits.scala.
This PR fixes a sample code, a description, and indentations in docs.

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13420 from dongjoon-hyun/minor_fix_dataset_doc.
2016-05-31 17:37:33 -07:00
WeichenXu dad5a68818 [SPARK-15670][JAVA API][SPARK CORE] label_accumulator_deprecate_in_java_spark_context
## What changes were proposed in this pull request?

Add deprecate annotation for acumulator V1 interface in JavaSparkContext class

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13412 from WeichenXu123/label_accumulator_deprecate_in_java_spark_context.
2016-05-31 17:34:34 -07:00
Sean Zhong 06514d689c [SPARK-12988][SQL] Can't drop top level columns that contain dots
## What changes were proposed in this pull request?

Fixes "Can't drop top level columns that contain dots".

This work is based on dilipbiswal's https://github.com/apache/spark/pull/10943.
This PR fixes problems like:

```
scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b")
org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c];
```

`drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "."

We should also NOT interpret back tick "`", otherwise it is hard to understand what

```
​```aaa```bbb``
```

actually means.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13306 from clockfly/fix_drop_column.
2016-05-31 17:34:10 -07:00
Jacek Laskowski 0f24713468 [CORE][DOC][MINOR] typos + links
## What changes were proposed in this pull request?

A very tiny change to javadoc (which I don't mind if gets merged with a bigger change). I've just found it annoying and couldn't resist proposing a pull request. Sorry srowen and rxin.

## How was this patch tested?

Manual build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #13383 from jaceklaskowski/memory-consumer.
2016-05-31 17:32:37 -07:00
Josh Rosen 8ca01a6feb [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues
## What changes were proposed in this pull request?

In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase.

The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying.

In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful).

## How was this patch tested?

This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #13421 from JoshRosen/disable-line-comments-in-codegen.
2016-05-31 17:30:03 -07:00
Reynold Xin 223f1d58c4 [SPARK-15662][SQL] Add since annotation for classes in sql.catalog
## What changes were proposed in this pull request?
This patch does a few things:

1. Adds since version annotation to methods and classes in sql.catalog.
2. Fixed a typo in FilterFunction and a whitespace issue in spark/api/java/function/package.scala
3. Added "database" field to Function class.

## How was this patch tested?
Updated unit test case for "database" field in Function class.

Author: Reynold Xin <rxin@databricks.com>

Closes #13406 from rxin/SPARK-15662.
2016-05-31 17:29:10 -07:00
Jacek Laskowski 6954704299 [CORE][MINOR][DOC] Removing incorrect scaladoc
## What changes were proposed in this pull request?

I don't think the method will ever throw an exception so removing a false comment. Sorry srowen and rxin again -- I simply couldn't resist.

I wholeheartedly support merging the change with a bigger one (and trashing this PR).

## How was this patch tested?

Manual build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #13384 from jaceklaskowski/blockinfomanager.
2016-05-31 19:21:25 -05:00
Marcelo Vanzin 57adb77e6b [SPARK-15451][BUILD] Use jdk7's rt.jar when available.
This helps with preventing jdk8-specific calls being checked in,
because PR builders are running the compiler with the wrong settings.

If the JAVA_7_HOME env variable is set, assume it points at
a jdk7 and use its rt.jar when invoking javac. For zinc, just run
it with jdk7, and disable it when building jdk8-specific code.

A big note for sbt usage: adding the bootstrap options forces sbt
to fork the compiler, and that disables incremental compilation.
That means that it's really not convenient to use for normal
development, but should be ok for automated builds.

Tested with JAVA_HOME=jdk8 and JAVA_7_HOME=jdk7:
- mvn + zinc
- mvn sans zinc
- sbt

Verified that in all cases, jdk8-specific library calls fail to
compile.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13272 from vanzin/SPARK-15451.
2016-05-31 16:54:34 -07:00
Tathagata Das 90b11439b3 [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode.  This PR adds the following.

- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
  - Plans with no aggregation should support only Append mode
  - Plans with aggregation should support only Update and Complete modes
  - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13286 from tdas/complete-mode.
2016-05-31 15:57:01 -07:00
Dilip Biswal dfe2cbeb43 [SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal
In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`.

I think we need to cast the string literals to Double type instead. I looked at the history and found that  this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double.

To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double.
Here is the output.

- Hive
```SQL
hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1;
OK
hive> describe l2;
OK
_c0                 	double
```
- MySQL
```SQL
mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe foo2;
+-----------------------------------+--------+------+-----+---------+-------+
| Field                             | Type   | Null | Key | Default | Extra |
+-----------------------------------+--------+------+-----+---------+-------+
| (cast(99 as decimal(19,6)) + '2') | double | NO   |     | 0       |       |
+-----------------------------------+--------+------+-----+---------+-------+
```

## How was this patch tested?
Added a new test in SQLQuerySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #13368 from dilipbiswal/spark-15557.
2016-05-31 15:49:45 -07:00
Davies Liu 2df6ca848e [SPARK-15327] [SQL] fix split expression in whole stage codegen
## What changes were proposed in this pull request?

Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted.

This PR will not split the code in whole-stage codegen.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13235 from davies/fix_nested_codegen.
2016-05-31 15:36:02 -07:00
Yanbo Liang 594484cd83 [MINOR][DOC][ML] ml.clustering scala & python api doc sync
## What changes were proposed in this pull request?
Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync.

## How was this patch tested?
Docs change, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13291 from yanboliang/spark-15361-followup.
2016-05-31 14:56:43 -07:00
Shixiong Zhu 9a74de18a1 Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
## What changes were proposed in this pull request?

This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.

## How was this patch tested?

Jenkins unit tests, integration tests, manual tests)

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13417 from zsxwing/revert-SPARK-11753.
2016-05-31 14:50:07 -07:00
Yin Huai c6de5832bf [SPARK-15622][SQL] Wrap the parent classloader of Janino's classloader in the ParentClassLoader.
## What changes were proposed in this pull request?
At https://github.com/aunkrig/janino/blob/janino_2.7.8/janino/src/org/codehaus/janino/ClassLoaderIClassLoader.java#L80-L85, Janino's classloader throws the exception when its parent throws a ClassNotFoundException with a cause set. However, it does not throw the exception when there is no cause set. Seems we need to use a special ClassLoader to wrap the actual parent classloader set to Janino handle this behavior.

## How was this patch tested?
I have reverted the workaround made by https://issues.apache.org/jira/browse/SPARK-11636 ( https://github.com/apache/spark/compare/master...yhuai:SPARK-15622?expand=1#diff-bb538fda94224dd0af01d0fd7e1b4ea0R81) and `test-only *ReplSuite -- -z "SPARK-2576 importing implicits"` still passes the test (without the change in `CodeGenerator`, this test does not pass with the change in `ExecutorClassLoader `).

Author: Yin Huai <yhuai@databricks.com>

Closes #13366 from yhuai/SPARK-15622.
2016-05-31 12:30:34 -07:00
Wenchen Fan 2bfed1a0c5 [SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType
## What changes were proposed in this pull request?

When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception.

## How was this patch tested?

new test in `UserDefiendTypeSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13402 from cloud-fan/udt.
2016-05-31 11:00:38 -07:00
gatorsmile d67c82e4b6 [SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule
#### What changes were proposed in this pull request?

The following condition in the Optimizer rule `OptimizeCodegen` is not right.
```Scala
branches.size < conf.maxCaseBranchesForCodegen
```

- The number of branches in case when clause should be `branches.size + elseBranch.size`.
- `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`.

This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`.

#### How was this patch tested?
Added test cases in `SQLConfSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13392 from gatorsmile/maxCaseWhen.
2016-05-31 10:08:00 -07:00
Lianhui Wang 2bfc4f1521 [SPARK-15649][SQL] Avoid to serialize MetastoreRelation in HiveTableScanExec
## What changes were proposed in this pull request?
in HiveTableScanExec, schema is lazy and is related with relation.attributeMap. So it needs to serialize MetastoreRelation when serializing task binary bytes.It can avoid to serialize MetastoreRelation.

## How was this patch tested?

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #13397 from lianhuiwang/avoid-serialize.
2016-05-31 09:21:51 -07:00
Takeshi YAMAMURO 95db8a44f3 [SPARK-15528][SQL] Fix race condition in NumberConverter
## What changes were proposed in this pull request?
A local variable in NumberConverter is wrongly shared between threads.
This pr fixes the race condition.

## How was this patch tested?
Manually checked.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13391 from maropu/SPARK-15528.
2016-05-31 07:25:16 -05:00
catapan 6878f3e2ea [SPARK-15641] HistoryServer to not show invalid date for incomplete application
## What changes were proposed in this pull request?
For incomplete applications in HistoryServer, the complete column will show "-" instead of incorrect date.

## How was this patch tested?
manually tested.

Author: catapan <cedarpan86@gmail.com>
Author: Ziying Pan <cedarpan@Ziyings-MacBook.local>

Closes #13396 from catapan/SPARK-15641_fix_completed_column.
2016-05-31 06:55:07 -05:00
Reynold Xin 675921040e [SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext
## What changes were proposed in this pull request?
This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13370 from rxin/SPARK-15638.
2016-05-30 22:47:58 -07:00
Devaraj K 5b21139dbf [SPARK-10530][CORE] Kill other task attempts when one taskattempt belonging the same task is succeeded in speculation
## What changes were proposed in this pull request?

With this patch, TaskSetManager kills other running attempts when any one of the attempt succeeds for the same task. Also killed tasks will not be considered as failed tasks and they get listed separately in the UI and also shows the task state as KILLED instead of FAILED.

## How was this patch tested?

core\src\test\scala\org\apache\spark\ui\jobs\JobProgressListenerSuite.scala
core\src\test\scala\org\apache\spark\util\JsonProtocolSuite.scala

I have verified this patch manually by enabling spark.speculation as true, when any attempt gets succeeded then other running attempts are getting killed for the same task and other pending tasks are getting assigned in those. And also when any attempt gets killed then they are considered as KILLED tasks and not considered as FAILED tasks. Please find the attached screen shots for the reference.

![stage-tasks-table](https://cloud.githubusercontent.com/assets/3174804/14075132/394c6a12-f4f4-11e5-8638-20ff7b8cc9bc.png)
![stages-table](https://cloud.githubusercontent.com/assets/3174804/14075134/3b60f412-f4f4-11e5-9ea6-dd0dcc86eb03.png)

Ref : https://github.com/apache/spark/pull/11916

Author: Devaraj K <devaraj@apache.org>

Closes #11996 from devaraj-kavali/SPARK-10530.
2016-05-30 14:29:27 -07:00
Matthew Wise 2d34183b27 [DOCS] fix example code issues in documentation
## What changes were proposed in this pull request?

Fixed broken java code examples in streaming documentation

Attn: tdas

Author: Matthew Wise <matthew.rs.wise@gmail.com>

Closes #13388 from mawise/fix_docs_java_streaming_example.
2016-05-30 09:12:02 -05:00
Xin Ren 5728aa558e [SPARK-15645][STREAMING] Fix some typos of Streaming module
## What changes were proposed in this pull request?

No code change, just some typo fixing.

## How was this patch tested?

Manually run project build with testing, and build is successful.

Author: Xin Ren <iamshrek@126.com>

Closes #13385 from keypointt/codeWalkThroughStreaming.
2016-05-30 08:40:03 -05:00
Cheng Lian 1360a6d636 [SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema
## What changes were proposed in this pull request?

`EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema.

However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter *does* change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data.

This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs.

## How was this patch tested?

New test case added in `DatasetSuite`.

[1]: https://issues.apache.org/jira/browse/SPARK-15632

Author: Cheng Lian <lian@databricks.com>

Closes #13362 from liancheng/spark-15112-corrupted-filter.
2016-05-29 23:19:12 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Reynold Xin 472f16181d [SPARK-15636][SQL] Make aggregate expressions more concise in explain
## What changes were proposed in this pull request?
This patch reduces the verbosity of aggregate expressions in explain (but does not actually remove any information). As an example, for the following command:
```
spark.range(10).selectExpr("sum(id) + 1", "count(distinct id)").explain(true)
```

Output before this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[(sum(id#0L),mode=Final,isDistinct=false),(count(id#0L),mode=Final,isDistinct=true)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false),(count(id#0L),mode=Partial,isDistinct=true)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=PartialMerge,isDistinct=false)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[(sum(id#0L),mode=Partial,isDistinct=false)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Output after this patch:
```
== Physical Plan ==
*TungstenAggregate(key=[], functions=[sum(id#0L),count(distinct id#0L)], output=[(sum(id) + 1)#3L,count(DISTINCT id)#16L])
+- Exchange SinglePartition, None
   +- *TungstenAggregate(key=[], functions=[merge_sum(id#0L),partial_count(distinct id#0L)], output=[sum#18L,count#21L])
      +- *TungstenAggregate(key=[id#0L], functions=[merge_sum(id#0L)], output=[id#0L,sum#18L])
         +- Exchange hashpartitioning(id#0L, 5), None
            +- *TungstenAggregate(key=[id#0L], functions=[partial_sum(id#0L)], output=[id#0L,sum#18L])
               +- *Range (0, 10, splits=2)
```

Note the change from `(sum(id#0L),mode=PartialMerge,isDistinct=false)` to `merge_sum(id#0L)`.

In general aggregate explain is still very verbose, but further work will be done as follow-up pull requests.

## How was this patch tested?
Tested manually.

Author: Reynold Xin <rxin@databricks.com>

Closes #13367 from rxin/SPARK-15636.
2016-05-28 14:14:36 -07:00
felixcheung 74c1b79f3f [SPARK-15637][SPARKR] fix R tests on R 3.2.2
## What changes were proposed in this pull request?

Change version check in R tests

## How was this patch tested?

R tests
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #13369 from felixcheung/rversioncheck.
2016-05-28 10:32:40 -07:00
Yadong Qi b4c32c4952 [SPARK-15549][SQL] Disable bucketing when the output doesn't contain all bucketing columns
## What changes were proposed in this pull request?
I create a bucketed table bucketed_table with bucket column i,
```scala
case class Data(i: Int, j: Int, k: Int)
sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table")
```

and I run the following SQLs:
```sql
SELECT j FROM bucketed_table;
Error in query: bucket column i not found in existing columns (j);

SELECT j, MAX(k) FROM bucketed_table GROUP BY j;
Error in query: bucket column i not found in existing columns (j, k);
```

I think we should add a check that, we only enable bucketing when it satisfies all conditions below:
1. the conf is enabled
2. the relation is bucketed
3. the output contains all bucketing columns

## How was this patch tested?
Updated test cases to reflect the changes.

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #13321 from watermen/SPARK-15549.
2016-05-28 10:19:29 -07:00
Liang-Chi Hsieh f1b220eeee [SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand
## What changes were proposed in this pull request?

Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13327 from viirya/dataset-createtempview.
2016-05-27 21:24:08 -07:00
Reynold Xin 73178c7556 [SPARK-15633][MINOR] Make package name for Java tests consistent
## What changes were proposed in this pull request?
This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8.

## How was this patch tested?
This is a test only change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13364 from rxin/SPARK-15633.
2016-05-27 21:20:02 -07:00
Zheng RuiFeng 9893dc9757 [SPARK-15610][ML] update error message for k in pca
## What changes were proposed in this pull request?
Fix the wrong bound of `k` in `PCA`
`require(k <= sources.first().size, ...`  ->  `require(k < sources.first().size`

BTW, remove unused import in `ml.ElementwiseProduct`

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13356 from zhengruifeng/fix_pca.
2016-05-27 21:57:41 -05:00
dding3 88c9c467a3 [SPARK-15562][ML] Delete temp directory after program exit in DataFrameExample
## What changes were proposed in this pull request?
Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit.

## How was this patch tested?

unit tests and local build.

Author: dding3 <ding.ding@intel.com>

Closes #13328 from dding3/master.
2016-05-27 21:01:50 -05:00
wm624@hotmail.com 5d4dafe8fd [SPARK-15449][MLLIB][EXAMPLE] Wrong Data Format - Documentation Issue
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
In the MLLib naivebayes example, scala and python example doesn't use libsvm data, but Java does.

I make changes in scala and python example to use the libsvm data as the same as Java example.

## How was this patch tested?

Manual tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13301 from wangmiao1981/example.
2016-05-27 20:59:24 -05:00
Andrew Or 4a2fb8b87c [SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES does not respect partition spec
## What changes were proposed in this pull request?

These commands ignore the partition spec and change the storage properties of the table itself:
```
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDE 'my_serde'
ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDEPROPERTIES ('key1'='val1')
```
Now they change the storage properties of the specified partition.

## How was this patch tested?

DDLSuite

Author: Andrew Or <andrew@databricks.com>

Closes #13343 from andrewor14/alter-table-serdeproperties.
2016-05-27 17:27:24 -07:00