Commit graph

11413 commits

Author SHA1 Message Date
Andrew Or 2248ad8b70 [SPARK-8498] [SQL] Add regression test for SPARK-8470
**Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes.

**What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself.

I tested this with and without the fix in #6891 and verified that it passes only if the fix is present.

Author: Andrew Or <andrew@databricks.com>

Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits:

5e9d688 [Andrew Or] Add regression test for SPARK-8470

(cherry picked from commit 093c34838d)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-06-19 17:34:36 -07:00
Yin Huai 2510365faa [HOT-FIX] Fix compilation (caused by 0131142d98)
Author: Yin Huai <yhuai@databricks.com>

Closes #6913 from yhuai/branch-1.4-hotfix and squashes the following commits:

7f91fa0 [Yin Huai] [HOT-FIX] Fix compilation (caused by 0131142d98).
2015-06-19 17:29:51 -07:00
Nathan Howell 0131142d98 [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents
Author: Nathan Howell <nhowell@godaddy.com>

Closes #6799 from NathanHowell/spark-8093 and squashes the following commits:

76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents

(cherry picked from commit 9814b971f0)
Signed-off-by: Yin Huai <yhuai@databricks.com>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
2015-06-19 16:23:11 -07:00
Hossein 1a6b510784 [SPARK-8452] [SPARKR] expose jobGroup API in SparkR
This pull request adds following methods to SparkR:

```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.

cc shivaram and davies

Author: Hossein <hossein@databricks.com>

Closes #6889 from falaki/SPARK-8452 and squashes the following commits:

9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR

(cherry picked from commit 1fa29c2df2)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-19 15:52:27 -07:00
Yin Huai 9ac8393663 [SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread (branch 1.4)
This is for 1.4 branch (based on https://github.com/apache/spark/pull/6891).

Author: Yin Huai <yhuai@databricks.com>

Closes #6895 from yhuai/SPARK-8368-1.4 and squashes the following commits:

adbbbc9 [Yin Huai] Minor update.
3cca0e9 [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper.
b1e14a9 [Yin Huai] Failed tests.
2015-06-19 11:15:28 -07:00
Tathagata Das 4b2c793a27 [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations
This PR solves three SerializationDebugger issues.
* SPARK-7180 - SerializationDebugger fails with ArrayOutOfBoundsException
* SPARK-8090 - SerializationDebugger does not handle classes with writeReplace correctly
* SPARK-8091 - SerializationDebugger does not handle classes with writeObject method

The solutions for each are explained as follows
* SPARK-7180 - The wrong slot desc was used for getting the value of the fields in the object being tested.
* SPARK-8090 - Test the type of the replaced object.
* SPARK-8091 - Use a dummy ObjectOutputStream to collect all the objects written by the writeObject() method, and then test those objects as usual.

I also added more tests in the testsuite to increase code coverage. For example, added tests for cases where there are not serializability issues.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6625 from tdas/SPARK-7180 and squashes the following commits:

c7cb046 [Tathagata Das] Addressed comments on docs
ae212c8 [Tathagata Das] Improved docs
304c97b [Tathagata Das] Fixed build error
26b5179 [Tathagata Das] more tests.....92% line coverage
7e2fdcf [Tathagata Das] Added more tests
d1967fb [Tathagata Das] Added comments.
da75d34 [Tathagata Das] Removed unnecessary lines.
50a608d [Tathagata Das] Fixed bugs and added support for writeObject
2015-06-19 11:06:32 -07:00
Sean Owen 3415fb978b [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0395)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 11:03:12 -07:00
Andrew Or aedd893b42 [SPARK-8451] [SPARK-7287] SparkSubmitSuite should check exit code
This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness.

yhuai brkyvz

Author: Andrew Or <andrew@databricks.com>

Closes #6886 from andrewor14/spark-submit-suite-fix and squashes the following commits:

3f99ff1 [Andrew Or] Move destroy to finally block
9a62188 [Andrew Or] Re-enable ignored tests
2382672 [Andrew Or] Check for exit code

(cherry picked from commit 68a2dca292)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 10:56:36 -07:00
Lianhui Wang 6f2e411084 [SPARK-8430] ExternalShuffleBlockResolver of shuffle service should support UnsafeShuffleManager
andrewor14 can you take a look?thanks

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #6873 from lianhuiwang/SPARK-8430 and squashes the following commits:

51c47ca [Lianhui Wang] update andrewor's comments
2b27b19 [Lianhui Wang] support UnsafeShuffleManager

(cherry picked from commit 9baf093014)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 10:47:15 -07:00
Xiangrui Meng 1f2dafb77f [SPARK-8151] [MLLIB] pipeline components should correctly implement copy
Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6622 from mengxr/SPARK-8087 and squashes the following commits:

0e4c8c4 [Xiangrui Meng] fix merge issues
26fc1f0 [Xiangrui Meng] address comments
e607a04 [Xiangrui Meng] merge master
b85b57e [Xiangrui Meng] fix examples/compile
d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy
84ec278 [Xiangrui Meng] remove setter checks due to generics
2cf2ed0 [Xiangrui Meng] snapshot
291814f [Xiangrui Meng] OneVsRest.copy
1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages

(cherry picked from commit 43c7ec6384)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-06-19 10:05:07 -07:00
Kevin Conor 164b9d32e7 [SPARK-8339] [PYSPARK] integer division for python 3
Itertools islice requires an integer for the stop argument.  Switching to integer division here prevents a ValueError when vs is evaluated above.

davies

This is my original work, and I license it to the project.

Author: Kevin Conor <kevin@discoverybayconsulting.com>

Closes #6794 from kconor/kconor-patch-1 and squashes the following commits:

da5e700 [Kevin Conor] Integer division for batch size

(cherry picked from commit fdf63f1249)
Signed-off-by: Davies Liu <davies@databricks.com>
2015-06-19 00:12:43 -07:00
Cheng Lian f48f3a2e2f [SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files
`Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.

Author: Cheng Lian <lian@databricks.com>

Closes #6892 from liancheng/spark-8458 and squashes the following commits:

87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files

(cherry picked from commit a71cbbdea5)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-06-18 22:02:13 -07:00
Dibyendu Bhattacharya b55e4b9a52 [SPARK-8080] [STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
tdas  zsxwing this is the new PR for Spark-8080

I have merged https://github.com/apache/spark/pull/6659

Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.

For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk )  number of records will be counted even though the block not able to unroll to memory.

thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.

I have added few test cases to cover those block unrolling scenarios also.

Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com>
Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com>

Closes #6707 from dibbhatt/master and squashes the following commits:

f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
01e6dc8 [U-PEROOT\UBHATD1] A
2015-06-18 20:04:29 -07:00
Lars Francke bd9bbd6119 [SPARK-8462] [DOCS] Documentation fixes for Spark SQL
This fixes various minor documentation issues on the Spark SQL page

Author: Lars Francke <lars.francke@gmail.com>

Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:

dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes

(cherry picked from commit 4ce3bab89f)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-18 19:40:55 -07:00
Josh Rosen 152f4465d3 [SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators
This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators.  This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.

These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits:

f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
84214be [Josh Rosen] Add an extra column which isn't part of the sort
ae1896b [Josh Rosen] Provide implicits automatically
a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
d9ab1e4 [Michael Armbrust] Add simple resolver
c60a44d [Josh Rosen] Manually bind references
996332a [Josh Rosen] Add types so that tests compile
a46144a [Josh Rosen] WIP

(cherry picked from commit 207a98ca59)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-06-18 16:45:27 -07:00
zsxwing 9f293a9eb6 [SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <zsxwing@gmail.com>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc

(cherry picked from commit 24e53793b4)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-18 16:00:42 -07:00
Josh Rosen c1da5cf029 [SPARK-8353] [DOCS] Show anchor links when hovering over documentation headers
This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:

![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)

This makes it easier for users to link to specific sections of the documentation.

I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits:

e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code

(cherry picked from commit 44c931f006)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-18 15:10:33 -07:00
Davies Liu ca23c3b014 [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.

cc JoshRosen rxin angelini

Author: Davies Liu <davies@databricks.com>

Closes #6714 from davies/batch_size and squashes the following commits:

b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
2015-06-18 13:49:32 -07:00
Burak Yavuz 9dabc12936 [SPARK-8095] Resolve dependencies of --packages in local ivy cache
Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be.

cc andrewor14

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6788 from brkyvz/local-ivy-fix and squashes the following commits:

2875bf4 [Burak Yavuz] fix temp dir bug
48cc648 [Burak Yavuz] improve deletion
a69e3e6 [Burak Yavuz] delete cache before test as well
0037197 [Burak Yavuz] fix merge conflicts
f60772c [Burak Yavuz] use different folder for m2 cache during testing
b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache

Conflicts:
	core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala
2015-06-17 22:36:28 -07:00
xutingjun 67ad12d793 [SPARK-8392] RDDOperationGraph: getting cached nodes is slow
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```

when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.

Author: xutingjun <xutingjun@huawei.com>

Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:

53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside

(cherry picked from commit e2cdb0568b)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 22:31:39 -07:00
Yin Huai 73cf5def06 [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
https://issues.apache.org/jira/browse/SPARK-8306

I will try to add a test later.

marmbrus aarondav

Author: Yin Huai <yhuai@databricks.com>

Closes #6758 from yhuai/SPARK-8306 and squashes the following commits:

1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.

(cherry picked from commit 302556ff99)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
2015-06-17 15:14:42 -07:00
zsxwing 5aedfa2ceb [SPARK-8404] [STREAMING] [TESTS] Use thread-safe collections to make the tests more reliable
KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.

This PR changes them to thread-safe collections.

Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.

Author: zsxwing <zsxwing@gmail.com>

Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:

d464211 [zsxwing] Use thread-safe collections to make the tests more reliable

(cherry picked from commit a06d9c8e76)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-17 15:00:17 -07:00
zsxwing 5e7973df0e [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.

Author: zsxwing <zsxwing@gmail.com>

Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:

b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD

(cherry picked from commit 0fc4b96f3e)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:59:47 -07:00
Carson Wang f0513733d4 [SPARK-8372] History server shows incorrect information for application not started
The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed.
![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png)

The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay.

Author: Carson Wang <carson.wang@intel.com>

Closes #6827 from carsonwang/SPARK-8372 and squashes the following commits:

cdbb089 [Carson Wang] Fix code style
3e46b35 [Carson Wang] Update code style
90f5dde [Carson Wang] Add a unit test
d8c9cd0 [Carson Wang] Replaying events only return information when app is started

(cherry picked from commit 2837e06709)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:42:46 -07:00
Mingfei d75c53d88d [SPARK-8161] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized
externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed.

Author: Mingfei <mingfei.shi@intel.com>

Closes #6702 from shimingfei/SetTrue and squashes the following commits:

add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized

(cherry picked from commit 7ad8c5d869)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:40:16 -07:00
Punya Biswal a7f6979d0f [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
Now PySpark on YARN with cluster mode is supported so let's update doc.

Author: Kousuke Saruta <sarutakoss.nttdata.co.jp>

Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:

ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode

Author: Punya Biswal <pbiswal@palantir.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6842 from punya/feature/SPARK-7515 and squashes the following commits:

0b83648 [Punya Biswal] Merge remote-tracking branch 'origin/branch-1.4' into feature/SPARK-7515
de025cd [Kousuke Saruta] [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
2015-06-17 13:37:20 -07:00
Sean Owen 320c4420b9 [SPARK-8395] [DOCS] start-slave.sh docs incorrect
start-slave.sh no longer takes a worker # param in 1.4+

Author: Sean Owen <sowen@cloudera.com>

Closes #6855 from srowen/SPARK-8395 and squashes the following commits:

300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+

(cherry picked from commit f005be0273)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:31:17 -07:00
Vyacheslav Baranov a5f602efcf [SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <slavik.baranov@gmail.com>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

(cherry picked from commit c13da20a55)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-17 09:42:41 +01:00
Punya Biswal 877deb0468 Fix break introduced by backport
rxin this is the fix you requested for the break introduced by backporting #6793

Author: Punya Biswal <pbiswal@palantir.com>

Closes #6850 from punya/feature/fix-backport-break and squashes the following commits:

fdc3693 [Punya Biswal] Fix break introduced by backport
2015-06-16 22:31:49 -07:00
Yanbo Liang 15d973f2d9 [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6460 from yanboliang/spark-7916 and squashes the following commits:

f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression

(cherry picked from commit ca998757e8)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-06-16 14:30:42 -07:00
Marcelo Vanzin b9e5d3cadd [SPARK-8126] [BUILD] Make sure temp dir exists when running tests.
If you ran "clean" at the top-level sbt project, the temp dir would
go away, so running "test" without restarting sbt would fail. This
fixes that by making sure the temp dir exists before running tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits:

12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.

(cherry picked from commit cebf241184)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-16 21:10:25 +01:00
Radek Ostrowski 4da0686508 [SQL] [DOC] improved a comment
[SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs

Author: Radek Ostrowski <dest.hawaii@gmail.com>
Author: radek <radek@radeks-MacBook-Pro-2.local>

Closes #6332 from radek1st/master and squashes the following commits:

dae3347 [Radek Ostrowski] fixed typo
c76bb3a [radek] improved a comment

(cherry picked from commit 4bd10fd509)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-16 21:04:45 +01:00
Moussa Taifi 1378bdc4a9 [SPARK-DOCS] [SPARK-SQL] Update sql-programming-guide.md
Typo in thriftserver section

Author: Moussa Taifi <moutai10@gmail.com>

Closes #6847 from moutai/patch-1 and squashes the following commits:

1bd29df [Moussa Taifi] Update sql-programming-guide.md

(cherry picked from commit dc455b8833)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-16 21:02:14 +01:00
huangzhaowei f287f7ea14 [SPARK-8367] [STREAMING] Add a limit for 'spark.streaming.blockInterval` since a data loss bug.
Bug had reported in the jira [SPARK-8367](https://issues.apache.org/jira/browse/SPARK-8367)
The relution is limitting the configuration `spark.streaming.blockInterval` to a positive number.

Author: huangzhaowei <carlmartinmax@gmail.com>
Author: huangzhaowei <SaintBacchus@users.noreply.github.com>

Closes #6818 from SaintBacchus/SPARK-8367 and squashes the following commits:

c9d1927 [huangzhaowei] Update BlockGenerator.scala
bd3f71a [huangzhaowei] Use requre instead of if
3d17796 [huangzhaowei] [SPARK_8367][Streaming]Add a limit for 'spark.streaming.blockInterval' since a data loss bug.

(cherry picked from commit ccf010f27b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-16 08:16:18 +02:00
tedyu fff8d7ee6c SPARK-8336 Fix NullPointerException with functions.rand()
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'

Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()

Author: tedyu <yuzhihong@gmail.com>

Closes #6793 from tedyu/master and squashes the following commits:

62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()

(cherry picked from commit 1a62d61696)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-15 17:00:43 -07:00
Peter Hoffmann 0ffbf08519 fix read/write mixup
Author: Peter Hoffmann <ph@peter-hoffmann.com>

Closes #6815 from hoffmann/patch-1 and squashes the following commits:

2abb6da [Peter Hoffmann] fix read/write mixup

(cherry picked from commit f3f2a4397d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-14 11:41:50 -07:00
Michael Armbrust 2805d145e3 [SPARK-8358] [SQL] Wait for child resolution when resolving generators
Author: Michael Armbrust <michael@databricks.com>

Closes #6811 from marmbrus/aliasExplodeStar and squashes the following commits:

fbd2065 [Michael Armbrust] more style
806a373 [Michael Armbrust] fix style
7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa

(cherry picked from commit 9073a426e4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-06-14 11:21:55 -07:00
Josh Rosen 4634be5a7d [SPARK-8354] [SQL] Fix off-by-factor-of-8 error when allocating scratch space in UnsafeFixedWidthAggregationMap
UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size.  This means that we end up allocating 8x too much conversion space.

This patch fixes this by allocating a `byte[]` array instead.  This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits:

6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size

(cherry picked from commit ea7fd2ff64)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-14 09:41:01 -07:00
Mike Dusenberry 187a3d5385 [Spark-8343] [Streaming] [Docs] Improve Spark Streaming Guides.
This improves the Spark Streaming Guides by fixing broken links, rewording confusing sections, fixing typos, adding missing words, etc.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6801 from dusenberrymw/SPARK-8343_Improve_Spark_Streaming_Guides_MERGED and squashes the following commits:

6688090 [Mike Dusenberry] Improvements to the Spark Streaming Custom Receiver Guide, including slight rewording of confusing sections, and fixing typos & missing words.
436fbd8 [Mike Dusenberry] Bunch of improvements to the Spark Streaming Guide, including fixing broken links, slight rewording of confusing sections, fixing typos & missing words, etc.

(cherry picked from commit 35d1267cf8)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-13 21:22:53 -07:00
Michael Armbrust 1ca431e83f [SPARK-8329][SQL] Allow _ in DataSource options
Author: Michael Armbrust <michael@databricks.com>

Closes #6786 from marmbrus/optionsParser and squashes the following commits:

e7d18ef [Michael Armbrust] add dots
99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options

(cherry picked from commit 4aed66f299)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-12 23:11:25 -07:00
Tathagata Das 7c11ccf391 [SPARK-7284] [STREAMING] Updated streaming documentation
- Kinesis API updated
- Kafka version updated, and Python API for Direct Kafka added
- Added SQLContext.getOrCreate()
- Added information on how to get partitionId in foreachRDD

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6781 from tdas/SPARK-7284 and squashes the following commits:

aac7be0 [Tathagata Das] Added information on how to get partition id
a66ec22 [Tathagata Das] Complete the line incomplete line,
a92ca39 [Tathagata Das] Updated streaming documentation

(cherry picked from commit e9471d3414)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-12 15:23:09 -07:00
Andrew Or 7608373419 [SPARK-8330] DAG visualization: trim whitespace from input
Safeguard against DOM rewriting.

Author: Andrew Or <andrew@databricks.com>

Closes #6787 from andrewor14/dag-viz-trim and squashes the following commits:

0fb4afe [Andrew Or] Trim input metadata from DOM

(cherry picked from commit 8860405151)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-12 11:15:09 -07:00
Mark Smith 141eab71ee [SPARK-8322] [EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and…
… SPARK_TACHYON_MAP

Author: Mark Smith <mark.smith@bronto.com>

Closes #6777 from markmsmith/branch-1.4 and squashes the following commits:

a218cfa [Mark Smith] [SPARK-8322][EC2] Fixed tachyon mapp entry to point to 0.6.4
90d1655 [Mark Smith] [SPARK-8322][EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and SPARK_TACHYON_MAP
2015-06-12 10:28:30 -07:00
Marcelo Vanzin 8b25f62bf1 [SPARK-6511] [docs] Fix example command in hadoop-provided docs.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6766 from vanzin/SPARK-6511 and squashes the following commits:

49f0f67 [Marcelo Vanzin] [SPARK-6511] [docs] Fix example command in hadoop-provided docs.

(cherry picked from commit 9cbdf31ec1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-11 15:29:09 -07:00
Shivaram Venkataraman 3a62569afb [SPARK-8310] [EC2] Update spark-ec2 branch to 1.4
cc pwendell  -- We should probably update our release guidelines to change this when we cut a release branch ?

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6765 from shivaram/SPARK-8310-14 and squashes the following commits:

066e44e [Shivaram Venkataraman] Update spark-ec2 branch to 1.4
2015-06-11 13:22:08 -07:00
Adam Roberts b313920abd [SPARK-8289] Specify stack size for consistency with Java tests - resolves test failures
This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).

-Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).

I have ensured this does not have any negative implications for other tests.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: a-roberts <aroberts@uk.ibm.com>

Closes #6727 from a-roberts/IncJavaStackSize and squashes the following commits:

ab40aea [Adam Roberts] Specify stack size for SBT builds
5032d8d [a-roberts] Update pom.xml

(cherry picked from commit 6b68366df3)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-11 08:41:00 +01:00
navis.ryu 5c05b5c0d2 [SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first
case cs  CombineSum(expr) =>
        val calcType = expr.dataType
          expr.dataType match {
            case DecimalType.Fixed(_, _) =>
              DecimalType.Unlimited
            case _ =>
              expr.dataType
          }
calcType is always expr.dataType. credits are all belong to IntelliJ

Author: navis.ryu <navis@apache.org>

Closes #6736 from navis/SPARK-8285 and squashes the following commits:

20382c1 [navis.ryu] [SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first

(cherry picked from commit 6a47114bc2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-10 18:19:24 -07:00
Paavo 59fc3f1972 [SPARK-8200] [MLLIB] Check for empty RDDs in StreamingLinearAlgorithm
Test cases for both StreamingLinearRegression and StreamingLogisticRegression, and code fix.

Edit:
This contribution is my original work and I license the work to the project under the project's open source license.

Author: Paavo <pparkkin@gmail.com>

Closes #6713 from pparkkin/streamingmodel-empty-rdd and squashes the following commits:

ff5cd78 [Paavo] Update strings to use interpolation.
db234cf [Paavo] Use !rdd.isEmpty.
54ad89e [Paavo] Test case for empty stream.
393e36f [Paavo] Ignore empty RDDs.
0bfc365 [Paavo] Test case for empty stream.

(cherry picked from commit b928f54384)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-10 23:26:54 +01:00
WangTaoTheTonic 2846a357f3 [SPARK-8273] Driver hangs up when yarn shutdown in client mode
In client mode, if yarn was shut down with spark application running, the application will hang up after several retries(default: 30) because the exception throwed by YarnClientImpl could not be caught by upper level, we should exit in case that user can not be aware that.

The exception we wanna catch is [here](https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java#L122), and I try to fix it refer to [MR](https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/ClientServiceDelegate.java#L320).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6717 from WangTaoTheTonic/SPARK-8273 and squashes the following commits:

28752d6 [WangTaoTheTonic] catch the throwed exception
2015-06-10 13:36:16 -07:00
Adam Roberts 568d1d51d6 [SPARK-7756] CORE RDDOperationScope fix for IBM Java
IBM Java has an extra method when we do getStackTrace(): this is "getStackTraceImpl", a native method. This causes two tests to fail within "DStreamScopeSuite" when running with IBM Java. Instead of "map" or "filter" being the method names found, "getStackTrace" is returned. This commit addresses such an issue by using dropWhile. Given that our current method is withScope, we look for the next method that isn't ours: we don't care about methods that come before us in the stack trace: e.g. getStackTrace (regardless of how many levels this might go).

IBM:
java.lang.Thread.getStackTraceImpl(Native Method)
java.lang.Thread.getStackTrace(Thread.java:1117)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:104)

Oracle:
PRINTING STACKTRACE!!!
java.lang.Thread.getStackTrace(Thread.java:1552)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:106)

I've tested this with Oracle and IBM Java, no side effects for other tests introduced.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: a-roberts <aroberts@uk.ibm.com>

Closes #6740 from a-roberts/RDDScopeStackCrawlFix and squashes the following commits:

13ce390 [Adam Roberts] Ensure consistency with String equality checking
a4fc0e0 [a-roberts] Update RDDOperationScope.scala

(cherry picked from commit 19e30b48f3)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-10 13:21:59 -07:00