Commit graph

11123 commits

Author SHA1 Message Date
Andrew Or 8f20824268 [SPARK-7864] [UI] Do not kill innocent stages from visualization
**Reproduction.** Run a long-running job, go to the job page, expand the DAG visualization, and click into a stage. Your stage is now killed. Why? This is because the visualization code just reaches into the stage table and grabs the first link it finds. In our case, this first link happens to be the kill link instead of the one to the stage page.

**Fix.** Use proper CSS selectors to avoid ambiguity.

This is an alternative to #6407. Thanks carsonwang for catching this.

Author: Andrew Or <andrew@databricks.com>

Closes #6419 from andrewor14/fix-ui-viz-kill and squashes the following commits:

25203bd [Andrew Or] Do not kill innocent stages
2015-05-26 16:31:34 -07:00
Xiangrui Meng 836a75898f [SPARK-7748] [MLLIB] Graduate spark.ml from alpha
With descent coverage of feature transformers, algorithms, and model tuning support, it is time to graduate `spark.ml` from alpha. This PR changes all `AlphaComponent` annotations to either `DeveloperApi` or `Experimental`, depending on whether we expect a class/method to be used by end users (who use the pipeline API to assemble/tune their ML pipelines but not to create new pipeline components.) `UnaryTransformer` becomes a `DeveloperApi` in this PR.

jkbradley harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6417 from mengxr/SPARK-7748 and squashes the following commits:

effbccd [Xiangrui Meng] organize imports
c15028e [Xiangrui Meng] added missing docs
1b2e5f8 [Xiangrui Meng] update package doc
73ca791 [Xiangrui Meng] alpha -> ex/dev for the rest
93819db [Xiangrui Meng] alpha -> ex/dev in ml.param
55ca073 [Xiangrui Meng] alpha -> ex/dev in ml.feature
83572f1 [Xiangrui Meng] add Experimental and DeveloperApi tags (wip)
2015-05-26 15:51:31 -07:00
zsxwing 9f742241cb [SPARK-6602] [CORE] Remove some places in core that calling SparkEnv.actorSystem
Author: zsxwing <zsxwing@gmail.com>

Closes #6333 from zsxwing/remove-actor-system-usage and squashes the following commits:

f125aa6 [zsxwing] Fix YarnAllocatorSuite
ceadcf6 [zsxwing] Change the "port" parameter type of "AkkaUtils.address" to "int"; update ApplicationMaster and YarnAllocator to get the driverUrl from RpcEnv
3239380 [zsxwing] Remove some places in core that calling SparkEnv.actorSystem
2015-05-26 15:28:49 -07:00
Shivaram Venkataraman 2e9a5f229e [SPARK-3674] YARN support in Spark EC2
This corresponds to https://github.com/mesos/spark-ec2/pull/116 in the spark-ec2 repo. The only changes required on the spark_ec2.py script is to open the RM port.

cc andrewor14

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6376 from shivaram/spark-ec2-yarn and squashes the following commits:

961504a [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into spark-ec2-yarn
152c94c [Shivaram Venkataraman] Open 8088 for YARN in EC2
2015-05-26 15:01:27 -07:00
MechCoder 61664732b2 [SPARK-7844] [MLLIB] Fix broken tests in KernelDensity
The densities in KernelDensity are scaled down by
(number of parallel processes X number of points). It should be just no.of samples. This results in broken tests in KernelDensitySuite which haven't been tested properly.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6383 from MechCoder/spark-7844 and squashes the following commits:

ab81302 [MechCoder] Math->math
9b8ed50 [MechCoder] Make one pass to update count
a92fe50 [MechCoder] [SPARK-7844] Fix broken tests in KernelDensity
2015-05-26 13:21:00 -07:00
Patrick Wendell b7d8085942 Revert "[SPARK-7042] [BUILD] use the standard akka artifacts with hadoop-2.x"
This reverts commit 43aa819c04.
2015-05-26 10:05:13 -07:00
Zhang, Liye 63099122de [SPARK-7854] [TEST] refine Kryo test suite
this modification is according to JoshRosen 's comments, for details, please refer to [#5934](https://github.com/apache/spark/pull/5934/files#r30949751).

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #6395 from liyezhang556520/kryoTest and squashes the following commits:

da214c8 [Zhang, Liye] refine Kryo test suite accroding to Josh's comments
2015-05-26 17:08:16 +01:00
Mike Dusenberry e5a63a0e39 [DOCS] [MLLIB] Fixing misformatted links in v1.4 MLlib Naive Bayes documentation by removing space and newline characters.
A couple of links in the MLlib Naive Bayes documentation for v1.4 were broken due to the addition of either space or newline characters between the link title and link URL in the markdown doc.  (Interestingly enough, they are rendered correctly in the GitHub viewer, but not when compiled to HTML by Jekyll.)

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6412 from dusenberrymw/Fix_Broken_Links_In_MLlib_Naive_Bayes_Docs and squashes the following commits:

91a4028 [Mike Dusenberry] Fixing misformatted links by removing space and newline characters.
2015-05-26 17:05:58 +01:00
meawoppl 8dbe777703 [SPARK-7806][EC2] Fixes that allow the spark_ec2.py tool to run with Python3
I have used this script to launch, destroy, start, and stop clusters successfully.

Author: meawoppl <meawoppl@gmail.com>

Closes #6336 from meawoppl/py3ec2spark and squashes the following commits:

2e87046 [meawoppl] Py3 compat fixes.
2015-05-26 09:02:25 -07:00
linweizhong 8948ad3fb5 [SPARK-7339] [PYSPARK] PySpark shuffle spill memory sometimes are not correct
In PySpark we get memory used before and after spill, then use the difference of these two value as memorySpilled, but if the before value is small than after value, then we will get a negative value, but this scenario 0 value may be more reasonable.

Below is the result in HistoryServer we have tested:
Index	ID	Attempt	Status	Locality Level	Executor ID / Host	Launch Time	Duration	GC Time	Input Size / Records	Write Time	Shuffle Write Size / Records	Shuffle Spill (Memory)	Shuffle Spill (Disk)	Errors
0	0	0	SUCCESS	NODE_LOCAL	3 / vm119	2015/05/04 17:31:06	21 s	0.1 s	128.1 MB (hadoop) / 3237	70 ms	10.1 MB / 2529	0.0 B	5.7 MB
2	2	0	SUCCESS	NODE_LOCAL	1 / vm118	2015/05/04 17:31:06	22 s	89 ms	128.1 MB (hadoop) / 3205	0.1 s	10.1 MB / 2529	-1048576.0 B	5.9 MB
1	1	0	SUCCESS	NODE_LOCAL	2 / vm117	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3271	68 ms	10.1 MB / 2529	-1048576.0 B	5.6 MB
4	4	0	SUCCESS	NODE_LOCAL	2 / vm117	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3192	51 ms	10.1 MB / 2529	-1048576.0 B	5.9 MB
3	3	0	SUCCESS	NODE_LOCAL	3 / vm119	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3262	51 ms	10.1 MB / 2529	1024.0 KB	5.8 MB
5	5	0	SUCCESS	NODE_LOCAL	1 / vm118	2015/05/04 17:31:06	22 s	89 ms	128.1 MB (hadoop) / 3256	93 ms	10.1 MB / 2529	-1048576.0 B	5.7 MB

/cc davies

Author: linweizhong <linweizhong@huawei.com>

Closes #5887 from Sephiroth-Lin/spark-7339 and squashes the following commits:

9186c81 [linweizhong] Use max function to get a nonnegative value
d41672b [linweizhong] Update MemoryBytesSpilled when memorySpilled > 0
2015-05-26 08:35:39 -07:00
scwf bf49c22130 [CORE] [TEST] Fix SimpleDateParamTest
```
sbt.ForkMain$ForkError: 1424424077190 was not equal to 1424474477190
	at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
	at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
	at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply$mcV$sp(SimpleDateParamTest.scala:25)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.scalatest.Suite$class.withFixture(Suite.scala:
```

Set timezone to fix SimpleDateParamTest

Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>

Closes #6377 from scwf/fix-SimpleDateParamTest and squashes the following commits:

b8df1e5 [Fei Wang] Update SimpleDateParamSuite.scala
8bb74f0 [scwf] fix SimpleDateParamSuite
2015-05-26 08:42:52 -05:00
Konstantin Shaposhnikov 43aa819c04 [SPARK-7042] [BUILD] use the standard akka artifacts with hadoop-2.x
Both akka 2.3.x and hadoop-2.x use protobuf 2.5 so only hadoop-1 build needs
custom 2.3.4-spark akka version that shades protobuf-2.5

This partially fixes SPARK-7042 (for hadoop-2.x builds)

Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com>

Closes #6341 from kostya-sh/SPARK-7042 and squashes the following commits:

7eb8c60 [Konstantin Shaposhnikov] [SPARK-7042][BUILD] use the standard akka artifacts with hadoop-2.x
2015-05-26 07:49:32 +01:00
Reynold Xin c9adcad81a [SQL][minor] Removed unused Catalyst logical plan DSL.
The Catalyst DSL is no longer used as a public facing API. This pull request removes the UDF and writeToFile feature from it since they are not used in unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #6350 from rxin/unused-logical-dsl and squashes the following commits:

90b3de6 [Reynold Xin] [SQL][minor] Removed unused Catalyst logical plan DSL.
2015-05-25 23:09:22 -07:00
Yin Huai f38e619c41 [SPARK-7832] [Build] Always run SQL tests in master build.
https://issues.apache.org/jira/browse/SPARK-7832

Author: Yin Huai <yhuai@databricks.com>

Closes #6385 from yhuai/runSQLTests and squashes the following commits:

3d399bc [Yin Huai] Always run SQL tests in master build.
2015-05-25 18:23:58 -07:00
Calvin Jia ce0051d6f7 [SPARK-6391][DOCS] Document Tachyon compatibility.
Adds a section in the RDD persistence section of the programming-guide docs detailing Spark-Tachyon version compatibility as discussed in [[SPARK-6391]](https://issues.apache.org/jira/browse/SPARK-6391).

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #6382 from calvinjia/spark-6391 and squashes the following commits:

113e863 [Calvin Jia] Move compatibility info to the offheap storage level section.
7942dc5 [Calvin Jia] Add a section in the programming-guide docs for Tachyon compatibility.
2015-05-25 16:50:43 -07:00
Cheng Lian 8af1bf10b7 [SPARK-7842] [SQL] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
When committing/aborting a write task issued in `InsertIntoHadoopFsRelation`, if an exception is thrown from `OutputWriter.close()`, the committing/aborting process will be interrupted, and leaves messy stuff behind (e.g., the `_temporary` directory created by `FileOutputCommitter`).

This PR makes these two process more robust by catching potential exceptions and falling back to normal task committment/abort.

Author: Cheng Lian <lian@databricks.com>

Closes #6378 from liancheng/spark-7838 and squashes the following commits:

f18253a [Cheng Lian] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust
2015-05-26 00:28:47 +08:00
Cheng Lian bfeedc69a2 [SPARK-7684] [SQL] Invoking HiveContext.newTemporaryConfiguration() shouldn't create new metastore directory
The "Database does not exist" error reported in SPARK-7684 was caused by `HiveContext.newTemporaryConfiguration()`, which always creates a new temporary metastore directory and returns a metastore configuration pointing that directory. This makes `TestHive.reset()` always replaces old temporary metastore with an empty new one.

Author: Cheng Lian <lian@databricks.com>

Closes #6359 from liancheng/spark-7684 and squashes the following commits:

95d2eb8 [Cheng Lian] Addresses @marmbrust's comment
042769d [Cheng Lian] Don't create new temp directory in HiveContext.newTemporaryConfiguration()
2015-05-26 00:16:06 +08:00
tedyu fd31fd4976 Add test which shows Kryo buffer size configured in mb is properly supported
This PR adds test which shows that Kryo buffer size configured in mb is supported properly

Author: tedyu <yuzhihong@gmail.com>

Closes #6390 from tedyu/master and squashes the following commits:

c51ea64 [tedyu] Fix KryoSerializer creation
f12ee04 [tedyu] Correct conf variable name in test
642de51 [tedyu] Drop change in KryoSerializer so that the new test runs
d2fdbc4 [tedyu] Give bufferSizeKb initial value
9a17277 [tedyu] Rewrite bufferSize checking
4739998 [tedyu] Rewrite bufferSize checking
830d0d0 [tedyu] Kryo buffer size configured in mb should be properly supported
2015-05-25 08:20:31 +01:00
tedyu 23bea97d92 Close HBaseAdmin at the end of HBaseTest
Author: tedyu <yuzhihong@gmail.com>

Closes #6381 from ted-yu/master and squashes the following commits:

e2f0ea1 [tedyu] Close HBaseAdmin at the end of HBaseTest
2015-05-25 08:19:42 +01:00
Judy Nash 4f4ba8fda8 [SPARK-7811] Fix typo on slf4j configuration on metrics.properties.tem…
Fix minor typo on metrics.properties.template where slf4j is incorrectly spelled as sl4j.

Author: Judy Nash <judynash@microsoft.com>

Closes #6362 from judynash/master and squashes the following commits:

c644875 [Judy Nash] SPARK-7811: Fix typo on slf4j configuration on metrics.properties.template
2015-05-24 21:48:27 +01:00
Ram Sriharsha 65c696ecc0 [SPARK-7833] [ML] Add python wrapper for RegressionEvaluator
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6365 from harsha2010/SPARK-7833 and squashes the following commits:

923f288 [Ram Sriharsha] cleanup
7623b7d [Ram Sriharsha] python style fix
9743f83 [Ram Sriharsha] [SPARK-7833][ml] Add python wrapper for RegressionEvaluator
2015-05-24 10:36:02 -07:00
Yin Huai ed21476bc0 [SPARK-7805] [SQL] Move SQLTestUtils.scala and ParquetTest.scala to src/test
https://issues.apache.org/jira/browse/SPARK-7805

Because `sql/hive`'s tests depend on the test jar of `sql/core`, we do not need to store `SQLTestUtils` and `ParquetTest` in `src/main`. We should only add stuff that will be needed by `sql/console` or Python tests (for Python, we need it in `src/main`, right? davies).

Author: Yin Huai <yhuai@databricks.com>

Closes #6334 from yhuai/SPARK-7805 and squashes the following commits:

af6d0c9 [Yin Huai] mima
b86746a [Yin Huai] Move SQLTestUtils.scala and ParquetTest.scala to src/test.
2015-05-24 09:51:37 -07:00
Yin Huai bfbc0df729 [SPARK-7845] [BUILD] Bump "Hadoop 1" tests to version 1.2.1
https://issues.apache.org/jira/browse/SPARK-7845

Author: Yin Huai <yhuai@databricks.com>

Closes #6384 from yhuai/hadoop1Test and squashes the following commits:

82fcea8 [Yin Huai] Use hadoop 1.2.1 (a stable version) for hadoop 1 test.
2015-05-24 09:49:57 -07:00
Patrick Wendell 3c1a2d049c [SPARK-7287] [HOTFIX] Disable o.a.s.deploy.SparkSubmitSuite --packages 2015-05-23 19:44:03 -07:00
Shivaram Venkataraman b231baa248 [HOTFIX] Copy SparkR lib if it exists in make-distribution
This is to fix an issue reported in #6373 where the `cp` would fail if `-Psparkr` was not used in the build

cc dragos pwendell

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6379 from shivaram/make-distribution-hotfix and squashes the following commits:

08eb7e4 [Shivaram Venkataraman] Copy SparkR lib if it exists in make-distribution
2015-05-23 12:28:16 -07:00
Yin Huai 2b7e63585d [SPARK-7654] [SQL] Move insertInto into reader/writer interface.
This one continues the work of https://github.com/apache/spark/pull/6216.

Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6366 from yhuai/insert and squashes the following commits:

3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable.
56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer.
c636e35 [Yin Huai] Remove unnecessary empty lines.
cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation.
0841a54 [Reynold Xin] Removed experimental tag for deprecated methods.
33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface.
2015-05-23 09:48:20 -07:00
Davies Liu a4df0f2d84 Fix install jira-python
jira-pytyhon package should be installed by

  sudo pip install jira

cc pwendell

Author: Davies Liu <davies@databricks.com>

Closes #6367 from davies/fix_jira_python2 and squashes the following commits:

fbb3c8e [Davies Liu] Fix install jira-python
2015-05-23 09:14:07 -07:00
Davies Liu be47af1bdb [SPARK-7840] add insertInto() to Writer
Add tests later.

Author: Davies Liu <davies@databricks.com>

Closes #6375 from davies/insertInto and squashes the following commits:

826423e [Davies Liu] add insertInto() to Writer
2015-05-23 09:07:14 -07:00
Davies Liu efe3bfdf49 [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates
1. ntile should take an integer as parameter.
2. Added Python API (based on #6364)
3. Update documentation of various DataFrame Python functions.

Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6374 from rxin/window-final and squashes the following commits:

69004c7 [Reynold Xin] Style fix.
288cea9 [Reynold Xin] Update documentaiton.
7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
66092b4 [Davies Liu] update docs
ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
8936ade [Davies Liu] fix maxint in python 3
2649358 [Davies Liu] update docs
778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
2015-05-23 08:30:05 -07:00
zsxwing ad0badba14 [SPARK-7777][Streaming] Handle the case when there is no block in a batch
In the old implementation, if a batch has no block, `areWALRecordHandlesPresent` will be `true` and it will return `WriteAheadLogBackedBlockRDD`.

This PR handles this case by returning `WriteAheadLogBackedBlockRDD` or `BlockRDD` according to the configuration.

Author: zsxwing <zsxwing@gmail.com>

Closes #6372 from zsxwing/SPARK-7777 and squashes the following commits:

788f895 [zsxwing] Handle the case when there is no block in a batch
2015-05-23 02:11:17 -07:00
Shivaram Venkataraman a40bca0111 [SPARK-6811] Copy SparkR lib in make-distribution.sh
This change also remove native libraries from SparkR to make sure our distribution works across platforms

Tested by building on Mac, running on Amazon Linux (CentOS), Windows VM and vice-versa (built on Linux run on Mac)

I will also test this with YARN soon and update this PR.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6373 from shivaram/sparkr-binary and squashes the following commits:

ae41b5c [Shivaram Venkataraman] Remove native libraries from SparkR Also include the built SparkR package in make-distribution.sh
2015-05-23 00:04:01 -07:00
Davies Liu 7af3818c6b [SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide
sqlCtx -> sqlContext

You can check the docs by:

```
$ cd docs
$ SKIP_SCALADOC=1 jekyll serve
```
cc shivaram

Author: Davies Liu <davies@databricks.com>

Closes #5442 from davies/r_docs and squashes the following commits:

7a12ec6 [Davies Liu] remove rdd in R docs
8496b26 [Davies Liu] remove the docs related to RDD
e23b9d6 [Davies Liu] delete R docs for RDD API
222e4ff [Davies Liu] Merge branch 'master' into r_docs
89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs
f0a10e1 [Davies Liu] address comments from @shivaram
f61de71 [Davies Liu] Update pairRDD.R
3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b
2f10a77 [Davies Liu] address comments from @cafreeman
9c2a062 [Davies Liu] mention R api together with Python API
23f751a [Davies Liu] Fill in SparkR examples in programming guide
2015-05-23 00:01:40 -07:00
GenTang 4583cf4be1 [SPARK-5090] [EXAMPLES] The improvement of python converter for hbase
Hi,

Following the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-td10001.html. I made some modification in three files in package examples:
1. HBaseConverters.scala: the new converter will converts all the records in an hbase results into a single string
2. hbase_input.py: as the value string may contain several records, we can use ast package to convert the string into dict
3. HBaseTest.scala: as the package examples use hbase 0.98.7 the original constructor HTableDescriptor is deprecated. The updation to new constructor is made

Author: GenTang <gen.tang86@gmail.com>

Closes #3920 from GenTang/master and squashes the following commits:

d2153df [GenTang] import JSONObject precisely
4802481 [GenTang] dump the result into a singl String
62df7f0 [GenTang] remove the comment
21de653 [GenTang] return the string in json format
15b1fe3 [GenTang] the modification of comments
5cbbcfc [GenTang] the improvement of pythonconverter
ceb31c5 [GenTang] the modification for adapting updation of hbase
3253b61 [GenTang] the modification accompanying the improvement of pythonconverter
2015-05-22 23:37:03 -07:00
Hari Shreedharan 368b8c2b5e [HOTFIX] Add tests for SparkListenerApplicationStart with Driver Logs.
#6166 added the driver logs to `SparkListenerApplicationStart`. This  adds tests in `JsonProtocolSuite` to ensure we don't regress.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6368 from harishreedharan/jsonprotocol-test and squashes the following commits:

dc9eafc [Hari Shreedharan] [HOTFIX] Add tests for SparkListenerApplicationStart with Driver Logs.
2015-05-22 23:07:56 -07:00
Tathagata Das baa89838cc [SPARK-7838] [STREAMING] Set scope for kinesis stream
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6369 from tdas/SPARK-7838 and squashes the following commits:

87d1c7f [Tathagata Das] Addressed comment
37775d8 [Tathagata Das] set scope for kinesis stream
2015-05-22 23:05:54 -07:00
Shivaram Venkataraman 017b3404a5 [MINOR] Add SparkR to create-release script
Enables the SparkR profiles for all the binary builds we create

cc pwendell

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6371 from shivaram/sparkr-create-release and squashes the following commits:

ca5a0b2 [Shivaram Venkataraman] Add -Psparkr to create-release.sh
2015-05-22 22:33:49 -07:00
Akshat Aranya a16357413d [SPARK-7795] [CORE] Speed up task scheduling in standalone mode by reusing serializer
My experiments with scheduling very short tasks in standalone cluster mode indicated that a significant amount of time was being spent in scheduling the tasks (>500ms for 256 tasks).  I found that most of the time was being spent in creating a new instance of serializer for each task.  Changing this to just one serializer brought down the scheduling time to 8ms.

Author: Akshat Aranya <aaranya@quantcast.com>

Closes #6323 from coolfrood/master and squashes the following commits:

12d8c9e [Akshat Aranya] Reduce visibility of serializer
bd4a5dd [Akshat Aranya] Style fix
0b8ca93 [Akshat Aranya] Incorporate review comments
fe530cd [Akshat Aranya] Speed up task scheduling in standalone mode by reusing serializer instead of creating a new one for each task.
2015-05-22 22:03:31 -07:00
Mike Dusenberry 63a5ce75ea [SPARK-7830] [DOCS] [MLLIB] Adding logistic regression to the list of Multiclass Classification Supported Methods documentation
Added logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6357 from dusenberrymw/Add_LR_To_List_Of_Multiclass_Classification_Methods and squashes the following commits:

7918650 [Mike Dusenberry] Updating broken link due to the "Binary Classification" section on the Linear Methods page being renamed to "Classification".
3005dc2 [Mike Dusenberry] Adding logistic regression to the list of Multiclass Classification Supported Methods in the MLlib Classification and Regression documentation, as it was missing.
2015-05-22 18:03:12 -07:00
Burak Yavuz 8014e1f6bb [SPARK-7224] [SPARK-7306] mock repository generator for --packages tests without nio.Path
The previous PR for SPARK-7224 (#5790) broke JDK 6, because it used java.nio.Path, which was in jdk 7, and not in 6. This PR uses Guava's `Files` to handle directory creation, and etc...

The description from the previous PR:
> This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central.

cc pwendell

I also rand the flaky test about 20 times locally, it didn't fail a single time, but I think it may fail like once every 100 builds? I still haven't figured the cause yet, but the test before it, `--jars` was also failing after we turned off the `--packages` test in `SparkSubmitSuite`. It may be related to the launch of SparkSubmit.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5892 from brkyvz/maven-utils and squashes the following commits:

e9b1903 [Burak Yavuz] fix merge conflict
68214e0 [Burak Yavuz] remove ignore for test(neglect spark dependencies)
e632381 [Burak Yavuz] fix ignore
9ef1408 [Burak Yavuz] re-enable --packages test
22eea62 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into maven-utils
05cd0de [Burak Yavuz] added mock repository generator
2015-05-22 17:48:09 -07:00
Tathagata Das 1c388a9985 [SPARK-7788] Made KinesisReceiver.onStart() non-blocking
KinesisReceiver calls worker.run() which is a blocking call (while loop) as per source code of kinesis-client library - https://github.com/awslabs/amazon-kinesis-client/blob/v1.2.1/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/Worker.java.
This results in infinite loop while calling sparkStreamingContext.stop(stopSparkContext = false, stopGracefully = true) perhaps because ReceiverTracker is never able to register the receiver (it's receiverInfo field is a empty map) causing it to be stuck in infinite loop while waiting for running flag to be set to false.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6348 from tdas/SPARK-7788 and squashes the following commits:

2584683 [Tathagata Das] Added receiver id in thread name
6cf1cd4 [Tathagata Das] Made KinesisReceiver.onStart non-blocking
2015-05-22 17:39:01 -07:00
Andrew Or 3d8760d76e [SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further
The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes #6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages
2015-05-22 17:37:38 -07:00
Michael Armbrust 3c1305107a [SPARK-7834] [SQL] Better window error messages
Author: Michael Armbrust <michael@databricks.com>

Closes #6363 from marmbrus/windowErrors and squashes the following commits:

516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
2015-05-22 17:23:12 -07:00
Imran Rashid 821254fb94 [SPARK-7760] add /json back into master & worker pages; add test
Author: Imran Rashid <irashid@cloudera.com>

Closes #6284 from squito/SPARK-7760 and squashes the following commits:

5e02d8a [Imran Rashid] style; increase timeout
9987399 [Imran Rashid] comment
8c7ed63 [Imran Rashid] add /json back into master & worker pages; add test
2015-05-22 16:05:07 -07:00
Liang-Chi Hsieh 126d7235de [SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table
JIRA: https://issues.apache.org/jira/browse/SPARK-7270

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5864 from viirya/dyn_partition_insert and squashes the following commits:

b5627df [Liang-Chi Hsieh] For comments.
3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert
8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table.
2015-05-22 15:39:58 -07:00
Santiago M. Mola e4aef91fe7 [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
Author: Santiago M. Mola <santi@mola.io>

Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits:

11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
2015-05-22 15:10:27 -07:00
WangTaoTheTonic 31d5d463e7 [SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql
https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql
2015-05-22 14:43:16 -07:00
Josh Rosen eac00691da [SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled
SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
2015-05-22 13:28:14 -07:00
Ram Sriharsha 509d55ab41 [SPARK-7574] [ML] [DOC] User guide for OneVsRest
Including Iris Dataset (after shuffling and relabeling 3 -> 0 to confirm to 0 -> numClasses-1 labeling). Could not find an existing dataset in data/mllib for multiclass classification.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6296 from harsha2010/SPARK-7574 and squashes the following commits:

645427c [Ram Sriharsha] cleanup
46c41b1 [Ram Sriharsha] cleanup
2f76295 [Ram Sriharsha] Code Review Fixes
ebdf103 [Ram Sriharsha] Java Example
c026613 [Ram Sriharsha] Code Review fixes
4b7d1a6 [Ram Sriharsha] minor cleanup
13bed9c [Ram Sriharsha] add wikipedia link
bb9dbfa [Ram Sriharsha] Clean up naming
6f90db1 [Ram Sriharsha] [SPARK-7574][ml][doc] User guide for OneVsRest
2015-05-22 13:18:08 -07:00
Patrick Wendell c63036cd47 Revert "[BUILD] Always run SQL tests in master build."
This reverts commit 147b6be3b6.
2015-05-22 10:04:45 -07:00
Ram Sriharsha f490b3b4c7 [SPARK-7404] [ML] Add RegressionEvaluator to spark.ml
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6344 from harsha2010/SPARK-7404 and squashes the following commits:

16b9d77 [Ram Sriharsha] consistent naming
7f100b6 [Ram Sriharsha] cleanup
c46044d [Ram Sriharsha] Merge with Master + Code Review Fixes
188fa0a [Ram Sriharsha] Merge branch 'master' into SPARK-7404
f5b6a4c [Ram Sriharsha] cleanup doc
97beca5 [Ram Sriharsha] update test to use R packages
32dd310 [Ram Sriharsha] fix indentation
f93b812 [Ram Sriharsha] fix test
1b6ebb3 [Ram Sriharsha] [SPARK-7404][ml] Add RegressionEvaluator to spark.ml
2015-05-22 09:59:44 -07:00