Commit graph

11370 commits

Author SHA1 Message Date
Marcelo Vanzin 8b25f62bf1 [SPARK-6511] [docs] Fix example command in hadoop-provided docs.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6766 from vanzin/SPARK-6511 and squashes the following commits:

49f0f67 [Marcelo Vanzin] [SPARK-6511] [docs] Fix example command in hadoop-provided docs.

(cherry picked from commit 9cbdf31ec1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-11 15:29:09 -07:00
Shivaram Venkataraman 3a62569afb [SPARK-8310] [EC2] Update spark-ec2 branch to 1.4
cc pwendell  -- We should probably update our release guidelines to change this when we cut a release branch ?

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6765 from shivaram/SPARK-8310-14 and squashes the following commits:

066e44e [Shivaram Venkataraman] Update spark-ec2 branch to 1.4
2015-06-11 13:22:08 -07:00
Adam Roberts b313920abd [SPARK-8289] Specify stack size for consistency with Java tests - resolves test failures
This change is a simple one and specifies a stack size of 4096k instead of the vendor default for Java tests (the defaults vary between Java vendors). This remedies test failures observed with JavaALSSuite with IBM and Oracle Java owing to a lower default size in comparison to the size with OpenJDK. 4096k is a suitable default where the tests pass with each Java vendor tested. The alternative is to reduce the number of iterations in the test (no observed failures with 5 iterations instead of 15).

-Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).

I have ensured this does not have any negative implications for other tests.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: a-roberts <aroberts@uk.ibm.com>

Closes #6727 from a-roberts/IncJavaStackSize and squashes the following commits:

ab40aea [Adam Roberts] Specify stack size for SBT builds
5032d8d [a-roberts] Update pom.xml

(cherry picked from commit 6b68366df3)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-11 08:41:00 +01:00
navis.ryu 5c05b5c0d2 [SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first
case cs  CombineSum(expr) =>
        val calcType = expr.dataType
          expr.dataType match {
            case DecimalType.Fixed(_, _) =>
              DecimalType.Unlimited
            case _ =>
              expr.dataType
          }
calcType is always expr.dataType. credits are all belong to IntelliJ

Author: navis.ryu <navis@apache.org>

Closes #6736 from navis/SPARK-8285 and squashes the following commits:

20382c1 [navis.ryu] [SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first

(cherry picked from commit 6a47114bc2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-10 18:19:24 -07:00
Paavo 59fc3f1972 [SPARK-8200] [MLLIB] Check for empty RDDs in StreamingLinearAlgorithm
Test cases for both StreamingLinearRegression and StreamingLogisticRegression, and code fix.

Edit:
This contribution is my original work and I license the work to the project under the project's open source license.

Author: Paavo <pparkkin@gmail.com>

Closes #6713 from pparkkin/streamingmodel-empty-rdd and squashes the following commits:

ff5cd78 [Paavo] Update strings to use interpolation.
db234cf [Paavo] Use !rdd.isEmpty.
54ad89e [Paavo] Test case for empty stream.
393e36f [Paavo] Ignore empty RDDs.
0bfc365 [Paavo] Test case for empty stream.

(cherry picked from commit b928f54384)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-10 23:26:54 +01:00
WangTaoTheTonic 2846a357f3 [SPARK-8273] Driver hangs up when yarn shutdown in client mode
In client mode, if yarn was shut down with spark application running, the application will hang up after several retries(default: 30) because the exception throwed by YarnClientImpl could not be caught by upper level, we should exit in case that user can not be aware that.

The exception we wanna catch is [here](https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryInvocationHandler.java#L122), and I try to fix it refer to [MR](https://github.com/apache/hadoop/blob/branch-2.7.0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/ClientServiceDelegate.java#L320).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6717 from WangTaoTheTonic/SPARK-8273 and squashes the following commits:

28752d6 [WangTaoTheTonic] catch the throwed exception
2015-06-10 13:36:16 -07:00
Adam Roberts 568d1d51d6 [SPARK-7756] CORE RDDOperationScope fix for IBM Java
IBM Java has an extra method when we do getStackTrace(): this is "getStackTraceImpl", a native method. This causes two tests to fail within "DStreamScopeSuite" when running with IBM Java. Instead of "map" or "filter" being the method names found, "getStackTrace" is returned. This commit addresses such an issue by using dropWhile. Given that our current method is withScope, we look for the next method that isn't ours: we don't care about methods that come before us in the stack trace: e.g. getStackTrace (regardless of how many levels this might go).

IBM:
java.lang.Thread.getStackTraceImpl(Native Method)
java.lang.Thread.getStackTrace(Thread.java:1117)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:104)

Oracle:
PRINTING STACKTRACE!!!
java.lang.Thread.getStackTrace(Thread.java:1552)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:106)

I've tested this with Oracle and IBM Java, no side effects for other tests introduced.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: a-roberts <aroberts@uk.ibm.com>

Closes #6740 from a-roberts/RDDScopeStackCrawlFix and squashes the following commits:

13ce390 [Adam Roberts] Ensure consistency with String equality checking
a4fc0e0 [a-roberts] Update RDDOperationScope.scala

(cherry picked from commit 19e30b48f3)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-10 13:21:59 -07:00
Hossein 28e8a6ea65 [SPARK-8282] [SPARKR] Make number of threads used in RBackend configurable
Read number of threads for RBackend from configuration.

[SPARK-8282] #comment Linking with JIRA

Author: Hossein <hossein@databricks.com>

Closes #6730 from falaki/SPARK-8282 and squashes the following commits:

33b3d98 [Hossein] Documented new config parameter
70f2a9c [Hossein] Fixing import
ec44225 [Hossein] Read number of threads for RBackend from configuration

(cherry picked from commit 30ebf1a233)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-10 13:19:53 -07:00
Cheng Lian 7b88e6a1e3 [SQL] [MINOR] Fixes a minor Java example error in SQL programming guide
Author: Cheng Lian <lian@databricks.com>

Closes #6749 from liancheng/java-sample-fix and squashes the following commits:

5b44585 [Cheng Lian] Fixes a minor Java example error in SQL programming guide

(cherry picked from commit 8f7308f9c4)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-10 11:48:19 -07:00
Patrick Wendell a0a7f2f921 [SPARK-6511] [DOCUMENTATION] Explain how to use Hadoop provided builds
This provides preliminary documentation pointing out how to use the
Hadoop free builds. I am hoping over time this list can grow to
include most of the popular Hadoop distributions.

Getting more people using these builds will help us long term reduce
the number of binaries we build.

Author: Patrick Wendell <patrick@databricks.com>

Closes #6729 from pwendell/hadoop-provided and squashes the following commits:

1113b76 [Patrick Wendell] [SPARK-6511] [Documentation] Explain how to use Hadoop provided builds

(cherry picked from commit 6e4fb0c9e8)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
2015-06-09 16:14:30 -07:00
Andrew Or 1175cfe088 [MINOR] [UI] DAG visualization: trim whitespace from input
Just as a safeguard against DOM rewriting.

Author: Andrew Or <andrew@databricks.com>

Closes #6732 from andrewor14/dag-viz-trim and squashes the following commits:

7e9bacb [Andrew Or] [MINOR] [UI] DAG visualization: trim whitespace from input

(cherry picked from commit 0d5892dc72)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-09 15:44:09 -07:00
FavioVazquez a7b7a194ac [SPARK-8274] [DOCUMENTATION-MLLIB] Fix wrong URLs in MLlib Frequent Pattern Mining Documentation
There is a mistake in the URLs of the Scala section of FP-Growth in the MLlib Frequent Pattern Mining documentation. The URL points to https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/fpm/FPGrowth.html which is the Java's API, the link should point to the Scala API https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth

There's another mistake in the FP-GrowthModel in the same section, the link points, again, to the Java's API https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html, the link should point to the Scala API https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel

Author: FavioVazquez <favio.vazquezp@gmail.com>

Closes #6722 from FavioVazquez/fix-wrog-urls-mllib-fpgrowth and squashes the following commits:

e1ca54d [FavioVazquez] - Fixed wrong URLs in MLlib Frequent Pattern Mining, FP-Growth Scala section
ad882a3 [FavioVazquez] Merge remote-tracking branch 'upstream/master'
f27a20b [FavioVazquez] Merge remote-tracking branch 'upstream/master'
9af7074 [FavioVazquez] Merge remote-tracking branch 'upstream/master'
edab1ef [FavioVazquez] Merge remote-tracking branch 'upstream/master'
b2e2f8c [FavioVazquez] Merge remote-tracking branch 'upstream/master'

(cherry picked from commit 490d5a72ec)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-09 15:02:56 +01:00
hqzizania 0a9383decb [SPARK-6820] [SPARKR] Convert NAs to null type in SparkR DataFrames
Author: hqzizania <qian.huang@intel.com>

Closes #6190 from hqzizania/R and squashes the following commits:

1641f9e [hqzizania] fixes and add test units
bb3411a [hqzizania] Convert NAs to null type in SparkR DataFrames

(cherry picked from commit a5c52c1a34)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-08 21:40:27 -07:00
Andrew Or e9a8372361 [SPARK-8162] [HOTFIX] Fix NPE in spark-shell
This was caused by this commit: f271347

This patch does not attempt to fix the root cause of why the `VisibleForTesting` annotation causes a NPE in the shell. We should find a way to fix that separately.

Author: Andrew Or <andrew@databricks.com>

Closes #6711 from andrewor14/fix-spark-shell and squashes the following commits:

bf62ecc [Andrew Or] Prevent NPE in spark-shell
2015-06-08 18:10:40 -07:00
Marcelo Vanzin 99c2a57348 [SPARK-8126] [BUILD] Use custom temp directory during build.
Even with all the efforts to cleanup the temp directories created by
unit tests, Spark leaves a lot of garbage in /tmp after a test run.
This change overrides java.io.tmpdir to place those files under the
build directory instead.

After an sbt full unit test run, I was left with > 400 MB of temp
files. Since they're now under the build dir, it's much easier to
clean them up.

Also make a slight change to a unit test to make it not pollute the
source directory with test data.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6674 from vanzin/SPARK-8126 and squashes the following commits:

0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
2015-06-08 22:17:44 +01:00
Cheng Lian 69197c3e38 [SPARK-8121] [SQL] Fixes InsertIntoHadoopFsRelation job initialization for Hadoop 1.x (branch 1.4 backport based on https://github.com/apache/spark/pull/6669) 2015-06-08 11:36:42 -07:00
linweizhong a3afc2cbab [SPARK-7705] [YARN] Cleanup of .sparkStaging directory fails if application is killed
As I have tested, if we cancel or kill the app then the final status may be undefined, killed or succeeded, so clean up staging directory when appMaster exit at any final application status.

Author: linweizhong <linweizhong@huawei.com>

Closes #6409 from Sephiroth-Lin/SPARK-7705 and squashes the following commits:

3a5a0a5 [linweizhong] Update
83dc274 [linweizhong] Update
923d44d [linweizhong] Update
0dd7c2d [linweizhong] Update
b76a102 [linweizhong] Update code style
7846b69 [linweizhong] Update
bd6cf0d [linweizhong] Refactor
aed9f18 [linweizhong] Clean up stagingDir when launch app on yarn
95595c3 [linweizhong] Cleanup of .sparkStaging directory when AppMaster exit at any final application status

(cherry picked from commit eacd4a929b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-08 09:34:42 +01:00
Daoyuan Wang 58bfdd6212 [SPARK-4761] [DOC] [SQL] kryo default setting in SQL Thrift server
this is a follow up of #3621

/cc liancheng pwendell

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6639 from adrian-wang/kryodoc and squashes the following commits:

3c4b1cf [Daoyuan Wang] [DOC] kryo default setting in SQL Thrift server

(cherry picked from commit 10fc2f6f51)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-08 01:07:56 -07:00
Reynold Xin b9c046f6d7 [SPARK-8004][SQL] Quote identifier in JDBC data source.
This is a follow-up patch to #6577 to replace columnEnclosing to quoteIdentifier.

I also did some minor cleanup to the JdbcDialect file.

Author: Reynold Xin <rxin@databricks.com>

Closes #6689 from rxin/jdbc-quote and squashes the following commits:

bad365f [Reynold Xin] Fixed test compilation...
e39e14e [Reynold Xin] Fixed compilation.
db9a8e0 [Reynold Xin] [SPARK-8004][SQL] Quote identifier in JDBC data source.

(cherry picked from commit d6d601a07b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-07 10:52:18 -07:00
Reynold Xin ff26767c03 [SPARK-8146] DataFrame Python API: Alias replace in df.na
Author: Reynold Xin <rxin@databricks.com>

Closes #6688 from rxin/df-alias-replace and squashes the following commits:

774c19c [Reynold Xin] [SPARK-8146] DataFrame Python API: Alias replace in DataFrameNaFunctions.

(cherry picked from commit 0ac47083f7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-07 01:21:08 -07:00
Liang-Chi Hsieh b4d54417e5 [SPARK-8141] [SQL] Precompute datatypes for partition columns and reuse it
JIRA: https://issues.apache.org/jira/browse/SPARK-8141

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6687 from viirya/reuse_partition_column_types and squashes the following commits:

dab0688 [Liang-Chi Hsieh] Reuse partitionColumnTypes.

(cherry picked from commit 26d07f1ece)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-06-07 15:35:43 +08:00
979969786 9d1f4d66bd [SPARK-8145] [WEBUI] Trigger a double click on the span to show full job description.
When using the Spark SQL, Jobs tab and Stages tab display only part of SQL. I change it to  display full SQL by double-click on the description span

before:
![before](https://cloud.githubusercontent.com/assets/5399861/8022257/9f8e0a22-0cf8-11e5-98c8-da4d7a615e7e.png)

after double click on the description span:
![after](https://cloud.githubusercontent.com/assets/5399861/8022261/dac08d4a-0cf8-11e5-8fe7-74c96c6ce933.png)

Author: 979969786 <q79969786@gmail.com>

Closes #6646 from 979969786/master and squashes the following commits:

b5ba20e [979969786] Trigger a double click on the span to show full job description.

(cherry picked from commit 081db9479a)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-06 23:15:35 -07:00
Liang-Chi Hsieh b6fdc6cf11 [SPARK-8004][SQL] Enclose column names by JDBC Dialect
JIRA: https://issues.apache.org/jira/browse/SPARK-8004

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6577 from viirya/enclose_jdbc_columns and squashes the following commits:

614606a [Liang-Chi Hsieh] For comment.
bc50182 [Liang-Chi Hsieh] Enclose column names by JDBC Dialect.

(cherry picked from commit 901a552c5e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-06 23:00:18 -07:00
Hari Shreedharan 6faaf15ba3 [SPARK-7955] [CORE] Ensure executors with cached RDD blocks are not re…
…moved if dynamic allocation is enabled.

This is a work in progress. This patch ensures that an executor that has cached RDD blocks are not removed,
but makes no attempt to find another executor to remove. This is meant to get some feedback on the current
approach, and if it makes sense then I will look at choosing another executor to remove. No testing has been done either.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6508 from harishreedharan/dymanic-caching and squashes the following commits:

dddf1eb [Hari Shreedharan] Minor configuration description update.
10130e2 [Hari Shreedharan] Fix compile issue.
5417b53 [Hari Shreedharan] Add documentation for new config. Remove block from cachedBlocks when it is dropped.
875916a [Hari Shreedharan] Make some code more readable.
39940ca [Hari Shreedharan] Handle the case where the executor has not yet registered.
90ad711 [Hari Shreedharan] Remove unused imports and unused methods.
063985c [Hari Shreedharan] Send correct message instead of recursively calling same method.
ec2fd7e [Hari Shreedharan] Add file missed in last commit
5d10fad [Hari Shreedharan] Update cached blocks status using local info, rather than doing an RPC.
193af4c [Hari Shreedharan] WIP. Use local state rather than via RPC.
ae932ff [Hari Shreedharan] Fix config param name.
272969d [Hari Shreedharan] Fix seconds to millis bug.
5a1993f [Hari Shreedharan] Add timeout for cache executors. Ignore broadcast blocks while checking if there are cached blocks.
57fefc2 [Hari Shreedharan] [SPARK-7955][Core] Ensure executors with cached RDD blocks are not removed if dynamic allocation is enabled.

(cherry picked from commit 3285a51121)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-06 21:13:33 -07:00
Cheng Lian d8a53fb806 [SPARK-8079] [SQL] Makes InsertIntoHadoopFsRelation job/task abortion more robust
As described in SPARK-8079, when writing a DataFrame to a `HadoopFsRelation`, if `HadoopFsRelation.prepareForWriteJob` throws exception, an unexpected NPE will be thrown during job abortion. (This issue doesn't bring much damage since the job is failing anyway.)

This PR makes the job/task abortion logic in `InsertIntoHadoopFsRelation` more robust to avoid such confusing exceptions.

Author: Cheng Lian <lian@databricks.com>

Closes #6612 from liancheng/spark-8079 and squashes the following commits:

87cd81e [Cheng Lian] Addresses @rxin's comment
1864c75 [Cheng Lian] Addresses review comments
9e6dbb3 [Cheng Lian] Makes InsertIntoHadoopFsRelation job/task abortion more robust

(cherry picked from commit 16fc49617e)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-06-06 17:23:46 +08:00
amey 84523fc387 [SPARK-7991] [PySpark] Adding support for passing lists to describe.
This is a minor change.

Author: amey <amey@skytree.net>

Closes #6655 from ameyc/JIRA-7991/support-passing-list-to-describe and squashes the following commits:

e8a1dff [amey] Adding support for passing lists to describe.

(cherry picked from commit 356a4a9b93)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-05 13:49:55 -07:00
Luca Martinetti 94f65bccee [SPARK-7747] [SQL] [DOCS] spark.sql.planner.externalSort
Add documentation for spark.sql.planner.externalSort

Author: Luca Martinetti <luca@luca.io>

Closes #6272 from lucamartinetti/docs-externalsort and squashes the following commits:

985661b [Luca Martinetti] [SPARK-7747] [SQL] [DOCS] Add documentation for spark.sql.planner.externalSort

(cherry picked from commit 4060526cd3)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-06-05 13:41:52 -07:00
zsxwing 200c980a13 [SPARK-8112] [STREAMING] Fix the negative event count issue
Author: zsxwing <zsxwing@gmail.com>

Closes #6659 from zsxwing/SPARK-8112 and squashes the following commits:

a5d7da6 [zsxwing] Address comments
d255b6e [zsxwing] Fix the negative event count issue

(cherry picked from commit 4f16d3fe2e)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-05 12:46:15 -07:00
Andrew Or 429c658519 Revert "[MINOR] [BUILD] Use custom temp directory during build."
This reverts commit 9b3e4c1871.
2015-06-05 10:54:06 -07:00
Shivaram Venkataraman 3e3151e755 [SPARK-8085] [SPARKR] Support user-specified schema in read.df
cc davies sun-rui

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6620 from shivaram/sparkr-read-schema and squashes the following commits:

16a6726 [Shivaram Venkataraman] Fix loadDF to pass schema Also add a unit test
a229877 [Shivaram Venkataraman] Use wrapper function to DataFrameReader
ee70ba8 [Shivaram Venkataraman] Support user-specified schema in read.df

(cherry picked from commit 12f5eaeee1)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-05 10:19:15 -07:00
Akhil Das 0ef2e9d351 [STREAMING] Update streaming-kafka-integration.md
Fixed the broken links (Examples) in the documentation.

Author: Akhil Das <akhld@darktech.ca>

Closes #6666 from akhld/patch-2 and squashes the following commits:

2228b83 [Akhil Das] Update streaming-kafka-integration.md

(cherry picked from commit 019dc9f558)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-05 14:24:06 +02:00
Marcelo Vanzin 9b3e4c1871 [MINOR] [BUILD] Use custom temp directory during build.
Even with all the efforts to cleanup the temp directories created by
unit tests, Spark leaves a lot of garbage in /tmp after a test run.
This change overrides java.io.tmpdir to place those files under the
build directory instead.

After an sbt full unit test run, I was left with > 400 MB of temp
files. Since they're now under the build dir, it's much easier to
clean them up.

Also make a slight change to a unit test to make it not pollute the
source directory with test data.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6653 from vanzin/unit-test-tmp and squashes the following commits:

31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other.
aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build.

(cherry picked from commit b16b5434ff)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-05 14:12:05 +02:00
Sean Owen 90cf686386 [MINOR] remove unused interpolation var in log message
Completely trivial but I noticed this wrinkle in a log message today; `$sender` doesn't refer to anything and isn't interpolated here.

Author: Sean Owen <sowen@cloudera.com>

Closes #6650 from srowen/Interpolation and squashes the following commits:

518687a [Sean Owen] Actually interpolate log string
7edb866 [Sean Owen] Trivial: remove unused interpolation var in log message

(cherry picked from commit 3a5c4da473)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-05 00:32:52 -07:00
Ted Blackman f02af7c8f7 [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
Author: Ted Blackman <ted.blackman@gmail.com>

Closes #6656 from belisarius222/branch-1.4 and squashes the following commits:

747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
2015-06-04 22:21:11 -07:00
Carson Wang 3ba6fc515d [SPARK-8098] [WEBUI] Show correct length of bytes on log page
The log page should only show desired length of bytes. Currently it shows bytes from the startIndex to the end of the file. The "Next" button on the page is always disabled.

Author: Carson Wang <carson.wang@intel.com>

Closes #6640 from carsonwang/logpage and squashes the following commits:

58cb3fd [Carson Wang] Show correct length of bytes on log page

(cherry picked from commit 63bc0c4430)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-04 16:25:05 -07:00
Shivaram Venkataraman 0b71b851de [SPARK-8027] [SPARKR] Move man pages creation to install-dev.sh
This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available

Related to discussion in #6567

cc pwendell srowen -- Let me know if this looks better

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6593 from shivaram/sparkr-pom-cleanup and squashes the following commits:

b282241 [Shivaram Venkataraman] Remove sparkr-docs from release script as well
8f100a5 [Shivaram Venkataraman] Move man pages creation to install-dev.sh This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available

(cherry picked from commit 3dc005282a)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-04 12:52:45 -07:00
Mike Dusenberry 81ff7a9012 [SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference.
Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests.  Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:

514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests.  Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.

(cherry picked from commit df7da07a86)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-04 11:30:25 -07:00
Daniel Darabos daf9451a4d Fix maxTaskFailures comment
If maxTaskFailures is 1, the task set is aborted after 1 task failure. Other documentation and the code supports this reading, I think it's just this comment that was off. It's easy to make this mistake — can you please double-check if I'm correct? Thanks!

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #6621 from darabos/patch-2 and squashes the following commits:

dfebdec [Daniel Darabos] Fix comment.

(cherry picked from commit 10ba188087)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-04 13:48:56 +02:00
Andrew Or 84da653192 [BUILD] Fix Maven build for Kinesis
A necessary dependency that is transitively referenced is not
provided, causing compilation failures in builds that provide
the kinesis-asl profile.
2015-06-03 20:47:53 -07:00
Andrew Or bfe74b34a6 [SPARK-7558] Demarcate tests in unit-tests.log (1.4)
This includes the following commits:

original: 9eb222c
hotfix1: 8c99793
hotfix2: a4f2412
scalastyle check: 609c492

---
Original patch #6441
Branch-1.3 patch #6602

Author: Andrew Or <andrew@databricks.com>

Closes #6598 from andrewor14/demarcate-tests-1.4 and squashes the following commits:

4c3c566 [Andrew Or] Merge branch 'branch-1.4' of github.com:apache/spark into demarcate-tests-1.4
e217b78 [Andrew Or] [SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike
46d4361 [Andrew Or] Various whitespace changes (minor)
3d9bf04 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
eaa520e [Andrew Or] Fix tests?
b4d93de [Andrew Or] Fix tests
634a777 [Andrew Or] Fix log message
a932e8d [Andrew Or] Fix manual things that cannot be covered through automation
8bc355d [Andrew Or] Add core tests as dependencies in all modules
75d361f [Andrew Or] Introduce base abstract class for all test suites
2015-06-03 20:46:44 -07:00
Andrew Or 584a2ba21c [BUILD] Use right branch when checking against Hive (1.4)
For branch-1.4.

This is identical to #6629 and is strictly not necessary. I'm opening this as a PR since it changes Jenkins test behavior and I want to test it out here.

Author: Andrew Or <andrew@databricks.com>

Closes #6630 from andrewor14/build-check-hive-1.4 and squashes the following commits:

186ec65 [Andrew Or] [BUILD] Use right branch when checking against Hive
2015-06-03 18:09:14 -07:00
Andrew Or 96f71b105a [BUILD] Increase Jenkins test timeout
Currently hive tests alone take 40m. The right thing to do is
to reduce the test time. However, that is a bigger project and
we currently have PRs blocking on tests not timing out.
2015-06-03 17:45:07 -07:00
Shivaram Venkataraman c2c129073f [SPARK-8084] [SPARKR] Make SparkR scripts fail on error
cc shaneknapp pwendell JoshRosen

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6623 from shivaram/SPARK-8084 and squashes the following commits:

0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error

(cherry picked from commit 0576c3c4ff)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-03 17:02:29 -07:00
Ryan Williams 16748694b8 [SPARK-8088] don't attempt to lower number of executors by 0
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #6624 from ryan-williams/execs and squashes the following commits:

b6f71d4 [Ryan Williams] don't attempt to lower number of executors by 0

(cherry picked from commit 51898b5158)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-03 16:54:52 -07:00
Andrew Or 0bc9a3ec42 [HOTFIX] [TYPO] Fix typo in #6546 2015-06-03 16:04:35 -07:00
Andrew Or d0be9508f5 [HOTFIX] Unbreak build from backporting #6546
This is caused by 7e46ea0228.
2015-06-03 15:25:35 -07:00
Xiangrui Meng b2a22a651f [SPARK-8051] [MLLIB] make StringIndexerModel silent if input column does not exist
This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6595 from mengxr/SPARK-8051 and squashes the following commits:

b6a36b9 [Xiangrui Meng] add doc
f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
8ee7c7e [Xiangrui Meng] use SparkFunSuite
e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist

(cherry picked from commit 26c9d7a0f9)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-06-03 15:16:36 -07:00
Shivaram Venkataraman ca21fff7da [SPARK-3674] [EC2] Clear SPARK_WORKER_INSTANCES when using YARN
cc andrewor14

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:

db244ae [Shivaram Venkataraman] Make Python Lint happy
0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN

(cherry picked from commit d3e026f879)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-03 15:14:44 -07:00
zsxwing 7e46ea0228 [SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <zsxwing@gmail.com>

Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite

(cherry picked from commit f27134782e)
Signed-off-by: Andrew Or <andrew@databricks.com>

Conflicts:
	core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala
2015-06-03 15:05:49 -07:00
zsxwing 306837e4e3 [SPARK-8001] [CORE] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6550 from zsxwing/SPARK-8001 and squashes the following commits:

607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout

(cherry picked from commit 1d8669f15c)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-03 15:03:15 -07:00