Commit graph

7992 commits

Author SHA1 Message Date
Holden Karau ba78383bac SPARK-3318: Documentation update in addFile on how to use SparkFiles.get
Rather than specifying the path to SparkFiles we need to use the filename.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits:

a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well
0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
2014-08-30 16:58:17 -07:00
Marcelo Vanzin b6cf134817 [SPARK-2889] Create Hadoop config objects consistently.
Different places in the code were instantiating Configuration / YarnConfiguration objects in different ways. This could lead to confusion for people who actually expected "spark.hadoop.*" options to end up in the configs used by Spark code, since that would only happen for the SparkContext's config.

This change modifies most places to use SparkHadoopUtil to initialize configs, and make that method do the translation that previously was only done inside SparkContext.

The places that were not changed fall in one of the following categories:
- Test code where this doesn't really matter
- Places deep in the code where plumbing SparkConf would be too difficult for very little gain
- Default values for arguments - since the caller can provide their own config in that case

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1843 from vanzin/SPARK-2889 and squashes the following commits:

52daf35 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
f179013 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
51e71cf [Marcelo Vanzin] Add test to ensure that overriding Yarn configs works.
53f9506 [Marcelo Vanzin] Add DeveloperApi annotation.
3d345cb [Marcelo Vanzin] Restore old method for backwards compat.
fc45067 [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
0ac3fdf [Marcelo Vanzin] Merge branch 'master' into SPARK-2889
3f26760 [Marcelo Vanzin] Compilation fix.
f16cadd [Marcelo Vanzin] Initialize config in SparkHadoopUtil.
b8ab173 [Marcelo Vanzin] Update Utils API to take a Configuration argument.
1e7003f [Marcelo Vanzin] Replace explicit Configuration instantiation with SparkHadoopUtil.
2014-08-30 14:48:07 -07:00
Reynold Xin d90434c035 Manually close old pull requests
Closes #1824
2014-08-29 23:21:38 -07:00
Raymond Liu acea92806c [SPARK-2288] Hide ShuffleBlockManager behind ShuffleManager
By Hiding the shuffleblockmanager behind Shufflemanager, we decouple the shuffle data's block mapping management work from Diskblockmananger. This give a more clear interface and more easy for other shuffle manager to implement their own block management logic. the jira ticket have more details.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1241 from colorant/shuffle and squashes the following commits:

0e01ae3 [Raymond Liu] Move ShuffleBlockmanager behind shuffleManager
2014-08-29 23:05:18 -07:00
Kousuke Saruta 7e662af332 [SPARK-3305] Remove unused import from UI classes.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2200 from sarutak/SPARK-3305 and squashes the following commits:

3cbd6ee [Kousuke Saruta] Removed unused import from classes related to UI
2014-08-29 22:52:32 -07:00
Patrick Wendell a004a8d879 BUILD: Adding back CDH4 as per user requests 2014-08-29 22:24:35 -07:00
Cheng Lian 32b18dd52c [SPARK-3320][SQL] Made batched in-memory column buffer building work for SchemaRDDs with empty partitions
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2213 from liancheng/spark-3320 and squashes the following commits:

45a0139 [Cheng Lian] Fixed typo in InMemoryColumnarQuerySuite
f67067d [Cheng Lian] Fixed SPARK-3320
2014-08-29 18:16:47 -07:00
wangfei 13901764f4 [SPARK-3296][mllib] spark-example should be run-example in head notation of DenseKMeans and SparseNaiveBayes
`./bin/spark-example`  should be `./bin/run-example` in DenseKMeans and SparseNaiveBayes

Author: wangfei <wangfei_hello@126.com>

Closes #2193 from scwf/run-example and squashes the following commits:

207eb3a [wangfei] spark-example should be run-example
27a8999 [wangfei] ./bin/spark-example should be ./bin/run-example
2014-08-29 17:37:15 -07:00
Zdenek Farana 98ddbe6cdb [SPARK-3173][SQL] Timestamp support in the parser
If you have a table with TIMESTAMP column, that column can't be used in WHERE clause properly - it is not evaluated properly. [More](https://issues.apache.org/jira/browse/SPARK-3173)

Motivation: http://www.aproint.com/aggregation-with-spark-sql/

- [x] modify SqlParser so it supports casting to TIMESTAMP (workaround for item 2)
- [x] the string literal should be converted into Timestamp if the column is Timestamp.

Author: Zdenek Farana <zdenek.farana@gmail.com>
Author: Zdenek Farana <zdenek.farana@aproint.com>

Closes #2084 from byF/SPARK-3173 and squashes the following commits:

442b59d [Zdenek Farana] Fixed test merge conflict
2dbf4f6 [Zdenek Farana] Merge remote-tracking branch 'origin/SPARK-3173' into SPARK-3173
65b6215 [Zdenek Farana] Fixed timezone sensitivity in the test
47b27b4 [Zdenek Farana] Now works in the case of "StringLiteral=TimestampColumn"
96a661b [Zdenek Farana] Code style change
491dfcf [Zdenek Farana] Added test cases for SPARK-3173
4446b1e [Zdenek Farana] A string literal is casted into Timestamp when the column is Timestamp.
59af397 [Zdenek Farana] Added a new TIMESTAMP keyword; CAST to TIMESTAMP now can be used in SQL expression.
2014-08-29 15:39:15 -07:00
qiping.lqp 634d04b87c [SPARK-3291][SQL]TestcaseName in createQueryTest should not contain ":"
":" is not allowed to appear in a file name of Windows system. If file name contains ":", this file can't be checked out in a Windows system and developers using Windows must be careful to not commit the deletion of such files, Which is very inconvenient.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2191 from chouqin/querytest and squashes the following commits:

0e943a1 [qiping.lqp] rename golden file
60a863f [qiping.lqp] TestcaseName in createQueryTest should not contain ":"
2014-08-29 15:37:43 -07:00
Cheng Lian d94a44d7ca [SPARK-3269][SQL] Decreases initial buffer size for row set to prevent OOM
When a large batch size is specified, `SparkSQLOperationManager` OOMs even if the whole result set is much smaller than the batch size.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2171 from liancheng/jdbc-fetch-size and squashes the following commits:

5e1623b [Cheng Lian] Decreases initial buffer size for row set to prevent OOM
2014-08-29 15:36:04 -07:00
Cheng Lian b1eccfc88a [SQL] Turns on in-memory columnar compression in HiveCompatibilitySuite
`HiveCompatibilitySuite` already turns on in-memory columnar caching, it would be good to also enable compression to improve test coverage.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2190 from liancheng/compression-on and squashes the following commits:

88b536c [Cheng Lian] Code cleanup, narrowed field visibility
d13efd2 [Cheng Lian] Turns on in-memory columnar compression in HiveCompatibilitySuite
2014-08-29 15:34:59 -07:00
Cheng Hao dc4d577c65 [SPARK-3198] [SQL] Remove the TreeNode.id
Thus id property of the TreeNode API does save time in a faster way to compare 2 TreeNodes, it is kind of performance bottleneck during the expression object creation in a multi-threading env (because of the memory barrier).
Fortunately, the tree node comparison only happen once in master, so even we remove it, the entire performance will not be affected.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2155 from chenghao-intel/treenode and squashes the following commits:

7cf2cd2 [Cheng Hao] Remove the implicit keyword for TreeNodeRef and some other small issues
5873415 [Cheng Hao] Remove the TreeNode.id
2014-08-29 15:32:26 -07:00
Cheng Lian 287c0ac772 [SPARK-3234][Build] Fixed environment variables that rely on deprecated command line options in make-distribution.sh
Please refer to [SPARK-3234](https://issues.apache.org/jira/browse/SPARK-3234) for details.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2208 from liancheng/spark-3234 and squashes the following commits:

fb26de8 [Cheng Lian] Fixed SPARK-3234
2014-08-29 15:29:43 -07:00
William Benton 2f1519defa SPARK-2813: [SQL] Implement SQRT() directly in Spark SQL
This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive.

Author: William Benton <willb@redhat.com>

Closes #1750 from willb/spark-2813 and squashes the following commits:

22c8a79 [William Benton] Fixed missed newline from rebase
d673861 [William Benton] Added string coercions for SQRT and associated test case
e125df4 [William Benton] Added ExpressionEvaluationSuite test cases for SQRT
7b84bcd [William Benton] SQL SQRT now properly returns NULL for NULL inputs
8256971 [William Benton] added SQRT test to SqlQuerySuite
504d2e5 [William Benton] Added native SQRT implementation
2014-08-29 15:26:59 -07:00
Nicholas Chammas 53aa8316e8 [Docs] SQL doc formatting and typo fixes
As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html):
* Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks.
* Nested bullets need 2 leading spaces, not 1.
* Spellcheck!

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits:

873f889 [Nicholas Chammas] [Docs] fix skip-api flag
5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes
3b26c8d [nchammas] [Spark QA] Link to console output on test time out
2014-08-29 15:23:32 -07:00
Davies Liu e248328b39 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()
remove invalid docs

Author: Davies Liu <davies.liu@gmail.com>

Closes #2202 from davies/keep and squashes the following commits:

aa3b44f [Davies Liu] remove invalid docs
2014-08-29 11:47:49 -07:00
Kousuke Saruta 27df6ce6a2 [SPARK-3279] Remove useless field variable in ApplicationMaster
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2177 from sarutak/SPARK-3279 and squashes the following commits:

2955edc [Kousuke Saruta] Removed useless field variable from ApplicationMaster
2014-08-28 22:52:53 -07:00
Reynold Xin 665e71d14d [SPARK-1912] Lazily initialize buffers for local shuffle blocks.
This is a simplified fix for SPARK-1912.

Author: Reynold Xin <rxin@apache.org>

Closes #2179 from rxin/SPARK-1912 and squashes the following commits:

b2f0e9e [Reynold Xin] Fix unit tests.
a8eddfe [Reynold Xin] [SPARK-1912] Lazily initialize buffers for local shuffle blocks.
2014-08-28 19:00:40 -07:00
nchammas 3c517a812e [Spark QA] Link to console output on test time out
When tests time out we should link to the Jenkins console output for easy review. We already do this for when tests start or complete normally.

Here's [a recent example](https://github.com/apache/spark/pull/2109#issuecomment-53374032) of where this would be helpful.

Author: nchammas <nicholas.chammas@gmail.com>

Closes #2140 from nchammas/patch-1 and squashes the following commits:

3b26c8d [nchammas] [Spark QA] Link to console output on test time out
2014-08-28 18:08:28 -07:00
Andrew Or a46b8f2d71 [SPARK-3277] Fix external spilling with LZ4 assertion error
**Summary of the changes**

The bulk of this PR is comprised of tests and documentation; the actual fix is really just adding 1 line of code (see `BlockObjectWriter.scala`). We currently do not run the `External*` test suites with different compression codecs, and this would have caught the bug reported in [SPARK-3277](https://issues.apache.org/jira/browse/SPARK-3277). This PR extends the existing code to test spilling using all compression codecs known to Spark, including `LZ4`.

**The bug itself**

In `DiskBlockObjectWriter`, we only report the shuffle bytes written before we close the streams. With `LZ4`, all the bytes written reported by our metrics were 0 because `flush()` was not taking effect for some reason. In general, compression codecs may write additional bytes to the file after we call `close()`, and so we must also capture those bytes in our shuffle write metrics.

Thanks mridulm and pwendell for help with debugging.

Author: Andrew Or <andrewor14@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2187 from andrewor14/fix-lz4-spilling and squashes the following commits:

1b54bdc [Andrew Or] Speed up tests by not compressing everything
1c4624e [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-lz4-spilling
6b2e7d1 [Andrew Or] Fix compilation error
92e251b [Patrick Wendell] Better documentation for BlockObjectWriter.
a1ad536 [Andrew Or] Fix tests
089593f [Andrew Or] Actually fix SPARK-3277 (tests still fail)
4bbcf68 [Andrew Or] Update tests to actually test all compression codecs
b264a84 [Andrew Or] ExternalAppendOnlyMapSuite code style fixes (minor)
1bfa743 [Andrew Or] Add more information to assert for better debugging
2014-08-28 17:05:21 -07:00
Sandy Ryza 92af2314f2 SPARK-3082. yarn.Client.logClusterResourceDetails throws NPE if requeste...
...d queue doesn't exist

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1984 from sryza/sandy-spark-3082 and squashes the following commits:

fe08c37 [Sandy Ryza] Remove log message entirely
85253ad [Sandy Ryza] SPARK-3082. yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't exist
2014-08-28 16:18:50 -07:00
Ankur Dave 96df929069 [SPARK-3190] Avoid overflow in VertexRDD.count()
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them.

The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion).

```scala
val pairs = sc.parallelize(0L until 500L).map(_ * 10000000)
  .flatMap(start => start until (start + 10000000)).map(x => (x, x))
VertexRDD(pairs).count()
```

Author: Ankur Dave <ankurdave@gmail.com>

Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits:

641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
2014-08-28 15:17:01 -07:00
Yadong Qi 39012452da [SPARK-3285] [examples] Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
def sum[B >: A](implicit num: Numeric[B]): B = foldLeft(num.zero)(num.plus)
Using values.sum is easier to understand than using values.foldLeft(0)(_ + _), so we'd better use values.sum instead of values.foldLeft(0)(_ + _)

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #2182 from watermen/bug-fix3 and squashes the following commits:

17be9fb [Yadong Qi] Update CheckpointSuite.scala
714bda5 [Yadong Qi] Update BasicOperationsSuite.scala
57e704c [Yadong Qi] Update StatefulNetworkWordCount.scala
2014-08-28 14:08:48 -07:00
Reynold Xin be53c54b5c [SPARK-3281] Remove Netty specific code in BlockManager / shuffle
Netty functionality will be added back in subsequent PRs by using the BlockTransferService interface.

Author: Reynold Xin <rxin@apache.org>

Closes #2181 from rxin/SPARK-3281 and squashes the following commits:

5494b0e [Reynold Xin] Fix extra port.
ff6d1e1 [Reynold Xin] [SPARK-3281] Remove Netty specific code in BlockManager.
2014-08-28 14:08:07 -07:00
Andrew Or 41dc5987d9 [SPARK-3264] Allow users to set executor Spark home in Mesos
The executors and the driver may not share the same Spark home. There is currently one way to set the executor side Spark home in Mesos, through setting `spark.home`. However, this is neither documented nor intuitive. This PR adds a more specific config `spark.mesos.executor.home` and exposes this to the user.

liancheng tnachen

Author: Andrew Or <andrewor14@gmail.com>

Closes #2166 from andrewor14/mesos-spark-home and squashes the following commits:

b87965e [Andrew Or] Merge branch 'master' of github.com:apache/spark into mesos-spark-home
f6abb2e [Andrew Or] Document spark.mesos.executor.home
ca7846d [Andrew Or] Add more specific configuration for executor Spark home in Mesos
2014-08-28 11:05:44 -07:00
Cheng Lian 6d392b36ee [SPARK-2608][Core] Fixed command line option passing issue over Mesos via SPARK_EXECUTOR_OPTS
This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608).

The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue.

Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release.

Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits:

ba59190 [Cheng Lian] Added fine grained Mesos executor support
1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend

(cherry picked from commit 935bffe3bf)
Signed-off-by: Reynold Xin <rxin@apache.org>
2014-08-28 11:03:48 -07:00
Tatiana Borisova 70d814665b [SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init()
The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

Author: Tatiana Borisova <tanyatik@yandex.ru>

Closes #2062 from tanyatik/spark-3150 and squashes the following commits:

9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()
2014-08-28 10:36:36 -07:00
Michael Armbrust 76e3ba4264 [SPARK-3230][SQL] Fix udfs that return structs
We need to convert the case classes into Rows.

Author: Michael Armbrust <michael@databricks.com>

Closes #2133 from marmbrus/structUdfs and squashes the following commits:

189722f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into structUdfs
8e29b1c [Michael Armbrust] Use existing function
d8d0b76 [Michael Armbrust] Fix udfs that return structs
2014-08-28 00:15:23 -07:00
Cheng Lian 68f75dcdfe [SQL] Fixed 2 comment typos in SQLConf
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2172 from liancheng/sqlconf-typo and squashes the following commits:

115cc71 [Cheng Lian] Fixed 2 comment typos in SQLConf
2014-08-28 00:08:09 -07:00
Michael Armbrust 024178c574 [HOTFIX][SQL] Remove cleaning of UDFs
It is not safe to run the closure cleaner on slaves.  #2153 introduced this which broke all UDF execution on slaves.  Will re-add cleaning of UDF closures in a follow-up PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #2174 from marmbrus/fixUdfs and squashes the following commits:

55406de [Michael Armbrust] [HOTFIX] Remove cleaning of UDFs
2014-08-27 23:05:34 -07:00
Andrew Or dafe343499 [HOTFIX] Wait for EOF only for the PySpark shell
In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send us an `EOF` before finishing the application. This is applicable for the PySpark shell because we terminate the application the same way. However if we run a python application, for instance, the JVM actually never exits unless it receives a manual EOF from the user. This is causing a few tests to timeout.

We only need to do this for the PySpark shell because Spark submit runs as a python subprocess only in this case. Thus, the normal Spark shell doesn't need to go through this case even though it is also a REPL.

Thanks davies for reporting this.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2170 from andrewor14/bootstrap-hotfix and squashes the following commits:

42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell
2014-08-27 23:03:46 -07:00
Rob O'Dwyer f38fab97c7 SPARK-3265 Allow using custom ipython executable with pyspark
Although you can make pyspark use ipython with `IPYTHON=1`, and also change the python executable with `PYSPARK_PYTHON=...`, you can't use both at the same time because it hardcodes the default ipython script.

This makes it use the `PYSPARK_PYTHON` variable if present and fall back to default python, similarly to how the default python executable is handled.

So you can use a custom ipython like so:
`PYSPARK_PYTHON=./anaconda/bin/ipython IPYTHON_OPTS="notebook" pyspark`

Author: Rob O'Dwyer <odwyerrob@gmail.com>

Closes #2167 from robbles/patch-1 and squashes the following commits:

d98e8a9 [Rob O'Dwyer] Allow using custom ipython executable with pyspark
2014-08-27 19:47:33 -07:00
scwf b86277c132 [SPARK-3271] delete unused methods in Utils
delete no used method in Utils

Author: scwf <wangfei1@huawei.com>

Closes #2160 from scwf/delete-no-use-method and squashes the following commits:

d8f6b0d [scwf] delete no use method in Utils
2014-08-27 19:44:30 -07:00
Matthew Farrellee 64d8ecbbe9 Add line continuation for script to work w/ py2.7.5
Error was -

$ SPARK_HOME=$PWD/dist ./dev/create-release/generate-changelist.py
  File "./dev/create-release/generate-changelist.py", line 128
    if day < SPARK_REPO_CHANGE_DATE1 or
                                      ^
SyntaxError: invalid syntax

Author: Matthew Farrellee <matt@redhat.com>

Closes #2139 from mattf/master-fix-generate-changelist.py-0 and squashes the following commits:

6b3a900 [Matthew Farrellee] Add line continuation for script to work w/ py2.7.5
2014-08-27 15:50:30 -07:00
Patrick Wendell 8712653f11 HOTFIX: Don't build with YARN support for Mapr3 2014-08-27 15:41:09 -07:00
Michael Armbrust 7d2a7a91f2 [SPARK-3235][SQL] Ensure in-memory tables don't always broadcast.
Author: Michael Armbrust <michael@databricks.com>

Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits:

5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize
14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans.
8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs
18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast.
2014-08-27 15:14:08 -07:00
luogankun 65253502b9 [SPARK-3065][SQL] Add locale setting to fix results do not match for udf_unix_timestamp format "yyyy MMM dd h:mm:ss a" run with not "America/Los_Angeles" TimeZone in HiveCompatibilitySuite
When run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
with not "America/Los_Angeles" TimeZone throws error. [https://issues.apache.org/jira/browse/SPARK-3065]
add locale setting on beforeAll and afterAll method to fix the bug of HiveCompatibilitySuite testcase

Author: luogankun <luogankun@gmail.com>

Closes #1968 from luogankun/SPARK-3065 and squashes the following commits:

c167832 [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
0a25e3a [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
2014-08-27 15:08:22 -07:00
Aaron Davidson cc275f4b79 [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location
Currently we do `relation.hiveQlTable.getDataLocation.getPath`, which returns the path-part of the URI (e.g., "s3n://my-bucket/my-path" => "/my-path"). We should do `relation.hiveQlTable.getDataLocation.toString` instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2150 from aarondav/parquet-location and squashes the following commits:

459f72c [Aaron Davidson] [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location
2014-08-27 15:05:47 -07:00
viirya 28d41d6279 [SPARK-3252][SQL] Add missing condition for test
According to the text message, both relations should be tested. So add the missing condition.

Author: viirya <viirya@gmail.com>

Closes #2159 from viirya/fix_test and squashes the following commits:

b1c0f52 [viirya] add missing condition.
2014-08-27 14:55:05 -07:00
Andrew Or 63a053ab14 [SPARK-3243] Don't use stale spark-driver.* system properties
If we set both `spark.driver.extraClassPath` and `--driver-class-path`, then the latter correctly overrides the former. However, the value of the system property `spark.driver.extraClassPath` still uses the former, which is actually not added to the class path. This may cause some confusion...

Of course, this also affects other options (i.e. java options, library path, memory...).

Author: Andrew Or <andrewor14@gmail.com>

Closes #2154 from andrewor14/driver-submit-configs-fix and squashes the following commits:

17ec6fc [Andrew Or] Fix tests
0140836 [Andrew Or] Don't forget spark.driver.memory
e39d20f [Andrew Or] Also set spark.driver.extra* configs in client mode
2014-08-27 14:46:56 -07:00
Vida Ha 7faf755ae4 Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this"
... copy the spark_cluster_tag from a spot instance requests over to the instances.

Author: Vida Ha <vida@databricks.com>

Closes #2163 from vidaha/vida/spark-3213 and squashes the following commits:

5070a70 [Vida Ha] Spark-3214 Fix issue with spark-ec2 not detecting slaves created with 'Launch More Like This' and using Spot Requests
2014-08-27 14:26:06 -07:00
Davies Liu 4fa2fda88f [SPARK-2871] [PySpark] add RDD.lookup(key)
RDD.lookup(key)

        Return the list of values in the RDD for key `key`. This operation
        is done efficiently if the RDD has a known partitioner by only
        searching the partition that the key maps to.

        >>> l = range(1000)
        >>> rdd = sc.parallelize(zip(l, l), 10)
        >>> rdd.lookup(42)  # slow
        [42]
        >>> sorted = rdd.sortByKey()
        >>> sorted.lookup(42)  # fast
        [42]

It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2093 from davies/lookup and squashes the following commits:

1789cd4 [Davies Liu] `f` in foreach could be generator or not.
2871b80 [Davies Liu] Merge branch 'master' into lookup
c6390ea [Davies Liu] address all comments
0f1bce8 [Davies Liu] add test case for lookup()
be0e8ba [Davies Liu] fix preservesPartitioning
eb1305d [Davies Liu] add RDD.lookup(key)
2014-08-27 13:18:33 -07:00
chutium 48f42781de [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370

be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```

Author: chutium <teng.qiu@gmail.com>

Closes #2044 from chutium/parquet-singlefile and squashes the following commits:

4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
2014-08-27 13:13:04 -07:00
Chip Senkbeil 191d7cf2a6 [SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL
As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem.

My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global).

The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off.

Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath.

The solution now looks like this:
![screen shot 2014-08-13 at 3 46 02 pm](https://cloud.githubusercontent.com/assets/2481802/3912625/f02b1440-232c-11e4-9bf6-bafb3e352d14.png)

Author: Chip Senkbeil <rcsenkbe@us.ibm.com>

Closes #1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits:

f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters
a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error
2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly
a220639 [Chip Senkbeil] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL
2014-08-27 13:01:11 -07:00
Cheng Hao 4238c17dc9 [SPARK-3197] [SQL] Reduce the Expression tree object creations for aggregation function (min/max)
Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max.
Here is the benchmark that I've done in my local.

Master | Previous Result (ms) | Current Result (ms)
------------ | ------------- | -------------
local | 3645 | 3416
local[6] | 3602 | 1002

The Benchmark source code.
```
case class Record(key: Int, value: Int)

object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf()))

object DataPrepare extends App {
  import TestHive2._

  val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12)

  runSqlHive("SHOW TABLES")
  runSqlHive("DROP TABLE if exists a")
  runSqlHive("DROP TABLE if exists result")
  rdd.registerAsTable("records")

  runSqlHive("""CREATE TABLE a (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)
  runSqlHive("""CREATE TABLE result (key INT, value INT)
                 | ROW FORMAT SERDE
                 | 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
                 | STORED AS RCFILE
               """.stripMargin)

  hql(s"""from records
             | insert into table a
             | select key, value
           """.stripMargin)
}

object PerformanceTest extends App {
  import TestHive2._

  hql("SHOW TABLES")
  hql("set spark.sql.shuffle.partitions=12")

  val cmd = "select min(value), max(value) from a group by key"

  val results = ("Result1", benchmark(cmd)) ::
                ("Result2", benchmark(cmd)) ::
                ("Result3", benchmark(cmd)) :: Nil
  results.foreach { case (prompt, result) => {
      println(s"$prompt: took ${result._1} ms (${result._2} records)")
    }
  }

  def benchmark(cmd: String) = {
    val begin = System.currentTimeMillis()
    val count = hql(cmd).count
    val end = System.currentTimeMillis()
    ((end - begin), count)
  }
}
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits:

db40395 [Cheng Hao] remove the transient and add val for the expression property
d56167d [Cheng Hao] Reduce the Expressions creation
2014-08-27 12:50:47 -07:00
u0jing 3b5eb7083d [SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name]" support
JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118

eg:
> SHOW TBLPROPERTIES test;
SHOW TBLPROPERTIES test;
numPartitions	0
numFiles	1
transient_lastDdlTime	1407923642
numRows	0
totalSize	82
rawDataSize	0

eg:
> SHOW COLUMNS  in test;
SHOW COLUMNS  in test;
OK
Time taken: 0.304 seconds
id
stid
bo

Author: u0jing <u9jing@gmail.com>

Closes #2034 from u0jing/spark-3118 and squashes the following commits:

b231d87 [u0jing] add golden answer files
35f4885 [u0jing] add 'show columns' and 'show tblproperties' support
2014-08-27 12:47:14 -07:00
Allan Douglas R. de Oliveira 5ac4093c9f SPARK-3259 - User data should be given to the master
Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>

Closes #2162 from douglaz/user_data_master and squashes the following commits:

10d15f6 [Allan Douglas R. de Oliveira] Give user data also to the master
2014-08-27 12:43:22 -07:00
uncleGen d8298c46b7 [SPARK-3170][CORE][BUG]:RDD info loss in "StorageTab" and "ExecutorTab"
compeleted stage only need to remove its own partitions that are no longer cached. However, "StorageTab" may lost some rdds which are cached actually. Not only in "StorageTab", "ExectutorTab" may also lose some rdd info which have been overwritten by last rdd in a same task.
1. "StorageTab": when multiple stages run simultaneously, completed stage will remove rdd info which belong to other stages that are still running.
2. "ExectutorTab": taskcontext may lose some "updatedBlocks" info of  rdds  in a dependency chain. Like the following example:
         val r1 = sc.paralize(..).cache()
         val r2 = r1.map(...).cache()
         val n = r2.count()

When count the r2, r1 and r2 will be cached finally. So in CacheManager.getOrCompute, the taskcontext should contain "updatedBlocks" of r1 and r2. Currently, the "updatedBlocks" only contain the info of r2.

Author: uncleGen <hustyugm@gmail.com>

Closes #2131 from uncleGen/master_ui_fix and squashes the following commits:

a6a8a0b [uncleGen] fix some coding style
3a1bc15 [uncleGen] fix some error in unit test
56ea488 [uncleGen] there's some line too long
c82ba82 [uncleGen] Bug Fix: RDD info loss in "StorageTab" and "ExecutorTab"
2014-08-27 10:33:01 -07:00
Marcelo Vanzin b92d823ad1 [SPARK-2933] [yarn] Refactor and cleanup Yarn AM code.
This change modifies the Yarn module so that all the logic related
to running the ApplicationMaster is localized. Instead of, previously,
4 different classes with mostly identical code, now we have:

- A single, shared ApplicationMaster class, which can operate both in
  client and cluster mode, and substitutes the old ApplicationMaster
  (for cluster mode) and ExecutorLauncher (for client mode).

The benefit here is that all different execution modes for all supported
yarn versions use the same shared code for monitoring executor allocation,
setting up configuration, and monitoring the process's lifecycle.

- A new YarnRMClient interface, which defines basic RM functionality needed
  by the ApplicationMaster. This interface has concrete implementations for
  each supported Yarn version.

- A new YarnAllocator interface, which just abstracts the existing interface
  of the YarnAllocationHandler class. This is to avoid having to touch the
  allocator code too much in this change, although it might benefit from a
  similar effort in the future.

The end result is much easier to understand code, with much less duplication,
making it much easier to fix bugs, add features, and test everything knowing
that all supported versions will behave the same.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2020 from vanzin/SPARK-2933 and squashes the following commits:

3bbf3e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
ff389ed [Marcelo Vanzin] Do not interrupt reporter thread from within itself.
3a8ed37 [Marcelo Vanzin] Remote stale comment.
0f5142c [Marcelo Vanzin] Review feedback.
41f8c8a [Marcelo Vanzin] Fix app status reporting.
c0794be [Marcelo Vanzin] Correctly clean up staging directory.
92770cc [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
ecaf332 [Marcelo Vanzin] Small fix to shutdown code.
f02d3f8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933
f581122 [Marcelo Vanzin] Review feedback.
557fdeb [Marcelo Vanzin] Cleanup a couple more constants.
be6068d [Marcelo Vanzin] Restore shutdown hook to clean up staging dir.
5150993 [Marcelo Vanzin] Some more cleanup.
b6289ab [Marcelo Vanzin] Move cluster/client code to separate methods.
ecb23cd [Marcelo Vanzin] More trivial cleanup.
34f1e63 [Marcelo Vanzin] Fix some questionable error handling.
5657c7d [Marcelo Vanzin] Finish app if SparkContext initialization times out.
0e4be3d [Marcelo Vanzin] Keep "ExecutorLauncher" as the main class for client-mode AM.
91beabb [Marcelo Vanzin] Fix UI filter registration.
8c72239 [Marcelo Vanzin] Trivial cleanups.
99a52d5 [Marcelo Vanzin] Changes to the yarn-alpha project to use common AM code.
848ca6d [Marcelo Vanzin] [SPARK-2933] [yarn] Refactor and cleanup Yarn AM code.
2014-08-27 11:02:04 -05:00