Commit graph

11767 commits

Author SHA1 Message Date
zhichao.li 1a7a7d7d57 [SPARK-8213][SQL]Add function factorial
Author: zhichao.li <zhichao.li@intel.com>

Closes #6822 from zhichao-li/factorial and squashes the following commits:

26edf4f [zhichao.li] add factorial
2015-07-02 20:37:31 -07:00
Bryan Cutler aa7bbc1438 [SPARK-6980] [CORE] Akka timeout exceptions indicate which conf controls them (RPC Layer)
Latest changes after refactoring to the RPC layer.  I rebased against trunk to make sure to get any recent changes since it had been a while.  I wasn't crazy about the name `ConfigureTimeout` and `RpcTimeout` seemed to fit better, but I'm open to suggestions!

I ran most of the tests and they pass, but others would get stuck with "WARN TaskSchedulerImpl: Initial job has not accepted any resources".  I think its just my machine, so I'd though I would push what I have anyway.

Still left to do:
* I only added a couple unit tests so far, there are probably some more cases to test
* Make sure all uses require a `RpcTimeout`
* Right now, both the `ask` and `Await.result` use the same timeout, should we differentiate between these in the TimeoutException message?
* I wrapped `Await.result` in `RpcTimeout`, should we also wrap `Await.ready`?
* Proper scoping of classes and methods

hardmettle, feel free to help out with any of these!

Author: Bryan Cutler <bjcutler@us.ibm.com>
Author: Harsh Gupta <harsh@Harshs-MacBook-Pro.local>
Author: BryanCutler <cutlerb@gmail.com>

Closes #6205 from BryanCutler/configTimeout-6980 and squashes the following commits:

46c8d48 [Bryan Cutler] [SPARK-6980] Changed RpcEnvSuite test to never reply instead of just sleeping, to avoid possible sync issues
06afa53 [Bryan Cutler] [SPARK-6980] RpcTimeout class extends Serializable, was causing error in MasterSuite
7bb70f1 [Bryan Cutler] Merge branch 'master' into configTimeout-6980
dbd5f73 [Bryan Cutler] [SPARK-6980] Changed RpcUtils askRpcTimeout and lookupRpcTimeout scope to private[spark] and improved deprecation warning msg
4e89c75 [Bryan Cutler] [SPARK-6980] Missed one usage of deprecated RpcUtils.askTimeout in YarnSchedulerBackend although it is not being used, and fixed SparkConfSuite UT to not use deprecated RpcUtils functions
6a1c50d [Bryan Cutler] [SPARK-6980] Minor cleanup of test case
7f4d78e [Bryan Cutler] [SPARK-6980] Fixed scala style checks
287059a [Bryan Cutler] [SPARK-6980] Removed extra import in AkkaRpcEnvSuite
3d8b1ff [Bryan Cutler] [SPARK-6980] Cleaned up imports in AkkaRpcEnvSuite
3a168c7 [Bryan Cutler] [SPARK-6980] Rewrote Akka RpcTimeout UTs in RpcEnvSuite
7636189 [Bryan Cutler] [SPARK-6980] Fixed call to askWithReply in DAGScheduler to use RpcTimeout - this was being compiled by auto-tupling and changing the message type of BlockManagerHeartbeat
be11c4e [Bryan Cutler] Merge branch 'master' into configTimeout-6980
039afed [Bryan Cutler] [SPARK-6980] Corrected import organization
218aa50 [Bryan Cutler] [SPARK-6980] Corrected issues from feedback
fadaf6f [Bryan Cutler] [SPARK-6980] Put back in deprecated RpcUtils askTimeout and lookupTimout to fix MiMa errors
fa6ed82 [Bryan Cutler] [SPARK-6980] Had to increase timeout on positive test case because a processor slowdown could trigger an Future TimeoutException
b05d449 [Bryan Cutler] [SPARK-6980] Changed constructor to use val duration instead of getter function, changed name of string property from conf to timeoutProp for consistency
c6cfd33 [Bryan Cutler] [SPARK-6980] Changed UT ask message timeout to explicitly intercept a SparkException
1394de6 [Bryan Cutler] [SPARK-6980] Moved MessagePrefix to createRpcTimeoutException directly
1517721 [Bryan Cutler] [SPARK-6980] RpcTimeout object scope should be private[spark]
2206b4d [Bryan Cutler] [SPARK-6980] Added unit test for ask then immediat awaitReply
1b9beab [Bryan Cutler] [SPARK-6980] Cleaned up import ordering
08f5afc [Bryan Cutler] [SPARK-6980] Added UT for constructing RpcTimeout with default value
d3754d1 [Bryan Cutler] [SPARK-6980] Added akkaConf to prevent dead letter logging
995d196 [Bryan Cutler] [SPARK-6980] Cleaned up import ordering, comments, spacing from PR feedback
7774d56 [Bryan Cutler] [SPARK-6980] Cleaned up UT imports
4351c48 [Bryan Cutler] [SPARK-6980] Added UT for addMessageIfTimeout, cleaned up UTs
1607a5f [Bryan Cutler] [SPARK-6980] Changed addMessageIfTimeout to PartialFunction, cleanup from PR comments
2f94095 [Bryan Cutler] [SPARK-6980] Added addMessageIfTimeout for when a Future is completed with TimeoutException
235919b [Bryan Cutler] [SPARK-6980] Resolved conflicts after master merge
c07d05c [Bryan Cutler] Merge branch 'master' into configTimeout-6980-tmp
b7fb99f [BryanCutler] Merge pull request #2 from hardmettle/configTimeoutUpdates_6980
4be3a8d [Harsh Gupta] Modifying loop condition to find property match
0ee5642 [Harsh Gupta] Changing the loop condition to halt at the first match in the property list for RpcEnv exception catch
f74064d [Harsh Gupta] Retrieving properties from property list using iterator and while loop instead of chained functions
a294569 [Bryan Cutler] [SPARK-6980] Added creation of RpcTimeout with Seq of property keys
23d2f26 [Bryan Cutler] [SPARK-6980] Fixed await result not being handled by RpcTimeout
49f9f04 [Bryan Cutler] [SPARK-6980] Minor cleanup and scala style fix
5b59a44 [Bryan Cutler] [SPARK-6980] Added some RpcTimeout unit tests
78a2c0a [Bryan Cutler] [SPARK-6980] Using RpcTimeout.awaitResult for future in AppClient now
97523e0 [Bryan Cutler] [SPARK-6980] Akka ask timeout description refactored to RPC layer
2015-07-02 21:38:21 -05:00
Josh Rosen d9838196ff [SPARK-8782] [SQL] Fix code generation for ORDER BY NULL
This fixes code generation for queries containing `ORDER BY NULL`.  Previously, the generated code would fail to compile.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7179 from JoshRosen/generate-order-fixes and squashes the following commits:

6ef49a6 [Josh Rosen] Fix ORDER BY NULL
0036696 [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
2015-07-02 18:07:09 -07:00
Reynold Xin e589e71a29 Revert "[SPARK-8784] [SQL] Add Python API for hex and unhex"
This reverts commit fc7aebd94a.
2015-07-02 16:25:10 -07:00
Yu ISHIKAWA 488bad319a [SPARK-7104] [MLLIB] Support model save/load in Python's Word2Vec
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits:

975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
2015-07-02 15:55:16 -07:00
Davies Liu fc7aebd94a [SPARK-8784] [SQL] Add Python API for hex and unhex
Also improve the performance of hex/unhex

Author: Davies Liu <davies@databricks.com>

Closes #7181 from davies/hex and squashes the following commits:

f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
2015-07-02 15:43:02 -07:00
lewuathe 7d9cc9673e [SPARK-3382] [MLLIB] GradientDescent convergence tolerance
GrandientDescent can receive convergence tolerance value. Default value is 0.0.
When loss value becomes less than the tolerance which is set by user, iteration is terminated.

Author: lewuathe <lewuathe@me.com>

Closes #3636 from Lewuathe/gd-convergence-tolerance and squashes the following commits:

0b8a9a8 [lewuathe] Update doc
ce91b15 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
4f22c2b [lewuathe] Modify based on SPARK-1503
5e47b82 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
abadb7e [lewuathe] Fix LassoSuite
8fadebd [lewuathe] Fix failed unit tests
ee5de46 [lewuathe] Merge branch 'master' into gd-convergence-tolerance
8313ba2 [lewuathe] Fix styles
0ead94c [lewuathe] Merge branch 'master' into gd-convergence-tolerance
a94cfd5 [lewuathe] Modify some styles
3aef0a2 [lewuathe] Modify converged logic to do relative comparison
f7b19d5 [lewuathe] [SPARK-3382] Clarify comparison logic
e6c9cd2 [lewuathe] [SPARK-3382] Compare with the diff of solution vector
4b125d2 [lewuathe] [SPARK3382] Fix scala style
e7c10dd [lewuathe] [SPARK-3382] format improvements
f867eea [lewuathe] [SPARK-3382] Modify warning message statements
b9d5e61 [lewuathe] [SPARK-3382] should compare diff inside loss history and convergence tolerance
5433f71 [lewuathe] [SPARK-3382] GradientDescent convergence tolerance
2015-07-02 15:00:13 -07:00
Reynold Xin 52508beb65 [SPARK-8772][SQL] Implement implicit type cast for expressions that define input types.
Author: Reynold Xin <rxin@databricks.com>

Closes #7175 from rxin/implicitCast and squashes the following commits:

88080a2 [Reynold Xin] Clearer definition of implicit type cast.
f0ff97f [Reynold Xin] Added missing file.
c65e532 [Reynold Xin] [SPARK-8772][SQL] Implement implicit type cast for expressions that defines input types.
2015-07-02 14:16:14 -07:00
Andrew Or cd20355078 [SPARK-7835] Refactor HeartbeatReceiverSuite for coverage + cleanup
The existing test suite has a lot of duplicate code and doesn't even cover the most fundamental feature of the HeartbeatReceiver, which is expiring hosts that have not responded in a while.

This introduces manual clocks in `HeartbeatReceiver` and makes it respond to heartbeats only for registered executors. A few internal messages are moved to `receiveAndReply` to increase determinism of the tests so we don't have to rely on flaky constructs like `eventually`.

Author: Andrew Or <andrew@databricks.com>

Closes #7173 from andrewor14/heartbeat-receiver-tests and squashes the following commits:

4a903d6 [Andrew Or] Increase HeartReceiverSuite coverage and clean up
2015-07-02 13:59:56 -07:00
Deron Eriksson fcbcba66c9 [SPARK-1564] [DOCS] Added Javascript to Javadocs to create badges for tags like :: Experimental ::
Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags

Author: Deron Eriksson <deron@us.ibm.com>

Closes #7169 from deroneriksson/SPARK-1564_JavaDocs_badges and squashes the following commits:

a8353db [Deron Eriksson] added license headers to api-docs.css and api-javadocs.css
07feb07 [Deron Eriksson] added linebreaks to make jquery more readable when adding html badge tags
65b4930 [Deron Eriksson] Modified copy_api_dirs.rb and created api-javadocs.js and api-javadocs.css files in order to add badges to javadoc files for :: Experimental ::, :: DeveloperApi ::, and :: AlphaComponent :: tags
2015-07-02 13:55:53 -07:00
Andrew Or 82cf3315e6 [SPARK-8781] Fix variables in published pom.xml are not resolved
The issue is summarized in the JIRA and is caused by this commit: 984ad60147.

This patch reverts that commit and fixes the maven build in a different way. We limit the dependencies of `KinesisReceiverSuite` to avoid having to deal with the complexities in how maven deals with transitive test dependencies.

Author: Andrew Or <andrew@databricks.com>

Closes #7193 from andrewor14/fix-kinesis-pom and squashes the following commits:

ca3d5d4 [Andrew Or] Limit kinesis test dependencies
f24e09c [Andrew Or] Revert "[BUILD] Fix Maven build for Kinesis"
2015-07-02 13:49:45 -07:00
MechCoder 34d448dbe1 [SPARK-8479] [MLLIB] Add numNonzeros and numActives to linalg.Matrices
Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6904 from MechCoder/nnz_matrix and squashes the following commits:

252c6b7 [MechCoder] Add to MiMa excludes
e2390f5 [MechCoder] Use count instead of foreach
2f62b2f [MechCoder] Add to MiMa excludes
d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
2015-07-02 11:28:14 -07:00
Andrew Or 2e2f32603c [SPARK-8581] [SPARK-8584] Simplify checkpointing code + better error message
This patch rewrites the old checkpointing code in a way that is easier to understand. It also adds a guard against an invalid specification of checkpoint directory to provide a clearer error message. Most of the changes here are relatively minor.

Author: Andrew Or <andrew@databricks.com>

Closes #6968 from andrewor14/checkpoint-cleanup and squashes the following commits:

4ef8263 [Andrew Or] Use global synchronized instead
6f6fd84 [Andrew Or] Merge branch 'master' of github.com:apache/spark into checkpoint-cleanup
b1437ad [Andrew Or] Warn instead of throw
5484293 [Andrew Or] Merge branch 'master' of github.com:apache/spark into checkpoint-cleanup
7fb4af5 [Andrew Or] Guard against bad settings of checkpoint directory
691da98 [Andrew Or] Simplify checkpoint code / code style / comments
2015-07-02 10:57:02 -07:00
Liang-Chi Hsieh 0e553a3e93 [SPARK-8708] [MLLIB] Paritition ALS ratings based on both users and products
JIRA: https://issues.apache.org/jira/browse/SPARK-8708

Previously the partitions of ratings are only based on the given products. So if the `usersProducts` given for prediction contains only few products or even one product, the generated ratings will be pushed into few or single partition and can't use high parallelism.

The following codes are the example reported in the JIRA. Because it asks the predictions for users on product 2. There is only one partition in the result.

    >>> r1 = (1, 1, 1.0)
    >>> r2 = (1, 2, 2.0)
    >>> r3 = (2, 1, 2.0)
    >>> r4 = (2, 2, 2.0)
    >>> r5 = (3, 1, 1.0)
    >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
    >>> users = ratings.map(itemgetter(0)).distinct()
    >>> model = ALS.trainImplicit(ratings, 1, seed=10)
    >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
    >>> predictions_for_2.glom().map(len).collect()
    [0, 0, 3, 0, 0]

This PR uses user and product instead of only product to partition the ratings.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7121 from viirya/mfm_fix_partition and squashes the following commits:

779946d [Liang-Chi Hsieh] Calculate approximate numbers of users and products in one pass.
4336dc2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into mfm_fix_partition
83e56c1 [Liang-Chi Hsieh] Instead of additional join, use the numbers of users and products to decide how to perform join.
b534dc8 [Liang-Chi Hsieh] Paritition ratings based on both users and products.
2015-07-02 10:18:23 -07:00
Yijie Shen 52302a8039 [SPARK-8407] [SQL] complex type constructors: struct and named_struct
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.

After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:

4cd3375ac [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
2015-07-02 10:12:25 -07:00
Wenchen Fan afa021e03f [SPARK-8747] [SQL] fix EqualNullSafe for binary type
also improve tests for binary comparison.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7143 from cloud-fan/binary and squashes the following commits:

28a5b76 [Wenchen Fan] improve test
04ef4b0 [Wenchen Fan] fix equalNullSafe
2015-07-02 10:06:38 -07:00
Tarek Auel 5b3338130d [SPARK-8223] [SPARK-8224] [SQL] shift left and shift right
Jira:
https://issues.apache.org/jira/browse/SPARK-8223
https://issues.apache.org/jira/browse/SPARK-8224

~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged

rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).

If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7178 from tarekauel/8223 and squashes the following commits:

8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
2015-07-02 10:02:19 -07:00
Yanbo Liang 0a468a46bf [SPARK-8758] [MLLIB] Add Python user guide for PowerIterationClustering
Add Python user guide for PowerIterationClustering

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7155 from yanboliang/spark-8758 and squashes the following commits:

18d803b [Yanbo Liang] address comments
dd29577 [Yanbo Liang] Add Python user guide for PowerIterationClustering
2015-07-02 09:59:54 -07:00
Alok Singh 99c40cd0d8 [SPARK-8647] [MLLIB] Potential issue with constant hashCode
I added the code,
  // see [SPARK-8647], this achieves the needed constant hash code without constant no.
  override def hashCode(): Int = this.getClass.getName.hashCode()

does getting the constant hash code as per jira

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>

Closes #7146 from aloknsingh/aloknsingh_SPARK-8647 and squashes the following commits:

e58bccf [Alok  Singh] [SPARK-8647][MLlib] to avoid the class derivation issues, change the constant hashCode to override def hashCode(): Int = classOf[MatrixUDT].getName.hashCode()
43cdb89 [Alok  Singh] [SPARK-8647][MLlib] Potential issue with constant hashCode
2015-07-02 09:58:57 -07:00
Wisely Chen 246265f2bb [SPARK-8690] [SQL] Add a setting to disable SparkSQL parquet schema merge by using datasource API
The detail problem story is in https://issues.apache.org/jira/browse/SPARK-8690

General speaking, I add a config spark.sql.parquet.mergeSchema to achieve the  sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" ))

It will become a simple flag and without any side affect.

Author: Wisely Chen <wiselychen@appier.com>

Closes #7070 from thegiive/SPARK8690 and squashes the following commits:

c6f3e86 [Wisely Chen] Refactor some code style and merge the test case to ParquetSchemaMergeConfigSuite
94c9307 [Wisely Chen] Remove some style problem
db8ef1b [Wisely Chen] Change config to SQLConf and add test case
b6806fb [Wisely Chen] remove text
c0edb8c [Wisely Chen] [SPARK-8690] add a config spark.sql.parquet.mergeSchema to disable datasource API schema merge feature.
2015-07-02 09:58:12 -07:00
Christian Kadner 1bbdf9ead9 [SPARK-8746] [SQL] update download link for Hive 0.13.1
updated the [Hive 0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) download link in `sql/README.md`

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #7144 from ckadner/SPARK-8746 and squashes the following commits:

65d80f7 [Christian Kadner] [SPARK-8746][SQL] update download link for Hive 0.13.1
2015-07-02 13:45:19 +01:00
Vinod K C c572e25617 [SPARK-8787] [SQL] Changed parameter order of @deprecated in package object sql
Parameter order of deprecated annotation in package object sql is wrong
>>deprecated("1.3.0", "use DataFrame") .

This has to be changed to deprecated("use DataFrame", "1.3.0")

Author: Vinod K C <vinod.kc@huawei.com>

Closes #7183 from vinodkc/fix_deprecated_param_order and squashes the following commits:

1cbdbe8 [Vinod K C] Modified the message
700911c [Vinod K C] Changed order of parameters
2015-07-02 13:42:48 +01:00
Kousuke Saruta 41588365ad [DOCS] Fix minor wrong lambda expression example.
It's a really minor issue but there is an example with wrong lambda-expression usage in `SQLContext.scala` like as follows.

```
sqlContext.udf().register("myUDF",
       (Integer arg1, String arg2) -> arg2 + arg1),  <- We have an extra `)` here.
       DataTypes.StringType);
```

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #7187 from sarutak/fix-minor-wrong-lambda-expression and squashes the following commits:

a13196d [Kousuke Saruta] Fixed minor wrong lambda expression example.
2015-07-02 21:16:35 +09:00
huangzhaowei 1b0c8e6104 [SPARK-8687] [YARN] Fix bug: Executor can't fetch the new set configuration in yarn-client
Spark initi the properties CoarseGrainedSchedulerBackend.start
```scala
    // TODO (prashant) send conf instead of properties
    driverEndpoint = rpcEnv.setupEndpoint(
      CoarseGrainedSchedulerBackend.ENDPOINT_NAME, new DriverEndpoint(rpcEnv, properties))
```
Then the yarn logic will set some configuration but not update in this `properties`.
So `Executor` won't gain the `properties`.

[Jira](https://issues.apache.org/jira/browse/SPARK-8687)

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #7066 from SaintBacchus/SPARK-8687 and squashes the following commits:

1de4f48 [huangzhaowei] Ensure all necessary properties have already been set before startup ExecutorLaucher
2015-07-01 23:14:13 -07:00
Ilya Ganelin 3697232b7d [SPARK-3071] Increase default driver memory
I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place.

Please let me know if I've missed anything.

Will the spark-shell use the value within the command line builder during instantiation?

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits:

4074164 [Ilya Ganelin] String fix
271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071
273b6e9 [Ilya Ganelin] Test fix
fd67721 [Ilya Ganelin] Update JavaUtils.java
26cc177 [Ilya Ganelin] test fix
e5db35d [Ilya Ganelin] Fixed test failure
39732a1 [Ilya Ganelin] merge fix
a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each
09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala
19b6f25 [Ilya Ganelin] Missed one doc update
2698a3d [Ilya Ganelin] Updated default value for driver memory
2015-07-01 23:11:02 -07:00
Josh Rosen 377ff4c9e8 [SPARK-8740] [PROJECT INFRA] Support GitHub OAuth tokens in dev/merge_spark_pr.py
This commit allows `dev/merge_spark_pr.py` to use personal GitHub OAuth tokens in order to make authenticated requests. This is necessary to work around per-IP rate limiting issues.

To use a token, just set the `GITHUB_OAUTH_KEY` environment variable.  You can create a personal token at https://github.com/settings/tokens; we only require `public_repo` scope.

If the script fails due to a rate-limit issue, it now logs a useful message directing the user to the OAuth token instructions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7136 from JoshRosen/pr-merge-script-oauth-authentication and squashes the following commits:

4d011bd [Josh Rosen] Fix error message
23d92ff [Josh Rosen] Support GitHub OAuth tokens in dev/merge_spark_pr.py
2015-07-01 23:06:52 -07:00
Holden Karau 15d41cc501 [SPARK-8769] [TRIVIAL] [DOCS] toLocalIterator should mention it results in many jobs
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7171 from holdenk/SPARK-8769-toLocalIterator-documentation-improvement and squashes the following commits:

97ddd99 [Holden Karau] Add note
2015-07-01 23:05:45 -07:00
Holden Karau d14338eafc [SPARK-8771] [TRIVIAL] Add a version to the deprecated annotation for the actorSystem
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7172 from holdenk/SPARK-8771-actor-system-deprecation-tag-uses-deprecated-deprecation-tag and squashes the following commits:

7f1455b [Holden Karau] Add .0s to the versions for the derpecated anotations in SparkEnv.scala
ca13c9d [Holden Karau] Add a version to the deprecated annotation for the actorSystem in SparkEnv
2015-07-01 23:04:05 -07:00
huangzhaowei 646366b5d2 [SPARK-8688] [YARN] Bug fix: disable the cache fs to gain the HDFS connection.
If `fs.hdfs.impl.disable.cache` was `false`(default), `FileSystem` will use the cached `DFSClient` which use old token.
[AMDelegationTokenRenewer](https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala#L196)
```scala
    val credentials = UserGroupInformation.getCurrentUser.getCredentials
    credentials.writeTokenStorageFile(tempTokenPath, discachedConfiguration)
```
Although the `credentials` had the new Token, but it still use the cached client and old token.
So It's better to set the `fs.hdfs.impl.disable.cache`  as `true` to avoid token expired.

[Jira](https://issues.apache.org/jira/browse/SPARK-8688)

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #7069 from SaintBacchus/SPARK-8688 and squashes the following commits:

f94cd0b [huangzhaowei] modify function parameter
8fb9eb9 [huangzhaowei] explicit  the comment
0cd55c9 [huangzhaowei] Rename function name to be an accurate one
cf776a1 [huangzhaowei] [SPARK-8688][YARN]Bug fix: disable the cache fs to gain the HDFS connection.
2015-07-01 23:01:44 -07:00
Devaraj K 792fcd802c [SPARK-8754] [YARN] YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
In YarnClientSchedulerBackend.stop(), added a check for monitorThread.

Author: Devaraj K <devaraj@apache.org>

Closes #7153 from devaraj-kavali/master and squashes the following commits:

66be9ad [Devaraj K] https://issues.apache.org/jira/browse/SPARK-8754 YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
2015-07-01 22:59:04 -07:00
zhichao.li b285ac5ba8 [SPARK-8227] [SQL] Add function unhex
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #7113 from zhichao-li/unhex and squashes the following commits:

379356e [zhichao.li] remove exception checking
a4ae6dc [zhichao.li] add udf_unhex to whitelist
fe5c14a [zhichao.li] add todigit
607d7a3 [zhichao.li] use checkInputTypes
bffd37f [zhichao.li] change to use Hex in apache common package
cde73f5 [zhichao.li] update to use AutoCastInputTypes
11945c7 [zhichao.li] style
c852d46 [zhichao.li] Add function unhex
2015-07-01 22:19:51 -07:00
Rosstin 4e4f74b5e1 [SPARK-8660] [MLLIB] removed > symbols from comments in LogisticRegressionSuite.scala for ease of copypaste
'>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste

also single-lined the multiline commands (is this desirable, or does it violate style?)

Author: Rosstin <asterazul@gmail.com>

Closes #7167 from Rosstin/SPARK-8660-2 and squashes the following commits:

f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala
fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste
39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661
5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
2015-07-01 21:42:06 -07:00
Reynold Xin 9fd13d5613 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7174 from rxin/binary-opterator and squashes the following commits:

f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
2015-07-01 21:14:13 -07:00
Reynold Xin 3a342dedc0 Revert "[SPARK-8770][SQL] Create BinaryOperator abstract class."
This reverts commit 2727789998.
2015-07-01 16:59:39 -07:00
Reynold Xin 2727789998 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7170 from rxin/binaryoperator and squashes the following commits:

51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
2015-07-01 16:56:48 -07:00
Davies Liu f958f27e20 [SPARK-8766] support non-ascii character in column names
Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii').

This PR also fix a bug when there is Java exception without error message.

Author: Davies Liu <davies@databricks.com>

Closes #7165 from davies/non_ascii and squashes the following commits:

02cb61a [Davies Liu] fix tests
3b09d31 [Davies Liu] add encoding in header
867754a [Davies Liu] support non-ascii character in column names
2015-07-01 16:43:18 -07:00
Marcelo Vanzin 1ce6428907 [SPARK-3444] [CORE] Restore INFO level after log4j test.
Otherwise other tests don't log anything useful...

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7140 from vanzin/SPARK-3444 and squashes the following commits:

de14836 [Marcelo Vanzin] Better fix.
6cff13a [Marcelo Vanzin] [SPARK-3444] [core] Restore INFO level after log4j test.
2015-07-01 20:40:47 +01:00
Davies Liu 3083e17645 [QUICKFIX] [SQL] fix copy of generated row
copy() of generated Row doesn't check nullability of columns

Author: Davies Liu <davies@databricks.com>

Closes #7163 from davies/fix_copy and squashes the following commits:

661a206 [Davies Liu] fix copy of generated row
2015-07-01 12:39:57 -07:00
jerryshao 9f7db3486f [SPARK-7820] [BUILD] Fix Java8-tests suite compile and test error under sbt
Author: jerryshao <saisai.shao@intel.com>

Closes #7120 from jerryshao/SPARK-7820 and squashes the following commits:

6902439 [jerryshao] fix Java8-tests suite compile error under sbt
2015-07-01 12:33:24 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Joseph K. Bradley b8faa32875 [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC test
See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console]

CC yanboliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7164 from jkbradley/pic-python-test and squashes the following commits:

156d55b [Joseph K. Bradley] removed flaky python PIC test
2015-07-01 11:57:52 -07:00
Yuhao Yang 2012913355 [SPARK-8308] [MLLIB] add missing save load for python example
jira: https://issues.apache.org/jira/browse/SPARK-8308

1. add some missing save/load in python examples. , LogisticRegression, LinearRegression and NaiveBayes
2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow for default java configuration (>1M)

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6760 from hhbyyh/docUpdate and squashes the following commits:

9bd3383 [Yuhao Yang] update scala example
8a44692 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
077cbb8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
3e948dc [Yuhao Yang] add missing save load for python example
2015-07-01 11:17:56 -07:00
lewuathe 184de91d15 [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils
Implement missing API in pyspark.

MLUtils
* appendBias
* loadVectors

`kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.

Author: lewuathe <lewuathe@me.com>

Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:

16863ea [lewuathe] Merge master
3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
6084e9c [lewuathe] Resolv conflict
d2aa2a0 [lewuathe] Resolv conflict
9c329d8 [lewuathe] Fix efficiency
3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
1d4714b [lewuathe] Fix style
b29e2bc [lewuathe] Remove scipy dependencies
e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
25d3c9d [lewuathe] Remove unnecessary imports
7ec04db [lewuathe] Resolv conflict
1502d13 [lewuathe] Resolv conflict
d6bd416 [lewuathe] Check existence of scipy.sparse
5d555b1 [lewuathe] Construct scipy.sparse matrix
c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
b8b5ef7 [lewuathe] Fix unnecessary sort method
d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
62a9c7e [lewuathe] Fix appendBias return type
454c73d [lewuathe] Merge branch 'master' into SPARK-6263
a353354 [lewuathe] Remove unnecessary appendBias implementation
44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
c728046 [lewuathe] Fix style
2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
2015-07-01 11:14:07 -07:00
Wenchen Fan 31b4a3d7f2 [SPARK-8621] [SQL] support empty string as column name
improve the empty check in `parseAttributeName` so that we can allow empty string as column name.
Close https://github.com/apache/spark/pull/7117

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7149 from cloud-fan/8621 and squashes the following commits:

efa9e3e [Wenchen Fan] support empty string
2015-07-01 10:31:35 -07:00
Reynold Xin 4137f769b8 [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.
This patch doesn't actually introduce any code that uses the new ExpectsInputTypes. It just adds the trait so others can use it. Also renamed the old expectsInputTypes function to just inputTypes.

We should add implicit type casting also in the future.

Author: Reynold Xin <rxin@databricks.com>

Closes #7151 from rxin/expects-input-types and squashes the following commits:

16cf07b [Reynold Xin] [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.
2015-07-01 10:30:54 -07:00
Sun Rui 69c5dee2f0 [SPARK-7714] [SPARKR] SparkR tests should use more specific expectations than expect_true
1. Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2. Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
3. Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.

Author: Sun Rui <rui.sun@intel.com>

Closes #7152 from sun-rui/SPARK-7714 and squashes the following commits:

8ad2440 [Sun Rui] Fix test case errors.
8fe9f0c [Sun Rui] Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.
f1b8005 [Sun Rui] Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
f631e94 [Sun Rui] Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2015-07-01 09:50:12 -07:00
cocoatomo fdcad6ef48 [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
Running run-tests.py with Python 2.6 cause following error:

```
Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log
Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Traceback (most recent call last):
  File "./python/run-tests.py", line 196, in <module>
    main()
  File "./python/run-tests.py", line 159, in main
    python_implementation = subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'
...
```

The cause of this error is using subprocess.check_output function, which exists since Python 2.7.
(ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output)

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #7161 from cocoatomo/issues/8763-test-fails-py26 and squashes the following commits:

cf4f901 [cocoatomo] [SPARK-8763] backport process.check_output function from Python 2.7
2015-07-01 09:37:09 -07:00
Reynold Xin 97652416e2 [SPARK-8750][SQL] Remove the closure in functions.callUdf.
Author: Reynold Xin <rxin@databricks.com>

Closes #7148 from rxin/calludf-closure and squashes the following commits:

00df372 [Reynold Xin] Fixed index out of bound exception.
4beba76 [Reynold Xin] [SPARK-8750][SQL] Remove the closure in functions.callUdf.
2015-07-01 01:08:20 -07:00
Wenchen Fan 0eee061589 [SQL] [MINOR] remove internalRowRDD in DataFrame
Developers have already familiar with `queryExecution.toRDD` as internal row RDD, and we should not add new concept.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7116 from cloud-fan/internal-rdd and squashes the following commits:

24756ca [Wenchen Fan] remove internalRowRDD
2015-07-01 01:02:33 -07:00
Reynold Xin fc3a6fe67f [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
Moved all the rules into the companion object.

Author: Reynold Xin <rxin@databricks.com>

Closes #7147 from rxin/SPARK-8749 and squashes the following commits:

c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
2015-07-01 00:08:16 -07:00