Commit graph

11792 commits

Author SHA1 Message Date
Josh Rosen 377ff4c9e8 [SPARK-8740] [PROJECT INFRA] Support GitHub OAuth tokens in dev/merge_spark_pr.py
This commit allows `dev/merge_spark_pr.py` to use personal GitHub OAuth tokens in order to make authenticated requests. This is necessary to work around per-IP rate limiting issues.

To use a token, just set the `GITHUB_OAUTH_KEY` environment variable.  You can create a personal token at https://github.com/settings/tokens; we only require `public_repo` scope.

If the script fails due to a rate-limit issue, it now logs a useful message directing the user to the OAuth token instructions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7136 from JoshRosen/pr-merge-script-oauth-authentication and squashes the following commits:

4d011bd [Josh Rosen] Fix error message
23d92ff [Josh Rosen] Support GitHub OAuth tokens in dev/merge_spark_pr.py
2015-07-01 23:06:52 -07:00
Holden Karau 15d41cc501 [SPARK-8769] [TRIVIAL] [DOCS] toLocalIterator should mention it results in many jobs
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7171 from holdenk/SPARK-8769-toLocalIterator-documentation-improvement and squashes the following commits:

97ddd99 [Holden Karau] Add note
2015-07-01 23:05:45 -07:00
Holden Karau d14338eafc [SPARK-8771] [TRIVIAL] Add a version to the deprecated annotation for the actorSystem
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7172 from holdenk/SPARK-8771-actor-system-deprecation-tag-uses-deprecated-deprecation-tag and squashes the following commits:

7f1455b [Holden Karau] Add .0s to the versions for the derpecated anotations in SparkEnv.scala
ca13c9d [Holden Karau] Add a version to the deprecated annotation for the actorSystem in SparkEnv
2015-07-01 23:04:05 -07:00
huangzhaowei 646366b5d2 [SPARK-8688] [YARN] Bug fix: disable the cache fs to gain the HDFS connection.
If `fs.hdfs.impl.disable.cache` was `false`(default), `FileSystem` will use the cached `DFSClient` which use old token.
[AMDelegationTokenRenewer](https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala#L196)
```scala
    val credentials = UserGroupInformation.getCurrentUser.getCredentials
    credentials.writeTokenStorageFile(tempTokenPath, discachedConfiguration)
```
Although the `credentials` had the new Token, but it still use the cached client and old token.
So It's better to set the `fs.hdfs.impl.disable.cache`  as `true` to avoid token expired.

[Jira](https://issues.apache.org/jira/browse/SPARK-8688)

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #7069 from SaintBacchus/SPARK-8688 and squashes the following commits:

f94cd0b [huangzhaowei] modify function parameter
8fb9eb9 [huangzhaowei] explicit  the comment
0cd55c9 [huangzhaowei] Rename function name to be an accurate one
cf776a1 [huangzhaowei] [SPARK-8688][YARN]Bug fix: disable the cache fs to gain the HDFS connection.
2015-07-01 23:01:44 -07:00
Devaraj K 792fcd802c [SPARK-8754] [YARN] YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
In YarnClientSchedulerBackend.stop(), added a check for monitorThread.

Author: Devaraj K <devaraj@apache.org>

Closes #7153 from devaraj-kavali/master and squashes the following commits:

66be9ad [Devaraj K] https://issues.apache.org/jira/browse/SPARK-8754 YarnClientSchedulerBackend doesn't stop gracefully in failure conditions
2015-07-01 22:59:04 -07:00
zhichao.li b285ac5ba8 [SPARK-8227] [SQL] Add function unhex
cc chenghao-intel  adrian-wang

Author: zhichao.li <zhichao.li@intel.com>

Closes #7113 from zhichao-li/unhex and squashes the following commits:

379356e [zhichao.li] remove exception checking
a4ae6dc [zhichao.li] add udf_unhex to whitelist
fe5c14a [zhichao.li] add todigit
607d7a3 [zhichao.li] use checkInputTypes
bffd37f [zhichao.li] change to use Hex in apache common package
cde73f5 [zhichao.li] update to use AutoCastInputTypes
11945c7 [zhichao.li] style
c852d46 [zhichao.li] Add function unhex
2015-07-01 22:19:51 -07:00
Rosstin 4e4f74b5e1 [SPARK-8660] [MLLIB] removed > symbols from comments in LogisticRegressionSuite.scala for ease of copypaste
'>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste

also single-lined the multiline commands (is this desirable, or does it violate style?)

Author: Rosstin <asterazul@gmail.com>

Closes #7167 from Rosstin/SPARK-8660-2 and squashes the following commits:

f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala
fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste
39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661
5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code.
bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660
242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala
2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639
6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
2015-07-01 21:42:06 -07:00
Reynold Xin 9fd13d5613 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7174 from rxin/binary-opterator and squashes the following commits:

f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
2015-07-01 21:14:13 -07:00
Reynold Xin 3a342dedc0 Revert "[SPARK-8770][SQL] Create BinaryOperator abstract class."
This reverts commit 2727789998.
2015-07-01 16:59:39 -07:00
Reynold Xin 2727789998 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7170 from rxin/binaryoperator and squashes the following commits:

51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
2015-07-01 16:56:48 -07:00
Davies Liu f958f27e20 [SPARK-8766] support non-ascii character in column names
Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii').

This PR also fix a bug when there is Java exception without error message.

Author: Davies Liu <davies@databricks.com>

Closes #7165 from davies/non_ascii and squashes the following commits:

02cb61a [Davies Liu] fix tests
3b09d31 [Davies Liu] add encoding in header
867754a [Davies Liu] support non-ascii character in column names
2015-07-01 16:43:18 -07:00
Marcelo Vanzin 1ce6428907 [SPARK-3444] [CORE] Restore INFO level after log4j test.
Otherwise other tests don't log anything useful...

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7140 from vanzin/SPARK-3444 and squashes the following commits:

de14836 [Marcelo Vanzin] Better fix.
6cff13a [Marcelo Vanzin] [SPARK-3444] [core] Restore INFO level after log4j test.
2015-07-01 20:40:47 +01:00
Davies Liu 3083e17645 [QUICKFIX] [SQL] fix copy of generated row
copy() of generated Row doesn't check nullability of columns

Author: Davies Liu <davies@databricks.com>

Closes #7163 from davies/fix_copy and squashes the following commits:

661a206 [Davies Liu] fix copy of generated row
2015-07-01 12:39:57 -07:00
jerryshao 9f7db3486f [SPARK-7820] [BUILD] Fix Java8-tests suite compile and test error under sbt
Author: jerryshao <saisai.shao@intel.com>

Closes #7120 from jerryshao/SPARK-7820 and squashes the following commits:

6902439 [jerryshao] fix Java8-tests suite compile error under sbt
2015-07-01 12:33:24 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Joseph K. Bradley b8faa32875 [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC test
See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console]

CC yanboliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7164 from jkbradley/pic-python-test and squashes the following commits:

156d55b [Joseph K. Bradley] removed flaky python PIC test
2015-07-01 11:57:52 -07:00
Yuhao Yang 2012913355 [SPARK-8308] [MLLIB] add missing save load for python example
jira: https://issues.apache.org/jira/browse/SPARK-8308

1. add some missing save/load in python examples. , LogisticRegression, LinearRegression and NaiveBayes
2. tune down iterations for MatrixFactorization, since current number will trigger StackOverflow for default java configuration (>1M)

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6760 from hhbyyh/docUpdate and squashes the following commits:

9bd3383 [Yuhao Yang] update scala example
8a44692 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
077cbb8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docUpdate
3e948dc [Yuhao Yang] add missing save load for python example
2015-07-01 11:17:56 -07:00
lewuathe 184de91d15 [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils
Implement missing API in pyspark.

MLUtils
* appendBias
* loadVectors

`kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.

Author: lewuathe <lewuathe@me.com>

Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:

16863ea [lewuathe] Merge master
3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
6084e9c [lewuathe] Resolv conflict
d2aa2a0 [lewuathe] Resolv conflict
9c329d8 [lewuathe] Fix efficiency
3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
1d4714b [lewuathe] Fix style
b29e2bc [lewuathe] Remove scipy dependencies
e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
25d3c9d [lewuathe] Remove unnecessary imports
7ec04db [lewuathe] Resolv conflict
1502d13 [lewuathe] Resolv conflict
d6bd416 [lewuathe] Check existence of scipy.sparse
5d555b1 [lewuathe] Construct scipy.sparse matrix
c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
b8b5ef7 [lewuathe] Fix unnecessary sort method
d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
62a9c7e [lewuathe] Fix appendBias return type
454c73d [lewuathe] Merge branch 'master' into SPARK-6263
a353354 [lewuathe] Remove unnecessary appendBias implementation
44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
c728046 [lewuathe] Fix style
2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
2015-07-01 11:14:07 -07:00
Wenchen Fan 31b4a3d7f2 [SPARK-8621] [SQL] support empty string as column name
improve the empty check in `parseAttributeName` so that we can allow empty string as column name.
Close https://github.com/apache/spark/pull/7117

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7149 from cloud-fan/8621 and squashes the following commits:

efa9e3e [Wenchen Fan] support empty string
2015-07-01 10:31:35 -07:00
Reynold Xin 4137f769b8 [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.
This patch doesn't actually introduce any code that uses the new ExpectsInputTypes. It just adds the trait so others can use it. Also renamed the old expectsInputTypes function to just inputTypes.

We should add implicit type casting also in the future.

Author: Reynold Xin <rxin@databricks.com>

Closes #7151 from rxin/expects-input-types and squashes the following commits:

16cf07b [Reynold Xin] [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.
2015-07-01 10:30:54 -07:00
Sun Rui 69c5dee2f0 [SPARK-7714] [SPARKR] SparkR tests should use more specific expectations than expect_true
1. Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2. Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
3. Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.

Author: Sun Rui <rui.sun@intel.com>

Closes #7152 from sun-rui/SPARK-7714 and squashes the following commits:

8ad2440 [Sun Rui] Fix test case errors.
8fe9f0c [Sun Rui] Update the pattern 'expect_true(identical(a, b))' to 'expect_identical(a, b)'.
f1b8005 [Sun Rui] Update the pattern 'expect_true(inherits(a, b))' to 'expect_is(a, b)'.
f631e94 [Sun Rui] Update the pattern 'expect_true(a == b)' to 'expect_equal(a, b)'.
2015-07-01 09:50:12 -07:00
cocoatomo fdcad6ef48 [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
Running run-tests.py with Python 2.6 cause following error:

```
Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log
Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Traceback (most recent call last):
  File "./python/run-tests.py", line 196, in <module>
    main()
  File "./python/run-tests.py", line 159, in main
    python_implementation = subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'
...
```

The cause of this error is using subprocess.check_output function, which exists since Python 2.7.
(ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output)

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #7161 from cocoatomo/issues/8763-test-fails-py26 and squashes the following commits:

cf4f901 [cocoatomo] [SPARK-8763] backport process.check_output function from Python 2.7
2015-07-01 09:37:09 -07:00
Reynold Xin 97652416e2 [SPARK-8750][SQL] Remove the closure in functions.callUdf.
Author: Reynold Xin <rxin@databricks.com>

Closes #7148 from rxin/calludf-closure and squashes the following commits:

00df372 [Reynold Xin] Fixed index out of bound exception.
4beba76 [Reynold Xin] [SPARK-8750][SQL] Remove the closure in functions.callUdf.
2015-07-01 01:08:20 -07:00
Wenchen Fan 0eee061589 [SQL] [MINOR] remove internalRowRDD in DataFrame
Developers have already familiar with `queryExecution.toRDD` as internal row RDD, and we should not add new concept.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7116 from cloud-fan/internal-rdd and squashes the following commits:

24756ca [Wenchen Fan] remove internalRowRDD
2015-07-01 01:02:33 -07:00
Reynold Xin fc3a6fe67f [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
Moved all the rules into the companion object.

Author: Reynold Xin <rxin@databricks.com>

Closes #7147 from rxin/SPARK-8749 and squashes the following commits:

c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
2015-07-01 00:08:16 -07:00
Reynold Xin 365c14055e [SPARK-8748][SQL] Move castability test out from Cast case class into Cast object.
This patch moved resolve function in Cast case class into the companion object, and renamed it canCast. We can then use this in the analyzer without a Cast expr.

Author: Reynold Xin <rxin@databricks.com>

Closes #7145 from rxin/cast and squashes the following commits:

cd086a9 [Reynold Xin] Whitespace changes.
4d2d989 [Reynold Xin] [SPARK-8748][SQL] Move castability test out from Cast case class into Cast object.
2015-06-30 23:04:54 -07:00
zsxwing 64c14618d3 [SPARK-6602][Core]Remove unnecessary synchronized
A follow-up pr to address https://github.com/apache/spark/pull/5392#discussion_r33627528

Author: zsxwing <zsxwing@gmail.com>

Closes #7141 from zsxwing/pr5392-follow-up and squashes the following commits:

fcf7b50 [zsxwing] Remove unnecessary synchronized
2015-06-30 21:57:07 -07:00
x1- b6e76edf30 [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`.
So I think `pandas.columns` are should be convert to `String`.

### issue

* [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535)

Author: x1- <viva008@gmail.com>

Closes #7124 from x1-/SPARK-8535 and squashes the following commits:

d68fd38 [x1-] modify unit-test using pandas.
ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
2015-06-30 20:35:46 -07:00
Feynman Liang f457569886 [SPARK-8471] [ML] Rename DiscreteCosineTransformer to DCT
Rename DiscreteCosineTransformer and related classes to DCT.

Author: Feynman Liang <fliang@databricks.com>

Closes #7138 from feynmanliang/dct-features and squashes the following commits:

e547b3e [Feynman Liang] Fix renaming bug
9d5c9e4 [Feynman Liang] Lowercase JavaDCTSuite variable
f9a8958 [Feynman Liang] Remove old files
f8fe794 [Feynman Liang] Merge branch 'master' into dct-features
894d0b2 [Feynman Liang] Rename DiscreteCosineTransformer to DCT
433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
2015-06-30 20:19:43 -07:00
zsxwing 3bee0f1466 [SPARK-6602][Core] Update Master, Worker, Client, AppClient and related classes to use RpcEndpoint
This PR updates the rest Actors in core to RpcEndpoint.

Because there is no `ActorSelection` in RpcEnv, I changes the logic of `registerWithMaster` in Worker and AppClient to avoid blocking the message loop. These changes need to be reviewed carefully.

Author: zsxwing <zsxwing@gmail.com>

Closes #5392 from zsxwing/rpc-rewrite-part3 and squashes the following commits:

2de7bed [zsxwing] Merge branch 'master' into rpc-rewrite-part3
f12d943 [zsxwing] Address comments
9137b82 [zsxwing] Fix the code style
e734c71 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
2d24fb5 [zsxwing] Fix the code style
5a82374 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
fa47110 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
72304f0 [zsxwing] Update the error strategy for AkkaRpcEnv
e56cb16 [zsxwing] Always send failure back to the sender
a7b86e6 [zsxwing] Use JFuture for java.util.concurrent.Future
aa34b9b [zsxwing] Fix the code style
bd541e7 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
25a84d8 [zsxwing] Use ThreadUtils
060ff31 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
dbfc916 [zsxwing] Improve the docs and comments
837927e [zsxwing] Merge branch 'master' into rpc-rewrite-part3
5c27f97 [zsxwing] Merge branch 'master' into rpc-rewrite-part3
fadbb9e [zsxwing] Fix the code style
6637e3c [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-rewrite-part3
7fdee0e [zsxwing] Fix the return type to ExecutorService and ScheduledExecutorService
e8ad0a5 [zsxwing] Fix the code style
6b2a104 [zsxwing] Log error and use SparkExitCode.UNCAUGHT_EXCEPTION exit code
fbf3194 [zsxwing] Add Utils.newDaemonSingleThreadExecutor and newDaemonSingleThreadScheduledExecutor
b776817 [zsxwing] Update Master, Worker, Client, AppClient and related classes to use RpcEndpoint
2015-06-30 17:39:55 -07:00
Tarek Auel ccdb05222a [SPARK-8727] [SQL] Missing python api; md5, log2
Jira: https://issues.apache.org/jira/browse/SPARK-8727

Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7114 from tarekauel/missing-python and squashes the following commits:

ef4c61b [Tarek Auel] [SPARK-8727] revert dataframe change
4029d4d [Tarek Auel] removed dataframe pi and e unit test
66f0d2b [Tarek Auel] removed pi and e from python api and dataframe api; added _to_java_column(col) for strlen
4d07318 [Tarek Auel] fixed python unit test
45f2bee [Tarek Auel] fixed result of pi and e
c39f47b [Tarek Auel] add python api
bd50a3a [Tarek Auel] add missing python functions
2015-06-30 17:00:51 -07:00
Reynold Xin 8133125ca0 [SPARK-8741] [SQL] Remove e and pi from DataFrame functions.
Author: Reynold Xin <rxin@databricks.com>

Closes #7137 from rxin/SPARK-8741 and squashes the following commits:

32c7e75 [Reynold Xin] [SPARK-8741][SQL] Remove e and pi from DataFrame functions.
2015-06-30 16:54:51 -07:00
sethah 8d23587f1d [SPARK-7739] [MLLIB] Improve ChiSqSelector example code in user guide
Author: sethah <seth.hendrickson16@gmail.com>

Closes #7029 from sethah/working_on_SPARK-7739 and squashes the following commits:

ef96916 [sethah] Fixing some style issues
efea1f8 [sethah] adding clarification to ChiSqSelector example
2015-06-30 16:28:25 -07:00
Davies Liu 58ee2a2e47 [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API
Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7135 from davies/ananylis and squashes the following commits:

dad7ae7 [Davies Liu] add comment
ec0c0e8 [Davies Liu] Update utils.py
cdd7edd [Davies Liu] add doc
7b044c2 [Davies Liu] fix python 3
f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
2015-06-30 16:17:46 -07:00
Kousuke Saruta d2495f7cc7 [SPARK-8739] [WEB UI] [WINDOWS] A illegal character \r can be contained in StagePage.
This issue was reported by saurfang. Thanks!

There is a following code in StagePage.scala.

```
                   |width="$serializationTimeProportion%"></rect>
                 |<rect class="getting-result-time-proportion"
                   |x="$gettingResultTimeProportionPos%" y="0px" height="26px"
                   |width="$gettingResultTimeProportion%"></rect></svg>',
               |'start': new Date($launchTime),
               |'end': new Date($finishTime)
             |}
           |""".stripMargin.replaceAll("\n", " ")
```

The last `replaceAll("\n", "")` doesn't work when we checkout and build source code on Windows and deploy on Linux.
It's because when we checkout the source code on Windows, new-line-code is replaced with `"\r\n"` and `replaceAll("\n", "")` replaces only `"\n"`.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #7133 from sarutak/SPARK-8739 and squashes the following commits:

17fb044 [Kousuke Saruta] Fixed a new-line-code issue
2015-06-30 14:09:29 -07:00
lee19 e72526227f [SPARK-8563] [MLLIB] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake.
I pushed codes again.

And, I added a test code.

>
There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()`
It should have been `U.numCols() = k = svd.U.numCols()`

>
```
self = U * sigma * V.transpose
(m x n) = (m x n) * (k x k) * (k x n) //ASIS
-->
(m x n) = (m x k) * (k x k) * (k x n) //TOBE
```

Author: lee19 <lee19@live.co.kr>

Closes #6953 from lee19/MLlibBugfix and squashes the following commits:

c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden.
4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error.
c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib]
8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
2015-06-30 14:08:00 -07:00
zsxwing 8c898964f0 [SPARK-8705] [WEBUI] Don't display rects when totalExecutionTime is 0
Because `System.currentTimeMillis()` is not accurate for tasks that only need several milliseconds, sometimes `totalExecutionTime` in `makeTimeline` will be 0. If `totalExecutionTime` is 0, there will the following error in the console.

![screen shot 2015-06-29 at 7 08 55 pm](https://cloud.githubusercontent.com/assets/1000778/8406776/5cd38e04-1e92-11e5-89f2-0c5134fe4b6b.png)

This PR fixes it by using an empty svg tag when `totalExecutionTime` is 0. This is a screenshot for a task that its totalExecutionTime is 0 after fixing it.

![screen shot 2015-06-30 at 12 26 52 am](https://cloud.githubusercontent.com/assets/1000778/8412896/7b33b4be-1ebf-11e5-9100-d6d656af3747.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #7088 from zsxwing/SPARK-8705 and squashes the following commits:

9ee4ef5 [zsxwing] Address comments
ef2ecfa [zsxwing] Don't display rects when totalExecutionTime is 0
2015-06-30 14:06:50 -07:00
Joseph K. Bradley 3ba23ffd37 [SPARK-8736] [ML] GBTRegressor should not threshold prediction
Changed GBTRegressor so it does NOT threshold the prediction.  Added test which fails with bug but works after fix.

CC: feynmanliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7134 from jkbradley/gbrt-fix and squashes the following commits:

613b90e [Joseph K. Bradley] Changed GBTRegressor so it does NOT threshold the prediction
2015-06-30 14:02:50 -07:00
Marcelo Vanzin 4bb8375fc2 [SPARK-8372] Do not show applications that haven't recorded their app ID yet.
Showing these applications may lead to weird behavior in the History Server. For old logs, if
the app ID is recorded later, you may end up with a duplicate entry. For new logs, the app might
be listed with a ".inprogress" suffix.

So ignore those, but still allow old applications that don't record app IDs at all (1.0 and 1.1) to be shown.

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: Carson Wang <carson.wang@intel.com>

Closes #7097 from vanzin/SPARK-8372 and squashes the following commits:

a24eab2 [Marcelo Vanzin] Feedback.
112ae8f [Marcelo Vanzin] Merge branch 'master' into SPARK-8372
7b91b74 [Marcelo Vanzin] Handle logs generated by 1.0 and 1.1.
1eca3fe [Carson Wang] [SPARK-8372] History server shows incorrect information for application not started
2015-06-30 14:01:52 -07:00
Joshi 7dda0844e1 [SPARK-2645] [CORE] Allow SparkEnv.stop() to be called multiple times without side effects.
Fix for SparkContext stop behavior - Allow sc.stop() to be called multiple times without side effects.

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #6973 from rekhajoshm/SPARK-2645 and squashes the following commits:

277043e [Joshi] Fix for SparkContext stop behavior
446b0a4 [Joshi] Fix for SparkContext stop behavior
2ce5760 [Joshi] Fix for SparkContext stop behavior
c97839a [Joshi] Fix for SparkContext stop behavior
1aff39c [Joshi] Fix for SparkContext stop behavior
12f66b5 [Joshi] Fix for SparkContext stop behavior
72bb484 [Joshi] Fix for SparkContext stop behavior
a5a7d7f [Joshi] Fix for SparkContext stop behavior
9193a0c [Joshi] Fix for SparkContext stop behavior
58dba70 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
380c5b0 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
b566b66 [Joshi] SPARK-2645: Fix for SparkContext stop behavior
0be142d [Rekha Joshi] Merge pull request #3 from apache/master
106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
2015-06-30 14:00:35 -07:00
xutingjun 79f0b371a3 [SPARK-8560] [UI] The Executors page will have negative if having resubmitted tasks
when the ```taskEnd.reason``` is ```Resubmitted```, it shouldn't  do statistics. Because this tasks has a ```SUCCESS``` taskEnd before.

Author: xutingjun <xutingjun@huawei.com>

Closes #6950 from XuTingjun/pageError and squashes the following commits:

af35dc3 [xutingjun] When taskEnd is Resubmitted, don't do statistics
2015-06-30 13:57:27 -07:00
Yuhao Yang 61d7b533dd [SPARK-7514] [MLLIB] Add MinMaxScaler to feature transformation
jira: https://issues.apache.org/jira/browse/SPARK-7514
Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling.

Core function is,
Normalized(x) = (x - min) / (max - min) * scale + newBase

where `newBase` and `scale` are parameters (type Double) of the `VectorTransformer`. `newBase` is the new minimum number for the features, and `scale` controls the ranges after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application.

For case that `max == min`, 0.5 is used as the raw value. (0.5 * scale + newBase)
I'll add UT once the design got settled ( and this is not considered as too naive)

reference:
 http://en.wikipedia.org/wiki/Feature_scaling
http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6039 from hhbyyh/minMaxNorm and squashes the following commits:

f942e9f [Yuhao Yang] add todo for metadata
8b37bbc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
4894dbc [Yuhao Yang] add copy
fa2989f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
29db415 [Yuhao Yang] add clue and minor adjustment
5b8f7cc [Yuhao Yang] style fix
9b133d0 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
22f20f2 [Yuhao Yang] style change and bug fix
747c9bb [Yuhao Yang] add ut and remove mllib version
a5ba0aa [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
585cc07 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1c6dcb1 [Yuhao Yang] minor change
0f1bc80 [Yuhao Yang] add MinMaxScaler to ml
8e7436e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
3663165 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm
1247c27 [Yuhao Yang] some comments improvement
d285a19 [Yuhao Yang] initial checkin for minMaxNorm
2015-06-30 12:44:43 -07:00
Feynman Liang 74cc16dbc3 [SPARK-8471] [ML] Discrete Cosine Transform Feature Transformer
Implementation and tests for Discrete Cosine Transformer.

Author: Feynman Liang <fliang@databricks.com>

Closes #6894 from feynmanliang/dct-features and squashes the following commits:

433dbc7 [Feynman Liang] Test refactoring
91e9636 [Feynman Liang] Style guide and test helper refactor
b5ac19c [Feynman Liang] Use Vector types, add Java test
530983a [Feynman Liang] Tests for other numeric datatypes
195d7aa [Feynman Liang] Implement support for arbitrary numeric types
95d4939 [Feynman Liang] Working DCT for 1D Doubles
2015-06-30 12:31:33 -07:00
Vinod K C b8e5bb6fc1 [SPARK-8628] [SQL] Race condition in AbstractSparkSQLParser.parse
Made lexical iniatialization as lazy val

Author: Vinod K C <vinod.kc@huawei.com>

Closes #7015 from vinodkc/handle_lexical_initialize_schronization and squashes the following commits:

b6d1c74 [Vinod K C] Avoided repeated lexical  initialization
5863cf7 [Vinod K C] Removed space
e27c66c [Vinod K C] Avoid reinitialization of lexical in parse method
ef4f60f [Vinod K C] Reverted import order
e9fc49a [Vinod K C] handle  synchronization in SqlLexical.initialize
2015-06-30 12:24:47 -07:00
Yanbo Liang c1befd780c [SPARK-8664] [ML] Add PCA transformer
Add PCA transformer for ML pipeline

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7065 from yanboliang/spark-8664 and squashes the following commits:

4afae45 [Yanbo Liang] address comments
e9effd7 [Yanbo Liang] Add PCA transformer
2015-06-30 12:23:48 -07:00
Christian Kadner 1e1f339976 [SPARK-6785] [SQL] fix DateTimeUtils for dates before 1970
Hi Michael,
this Pull-Request is a follow-up to [PR-6242](https://github.com/apache/spark/pull/6242). I removed the two obsolete test cases from the HiveQuerySuite and deleted the corresponding golden answer files.
Thanks for your review!

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #6983 from ckadner/SPARK-6785 and squashes the following commits:

ab1e79b [Christian Kadner] Merge remote-tracking branch 'origin/SPARK-6785' into SPARK-6785
1fed877 [Christian Kadner] [SPARK-6785][SQL] failed Scala style test, remove spaces on empty line DateTimeUtils.scala:61
9d8021d [Christian Kadner] [SPARK-6785][SQL] merge recent changes in DateTimeUtils & MiscFunctionsSuite
b97c3fb [Christian Kadner] [SPARK-6785][SQL] move test case for DateTimeUtils to DateTimeUtilsSuite
a451184 [Christian Kadner] [SPARK-6785][SQL] fix DateTimeUtils.fromJavaDate(java.util.Date) for Dates before 1970
2015-06-30 12:22:34 -07:00
huangzhaowei d16a944375 [SPARK-8619] [STREAMING] Don't recover keytab and principal configuration within Streaming checkpoint
[Client.scala](https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L786) will change these configurations, so this would cause the problem that the Streaming recover logic can't find the local keytab file(since configuration was changed)
```scala
      sparkConf.set("spark.yarn.keytab", keytabFileName)
      sparkConf.set("spark.yarn.principal", args.principal)
```

Problem described at [Jira](https://issues.apache.org/jira/browse/SPARK-8619)

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #7008 from SaintBacchus/SPARK-8619 and squashes the following commits:

d50dbdf [huangzhaowei] Delect one blank space
9b8e92c [huangzhaowei] Fix code style and add a short comment.
0d8f800 [huangzhaowei] Don't recover keytab and principal configuration within Streaming checkpoint.
2015-06-30 11:46:22 -07:00
zsxwing 57264400ac [SPARK-8630] [STREAMING] Prevent from checkpointing QueueInputDStream
This PR throws an exception in `QueueInputDStream.writeObject` so that it can fail the application when calling `StreamingContext.start` rather than failing it during recovering QueueInputDStream.

Author: zsxwing <zsxwing@gmail.com>

Closes #7016 from zsxwing/queueStream-checkpoint and squashes the following commits:

89a3d73 [zsxwing] Fix JavaAPISuite.testQueueStream
cc40fd7 [zsxwing] Prevent from checkpointing QueueInputDStream
2015-06-30 11:14:38 -07:00
nishkamravi2 ca7e460f7d [SPARK-7988] [STREAMING] Round-robin scheduling of receivers by default
Minimal PR for round-robin scheduling of receivers. Dense scheduling can be enabled by setting preferredLocation, so a new config parameter isn't really needed. Tested this on a cluster of 6 nodes and noticed 20-25% gain in throughput compared to random scheduling.

tdas pwendell

Author: nishkamravi2 <nishkamravi@gmail.com>
Author: Nishkam Ravi <nravi@cloudera.com>

Closes #6607 from nishkamravi2/master_nravi and squashes the following commits:

1918819 [Nishkam Ravi] Update ReceiverTrackerSuite.scala
f747739 [Nishkam Ravi] Update ReceiverTrackerSuite.scala
6127e58 [Nishkam Ravi] Update ReceiverTracker and ReceiverTrackerSuite
9f1abc2 [nishkamravi2] Update ReceiverTrackerSuite.scala
ae29152 [Nishkam Ravi] Update test suite with TD's suggestions
48a4a97 [nishkamravi2] Update ReceiverTracker.scala
bc23907 [nishkamravi2] Update ReceiverTracker.scala
68e8540 [nishkamravi2] Update SchedulerSuite.scala
4604f28 [nishkamravi2] Update SchedulerSuite.scala
179b90f [nishkamravi2] Update ReceiverTracker.scala
242e677 [nishkamravi2] Update SchedulerSuite.scala
7f3e028 [Nishkam Ravi] Update ReceiverTracker.scala, add unit test cases in SchedulerSuite
f8a3e05 [nishkamravi2] Update ReceiverTracker.scala
4cf97b6 [nishkamravi2] Update ReceiverTracker.scala
16e84ec [Nishkam Ravi] Update ReceiverTracker.scala
45e3a99 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
02dbdb8 [Nishkam Ravi] Update ReceiverTracker.scala
07b9dfa [nishkamravi2] Update ReceiverTracker.scala
6caeefe [nishkamravi2] Update ReceiverTracker.scala
7888257 [nishkamravi2] Update ReceiverTracker.scala
6e3515c [Nishkam Ravi] Minor changes
975b8d8 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
3cac21b [Nishkam Ravi] Generalize the scheduling algorithm
b05ee2f [nishkamravi2] Update ReceiverTracker.scala
bb5e09b [Nishkam Ravi] Add a new var in receiver to store location information for round-robin scheduling
41705de [nishkamravi2] Update ReceiverTracker.scala
fff1b2e [Nishkam Ravi] Round-robin scheduling of streaming receivers
2015-06-30 11:12:15 -07:00
Tijo Thomas 9213f73a8e [SPARK-8615] [DOCUMENTATION] Fixed Sample deprecated code
Modified the deprecated jdbc api in the documentation.

Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #7039 from tijoparacka/JIRA_8615 and squashes the following commits:

6e73b8a [Tijo Thomas] Reverted new lines
4042fcf [Tijo Thomas] updated to sql documentation
a27949c [Tijo Thomas] Fixed Sample deprecated code
2015-06-30 10:50:45 -07:00