Commit graph

8987 commits

Author SHA1 Message Date
alexdebrie 794f3aec24 [SPARK-4745] Fix get_existing_cluster() function with multiple security groups
The current get_existing_cluster() function would only find an instance belonged to a cluster if the instance's security groups == cluster_name + "-master" (or "-slaves"). This fix allows for multiple security groups by checking if the cluster_name + "-master" security group is in the list of groups for a particular instance.

Author: alexdebrie <alexdebrie1@gmail.com>

Closes #3596 from alexdebrie/master and squashes the following commits:

9d51232 [alexdebrie] Fix get_existing_cluster() function with multiple security groups
2014-12-04 14:14:39 -08:00
Patrick Wendell 8dae26f838 [HOTFIX] Fixing two issues with the release script.
1. The version replacement was still producing some false changes.
2. Uploads to the staging repo specifically.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3608 from pwendell/release-script and squashes the following commits:

3c63294 [Patrick Wendell] Fixing two issues with the release script:
2014-12-04 12:11:41 -08:00
WangTaoTheTonic 8106b1e36b [SPARK-4253] Ignore spark.driver.host in yarn-cluster and standalone-cluster modes
In yarn-cluster and standalone-cluster modes, we don't know where driver will run until it is launched.  If the `spark.driver.host` property is set on the submitting machine and propagated to the driver through SparkConf then this will lead to errors when the driver launches.

This patch fixes this issue by dropping the `spark.driver.host` property in SparkSubmit when running in a cluster deploy mode.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #3112 from WangTaoTheTonic/SPARK4253 and squashes the following commits:

ed1a25c [WangTaoTheTonic] revert unrelated formatting issue
02c4e49 [WangTao] add comment
32a3f3f [WangTaoTheTonic] ingore it in SparkSubmit instead of SparkContext
667cf24 [WangTaoTheTonic] document fix
ff8d5f7 [WangTaoTheTonic] also ignore it in standalone cluster mode
2286e6b [WangTao] ignore spark.driver.host in yarn-cluster mode
2014-12-04 11:53:23 -08:00
Cheng Lian 28c7acacef [SPARK-4683][SQL] Add a beeline.cmd to run on Windows
Tested locally with a Win7 VM. Connected to a Spark SQL Thrift server instance running on Mac OS X with the following command line:

```
bin\beeline.cmd -u jdbc:hive2://10.0.2.2:10000 -n lian
```

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3599)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3599 from liancheng/beeline.cmd and squashes the following commits:

79092e7 [Cheng Lian] Windows script for BeeLine
2014-12-04 10:21:03 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 529439bd50 [docs] Fix outdated comment in tuning guide
When you use the SPARK_JAVA_OPTS env variable, Spark complains:

```
SPARK_JAVA_OPTS was detected (set to ' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps ').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with conf/spark-defaults.conf to set defaults for an application
 - ./spark-submit with --driver-java-options to set -X options for a driver
 - spark.executor.extraJavaOptions to set -X options for executors
 - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker)
```

This updates the docs to redirect the user to the relevant part of the configuration docs.

CC: mengxr  but please CC someone else as needed

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3592 from jkbradley/tuning-doc and squashes the following commits:

0760ce1 [Joseph K. Bradley] fixed outdated comment in tuning guide
2014-12-04 00:59:32 -08:00
Aaron Davidson c6c7165e7e [SQL] Minor: Avoid calling Seq#size in a loop
Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3593 from aarondav/seq-opt and squashes the following commits:

962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop
2014-12-04 00:58:42 -08:00
lewuathe 20bfea4ab7 [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group
This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`.

Closes #3554

jkbradley

Author: lewuathe <lewuathe@me.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits:

184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc
f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
2014-12-04 16:51:41 +08:00
Reynold Xin c3ad486036 [SPARK-4719][API] Consolidate various narrow dep RDD classes with MapPartitionsRDD
MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD, FlatMappedRDD are not necessary. They can be implemented trivially using MapPartitionsRDD.

Author: Reynold Xin <rxin@databricks.com>

Closes #3578 from rxin/SPARK-4719 and squashes the following commits:

eed9853 [Reynold Xin] Preserve partitioning for filter.
eb1a89b [Reynold Xin] [SPARK-4719][API] Consolidate various narrow dep RDD classes with MapPartitionsRDD.
2014-12-04 00:45:57 -08:00
Jacky Li ed88db4cb2 [SQL] remove unnecessary import
Author: Jacky Li <jacky.likun@huawei.com>

Closes #3585 from jackylk/remove and squashes the following commits:

045423d [Jacky Li] remove unnecessary import
2014-12-04 00:43:55 -08:00
Patrick Wendell 3cdae038f1 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #1875 (close requested by 'marmbrus')
Closes #3566 (close requested by 'andrewor14')
Closes #3487 (close requested by 'pwendell')
2014-12-03 22:15:46 -08:00
Andrew Or a4dfb4efef [Release] Correctly translate contributors name in release notes
This commit involves three main changes:

(1) It separates the translation of contributor names from the
generation of the contributors list. This is largely motivated
by the Github API limit; even if we exceed this limit, we should
at least be able to proceed manually as before. This is why the
translation logic is abstracted into its own script
translate-contributors.py.

(2) When we look for candidate replacements for invalid author
names, we should look for the assignees of the associated JIRAs
too. As a result, the intermediate file must keep track of these.

(3) This provides an interactive mode with which the user can
sit at the terminal and manually pick the candidate replacement
that he/she thinks makes the most sense. As before, there is a
non-interactive mode that picks the first candidate that the
script considers "valid."

TODO: We should have a known_contributors file that stores
known mappings so we don't have to go through all of this
translation every time. This is also valuable because some
contributors simply cannot be automatically translated.
2014-12-03 19:10:07 -08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
Joseph K. Bradley 27ab0b8a03 [SPARK-4711] [mllib] [docs] Programming guide advice on choosing optimizer
I have heard requests for the docs to include advice about choosing an optimization method. The programming guide could include a brief statement about this (so the user does not have to read the whole optimization section).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3569 from jkbradley/lr-doc and squashes the following commits:

654aeb5 [Joseph K. Bradley] updated section header for mllib-optimization
5035ad0 [Joseph K. Bradley] updated based on review
94f6dec [Joseph K. Bradley] Updated linear methods and optimization docs with quick advice on choosing an optimization method
2014-12-04 08:58:03 +08:00
Reynold Xin 1826372d0a [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file.
cc aarondav kayousterhout pwendell

This should go into 1.2?

Author: Reynold Xin <rxin@databricks.com>

Closes #3579 from rxin/SPARK-4085 and squashes the following commits:

255b4fd [Reynold Xin] Updated test.
f9814d9 [Reynold Xin] Code review feedback.
2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file.
2014-12-03 16:28:24 -08:00
Mark Hamstra 96b27855c5 [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver adds Executor
The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master.  Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting.

JoshRosen

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits:

8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
2014-12-03 15:08:01 -08:00
Michael Armbrust 513ef82e85 [SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive
This is a very small fix that catches one specific exception and returns an empty table.  #3441 will address this in a more principled way.

Author: Michael Armbrust <michael@databricks.com>

Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits:

2781d9f [Michael Armbrust] Handle empty lists for newParquet
04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive
2014-12-03 14:13:35 -08:00
Andrew Or 90ec643e9a [HOT FIX] [YARN] Check whether /lib exists before listing its files
This is caused by a975dc3279

Author: Andrew Or <andrew@databricks.com>

Closes #3589 from andrewor14/yarn-hot-fix and squashes the following commits:

a4fad5f [Andrew Or] Check whether lib directory exists before listing its files
2014-12-03 13:56:23 -08:00
Masayoshi TSUZUKI 692f49378f [SPARK-4642] Add description about spark.yarn.queue to running-on-YARN document.
Added descriptions about these parameters.
- spark.yarn.queue

Modified description about the defalut value of this parameter.
- spark.yarn.submit.file.replication

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3500 from tsudukim/feature/SPARK-4642 and squashes the following commits:

ce99655 [Masayoshi TSUZUKI] better gramatically.
21cf624 [Masayoshi TSUZUKI] Removed intentionally undocumented properties.
88cac9b [Masayoshi TSUZUKI] [SPARK-4642] Documents about running-on-YARN needs update
2014-12-03 13:16:24 -08:00
zsxwing edd3cd477c [SPARK-4715][Core] Make sure tryToAcquire won't return a negative value
ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`.

Author: zsxwing <zsxwing@gmail.com>

Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits:

a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value
2014-12-03 12:19:40 -08:00
Masayoshi TSUZUKI 96786e3ee5 [SPARK-4701] Typo in sbt/sbt
Modified typo.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #3560 from tsudukim/feature/SPARK-4701 and squashes the following commits:

ed2a3f1 [Masayoshi TSUZUKI] Another whitespace position error.
1af3a35 [Masayoshi TSUZUKI] [SPARK-4701] Typo in sbt/sbt
2014-12-03 12:08:00 -08:00
Jim Lim a975dc3279 SPARK-2624 add datanucleus jars to the container in yarn-cluster
If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container.

This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container.

Author: Jim Lim <jim@quixey.com>

Closes #3238 from jimjh/SPARK-2624 and squashes the following commits:

3633071 [Jim Lim] SPARK-2624 update documentation and comments
fe95125 [Jim Lim] SPARK-2624 keep java imports together
6c31fe0 [Jim Lim] SPARK-2624 update documentation
6690fbf [Jim Lim] SPARK-2624 add tests
d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option
84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
2014-12-03 11:16:29 -08:00
DB Tsai d00542987e [SPARK-4717][MLlib] Optimize BLAS library to avoid de-reference multiple times in loop
Have a local reference to `values` and `indices` array in the `Vector` object
so JVM can locate the value with one operation call. See `SPARK-4581`
for similar optimization, and the bytecode analysis.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3577 from dbtsai/blasopt and squashes the following commits:

62d38c4 [DB Tsai] formating
0316cef [DB Tsai] first commit
2014-12-03 22:31:39 +08:00
DB Tsai 7fc49ed911 [SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample
Note that the usage of `breezeSquaredDistance` in
`org.apache.spark.mllib.util.MLUtils.fastSquaredDistance`
is in the critical path, and `breezeSquaredDistance` is slow.
We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3565 from dbtsai/kmean and squashes the following commits:

08bc068 [DB Tsai] restyle
de24662 [DB Tsai] address feedback
b185a77 [DB Tsai] cleanup
4554ddd [DB Tsai] first commit
2014-12-03 19:01:56 +08:00
Joseph K. Bradley 4ac2151154 [SPARK-4710] [mllib] Eliminate MLlib compilation warnings
Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class.

Added import to DecisionTreeRunner to eliminate warning.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits:

64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample.
2014-12-03 18:50:03 +08:00
zsxwing 8af551f71d [SPARK-4397][Core] Change the 'since' value of '@deprecated' to '1.3.0'
As #3262 wasn't merged to branch 1.2, the `since` value of `deprecated` should be '1.3.0'.

Author: zsxwing <zsxwing@gmail.com>

Closes #3573 from zsxwing/SPARK-4397-version and squashes the following commits:

1daa03c [zsxwing] Change the 'since' value to '1.3.0'
2014-12-03 02:05:17 -08:00
JerryLead 77be8b986f [SPARK-4672][Core]Checkpoint() should clear f to shorten the serialization chain
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA.

Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>

Closes #3545 from JerryLead/my_core and squashes the following commits:

f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
2014-12-02 23:53:29 -08:00
JerryLead 17c162f668 [SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow error
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.

Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>

Closes #3544 from JerryLead/my_graphX and squashes the following commits:

628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
2014-12-02 17:14:11 -08:00
JerryLead fc0a1475ef [SPARK-4672][GraphX]Perform checkpoint() on PartitionsRDD to shorten the lineage
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672

Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA.

Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>

Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits:

d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves
ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
2014-12-02 17:08:02 -08:00
Andrew Or 5da21f07d8 [Release] Translate unknown author names automatically 2014-12-02 16:36:12 -08:00
Reynold Xin 2d4f6e70f7 Minor nit style cleanup in GraphX. 2014-12-02 14:41:05 -08:00
wangfei 3ae0cda83c [SPARK-4695][SQL] Get result using executeCollect
Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect

Author: wangfei <wangfei1@huawei.com>

Closes #3547 from scwf/executeCollect and squashes the following commits:

a5ab68e [wangfei] Revert "adding debug info"
a60d680 [wangfei] fix test failure
0db7ce8 [wangfei] adding debug info
184c594 [wangfei] using executeCollect instead collect
2014-12-02 14:30:44 -08:00
Daoyuan Wang 1f5ddf17e8 [SPARK-4670] [SQL] wrong symbol for bitwise not
We should use `~` instead of `-` for bitwise NOT.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3528 from adrian-wang/symbol and squashes the following commits:

affd4ad [Daoyuan Wang] fix code gen test case
56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type
f55fbae [Daoyuan Wang] wrong symbol for bitwise not
2014-12-02 14:25:12 -08:00
Daoyuan Wang f6df609dcc [SPARK-4593][SQL] Return null when denominator is 0
SELECT max(1/0) FROM src
would return a very large number, which is obviously not right.
For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0.
I think it is better to keep our behavior with newer Hive version.
This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3443 from adrian-wang/div and squashes the following commits:

2e98677 [Daoyuan Wang] fix code gen for divide 0
85c28ba [Daoyuan Wang] temp
36236a5 [Daoyuan Wang] add test cases
6f5716f [Daoyuan Wang] fix comments
cee92bd [Daoyuan Wang] avoid evaluation 2 times
22ecd9a [Daoyuan Wang] fix style
cf28c58 [Daoyuan Wang] divide fix
2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type
2014-12-02 14:21:47 -08:00
YanTangZhai 1066427600 [SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
val nrdd = jhc.hql("select null from spark_test.for_test")
println(nrdd.schema)
Then the error is thrown as follows:
scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$)
at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)

Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits:

e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null
896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-12-02 14:15:12 -08:00
baishuo 69b6fed206 [SPARK-4663][sql]add finally to avoid resource leak
Author: baishuo <vc_java@hotmail.com>

Closes #3526 from baishuo/master-trycatch and squashes the following commits:

d446e14 [baishuo] correct the code style
b36bf96 [baishuo] correct the code style
ae0e447 [baishuo] add finally to avoid resource leak
2014-12-02 12:12:03 -08:00
Kousuke Saruta e75e04f980 [SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSL
Spark SQL has embeded sqrt and abs but DSL doesn't support those functions.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits:

07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite
8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator
0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL
2014-12-02 12:07:52 -08:00
Reynold Xin b1f8fe316a Indent license header properly for interfaces.scala.
A very small nit update.

Author: Reynold Xin <rxin@databricks.com>

Closes #3552 from rxin/license-header and squashes the following commits:

df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.
2014-12-02 11:59:15 -08:00
Kay Ousterhout d9a148ba6a [SPARK-4686] Link to allowed master URLs is broken
The link points to the old scala programming guide; it should point to the submitting applications page.

This should be backported to 1.1.2 (it's been broken as of 1.0).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3542 from kayousterhout/SPARK-4686 and squashes the following commits:

a8fc43b [Kay Ousterhout] [SPARK-4686] Link to allowed master URLs is broken
2014-12-02 09:06:02 -08:00
zsxwing 6dfe38a03a [SPARK-4397][Core] Cleanup 'import SparkContext._' in core
This PR cleans up `import SparkContext._` in core for SPARK-4397(#3262) to prove it really works well.

Author: zsxwing <zsxwing@gmail.com>

Closes #3530 from zsxwing/SPARK-4397-cleanup and squashes the following commits:

04e2273 [zsxwing] Cleanup 'import SparkContext._' in core
2014-12-02 00:18:41 -08:00
DB Tsai 64f3175bf9 [SPARK-4611][MLlib] Implement the efficient vector norm
The vector norm in breeze is implemented by `activeIterator` which is known to be very slow.
In this PR, an efficient vector norm is implemented, and with this API, `Normalizer` and
`k-means` have big performance improvement.

Here is the benchmark against mnist8m dataset.

a) `Normalizer`
Before
DenseVector: 68.25secs
SparseVector: 17.01secs

With this PR
DenseVector: 12.71secs
SparseVector: 2.73secs

b) `k-means`
Before
DenseVector: 83.46secs
SparseVector: 61.60secs

With this PR
DenseVector: 70.04secs
SparseVector: 59.05secs

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3462 from dbtsai/norm and squashes the following commits:

63c7165 [DB Tsai] typo
0c3637f [DB Tsai] add import org.apache.spark.SparkContext._ back
6fa616c [DB Tsai] address feedback
9b7cb56 [DB Tsai] move norm to static method
0b632e6 [DB Tsai] kmeans
dbed124 [DB Tsai] style
c1a877c [DB Tsai] first commit
2014-12-02 11:40:43 +08:00
Patrick Wendell b0a46d8995 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #1612 (close requested by 'marmbrus')
Closes #2723 (close requested by 'marmbrus')
Closes #1737 (close requested by 'marmbrus')
Closes #2252 (close requested by 'marmbrus')
Closes #2029 (close requested by 'marmbrus')
Closes #2386 (close requested by 'marmbrus')
Closes #2997 (close requested by 'marmbrus')
2014-12-01 17:27:14 -08:00
zsxwing d3e02dddf0 [SPARK-4268][SQL] Use #::: to get benefit from Stream in SqlLexical.allCaseVersions
In addition, using `s.isEmpty` to eliminate the string comparison.

Author: zsxwing <zsxwing@gmail.com>

Closes #3132 from zsxwing/SPARK-4268 and squashes the following commits:

358e235 [zsxwing] Improvement of allCaseVersions
2014-12-01 16:39:54 -08:00
Daoyuan Wang 4df60a8cbc [SPARK-4529] [SQL] support view with column alias
Support view definition like

CREATE VIEW view3(valoo)
TBLPROPERTIES ("fear" = "factor")
AS SELECT upper(value) FROM src WHERE key=86;

[valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3396 from adrian-wang/viewcolumn and squashes the following commits:

4d001d0 [Daoyuan Wang] support view with column alias
2014-12-01 16:08:51 -08:00
Daoyuan Wang 5edbcbfb61 [SQL][DOC] Date type in SQL programming guide
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3535 from adrian-wang/datedoc and squashes the following commits:

18ff1ed [Daoyuan Wang] [DOC] Date type
2014-12-01 14:04:07 -08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
ravipesala bc353819cc [SPARK-4658][SQL] Code documentation issue in DDL of datasource API
Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3516 from ravipesala/ddl_doc and squashes the following commits:

d101fdf [ravipesala] Style issues fixed
d2238cd [ravipesala] Corrected documentation
2014-12-01 13:31:27 -08:00
ravipesala 6a9ff19dc0 [SPARK-4650][SQL] Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL
Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL

Author: ravipesala <ravindra.pesala@huawei.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #3511 from ravipesala/countdistinct and squashes the following commits:

cc4dbb1 [ravipesala] style
070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL
2014-12-01 13:28:04 -08:00
Liang-Chi Hsieh b57365a1ec [SPARK-4358][SQL] Let BigDecimal do checking type compatibility
Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #3208 from viirya/more_numericLit and squashes the following commits:

e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal.
1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer.
cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast.
91fe489 [Liang-Chi Hsieh] add Byte and Short.
1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility.
2014-12-01 13:17:56 -08:00