Commit graph

11030 commits

Author SHA1 Message Date
zsxwing 10007fbe0b [SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page consistent with other pages
This PR makes "Input Rate" in the Streaming page consistent with Job and Stage pages.

![screen shot 2015-05-12 at 5 03 35 pm](https://cloud.githubusercontent.com/assets/1000778/7601444/f943f8ac-f8ca-11e4-8280-a715d814f434.png)
![screen shot 2015-05-12 at 5 07 25 pm](https://cloud.githubusercontent.com/assets/1000778/7601445/f9571c0c-f8ca-11e4-9b12-9317cb55c002.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6102 from zsxwing/SPARK-7589 and squashes the following commits:

2745225 [zsxwing] Make "Input Rate" in the Streaming page consistent with other pages

(cherry picked from commit bec938f777)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-13 10:01:38 -07:00
Cheng Hao 42cf4a2a5e [SPARK-6734] [SQL] Add UDTF.close support in Generate
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5383 from chenghao-intel/udtf_close and squashes the following commits:

98b4e4b [Cheng Hao] Support UDTF.close

(cherry picked from commit 0da254fb29)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-05-14 00:35:22 +08:00
Cheng Lian d78f0e1b48 [MINOR] [SQL] Removes debugging println
Author: Cheng Lian <lian@databricks.com>

Closes #6123 from liancheng/remove-println and squashes the following commits:

03356b6 [Cheng Lian] Removes debugging println

(cherry picked from commit aa6ba3f216)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-05-13 23:40:30 +08:00
Yin Huai 9ca28d9de6 [SQL] In InsertIntoFSBasedRelation.insert, log cause before abort job/task.
We need to add a log entry before calling `abortTask`/`abortJob`. Otherwise, an exception from `abortTask`/`abortJob` will shadow the real cause.

cc liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #6105 from yhuai/logCause and squashes the following commits:

8dfe0d8 [Yin Huai] Log cause.

(cherry picked from commit b061bd517a)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-05-13 23:36:36 +08:00
Cheng Lian cb1fe81339 [SPARK-7599] [SQL] Don't restrict customized output committers to be subclasses of FileOutputCommitter
Author: Cheng Lian <lian@databricks.com>

Closes #6118 from liancheng/spark-7599 and squashes the following commits:

31e1bd6 [Cheng Lian] Don't restrict customized output committers to be subclasses of FileOutputCommitter

(cherry picked from commit 10c546e9d4)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-05-13 07:36:04 -07:00
Masayoshi TSUZUKI bfdecace5d [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
escape spaces in the arguments.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits:

3f9a188 [Masayoshi TSUZUKI] modified some errors.
ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors.
1784239 [Masayoshi TSUZUKI] removed Utils.formatPath.
e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala.
84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows
016128d [Masayoshi TSUZUKI] fixed to use File.toURI()
2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2
7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2
45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2
10f1c73 [Kousuke Saruta] Added a comment
93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2
649da82 [Kousuke Saruta] Fix classpath handling
c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path

(cherry picked from commit 50c7270801)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-05-13 09:43:49 +01:00
linweizhong 7bd5274270 [SPARK-7526] [SPARKR] Specify ip of RBackend, MonitorServer and RRDD Socket server
These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip.

Author: linweizhong <linweizhong@huawei.com>

Closes #6053 from Sephiroth-Lin/spark-7526 and squashes the following commits:

5303af7 [linweizhong] bind to localhost rather than wildcard ip

(cherry picked from commit 98195c3031)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-05-12 23:55:54 -07:00
Sun Rui b18f1c61ae [SPARK-7482] [SPARKR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.
Author: Sun Rui <rui.sun@intel.com>

Closes #6007 from sun-rui/SPARK-7482 and squashes the following commits:

5c5cf5e [Sun Rui] Implement alias loadDF() as a new function.
3a30c10 [Sun Rui] Rename load()/save() to read.df()/write.df(). Also add loadDF()/saveDF() as aliases.
9f569d6 [Sun Rui] [SPARK-7482][SparkR] Rename some DataFrame API methods in SparkR to match their counterparts in Scala.

(cherry picked from commit df9b94a57c)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-05-12 23:52:42 -07:00
Santiago M. Mola 6ff3379a19 [SPARK-7566][SQL] Add type to HiveContext.analyzer
This makes HiveContext.analyzer overrideable.

Author: Santiago M. Mola <santi@mola.io>

Closes #6086 from smola/patch-3 and squashes the following commits:

8ece136 [Santiago M. Mola] [SPARK-7566][SQL] Add type to HiveContext.analyzer

(cherry picked from commit 208b902257)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 23:44:43 -07:00
Reynold Xin 219a9043ef [SPARK-7321][SQL] Add Column expression for conditional statements (when/otherwise)
This builds on https://github.com/apache/spark/pull/5932 and should close https://github.com/apache/spark/pull/5932 as well.

As an example:
```python
df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
```

Author: Reynold Xin <rxin@databricks.com>
Author: kaka1992 <kaka_1992@163.com>

Closes #6072 from rxin/when-expr and squashes the following commits:

8f49201 [Reynold Xin] Throw exception if otherwise is applied twice.
0455eda [Reynold Xin] Reset run-tests.
bfb9d9f [Reynold Xin] Updated documentation and test cases.
762f6a5 [Reynold Xin] Merge pull request #5932 from kaka1992/IFCASE
95724c6 [kaka1992] Update
8218d0a [kaka1992] Update
801009e [kaka1992] Update
76d6346 [kaka1992] [SPARK-7321][SQL] Add Column expression for conditional statements (if, case)

(cherry picked from commit 97dee313f2)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 21:44:16 -07:00
Reynold Xin bdd5db9f16 [SPARK-7588] Document all SQL/DataFrame public methods with @since tag
This pull request adds since tag to all public methods/classes in SQL/DataFrame to indicate which version the methods/classes were first added.

Author: Reynold Xin <rxin@databricks.com>

Closes #6101 from rxin/tbc and squashes the following commits:

ed55e11 [Reynold Xin] Add since version to all DataFrame methods.

(cherry picked from commit 8fd55358b7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 18:37:10 -07:00
zsxwing 2cc3301812 [HOTFIX] Use the old Job API to support old Hadoop versions
#5526 uses `Job.getInstance`, which does not exist in the old Hadoop versions. Just use `new Job` to replace it.

cc liancheng

Author: zsxwing <zsxwing@gmail.com>

Closes #6095 from zsxwing/hotfix and squashes the following commits:

b0c2049 [zsxwing] Use the old Job API to support old Hadoop versions

(cherry picked from commit 247b70349c)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-05-13 08:33:43 +08:00
Xiangrui Meng 08ec1af547 [SPARK-7572] [MLLIB] do not import Param/Params under pyspark.ml
Remove `Param` and `Params` from `pyspark.ml` and add a section in the doc. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #6094 from mengxr/SPARK-7572 and squashes the following commits:

022abd6 [Xiangrui Meng] do not import Param/Params under spark.ml

(cherry picked from commit 77f64c736d)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 17:15:47 -07:00
Tathagata Das bb81b15005 [SPARK-7554] [STREAMING] Throw exception when an active/stopped StreamingContext is used to create DStreams and output operations
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6099 from tdas/SPARK-7554 and squashes the following commits:

2cd4158 [Tathagata Das] Throw exceptions on attempts to add stuff to active and stopped contexts.

(cherry picked from commit 23f7d66d51)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-12 17:07:46 -07:00
Xiangrui Meng 6c292a213b [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly
`RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:

e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly

(cherry picked from commit 2713bc65af)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:54:04 -07:00
Tathagata Das 91fbd93f26 [SPARK-7553] [STREAMING] Added methods to maintain a singleton StreamingContext
In a REPL/notebook environment, its very easy to lose a reference to a StreamingContext by overriding the variable name. So if you happen to execute the following commands
```
val ssc = new StreamingContext(...) // cmd 1
ssc.start() // cmd 2
...
val ssc = new StreamingContext(...) // accidentally run cmd 1 again
```
The value of ssc will be overwritten. Now you can neither start the new context (as only one context can be started), nor stop the previous context (as the reference is lost).
Hence its best to maintain a singleton reference to the active context, so that we never loose reference for the active context.
Since this problem occurs useful in REPL environments, its best to add this as an Experimental support in the Scala API only so that it can be used in Scala REPLs and notebooks.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6070 from tdas/SPARK-7553 and squashes the following commits:

731c9a1 [Tathagata Das] Fixed style
a797171 [Tathagata Das] Added more unit tests
19fc70b [Tathagata Das] Added :: Experimental :: in docs
64706c9 [Tathagata Das] Fixed test
634db5d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7553
3884a25 [Tathagata Das] Fixing test bug
d37a846 [Tathagata Das] Added getActive and getActiveOrCreate

(cherry picked from commit 00e7b09a0b)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-12 16:44:27 -07:00
Joseph K. Bradley 612247ff05 [SPARK-7573] [ML] OneVsRest cleanups
Minor cleanups discussed with [~mengxr]:
* move OneVsRest from reduction to classification sub-package
* make model constructor private

Some doc cleanups too

CC: harsha2010  Could you please verify this looks OK?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:

4ecd48d [Joseph K. Bradley] org imports
430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.

(cherry picked from commit 96c4846db8)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:42:36 -07:00
Joseph K. Bradley d080df10bc [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer
Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.

I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:

d5d213f [Joseph K. Bradley] small fix
dd6e91a [Joseph K. Bradley] fixes from code review of user guide
33c3ff9 [Joseph K. Bradley] small fix
bc6058c [Joseph K. Bradley] fix link
361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide

(cherry picked from commit f0c1bc3472)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 16:40:03 -07:00
Yuhao Yang fe34a5915c [SPARK-7496] [MLLIB] Update Programming guide with Online LDA
jira: https://issues.apache.org/jira/browse/SPARK-7496

Update LDA subsection of clustering section of MLlib programming guide to include OnlineLDA.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6046 from hhbyyh/ldaDocument and squashes the following commits:

4b6fbfa [Yuhao Yang] add online paper and some comparison
fd4c983 [Yuhao Yang] update lda document for optimizers

(cherry picked from commit 1d703660d4)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-12 15:12:37 -07:00
zsxwing 221375ee1f [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
Screenshots:
![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png)

![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png)

tdas as we discussed offline

Author: zsxwing <zsxwing@gmail.com>

Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits:

2b004ea [zsxwing] Merge branch 'master' into SPARK-7406
e9eb506 [zsxwing] Update tooltip contents
2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"

(cherry picked from commit 1422e79e51)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-12 14:41:30 -07:00
Xiangrui Meng 2555517c97 [SPARK-7571] [MLLIB] rename Math to math
`scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:

fe8f8d3 [Xiangrui Meng] Math -> math

(cherry picked from commit a4874b0d18)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 14:39:11 -07:00
Venkata Ramana Gollamudi 32819fcb7c [SPARK-7484][SQL]Support jdbc connection properties
Few jdbc drivers like SybaseIQ support passing username and password only through connection properties. So the same needs to be supported for
SQLContext.jdbc, dataframe.createJDBCTable and dataframe.insertIntoJDBC.
Added as default arguments or overrided function to support backward compatability.

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #6009 from gvramana/add_jdbc_conn_properties and squashes the following commits:

396a0d0 [Venkata Ramana Gollamudi] fixed comments
d66dd8c [Venkata Ramana Gollamudi] fixed comments
1b8cd8c [Venkata Ramana Gollamudi] Support jdbc connection properties

(cherry picked from commit 455551d1c6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 14:38:53 -07:00
Xiangrui Meng 98ccd934f3 [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket.
We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11).

This also update the impl to use `Arrays.binarySearch` and `withClue` in test.

yinxusen jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6075 from mengxr/SPARK-7559 and squashes the following commits:

e28f910 [Xiangrui Meng] update bucketizer impl

(cherry picked from commit 23b9863e2a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-12 14:24:33 -07:00
Michael Armbrust c68485e7a7 [SPARK-7569][SQL] Better error for invalid binary expressions
`scala> Seq((1,1)).toDF("a", "b").select(lit(1) + new java.sql.Date(1)) `

Before:

```
org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between Literal 1, IntegerType and Literal 0, DateType;
```

After:
```
org.apache.spark.sql.AnalysisException: invalid expression (1 + 0) between int and date;
```

Author: Michael Armbrust <michael@databricks.com>

Closes #6089 from marmbrus/betterBinaryError and squashes the following commits:

23b68ad [Michael Armbrust] [SPARK-7569][SQL] Better error for invalid binary expressions

(cherry picked from commit 2a41c0d71a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 13:37:01 -07:00
Ram Sriharsha fd16709f04 [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All
initial cut of one against all. test code is a scaffolding , not fully implemented.
This WIP is to gather early feedback.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #5830 from harsha2010/reduction and squashes the following commits:

5f4b495 [Ram Sriharsha] Fix Test
386e98b [Ram Sriharsha] Style fix
49b4a17 [Ram Sriharsha] Simplify the test
02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
bc78032 [Ram Sriharsha] Code Review Updates
8ce4845 [Ram Sriharsha] Merge with Master
2a807be [Ram Sriharsha] Merge branch 'master' into reduction
e21bfcc [Ram Sriharsha] Style Fix
5614f23 [Ram Sriharsha] Style Fix
c75583a [Ram Sriharsha] Cleanup
7a5f136 [Ram Sriharsha] Fix TODOs
804826b [Ram Sriharsha] Merge with Master
1448a5f [Ram Sriharsha] Style Fix
6e47807 [Ram Sriharsha] Style Fix
d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
78fa82a [Ram Sriharsha] extra line
0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute
a59a4f4 [Ram Sriharsha] @Experimental
4167234 [Ram Sriharsha] Merge branch 'master' into reduction
868a4fd [Ram Sriharsha] @Experimental
041d905 [Ram Sriharsha] Code Review Fixes
df188d8 [Ram Sriharsha] Style fix
612ec48 [Ram Sriharsha] Style Fix
6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner
6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
1c7fa44 [Ram Sriharsha] Fix Tests
ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier
221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers
26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
9738744 [Ram Sriharsha] Merge branch 'master' into reduction
1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically
32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
ff272da [Ram Sriharsha] Style fix
28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
b60f874 [Ram Sriharsha] Fix Style issues in Test
3191cdf [Ram Sriharsha] Remove this test, accidental commit
23f056c [Ram Sriharsha] Fix Headers for test
1b5e929 [Ram Sriharsha] Fix Style issues and add Header
8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All

(cherry picked from commit 595a67589a)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-12 13:35:33 -07:00
Tim Ellison eadda926cd [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization p…
…roblem

Pick up newer version of dependency with fix for SPARK-2018.  The update involved patching the ning/compress LZF library to handle big endian systems correctly.

Credit goes to gireeshpunathil for diagnosing the problem, and cowtowncoder for fixing it.

Spark tests run clean for me.

Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6077 from tellison/UpgradeLZF and squashes the following commits:

ad8d4ef [Tim Ellison] [SPARK-2018] [CORE] Upgrade LZF library to fix endian serialization problem

(cherry picked from commit 5438f49ccf)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-05-12 20:48:48 +01:00
Burak Yavuz 432694c18e [SPARK-7487] [ML] Feature Parity in PySpark for ml.regression
Added LinearRegression Python API

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6016 from brkyvz/ml-reg and squashes the following commits:

11c9ef9 [Burak Yavuz] address comments
1027a40 [Burak Yavuz] fix typo
4c699ad [Burak Yavuz] added tree regressor api
8afead2 [Burak Yavuz] made mixin for DT
fa51c74 [Burak Yavuz] save additions
0640d48 [Burak Yavuz] added ml.regression
82aac48 [Burak Yavuz] added linear regression

(cherry picked from commit 8e935b0a21)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-12 12:17:14 -07:00
Andrew Or ce6c40066f [HOT FIX #6076] DAG visualization: curve the edges 2015-05-12 12:07:36 -07:00
Wenchen Fan 8be43f897f [SPARK-7276] [DATAFRAME] speed up DataFrame.select by collapsing Project
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #5831 from cloud-fan/7276 and squashes the following commits:

ee4a1e1 [Wenchen Fan] fix rebase mistake
a3b565d [Wenchen Fan] refactor
99deb5d [Wenchen Fan] add test
f1f67ad [Wenchen Fan] fix 7276

(cherry picked from commit 4e290522c2)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-05-12 11:58:50 -07:00
Andrew Or a23610458c [SPARK-7500] DAG visualization: move cluster labeling to dagre-d3
This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular:

(1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark
(2) All graph styling is now moved from Scala to JS

Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3.

-------------------

**Before.**
<img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/>

-------------------

**After.**
<img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits:

5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed
c686dc4 [Andrew Or] Fix tooltip placement
d908c36 [Andrew Or] Add link to dagre-d3 changes (minor)
4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS

(cherry picked from commit 65697bbeaf)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-12 11:18:08 -07:00
Wenchen Fan ec8928604d [DataFrame][minor] support column in field accessor
Minor improvement, now we can use `Column` as extraction expression.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6080 from cloud-fan/tmp and squashes the following commits:

0fdefb7 [Wenchen Fan] support column in field accessor

(cherry picked from commit bfcaf8adcd)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 10:38:19 -07:00
Cheng Lian d2328137f7 [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API
This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.

## New features provided

1. Hive compatible partition discovery

   This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.

1. Generalized partition pruning optimization

   Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.

   (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)

1. Insertion with dynamic partitions

   When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.

## New structures provided

### Developer API

1. `FSBasedRelation`

   Base abstract class for file system based data sources.

1. `OutputWriter`

   Base abstract class for output row writers, responsible for writing a single row object.

1. `FSBasedRelationProvider`

   A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.

### User API

New overloaded versions of

1. `DataFrame.save()`
1. `DataFrame.saveAsTable()`
1. `SQLContext.load()`

are provided to allow users to save/load DataFrames with user defined dynamic partition columns.

### Spark SQL query planning

1. `InsertIntoFSBasedRelation`

   Used to implement write path for `FSBasedRelation`s.

1. New rules for `FSBasedRelation` in `DataSourceStrategy`

   These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.

## TODO

- [ ] Use scratch directories when overwriting a table with data selected from itself.

      Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.

- [ ] When inserting with dynamic partition columns, use external sorter to group the data first.

      This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.

- [x] More tests.  Specifically, test cases for

      - [x] Self-join
      - [x] Loading partitioned relations with a subset of partition columns stored in data files.
      - [x] `SQLContext.load()` with user defined dynamic partition columns.

## Parquet data source migration

Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.

Author: Cheng Lian <lian@databricks.com>

Closes #5526 from liancheng/partitioning-support and squashes the following commits:

5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
43ba50e [Cheng Lian] Avoids serializing generated projection code
edf49e7 [Cheng Lian] Removed commented stale code block
348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
ad4d4de [Cheng Lian] Enables HDFS style globbing
8d12e69 [Cheng Lian] Fixes compilation error
c71ac6c [Cheng Lian] Addresses comments from @marmbrus
7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
c466de6 [Cheng Lian] Addresses comments
bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
795920a [Cheng Lian] Fixes compilation error after rebasing
0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
fa543f3 [Cheng Lian] Addresses comments
5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
c4ed4fe [Cheng Lian] Bug fixes and a new test suite
a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
422ff4a [Cheng Lian] Fixes style issue
ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
f18dec2 [Cheng Lian] More strict schema checking
b746ab5 [Cheng Lian] More tests
9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
ea6c8dd [Cheng Lian] Removes remote debugging stuff
327bb1d [Cheng Lian] Implements partitioning support for data sources API
3c5073a [Cheng Lian] Fixes SaveModes used in test cases
fb5a607 [Cheng Lian] Fixes compilation error
9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
5de194a [Cheng Lian] Forgot Apache licence header
95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
770b5ba [Cheng Lian] Adds tests for FSBasedRelation
3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
aa8ba9a [Cheng Lian] Javadoc fix
012ed2d [Cheng Lian] Adds PartitioningOptions
7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support

(cherry picked from commit 0595b6de8f)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-05-13 01:32:55 +08:00
Wenchen Fan a9d84a9bf7 [DataFrame][minor] cleanup unapply methods in DataTypes
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6079 from cloud-fan/unapply and squashes the following commits:

40da442 [Wenchen Fan] one more
7d90a05 [Wenchen Fan] cleanup unapply in DataTypes

(cherry picked from commit 831504cf6b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 10:29:05 -07:00
Daoyuan Wang 653db0a1bd [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6003 from adrian-wang/pynareplace and squashes the following commits:

672efba [Daoyuan Wang] remove py2.7 feature
4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests
9e232e7 [Daoyuan Wang] rename scala map
af0268a [Daoyuan Wang] remove na
63ac579 [Daoyuan Wang] add na.replace in pyspark

(cherry picked from commit d86ce84584)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-12 10:23:57 -07:00
Tathagata Das 2bbb685f4c [SPARK-7532] [STREAMING] StreamingContext.start() made to logWarning and not throw exception
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6060 from tdas/SPARK-7532 and squashes the following commits:

6fe2e83 [Tathagata Das] Update docs
7dadfc3 [Tathagata Das] Fixed bug again
99c7678 [Tathagata Das] Added logInfo
65aec20 [Tathagata Das] Fix bug
5bf031b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7532
1a9a818 [Tathagata Das] Fix scaladoc
c584313 [Tathagata Das] StreamingContext.start() made to logWarning and not throw exception

(cherry picked from commit ec6f2a9774)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-12 08:48:34 -07:00
Andrew Or 56016326c0 [SPARK-7467] Dag visualization: treat checkpoint as an RDD operation
Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming.

Author: Andrew Or <andrew@databricks.com>

Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits:

9217439 [Andrew Or] Fix checkpoints
4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint
19bc07b [Andrew Or] Treat checkpoint as an RDD operation

(cherry picked from commit f3e8e60063)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-12 01:41:02 -07:00
Marcelo Vanzin afe54b76a6 [SPARK-7485] [BUILD] Remove pyspark files from assembly.
The sbt part of the build is hacky; it basically tricks sbt
into generating the zip by using a generator, but returns
an empty list for the generated files so that nothing is
actually added to the assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:

22c1e04 [Marcelo Vanzin] Remove unneeded code.
4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.

(cherry picked from commit 82e890fb19)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-12 01:39:28 -07:00
linweizhong 4092a2e859 [MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark
As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this.

Author: linweizhong <linweizhong@huawei.com>

Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits:

8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build

(cherry picked from commit 9847875266)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-12 01:36:36 -07:00
zsxwing af374ed268 [SPARK-7534] [CORE] [WEBUI] Fix the Stage table when a stage is missing
Just improved the Stage table when a stage is missing.

Before:

![screen shot 2015-05-11 at 10 11 51 am](https://cloud.githubusercontent.com/assets/1000778/7570842/2ba37380-f7c8-11e4-9b5f-cf1a6264b2a4.png)

After:

![screen shot 2015-05-11 at 10 26 09 am](https://cloud.githubusercontent.com/assets/1000778/7570848/33703152-f7c8-11e4-81a8-d53dd72d7b8d.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6061 from zsxwing/SPARK-7534 and squashes the following commits:

09fe862 [zsxwing] Leave it blank rather than '-'
6299197 [zsxwing] Fix the Stage table when a stage is missing
2015-05-12 01:35:14 -07:00
vidmantas zemleris 6523fb81ba [SPARK-6994][SQL] Update docs for fetching Row fields by name
add docs for https://issues.apache.org/jira/browse/SPARK-6994

Author: vidmantas zemleris <vidmantas@vinted.com>

Closes #6030 from vidma/docs/row-with-named-fields and squashes the following commits:

241b401 [vidmantas zemleris] [SPARK-6994][SQL] Update docs for fetching Row fields by name

(cherry picked from commit 640f63b959)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-11 22:29:41 -07:00
Reynold Xin c6b8148458 [SQL] Rename Dialect -> ParserDialect.
Author: Reynold Xin <rxin@databricks.com>

Closes #6071 from rxin/parserdialect and squashes the following commits:

ca2eb31 [Reynold Xin] Rename Dialect -> ParserDialect.

(cherry picked from commit 16696759e9)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-11 22:07:02 -07:00
Joshi 835a770c84 [SPARK-7435] [SPARKR] Make DataFrame.show() consistent with that of Scala and pySpark
Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #5989 from rekhajoshm/fix/SPARK-7435 and squashes the following commits:

cfc9e02 [Joshi] Spark-7435[R]: updated patch for review comments
62becc1 [Joshi] SPARK-7435: Update to DataFrame
e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master

(cherry picked from commit b94a93371c)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-05-11 21:04:59 -07:00
Reynold Xin ed40ab5b27 [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.
Author: Reynold Xin <rxin@databricks.com>

Closes #6068 from rxin/drop-column and squashes the following commits:

9d7d5ec [Reynold Xin] [SPARK-7509][SQL] DataFrame.drop in Python for dropping columns.

(cherry picked from commit 028ad4bd29)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-11 20:04:42 -07:00
Zhongshuai Pei c30982dd86 [SPARK-7437] [SQL] Fold "literal in (item1, item2, ..., literal, ...)" into true or false directly
SQL
```
select key from src where 3 in (4, 5);
```
Before
```
== Optimized Logical Plan ==
Project [key#12]
 Filter 3 INSET (5,4)
  MetastoreRelation default, src, None
```

After
```
== Optimized Logical Plan ==
LocalRelation [key#228], []
```

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes #5972 from DoingDone9/InToFalse and squashes the following commits:

4c722a2 [Zhongshuai Pei] Update predicates.scala
abe2bbb [Zhongshuai Pei] Update Optimizer.scala
fa461a5 [Zhongshuai Pei] Update Optimizer.scala
e34c28a [Zhongshuai Pei] Update predicates.scala
24739bd [Zhongshuai Pei] Update ConstantFoldingSuite.scala
f4dbf50 [Zhongshuai Pei] Update ConstantFoldingSuite.scala
35ceb7a [Zhongshuai Pei] Update Optimizer.scala
36c194e [Zhongshuai Pei] Update Optimizer.scala
2e8f6ca [Zhongshuai Pei] Update Optimizer.scala
14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master

(cherry picked from commit 4b5e1fe94c)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-05-11 19:23:09 -07:00
Cheng Hao 1a664a0d54 [SPARK-7411] [SQL] Support SerDe for HiveQl in CTAS
This is a follow up of #5876 and should be merged after #5876.

Let's wait for unit testing result from Jenkins.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5963 from chenghao-intel/useIsolatedClient and squashes the following commits:

f87ace6 [Cheng Hao] remove the TODO and add `resolved condition` for HiveTable
a8260e8 [Cheng Hao] Update code as feedback
f4e243f [Cheng Hao] remove the serde setting for SequenceFile
d166afa [Cheng Hao] style issue
d25a4aa [Cheng Hao] Add SerDe support for CTAS

(cherry picked from commit e35d878be3)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-05-11 19:21:29 -07:00
Reynold Xin 8a9d2348e9 [SPARK-7324] [SQL] DataFrame.dropDuplicates
This should also close https://github.com/apache/spark/pull/5870

Author: Reynold Xin <rxin@databricks.com>

Closes #6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

(cherry picked from commit b6bf4f76c7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-05-11 19:15:32 -07:00
Tathagata Das c16b47f9ed [SPARK-7530] [STREAMING] Added StreamingContext.getState() to expose the current state of the context
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6058 from tdas/SPARK-7530 and squashes the following commits:

80ee0e6 [Tathagata Das] STARTED --> ACTIVE
3da6547 [Tathagata Das] Added synchronized
dd88444 [Tathagata Das] Added more docs
e1a8505 [Tathagata Das] Fixed comment length
89f9980 [Tathagata Das] Change to Java enum and added Java test
7c57351 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
dd4e702 [Tathagata Das] Addressed comments.
3d56106 [Tathagata Das] Added Mima excludes
2b86ba1 [Tathagata Das] Added scala docs.
1722433 [Tathagata Das] Fixed style
976b094 [Tathagata Das] Added license
0585130 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7530
e0f0a05 [Tathagata Das] Added getState and exposed StreamingContextState

(cherry picked from commit f9c7580ada)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-11 18:54:06 -07:00
Xusen Yin f188815989 [SPARK-5893] [ML] Add bucketizer
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).

One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,

```scala
buckets = Array(-0.5, 0.0, 0.5)
```

splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.

Author: Xusen Yin <yinxusen@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:

dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
1ca973a [Joseph K. Bradley] one more bucketizer test
34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
eacfcfa [Xusen Yin] change ML attribute from splits into buckets
c3cc770 [Xusen Yin] add more unit test for binary search
3a16cc2 [Xusen Yin] refine comments and names
ac77859 [Xusen Yin] fix style error
fb30d79 [Xusen Yin] fix and test binary search
2466322 [Xusen Yin] refactor Bucketizer
11fb00a [Xusen Yin] change it into an Estimator
998bc87 [Xusen Yin] check buckets
4024cf1 [Xusen Yin] add test suite
5fe190e [Xusen Yin] add bucketizer

(cherry picked from commit 35fb42a0b0)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-05-11 18:41:36 -07:00
Reynold Xin e1e599d58c Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.
So users that are interested in this can track it easily.

Author: Reynold Xin <rxin@databricks.com>

Closes #6067 from rxin/SPARK-7550 and squashes the following commits:

ee0e34c [Reynold Xin] Updated DataFrame.saveAsTable Hive warning to include SPARK-7550 ticket.

(cherry picked from commit 87229c95c6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-11 18:10:52 -07:00
Reynold Xin eaa6116200 [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.
Author: Reynold Xin <rxin@databricks.com>

Closes #6062 from rxin/agg-retain-doc and squashes the following commits:

43e511e [Reynold Xin] [SPARK-7462][SQL] Update documentation for retaining grouping columns in DataFrames.

(cherry picked from commit 3a9b6997df)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-11 18:07:19 -07:00