ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Davies Liu	5c746eedda	[SPARK-5395] [PySpark] fix python process leak while coalesce() Currently, the Python process is released into pool only after the task had finished, it cause many process forked if coalesce() is called. This PR will change it to release the process as soon as read all the data from it (finish the partition), then a process could be reused to process multiple partitions in a single task. Author: Davies Liu <davies@databricks.com> Closes #4238 from davies/py_leak and squashes the following commits: ec80a43 [Davies Liu] add @volatile 6da437a [Davies Liu] address comments 24ed322 [Davies Liu] fix python process leak while coalesce()	2015-01-29 17:28:37 -08:00
Reynold Xin	ce9c43ba8c	[SQL] DataFrame API improvements 1. Added Dsl.column in case Dsl.col is shadowed. 2. Allow using String to specify the target data type in cast. 3. Support sorting on multiple columns using column names. 4. Added Java API test file. Author: Reynold Xin <rxin@databricks.com> Closes #4280 from rxin/dsl1 and squashes the following commits: 33ecb7a [Reynold Xin] Add the Java test. d06540a [Reynold Xin] [SQL] DataFrame API improvements.	2015-01-29 17:24:00 -08:00
Patrick Wendell	d2071e8f45	Revert "[WIP] [SPARK-3996]: Shade Jetty in Spark deliverables" This reverts commit `f240fe390b`.	2015-01-29 17:14:27 -08:00
Yoshihiro Shimizu	5338772f3f	remove 'return' looks unnecessary 😀 Author: Yoshihiro Shimizu <shimizu@amoad.com> Closes #4268 from y-shimizu/remove-return and squashes the following commits: 12be0e9 [Yoshihiro Shimizu] remove 'return'	2015-01-29 16:55:00 -08:00
Patrick Wendell	f240fe390b	[WIP] [SPARK-3996]: Shade Jetty in Spark deliverables This patch piggy-back's on vanzin's work to simplify the Guava shading, and adds Jetty as a shaded library in Spark. Other than adding Jetty, it consilidates the \<artifactSet\>'s into the root pom. I found it was a bit easier to follow that way, since you don't need to look into child pom's to find out specific artifact sets included in shading. Author: Patrick Wendell <patrick@databricks.com> Closes #4252 from pwendell/jetty and squashes the following commits: 19f0710 [Patrick Wendell] More code review feedback 961452d [Patrick Wendell] Responding to feedback from Marcello 6df25ca [Patrick Wendell] [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables	2015-01-29 16:31:19 -08:00
Josh Rosen	0bb15f22d1	[SPARK-5464] Fix help() for Python DataFrame instances This fixes an exception that prevented users from calling `help()` on Python DataFrame instances. Author: Josh Rosen <joshrosen@databricks.com> Closes #4278 from JoshRosen/SPARK-5464-python-dataframe-help-command and squashes the following commits: 08f95f7 [Josh Rosen] Fix exception when calling help() on Python DataFrame instances	2015-01-29 16:23:20 -08:00
Yin Huai	c00d517d66	[SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions I believe that SPARK-4296 has been fixed by `3684fd21e1`. I am adding tests based #3910 (change the udf to HiveUDF instead). Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits: 6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin 6cfadd2 [Yin Huai] Actually, this issue has been fixed by `3684fd21e1`. d42b707 [Yin Huai] Update comment. 8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block, revert this change. 443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions	2015-01-29 15:49:34 -08:00
wangfei	c1b3eebf97	[SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result `select key, count( * ) from src group by key, 1` will get the wrong answer. e.g. for this table ``` val testData2 = TestSQLContext.sparkContext.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toSchemaRDD testData2.registerTempTable("testData2") ``` result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1` is ``` [1,1] [2,2] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #4169 from scwf/agg-bug and squashes the following commits: 05751db [wangfei] fix bugs when literal in agg grouping expressioons	2015-01-29 15:47:18 -08:00
wangfei	fbaf9e0896	[SPARK-5367][SQL] Support star expression in udf now spark sql does not support star expression in udf, run the following sql by spark-sql will get error ``` select concat(*) from src ``` Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #4163 from scwf/udf-star and squashes the following commits: 9db7b39 [wangfei] addressed comments da1da09 [scwf] minor fix f87b5f9 [scwf] added test case 587bf7e [wangfei] compile fix eb93c16 [wangfei] fix star resolve issue in udf	2015-01-29 15:44:53 -08:00
Yash Datta	de221ea032	[SPARK-4786][SQL]: Parquet filter pushdown for castable types Enable parquet filter pushdown of castable types like short, byte that can be cast to integer Author: Yash Datta <Yash.Datta@guavus.com> Closes #4156 from saucam/filter_short and squashes the following commits: a403979 [Yash Datta] SPARK-4786: Fix styling issues d029866 [Yash Datta] SPARK-4786: Add test case cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types	2015-01-29 15:42:23 -08:00
Michael Davies	940f375611	[SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for Strin... ...gs. Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding. The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String. In measurements this could account for over 25% of entire query time. For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits: 327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings. 33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.	2015-01-29 15:40:59 -08:00
Liang-Chi Hsieh	bce0ba1fbd	[SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on Hive 0.13.1 I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4223 from viirya/fix_hivetest and squashes the following commits: 97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.	2015-01-29 15:28:22 -08:00
Reynold Xin	715632232d	[SPARK-5445][SQL] Consolidate Java and Scala DSL static methods. Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl. Author: Reynold Xin <rxin@databricks.com> Closes #4276 from rxin/dsl and squashes the following commits: 30aa611 [Reynold Xin] Add all files. 1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.	2015-01-29 15:13:09 -08:00
Marcelo Vanzin	f9e569452e	[SPARK-5466] Add explicit guava dependencies where needed. One side-effect of shading guava is that it disappears as a transitive dependency. For Hadoop 2.x, this was masked by the fact that Hadoop itself depends on guava. But certain versions of Hadoop 1.x also shade guava, leaving either no guava or some random version pulled by another dependency on the classpath. So be explicit about the dependency in modules that use guava directly, which is the right thing to do anyway. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4272 from vanzin/SPARK-5466 and squashes the following commits: e3f30e5 [Marcelo Vanzin] Dependency for catalyst is not needed. d3b2c84 [Marcelo Vanzin] [SPARK-5466] Add explicit guava dependencies where needed.	2015-01-29 13:00:45 -08:00
Xiangrui Meng	a3dc618486	[SPARK-5477] refactor stat.py There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change. davies Author: Xiangrui Meng <meng@databricks.com> Closes #4266 from mengxr/py-stat-refactor and squashes the following commits: 1a5e1db [Xiangrui Meng] refactor stat.py	2015-01-29 10:11:44 -08:00
Reynold Xin	5ad78f6205	[SQL] Various DataFrame DSL update. 1. Added foreach, foreachPartition, flatMap to DataFrame. 2. Added col() in dsl. 3. Support renaming columns in toDataFrame. 4. Support type inference on arrays (in addition to Seq). 5. Updated mllib to use the new DSL. Author: Reynold Xin <rxin@databricks.com> Closes #4260 from rxin/sql-dsl-update and squashes the following commits: 73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve. fab3ccc [Reynold Xin] Bug fix. d31fcd2 [Reynold Xin] Style fix. 62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.	2015-01-29 00:01:10 -08:00
Burak Yavuz	a63be1a18f	[SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices The conversion methods for `BlockMatrix`. Conversions go through `CoordinateMatrix` in order to cause a shuffle so that intermediate operations will be stored on disk and the expensive initial computation will be mitigated. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4256 from brkyvz/SPARK-3977PR and squashes the following commits: 4df37fe [Burak Yavuz] moved TODO inside code block b049c07 [Burak Yavuz] addressed code review feedback v1 66cb755 [Burak Yavuz] added default toBlockMatrix conversion 851f2a2 [Burak Yavuz] added better comments and checks cdb9895 [Burak Yavuz] [SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices	2015-01-28 23:42:07 -08:00
Reynold Xin	5b9760de8d	[SPARK-5445][SQL] Made DataFrame dsl usable in Java Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago. Author: Reynold Xin <rxin@databricks.com> Closes #4241 from rxin/df-docupdate and squashes the following commits: c0f4810 [Reynold Xin] Fix Python merge conflict. 094c7d7 [Reynold Xin] Minor style fix. Reset Python tests. 3c89f4a [Reynold Xin] Package. dfe6962 [Reynold Xin] Updated Python aggregate. 5dd4265 [Reynold Xin] Made dsl Java callable. 14b3c27 [Reynold Xin] Fix literal expression for symbols. 68b31cb [Reynold Xin] Literal. 4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.	2015-01-28 19:10:32 -08:00
Xiangrui Meng	4ee79c71af	[SPARK-5430] move treeReduce and treeAggregate from mllib to core We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #4228 from mengxr/SPARK-5430 and squashes the following commits: 20ad40d [Xiangrui Meng] exclude tree* from mima e89a43e [Xiangrui Meng] fix compile and update java doc 3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python 6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core	2015-01-28 17:26:03 -08:00
Xiangrui Meng	e80dc1c5a8	[SPARK-4586][MLLIB] Python API for ML pipeline and parameters This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code. TODO: - [x] handle parameters in LRModel - [x] unit tests - [x] missing some docs CC: davies jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4151 from mengxr/SPARK-4586 and squashes the following commits: 415268e [Xiangrui Meng] remove inherit_doc from __init__ edbd6fe [Xiangrui Meng] move Identifiable to ml.util 44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 14ae7e2 [Davies Liu] fix docs 54ca7df [Davies Liu] fix tests 78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 1dca16a [Davies Liu] refactor 090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml 0882513 [Xiangrui Meng] update doc style a4f4dbf [Xiangrui Meng] add unit test for LR 7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer ba0ba1e [Xiangrui Meng] add unit tests for pipeline 0586c7b [Xiangrui Meng] add more comments to the example 5153cff [Xiangrui Meng] simplify java models 036ca04 [Xiangrui Meng] gen numFeatures 46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly 1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc f66ba0c [Xiangrui Meng] make params a property d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example 05e3e40 [Xiangrui Meng] update example d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl 56de571 [Xiangrui Meng] fix style d0c5bb8 [Xiangrui Meng] a working copy bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586 17ecfb9 [Xiangrui Meng] code gen for shared params d9ea77c [Xiangrui Meng] update doc c18dca1 [Xiangrui Meng] make the example working dadd84e [Xiangrui Meng] add base classes and docs a3015cf [Xiangrui Meng] add Estimator and Transformer 46eea43 [Xiangrui Meng] a pipeline in python 33b68e0 [Xiangrui Meng] a working LR	2015-01-28 17:14:23 -08:00
Michael Nazario	e023112d33	[SPARK-5441][pyspark] Make SerDeUtil PairRDD to Python conversions more robust SerDeUtil.pairRDDToPython and SerDeUtil.pythonToPairRDD now both support empty RDDs by checking the result of take(1) instead of calling first which throws an exception. Author: Michael Nazario <mnazario@palantir.com> Closes #4236 from mnazario/feature/empty-first and squashes the following commits: a531c0c [Michael Nazario] Added regression tests for SPARK-5441 e3b2fb6 [Michael Nazario] Added acceptance of the empty case	2015-01-28 13:58:46 -08:00
Yandu Oppacher	3bead67d59	[SPARK-4387][PySpark] Refactoring python profiling code to make it extensible This PR is based on #3255 , fix conflicts and code style. Closes #3255. Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com> Author: Davies Liu <davies@databricks.com> Closes #3901 from davies/refactor-python-profile-code and squashes the following commits: b4a9306 [Davies Liu] fix tests 4b79ce8 [Davies Liu] add docstring for profiler_cls 2700e47 [Davies Liu] use BasicProfiler as default 349e341 [Davies Liu] more refactor 6a5d4df [Davies Liu] refactor and fix tests 31bf6b6 [Davies Liu] fix code style 0864b5d [Yandu Oppacher] Remove unused method 76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage 9eefc36 [Yandu Oppacher] Fix doc 9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around 8739aff [Yandu Oppacher] Code review fixes 9bda3ec [Yandu Oppacher] Refactor profiler code	2015-01-28 13:48:06 -08:00
Ryan Williams	a731314c31	[SPARK-5417] Remove redundant executor-id set() call This happens inside SparkEnv initialization as of #4194 Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4213 from ryan-williams/exec-id-set and squashes the following commits: b3e4f7b [Ryan Williams] Remove redundant executor-id set() call	2015-01-28 13:04:52 -08:00
Nicholas Chammas	d44ee43665	[SPARK-5434] [EC2] Preserve spaces in EC2 path Fixes [SPARK-5434](https://issues.apache.org/jira/browse/SPARK-5434). Simple demonstration of the problem and the fix: ``` $ spacey_path="/path/with some/spaces" $ dirname $spacey_path usage: dirname path $ echo $? 1 $ dirname "$spacey_path" /path/with some $ echo $? 0 ``` Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #4224 from nchammas/patch-1 and squashes the following commits: 960711a [Nicholas Chammas] [EC2] Preserve spaces in EC2 path	2015-01-28 12:56:03 -08:00
Andrew Or	84b6ecdef6	[SPARK-5437] Fix DriverSuite and SparkSubmitSuite timeout issues In DriverSuite, we currently set a timeout of 60 seconds. If after this time the process has not terminated, we leak the process because we never destroy it. In SparkSubmitSuite, we currently do not have a timeout so the test can hang indefinitely. Author: Andrew Or <andrew@databricks.com> Closes #4230 from andrewor14/fix-driver-suite and squashes the following commits: f5c80fd [Andrew Or] Fix timeout behaviors in both suites 8092c36 [Andrew Or] Stop SparkContext after every individual test	2015-01-28 12:53:22 -08:00
lianhuiwang	81f8f34062	[SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode. With executor dynamic scaling enabled, executor number shoude be added or killed in yarn-cluster mode.so in yarn-cluster mode, ApplicationMaster start a AMActor that add or kill a executor. then YarnSchedulerActor in YarnSchedulerBackend send message to am's AMActor. andrewor14 ChengXiangLi tdas Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #3962 from lianhuiwang/SPARK-4955 and squashes the following commits: 48d9ebb [lianhuiwang] update with andrewor14's comments 12426af [lianhuiwang] refactor am's code 45da3b0 [lianhuiwang] remove unrelated code 9318fc1 [lianhuiwang] update with andrewor14's comments 08ba473 [lianhuiwang] address andrewor14's comments 265c36d [lianhuiwang] fix small change f43bda8 [lianhuiwang] fix address andrewor14's comments 7a7767a [lianhuiwang] fix address andrewor14's comments bbc4d5a [lianhuiwang] address andrewor14's comments 1b029a4 [lianhuiwang] fix bug 7d33791 [lianhuiwang] in AM create a new actorSystem 2164ea8 [lianhuiwang] fix a min bug 6dfeeec [lianhuiwang] in yarn-cluster mode,executor number can be added or killed.	2015-01-28 12:51:15 -08:00
Michael Nazario	456c11f15a	[SPARK-5440][pyspark] Add toLocalIterator to pyspark rdd Since Java and Scala both have access to iterate over partitions via the "toLocalIterator" function, python should also have that same ability. Author: Michael Nazario <mnazario@palantir.com> Closes #4237 from mnazario/feature/toLocalIterator and squashes the following commits: 1c58526 [Michael Nazario] Fix documentation off by one error 0cdc8f8 [Michael Nazario] Add toLocalIterator to PySpark	2015-01-28 12:47:12 -08:00
Sean Owen	9b18009b83	SPARK-1934 [CORE] "this" reference escape to "selectorThread" during construction in ConnectionManager This change reshuffles the order of initialization in `ConnectionManager` so that the last thing that happens is running `selectorThread`, which invokes a method that relies on object state in `ConnectionManager` zsxwing also reported a similar problem in `BlockManager` in the JIRA, but I can't find a similar pattern there. Maybe it was subsequently fixed? Author: Sean Owen <sowen@cloudera.com> Closes #4225 from srowen/SPARK-1934 and squashes the following commits: c4dec3b [Sean Owen] Init all object state in ConnectionManager constructor before starting thread in constructor that accesses object's state	2015-01-28 12:44:35 -08:00
Kousuke Saruta	e902dc443d	[SPARK-5188][BUILD] make-distribution.sh should support curl, not only wget to get Tachyon When we use `make-distribution.sh` with `--with-tachyon` option, Tachyon will be downloaded by `wget` command but some systems don't have `wget` by default (MacOS X doesn't have). Other scripts like build/mvn, build/sbt support not only `wget` but also `curl` so `make-distribution.sh` should support `curl` too. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3988 from sarutak/SPARK-5188 and squashes the following commits: 0f546e0 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188 010e884 [Kousuke Saruta] Merge branch 'SPARK-5188' of github.com:sarutak/spark into SPARK-5188 163687e [Kousuke Saruta] Fixed a merge conflict e24e01b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188 3daf1f1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188 3caa4cb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5188 7cc8255 [Kousuke Saruta] Fix to use \$MVN instead of mvn a3e908b [Kousuke Saruta] Fixed style 2db9fbf [Kousuke Saruta] Removed redirection from the logic which checks the existence of commands 1e4c7e0 [Kousuke Saruta] Used "command" command instead of "type" command 83b49b5 [Kousuke Saruta] Modified make-distribution.sh so that we use curl, not only wget to get tachyon	2015-01-28 12:43:22 -08:00
Sandy Ryza	406f6d3070	SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs Author: Sandy Ryza <sandy@cloudera.com> Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits: 460827a [Sandy Ryza] Python too d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs	2015-01-28 12:41:23 -08:00
Reynold Xin	c8e934ef3c	[SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame. and [SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext Author: Reynold Xin <rxin@databricks.com> Closes #4242 from rxin/sqlCleanup and squashes the following commits: e351cb2 [Reynold Xin] Fixed toDataFrame. 6545c42 [Reynold Xin] More changes. 728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.	2015-01-28 12:10:01 -08:00
Winston Chen	453d7999b8	[SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark. It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens: ``` 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7) java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157) at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) ``` The test case code below reproduces it: ``` from pyspark.rdd import RDD dl = [ (u'2', {u'director': u'David Lean'}), (u'7', {u'director': u'Andrew Dominik'}) ] dl_rdd = sc.parallelize(dl) tmp = dl_rdd._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() tmp = t._to_java_object_rdd() tmp2 = sc._jvm.SerDe.javaToPython(tmp) t = RDD(tmp2, sc) t.count() # it blows up here during the 2nd time of conversion ``` Author: Winston Chen <wchen@quid.com> Closes #4146 from wingchen/master and squashes the following commits: 903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR 5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks 126be6b [Winston Chen] SPARK-5361, add in test case 4cf1187 [Winston Chen] SPARK-5361, add in test case 9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD	2015-01-28 11:08:44 -08:00
Kousuke Saruta	0b35fcd7f0	[SPARK-5291][CORE] Add timestamp and reason why an executor is removed to SparkListenerExecutorAdded and SparkListenerExecutorRemoved Recently `SparkListenerExecutorAdded` and `SparkListenerExecutorRemoved` are added. I think it's useful if they have timestamp and the reason why an executor is removed. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #4082 from sarutak/SPARK-5291 and squashes the following commits: a026ff2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291 979dfe1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291 cf9f9080 [Kousuke Saruta] Fixed test case 1f2a89b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5291 243f2a60 [Kousuke Saruta] Modified MesosSchedulerBackendSuite a527c35 [Kousuke Saruta] Added timestamp to SparkListenerExecutorAdded	2015-01-28 11:02:51 -08:00
Burak Yavuz	eeb53bf90e	[SPARK-3974][MLlib] Distributed Block Matrix Abstractions This pull request includes the abstractions for the distributed BlockMatrix representation. `BlockMatrix` will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such as `RowBasedPartitioner` and `ColumnBasedPartitioner`, are implemented in order to optimize addition and multiplication operations that will be added in a following PR. This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA. https://github.com/amplab/ml-matrix Additional thanks to rezazadeh, shivaram, and mengxr for guidance on the design. Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak Yavuz <brkyvz@dn51t42l.sunet> Author: Burak Yavuz <brkyvz@dn51t4rd.sunet> Author: Burak Yavuz <brkyvz@dn0a221430.sunet> Closes #3200 from brkyvz/SPARK-3974 and squashes the following commits: a8eace2 [Burak Yavuz] Merge pull request #2 from mengxr/brkyvz-SPARK-3974 feb32a7 [Xiangrui Meng] update tests e1d3ee8 [Xiangrui Meng] minor updates 24ec7b8 [Xiangrui Meng] update grid partitioner 5eecd48 [Burak Yavuz] fixed gridPartitioner and added tests 140f20e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3974 1694c9e [Burak Yavuz] almost finished addressing comments f9d664b [Burak Yavuz] updated API and modified partitioning scheme eebbdf7 [Burak Yavuz] preliminary changes addressing code review 1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required 1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist 239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust 9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner 49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master 645afbe [Burak Yavuz] [SPARK-3974] Pull latest master b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes 19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol 589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready b693209 [Burak Yavuz] Ready for Pull request	2015-01-28 10:06:37 -08:00
Patrick Wendell	622ff09d03	MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #1480 (close requested by 'pwendell') Closes #4205 (close requested by 'kdatta') Closes #4114 (close requested by 'pwendell') Closes #3382 (close requested by 'mengxr') Closes #3933 (close requested by 'mengxr') Closes #3870 (close requested by 'yhuai')	2015-01-28 02:15:14 -08:00
Ryan Williams	661d3f9f3e	[SPARK-5415] bump sbt to version to 0.13.7 Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #4211 from ryan-williams/sbt0.13.7 and squashes the following commits: e28476d [Ryan Williams] bump sbt to version to 0.13.7	2015-01-28 02:13:06 -08:00
Marcelo Vanzin	37a5e272f8	[SPARK-4809] Rework Guava library shading. The current way of shading Guava is a little problematic. Code that depends on "spark-core" does not see the transitive dependency, yet classes in "spark-core" actually depend on Guava. So it's a little tricky to run unit tests that use spark-core classes, since you need a compatible version of Guava in your dependencies when running the tests. This can become a little tricky, and is kind of a bad user experience. This change modifies the way Guava is shaded so that it's applied uniformly across the Spark build. This means Guava is shaded inside spark-core itself, so that the dependency issues above are solved. Aside from that, all Spark sub-modules have their Guava references relocated, so that they refer to the relocated classes now packaged inside spark-core. Before, this was only done by the time the assembly was built, so projects that did not end up inside the assembly (such as streaming backends) could still reference the original location of Guava classes. The Guava classes are added to the "first" artifact Spark generates (network-common), so that all downstream modules have the needed classes available. Since "network-common" is a dependency of spark-core, all Spark apps should get the relocated classes automatically. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3658 from vanzin/SPARK-4809 and squashes the following commits: 3c93e42 [Marcelo Vanzin] Shade Guava in the network-common artifact. 5d69ec9 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809 b3104fc [Marcelo Vanzin] Add comment. 941848f [Marcelo Vanzin] Merge branch 'master' into SPARK-4809 f78c48a [Marcelo Vanzin] Merge branch 'master' into SPARK-4809 8053dd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809 107d7da [Marcelo Vanzin] Add fix for SPARK-5052 (PR #3874). 40b8723 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809 4a4ed42 [Marcelo Vanzin] [SPARK-4809] Rework Guava library shading.	2015-01-28 00:29:29 -08:00
Reynold Xin	d74373225e	[SPARK-5097][SQL] Test cases for DataFrame expressions. Author: Reynold Xin <rxin@databricks.com> Closes #4235 from rxin/df-tests1 and squashes the following commits: f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.	2015-01-27 18:10:49 -08:00
Reynold Xin	119f45d61d	[SPARK-5097][SQL] DataFrame This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby -> groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation & literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!	2015-01-27 16:08:24 -08:00
Sandy Ryza	b1b35ca2e4	SPARK-5199. FS read metrics should support CombineFileSplits and track bytes from all FSs ...mbineFileSplits Author: Sandy Ryza <sandy@cloudera.com> Closes #4050 from sryza/sandy-spark-5199 and squashes the following commits: 864514b [Sandy Ryza] Add tests and fix bug 0d504f1 [Sandy Ryza] Prettify 915c7e6 [Sandy Ryza] Get metrics from all filesystems cdbc3e8 [Sandy Ryza] SPARK-5199. Input metrics should show up for InputFormats that return CombineFileSplits	2015-01-27 15:42:55 -08:00
Davies Liu	fdaad4eb03	[MLlib] fix python example of ALS in guide fix python example of ALS in guide, use Rating instead of np.array. Author: Davies Liu <davies@databricks.com> Closes #4226 from davies/fix_als_guide and squashes the following commits: 1433d76 [Davies Liu] fix python example of als in guide	2015-01-27 15:33:01 -08:00
Sean Owen	ff356e2a21	SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build. Author: Sean Owen <sowen@cloudera.com> Closes #4161 from srowen/SPARK-5308 and squashes the following commits: 70d09d0 [Sean Owen] Use $(...) syntax e25eff8 [Sean Owen] Generate MD5, SHA1 hashes in a format like Maven's plugin	2015-01-27 10:22:50 -08:00
Burak Yavuz	914267484a	[SPARK-5321] Support for transposing local matrices Support for transposing local matrices added. The `.transpose` function creates a new object re-using the backing array(s) but switches `numRows` and `numCols`. Operations check the flag `.isTransposed` to see whether the indexing in `values` should be modified. This PR will pave the way for transposing `BlockMatrix`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #4109 from brkyvz/SPARK-5321 and squashes the following commits: 87ab83c [Burak Yavuz] fixed scalastyle caf4438 [Burak Yavuz] addressed code review v3 c524770 [Burak Yavuz] address code review comments 2 77481e8 [Burak Yavuz] fixed MiMa f1c1742 [Burak Yavuz] small refactoring ccccdec [Burak Yavuz] fixed failed test dd45c88 [Burak Yavuz] addressed code review a01bd5f [Burak Yavuz] [SPARK-5321] Fixed MiMa issues 2a63593 [Burak Yavuz] [SPARK-5321] fixed bug causing failed gemm test c55f29a [Burak Yavuz] [SPARK-5321] Support for transposing local matrices cleaned up c408c05 [Burak Yavuz] [SPARK-5321] Support for transposing local matrices added	2015-01-27 01:46:17 -08:00
Liang-Chi Hsieh	7b0ed79795	[SPARK-5419][Mllib] Fix the logic in Vectors.sqdist The current implementation in Vectors.sqdist is not efficient because of allocating temp arrays. There is also a bug in the code `v1.indices.length / v1.size < 0.5`. This pr fixes the bug and refactors sqdist without allocating new arrays. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4217 from viirya/fix_sqdist and squashes the following commits: e8b0b3d [Liang-Chi Hsieh] For review comments. 314c424 [Liang-Chi Hsieh] Fix sqdist bug.	2015-01-27 01:29:14 -08:00
MechCoder	d6894b1c53	[SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests I've added support for sampling_rate not equal to 1.0 . I have two major questions. 1. A Scala style test is failing, since the number of parameters now exceed 10. 2. I would like suggestions to understand how to test this. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4073 from MechCoder/spark-3726 and squashes the following commits: 8012fb2 [MechCoder] Add test in Strategy e0e0d9c [MechCoder] TST: Add better test d1df1b2 [MechCoder] Add test to verify subsampling behavior a7bfc70 [MechCoder] [SPARK-3726] Allow sampling_rate not equal to 1.0	2015-01-26 19:46:17 -08:00
lewuathe	f2ba5c6fc3	[SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train... ... decision tree model Labels loaded from libsvm files are mapped to 0.0 if they are negative labels because they should be nonnegative value. Author: lewuathe <lewuathe@me.com> Closes #3975 from Lewuathe/map-negative-label-to-positive and squashes the following commits: 12d1d59 [lewuathe] [SPARK-5119] Fix code styles 6d9a18a [lewuathe] [SPARK-5119] Organize test codes 62a150c [lewuathe] [SPARK-5119] Modify Impurities throw exceptions with negatie labels 3336c21 [lewuathe] [SPARK-5119] java.lang.ArrayIndexOutOfBoundsException on trying to train decision tree model	2015-01-26 18:03:21 -08:00
Elmer Garduno	661e0fca5d	[SPARK-5052] Add common/base classes to fix guava methods signatures. Fixes problems with incorrect method signatures related to shaded classes. For discussion see the jira issue. Author: Elmer Garduno <elmerg@google.com> Closes #3874 from elmer-garduno/fix_guava_signatures and squashes the following commits: aa5d8e0 [Elmer Garduno] Unshade common/base[Function\|Supplier] classes to fix guava methods signatures.	2015-01-26 17:40:48 -08:00
Sean Owen	0497ea51ac	SPARK-960 [CORE] [TEST] JobCancellationSuite "two jobs sharing the same stage" is broken This reenables and fixes this test, after addressing two issues: - The Semaphore that was intended to be shared locally was being serialized and copied; it's now a static member in the companion object as in other tests - Later changes to Spark means that cancelling the first task will not cancel the shared stage and therefore the second task should succeed Author: Sean Owen <sowen@cloudera.com> Closes #4180 from srowen/SPARK-960 and squashes the following commits: 43da66f [Sean Owen] Fix 'two jobs sharing the same stage' test and reenable it: truly share a Semaphore locally as intended, and update expectation of failure in non-cancelled task	2015-01-26 14:32:27 -08:00
David Y. Ross	b38034e878	Fix command spaces issue in make-distribution.sh Storing command in variables is tricky in bash, use an array to handle all issues with spaces, quoting, etc. See: http://mywiki.wooledge.org/BashFAQ/050 Author: David Y. Ross <dyross@gmail.com> Closes #4126 from dyross/dyr-fix-make-distribution and squashes the following commits: 4ce522b [David Y. Ross] Fix command spaces issue in make-distribution.sh	2015-01-26 14:26:10 -08:00
Sean Owen	54e7b456dd	SPARK-4147 [CORE] Reduce log4j dependency Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147. Author: Sean Owen <sowen@cloudera.com> Closes #4190 from srowen/SPARK-4147 and squashes the following commits: 4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework.	2015-01-26 14:23:42 -08:00

1 2 3 4 5 ...

9451 commits