ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	1974d1d34d	[SPARK-12719][SQL] SQL generation support for Generate ## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.	2016-03-17 20:25:05 +08:00
Wenchen Fan	8ef3399aff	[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.	2016-03-17 19:23:38 +08:00
trueyao	ea9ca6f04c	[SPARK-13901][CORE] correct the logDebug information when jump to the next locality level JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901 In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it. Author: trueyao <501663994@qq.com> Closes #11719 from trueyao/logDebug-localityWait.	2016-03-17 09:45:06 +00:00
Yuhao Yang	357d82d84d	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.	2016-03-17 11:21:11 +02:00
Zheng RuiFeng	204c9dec2c	[MINOR][DOC] Add JavaStreamingTestExample ## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.	2016-03-17 11:09:02 +02:00
Davies Liu	30c18841e4	Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator" This reverts commit `99bd2f0e94`.	2016-03-16 23:11:13 -07:00
Josh Rosen	82066a1667	[SPARK-13948] MiMa check should catch if the visibility changes to private MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.	2016-03-16 23:02:25 -07:00
Ryan Blue	5faba9facc	[SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors. This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf. I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)). Author: Ryan Blue <blue@apache.org> Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.	2016-03-16 22:57:06 -07:00
Josh Rosen	de1a84e56e	[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.	2016-03-16 22:52:55 -07:00
Daoyuan Wang	d1c193a2f1	[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test ## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.	2016-03-16 22:52:10 -07:00
Dongjoon Hyun	c890c359b1	[MINOR][SQL][BUILD] Remove duplicated lines ## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.	2016-03-16 22:48:58 -07:00
Jakob Odersky	7eef2463ad	[SPARK-13118][SQL] Expression encoding for optional synthetic classes ## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases. This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR). ## How was this patch tested? A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects. Author: Jakob Odersky <jakob@odersky.com> Closes #11708 from jodersky/SPARK-13118-package-objects.	2016-03-16 21:53:16 -07:00
Davies Liu	c100d31ddc	[SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen. Updated the benchmark for `collect`, we could see 20-30% speedup. ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11740 from davies/avoid_copy2.	2016-03-16 21:46:04 -07:00
hyukjinkwon	917f4000b4	[SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild ## What changes were proposed in this pull request? This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` - Before ```scala sqlContext.read.schema(schema).json(path).show() ``` ```scala Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` - After ```scala sqlContext.read.schema(schema).json(path).show() ``` ```bash +----+ \| a\| +----+ \| [1]\| \|null\| +----+ ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.	2016-03-16 18:20:30 -07:00
Andrew Or	ca9ef86c84	[SPARK-13923][SQL] Implement SessionCatalog ## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.	2016-03-16 18:02:43 -07:00
Yuhao Yang	92b70576ea	[SPARK-13761][ML] Deprecate validateParams ## What changes were proposed in this pull request? Deprecate validateParams() method here: `035d3acdf3/mllib/src/main/scala/org/apache/spark/ml/param/params.scala (L553)` Move all functionality in overridden methods to transformSchema(). Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11620 from hhbyyh/depreValid.	2016-03-16 17:31:55 -07:00
Jakob Odersky	d4d84936fb	[SPARK-11011][SQL] Narrow type of UDT serialization ## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.	2016-03-16 16:59:36 -07:00
Sameer Agarwal	77ba3021c1	[SPARK-13869][SQL] Remove redundant conditions while combining filters ## What changes were proposed in this pull request? [I'll link it to the JIRA once ASF JIRA is back online] This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`) ## How was this patch tested? Unit test in `FilterPushdownSuite` Author: Sameer Agarwal <sameer@databricks.com> Closes #11670 from sameeragarwal/combine-filters.	2016-03-16 16:27:46 -07:00
Sameer Agarwal	f96997ba24	[SPARK-13871][SQL] Support for inferring filters from data constraints ## What changes were proposed in this pull request? This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things: (a) no redundant filters are generated, and (b) filters that do not contribute to any further optimizations are not generated. ## How was this patch tested? Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators. In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization. Author: Sameer Agarwal <sameer@databricks.com> Closes #11665 from sameeragarwal/infer-filters.	2016-03-16 16:26:51 -07:00
Sameer Agarwal	b90c0206fa	[SPARK-13922][SQL] Filter rows with null attributes in vectorized parquet reader # What changes were proposed in this pull request? It's common for many SQL operators to not care about reading `null` values for correctness. Currently, this is achieved by performing `isNotNull` checks (for all relevant columns) on a per-row basis. Pushing these null filters in the vectorized parquet reader should bring considerable benefits (especially for cases when the underlying data doesn't contain any nulls or contains all nulls). ## How was this patch tested? Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (0%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 1229 / 1648 8.5 117.2 1.0X PR Vectorized 833 / 846 12.6 79.4 1.5X PR Vectorized (Null Filtering) 732 / 782 14.3 69.8 1.7X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (50%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 995 / 1053 10.5 94.9 1.0X PR Vectorized 732 / 772 14.3 69.8 1.4X PR Vectorized (Null Filtering) 725 / 790 14.5 69.1 1.4X Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz String with Nulls Scan (95%): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- SQL Parquet Vectorized 326 / 333 32.2 31.1 1.0X PR Vectorized 190 / 200 55.1 18.2 1.7X PR Vectorized (Null Filtering) 168 / 172 62.2 16.1 1.9X Author: Sameer Agarwal <sameer@databricks.com> Closes #11749 from sameeragarwal/perf-testing.	2016-03-16 16:25:40 -07:00
Dongjoon Hyun	4ce2d24e2a	[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x ## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. Migration Guide ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11770 from dongjoon-hyun/SPARK-13942.	2016-03-16 15:50:24 -07:00
GayathriMurali	27e1f38851	[SPARK-13034] PySpark ml.classification support export/import ## What changes were proposed in this pull request? Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py. ## How was this patch tested? ./python/run-tests ./dev/lint-python Unit tests added to check persistence in Logistic Regression Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11707 from GayathriMurali/SPARK-13034.	2016-03-16 14:21:42 -07:00
Xiangrui Meng	85c42fda99	[SPARK-13927][MLLIB] add row/column iterator to local matrices ## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.	2016-03-16 14:19:54 -07:00
Joseph K. Bradley	6fc2b6541f	[SPARK-11888][ML] Decision tree persistence in spark.ml ### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel * The shared implementation is in treeModels.scala * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files. Other changes: * Made CategoricalSplit.numCategories public (to use in persistence) * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed. ### How was this patch tested? Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11581 from jkbradley/dt-io.	2016-03-16 14:18:35 -07:00
Yanbo Liang	3f06eb72ca	[SPARK-13613][ML] Provide ignored tests to export test dataset into CSV format ## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package. cc mengxr ## How was this patch tested? The test suite is ignored, but I have enabled all these cases offline and it works as expected. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11463 from yanboliang/spark-13613.	2016-03-16 14:14:15 -07:00
Xusen Yin	ae6c677c8a	[SPARK-13038][PYSPARK] Add load/save to pipeline ## What changes were proposed in this pull request? JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038 1. Add load/save to PySpark Pipeline and PipelineModel 2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`. ## How was this patch tested? Test with doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #11683 from yinxusen/SPARK-13038-only.	2016-03-16 13:49:40 -07:00
gatorsmile	c4bd57602c	[SPARK-12721][SQL] SQL Generation for Script Transformation #### What changes were proposed in this pull request? This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`. For example, below is the SQL containing `Transform` ``` SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2 ``` Its logical plan is like ``` ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true) +- SubqueryAlias parquet_t2 +- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation ``` The generated SQL will be like ``` SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2` ``` #### How was this patch tested? Seven test cases are added to `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11503 from gatorsmile/transformToSQL.	2016-03-16 13:11:11 -07:00
Wenchen Fan	1d1de28a3c	[SPARK-13827][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes #11658 from cloud-fan/gensql.	2016-03-16 11:57:28 -07:00
Zheng RuiFeng	91984978e7	[SPARK-13816][GRAPHX] Add parameter checks for algorithms in Graphx JIRA: https://issues.apache.org/jira/browse/SPARK-13816 ## What changes were proposed in this pull request? Add parameter checks for algorithms in Graphx: Pregel,LabelPropagation,PageRank,SVDPlusPlus ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11655 from zhengruifeng/graphx_param_check.	2016-03-16 11:52:25 -07:00
Cheng Hao	d9670f8473	[SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSet ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.	2016-03-16 11:20:15 -07:00
Wenchen Fan	d9e8f26d03	[SPARK-13924][SQL] officially support multi-insert ## What changes were proposed in this pull request? There is a feature of hive SQL called multi-insert. For example: ``` FROM src INSERT OVERWRITE TABLE dest1 SELECT key + 1 INSERT OVERWRITE TABLE dest2 SELECT key WHERE key > 2 INSERT OVERWRITE TABLE dest3 SELECT col EXPLODE(arr) exp AS col ... ``` We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE. This PR removes these limitations and make us fully support multi-insert. ## How was this patch tested? new tests in `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11754 from cloud-fan/lateral-view.	2016-03-16 10:52:36 -07:00
Jeff Zhang	eacd9d8eda	[SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON… … is not picked up in yarn-cluster mode Author: Jeff Zhang <zjffdu@apache.org> Closes #11238 from zjffdu/SPARK-13360.	2016-03-16 10:21:01 -07:00
Wesley Tang	5f6bdf97c5	[SPARK-13281][CORE] Switch broadcast of RDD to exception from warning ## What changes were proposed in this pull request? In SparkContext, throw Illegalargumentexception when trying to broadcast rdd directly, instead of logging the warning. ## How was this patch tested? mvn clean install Add UT in BroadcastSuite Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #11735 from breakdawn/master.	2016-03-16 16:12:17 +00:00
Sean Owen	9412547e7a	[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines ## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console ``` testCommunication(org.apache.spark.launcher.LauncherServerSuite) Time elapsed: 0.023 sec <<< FAILURE! java.lang.AssertionError: expected:<app-id> but was:<null> at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93) ``` However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences. This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway. Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence: - First attempt to acquire semaphore for 10ms actually silently times out - The changed being waited for happens just after that, a bit too late - Assertion passes since condition became true just in time - `release()` fires from the listener - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11763 from srowen/SPARK-13823.3.	2016-03-16 16:11:24 +00:00
Carson Wang	496d2a2b40	[SPARK-13889][YARN] Fix integer overflow when calculating the max number of executor failure ## What changes were proposed in this pull request? The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow. ## How was this patch tested? It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2. Author: Carson Wang <carson.wang@intel.com> Closes #11713 from carsonwang/IntOverflow.	2016-03-16 10:56:01 +00:00
Tejas Patil	1d95fb6785	[SPARK-13793][CORE] PipedRDD doesn't propagate exceptions while reading parent RDD ## What changes were proposed in this pull request? PipedRDD creates a child thread to read output of the parent stage and feed it to the pipe process. Used a variable to save the exception thrown in the child thread and then propagating the exception in the main thread if the variable was set. ## How was this patch tested? - Added a unit test - Ran all the existing tests in PipedRDDSuite and they all pass with the change - Tested the patch with a real pipe() job, bounced the executor node which ran the parent stage to simulate a fetch failure and observed that the parent stage was re-ran. Author: Tejas Patil <tejasp@fb.com> Closes #11628 from tejasapatil/pipe_rdd.	2016-03-16 09:58:53 +00:00
GayathriMurali	56d88247f1	[SPARK-13396] Stop using our internal deprecated .metrics on Exceptio… JIRA: https://issues.apache.org/jira/browse/SPARK-13396 Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11544 from GayathriMurali/SPARK-13396.	2016-03-16 09:39:41 +00:00
Sean Owen	3b461d9ecd	[SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.	2016-03-16 09:36:34 +00:00
Yonathan Randolph	05ab2948ab	[SPARK-13906] Ensure that there are at least 2 dispatcher threads. ## What changes were proposed in this pull request? Force at least two dispatcher-event-loop threads. Since SparkDeploySchedulerBackend (in AppClient) calls askWithRetry to CoarseGrainedScheduler in the same process, there the driver needs at least two dispatcher threads to prevent the dispatcher thread from hanging. ## How was this patch tested? Manual. Author: Yonathan Randolph <yonathangmail.com> Author: Yonathan Randolph <yonathan@liftigniter.com> Closes #11728 from yonran/SPARK-13906.	2016-03-16 09:34:04 +00:00
Dongjoon Hyun	431a3d04b4	[SPARK-12653][SQL] Re-enable test "SPARK-8489: MissingRequirementError during reflection" ## What changes were proposed in this pull request? The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test. Historically, the target regression test is added by [SPARK-8498](`093c34838d`), but is temporarily disabled by [SPARK-12615](`8ce645d4ee`) due to binary compatibility error. The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test. ``` Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map; ``` Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10. ``` Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; ``` This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11744 from dongjoon-hyun/SPARK-12653.	2016-03-16 09:05:53 +00:00
hyukjinkwon	92024797a4	[SPARK-13899][SQL] Produce InternalRow instead of external Row at CSV data source ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13899 This PR makes CSV data source produce `InternalRow` instead of `Row`. Basically, this resembles JSON data source. It uses the same codes for casting. ## How was this patch tested? Unit tests were used within IDE and code style was checked by `./dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11717 from HyukjinKwon/SPARK-13899.	2016-03-15 23:31:46 -07:00
Dongjoon Hyun	3c578c594e	[SPARK-13920][BUILD] MIMA checks should apply to @Experimental and @DeveloperAPI APIs ## What changes were proposed in this pull request? We are able to change `Experimental` and `DeveloperAPI` API freely but also should monitor and manage those API carefully. This PR for [SPARK-13920](https://issues.apache.org/jira/browse/SPARK-13920) enables MiMa check and adds filters for them. ## How was this patch tested? Pass the Jenkins tests (including MiMa). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11751 from dongjoon-hyun/SPARK-13920.	2016-03-15 23:25:31 -07:00
Yanbo Liang	3665294d4e	[SPARK-9837][ML] R-like summary statistics for GLMs via iteratively reweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11694 from yanboliang/spark-9837.	2016-03-15 22:30:07 -07:00
Davies Liu	421f6c20e8	[SPARK-13917] [SQL] generate broadcast semi join ## What changes were proposed in this pull request? This PR brings codegen support for broadcast left-semi join. ## How was this patch tested? Existing tests. Added benchmark, the result show 7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11742 from davies/gen_semi.	2016-03-15 22:17:04 -07:00
Yucai Yu	52b6a899be	[MINOR][TEST][SQL] Remove wrong "expected" parameter in checkNaNWithoutCodegen ## What changes were proposed in this pull request? Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen. This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also. Author: Yucai Yu <yucai.yu@intel.com> Closes #11718 from yucai/unused_expected.	2016-03-15 21:44:58 -07:00
Davies Liu	bbd887f53c	[SPARK-13918][SQL] Merge SortMergeJoin and SortMergerOuterJoin ## What changes were proposed in this pull request? This PR just move some code from SortMergeOuterJoin into SortMergeJoin. This is for support codegen for outer join. ## How was this patch tested? existing tests. Author: Davies Liu <davies@databricks.com> Closes #11743 from davies/gen_smjouter.	2016-03-15 19:58:49 -07:00
Reynold Xin	643649dcbf	[SPARK-13895][SQL] DataFrameReader.text should return Dataset[String] ## What changes were proposed in this pull request? This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String]. Closes #11731. ## How was this patch tested? Updated existing integration tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #11739 from rxin/SPARK-13895.	2016-03-15 14:57:54 -07:00
Marcelo Vanzin	41eaabf593	[SPARK-13626][CORE] Revert change to SparkConf's constructor. It shouldn't be private. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11734 from vanzin/SPARK-13626-api.	2016-03-15 14:51:25 -07:00
CodingCat	dddf2f2d87	[MINOR] a minor fix for the comments of a method in RPC Dispatcher ## What changes were proposed in this pull request? a minor fix for the comments of a method in RPC Dispatcher ## How was this patch tested? existing unit tests Author: CodingCat <zhunansjtu@gmail.com> Closes #11738 from CodingCat/minor_rpc.	2016-03-15 14:46:21 -07:00
Stavros Kontopoulos	50e3644d00	[SPARK-13896][SQL][STRING] Dataset.toJSON should return Dataset ## What changes were proposed in this pull request? Change the return type of toJson in Dataset class ## How was this patch tested? No additional unit test required. Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com> Closes #11732 from skonto/fix_toJson.	2016-03-15 12:18:30 -07:00

... 11 12 13 14 15 ...

15781 commits