ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Michael Armbrust	caea152145	[SPARK-13985][SQL] Deterministic batches with ids This PR relaxes the requirements of a `Sink` for structured streaming to only require idempotent appending of data. Previously the `Sink` needed to be able to transactionally append data while recording an opaque offset indicated how far in a stream we have processed. In order to do this, a new write-ahead-log has been added to stream execution, which records the offsets that will are present in each batch. The log is created in the newly added `checkpointLocation`, which defaults to `${spark.sql.streaming.checkpointLocation}/${queryName}` but can be overriden by setting `checkpointLocation` in `DataFrameWriter`. In addition to making sinks easier to write the addition of batchIds and a checkpoint location is done in anticipation of integration with the the `StateStore` (#11645). Author: Michael Armbrust <michael@databricks.com> Closes #11804 from marmbrus/batchIds.	2016-03-22 10:18:42 -07:00
Dongjoon Hyun	c632bdc01f	[SPARK-14029][SQL] Improve BooleanSimplification optimization by implementing `Not` canonicalization. ## What changes were proposed in this pull request? Currently, BooleanSimplification optimization can handle the following cases. * a && (!a \|\| b ) ==> a && b * a && (b \|\| !a ) ==> a && b However, it can not handle the followings cases since those equations fail at the comparisons between their canonicalized forms. * a < 1 && (!(a < 1) \|\| b) ==> (a < 1) && b * a <= 1 && (!(a <= 1) \|\| b) ==> (a <= 1) && b * a > 1 && (!(a > 1) \|\| b) ==> (a > 1) && b * a >= 1 && (!(a >= 1) \|\| b) ==> (a >= 1) && b This PR implements the above cases and also the followings, too. * a < 1 && ((a >= 1) \|\| b ) ==> (a < 1) && b * a <= 1 && ((a > 1) \|\| b ) ==> (a <= 1) && b * a > 1 && ((a <= 1) \|\| b) ==> (a > 1) && b * a >= 1 && ((a < 1) \|\| b) ==> (a >= 1) && b ## How was this patch tested? Pass the Jenkins tests including new test cases in BooleanSimplicationSuite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11851 from dongjoon-hyun/SPARK-14029.	2016-03-22 10:17:08 -07:00
Sunitha Kambhampati	0ce01635cc	[SPARK-13774][SQL] - Improve error message for non-existent paths and add tests SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path Overview: - If a non-existent path is given in this call `` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") `` it throws the following error: `java.lang.IllegalArgumentException: Can not create a Path from an empty string` ….. `It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation` - The purpose of this JIRA is to throw a better error message. - With the fix, you will now get a _Path does not exist_ error message. ``` scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv") org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204) ... at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141) ... 49 elided ``` Details _Changes include:_ - Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path” - AnalysisException is thrown similar to the exceptions thrown in resolveRelation. - The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not. _Test modifications:_ - Changes went in for 3 tests to account for this error checking. - SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated. - 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test _New Tests:_ 2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message. _Testing:_ Unit tests were run with the fix. Notes/Questions to reviewers: - There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ? - From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know. I would appreciate your review. Thanks for your time and comments. Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #11775 from skambha/improve_errmsg.	2016-03-22 20:47:57 +08:00
hyukjinkwon	4e09a0d5ea	[SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13953 Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string. This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration. This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11881 from HyukjinKwon/SPARK-13953.	2016-03-22 20:30:48 +08:00
Cheng Lian	f2e855fba8	[SPARK-13473][SQL] Simplifies PushPredicateThroughProject ## What changes were proposed in this pull request? This is a follow-up of PR #11348. After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up. To be more specific, the following optimization is allowed: ```scala // From: df.select('a, 'b).filter('c > rand(42)) // To: df.filter('c > rand(42)).select('a, 'b) ``` while this isn't: ```scala // From: df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb) // To: df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c) ``` ## How was this patch tested? Existing test cases should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11864 from liancheng/spark-13473-cleanup.	2016-03-22 19:20:56 +08:00
Wenchen Fan	14464cadb9	[SPARK-14038][SQL] enable native view by default ## What changes were proposed in this pull request? As we have completed the `SQLBuilder`, we can safely turn on native view by default. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11872 from cloud-fan/native-view.	2016-03-22 00:07:57 -07:00
zero323	8193a266b5	[SPARK-14058][PYTHON] Incorrect docstring in Window.order ## What changes were proposed in this pull request? Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined." ## How was this patch tested? PySpark unit tests (no regression introduced). No changes to the code. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #11877 from zero323/order-by-description.	2016-03-21 23:52:33 -07:00
Michael Armbrust	8014a516d1	[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included. This code should be tested by the many existing tests for parquet. Author: Michael Armbrust <michael@databricks.com> Author: Sameer Agarwal <sameer@databricks.com> Author: Nong Li <nong@databricks.com> Closes #11709 from marmbrus/parquetReader.	2016-03-21 20:16:01 -07:00
Sameer Agarwal	7299961657	[SPARK-14016][SQL] Support high-precision decimals in vectorized parquet reader ## What changes were proposed in this pull request? This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader` ## How was this patch tested? 1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed. 2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Author: Sameer Agarwal <sameer@databricks.com> Closes #11869 from sameeragarwal/bigdecimal-parquet.	2016-03-21 18:19:54 -07:00
Xiangrui Meng	43ef1e52bf	Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example" This reverts commit `1af8de200c`.	2016-03-21 17:42:30 -07:00
gatorsmile	3f49e0766f	[SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star This PR resolves two issues: First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example, ```scala structDf.groupBy($"a").agg(min(struct($"record."))) ``` Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example, ```scala pagecounts4PartitionsDS .map(line => (line._1, line._3)) .toDF() .groupBy($"_1") .agg(sum("") as "sumOccurances") ``` Before the fix, the invalid usage will issue a confusing error message, like: ``` org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2; ``` After the fix, the message is like: ``` org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum' ``` cc: rxin nongli cloud-fan Author: gatorsmile <gatorsmile@gmail.com> Closes #11208 from gatorsmile/sumDataSetResolution.	2016-03-22 08:21:02 +08:00
Josh Rosen	b5f1ab701a	[SPARK-13990] Automatically pick serializer when caching RDDs Building on the `SerializerManager` introduced in SPARK-13926/ #11755, this patch Spark modifies Spark's BlockManager to use RDD's ClassTags in order to select the best serializer to use when caching RDD blocks. When storing a local block, the BlockManager `put()` methods use implicits to record ClassTags and stores those tags in the blocks' BlockInfo records. When reading a local block, the stored ClassTag is used to pick the appropriate serializer. When a block is stored with replication, the class tag is written into the block transfer metadata and will also be stored in the remote BlockManager. There are two or three places where we don't properly pass ClassTags, including TorrentBroadcast and BlockRDD. I think this happens to work because the missing ClassTag always happens to be `ClassTag.Any`, but it might be worth looking more carefully at those places to see whether we should be more explicit. Author: Josh Rosen <joshrosen@databricks.com> Closes #11801 from JoshRosen/pick-best-serializer-for-caching.	2016-03-21 17:19:39 -07:00
Reynold Xin	b3e5af62a1	[SPARK-13898][SQL] Merge DatasetHolder and DataFrameHolder ## What changes were proposed in this pull request? This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class. In addition, fixed some minor issues with pull request #11732. ## How was this patch tested? Updated existing unit tests that test these implicits. Author: Reynold Xin <rxin@databricks.com> Closes #11737 from rxin/SPARK-13898.	2016-03-21 17:17:25 -07:00
Nong Li	5e86e9262f	[SPARK-13916][SQL] Add a metric to WholeStageCodegen to measure duration. ## What changes were proposed in this pull request? WholeStageCodegen naturally breaks the execution into pipelines that are easier to measure duration. This is more granular than the task timings (a task can be multiple pipelines) and is integrated with the web ui. We currently report total time (across all tasks), min/mask/median to get a sense of how long each is taking. ## How was this patch tested? Manually tested looking at the web ui. Author: Nong Li <nong@databricks.com> Closes #11741 from nongli/spark-13916.	2016-03-21 16:56:33 -07:00
Xin Ren	1af8de200c	[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example https://issues.apache.org/jira/browse/SPARK-13019 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11108 from keypointt/SPARK-13019.	2016-03-21 16:09:34 -07:00
Wenchen Fan	f3717fc7c9	[SPARK-14004][FOLLOW-UP] Implementations of NonSQLExpression should not override sql method ## What changes were proposed in this pull request? There is only one exception: `PythonUDF`. However, I don't think the `PythonUDF#` prefix is useful, as we can only create python udf under python context. This PR removes the `PythonUDF#` prefix from `PythonUDF.toString`, so that it doesn't need to overrde `sql`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11859 from cloud-fan/tmp.	2016-03-21 15:24:18 -07:00
Kazuaki Ishizaki	f35df7d182	[SPARK-13805] [SQL] Generate code that get a value in each column from ColumnVector when ColumnarBatch is used ## What changes were proposed in this pull request? This PR generates code that get a value in each column from ```ColumnVector``` instead of creating ```InternalRow``` when ```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 15%. This PR consists of two parts: 1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method 2. Get a value of each column by using ```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)``` This is a motivated example. ```` sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") val values = 10 withTempPath { dir => withTempTable("t1", "tempTable") { sqlContext.range(values).registerTempTable("t1") sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1") .write.partitionBy("p").parquet(dir.getCanonicalPath) sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable") sqlContext.sql("select sum(p) from tempTable").collect } } ```` The original code ````java ... /* 072 / while (!shouldStop() && rdd_batchIdx < numRows) { / 073 / InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++); / 074 / /** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) / / 075 / / input[0, int] / / 076 / boolean rdd_isNull = rdd_row.isNullAt(0); / 077 / int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0)); ... ```` The code generated by this PR ````java / 072 / while (!shouldStop() && rdd_batchIdx < numRows) { / 073 / org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0); / 074 / /** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) / / 075 / / input[0, int] / / 076 / boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx); / 077 / int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx)); ... / 128 / rdd_batchIdx++; / 129 / } / 130 */ if (shouldStop()) return; ```` Performance Without this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 434 / 488 36.3 27.6 1.0X Read partition column 302 / 346 52.1 19.2 1.4X Read both columns 588 / 643 26.8 37.4 0.7X ```` With this PR ```` model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- Read data column 392 / 516 40.1 24.9 1.0X Read partition column 256 / 318 61.4 16.3 1.5X Read both columns 523 / 539 30.1 33.3 0.7X ```` ## How was this patch tested? Tested by existing test suites and benchmark Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #11636 from kiszk/SPARK-13805.	2016-03-21 14:36:51 -07:00
Davies Liu	9b4e15ba13	[SPARK-14007] [SQL] Manage the memory used by hash map in shuffled hash join ## What changes were proposed in this pull request? This PR try acquire the memory for hash map in shuffled hash join, fail the task if there is no enough memory (otherwise it could OOM the executor). It also removed unused HashedRelation. ## How was this patch tested? Existing unit tests. Manual tests with TPCDS Q78. Author: Davies Liu <davies@databricks.com> Closes #11826 from davies/cleanup_hash2.	2016-03-21 11:21:39 -07:00
Cheng Lian	5d8de16e71	[SPARK-14004][SQL] NamedExpressions should have at most one qualifier ## What changes were proposed in this pull request? This is a more aggressive version of PR #11820, which not only fixes the original problem, but also does the following updates to enforce the at-most-one-qualifier constraint: - Renames `NamedExpression.qualifiers` to `NamedExpression.qualifier` - Uses `Option[String]` rather than `Seq[String]` for `NamedExpression.qualifier` Quoted PR description of #11820 here: > Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. ## How was this patch tested? Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11822 from liancheng/spark-14004-aggressive.	2016-03-21 11:00:09 -07:00
Wenchen Fan	43ebf7a9cb	[SPARK-13456][SQL] fix creating encoders for case classes defined in Spark shell ## What changes were proposed in this pull request? case classes defined in REPL are wrapped by line classes, and we have a trick for scala 2.10 REPL to automatically register the wrapper classes to `OuterScope` so that we can use when create encoders. However, this trick doesn't work right after we upgrade to scala 2.11, and unfortunately the tests are only in scala 2.10, which makes this bug hidden until now. This PR moves the encoder tests to scala 2.11 `ReplSuite`, and fixes this bug by another approach(the previous trick can't port to scala 2.11 REPL): make `OuterScope` smarter that can detect classes defined in REPL and load the singleton of line wrapper classes automatically. ## How was this patch tested? the migrated encoder tests in `ReplSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11410 from cloud-fan/repl.	2016-03-21 10:37:24 -07:00
Cheng Lian	060a28c633	[SPARK-13826][SQL] Ad-hoc Dataset API ScalaDoc fixes ## What changes were proposed in this pull request? Ad-hoc Dataset API ScalaDoc fixes ## How was this patch tested? By building and checking ScalaDoc locally. Author: Cheng Lian <lian@databricks.com> Closes #11862 from liancheng/ds-doc-fixes.	2016-03-21 10:06:02 -07:00
Wenchen Fan	a2a9078028	[SPARK-14039][SQL][MINOR] make SubqueryHolder an inner class ## What changes were proposed in this pull request? `SubqueryHolder` is only used when generate SQL string in `SQLBuilder`, it's more clear to make it an inner class in `SQLBuilder`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11861 from cloud-fan/gensql.	2016-03-21 10:04:49 -07:00
Dongjoon Hyun	df61fbd978	[SPARK-13986][CORE][MLLIB] Remove `DeveloperApi`-annotations for non-publics ## What changes were proposed in this pull request? Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example. JobResult.scala ```scala DeveloperApi sealed trait JobResult DeveloperApi case object JobSucceeded extends JobResult -DeveloperApi private[spark] case class JobFailed(exception: Exception) extends JobResult ``` ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11797 from dongjoon-hyun/SPARK-13986.	2016-03-21 14:57:52 +00:00
Wenchen Fan	17a3f00676	[SPARK-14000][SQL] case class with a tuple field can't work in Dataset ## What changes were proposed in this pull request? When we validate an encoder, we may call `dataType` on unresolved expressions. This PR fix the validation so that we will resolve attributes first. ## How was this patch tested? a new test in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11816 from cloud-fan/encoder.	2016-03-21 22:22:15 +08:00
gatorsmile	2c5b18fb0f	[SPARK-12789][SQL] Support Order By Ordinal in SQL #### What changes were proposed in this pull request? This PR is to support order by position in SQL, e.g. ```SQL select c1, c2, c3 from tbl order by 1 desc, 3 ``` should be equivalent to ```SQL select c1, c2, c3 from tbl order by c1 desc, c3 asc ``` This is controlled by config option `spark.sql.orderByOrdinal`. - When true, the ordinal numbers are treated as the position in the select list. - When false, the ordinal number in order/sort By clause are ignored. - Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them - This also works with select . Question: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell -- Update: In these cases, they are ignored in this case. Note*: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil #### How was this patch tested? Added a few test cases for both positive and negative test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #11815 from gatorsmile/orderByPosition.	2016-03-21 18:08:41 +08:00
proflin	c35c60fa91	[SPARK-14028][STREAMING][KINESIS][TESTS] Remove deprecated methods; fix two other warnings ## What changes were proposed in this pull request? - Removed two methods that has been deprecated since 1.4 - Fixed two other compilation warnings ## How was this patch tested? existing test suits Author: proflin <proflin.me@gmail.com> Closes #11850 from lw-lin/streaming-kinesis-deprecates-warnings.	2016-03-21 08:02:06 +00:00
Dongjoon Hyun	761c2d1b6e	[MINOR][DOCS] Add proper periods and spaces for CLI help messages and `config` doc. ## What changes were proposed in this pull request? This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11848 from dongjoon-hyun/add_proper_period_and_space.	2016-03-21 08:00:09 +00:00
Dongjoon Hyun	20fd254101	[SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables LineLength checkstyle again. To help that, this also introduces RedundantImport and RedundantModifier, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.\|^import.\|a href\|href\|http://\|https://\|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.	2016-03-21 07:58:57 +00:00
hyukjinkwon	e474088144	[SPARK-13764][SQL] Parse modes in JSON data source ## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11756 from HyukjinKwon/SPARK-13764.	2016-03-21 15:42:35 +08:00
gatorsmile	f58319a24f	[SPARK-14019][SQL] Remove noop SortOrder in Sort #### What changes were proposed in this pull request? This PR is to add a new Optimizer rule for pruning Sort if its SortOrder is no-op. In the phase of Optimizer, if a specific `SortOrder` does not have any reference, it has no effect on the sorting results. If `Sort` is empty, remove the whole `Sort`. For example, in the following SQL query ```SQL SELECT * FROM t ORDER BY NULL + 5 ``` Before the fix, the plan is like ``` == Analyzed Logical Plan == a: int, b: int Sort [(cast(null as int) + 5) ASC], true +- Project [a#92,b#93] +- SubqueryAlias t +- Project [_1#89 AS a#92,_2#90 AS b#93] +- LocalRelation [_1#89,_2#90], [[1,2],[1,2]] == Optimized Logical Plan == Sort [null ASC], true +- LocalRelation [a#92,b#93], [[1,2],[1,2]] == Physical Plan == WholeStageCodegen : +- Sort [null ASC], true, 0 : +- INPUT +- Exchange rangepartitioning(null ASC, 5), None +- LocalTableScan [a#92,b#93], [[1,2],[1,2]] ``` After the fix, the plan is like ``` == Analyzed Logical Plan == a: int, b: int Sort [(cast(null as int) + 5) ASC], true +- Project [a#92,b#93] +- SubqueryAlias t +- Project [_1#89 AS a#92,_2#90 AS b#93] +- LocalRelation [_1#89,_2#90], [[1,2],[1,2]] == Optimized Logical Plan == LocalRelation [a#92,b#93], [[1,2],[1,2]] == Physical Plan == LocalTableScan [a#92,b#93], [[1,2],[1,2]] ``` cc rxin cloud-fan marmbrus Thanks! #### How was this patch tested? Added a test suite for covering this rule Author: gatorsmile <gatorsmile@gmail.com> Closes #11840 from gatorsmile/sortElimination.	2016-03-21 10:34:54 +08:00
Xusen Yin	454a00df2a	[SPARK-13993][PYSPARK] Add pyspark Rformula/RforumlaModel save/load ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13993 ## How was this patch tested? doctest Author: Xusen Yin <yinxusen@gmail.com> Closes #11807 from yinxusen/SPARK-13993.	2016-03-20 15:34:34 -07:00
sethah	811a524722	[SPARK-12182][ML] Distributed binning for trees in spark.ml This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #10231 from sethah/SPARK-12182.	2016-03-20 12:31:28 -07:00
Shixiong Zhu	d630a203d6	[SPARK-10680][TESTS] Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable ## What changes were proposed in this pull request? Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable ## How was this patch tested? Existing unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11833 from zsxwing/SPARK-10680.	2016-03-19 12:35:35 -07:00
Reynold Xin	dcaa016610	[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset ## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain \|K\| + \|V\| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11841 from rxin/SPARK-13897.	2016-03-19 11:23:14 -07:00
Dongjoon Hyun	2082a49569	[MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script. ## What changes were proposed in this pull request? Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`. ```bash $ ./bin/sparkR examples/src/main/r/dataframe.R Running R applications through 'sparkR' is not supported as of Spark 2.0. Use ./bin/spark-submit <R file> ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11842 from dongjoon-hyun/update_r_readme.	2016-03-19 13:23:34 +00:00
Reynold Xin	1970d911d9	[SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegen ## What changes were proposed in this pull request? 500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to 500L << 23 and got negative numbers instead. ## How was this patch tested? I'm only modifying test code. Author: Reynold Xin <rxin@databricks.com> Closes #11839 from rxin/SPARK-14018.	2016-03-19 00:27:23 -07:00
Sameer Agarwal	b39594472b	[SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader ## What changes were proposed in this pull request? This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #11834 from sameeragarwal/rename.	2016-03-18 22:33:43 -07:00
Dongjoon Hyun	c11ea2e413	[MINOR][DOCS] Update build descriptions and commands ## What changes were proposed in this pull request? This PR updates Scala and Hadoop versions in the build description and commands in `Building Spark` documents. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11838 from dongjoon-hyun/fix_doc_building_spark.	2016-03-18 21:32:48 -07:00
Yuhao Yang	f43a26ef92	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013, containing some comment update and style adjustment. jkbradley ## How was this patch tested? unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11830 from hhbyyh/cvToggle.	2016-03-18 17:34:33 -07:00
Sameer Agarwal	54794113a6	[SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader ## What changes were proposed in this pull request? This PR cleans up the new parquet record reader with the following changes: 1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`. 2. Removes the non-vectorized column reader code from `ColumnReader`. 3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader` 4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED` ## How was this patch tested? Refactoring only; Existing tests should reveal any problems. Author: Sameer Agarwal <sameer@databricks.com> Closes #11799 from sameeragarwal/vectorized-parquet.	2016-03-18 14:04:42 -07:00
Yin Huai	238fb485be	[SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis ## What changes were proposed in this pull request? As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by ``` case NonFatal(e) => fail( s"""Failed to analyze the converted SQL string: \| \|# Original HiveQL query string: \|$queryString \| \|# Resolved query plan: \|${originalQuery.analyzed.treeString} \| \|# Converted SQL query string: \|$convertedSQL """.stripMargin, e) ``` Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`). ``` case e: Throwable => val errorMessage = s""" \|Failed to execute query using catalyst: \|Error: ${e.getMessage} \|${stackTraceToString(e)} \|$queryString \|$query \|== HIVE - ${hive.size} row(s) == \|${hive.mkString("\n")} """.stripMargin ``` ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #11825 from yhuai/SPARK-13972-follow-up.	2016-03-18 13:40:53 -07:00
Sital Kedia	2e0c5284fd	[SPARK-13958] Executor OOM due to unbounded growth of pointer array in… ## What changes were proposed in this pull request? This change fixes the executor OOM which was recently introduced in PR apache/spark#11095 (Please fill in changes proposed in this fix) ## How was this patch tested? Tested by running a spark job on the cluster. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) … Sorter Author: Sital Kedia <skedia@fb.com> Closes #11794 from sitalkedia/SPARK-13958.	2016-03-18 12:56:06 -07:00
jerryshao	3537782168	[SPARK-13885][YARN] Fix attempt id regression for Spark running on Yarn ## What changes were proposed in this pull request? This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code. Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it. Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now. ## How was this patch tested? Test it with unit test and manually test different scenario: 1. application running in yarn-client mode. 2. application running in yarn-cluster mode. 3. application running in yarn-cluster mode with multiple attempts. Checked both the event log file name and url link. CC vanzin tgravescs , please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #11721 from jerryshao/SPARK-13885.	2016-03-18 12:39:49 -07:00
Davies Liu	9c23c818ca	[SPARK-13977] [SQL] Brings back Shuffled hash join ## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.	2016-03-18 10:32:53 -07:00
Cheng Lian	14c7236dc6	[SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. This PR fixes this issue by only picking the first qualifiers. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11820 from liancheng/spark-14004-single-qualifier.	2016-03-19 00:22:17 +08:00
Wenchen Fan	0acb32a3f1	[SPARK-13972][SQ] hive tests should fail if SQL generation failed ## What changes were proposed in this pull request? Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log. We will send new PRs for spotted bugs, and merge this one after all bugs are fixed. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11782 from cloud-fan/test.	2016-03-18 23:16:14 +08:00
Zheng RuiFeng	53f32a22da	[MINOR][DOC] Fix nits in JavaStreamingTestExample ## What changes were proposed in this pull request? Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419 use !rdd.isEmpty instead of rdd.count > 0 use static instead of AtomicInteger remove unneeded "throws Exception" ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11821 from zhengruifeng/je_fix.	2016-03-18 12:34:14 +00:00
Wenchen Fan	0f1015ffdd	[SPARK-14001][SQL] support multi-children Union in SQLBuilder ## What changes were proposed in this pull request? The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string. ## How was this patch tested? The re-enabled test Author: Wenchen Fan <wenchen@databricks.com> Closes #11818 from cloud-fan/bug-fix.	2016-03-18 19:42:33 +08:00
Yanbo Liang	7783b6f38f	[MINOR][ML] When trainingSummary is None, it should throw RuntimeException. ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11784 from yanboliang/fix-summary.	2016-03-18 11:23:17 +00:00
Reynold Xin	bb1fda01fe	[SPARK-13826][SQL] Addendum: update documentation for Datasets ## What changes were proposed in this pull request? This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast. ## How was this patch tested? Just documentation/api stability update. Author: Reynold Xin <rxin@databricks.com> Closes #11814 from rxin/dataset-docs.	2016-03-18 00:57:23 -07:00

... 3 4 5 6 7 ...

15446 commits