ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
sarutak	b9b54b1c88	[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. ## What changes were proposed in this pull request? TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit. It's because of the failure of reference query files in the jar file. ## How was this patch tested? Ran the benchmark. Author: sarutak <sarutak@oss.nttdata.co.jp> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #18592 from sarutak/fix-tpcds-benchmark.	2017-09-12 10:49:46 -07:00
Zhenhua Wang	515910e9bd	[SPARK-17642][SQL] support DESC EXTENDED/FORMATTED table column commands ## What changes were proposed in this pull request? Support DESC (EXTENDED \| FORMATTED) ? TABLE COLUMN command. Support DESC EXTENDED \| FORMATTED TABLE COLUMN command to show column-level statistics. Do NOT support describe nested columns. ## How was this patch tested? Added test cases. Author: Zhenhua Wang <wzh_zju@163.com> Author: Zhenhua Wang <wangzhenhua@huawei.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16422 from wzhfy/descColumn.	2017-09-12 08:59:52 -07:00
Jen-Ming Chung	7d0a3ef4ce	[SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query. ## How was this patch tested? Added unit test in `CSVSuite`. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.	2017-09-12 22:47:12 +09:00
caoxuewen	dc74c0e67d	[MINOR][SQL] remove unuse import class ## What changes were proposed in this pull request? this PR describe remove the import class that are unused. ## How was this patch tested? N/A Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #19131 from heary-cao/unuse_import.	2017-09-11 10:09:20 +01:00
Jen-Ming Chung	6273a711b6	[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? ``` echo '{"field": 1} {"field": 2} {"field": "3"}' >/tmp/sample.json ``` ```scala import org.apache.spark.sql.types._ val schema = new StructType() .add("field", ByteType) .add("_corrupt_record", StringType) val file = "/tmp/sample.json" val dfFromFile = spark.read.schema(schema).json(file) scala> dfFromFile.show(false) +-----+---------------+ \|field\|_corrupt_record\| +-----+---------------+ \|1 \|null \| \|2 \|null \| \|null \|{"field": "3"} \| +-----+---------------+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() res1: Long = 0 scala> dfFromFile.filter($"_corrupt_record".isNull).count() res2: Long = 3 ``` When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query. ## How was this patch tested? Added test case. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18865 from jmchung/SPARK-21610.	2017-09-10 17:26:43 -07:00
Jane Wang	f76790557b	[SPARK-4131] Support "Writing data into the filesystem from queries" ## What changes were proposed in this pull request? This PR implements the sql feature: INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ... ## How was this patch tested? Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory. Author: Jane Wang <janewang@fb.com> Closes #18975 from janewangfb/port_local_directory.	2017-09-09 11:48:34 -07:00
Yanbo Liang	e4d8f9a36a	[MINOR][SQL] Correct DataFrame doc. ## What changes were proposed in this pull request? Correct DataFrame doc. ## How was this patch tested? Only doc change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19173 from yanboliang/df-doc.	2017-09-09 09:25:12 -07:00
Liang-Chi Hsieh	6b45d7e941	[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19167 from viirya/test-jacksonutils.	2017-09-09 19:10:52 +09:00
Andrew Ash	8a5eb50681	[SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics ## What changes were proposed in this pull request? In a driver heap dump containing 390,105 instances of SQLTaskMetrics this would have saved me approximately 3.2MB of memory. Since we're not getting any benefit from storing this unused value, let's eliminate it until a future PR makes use of it. ## How was this patch tested? Existing unit tests Author: Andrew Ash <andrew@andrewash.com> Closes #19153 from ash211/aash/trim-sql-listener.	2017-09-08 23:33:15 -07:00
Kazuaki Ishizaki	8a4f228dc0	[SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite ## What changes were proposed in this pull request? This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`. Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result. ## How was this patch tested? Use existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19159 from kiszk/SPARK-21946.	2017-09-08 09:39:20 -07:00
Dongjoon Hyun	c26976fe14	[SPARK-21939][TEST] Use TimeLimits instead of Timeouts Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated. This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`. ```scala -import org.scalatest.concurrent.Timeouts._ +import org.scalatest.concurrent.TimeLimits._ ``` Pass the existing test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19150 from dongjoon-hyun/SPARK-21939. Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e	2017-09-08 09:31:13 +08:00
Dongjoon Hyun	eea2b877cf	[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names ## What changes were proposed in this pull request? Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables. BEFORE ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:28:21 ERROR Utils: Aborting task java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found. 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkException: Task failed while writing rows. ``` AFTER ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1 org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ``` ## How was this patch tested? Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19124 from dongjoon-hyun/SPARK-21912.	2017-09-06 22:20:48 -07:00
Liang-Chi Hsieh	ce7293c150	[SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce unresolved query plans ## What changes were proposed in this pull request? This is a follow-up of #19050 to deal with `ExistenceJoin` case. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19151 from viirya/SPARK-21835-followup.	2017-09-06 22:15:25 -07:00
Jacek Laskowski	fa0092bddf	[SPARK-21901][SS] Define toString for StateOperatorProgress ## What changes were proposed in this pull request? Just `StateOperatorProgress.toString` + few formatting fixes ## How was this patch tested? Local build. Waiting for OK from Jenkins. Author: Jacek Laskowski <jacek@japila.pl> Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.	2017-09-06 15:48:48 -07:00
Jose Torres	acdf45fb52	[SPARK-21765] Check that optimization doesn't affect isStreaming bit. ## What changes were proposed in this pull request? Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening. ## How was this patch tested? new and existing unit tests Author: Jose Torres <joseph.torres@databricks.com> Author: Jose Torres <joseph-torres@databricks.com> Closes #19056 from joseph-torres/SPARK-21765-followup.	2017-09-06 11:19:46 -07:00
Liang-Chi Hsieh	f2e22aebfe	[SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans ## What changes were proposed in this pull request? Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery` during optimization. It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity. We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19050 from viirya/SPARK-21835.	2017-09-06 07:42:19 -07:00
Xingbo Jiang	fd60d4fa6c	[SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation ## What changes were proposed in this pull request? For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration: ``` Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1") Seq(1, 2).toDF("col").createOrReplaceTempView("t2") sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col") ``` We can fix this by adjusting the indent of the optimize rules. ## How was this patch tested? Add test case that would have failed in `SQLQuerySuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #19099 from jiangxb1987/unconverge-optimization.	2017-09-05 13:12:39 -07:00
gatorsmile	2974406d17	[SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable ## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19119 from gatorsmile/fallbackCodegen.	2017-09-05 09:04:03 -07:00
hyukjinkwon	02a4386aec	[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` Before ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` After ``` +---+----+--------+ \| a\| b\|unparsed\| +---+----+--------+ \| a\|null\| a\| +---+----+--------+ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19113 from HyukjinKwon/bump-up-univocity.	2017-09-05 23:21:43 +08:00
Dongjoon Hyun	4e7a29efdb	[SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE ## What changes were proposed in this pull request? Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE. ## How was this patch tested? This is a change on test util. Pass the existing Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19125 from dongjoon-hyun/SPARK-21913.	2017-09-05 00:20:16 -07:00
Sean Owen	ca59445adb	[SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19123 from srowen/SPARK-21418.	2017-09-04 23:02:59 +02:00
Liang-Chi Hsieh	9f30d92803	[SPARK-21654][SQL] Complement SQL predicates expression description ## What changes were proposed in this pull request? SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples. This change also adds related test cases for the SQL predicate expressions. ## How was this patch tested? Existing tests. And added predicate test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18869 from viirya/SPARK-21654.	2017-09-03 21:55:18 -07:00
gatorsmile	acb7fed237	[SPARK-21891][SQL] Add TBLPROPERTIES to DDL statement: CREATE TABLE USING ## What changes were proposed in this pull request? Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`. After this change, the DDL becomes ``` CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name USING table_provider [OPTIONS table_property_list] [PARTITIONED BY (col_name, col_name, ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC\|DESC], ...)] INTO num_buckets BUCKETS ] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (property_name=property_value, ...)] [[AS] select_statement]; ``` ## How was this patch tested? Add a few tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.	2017-09-02 14:53:41 -07:00
Sean Owen	12ab7f7e89	[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure ## What changes were proposed in this pull request? This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts. In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11. It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release. - Scalatest 2.x -> 3.0.3 - Chill 0.8.0 -> 0.8.4 - Clapper 1.0.x -> 1.1.2 - json4s 3.2.x -> 3.4.2 - Jackson 2.6.x -> 2.7.9 (required by json4s) This change does _not_ fully enable a Scala 2.12 build: - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too. What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build. ## How was this patch tested? Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above. Author: Sean Owen <sowen@cloudera.com> Closes #18645 from srowen/SPARK-14280.	2017-09-01 19:21:21 +01:00
he.qiao	12f0d24225	[SPARK-21880][WEB UI] In the SQL table page, modify jobs trace information ## What changes were proposed in this pull request? As shown below, for example, When the job 5 is running, It was a mistake to think that five jobs were running, So I think it would be more appropriate to change jobs to job id. ![image](https://user-images.githubusercontent.com/21355020/29909612-4dc85064-8e59-11e7-87cd-275a869243bb.png) ## How was this patch tested? no need Author: he.qiao <he.qiao17@zte.com.cn> Closes #19093 from Geek-He/08_31_sqltable.	2017-09-01 10:47:11 -07:00
hyukjinkwon	5cd8ea99f0	[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python ## What changes were proposed in this pull request? This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API. In short, the following examples are allowed: ```python >>> df = spark.range(10) >>> df.sample(0.5).count() 7 >>> df.sample(fraction=0.5).count() 3 >>> df.sample(0.5, seed=42).count() 5 >>> df.sample(fraction=0.5, seed=42).count() 5 ``` In addition, this PR also adds some type checking logics as below: ```python >>> df = spark.range(10) >>> df.sample().count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got []. >>> df.sample(True).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>]. >>> df.sample(42).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>]. >>> df.sample(fraction=False, seed="a").count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>]. >>> df.sample(seed=[1]).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>]. >>> df.sample(withReplacement="a", fraction=0.5, seed=1) ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>]. ``` ## How was this patch tested? Manually tested, unit tests added in doc tests and manually checked the built documentation for Python. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18999 from HyukjinKwon/SPARK-21779.	2017-09-01 13:01:23 +09:00
Bryan Cutler	501370d9d5	[SPARK-21583][HOTFIX] Removed intercept in test causing failures Removing a check in the ColumnarBatchSuite that depended on a Java assertion. This assertion is being compiled out in the Maven builds causing the test to fail. This part of the test is not specifically from to the functionality that is being tested here. Author: Bryan Cutler <cutlerb@gmail.com> Closes #19098 from BryanCutler/hotfix-ColumnarBatchSuite-assertion.	2017-08-31 11:32:10 -07:00
Jacek Laskowski	9696580c33	[SPARK-21886][SQL] Use SparkSession.internalCreateDataFrame to create… … Dataset with LogicalRDD logical operator ## What changes were proposed in this pull request? Reusing `SparkSession.internalCreateDataFrame` wherever possible (to cut dups) ## How was this patch tested? Local build and waiting for Jenkins Author: Jacek Laskowski <jacek@japila.pl> Closes #19095 from jaceklaskowski/SPARK-21886-internalCreateDataFrame.	2017-08-31 09:44:29 -07:00
gatorsmile	19b0240d42	[SPARK-21878][SQL][TEST] Create SQLMetricsTestUtils ## What changes were proposed in this pull request? Creates `SQLMetricsTestUtils` for the utility functions of both Hive-specific and the other SQLMetrics test cases. Also, move two SQLMetrics test cases from sql/hive to sql/core. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19092 from gatorsmile/rewriteSQLMetrics.	2017-08-31 09:16:26 -07:00
Bryan Cutler	964b507c75	[SPARK-21583][SQL] Create a ColumnarBatch from ArrowColumnVectors ## What changes were proposed in this pull request? This PR allows the creation of a `ColumnarBatch` from `ReadOnlyColumnVectors` where previously a columnar batch could only allocate vectors internally. This is useful for using `ArrowColumnVectors` in a batch form to do row-based iteration. Also added `ArrowConverter.fromPayloadIterator` which converts `ArrowPayload` iterator to `InternalRow` iterator and uses a `ColumnarBatch` internally. ## How was this patch tested? Added a new unit test for creating a `ColumnarBatch` with `ReadOnlyColumnVectors` and a test to verify the roundtrip of rows -> ArrowPayload -> rows, using `toPayloadIterator` and `fromPayloadIterator`. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18787 from BryanCutler/arrow-ColumnarBatch-support-SPARK-21583.	2017-08-31 13:08:52 +09:00
Andrew Ash	313c6ca435	[SPARK-21875][BUILD] Fix Java style bugs ## What changes were proposed in this pull request? Fix Java code style so `./dev/lint-java` succeeds ## How was this patch tested? Run `./dev/lint-java` Author: Andrew Ash <andrew@andrewash.com> Closes #19088 from ash211/spark-21875-lint-java.	2017-08-31 09:26:11 +09:00
Dongjoon Hyun	d8f4540863	[SPARK-21839][SQL] Support SQL config for ORC compression ## What changes were proposed in this pull request? This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too. ## How was this patch tested? Pass the Jenkins with new and updated test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19055 from dongjoon-hyun/SPARK-21839.	2017-08-31 08:16:58 +09:00
caoxuewen	235d28333c	[MINOR][SQL][TEST] Test shuffle hash join while is not expected ## What changes were proposed in this pull request? igore("shuffle hash join") is to shuffle hash join to test _case class ShuffledHashJoinExec_. But when you 'ignore' -> 'test', the test is _case class BroadcastHashJoinExec_. Before modified, as a result of：canBroadcast is true. Print information in _canBroadcast(plan: LogicalPlan)_ ``` canBroadcast plan.stats.sizeInBytes:6710880 canBroadcast conf.autoBroadcastJoinThreshold:10000000 ``` After modified, plan.stats.sizeInBytes is 11184808. Print information in _canBuildLocalHashMap(plan: LogicalPlan)_ and _muchSmaller(a: LogicalPlan, b: LogicalPlan)_ : ``` canBuildLocalHashMap plan.stats.sizeInBytes:11184808 canBuildLocalHashMap conf.autoBroadcastJoinThreshold:10000000 canBuildLocalHashMap conf.numShufflePartitions:2 ``` ``` muchSmaller a.stats.sizeInBytes * 3:33554424 muchSmaller b.stats.sizeInBytes:33554432 ``` ## How was this patch tested? existing test case. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #19069 from heary-cao/shuffle_hash_join.	2017-08-30 10:10:24 -07:00
gatorsmile	32d6d9d720	Revert "[SPARK-21845][SQL] Make codegen fallback of expressions configurable" This reverts commit `3d0e174244`.	2017-08-30 09:08:40 -07:00
gatorsmile	3d0e174244	[SPARK-21845][SQL] Make codegen fallback of expressions configurable ## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19062 from gatorsmile/fallbackCodegen.	2017-08-29 20:59:01 -07:00
Wenchen Fan	6327ea570b	[SPARK-21255][SQL] simplify encoder for java enum ## What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/18488, to simplify the code. The major change is, we should map java enum to string type, instead of a struct type with a single string field. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19066 from cloud-fan/fix.	2017-08-29 09:15:59 -07:00
Wang Gengliang	8fcbda9c93	[SPARK-21848][SQL] Add trait UserDefinedExpression to identify user-defined functions ## What changes were proposed in this pull request? Add trait UserDefinedExpression to identify user-defined functions. UDF can be expensive. In optimizer we may need to avoid executing UDF multiple times. E.g. ```scala table.select(UDF as 'a).select('a, ('a + 1) as 'b) ``` If UDF is expensive in this case, optimizer should not collapse the project to ```scala table.select(UDF as 'a, (UDF+1) as 'b) ``` Currently UDF classes like PythonUDF, HiveGenericUDF are not defined in catalyst. This PR is to add a new trait to make it easier to identify user-defined functions. ## How was this patch tested? Unit test Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19064 from gengliangwang/UDFType.	2017-08-29 09:08:59 -07:00
Takuya UESHIN	32fa0b8141	[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. ## What changes were proposed in this pull request? As mentioned at https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18989 from ueshin/issues/SPARK-21781.	2017-08-29 20:16:45 +08:00
iamhumanbeing	07142cf6dc	[SPARK-21843] testNameNote should be "(minNumPostShufflePartitions: 5)" Signed-off-by: iamhumanbeing <iamhumanbeinggmail.com> ## What changes were proposed in this pull request? testNameNote = "(minNumPostShufflePartitions: 3) is not correct. it should be "(minNumPostShufflePartitions: " + numPartitions + ")" in ExchangeCoordinatorSuite ## How was this patch tested? unit tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: iamhumanbeing <iamhumanbeing@gmail.com> Closes #19058 from iamhumanbeing/testnote.	2017-08-27 08:23:57 +01:00
Sean Owen	1a598d717c	[SPARK-21837][SQL][TESTS] UserDefinedTypeSuite Local UDTs not actually testing what it intends ## What changes were proposed in this pull request? Adjust Local UDTs test to assert about results, and fix index of vector column. See JIRA for details. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #19053 from srowen/SPARK-21837.	2017-08-25 13:29:40 -07:00
vinodkc	51620e288b	[SPARK-21756][SQL] Add JSON option to allow unquoted control characters ## What changes were proposed in this pull request? This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) ## How was this patch tested? Add new test cases Author: vinodkc <vinod.kc.in@gmail.com> Closes #19008 from vinodkc/br_fix_SPARK-21756.	2017-08-25 10:18:03 -07:00
Sean Owen	de7af295c2	[MINOR][BUILD] Fix build warnings and Java lint errors ## What changes were proposed in this pull request? Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19051 from srowen/JavaWarnings.	2017-08-25 16:07:13 +01:00
mike	7d16776d28	[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum ## What changes were proposed in this pull request? Fixed NPE when creating encoder for enum. When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference following code ``` def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = { val beanInfo = Introspector.getBeanInfo(beanClass) beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") .filter(_.getReadMethod != null) } ``` filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I added property name "declaringClass" to filtering to resolve this. ## How was this patch tested? Unit test in JavaDatasetSuite which creates an encoder for enum Author: mike <mike0sv@gmail.com> Author: Mikhail Sveshnikov <mike0sv@gmail.com> Closes #18488 from mike0sv/enum-support.	2017-08-25 07:22:34 +01:00
Herman van Hovell	05af2de0fd	[SPARK-21830][SQL] Bump ANTLR version and fix a few issues. ## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19042 from hvanhovell/SPARK-21830.	2017-08-24 16:33:55 -07:00
Shixiong Zhu	d3abb36990	[SPARK-21788][SS] Handle more exceptions when stopping a streaming query ## What changes were proposed in this pull request? Add more cases we should view as a normal query stop rather than a failure. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #18997 from zsxwing/SPARK-21788.	2017-08-24 10:23:59 -07:00
Wenchen Fan	2dd37d827f	[SPARK-21826][SQL] outer broadcast hash join should not throw NPE ## What changes were proposed in this pull request? This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 . Non-equal join condition should only be applied when the equal-join condition matches. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19036 from cloud-fan/bug.	2017-08-24 16:44:12 +02:00
Liang-Chi Hsieh	183d4cb71f	[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery ## What changes were proposed in this pull request? With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans. For a correlated IN query looks like: SELECT t1.a FROM t1 WHERE t1.a IN (SELECT t2.c FROM t2 WHERE t1.b < t2.d); The query plan might look like: Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] After `PullupCorrelatedPredicates`, it produces query plan like: 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery. When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`. We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18968 from viirya/SPARK-21759.	2017-08-24 21:46:58 +08:00
Takuya UESHIN	9e33954ddf	[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. ## What changes were proposed in this pull request? This is a refactoring of `ColumnVector` hierarchy and related classes. 1. make `ColumnVector` read-only 2. introduce `WritableColumnVector` with write interface 3. remove `ReadOnlyColumnVector` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18958 from ueshin/issues/SPARK-21745.	2017-08-24 21:13:44 +08:00
lufei	846bc61cf5	[MINOR][SQL] The comment of Class ExchangeCoordinator exist a typing and context error ## What changes were proposed in this pull request? The given example in the comment of Class ExchangeCoordinator is exist four post-shuffle partitions,but the current comment is “three”. ## How was this patch tested? Author: lufei <lu.fei80@zte.com.cn> Closes #19028 from figo77/SPARK-21816.	2017-08-24 10:07:27 +01:00
Jose Torres	3c0c2d09ca	[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. ## What changes were proposed in this pull request? All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from. ## How was this patch tested? Existing unit tests - no functional change is intended in this PR. Author: Jose Torres <joseph-torres@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18973 from joseph-torres/SPARK-21765.	2017-08-22 19:07:43 -07:00
gatorsmile	01a8e46278	[SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL ## What changes were proposed in this pull request? For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different. The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema. ## How was this patch tested? Added a cross-version test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19003 from gatorsmile/respectSparkSchema.	2017-08-22 13:12:59 -07:00
gatorsmile	43d71d9659	[SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction) ## What changes were proposed in this pull request? This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction). ```SQL CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg' ``` Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #18700 from gatorsmile/javaUDFinScala.	2017-08-22 13:01:35 -07:00
gatorsmile	be72b157ea	[SPARK-21803][TEST] Remove the HiveDDLCommandSuite ## What changes were proposed in this pull request? We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19015 from gatorsmile/combineDDL.	2017-08-22 17:54:39 +08:00
Marcelo Vanzin	84b5b16ea6	[SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. For Hive tables, the current "replace the schema" code is the correct path, except that an exception in that path should result in an error, and not in retrying in a different way. For data source tables, Spark may generate a non-compatible Hive table; but for that to work with Hive 2.1, the detection of data source tables needs to be fixed in the Hive client, to also consider the raw tables used by code such as `alterTableSchema`. Tested with existing and added unit tests (plus internal tests with a 2.1 metastore). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18849 from vanzin/SPARK-21617.	2017-08-21 15:09:02 -07:00
Sean Owen	b3a07526fe	[SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." ## What changes were proposed in this pull request? Reduce 'Skipping partitions' message to debug ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19010 from srowen/SPARK-21718.	2017-08-21 14:20:40 +02:00
Wenchen Fan	7880909c45	[SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18993 from cloud-fan/bug.	2017-08-18 11:19:22 -07:00
Masha Basmanova	23ea898080	[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes ## What changes were proposed in this pull request? Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows. When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified. For example, table t has 4 partitions with the following specs: * Partition1: (ds='2008-04-08', hr=11) * Partition2: (ds='2008-04-08', hr=12) * Partition3: (ds='2008-04-09', hr=11) * Partition4: (ds='2008-04-09', hr=12) 'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3. 'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4. 'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions. When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes. The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command. ## How was this patch tested? Added tests. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18421 from mbasmanova/mbasmanova-analyze-partition.	2017-08-18 09:54:39 -07:00
Reynold Xin	07a2b8738e	[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java ## What changes were proposed in this pull request? Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed. ## How was this patch tested? Tested manually. Not sure yet if we should add a test case for just this wrapper ... Author: Reynold Xin <rxin@databricks.com> Closes #18988 from rxin/SPARK-21778.	2017-08-18 23:58:20 +09:00
Jen-Ming Chung	7ab951885f	[SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type ## What changes were proposed in this pull request? ``` scala scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() ... java.lang.NullPointerException at ... ``` Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names. ## How was this patch tested? Added unit test. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18930 from jmchung/SPARK-21677.	2017-08-17 15:59:45 -07:00
gatorsmile	ae9e424792	[SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true ## What changes were proposed in this pull request? When running IntelliJ, we are unable to capture the exception of memory leak detection. > org.apache.spark.executor.Executor: Managed memory leak detected Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when building the SparkSession, instead of reading it from system properties. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.	2017-08-17 13:00:37 -07:00
Wenchen Fan	a45133b826	[SPARK-21743][SQL] top-most limit should not cause memory leak ## What changes were proposed in this pull request? For top-most limit, we will use a special operator to execute it: `CollectLimitExec`. `CollectLimitExec` will retrieve `n`(which is the limit) rows from each partition of the child plan output, see https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L311. It's very likely that we don't exhaust the child plan output. This is fine when whole-stage-codegen is off, as child plan will release the resource via task completion listener. However, when whole-stage codegen is on, the resource can only be released if all output is consumed. To fix this memory leak, one simple approach is, when `CollectLimitExec` retrieve `n` rows from child plan output, child plan output should only have `n` rows, then the output is exhausted and resource is released. This can be done by wrapping child plan with `LocalLimit` ## How was this patch tested? a regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18955 from cloud-fan/leak.	2017-08-16 22:37:45 -07:00
10129659	1cce1a3b63	[SPARK-21603][SQL] The wholestage codegen will be much slower then that is closed when the function is too long ## What changes were proposed in this pull request? Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT optimizing. A benchmark test result is 10x slower when the generated function is too long : ignore("max function length of wholestagecodegen") { val N = 20 << 15 val benchmark = new Benchmark("max function length of wholestagecodegen", N) def f(): Unit = sparkSession.range(N) .selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as double) as k2", "cast(id & 1023 as int) as k3", "case when id > 100 and id <= 200 then 1 else 0 end as v1", "case when id > 200 and id <= 300 then 1 else 0 end as v2", "case when id > 300 and id <= 400 then 1 else 0 end as v3", "case when id > 400 and id <= 500 then 1 else 0 end as v4", "case when id > 500 and id <= 600 then 1 else 0 end as v5", "case when id > 600 and id <= 700 then 1 else 0 end as v6", "case when id > 700 and id <= 800 then 1 else 0 end as v7", "case when id > 800 and id <= 900 then 1 else 0 end as v8", "case when id > 900 and id <= 1000 then 1 else 0 end as v9", "case when id > 1000 and id <= 1100 then 1 else 0 end as v10", "case when id > 1100 and id <= 1200 then 1 else 0 end as v11", "case when id > 1200 and id <= 1300 then 1 else 0 end as v12", "case when id > 1300 and id <= 1400 then 1 else 0 end as v13", "case when id > 1400 and id <= 1500 then 1 else 0 end as v14", "case when id > 1500 and id <= 1600 then 1 else 0 end as v15", "case when id > 1600 and id <= 1700 then 1 else 0 end as v16", "case when id > 1700 and id <= 1800 then 1 else 0 end as v17", "case when id > 1800 and id <= 1900 then 1 else 0 end as v18") .groupBy("k1", "k2", "k3") .sum() .collect() benchmark.addCase(s"codegen = F") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "false") f() } benchmark.addCase(s"codegen = T") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000") f() } benchmark.run() /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1 Intel64 Family 6 Model 58 Stepping 9, GenuineIntel max function length of wholestagecodegen: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ codegen = F 443 / 507 1.5 676.0 1.0X codegen = T 3279 / 3283 0.2 5002.6 0.1X */ } ## How was this patch tested? Run the unit test Author: 10129659 <chen.yanshan@zte.com.cn> Closes #18810 from eatoncys/codegen.	2017-08-16 09:12:20 -07:00
Dongjoon Hyun	8c54f1eb71	[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 ## What changes were proposed in this pull request? Like Parquet, this PR aims to depend on the latest Apache ORC 1.4 for Apache Spark 2.3. There are key benefits for Apache ORC 1.4. - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - Maintainability: Reduce the Hive dependency and can remove old legacy code later. Later, we can get the following two key benefits by adding new ORCFileFormat in SPARK-20728 (#17980), too. - Usability: User can use ORC data sources without hive module, i.e, -Phive. - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This will be faster than the current implementation in Spark. ## How was this patch tested? Pass the jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18640 from dongjoon-hyun/SPARK-21422.	2017-08-15 23:00:13 -07:00
Xingbo Jiang	42b9eda80e	[MINOR] Fix a typo in the method name `UserDefinedFunction.asNonNullabe` ## What changes were proposed in this pull request? The method name `asNonNullabe` should be `asNonNullable`. ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18952 from jiangxb1987/typo.	2017-08-15 16:40:01 -07:00
Marcelo Vanzin	3f958a9992	[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. This version fixes a few issues in the import order checker; it provides better error messages, and detects more improper ordering (thus the need to change a lot of files in this patch). The main fix is that it correctly complains about the order of packages vs. classes. As part of the above, I moved some "SparkSession" import in ML examples inside the "$example on$" blocks; that didn't seem consistent across different source files to start with, and avoids having to add more on/off blocks around specific imports. The new scalastyle also seems to have a better header detector, so a few license headers had to be updated to match the expected indentation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18943 from vanzin/SPARK-21731.	2017-08-15 13:59:00 -07:00
Wenchen Fan	14bdb25fd7	[SPARK-18464][SQL][FOLLOWUP] support old table which doesn't store schema in table properties ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18907 from cloud-fan/bug.	2017-08-15 09:04:56 -07:00
donnyzone	bc9902587a	[SPARK-19471][SQL] AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? This is a follow-up PR that moves the test case in PR-18920 (https://github.com/apache/spark/pull/18920) to DataFrameAggregateSuit. ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes #18946 from DonnyZone/branch-19471-followingPR.	2017-08-15 08:51:18 -07:00
Shixiong Zhu	12411b5edf	[SPARK-21732][SQL] Lazily init hive metastore client ## What changes were proposed in this pull request? This PR changes the codes to lazily init hive metastore client so that we can create SparkSession without talking to the hive metastore sever. It's pretty helpful when you set a hive metastore server but it's down. You can still start the Spark shell to debug. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18944 from zsxwing/hive-lazy-init.	2017-08-14 23:46:52 -07:00
Shixiong Zhu	282f00b410	[SPARK-21696][SS] Fix a potential issue that may generate partial snapshot files ## What changes were proposed in this pull request? Directly writing a snapshot file may generate a partial file. This PR changes it to write to a temp file then rename to the target file. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18928 from zsxwing/SPARK-21696.	2017-08-14 15:06:55 -07:00
donnyzone	fbc269252a	[SPARK-19471][SQL] AggregationIterator does not initialize the generated result projection before using it ## What changes were proposed in this pull request? Recently, we have also encountered such NPE issues in our production environment as described in: https://issues.apache.org/jira/browse/SPARK-19471 This issue can be reproduced by the following examples: ` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4))).toDF("x", "y") //HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),sum("y")).show() //ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false df.groupBy("x").agg(rand(),collect_list("y")).show() //SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false &&SQLConf.USE_OBJECT_HASH_AGG.key=false df.groupBy("x").agg(rand(),collect_list("y")).show()` ` This PR is based on PR-16820(https://github.com/apache/spark/pull/16820) with test cases for all aggregation paths. We want to push it forward. > When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions. ## How was this patch tested? unit test verified in production environment Author: donnyzone <wellfengzhu@gmail.com> Closes #18920 from DonnyZone/Branch-spark-19471.	2017-08-14 09:37:18 -07:00
caoxuewen	0326b69c98	[MINOR][SQL][TEST] no uncache table in joinsuite test ## What changes were proposed in this pull request? At present, in test("broadcasted hash outer join operator selection") case, set the testData2 to _CACHE TABLE_, but no _uncache table_ testData2. It can make people confused. In addition, in the joinsuite test cases, clear the cache table of work by SharedSQLContext _spark.sharedState.cacheManager.clearCache_ to do, so we do not need to uncache table let's fix it. thanks. ## How was this patch tested? Existing test cases. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #18914 from heary-cao/uncache_table.	2017-08-14 09:33:22 -07:00
aokolnychyi	5596ce83c4	[MINOR][SQL] Additional test case for CheckCartesianProducts rule ## What changes were proposed in this pull request? While discovering optimization rules and their test coverage, I did not find any tests for `CheckCartesianProducts` in the Catalyst folder. So, I decided to create a new test suite. Once I finished, I found a test in `JoinSuite` for this functionality so feel free to discard this change if it does not make much sense. The proposed test suite covers a few additional use cases. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18909 from aokolnychyi/check-cartesian-join-tests.	2017-08-13 21:33:16 -07:00
Tejas Patil	7f16c69107	[SPARK-19122][SQL] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-19122 `leftKeys` and `rightKeys` in `SortMergeJoinExec` are altered based on the ordering of join keys in the child's `outputPartitioning`. This is done everytime `requiredChildDistribution` is invoked during query planning. ## How was this patch tested? - Added new test case - Existing tests Author: Tejas Patil <tejasp@fb.com> Closes #16985 from tejasapatil/SPARK-19122_join_order_shuffle.	2017-08-11 15:13:42 -07:00
Tejas Patil	94439997d5	[SPARK-21595] Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595.	2017-08-11 22:01:00 +02:00
LucaCanali	0377338bf7	[SPARK-21519][SQL] Add an option to the JDBC data source to initialize the target DB environment Add an option to the JDBC data source to initialize the environment of the remote database session ## What changes were proposed in this pull request? This proposes an option to the JDBC datasource, tentatively called " sessionInitStatement" to implement the functionality of session initialization present for example in the Sqoop connector for Oracle (see https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements ) . After each database session is opened to the remote DB, and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block in the case of Oracle). See also https://issues.apache.org/jira/browse/SPARK-21519 ## How was this patch tested? Manually tested using Spark SQL data source and Oracle JDBC Author: LucaCanali <luca.canali@cern.ch> Closes #18724 from LucaCanali/JDBC_datasource_sessionInitStatement.	2017-08-11 12:03:37 -07:00
Adrian Ionescu	95ad960caf	[SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs ## What changes were proposed in this pull request? This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it. The interface basically consists of 3 traits: - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask` The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`. - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance. - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver. Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations. ## How was this patch tested? Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc. Author: Adrian Ionescu <adrian@databricks.com> Closes #18884 from adrian-ionescu/write-stats-tracker-api.	2017-08-10 12:37:10 -07:00
bravo-zhang	84454d7d33	[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None ## What changes were proposed in this pull request? Currently `df.na.replace("", Map[String, String]("NULL" -> null))` will produce exception. This PR enables passing null/None as value in the replacement map in DataFrame.replace(). Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type. This PR enables following operations for example: `df.na.replace("", Map[String, String]("NULL" -> null))`(scala) `df.na.replace("", Map[Any, Any](60 -> null, 70 -> 80))`(scala) `df.na.replace('Alice', None)`(python) `df.na.replace([10, 20])`(python, replacing with None is by default) One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data `df.na.replace("", Map("" -> null)).na.drop()`(scala) `df.replace(u'', None).dropna()`(python) ## How was this patch tested? Scala unit test. Python doctest and unit test. Author: bravo-zhang <mzhang1230@gmail.com> Closes #18820 from bravo-zhang/spark-14932.	2017-08-09 17:42:21 -07:00
gatorsmile	2d799d0808	[SPARK-21504][SQL] Add spark version info into table metadata ## What changes were proposed in this pull request? This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18709 from gatorsmile/addVersion.	2017-08-09 08:46:25 -07:00
Xingbo Jiang	031910b0ec	[SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API should allow literal boundary ## What changes were proposed in this pull request? Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp. Example of the use case can be: ``` SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData ``` This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid. ## How was this patch tested? Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18814 from jiangxb1987/literal-boundary.	2017-08-09 13:23:49 +08:00
Shixiong Zhu	6edfff055c	[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the return value ## What changes were proposed in this pull request? When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the return value. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18799 from zsxwing/SPARK-21596.	2017-08-08 20:20:26 -07:00
Sean Owen	fb54a564d7	[SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1 ## What changes were proposed in this pull request? Taking over https://github.com/apache/spark/pull/18789 ; Closes #18789 Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12 ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18881 from srowen/SPARK-20433.	2017-08-08 18:15:29 -07:00
Liang-Chi Hsieh	ee1304199b	[SPARK-21567][SQL] Dataset should work with type alias ## What changes were proposed in this pull request? If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. ## How was this patch tested? Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18813 from viirya/SPARK-21567.	2017-08-08 16:12:41 +08:00
Marcos P. Sanchez	312bebfb6d	[SPARK-21640][FOLLOW-UP][SQL] added errorifexists on IllegalArgumentException message ## What changes were proposed in this pull request? This commit adds a new argument for IllegalArgumentException message. This recent commit added the argument: [`dcac1d57f0`) ## How was this patch tested? Unit test have been passed Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Marcos P. Sanchez <mpenate@stratio.com> Closes #18862 from mpenate/feature/exception-errorifexists.	2017-08-07 22:41:57 -07:00
gatorsmile	baf5cac0f8	[SPARK-21648][SQL] Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided. ### What changes were proposed in this pull request? ```SQL CREATE TABLE mytesttable1 USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}', dbtable 'mytesttable1', paritionColumn 'state_id', lowerBound '0', upperBound '52', numPartitions '53', fetchSize '10000' ) ``` The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error. ``` AssertionError: assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18864 from gatorsmile/jdbcPartCol.	2017-08-07 13:04:04 -07:00
Jose Torres	cce25b360e	[SPARK-21565][SS] Propagate metadata in attribute replacement. ## What changes were proposed in this pull request? Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes. ## How was this patch tested? new unit test, which was verified to fail before the fix Author: Jose Torres <joseph-torres@databricks.com> Closes #18840 from joseph-torres/SPARK-21565.	2017-08-07 12:27:16 -07:00
Mac	4f7ec3a316	[SPARK][DOCS] Added note on meaning of position to substring function ## What changes were proposed in this pull request? Enhanced some existing documentation Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Mac <maclockard@gmail.com> Closes #18710 from maclockard/maclockard-patch-1.	2017-08-07 17:16:03 +01:00
Xiao Li	bbfd6b5d24	[SPARK-21647][SQL] Fix SortMergeJoin when using CROSS ### What changes were proposed in this pull request? author: BoleynSu closes https://github.com/apache/spark/pull/18836 ```Scala val df = Seq((1, 1)).toDF("i", "j") df.createOrReplaceTempView("T") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " + "cross join T t2 where t2.i = t1.i").explain(true) } ``` The above code could cause the following exception: ``` SortMergeJoinExec should not take Cross as the JoinType java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100) ``` Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue. ### How was this patch tested? Modified the two existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Author: Boleyn Su <boleyn.su@gmail.com> Closes #18863 from gatorsmile/pr-18836.	2017-08-08 00:00:01 +08:00
zhoukang	8b69b17f3f	[SPARK-21544][DEPLOY][TEST-MAVEN] Tests jar of some module should not upload twice ## What changes were proposed in this pull request? For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: `[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged` The reason is below: `[DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11🫙2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11🫙tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT]` when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be overrided. ## How was this patch tested? Execute 'mvn clean install -Pyarn -Phadoop-2.6 -Phadoop-provided -DskipTests' Author: zhoukang <zhoukang199191@gmail.com> Closes #18745 from caneGuy/zhoukang/fix-installtwice.	2017-08-07 12:51:39 +01:00
Takeshi Yamamuro	74b47845ea	[SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for visitTableName ## What changes were proposed in this pull request? This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18857 from maropu/SPARK-20963-FOLLOWUP.	2017-08-06 10:14:45 -07:00
Yuming Wang	10b3ca3e93	[SPARK-21574][SQL] Point out user to set hive config before SparkSession is initialized ## What changes were proposed in this pull request? Since Spark 2.0.0, SET hive config commands do not pass the values to HiveClient, this PR point out user to set hive config before SparkSession is initialized when they try to set hive config. ## How was this patch tested? manual tests <img width="1637" alt="spark-set" src="https://user-images.githubusercontent.com/5399861/29001141-03f943ee-7ab3-11e7-8584-ba5a5e81f6ad.png"> Author: Yuming Wang <wgyumg@gmail.com> Closes #18769 from wangyum/SPARK-21574.	2017-08-06 10:08:44 -07:00
vinodkc	1ba967b25e	[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588.	2017-08-05 23:04:39 -07:00
Takeshi Yamamuro	990efad1c6	[SPARK-20963][SQL] Support column aliases for join relations in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support column aliases for join relations in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18772 from maropu/SPARK-20963-2.	2017-08-05 20:35:54 -07:00
arodriguez	dcac1d57f0	[SPARK-21640] Add errorifexists as a valid string for ErrorIfExists save mode ## What changes were proposed in this pull request? This PR includes the changes to make the string "errorifexists" also valid for ErrorIfExists save mode. ## How was this patch tested? Unit tests and manual tests Author: arodriguez <arodriguez@arodriguez.stratio> Closes #18844 from ardlema/SPARK-21640.	2017-08-05 11:21:51 -07:00
hyukjinkwon	ba327ee54c	[SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments separately, and note/since in SQL built-in function documentation ## What changes were proposed in this pull request? This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information. For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html For few good examples to check the built documentation, please see both: `from_json` - https://spark-test.github.io/sparksqldoc/#from_json `like` - https://spark-test.github.io/sparksqldoc/#like For `DESCRIBE FUNCTION`, `note` and `since` are added as below: ``` > DESCRIBE FUNCTION EXTENDED rlike; ... Extended Usage: Arguments: ... Examples: ... Note: Use LIKE to match with simple string pattern ``` ``` > DESCRIBE FUNCTION EXTENDED to_json; ... Examples: ... Since: 2.2.0 ``` For the complete documentation, see https://spark-test.github.io/sparksqldoc/ ## How was this patch tested? Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc Jenkins tests are needed to double check Author: hyukjinkwon <gurwls223@gmail.com> Closes #18749 from HyukjinKwon/followup-sql-doc-gen.	2017-08-05 10:10:56 -07:00
liuxian	894d5a453a	[SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18779 from 10110346/groupby.	2017-08-04 22:55:06 -07:00
Shixiong Zhu	6cbd18c9d0	[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled FS cache ## What changes were proposed in this pull request? This PR replaces #18623 to do some clean up. Closes #18623 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Andrey Taptunov <taptunov@amazon.com> Closes #18848 from zsxwing/review-pr18623.	2017-08-04 22:40:04 -07:00
Reynold Xin	5ad1796b9f	[SPARK-21634][SQL] Change OneRowRelation from a case object to case class ## What changes were proposed in this pull request? OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class. This blocks SPARK-21619. ## How was this patch tested? Should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18839 from rxin/SPARK-21634.	2017-08-04 10:36:08 -07:00
Yuming Wang	231f67247b	[SPARK-21205][SQL] pmod(number, 0) should be null. ## What changes were proposed in this pull request? Hive `pmod(3.13, 0)`: ```:sql hive> select pmod(3.13, 0); OK NULL Time taken: 2.514 seconds, Fetched: 1 row(s) hive> ``` Spark `mod(3.13, 0)`: ```:sql spark-sql> select mod(3.13, 0); NULL spark-sql> ``` But the Spark `pmod(3.13, 0)`: ```:sql spark-sql> select pmod(3.13, 0); 17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)] java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504) at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323) ... ``` This PR make `pmod(number, 0)` to null. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18413 from wangyum/SPARK-21205.	2017-08-04 12:06:08 +02:00
Andrew Ray	25826c77dd	[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table with extreme values on the partition column ## What changes were proposed in this pull request? An overflow of the difference of bounds on the partitioning column leads to no data being read. This patch checks for this overflow. ## How was this patch tested? New unit test. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18800 from aray/SPARK-21330.	2017-08-04 08:58:01 +01:00
Shixiong Zhu	0d26b3aa55	[SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key ## What changes were proposed in this pull request? When the watermark is not a column of `dropDuplicates`, right now it will crash. This PR fixed this issue. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18822 from zsxwing/SPARK-21546.	2017-08-02 14:02:13 -07:00

1 2 3 4 5 ...

4115 commits