ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Bogdan Raducanu	cd6dff78be	[SPARK-25209][SQL] Avoid deserializer check in Dataset.apply when Dataset is actually DataFrame ## What changes were proposed in this pull request? Dataset.apply calls dataset.deserializer (to provide an early error) which ends up calling the full Analyzer on the deserializer. This can take tens of milliseconds, depending on how big the plan is. Since Dataset.apply is called for many Dataset operations such as Dataset.where it can be a significant overhead for short queries. According to a comment in the PR that introduced this check, we can at least remove this check for DataFrames: https://github.com/apache/spark/pull/20402#discussion_r164338267 ## How was this patch tested? Existing tests + manual benchmark Author: Bogdan Raducanu <bogdan@databricks.com> Closes #22201 from bogdanrdc/deserializer-fix.	2018-08-24 04:13:07 +02:00
s71955	b88ddb8a83	[SPARK-23425][SQL] Support wildcard in HDFS path for load table command ## What changes were proposed in this pull request? Problem statement load data command with hdfs file paths consists of wild card strings like * are not working eg: "load data inpath 'hdfs://hacluster/user/ext* into table t1" throws Analysis exception while executing this query ![wildcard_issue](https://user-images.githubusercontent.com/12999161/42673744-9f5c0c16-8621-11e8-8d28-cdc41bbe6efe.PNG) Analysis - Currently fs.exists() API which is used for path validation in load command API cannot resolve the path with wild card pattern, To mitigate this problem i am using globStatus() API another api which can resolve the paths with hdfs supported wildcards like ,? etc(inline with hive wildcard support). Improvement identified as part of this issue -* Currently system wont support wildcard character to be used for folder level path in a local file system. This PR has handled this scenario, the same globStatus API will unify the validation logic of local and non local file systems, this will ensure the behavior consistency between the hdfs and local file path in load command. with this improvement user will be able to use a wildcard character in folder level path of a local file system in load command inline with hive behaviour, in older versions user can use wildcards only in file path of the local file system if they use in folder path system use to give an error by mentioning that not supported. eg: load data local inpath '/localfilesystem/folder* into table t1 ## How was this patch tested? a) Manually tested by executing test-cases in HDFS yarn cluster. Reports is been attached in below section. b) Existing test-case can verify the impact and functionality for local file path scenarios c) A test-case is been added for verifying the functionality when wild card is been used in folder level path of a local file system ## Test Results Note: all ip's were updated to localhost for security reasons. HDFS path details ``` vm1:/opt/ficlient # hadoop fs -ls /user/data/sujith1 Found 2 items -rw-r--r-- 3 shahid hadoop 4802 2018-03-26 15:45 /user/data/sujith1/typeddata60.txt -rw-r--r-- 3 shahid hadoop 4883 2018-03-26 15:45 /user/data/sujith1/typeddata61.txt vm1:/opt/ficlient # hadoop fs -ls /user/data/sujith2 Found 2 items -rw-r--r-- 3 shahid hadoop 4802 2018-03-26 15:45 /user/data/sujith2/typeddata60.txt -rw-r--r-- 3 shahid hadoop 4883 2018-03-26 15:45 /user/data/sujith2/typeddata61.txt ``` positive scenario by specifying complete file path to know about record size ``` 0: jdbc:hive2://localhost:22550/default> create table wild_spark (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ','; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (1.217 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujith1/typeddata60.txt' into table wild_spark; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (4.236 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujith1/typeddata61.txt' into table wild_spark; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.602 seconds) 0: jdbc:hive2://localhost:22550/default> select count() from wild_spark; +-----------+--+ \| count(1) \| +-----------+--+ \| 121 \| +-----------+--+ 1 row selected (18.529 seconds) 0: jdbc:hive2://localhost:22550/default> ``` With wild card character in file path ``` 0: jdbc:hive2://localhost:22550/default> create table spark_withWildChar (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ','; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.409 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujith1/type' into table spark_withWildChar; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (1.502 seconds) 0: jdbc:hive2://localhost:22550/default> select count() from spark_withWildChar; +-----------+--+ \| count(1) \| +-----------+--+ \| 121 \| +-----------+--+ ``` with ? wild card scenario ``` 0: jdbc:hive2://localhost:22550/default> create table spark_withWildChar_DiffChar (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ','; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.489 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujith1/?ypeddata60.txt' into table spark_withWildChar_DiffChar; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (1.152 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujith1/?ypeddata61.txt' into table spark_withWildChar_DiffChar; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.644 seconds) 0: jdbc:hive2://localhost:22550/default> select count() from spark_withWildChar_DiffChar; +-----------+--+ \| count(1) \| +-----------+--+ \| 121 \| +-----------+--+ 1 row selected (16.078 seconds) ``` with folder level wild card scenario ``` 0: jdbc:hive2://localhost:22550/default> create table spark_withWildChar_folderlevel (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ','; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.489 seconds) 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/suji/' into table spark_withWildChar_folderlevel; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (1.152 seconds) 0: jdbc:hive2://localhost:22550/default> select count() from spark_withWildChar_folderlevel; +-----------+--+ \| count(1) \| +-----------+--+ \| 242 \| +-----------+--+ 1 row selected (16.078 seconds) ``` Negative scenario invalid path ``` 0: jdbc:hive2://localhost:22550/default> load data inpath '/user/data/sujiinvalid/' into table spark_withWildChar_folder; Error: org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: /user/data/sujiinvalid/; (state=,code=0) 0: jdbc:hive2://localhost:22550/default> ``` Hive Test results- file level ``` 0: jdbc:hive2://localhost:21066/> create table hive_withWildChar_files (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) stored as TEXTFILE; No rows affected (0.723 seconds) 0: jdbc:hive2://localhost:21066/> load data inpath '/user/data/sujith1/type' into table hive_withWildChar_files; INFO : Loading data to table default.hive_withwildchar_files from hdfs://hacluster/user/sujith1/type* No rows affected (0.682 seconds) 0: jdbc:hive2://localhost:21066/> select count() from hive_withWildChar_files; +------+--+ \| _c0 \| +------+--+ \| 121 \| +------+--+ 1 row selected (50.832 seconds) ``` Hive Test results- folder level ``` 0: jdbc:hive2://localhost:21066/> create table hive_withWildChar_folder (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) stored as TEXTFILE; No rows affected (0.459 seconds) 0: jdbc:hive2://localhost:21066/> load data inpath '/user/data/suji/' into table hive_withWildChar_folder; INFO : Loading data to table default.hive_withwildchar_folder from hdfs://hacluster/user/data/suji/* No rows affected (0.76 seconds) 0: jdbc:hive2://localhost:21066/> select count(*) from hive_withWildChar_folder; +------+--+ \| _c0 \| +------+--+ \| 242 \| +------+--+ 1 row selected (46.483 seconds) ``` Closes #20611 from sujith71955/master_wldcardsupport. Lead-authored-by: s71955 <sujithchacko.2010@gmail.com> Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-24 09:54:30 +08:00
Jose Torres	8ed0449285	[SPARK-25204][SS] Fix race in rate source test. ## What changes were proposed in this pull request? Fix a race in the rate source tests. We need a better way of testing restart behavior. ## How was this patch tested? unit test Closes #22191 from jose-torres/racetest. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-08-23 12:14:27 -07:00
Takuya UESHIN	a9aacdf1c2	[SPARK-25208][SQL] Loosen Cast.forceNullable for DecimalType. ## What changes were proposed in this pull request? Casting to `DecimalType` is not always needed to force nullable. If the decimal type to cast is wider than original type, or only truncating or precision loss, the casted value won't be `null`. ## How was this patch tested? Added and modified tests. Closes #22200 from ueshin/issues/SPARK-25208/cast_nullable_decimal. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 22:48:26 +08:00
Rao Fu	5d572fc7c3	[SPARK-25126][SQL] Avoid creating Reader for all orc files ## What changes were proposed in this pull request? [SPARK-25126] (https://issues.apache.org/jira/browse/SPARK-25126) reports loading a large number of orc files consumes a lot of memory in both 2.0 and 2.3. The issue is caused by creating a Reader for every orc file in order to infer the schema. In OrFileOperator.ReadSchema, a Reader is created for every file although only the first valid one is used. This uses significant amount of memory when there `paths` have a lot of files. In 2.3 a different code path (OrcUtils.readSchema) is used for inferring schema for orc files. This commit changes both functions to create Reader lazily. ## How was this patch tested? Pass the Jenkins with a newly added test case by dongjoon-hyun Closes #22157 from raofu/SPARK-25126. Lead-authored-by: Rao Fu <rao@coupang.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Rao Fu <raofu04@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-23 22:00:20 +08:00
Bruce Robbins	8cc591c91a	[SPARK-25164][SQL] Avoid rebuilding column and path list for each column in parquet reader ## What changes were proposed in this pull request? VectorizedParquetRecordReader::initializeInternal rebuilds the column list and path list once for each column. Therefore, it indirectly iterates 2\colCount\colCount times for each parquet file. This inefficiency impacts jobs that read parquet-backed tables with many columns and many files. Jobs that read tables with few columns or few files are not impacted. This PR changes initializeInternal so that it builds each list only once. I ran benchmarks on my laptop with 1 worker thread, running this query: <pre> sql("select * from parquet_backed_table where id1 = 1").collect </pre> There are roughly one matching row for every 425 rows, and the matching rows are sprinkled pretty evenly throughout the table (that is, every page for column <code>id1</code> has at least one matching row). 6000 columns, 1 million rows, 67 32M files: master \| branch \| improvement -------\|---------\|----------- 10.87 min \| 6.09 min \| 44% 6000 columns, 1 million rows, 23 98m files: master \| branch \| improvement -------\|---------\|----------- 7.39 min \| 5.80 min \| 21% 600 columns 10 million rows, 67 32M files: master \| branch \| improvement -------\|---------\|----------- 1.95 min \| 1.96 min \| -0.5% 60 columns, 100 million rows, 67 32M files: master \| branch \| improvement -------\|---------\|----------- 0.55 min \| 0.55 min \| 0% ## How was this patch tested? - sql unit tests - pyspark-sql tests Closes #22188 from bersprockets/SPARK-25164. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 14:52:23 +08:00
Takeshi Yamamuro	2a0a8f753b	[SPARK-23034][SQL] Show RDD/relation names in RDD/Hive table scan nodes ## What changes were proposed in this pull request? This pr proposed to show RDD/relation names in RDD/Hive table scan nodes. This change made these names show up in the webUI and explain results. For example; ``` scala> sql("CREATE TABLE t(c1 int) USING hive") scala> sql("INSERT INTO t VALUES(1)") scala> spark.table("t").explain() == Physical Plan == Scan hive default.t [c1#8], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#8] ^^^^^^^^^^^ ``` <img width="212" alt="spark-pr-hive" src="https://user-images.githubusercontent.com/692303/44501013-51264c80-a6c6-11e8-94f8-0704aee83bb6.png"> Closes #20226 ## How was this patch tested? Added tests in `DataFrameSuite`, `DatasetSuite`, and `HiveExplainSuite` Closes #22153 from maropu/pr20226. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Tejas Patil <tejasp@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 14:26:10 +08:00
Takuya UESHIN	49720906c9	[SPARK-23932][SQL][FOLLOW-UP] Fix an example of zip_with function. ## What changes were proposed in this pull request? This is a follow-up pr of #22031 which added `zip_with` function to fix an example. ## How was this patch tested? Existing tests. Closes #22194 from ueshin/issues/SPARK-23932/fix_examples. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-23 14:17:29 +08:00
Reynold Xin	0295ad40de	[SPARK-25127] DataSourceV2: Remove SupportsPushDownCatalystFilters ## What changes were proposed in this pull request? They depend on internal Expression APIs. Let's see how far we can get without it. ## How was this patch tested? Just some code removal. There's no existing tests as far as I can tell so it's easy to remove. Closes #22185 from rxin/SPARK-25127. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-23 08:10:45 +08:00
Tathagata Das	3106324986	[SPARK-25184][SS] Fixed race condition in StreamExecution that caused flaky test in FlatMapGroupsWithState ## What changes were proposed in this pull request? The race condition that caused test failure is between 2 threads. - The MicrobatchExecution thread that processes inputs to produce answers and then generates progress events. - The test thread that generates some input data, checked the answer and then verified the query generated progress event. The synchronization structure between these threads is as follows 1. MicrobatchExecution thread, in every batch, does the following in order. a. Processes batch input to generate answer. b. Signals `awaitProgressLockCondition` to wake up threads waiting for progress using `awaitOffset` c. Generates progress event 2. Test execution thread a. Calls `awaitOffset` to wait for progress, which waits on `awaitProgressLockCondition`. b. As soon as `awaitProgressLockCondition` is signaled, it would move on the in the test to check answer. c. Finally, it would verify the last generated progress event. What can happen is the following sequence of events: 2a -> 1a -> 1b -> 2b -> 2c -> 1c. In other words, the progress event may be generated after the test tries to verify it. The solution has two steps. 1. Signal the waiting thread after the progress event has been generated, that is, after `finishTrigger()`. 2. Increase the timeout of `awaitProgressLockCondition.await(100 ms)` to a large value. This latter is to ensure that test thread for keeps waiting on `awaitProgressLockCondition`until the MicroBatchExecution thread explicitly signals it. With the existing small timeout of 100ms the following sequence can occur. - MicroBatchExecution thread updates committed offsets - Test thread waiting on `awaitProgressLockCondition` accidentally times out after 100 ms, finds that the committed offsets have been updated, therefore returns from `awaitOffset` and moves on to the progress event tests. - MicroBatchExecution thread then generates progress event and signals. But the test thread has already attempted to verify the event and failed. By increasing the timeout to large (e.g., `streamingTimeoutMs = 60 seconds`, similar to `awaitInitialization`), this above type of race condition is also avoided. ## How was this patch tested? Ran locally many times. Closes #22182 from tdas/SPARK-25184. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-08-22 12:22:53 -07:00
cclauss	71f38ac242	[SPARK-23698][PYTHON] Resolve undefined names in Python 3 ## What changes were proposed in this pull request? Fix issues arising from the fact that builtins __file__, __long__, __raw_input()__, __unicode__, __xrange()__, etc. were all removed from Python 3. __Undefined names__ have the potential to raise [NameError](https://docs.python.org/3/library/exceptions.html#NameError) at runtime. ## How was this patch tested? * $ __python2 -m flake8 . --count --select=E9,F82 --show-source --statistics__ * $ __python3 -m flake8 . --count --select=E9,F82 --show-source --statistics__ holdenk flake8 testing of https://github.com/apache/spark on Python 3.6.3 $ __python3 -m flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics__ ``` ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' result = raw_input("\n%s (y/n): " % prompt) ^ ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' primary_author = raw_input( ^ ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) ^ ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) ^ ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) ^ ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' raw_assignee = raw_input( ^ ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ") ^ ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' result = raw_input("Would you like to use the modified title? (y/n): ") ^ ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": ^ ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' response = raw_input("%s [y/n]: " % msg) ^ ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' author = unidecode.unidecode(unicode(author, "UTF-8")).strip() ^ ./python/setup.py:37:11: F821 undefined name '__version__' VERSION = __version__ ^ ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' dispatch[buffer] = save_buffer ^ ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' dispatch[file] = save_file ^ ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' if not isinstance(obj, str) and not isinstance(obj, unicode): ^ ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' intlike = (int, long) ^ ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' return self._sc._jvm.Time(long(timestamp * 1000)) ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 undefined name 'xrange' for i in xrange(50): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 undefined name 'xrange' for j in xrange(5): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 undefined name 'xrange' for k in xrange(20022): ^ 20 F821 undefined name 'raw_input' 20 ``` Closes #20838 from cclauss/fix-undefined-names. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-08-22 10:06:59 -07:00
Wenchen Fan	e754887182	[SPARK-24882][SQL] improve data source v2 API ## What changes were proposed in this pull request? Improve the data source v2 API according to the [design doc](https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing) summary of the changes 1. rename `ReadSupport` -> `DataSourceReader` -> `InputPartition` -> `InputPartitionReader` to `BatchReadSupportProvider` -> `BatchReadSupport` -> `InputPartition`/`PartitionReaderFactory` -> `PartitionReader`. Similar renaming also happens at streaming and write APIs. 2. create `ScanConfig` to store query specific information like operator pushdown result, streaming offsets, etc. This makes batch and streaming `ReadSupport`(previouslly named `DataSourceReader`) immutable. All other methods take `ScanConfig` as input, which implies applying operator pushdown and getting streaming offsets happen before all other things(get input partitions, report statistics, etc.). 3. separate `InputPartition` to `InputPartition` and `PartitionReaderFactory`. This is a natural separation, data splitting and reading are orthogonal and we should not mix them in one interfaces. This also makes the naming consistent between read and write API: `PartitionReaderFactory` vs `DataWriterFactory`. 4. separate the batch and streaming interfaces. Sometimes it's painful to force the streaming interface to extend batch interface, as we may need to override some batch methods to return false, or even leak the streaming concept to batch API(e.g. `DataWriterFactory#createWriter(partitionId, taskId, epochId)`) Some follow-ups we should do after this PR (tracked by https://issues.apache.org/jira/browse/SPARK-25186 ): 1. Revisit the life cycle of `ReadSupport` instances. Currently I keep it same as the previous `DataSourceReader`, i.e. the life cycle is bound to the batch/stream query. This fits streaming very well but may not be perfect for batch source. We can also consider to let `ReadSupport.newScanConfigBuilder` take `DataSourceOptions` as parameter, if we decide to change the life cycle. 2. Add `WriteConfig`. This is similar to `ScanConfig` and makes the write API more flexible. But it's only needed when we add the `replaceWhere` support, and it needs to change the streaming execution engine for this new concept, which I think is better to be done in another PR. 3. Refine the document. This PR adds/changes a lot of document and it's very likely that some people may have better ideas. 4. Figure out the life cycle of `CustomMetrics`. It looks to me that it should be bound to a `ScanConfig`, but we need to change `ProgressReporter` to get the `ScanConfig`. Better to be done in another PR. 5. Better operator pushdown API. This PR keeps the pushdown API as it was, i.e. using the `SupportsPushdownXYZ` traits. We can design a better API using build pattern, but this is a complicated design and deserves an individual JIRA ticket and design doc. 6. Improve the continuous streaming engine to only create a new `ScanConfig` when re-configuring. 7. Remove `SupportsPushdownCatalystFilter`. This is actually not a must-have for file source, we can change the hive partition pruning to use the public `Filter`. ## How was this patch tested? existing tests. Closes #22009 from cloud-fan/redesign. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-22 00:10:55 -07:00
Marco Gaido	55f36641ff	[SPARK-25093][SQL] Avoid recompiling regexp for comments multiple times ## What changes were proposed in this pull request? The PR moves the compilation of the regexp for code formatting outside the method which is called for each code block when splitting expressions, in order to avoid recompiling the regexp every time. Credit should be given to Izek Greenfield. ## How was this patch tested? existing UTs Closes #22135 from mgaido91/SPARK-25093. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-22 14:31:51 +08:00
Wenchen Fan	4a9c9d8f9a	[SPARK-25159][SQL] json schema inference should only trigger one job ## What changes were proposed in this pull request? This fixes a perf regression caused by https://github.com/apache/spark/pull/21376 . We should not use `RDD#toLocalIterator`, which triggers one Spark job per RDD partition. This is very bad for RDDs with a lot of small partitions. To fix it, this PR introduces a way to access SQLConf in the scheduler event loop thread, so that we don't need to use `RDD#toLocalIterator` anymore in `JsonInferSchema`. ## How was this patch tested? a new test Closes #22152 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-21 22:21:08 -07:00
Takeshi Yamamuro	07737c87d6	[SPARK-23711][SPARK-25140][SQL] Catch correct exceptions when expr codegen fails ## What changes were proposed in this pull request? This pr is to fix bugs when expr codegen fails; we need to catch `java.util.concurrent.ExecutionException` instead of `InternalCompilerException` and `CompileException` . This handling is the same with the `WholeStageCodegenExec ` one: `60af2501e1/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (L585)` ## How was this patch tested? Added tests in `CodeGeneratorWithInterpretedFallbackSuite` Closes #22154 from maropu/SPARK-25140. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-21 22:17:44 -07:00
Tathagata Das	a998e9d829	[MINOR] Added import to fix compilation ## What changes were proposed in this pull request? Two back to PRs implicitly conflicted by one PR removing an existing import that the other PR needed. This did not cause explicit conflict as the import already existed, but not used. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-2.7/8226/consoleFull ``` [info] Compiling 342 Scala sources and 97 Java sources to /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/sql/core/target/scala-2.11/classes... [warn] /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:128: value ENABLE_JOB_SUMMARY in object ParquetOutputFormat is deprecated: see corresponding Javadoc for more information. [warn] && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { [warn] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-hadoop-2.7/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala:95: value asJava is not a member of scala.collection.immutable.Map[String,Long] [error] new java.util.HashMap(customMetrics.mapValues(long2Long).asJava) [error] ^ [warn] one warning found [error] one error found [error] Compile failed at Aug 21, 2018 4:04:35 PM [12.827s] ``` ## How was this patch tested? It compiles! Closes #22175 from tdas/fix-build. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-08-21 17:08:15 -07:00
Jungtaek Lim	42035a4fec	[SPARK-24441][SS] Expose total estimated size of states in HDFSBackedStateStoreProvider ## What changes were proposed in this pull request? This patch exposes the estimation of size of cache (loadedMaps) in HDFSBackedStateStoreProvider as a custom metric of StateStore. The rationalize of the patch is that state backed by HDFSBackedStateStoreProvider will consume more memory than the number what we can get from query status due to caching multiple versions of states. The memory footprint to be much larger than query status reports in situations where the state store is getting a lot of updates: while shallow-copying map incurs additional small memory usages due to the size of map entities and references, but row objects will still be shared across the versions. If there're lots of updates between batches, less row objects will be shared and more row objects will exist in memory consuming much memory then what we expect. While HDFSBackedStateStore refers loadedMaps in HDFSBackedStateStoreProvider directly, there would be only one `StateStoreWriter` which refers a StateStoreProvider, so the value is not exposed as well as being aggregated multiple times. Current state metrics are safe to aggregate for the same reason. ## How was this patch tested? Tested manually. Below is the snapshot of UI page which is reflected by the patch: <img width="601" alt="screen shot 2018-06-05 at 10 16 16 pm" src="https://user-images.githubusercontent.com/1317309/40978481-b46ad324-690e-11e8-9b0f-e80528612a62.png"> Please refer "estimated size of states cache in provider total" as well as "count of versions in state cache in provider". Closes #21469 from HeartSaVioR/SPARK-24441. Authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-08-21 15:28:31 -07:00
Gengliang Wang	ac0174e55a	[SPARK-25129][SQL] Make the mapping of com.databricks.spark.avro to built-in module configurable ## What changes were proposed in this pull request? In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . As per the discussion in the [Jira](https://issues.apache.org/jira/browse/SPARK-24924) and PR #22119, we should make the mapping configurable. This PR also improve the error message when data source of Avro/Kafka is not found. ## How was this patch tested? Unit test Closes #22133 from gengliangwang/configurable_avro_mapping. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-21 15:26:24 -07:00
Jungtaek Lim	6c5cb85856	[SPARK-24763][SS] Remove redundant key data from value in streaming aggregation ## What changes were proposed in this pull request? This patch proposes a new flag option for stateful aggregation: remove redundant key data from value. Enabling new option runs similar with current, and uses less memory for state according to key/value fields of state operator. Please refer below link to see detailed perf. test result: https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16536539&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16536539 Since the state between enabling the option and disabling the option is not compatible, the option is set to 'disable' by default (to ensure backward compatibility), and OffsetSeqMetadata would prevent modifying the option after executing query. ## How was this patch tested? Modify unit tests to cover both disabling option and enabling option. Also did manual tests to see whether propose patch improves state memory usage. Closes #21733 from HeartSaVioR/SPARK-24763. Authored-by: Jungtaek Lim <kabhwan@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-08-21 15:22:42 -07:00
Xingbo Jiang	4fb96e5105	[SPARK-25114][CORE] Fix RecordBinaryComparator when subtraction between two words is divisible by Integer.MAX_VALUE. ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/22079#discussion_r209705612 It is possible for two objects to be unequal and yet we consider them as equal with this code, if the long values are separated by Int.MaxValue. This PR fixes the issue. ## How was this patch tested? Add new test cases in `RecordBinaryComparatorSuite`. Closes #22101 from jiangxb1987/fix-rbc. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-20 23:13:31 -07:00
seancxmao	f984ec75ed	[SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, regardless of spark.sql.caseSensitive set to true or false. This PR aims to add case-insensitive field resolution for ParquetFileFormat. * Do case-insensitive resolution only if Spark is in case-insensitive mode. * Field resolution should fail if there is ambiguity, i.e. more than one field is matched. ## How was this patch tested? Unit tests added. Closes #22148 from seancxmao/SPARK-25132-Parquet. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-21 10:34:23 +08:00
Koert Kuipers	b461acb2d9	[SPARK-25134][SQL] Csv column pruning with checking of headers throws incorrect error ## What changes were proposed in this pull request? When column pruning is turned on the checking of headers in the csv should only be for the fields in the requiredSchema, not the dataSchema, because column pruning means only requiredSchema is read. ## How was this patch tested? Added 2 unit tests where column pruning is turned on/off and csv headers are checked againt schema Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22123 from koertkuipers/feat-csv-column-pruning-and-check-header. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-21 10:23:55 +08:00
Dongjoon Hyun	883f3aff67	[SPARK-25144][SQL][TEST] Free aggregate map when task ends ## What changes were proposed in this pull request? [SPARK-25144](https://issues.apache.org/jira/browse/SPARK-25144) reports memory leaks on Apache Spark 2.0.2 ~ 2.3.2-RC5. The bug is already fixed via #21738 as a part of SPARK-21743. This PR only adds a test case to prevent any future regression. ```scala scala> case class Foo(bar: Option[String]) scala> val ds = List(Foo(Some("bar"))).toDS scala> val result = ds.flatMap(_.bar).distinct scala> result.rdd.isEmpty 18/08/19 23:01:54 WARN Executor: Managed memory leak detected; size = 8650752 bytes, TID = 125 res0: Boolean = false ``` ## How was this patch tested? Pass the Jenkins with a new added test case. Closes #22155 from dongjoon-hyun/SPARK-25144-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-21 09:08:36 +08:00
Gengliang Wang	60af2501e1	[SPARK-25160][SQL] Avro: remove sql configuration spark.sql.avro.outputTimestampType ## What changes were proposed in this pull request? In the PR for supporting logical timestamp types https://github.com/apache/spark/pull/21935, a SQL configuration spark.sql.avro.outputTimestampType is added, so that user can specify the output timestamp precision they want. With PR https://github.com/apache/spark/pull/21847, the output file can be written with user specified types. So there is no need to have such trivial configuration. Otherwise to make it consistent we need to add configuration for all the Catalyst types that can be converted into different Avro types. This PR also add a test case for user specified output schema with different timestamp types. ## How was this patch tested? Unit test Closes #22151 from gengliangwang/removeOutputTimestampType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-20 20:42:27 +08:00
Takuya UESHIN	6b8fbbfb11	[SPARK-25141][SQL][TEST] Modify tests for higher-order functions to check bind method. ## What changes were proposed in this pull request? We should also check `HigherOrderFunction.bind` method passes expected parameters. This pr modifies tests for higher-order functions to check `bind` method. ## How was this patch tested? Modified tests. Closes #22131 from ueshin/issues/SPARK-25141/bind_test. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-19 09:18:47 +09:00
Maxim Gekk	a8a1ac01c4	[SPARK-24959][SQL] Speed up count() for JSON and CSV ## What changes were proposed in this pull request? In the PR, I propose to skip invoking of the CSV/JSON parser per each line in the case if the required schema is empty. Added benchmarks for `count()` shows performance improvement up to 3.5 times. Before: ``` Count a dataset with 10 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------------- JSON count() 7676 / 7715 1.3 767.6 CSV count() 3309 / 3363 3.0 330.9 ``` After: ``` Count a dataset with 10 columns: Best/Avg Time(ms) Rate(M/s) Per Row(ns) -------------------------------------------------------------------------------------- JSON count() 2104 / 2156 4.8 210.4 CSV count() 2332 / 2386 4.3 233.2 ``` ## How was this patch tested? It was tested by `CSVSuite` and `JSONSuite` as well as on added benchmarks. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21909 from MaxGekk/empty-schema-optimization.	2018-08-18 10:34:49 -07:00
Xiangrui Meng	f454d5287f	[MINOR][DOC][SQL] use one line for annotation arg value ## What changes were proposed in this pull request? Put annotation args in one line, or API doc generation will fail. ~~~ [error] /Users/meng/src/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:1559: annotation argument needs to be a constant; found: "_FUNC_(expr) - Returns the character length of string data or number of bytes of ".+("binary data. The length of string data includes the trailing spaces. The length of binary ").+("data includes binary zeros.") [error] "binary data. The length of string data includes the trailing spaces. The length of binary " + [error] ^ [info] No documentation generated with unsuccessful compiler run [error] one error found [error] (catalyst/compile:doc) Scaladoc generation failed [error] Total time: 27 s, completed Aug 17, 2018 3:20:08 PM ~~~ ## How was this patch tested? sbt catalyst/compile:doc passed Closes #22137 from mengxr/minor-doc-fix. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-18 17:20:34 +08:00
Takuya UESHIN	c1ffb3c10a	[SPARK-23938][SQL][FOLLOW-UP][TEST] Nullabilities of value arguments should be true. ## What changes were proposed in this pull request? This is a follow-up pr of #22017 which added `map_zip_with` function. In the test, when creating a lambda function, we use the `valueContainsNull` values for the nullabilities of the value arguments, but we should've used `true` as the same as `bind` method because the values might be `null` if the keys don't match. ## How was this patch tested? Added small tests and existing tests. Closes #22126 from ueshin/issues/SPARK-23938/fix_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-17 14:13:37 +09:00
Marek Novotny	8af61fba03	[SPARK-25122][SQL] Deduplication of supports equals code ## What changes were proposed in this pull request? The method ```*supportEquals``` determining whether elements of a data type could be used as items in a hash set or as keys in a hash map is duplicated across multiple collection and higher-order functions. This PR suggests to deduplicate the method. ## How was this patch tested? Run tests in: - DataFrameFunctionsSuite - CollectionExpressionsSuite - HigherOrderExpressionsSuite Closes #22110 from mn-mikke/SPARK-25122. Authored-by: Marek Novotny <mn.mikke@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-17 11:52:16 +08:00
codeatri	f16140975d	[SPARK-23940][SQL] Add transform_values SQL function ## What changes were proposed in this pull request? This pr adds `transform_values` function which applies the function to each entry of the map and transforms the values. ```javascript > SELECT transform_values(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> v + 1); map(1->2, 2->3, 3->4) > SELECT transform_values(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + v); map(1->2, 2->4, 3->6) ``` ## How was this patch tested? New Tests added to `DataFrameFunctionsSuite` `HigherOrderFunctionsSuite` `SQLQueryTestSuite` Closes #22045 from codeatri/SPARK-23940. Authored-by: codeatri <nehapatil6@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-17 11:50:06 +09:00
Dilip Biswal	e59dd8fa0c	[SPARK-25092][SQL][FOLLOWUP] Add RewriteCorrelatedScalarSubquery in list of nonExcludableRules ## What changes were proposed in this pull request? Add RewriteCorrelatedScalarSubquery in the list of nonExcludableRules since its used to transform correlated scalar subqueries to joins. ## How was this patch tested? Added test in OptimizerRuleExclusionSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #22108 from dilipbiswal/scalar_exclusion.	2018-08-16 15:55:00 -07:00
Sandeep Singh	ea63a7a168	[SPARK-23932][SQL] Higher order function zip_with ## What changes were proposed in this pull request? Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function: ``` SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- [ROW('a', 1), ROW('b', 3), ROW('c', 5)] SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6] SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> concat(x, y)); -- ['ad', 'be', 'cf'] SELECT zip_with(ARRAY['a'], ARRAY['d', null, 'f'], (x, y) -> coalesce(x, y)); -- ['a', null, 'f'] ``` ## How was this patch tested? Added tests Closes #22031 from techaddict/SPARK-23932. Authored-by: Sandeep Singh <sandeep@techaddict.me> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-16 23:02:45 +09:00
codeatri	5b4a38d826	[SPARK-23939][SQL] Add transform_keys function ## What changes were proposed in this pull request? This pr adds transform_keys function which applies the function to each entry of the map and transforms the keys. ```javascript > SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + 1); map(2->1, 3->2, 4->3) > SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + v); map(2->1, 4->2, 6->3) ``` ## How was this patch tested? Added tests. Closes #22013 from codeatri/SPARK-23939. Authored-by: codeatri <nehapatil6@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-16 17:07:33 +09:00
Bo Meng	7822c3f8d1	[SPARK-25082][SQL] improve the javadoc for expm1() ## What changes were proposed in this pull request? Correct the javadoc for expm1() function. ## How was this patch tested? None. It is a minor issue. Closes #22115 from bomeng/25082. Authored-by: Bo Meng <bo.meng@jd.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-16 14:14:42 +08:00
Liang-Chi Hsieh	19c45db477	[SPARK-24505][SQL] Convert strings in codegen to blocks: Cast and BoundAttribute ## What changes were proposed in this pull request? This is split from #21520. This includes changes of `BoundAttribute` and `Cast`. This patch also adds few convenient APIs: ```scala CodeGenerator.freshVariable(name: String, dt: DataType): VariableValue CodeGenerator.freshVariable(name: String, javaClass: Class[_]): VariableValue JavaCode.javaType(javaClass: Class[_]): Inline JavaCode.javaType(dataType: DataType): Inline JavaCode.boxedType(dataType: DataType): Inline ``` ## How was this patch tested? Existing tests. Closes #21537 from viirya/SPARK-24505-1. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-15 14:32:51 +08:00
Bryan Cutler	ed075e1ff6	[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 ## How was this patch tested? existing tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #21939 from BryanCutler/arrow-upgrade-010.	2018-08-14 17:13:38 -07:00
Kris Mok	3c614d0565	[SPARK-25113][SQL] Add logging to CodeGenerator when any generated method's bytecode size goes above HugeMethodLimit ## What changes were proposed in this pull request? Add logging for all generated methods from the `CodeGenerator` whose bytecode size goes above 8000 bytes. This is to help with gathering stats on how often Spark is generating methods too big to be JIT'd. It covers all codegen scenarios, include whole-stage codegen and also individual expression codegen, e.g. unsafe projection, mutable projection, etc. ## How was this patch tested? Manually tested that logging did happen when generated method was above 8000 bytes. Also added a new unit test case to `CodeGenerationSuite` to verify that the logging did happen. Author: Kris Mok <kris.mok@databricks.com> Closes #22103 from rednaxelafx/codegen-8k-logging.	2018-08-14 16:40:00 -07:00
Alessandro Bellina	b81e3031fd	[SPARK-25043] print master and appId from spark-sql on startup ## What changes were proposed in this pull request? A small change to print the master and appId from spark-sql as with logging turned down all the way (`log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=WARN`), we may not know this information easily. This adds the following string before the `spark-sql>` prompt shows on the screen. `Spark master: yarn, Application Id: application_123456789_12345` ## How was this patch tested? I ran spark-sql locally and saw the appId displayed as expected. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22025 from abellina/SPARK-25043_print_master_and_app_id_from_sparksql. Lead-authored-by: Alessandro Bellina <abellina@gmail.com> Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2018-08-14 13:15:55 -05:00
Marek Novotny	42263fd0cb	[SPARK-23938][SQL] Add map_zip_with function ## What changes were proposed in this pull request? This PR adds a new SQL function called ```map_zip_with```. It merges the two given maps into a single map by applying function to the pair of values with the same key. ## How was this patch tested? Added new tests into: - DataFrameFunctionsSuite.scala - HigherOrderFunctionsSuite.scala Closes #22017 from mn-mikke/SPARK-23938. Authored-by: Marek Novotny <mn.mikke@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-08-14 21:14:15 +09:00
Dongjoon Hyun	e2ab7deae7	[MINOR][SQL][DOC] Fix `to_json` example in function description and doc ## What changes were proposed in this pull request? This PR fixes the an example for `to_json` in doc and function description. - http://spark.apache.org/docs/2.3.0/api/sql/#to_json - `describe function extended` ## How was this patch tested? Pass the Jenkins with the updated test. Closes #22096 from dongjoon-hyun/minor_json. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-14 19:59:39 +08:00
Marco Gaido	c220cc42ab	[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values ## What changes were proposed in this pull request? `ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a partition column contains a NULL value. The PR avoids the NPE, replacing the `NULL` values with the default partition placeholder. ## How was this patch tested? added UT Closes #22036 from mgaido91/SPARK-25028. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-14 00:59:18 +08:00
Takuya UESHIN	b804ca5771	[SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments, and add argument type check. ## What changes were proposed in this pull request? This is a follow-up pr of #21954 to address comments. - Rename ambiguous name `inputs` to `arguments`. - Add argument type check and remove hacky workaround. - Address other small comments. ## How was this patch tested? Existing tests and some additional tests. Closes #22075 from ueshin/issues/SPARK-23908/fup1. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-13 20:58:29 +08:00
Maxim Gekk	ab06c25350	[SPARK-24391][SQL] Support arrays of any types by from_json ## What changes were proposed in this pull request? The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type: ``` scala> import org.apache.spark.sql.functions._ scala> val df = Seq("[1, 2, 3]").toDF("a") scala> val schema = new ArrayType(IntegerType, false) scala> val arr = df.select(from_json($"a", schema)) scala> arr.printSchema root \|-- jsontostructs(a): array (nullable = true) \| \|-- element: integer (containsNull = true) ``` and result of converting of the json string to the `ArrayType`: ``` scala> arr.show +----------------+ \|jsontostructs(a)\| +----------------+ \| [1, 2, 3]\| +----------------+ ``` ## How was this patch tested? I added a few positive and negative tests: - array of primitive types - array of arrays - array of structs - array of maps Closes #21439 from MaxGekk/from_json-array. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-13 20:13:09 +08:00
Takuya UESHIN	b270bccfff	[SPARK-25096][SQL] Loosen nullability if the cast is force-nullable. ## What changes were proposed in this pull request? In type coercion for complex types, if the found type is force-nullable to cast, we should loosen the nullability to be able to cast. Also for map key type, we can't use the type. ## How was this patch tested? Added some test. Closes #22086 from ueshin/issues/SPARK-25096/fix_type_coercion. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-13 19:27:17 +08:00
Gengliang Wang	be2238fb50	[SPARK-24774][SQL] Avro: Support logical decimal type ## What changes were proposed in this pull request? Support Avro logical date type: https://avro.apache.org/docs/1.8.2/spec.html#Decimal ## How was this patch tested? Unit test Closes #22037 from gengliangwang/avro_decimal. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-13 08:29:07 +08:00
忍冬	a90b1f5d93	[MINOR][DOC] Fix Java example code in Column's comments ## What changes were proposed in this pull request? Fix scaladoc in Column ## How was this patch tested? None Closes #22069 from sadhen/fix_doc_minor. Authored-by: 忍冬 <rendong@wacai.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-12 08:26:21 -05:00
Kazuaki Ishizaki	d177234792	[SQL][TEST][MINOR] Add missing codes to ParquetCompressionCodecPrecedenceSuite ## What changes were proposed in this pull request? This PR adds codes to ``"Test `spark.sql.parquet.compression.codec` config"` in `ParquetCompressionCodecPrecedenceSuite`. ## How was this patch tested? Existing UTs Closes #22083 from kiszk/ParquetCompressionCodecPrecedenceSuite. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-12 19:43:25 +08:00
Dilip Biswal	c3be2cd347	[SPARK-25092] Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules ## What changes were proposed in this pull request? Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules as the rewrites are essential for the functioning of EXCEPT ALL and INTERSECT ALL feature. ## How was this patch tested? Added test in OptimizerRuleExclusionSuite. Closes #22080 from dilipbiswal/exceptall_rewrite_exclusion. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-11 22:51:11 -07:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
yucai	41a7de6002	[SPARK-25084][SQL] "distribute by" on multiple columns (wrap in brackets) may lead to codegen issue ## What changes were proposed in this pull request? "distribute by" on multiple columns (wrap in brackets) may lead to codegen issue. Simple way to reproduce: ```scala val df = spark.range(1000) val columns = (0 until 400).map{ i => s"id as id$i" } val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",") df.selectExpr(columns : _).createTempView("test") spark.sql(s"select from test distribute by ($distributeExprs)").count() ``` ## How was this patch tested? Add UT. Closes #22066 from yucai/SPARK-25084. Authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-11 21:38:31 +08:00

1 2 3 4 5 ...

6941 commits