## What changes were proposed in this pull request?
Add RewriteCorrelatedScalarSubquery in the list of nonExcludableRules since its used to transform correlated scalar subqueries to joins.
## How was this patch tested?
Added test in OptimizerRuleExclusionSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#22108 from dilipbiswal/scalar_exclusion.
## What changes were proposed in this pull request?
Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:
```
SELECT zip_with(ARRAY[1, 3, 5], ARRAY['a', 'b', 'c'], (x, y) -> (y, x)); -- [ROW('a', 1), ROW('b', 3), ROW('c', 5)]
SELECT zip_with(ARRAY[1, 2], ARRAY[3, 4], (x, y) -> x + y); -- [4, 6]
SELECT zip_with(ARRAY['a', 'b', 'c'], ARRAY['d', 'e', 'f'], (x, y) -> concat(x, y)); -- ['ad', 'be', 'cf']
SELECT zip_with(ARRAY['a'], ARRAY['d', null, 'f'], (x, y) -> coalesce(x, y)); -- ['a', null, 'f']
```
## How was this patch tested?
Added tests
Closes#22031 from techaddict/SPARK-23932.
Authored-by: Sandeep Singh <sandeep@techaddict.me>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
This pr adds transform_keys function which applies the function to each entry of the map and transforms the keys.
```javascript
> SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + 1);
map(2->1, 3->2, 4->3)
> SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + v);
map(2->1, 4->2, 6->3)
```
## How was this patch tested?
Added tests.
Closes#22013 from codeatri/SPARK-23939.
Authored-by: codeatri <nehapatil6@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
Correct the javadoc for expm1() function.
## How was this patch tested?
None. It is a minor issue.
Closes#22115 from bomeng/25082.
Authored-by: Bo Meng <bo.meng@jd.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This is split from #21520. This includes changes of `BoundAttribute` and `Cast`.
This patch also adds few convenient APIs:
```scala
CodeGenerator.freshVariable(name: String, dt: DataType): VariableValue
CodeGenerator.freshVariable(name: String, javaClass: Class[_]): VariableValue
JavaCode.javaType(javaClass: Class[_]): Inline
JavaCode.javaType(dataType: DataType): Inline
JavaCode.boxedType(dataType: DataType): Inline
```
## How was this patch tested?
Existing tests.
Closes#21537 from viirya/SPARK-24505-1.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Upgrade Apache Arrow to 0.10.0
Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark:
* Allow for adding BinaryType support ARROW-2141
* Bug fix related to array serialization ARROW-1973
* Python2 str will be made into an Arrow string instead of bytes ARROW-2101
* Python bytearrays are supported in as input to pyarrow ARROW-2141
* Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962
* Cleanup pyarrow type equality checks ARROW-2423
* ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645
* Improved low level handling of messages for RecordBatch ARROW-2704
## How was this patch tested?
existing tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#21939 from BryanCutler/arrow-upgrade-010.
## What changes were proposed in this pull request?
Add logging for all generated methods from the `CodeGenerator` whose bytecode size goes above 8000 bytes.
This is to help with gathering stats on how often Spark is generating methods too big to be JIT'd. It covers all codegen scenarios, include whole-stage codegen and also individual expression codegen, e.g. unsafe projection, mutable projection, etc.
## How was this patch tested?
Manually tested that logging did happen when generated method was above 8000 bytes.
Also added a new unit test case to `CodeGenerationSuite` to verify that the logging did happen.
Author: Kris Mok <kris.mok@databricks.com>
Closes#22103 from rednaxelafx/codegen-8k-logging.
## What changes were proposed in this pull request?
A small change to print the master and appId from spark-sql as with logging turned down all the way (`log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=WARN`), we may not know this information easily. This adds the following string before the `spark-sql>` prompt shows on the screen.
`Spark master: yarn, Application Id: application_123456789_12345`
## How was this patch tested?
I ran spark-sql locally and saw the appId displayed as expected.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes#22025 from abellina/SPARK-25043_print_master_and_app_id_from_sparksql.
Lead-authored-by: Alessandro Bellina <abellina@gmail.com>
Co-authored-by: Alessandro Bellina <abellina@yahoo-inc.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
## What changes were proposed in this pull request?
This PR adds a new SQL function called ```map_zip_with```. It merges the two given maps into a single map by applying function to the pair of values with the same key.
## How was this patch tested?
Added new tests into:
- DataFrameFunctionsSuite.scala
- HigherOrderFunctionsSuite.scala
Closes#22017 from mn-mikke/SPARK-23938.
Authored-by: Marek Novotny <mn.mikke@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
This PR fixes the an example for `to_json` in doc and function description.
- http://spark.apache.org/docs/2.3.0/api/sql/#to_json
- `describe function extended`
## How was this patch tested?
Pass the Jenkins with the updated test.
Closes#22096 from dongjoon-hyun/minor_json.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
`ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a partition column contains a NULL value.
The PR avoids the NPE, replacing the `NULL` values with the default partition placeholder.
## How was this patch tested?
added UT
Closes#22036 from mgaido91/SPARK-25028.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is a follow-up pr of #21954 to address comments.
- Rename ambiguous name `inputs` to `arguments`.
- Add argument type check and remove hacky workaround.
- Address other small comments.
## How was this patch tested?
Existing tests and some additional tests.
Closes#22075 from ueshin/issues/SPARK-23908/fup1.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type:
```
scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
|-- jsontostructs(a): array (nullable = true)
| |-- element: integer (containsNull = true)
```
and result of converting of the json string to the `ArrayType`:
```
scala> arr.show
+----------------+
|jsontostructs(a)|
+----------------+
| [1, 2, 3]|
+----------------+
```
## How was this patch tested?
I added a few positive and negative tests:
- array of primitive types
- array of arrays
- array of structs
- array of maps
Closes#21439 from MaxGekk/from_json-array.
Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In type coercion for complex types, if the found type is force-nullable to cast, we should loosen the nullability to be able to cast. Also for map key type, we can't use the type.
## How was this patch tested?
Added some test.
Closes#22086 from ueshin/issues/SPARK-25096/fix_type_coercion.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Support Avro logical date type:
https://avro.apache.org/docs/1.8.2/spec.html#Decimal
## How was this patch tested?
Unit test
Closes#22037 from gengliangwang/avro_decimal.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Fix scaladoc in Column
## How was this patch tested?
None
Closes#22069 from sadhen/fix_doc_minor.
Authored-by: 忍冬 <rendong@wacai.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
This PR adds codes to ``"Test `spark.sql.parquet.compression.codec` config"` in `ParquetCompressionCodecPrecedenceSuite`.
## How was this patch tested?
Existing UTs
Closes#22083 from kiszk/ParquetCompressionCodecPrecedenceSuite.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Add RewriteExceptAll and RewriteIntersectAll in the list of nonExcludableRules as the rewrites are essential for the functioning of EXCEPT ALL and INTERSECT ALL feature.
## How was this patch tested?
Added test in OptimizerRuleExclusionSuite.
Closes#22080 from dilipbiswal/exceptall_rewrite_exclusion.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell).
This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know.
## How was this patch tested?
### before
```
$ misspell . | grep -v '.js'
R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition"
NOTICE-binary:454:16: "containd" is a misspelling of "contained"
R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition"
R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition"
R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence"
R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred"
R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output"
R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent"
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin"
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden"
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments"
dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual"
dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across"
dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across"
dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments"
docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden"
docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes"
docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN"
docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior"
examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract"
examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions"
python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment"
python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability"
python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter"
python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability"
python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter"
python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns"
python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization"
python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary"
resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints"
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter"
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when"
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp"
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage"
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred"
sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing"
sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with"
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring"
```
### after
```
$ misspell . | grep -v '.js'
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
```
Closes#22070 from seratch/fix-typo.
Authored-by: Kazuhiro Sera <seratch@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
"distribute by" on multiple columns (wrap in brackets) may lead to codegen issue.
Simple way to reproduce:
```scala
val df = spark.range(1000)
val columns = (0 until 400).map{ i => s"id as id$i" }
val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",")
df.selectExpr(columns : _*).createTempView("test")
spark.sql(s"select * from test distribute by ($distributeExprs)").count()
```
## How was this patch tested?
Add UT.
Closes#22066 from yucai/SPARK-25084.
Authored-by: yucai <yyu1@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Parquet file provides six codecs: "snappy", "gzip", "lzo", "lz4", "brotli", "zstd".
This pr add missing compression codec :"lz4", "brotli", "zstd" .
## How was this patch tested?
N/A
Closes#22068 from 10110346/nosupportlz4.
Authored-by: liuxian <liu.xian3@zte.com.cn>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
A logical `Limit` is performed physically by two operations `LocalLimit` and `GlobalLimit`.
Most of time, we gather all data into a single partition in order to run `GlobalLimit`. If we use a very big limit number, shuffling data causes performance issue also reduces parallelism.
We can avoid shuffling into single partition if we don't care data ordering. This patch implements this idea by doing a map stage during global limit. It collects the info of row numbers at each partition. For each partition, we locally retrieves limited data without any shuffling to finish this global limit.
For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions.
If the data partition has certain ordering, we can't distribute required rows evenly to each partitions because it could change data ordering. But we still can avoid shuffling.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#16677 from viirya/improve-global-limit-parallelism.
## What changes were proposed in this pull request?
Support for text socket stream in spark structured streaming "continuous" mode. This is roughly based on the idea of ContinuousMemoryStream where the executor queries the data from driver over an RPC endpoint.
This makes it possible to create Structured streaming continuous pipeline to ingest data via "nc" and run examples.
## How was this patch tested?
Unit test and ran spark examples in structured streaming continuous mode.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes#21199 from arunmahadevan/SPARK-24127.
Authored-by: Arun Mahadevan <arunm@apache.org>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR simplified code generation for `ArrayDistinct`. #21966 enabled code generation only if the type can be specialized by the hash set. This PR follows this strategy.
Optimization of null handling will be implemented in #21912.
## How was this patch tested?
Existing UTs
Closes#22044 from kiszk/SPARK-23912-follow.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
This is a follow-up to #21305 that adds a test suite for AppendData analysis.
This also fixes the following problems uncovered by these tests:
* Incorrect order of data types passed to `canWrite` is fixed
* The field check calls `canWrite` first to ensure all errors are found
* `AppendData#resolved` must check resolution of the query's attributes
* Column names are quoted to show empty names
## How was this patch tested?
This PR adds a test suite for AppendData analysis.
Closes#22043 from rdblue/SPARK-24251-add-append-data-analysis-tests.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This pr adds `exists` function which tests whether a predicate holds for one or more elements in the array.
```sql
> SELECT exists(array(1, 2, 3), x -> x % 2 == 0);
true
```
## How was this patch tested?
Added tests.
Closes#22052 from ueshin/issues/SPARK-25068/exists.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
When a `SparkSession` is stopped, `SQLConf.get` should use the fallback conf to avoid weird issues like
```
sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped.
at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
at scala.Option.getOrElse(Option.scala:121)
...
```
## How was this patch tested?
a new test suite
Closes#22056 from cloud-fan/session.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Just delete the unused variable `inputFields` in WindowExec, avoid making others confused while reading the code.
## How was this patch tested?
Existing UT.
Closes#22057 from xuanyuanking/SPARK-25077.
Authored-by: liyuanjian <liyuanjian@baidu.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Currently, Analyze table calculates table size sequentially for each partition. We can parallelize size calculations over partitions.
Results : Tested on a table with 100 partitions and data stored in S3.
With changes :
- 10.429s
- 10.557s
- 10.439s
- 9.893s
Without changes :
- 110.034s
- 99.510s
- 100.743s
- 99.106s
## How was this patch tested?
Simple unit test.
Closes#21608 from Achuth17/improveAnalyze.
Lead-authored-by: Achuth17 <Achuth.narayan@gmail.com>
Co-authored-by: arajagopal17 <arajagopal@qubole.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Correct the class name typo checked in through SPARK-24891
## How was this patch tested?
Passed all existing tests.
Closes#22049 from maryannxue/known-not-null.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR addresses two issues in `BufferHolderSparkSubmitSuite`.
1. While `BufferHolderSparkSubmitSuite` tried to allocate a large object several times, it actually allocated an object once and reused the object.
2. `BufferHolderSparkSubmitSuite` may fail due to timeout
To assign a small object before allocating a large object each time solved issue 1 by avoiding reuse.
To increasing heap size from 4g to 7g solved issue 2. It can also avoid OOM after fixing issue 1.
## How was this patch tested?
Updated existing `BufferHolderSparkSubmitSuite`
Closes#20636 from kiszk/SPARK-23415.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR fixes typo regarding `auxiliary verb + verb[s]`. This is a follow-on of #21956.
## How was this patch tested?
N/A
Closes#22040 from kiszk/spellcheck1.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
We should use `Block.isEmpty/nonEmpty` instead of comparing with empty string to check whether the code is empty or not.
```
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala:278: org.apache.spark.sql.catalyst.expressions.codegen.Block and String are unrelated: they will most likely always compare unequal
[error] [warn] if (ev.code != "" && required.contains(attributes(i))) {
[error] [warn]
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala:323: org.apache.spark.sql.catalyst.expressions.codegen.Block and String are unrelated: they will most likely never compare equal
[error] [warn] | ${buildVars.filter(_.code == "").map(v => s"${v.isNull} = true;").mkString("\n")}
[error] [warn]
```
## How was this patch tested?
Existing tests.
Closes#22041 from ueshin/issues/SPARK-25058/fix_comparison.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
We have completed a significant subset of the object related Expressions to provide an interpreted fallback. This PR is going to modify the tests to also test the interpreted code paths.
One concern right now is that by testing the interpreted code paths too, we will double current test time or more. Otherwise, we can only choose to test the interpreted code paths for just few test suites such as encoder related.
## How was this patch tested?
Existing tests.
Closes#21535 from viirya/SPARK-23596.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This is a follow-up pr of #22014.
We still have some more compilation errors in scala-2.12 with sbt:
```
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala:493: match may not be exhaustive.
[error] It would fail on the following input: (_, _)
[error] [warn] val typeMatches = (targetType, f.dataType) match {
[error] [warn]
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:393: match may not be exhaustive.
[error] It would fail on the following input: (_, _)
[error] [warn] prevBatchOff.get.toStreamProgress(sources).foreach {
[error] [warn]
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala:173: match may not be exhaustive.
[error] It would fail on the following input: AggregateExpression(_, _, false, _)
[error] [warn] val rewrittenDistinctFunctions = functionsWithDistinct.map {
[error] [warn]
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala:271: match may not be exhaustive.
[error] It would fail on the following input: (_, _)
[error] [warn] keyWithIndexToValueMetrics.customMetrics.map {
[error] [warn]
[error] [warn] /.../sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala:959: match may not be exhaustive.
[error] It would fail on the following input: CatalogTableType(_)
[error] [warn] val tableTypeString = metadata.tableType match {
[error] [warn]
[error] [warn] /.../sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala:923: match may not be exhaustive.
[error] It would fail on the following input: CatalogTableType(_)
[error] [warn] hiveTable.setTableType(table.tableType match {
[error] [warn]
```
## How was this patch tested?
Manually build with Scala-2.12.
Closes#22039 from ueshin/issues/SPARK-25036/fix_match.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
This pr is a follow-up pr of #21982 and fixes the examples.
## How was this patch tested?
Existing tests.
Closes#22035 from ueshin/issues/SPARK-23911/fup1.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
The PR remove the following compilation error using scala-2.12 with sbt by adding a default case to `match`.
```
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _)
[error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): Boolean = (r1, r2) match {
[error] [warn]
[error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn]
[error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) match {
[error] [warn]
[error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn]
[error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: match may not be exhaustive.
[error] It would fail on the following input: Schema((x: org.apache.spark.sql.types.DataType forSome x not in org.apache.spark.sql.types.StructType), _)
[error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match {
[error] [warn]
```
## How was this patch tested?
Existing UTs with Scala-2.11.
Manually build with Scala-2.12
Closes#22014 from kiszk/SPARK-25036b.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR fixes an exception during the compilation of generated code of `mapEntry`. This error occurs since the current code uses `key` type to store a `value` when `key` and `value` types are primitive type.
```
val mid0 = Literal.create(Map(1 -> 1.1, 2 -> 2.2), MapType(IntegerType, DoubleType))
checkEvaluation(MapEntries(mid0), Seq(r(1, 1.1), r(2, 2.2)))
```
```
[info] Code generation of map_entries(keys: [1,2], values: [1.1,2.2]) failed:
[info] java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 20: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 20: No applicable constructor/method found for actual parameters "int, double"; candidates are: "public void org.apache.spark.sql.catalyst.expressions.UnsafeRow.setInt(int, int)", "public void org.apache.spark.sql.catalyst.InternalRow.setInt(int, int)"
[info] java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 20: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 80, Column 20: No applicable constructor/method found for actual parameters "int, double"; candidates are: "public void org.apache.spark.sql.catalyst.expressions.UnsafeRow.setInt(int, int)", "public void org.apache.spark.sql.catalyst.InternalRow.setInt(int, int)"
[info] at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
[info] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
[info] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
[info] at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
[info] at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
[info] at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
[info] at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
[info] at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
[info] at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
[info] at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
[info] at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1290)
...
```
## How was this patch tested?
Added a new test to `CollectionExpressionsSuite`
Closes#22033 from kiszk/SPARK-23935-followup.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
This is a follow-up pr of #21980.
`Shuffle` can also be `ExpressionWithRandomSeed` to produce different values for each execution in streaming query.
## How was this patch tested?
Added a test.
Closes#22027 from ueshin/issues/SPARK-25010/random_seed.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This adds a new logical plan, AppendData, that was proposed in SPARK-23521: Standardize SQL logical plans.
* DataFrameWriter uses the new AppendData plan for DataSourceV2 appends
* AppendData is resolved if its output columns match the incoming data frame
* A new analyzer rule, ResolveOutputColumns, validates data before it is appended. This rule will add safe casts, rename columns, and checks nullability
## How was this patch tested?
Existing tests for v2 appends. Will add AppendData tests to validate logical plan analysis.
Closes#21305 from rdblue/SPARK-24251-add-append-data.
Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: Ryan Blue <rdblue@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Alter View can excute sql like "ALTER VIEW ... AS INSERT INTO" . We should throw ParseException(s"Operation not allowed: $message", ctx) as Create View does.
```
override def visitCreateView(ctx: CreateViewContext): LogicalPlan = withOrigin(ctx) {
if (ctx.identifierList != null) {
operationNotAllowed("CREATE VIEW ... PARTITIONED ON", ctx)
} else {
// CREATE VIEW ... AS INSERT INTO is not allowed.
ctx.query.queryNoWith match {
case s: SingleInsertQueryContext if s.insertInto != null =>
operationNotAllowed("CREATE VIEW ... AS INSERT INTO", ctx)
case _: MultiInsertQueryContext =>
operationNotAllowed("CREATE VIEW ... AS FROM ... [INSERT INTO ...]+", ctx)
case _ => // OK
}
```
```
override def visitAlterViewQuery(ctx: AlterViewQueryContext): LogicalPlan = withOrigin(ctx) {
// ALTER VIEW ... AS INSERT INTO is not allowed.
ctx.query.queryNoWith match {
case s: SingleInsertQueryContext if s.insertInto != null =>
operationNotAllowed("ALTER VIEW ... AS INSERT INTO", ctx)
case _: MultiInsertQueryContext =>
operationNotAllowed("ALTER VIEW ... AS FROM ... [INSERT INTO ...]+", ctx)
case _ => // OK
}
AlterViewAsCommand(
name = visitTableIdentifier(ctx.tableIdentifier),
originalText = source(ctx.query),
query = plan(ctx.query))
}
```
## How was this patch tested?
UT has been added in SparkSqlParserSuite
Closes#22028 from sddyljsx/SPARK-25046.
Lead-authored-by: Neal Song <neal_song@126.com>
Co-authored-by: neal <neal_song@126.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
The PR fix the bug in `buildFormattedString` function in `MapType`, which makes the printed schema misleading.
## How was this patch tested?
Added UT
Closes#22006 from invkrh/fix-map-schema-print.
Authored-by: invkrh <invkrh@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
The PR adds the high order function `map_filter`, which filters the entries of a map and returns a new map which contains only the entries which satisfied the filter function.
## How was this patch tested?
added UTs
Closes#21986 from mgaido91/SPARK-23937.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/21822
Similar to `TreeNode`, `AnalysisHelper` should also provide 3 versions of transformations: `resolveOperatorsUp`, `resolveOperatorsDown` and `resolveOperators`.
This PR adds the missing `resolveOperatorsUp`, and also fixes some code style which is missed in #21822
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21932 from cloud-fan/follow.
## What changes were proposed in this pull request?
The tests for shuffle can be flaky (eg. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94355/testReport/). This happens because we have not set the seed for `Random`.
## How was this patch tested?
running 10000 times the UT (validated that with a different seed eg. 12345 the test fails).
Closes#22023 from mgaido91/SPARK-23928_followup.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
The design details is attached to the JIRA issue [here](https://drive.google.com/file/d/1zKm3aNZ3DpsqIuoMvRsf0kkDkXsAasxH/view)
High level overview of the changes are:
- Enhance the qualifier to be more than one string
- Add support to store the qualifier. Enhance the lookupRelation to keep the qualifier appropriately.
- Enhance the table matching column resolution algorithm to account for qualifier being more than a string.
- Enhance the table matching algorithm in UnresolvedStar.expand
- Ensure that we continue to support select t1.i1 from db1.t1
## How was this patch tested?
- New tests are added.
- Several test scenarios were added in a separate [test pr 17067](https://github.com/apache/spark/pull/17067). The tests that were not supported earlier are marked with TODO markers and those are now supported with the code changes here.
- Existing unit tests ( hive, catalyst and sql) were run successfully.
Closes#17185 from skambha/colResolution.
Authored-by: Sunitha Kambhampati <skambha@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose to replace Scala parallel collections by new methods `parmap()`. The methods use futures to transform a sequential collection by applying a lambda function to each element in parallel. The result of `parmap` is another regular (sequential) collection.
The proposed `parmap` method aims to solve the problem of impossibility to interrupt parallel Scala collection. This possibility is needed for reliable task preemption.
## How was this patch tested?
A test was added to `ThreadUtilsSuite`
Closes#21913 from MaxGekk/par-map.
Authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Using struct types in subqueries with the `IN` clause can generate invalid plans in `RewritePredicateSubquery`. Indeed, we are not handling clearly the cases when the outer value is a struct or the output of the inner subquery is a struct.
The PR aims to make Spark's behavior the same as the one of the other RDBMS - namely Oracle and Postgres behavior were checked. So we consider valid only queries having the same number of fields in the outer value and in the subquery. This means that:
- `(a, b) IN (select c, d from ...)` is a valid query;
- `(a, b) IN (select (c, d) from ...)` throws an AnalysisException, as in the subquery we have only one field of type struct while in the outer value we have 2 fields;
- `a IN (select (c, d) from ...)` - where `a` is a struct - is a valid query.
## How was this patch tested?
Added UT
Closes#21403 from mgaido91/SPARK-24313.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Like Uuid in SPARK-24896, Rand and Randn expressions now produce the same results for each execution in streaming query. It doesn't make too much sense for streaming queries. We should make them produce different results as Uuid.
In this change, similar to Uuid, we assign new random seeds to Rand/Randn when returning optimized plan from `IncrementalExecution`.
Note: Different to Uuid, Rand/Randn can be created with initial seed. Because we replace this initial seed at `IncrementalExecution`, it doesn't use the initial seed anymore. For now it seems to me not a big issue for streaming query. But need to confirm with others. cc zsxwing cloud-fan
## How was this patch tested?
Added test.
Closes#21980 from viirya/SPARK-25010.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>