ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Stavros Kontopoulos	5e74570c8f	[SPARK-23153][K8S] Support client dependencies with a Hadoop Compatible File System ## What changes were proposed in this pull request? - solves the current issue with --packages in cluster mode (there is no ticket for it). Also note of some [issues](https://issues.apache.org/jira/browse/SPARK-22657) of the past here when hadoop libs are used at the spark submit side. - supports spark.jars, spark.files, app jar. It works as follows: Spark submit uploads the deps to the HCFS. Then the driver serves the deps via the Spark file server. No hcfs uris are propagated. The related design document is [here](https://docs.google.com/document/d/1peg_qVhLaAl4weo5C51jQicPwLclApBsdR1To2fgc48/edit). the next option to add is the RSS but has to be improved given the discussion in the past about it (Spark 2.3). ## How was this patch tested? - Run integration test suite. - Run an example using S3: ``` ./bin/spark-submit \ ... --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.memory=1G \ --conf spark.kubernetes.namespace=spark \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \ --conf spark.driver.memory=1G \ --conf spark.executor.instances=2 \ --conf spark.sql.streaming.metricsEnabled=true \ --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image=skonto/spark:k8s-3.0.0 \ --conf spark.kubernetes.file.upload.path=s3a://fdp-stavros-test \ --conf spark.hadoop.fs.s3a.access.key=... \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.fast.upload=true \ --conf spark.kubernetes.executor.deleteOnTermination=false \ --conf spark.hadoop.fs.s3a.secret.key=... \ --conf spark.files=client:///...resolv.conf \ file:///my.jar ** ``` Added integration tests based on [Ceph nano](https://github.com/ceph/cn). Looks very [active](http://www.sebastien-han.fr/blog/2019/02/24/Ceph-nano-is-getting-better-and-better/). Unfortunately minio needs hadoop >= 2.8. Closes #23546 from skonto/support-client-deps. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Erik Erlandson <eerlands@redhat.com>	2019-05-22 16:15:42 -07:00
Sean Owen	6c5827c723	[SPARK-27794][R][DOCS] Use https URL for CRAN repo ## What changes were proposed in this pull request? Use https URL for CRAN repo (and for a Scala download in a Dockerfile) ## How was this patch tested? Existing tests. Closes #24664 from srowen/SPARK-27794. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-22 14:28:21 -07:00
Yuming Wang	76988dd4a2	[SPARK-27737][FOLLOW-UP][SQL][test-hadoop3.2] Update Hive test jars from 2.3.4 to 2.3.5 ## What changes were proposed in this pull request? This pr update `hive-contrib-2.3.4.jar` to `hive-contrib-2.3.5.jar` and `hive-hcatalog-core-2.3.4.jar` to `hive-hcatalog-core-2.3.5.jar`. ## How was this patch tested? Existing test Closes #24673 from wangyum/SPARK-27737-hive.jar. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-22 08:29:06 -07:00
Dongjoon Hyun	a24cdc00bf	[SPARK-27800][SQL][HOTFIX][FOLLOWUP] Fix wrong answer on BitwiseXor test cases This PR is a follow up of https://github.com/apache/spark/pull/24669 to fix the wrong answers used in test cases. Closes #24674 from dongjoon-hyun/SPARK-27800. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-22 03:11:29 -07:00
Liu Xiao	bf617996aa	[SPARK-27800][SQL][DOC] Fix wrong answer of example for BitwiseXor ## What changes were proposed in this pull request? Fix example for bitwise xor function. 3 ^ 5 should be 6 rather than 2. - See https://spark.apache.org/docs/latest/api/sql/index.html#_14 ## How was this patch tested? manual tests Closes #24669 from alex-lx/master. Authored-by: Liu Xiao <hhdxlx@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 21:52:19 -07:00
David Vogelbacher	034cb139a1	[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well. However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side. This PR fixes this by also sending the partition order if there are no partitions present. ## How was this patch tested? New unit test added. Closes #24650 from dvogelbacher/dv/fixNoPartitionArrowConversion. Authored-by: David Vogelbacher <dvogelbacher@palantir.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 13:21:26 +09:00
Wenchen Fan	03c9e8adee	[SPARK-24586][SQL] Upcast should not allow casting from string to other types ## What changes were proposed in this pull request? When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc. As a result, `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter. This PR does 2 things: 1. rename `Cast.canSafeCast` to `Cast.canUpcast`, and support complex typres 2. remove `Cast.mayTruncate` and replace it with `!Cast.canUpcast` Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here. ## How was this patch tested? new tests Closes #21586 from cloud-fan/cast. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:35:51 +08:00
Gengliang Wang	c3c443ca8c	[SPARK-27698][SQL] Add new method `convertibleFilters` for getting pushed down filters in Parquet file reader ## What changes were proposed in this pull request? To return accurate pushed filters in Parquet file scan(https://github.com/apache/spark/pull/24327#pullrequestreview-234775673), we can process the original data source filters in the following way: 1. For "And" operators, split the conjunctive predicates and try converting each of them. After that 1.1 if partially predicate pushed down is allowed, return convertible results; 1.2 otherwise, return the whole predicate if convertible, or empty result if not convertible. 2. For "Or" operators, if both children can be pushed down, it is partially or totally convertible; otherwise, return empty result 3. For other operators, they are not able to be partially pushed down. 2.1 if the entire predicate is convertible, return itself 2.2 otherwise, return an empty result. This PR also contains code refactoring. Currently `ParquetFilters. createFilter ` accepts parameter `schema: MessageType` and create field mapping for every input filter. We can make it a class member and avoid creating the `nameToParquetField` mapping for every input filter. ## How was this patch tested? Unit test Closes #24597 from gengliangwang/refactorParquetFilters. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:27:25 +08:00
wenxuanguan	e7443d6412	[SPARK-27774][CORE][MLLIB] Avoid hardcoded configs ## What changes were proposed in this pull request? avoid hardcoded configs in `SparkConf` and `SparkSubmit` and test ## How was this patch tested? N/A Closes #24631 from wenxuanguan/minor-fix. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:45:11 +09:00
Yuming Wang	6cd1efd0ae	[SPARK-27737][SQL] Upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile ## What changes were proposed in this pull request? This PR aims to upgrade to Hive 2.3.5 for Hive Metastore Client and Hadoop-3.2 profile. Release Notes - Hive - Version 2.3.5 - [[HIVE-21536](https://issues.apache.org/jira/browse/HIVE-21536)] - Backport HIVE-17764 to branch-2.3 - [[HIVE-21585](https://issues.apache.org/jira/browse/HIVE-21585)] - Upgrade branch-2.3 to ORC 1.3.4 - [[HIVE-21639](https://issues.apache.org/jira/browse/HIVE-21639)] - Spark test failed since HIVE-10632 - [[HIVE-21680](https://issues.apache.org/jira/browse/HIVE-21680)] - Backport HIVE-17644 to branch-2 and branch-2.3 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345394&styleName=Text&projectId=12310843 ## How was this patch tested? This PR is tested in two ways. - Pass the Jenkins with the default configuration for `Hive Metastore Client` testing. - Pass the Jenkins with `test-hadoop3.2` configuration for `Hadoop 3.2` testing. Closes #24620 from wangyum/SPARK-27737. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:24:17 +09:00
williamwong	8442d94fb1	[SPARK-27248][SQL] `refreshTable` should recreate cache with same cache name and storage level If we refresh a cached table, the table cache will be first uncached and then recache (lazily). Currently, the logic is embedded in CatalogImpl.refreshTable method. The current implementation does not preserve the cache name and storage level. As a result, cache name and cache level could be changed after a REFERSH. IMHO, it is not what a user would expect. I would like to fix this behavior by first save the cache name and storage level for recaching the table. Two unit tests are added to make sure cache name is unchanged upon table refresh. Before applying this patch, the test created for qualified case would fail. Closes #24221 from William1104/feature/SPARK-27248. Lead-authored-by: williamwong <william1104@gmail.com> Co-authored-by: William Wong <william1104@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 11:37:16 -07:00
Liang-Chi Hsieh	c033a3e1e6	[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans ## What changes were proposed in this pull request? Because a temporary view is resolved during analysis when we create a dataset, the content of the view is determined when the dataset is created, not when it is evaluated. Now the explain result of a dataset is not correctly consistent with the collected result of it, because we use pre-analyzed logical plan of the dataset in explain command. The explain command will analyzed the logical plan passed in. So if a view is changed after the dataset was created, the plans shown by explain command aren't the same with the plan of the dataset. ```scala scala> spark.range(10).createOrReplaceTempView("test") scala> spark.range(5).createOrReplaceTempView("test2") scala> spark.sql("select * from test").createOrReplaceTempView("tmp001") scala> val df = spark.sql("select * from tmp001") scala> spark.sql("select * from test2").createOrReplaceTempView("tmp001") scala> df.show +---+ \| id\| +---+ \| 0\| \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| +---+ scala> df.explain(true) ``` Before: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#2L] +- SubqueryAlias `tmp001` +- Project [id#2L] +- SubqueryAlias `test2` +- Range (0, 5, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 5, step=1, splits=12) ``` After: ```scala == Parsed Logical Plan == 'Project [] +- 'UnresolvedRelation `tmp001` == Analyzed Logical Plan == id: bigint Project [id#0L] +- SubqueryAlias `tmp001` +- Project [id#0L] +- SubqueryAlias `test` +- Range (0, 10, step=1, splits=Some(12)) == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` Previous PR to this issue has a regression when to explain an explain statement, like `sql("explain select 1").explain(true)`. This new fix is following up with hvanhovell's advice at https://github.com/apache/spark/pull/24464#issuecomment-494165538. Explain an explain: ```scala scala> sql("explain select 1").explain(true) == Parsed Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Analyzed Logical Plan == plan: string ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Optimized Logical Plan == ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false == Physical Plan == Execute ExplainCommand +- ExplainCommand 'Project [unresolvedalias(1, None)], false, false, false ``` Btw, I found there is a regression after applying hvanhovell's advice: ```scala spark.readStream .format("org.apache.spark.sql.streaming.test") .load() .explain(true) ``` ```scala == Parsed Logical Plan == StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),None ), dummySource, [a#559] == Analyzed Logical Plan == a: int StreamingRelation DataSource(org.apache.spark.sql.test.TestSparkSession3e8c7175,org.apache.spark.sql.streaming.test,List(),None,List(),None,Map(),Non$ ), dummySource, [a#559] == Optimized Logical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource == Physical Plan == org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();; dummySource ``` So I did a change to that to fix it too. ## How was this patch tested? Added test and manually test. Closes #24654 from viirya/SPARK-27439-3. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-21 11:27:05 -07:00
Sean Owen	eed6de1a65	[MINOR][DOCS] Tighten up some key links to the project and download pages to use HTTPS ## What changes were proposed in this pull request? Tighten up some key links to the project and download pages to use HTTPS ## How was this patch tested? N/A Closes #24665 from srowen/HTTPSURLs. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 10:56:42 -07:00
Sean Owen	4d64ed8114	[SPARK-27796][MESOS] Remove obsolete spark-mesos Dockerfile example ## What changes were proposed in this pull request? Remove obsolete spark-mesos Dockerfile example. This isn't tested and apparently hasn't been updated in 4 years. ## How was this patch tested? N/A Closes #24667 from srowen/SPARK-27796. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 10:53:55 -07:00
DB Tsai	808d9d05fc	[SPARK-27762][SQL] Support user provided avro schema for writing fields with different ordering ## What changes were proposed in this pull request? Spark Avro reader supports reading avro files with provided schema with different field orderings. However, the avro writer doesn't support this feature. This PR enables the Spark avro writer with this feature. ## How was this patch tested? New test is added. Closes #24635 from dbtsai/avroFix. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Brian Lindblom <blindblom@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-05-21 17:34:19 +00:00
David Navas	9e73be38a5	[SPARK-27726][CORE] Fix performance of ElementTrackingStore deletes when using InMemoryStore under high loads The details of the PR are explored in-depth in the sub-tasks of the umbrella jira SPARK-27726. Briefly: 1. Stop issuing asynchronous requests to cleanup elements in the tracking store when a request is already pending 2. Fix a couple of thread-safety issues (mutable state and mis-ordered updates) 3. Move Summary deletion outside of Stage deletion loop like Tasks already are 4. Reimplement multi-delete in a removeAllKeys call which allows InMemoryStore to implement it in a performant manner. 5. Some generic typing and exception handling cleanup We see about five orders of magnitude improvement in the deletion code, which for us is the difference between a server that needs restarting daily, and one that is stable over weeks. Unit tests for the fire-once asynchronous code and the removeAll calls in both LevelDB and InMemoryStore are supplied. It was noted that the testing code for the LevelDB and InMemoryStore is highly repetitive, and should probably be merged, but we did not attempt that in this PR. A version of this code was run in our production 2.3.3 and we were able to sustain higher throughput without going into GC overload (which was happening on a daily basis some weeks ago). A version of this code was also put under a purpose-built Performance Suite of tests to verify performance under both types of Store implementations for both before and after code streams and for both total and partial delete cases (this code is not included in this PR). Closes #24616 from davidnavas/PentaBugFix. Authored-by: David Navas <davidn@clearstorydata.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-21 10:22:21 -07:00
Wenchen Fan	1e0facb60d	[SQL][DOC][MINOR] update documents for Table and WriteBuilder ## What changes were proposed in this pull request? Update the docs to reflect the changes made by https://github.com/apache/spark/pull/24129 ## How was this patch tested? N/A Closes #24658 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 09:29:06 -07:00
HyukjinKwon	20fb01bbea	[MINOR][PYTHON] Remove explain(True) in test_udf.py ## What changes were proposed in this pull request? Not a big deal but it bugged me. This PR removes printing out plans in PySpark UDF tests. Before: ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). == Parsed Logical Plan == GlobalLimit 1 +- LocalLimit 1 +- Project [id#668L, <lambda>(id#668L) AS copy#673] +- Sort [id#668L ASC NULLS FIRST], true +- Range (0, 10, step=1, splits=Some(4)) == Analyzed Logical Plan == id: bigint, copy: int GlobalLimit 1 +- LocalLimit 1 +- Project [id#668L, <lambda>(id#668L) AS copy#673] +- Sort [id#668L ASC NULLS FIRST], true +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == GlobalLimit 1 +- LocalLimit 1 +- Project [id#668L, pythonUDF0#676 AS copy#673] +- BatchEvalPython [<lambda>(id#668L)], [id#668L, pythonUDF0#676] +- Range (0, 10, step=1, splits=Some(4)) == Physical Plan == CollectLimit 1 +- (2) Project [id#668L, pythonUDF0#676 AS copy#673] +- BatchEvalPython [<lambda>(id#668L)], [id#668L, pythonUDF0#676] +- (1) Range (0, 10, step=1, splits=4) ........................................... ---------------------------------------------------------------------- Ran 43 tests in 19.777s ``` After: ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). ........................................... ---------------------------------------------------------------------- Ran 43 tests in 25.201s ``` ## How was this patch tested? N/A Closes #24661 from HyukjinKwon/remove-explain-in-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-21 23:39:31 +09:00
Josh Rosen	604aa1b045	[SPARK-27786][SQL] Fix Sha1, Md5, and Base64 codegen when commons-codec is shaded ## What changes were proposed in this pull request? When running a custom build of Spark which shades `commons-codec`, the `Sha1` expression generates code which fails to compile: ``` org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 93: A method named "sha1Hex" is not declared in any enclosing class nor any supertype, nor through a static import ``` This is caused by an interaction between Spark's code generator and the shading: the current codegen template includes the string `org.apache.commons.codec.digest.DigestUtils.sha1Hex` as part of a larger string literal, preventing JarJarLinks from being able to replace the class name with the shaded class's name. As a result, the generated code still references the original unshaded class name name, triggering an error in case the original unshaded dependency isn't on the path. This problem impacts the `Sha1`, `Md5`, and `Base64` expressions. To fix this problem and allow for proper shading, this PR updates the codegen templates to replace the hardcoded class names with `${classof[<name>].getName}` calls. ## How was this patch tested? Existing tests. To ensure that I found all occurrences of this problem, I used IntelliJ's "Find in Path" to search for lines matching the regex `^(?!import\|package).*(org\|com\|net\|io)\.(?!apache\.spark)` and then filtered matches to inspect only non-test "Usage in string constants" cases. This isn't _perfect_ but I think it'll catch most cases. Closes #24655 from JoshRosen/fix-shaded-apache-commons. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-21 21:18:34 +08:00
Prashant Sharma	5f4b50513c	[MINOR][DOCS] Fix Spark hive example. ## What changes were proposed in this pull request? Documentation has an error, https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#hive-tables. The example: ```scala scala> val dataDir = "/tmp/parquet_data" dataDir: String = /tmp/parquet_data scala> spark.range(10).write.parquet(dataDir) scala> sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'") res6: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM hive_ints").show() +----+ \| key\| +----+ \|null\| \|null\| \|null\| \|null\| \|null\| \|null\| \|null\| \|null\| \|null\| \|null\| +----+ ``` Range does not emit `key`, but `id` instead. Closes #24657 from ScrapCodes/fix_hive_example. Lead-authored-by: Prashant Sharma <prashant@apache.org> Co-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-21 18:23:38 +09:00
hustfeiwang	d90c460c48	[SPARK-27637][SHUFFLE] For nettyBlockTransferService, if IOException occurred while fetching data, check whether relative executor is alive before retry ## What changes were proposed in this pull request? There are several kinds of shuffle client, blockTransferService and externalShuffleClient. For the externalShuffleClient, there are relative external shuffle service, which guarantees the shuffle block data and regardless the state of executors. For the blockTransferService, it is used to fetch broadcast block, and fetch the shuffle data when external shuffle service is not enabled. When fetching data by using blockTransferService, the shuffle client would connect relative executor's blockManager, so if the relative executor is dead, it would never fetch successfully. When spark.shuffle.service.enabled is true and spark.dynamicAllocation.enabled is true, the executor will be removed while it has been idle for more than idleTimeout. If a blockTransferService create connection to relative executor successfully, but the relative executor is removed when beginning to fetch broadcast block, it would retry (see RetryingBlockFetcher), which is Ineffective. If the spark.shuffle.io.retryWait and spark.shuffle.io.maxRetries is big, such as 30s and 10 times, it would waste 5 minutes. In this PR, we check whether relative executor is alive before retry. ## How was this patch tested? Unit test. Closes #24533 from turboFei/SPARK-27637. Authored-by: hustfeiwang <wangfei3@corp.netease.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-21 13:45:42 +08:00
Dongjoon Hyun	039db879f4	Revert "[SPARK-27439][SQL] Explainging Dataset should show correct resolved plans" This reverts commit `4b725e50a7`.	2019-05-20 15:07:00 -07:00
Wenchen Fan	0e6601acdf	[SPARK-27747][SQL] add a logical plan link in the physical plan ## What changes were proposed in this pull request? It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in https://github.com/apache/spark/pull/24389 This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied. The planner leverages this feature to put the logical plan into the physical plan. ## How was this patch tested? a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans. Closes #24626 from cloud-fan/link. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Peng Bo <bo.peng1019@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-20 13:42:25 -07:00
Yuming Wang	5dda1fe296	[SPARK-27699][FOLLOW-UP][SQL][test-hadoop3.2][test-maven] Fix hadoop-3.2 test error ## What changes were proposed in this pull request? This pr fix `hadoop-3.2` test error: ``` - SPARK-27699 Converting disjunctions into ORC SearchArguments * FAILED * Expected "...SS_THAN_EQUALS a 10)[ leaf-1 = (LESS_THAN a 1) ]expr = (or (not leaf...", but got "...SS_THAN_EQUALS a 10)[, leaf-1 = (LESS_THAN a 1), ]expr = (or (not leaf..." (HiveOrcFilterSuite.scala:445) ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105514/consoleFull ## How was this patch tested? N/A Closes #24639 from wangyum/SPARK-27699. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-20 13:04:05 -07:00
Sean Owen	db24b04cad	[MINOR][EXAMPLES] Don't use internal Spark logging in user examples ## What changes were proposed in this pull request? Don't use internal Spark logging in user examples, because users shouldn't / can't use it directly anyway. These examples already use println in some cases. Note that the usage in StreamingExamples is on purpose. ## How was this patch tested? N/A Closes #24649 from srowen/ExampleLog. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-20 08:43:03 -07:00
HyukjinKwon	b7bf4fd123	[SPARK-27402][INFRA][FOLLOW-UP] Exclude 'hive-thriftserver' in modules to test for hadoop3.2 for now ## What changes were proposed in this pull request? This PR excludes 'hive-thriftserver' in modules to test for hadoop3.2 for now as well ## How was this patch tested? Manually tested via `run-tests.py` Closes #24644 from HyukjinKwon/SPARK-27402. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-20 07:53:19 -07:00
Yuming Wang	974b879220	[SPARK-27694][SQL] Support auto-updating table statistics for data source CTAS command ## What changes were proposed in this pull request? This pr makes it support collect statistics when CTAS(create a data source table using the result of a query). ## How was this patch tested? unit tests and manual tests: ```sql bin/spark-sql --conf spark.sql.statistics.size.autoUpdate.enabled=true -S spark-sql> CREATE TABLE spark_27694 USING parquet AS SELECT 'a', 'b'; spark-sql> DESC FORMATTED spark_27694; a string NULL b string NULL # Detailed Table Information Database default Table spark_27694 Owner root Created Time Mon May 13 19:45:33 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider parquet Statistics 561 bytes Location file:/user/hive/warehouse/spark_27694 Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat ``` Closes #24596 from wangyum/SPARK-27694. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 22:29:40 -07:00
Ryan Blue	bc46feaced	[SPARK-27693][SQL] Add default catalog property Add a SQL config property for the default v2 catalog. Existing tests for regressions. Closes #24594 from rdblue/SPARK-27693-add-default-catalog-config. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 21:30:52 -07:00
Yuming Wang	93b5a2b686	[SPARK-27610][FOLLOW-UP][YARN] Remove duplicate declaration of plugin maven-antrun-plugin ## What changes were proposed in this pull request? This pr removes duplicate declaration of plugin `org.apache.maven.plugins:maven-antrun-plugin`: ``` [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-network-yarn_2.12🫙3.0.0-SNAPSHOT [WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found duplicate declaration of plugin org.apache.maven.plugins:maven-antrun-plugin line 177, column 15 [WARNING] [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. [WARNING] [WARNING] For this reason, future Maven versions might no longer support building such malformed projects. [WARNING] ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/105523/consoleFull ## How was this patch tested? Existing test Closes #24641 from wangyum/SPARK-27610. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 20:59:35 -07:00
HyukjinKwon	2431ab0999	[SPARK-27771][SQL] Add SQL description for grouping functions (cube, rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 19:26:20 -07:00
Arun Mahadevan	1a8c09334d	[SPARK-27754][K8S] Introduce additional config (spark.kubernetes.driver.request.cores) for driver request cores for spark on k8s ## What changes were proposed in this pull request? Spark on k8s supports config for specifying the executor cpu requests (spark.kubernetes.executor.request.cores) but a similar config is missing for the driver. Instead, currently `spark.driver.cores` value is used for integer value. Although `pod spec` can have `cpu` for the fine-grained control like the following, this PR proposes additional configuration `spark.kubernetes.driver.request.cores` for driver request cores. ``` resources: requests: memory: "64Mi" cpu: "250m" ``` ## How was this patch tested? Unit tests Closes #24630 from arunmahadevan/SPARK-27754. Authored-by: Arun Mahadevan <arunm@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-18 21:28:46 -07:00
liuxian	9bca99b29b	[SPARK-27552][SQL] The configuration `hive.exec.stagingdir` is invalid on Windows OS ## What changes were proposed in this pull requesst? If we set `hive.exec.stagingdir=.test-staging\tmp`, But the staging directory is still `.hive-staging` on Windows OS. Reasons for failure: Test code: ``` val path = new Path("C:\\test\\hivetable") println("path.toString: " + path.toString) println("path.toUri.getPath: " + path.toUri.getPath) ``` Output: ``` path.toString: C:/test/hivetable path.toUri.getPath: /C:/test/hivetable ``` We can see that `path.toUri.getPath` has one more separator than `path.toString`, and the separator is ' / ', not ' \ ' So `stagingPathName.stripPrefix(inputPathName).stripPrefix(File.separator).startsWith(".")` will return false ## How was this patch tested? 1. Existed tests 2. Manual testing on Windows OS Closes #24446 from 10110346/stagingdir. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-17 14:00:17 -05:00
Dongjoon Hyun	141a3bfc8d	[SPARK-27755][BUILD] Update zstd-jni to 1.4.0-1 ## What changes were proposed in this pull request? This PR aims to update `zstd-jni` library to `1.4.0-1` which improves the `level 1 compression speed` performance by 6% in most scenarios. The following is the full release note. - https://github.com/facebook/zstd/releases/tag/v1.4.0 ## How was this patch tested? Pass the Jenkins. Closes #24632 from dongjoon-hyun/SPARK-27755. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-17 08:34:45 -07:00
Gengliang Wang	e39e97b73a	[SPARK-27699][SQL] Partially push down disjunctive predicated in Parquet/ORC ## What changes were proposed in this pull request? Currently, in `ParquetFilters` and `OrcFilters`, if the child predicate of `Or` operator can't be entirely pushed down, the predicates will be thrown away. In fact, the conjunctive predicates under `Or` operators can be partially pushed down. For example, says `a` and `b` are convertible, while `c` can't be pushed down, the predicate `a or (b and c)` can be converted as `(a or b) and (a or c)` We can still push down `(a or b)`. We can't push down disjunctive predicates only when one of its children is not partially convertible. This PR also improve the filter pushing down logic in `DataSourceV2Strategy`. With partial filter push down in `Or` operator, the result of `pushedFilters()` might not exist in the mapping `translatedFilterToExpr`. To fix it, this PR changes the mapping `translatedFilterToExpr` as leaf filter expression to `sources.filter`, and later on rebuild the whole expression with the mapping. ## How was this patch tested? Unit test Closes #24598 from gengliangwang/pushdownDisjunctivePredicates. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-17 19:25:24 +08:00
Kazuaki Ishizaki	9e0d8c6ce2	[SPARK-27752][CORE] Upgrade lz4-java from 1.5.1 to 1.6.0 ## What changes were proposed in this pull request? This PR upgrades lz4-java from 1.5.1 to 1.6.0. Lz4-java is available at https://github.com/lz4/lz4-java. Changes from 1.5.1: - Upgraded LZ4 to 1.9.1. Updated the JNI bindings, except for the one for Linux/i386. Decompression speed is improved on amd64. - Deprecated use of LZ4FastDecompressor of a native instance because the corresponding C API function is deprecated. See the release note of LZ4 1.9.0 for details. Updated javadoc accordingly. - Changed the module name from org.lz4.lz4-java to org.lz4.java to avoid using - in the module name. (severn-everett, Oliver Eikemeier, Rei Odaira) - Enabled build with Java 11. Note that the distribution is still built with Java 7. (Rei Odaira) ## How was this patch tested? Existing tests. Closes #24629 from kiszk/SPARK-27752. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 20:45:13 -07:00
Wenchen Fan	fc5bd6da77	[SPARK-27576][SQL] table capability to skip the output column resolution ## What changes were proposed in this pull request? Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. ## How was this patch tested? new test cases Closes #24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 16:24:53 -07:00
Shixiong Zhu	6a317c8f01	[SPARK-27735][SS] Parsing interval string should be case-insensitive in SS ## What changes were proposed in this pull request? Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. ## How was this patch tested? The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 13:58:27 -07:00
shivusondur	c6a45e6f67	[SPARK-27722][SQL] removed the unsed "UnsafeKeyValueSorter" file. ## What changes were proposed in this pull request? removed the unused "UnsafeKeyValueSorter.java" file ## How was this patch tested? Ran Compilation and UT locally. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24622 from shivusondur/jira27722. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-16 18:22:06 +08:00
Wenchen Fan	3e30a98810	[SPARK-27674][SQL] the hint should not be dropped after cache lookup ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20365 . #20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes #24580 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 15:47:52 -07:00
Yuming Wang	02c33694c8	[SPARK-27354][SQL] Move incompatible code from the hive-thriftserver module to sql/hive-thriftserver/v1.2.1 ## What changes were proposed in this pull request? When we upgraded the built-in Hive to 2.3.4, the current `hive-thriftserver` module is not compatible, such as these Hive changes: 1. [HIVE-12442](https://issues.apache.org/jira/browse/HIVE-12442) HiveServer2: Refactor/repackage HiveServer2's Thrift code so that it can be used in the tasks 2. [HIVE-12237](https://issues.apache.org/jira/browse/HIVE-12237) Use slf4j as logging facade 3. [HIVE-13169](https://issues.apache.org/jira/browse/HIVE-13169) HiveServer2: Support delegation token based connection when using http transport So this PR moves the incompatible code to `sql/hive-thriftserver/v1.2.1` and copies it to `sql/hive-thriftserver/v2.3.4` for the next code review. ## How was this patch tested? manual tests: ``` diff -urNa sql/hive-thriftserver/v1.2.1 sql/hive-thriftserver/v2.3.4 ``` Closes #24282 from wangyum/SPARK-27354. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:52:08 -07:00
Xingbo Jiang	0bba5cf568	[SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes #24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:47:15 -07:00
Gabor Somogyi	efa303581a	[SPARK-27687][SS] Rename Kafka consumer cache capacity conf and document caching ## What changes were proposed in this pull request? Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`. Since Kafka consumer caching is not documented I've added this also. ## How was this patch tested? Existing + added unit test. ``` cd docs SKIP_API=1 jekyll build ``` and manual webpage check. Closes #24590 from gaborgsomogyi/SPARK-27687. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-15 10:42:09 -07:00
Marcelo Vanzin	d14e2d7874	[SPARK-27678][UI] Allow user impersonation in the UI. This feature allows proxy servers to identify the actual request user using a request parameter, and performs access control checks against that user instead of the authenticated user. Impersonation is only allowed if the authenticated user is configured as an admin. The request parameter used ("doAs") matches the one currently used by Knox, but it should be easy to change / customize if different proxy servers use a different way of identifying the original user. Tested with updated unit tests and also with a live server behind Knox. Closes #24582 from vanzin/SPARK-27678. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-15 09:58:12 -07:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
xy_xin	fd9acf23b0	[SPARK-27713][SQL] Move org.apache.spark.sql.execution.* in catalyst to core ## What changes were proposed in this pull request? `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. ## How was this patch tested? exist tests. Closes #24607 from xianyinxin/SPARK-27713. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 15:24:21 +08:00
gengjiaan	7dd2dd5dc5	[MINOR][SS] Remove duplicate 'add' in comment of `StructuredSessionization`. ## What changes were proposed in this pull request? `StructuredSessionization` comment contains duplicate 'add', I think it should be changed. ## How was this patch tested? Exists UT. Closes #24589 from beliefer/remove-duplicate-add-in-comment. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-15 16:01:43 +09:00
Ryan Blue	2da5b21834	[SPARK-24923][SQL] Implement v2 CreateTableAsSelect ## What changes were proposed in this pull request? This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation ## How was this patch tested? We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes #24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 11:24:03 +08:00
Yuming Wang	fee695d0cf	[SPARK-27690][SQL] Remove materialized views first in `HiveClientImpl.reset` ## What changes were proposed in this pull request? We should remove materialized view first otherwise(note that Hive 3.1 could reproduce this issue): ```scala Cause: org.apache.derby.shared.common.error.DerbySQLIntegrityConstraintViolationException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'MV_TABLES_USED_FK2' for key (4). The statement has been rolled back. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeBatchElement(Unknown Source) at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeBatch(Unknown Source) ``` ## How was this patch tested? Existing test Closes #24592 from wangyum/SPARK-27690. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-14 09:05:22 -07:00
Sean Owen	a10608cb82	[SPARK-27680][CORE][SQL][GRAPHX] Remove usage of Traversable ## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes #24584 from srowen/SPARK-27680. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:14:56 -05:00
pgandhi	695dbe27ce	[SPARK-25719][UI] : Search functionality in datatables in stages page should search over formatted data rather than the raw data The Pull Request to add datatables to stage page SPARK-21809 got merged. The search functionality in those datatables being a great improvement for searching through a large number of tasks, also performs search over the raw data rather than the formatted data displayed in the tables. It would be great if the search can happen for the formatted data as well. ## What changes were proposed in this pull request? Added code to enable searching over displayed data in tables e.g. searching on "165.7 MiB" or "0.3 ms" will now return the search results. Also, earlier we were missing search for two columns in the task table "Shuffle Read Bytes" as well as "Shuffle Remote Reads", which I have added here. ## How was this patch tested? Manual Tests Closes #24419 from pgandhi999/SPARK-25719. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:05:13 -05:00

... 2 3 4 5 6 ...

24512 commits