ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	f176dd3f28	[SPARK-27314][SQL] Deduplicate exprIds for Union. ## What changes were proposed in this pull request? We have been having a potential problem with `Union` when the children have the same expression id in their outputs, which happens when self-union. ## How was this patch tested? Modified some tests to adjust plan changes. Closes #24236 from ueshin/issues/SPARK-27314/dedup_union. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-29 14:05:38 -07:00
Maxim Gekk	61561c1c2d	[SPARK-27252][SQL][FOLLOWUP] Calculate min and max days independently from time zone in ComputeCurrentTimeSuite ## What changes were proposed in this pull request? This fixes the `analyzer should replace current_date with literals` test in `ComputeCurrentTimeSuite` by making calculation of `min` and `max` days independent from time zone. ## How was this patch tested? by `ComputeCurrentTimeSuite`. Closes #24240 from MaxGekk/current-date-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:28:36 -05:00
Ninad Ingole	dbc7ce18b9	[SPARK-27244][CORE] Redact Passwords While Using Option logConf=true ## What changes were proposed in this pull request? When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. ## How was this patch tested? This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:16:53 -05:00
Maxim Gekk	06abd06112	[SPARK-27252][SQL] Make current_date() independent from time zones ## What changes were proposed in this pull request? This makes the `CurrentDate` expression and `current_date` function independent from time zone settings. New result is number of days since epoch in `UTC` time zone. Previously, Spark shifted the current date (in `UTC` time zone) according the session time zone which violets definition of `DateType` - number of days since epoch (which is an absolute point in time, midnight of Jan 1 1970 in UTC time). The changes makes `CurrentDate` consistent to `CurrentTimestamp` which is independent from time zone too. ## How was this patch tested? The changes were tested by existing test suites like `DateExpressionsSuite`. Closes #24185 from MaxGekk/current-date. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-28 18:44:08 -07:00
Xianyang Liu	50cded590f	[MINOR] Move java file to java directory ## What changes were proposed in this pull request? move ```scala org.apache.spark.sql.execution.streaming.BaseStreamingSource org.apache.spark.sql.execution.streaming.BaseStreamingSink ``` to java directory ## How was this patch tested? Existing UT. Closes #24222 from ConeyLiu/move-scala-to-java. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-28 12:11:00 -05:00
zhoukang	43bf4ae641	[SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session ## What changes were proposed in this pull request? When using fair scheduler mode for thrift server, we may have unpredictable result. ``` val pool = sessionToActivePool.get(parentSession.getSessionHandle) if (pool != null) { sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, pool) } ``` The cause is we use thread pool to execute queries for thriftserver, and when we call setLocalProperty we may have unpredictab behavior. ``` /** * Set a local property that affects jobs submitted from this thread, such as the Spark fair * scheduler pool. User-defined properties may also be set here. These properties are propagated * through to worker tasks and can be accessed there via * [[org.apache.spark.TaskContext#getLocalProperty]]. * * These properties are inherited by child threads spawned from this thread. This * may have unexpected consequences when working with thread pools. The standard java * implementation of thread pools have worker threads spawn other worker threads. * As a result, local properties may propagate unpredictably. */ def setLocalProperty(key: String, value: String) { if (value == null) { localProperties.get.remove(key) } else { localProperties.get.setProperty(key, value) } } ``` I post an example on https://jira.apache.org/jira/browse/SPARK-26914 . ## How was this patch tested? UT Closes #23826 from caneGuy/zhoukang/fix-scheduler-error. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-28 09:24:16 -05:00
Wenchen Fan	e4a968d829	[MINOR][CORE] Remove import scala.collection.Set in TaskSchedulerImpl ## What changes were proposed in this pull request? I was playing with the scheduler and found this weird thing. In `TaskSchedulerImpl` we import `scala.collection.Set` without any reason. This is bad in practice, as it silently changes the actual class when we simply type `Set`, which by default should point to the immutable set. This change only affects one method: `getExecutorsAliveOnHost`. I checked all the caller side and none of them need a general `Set` type. ## How was this patch tested? N/A Closes #24231 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-28 21:12:18 +09:00
Stavros Kontopoulos	39577a27a0	[SPARK-24902][K8S] Add PV integration tests ## What changes were proposed in this pull request? - Adds persistent volume integration tests - Adds a custom tag to the test to exclude it if it is run against a cloud backend. - Assumes default fs type for the host, AFAIK that is ext4. ## How was this patch tested? Manually run the tests against minikube as usual: ``` [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 192 milliseconds. Run starting. Expected test count is: 16 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Test PVs with local storage ``` Closes #23514 from skonto/pvctests. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-03-27 13:00:56 -07:00
Gengliang Wang	49b0411549	[SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/23130, all empty files are excluded from target file splits in `FileSourceScanExec`. In File source V2, we should keep the same behavior. This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them. ## How was this patch tested? Unit test Closes #24227 from gengliangwang/ignoreEmptyFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-27 10:08:38 -07:00
Daoyuan Wang	f1fe805bed	[SPARK-27279][SQL] Reuse subquery should compare child plan of `SubqueryExec` ## What changes were proposed in this pull request? For now, `ReuseSubquery` in Spark compares two subqueries at `SubqueryExec` level, which invalidates the `ReuseSubquery` rule. This pull request fixes this, and add a configuration key for subquery reuse exclusively. ## How was this patch tested? add a unit test. Closes #24214 from adrian-wang/reuse. Authored-by: Daoyuan Wang <me@daoyuan.wang> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-27 08:45:22 -07:00
Takeshi Yamamuro	956b52b167	[SPARK-26771][SQL][FOLLOWUP] Make all the uncache operations non-blocking by default ## What changes were proposed in this pull request? To make the blocking behaviour consistent, this pr made catalog table/view `uncacheQuery` non-blocking by default. If this pr merged, all the behaviours in spark are non-blocking by default. ## How was this patch tested? Pass Jenkins. Closes #24212 from maropu/SPARK-26771-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 21:01:36 +09:00
Liang-Chi Hsieh	93ff69003b	[SPARK-27288][SQL] Pruning nested field in complex map key from object serializers ## What changes were proposed in this pull request? In the original PR #24158, pruning nested field in complex map key was not supported, because some methods in schema pruning did't support it at that moment. This is a followup to add it. ## How was this patch tested? Added tests. Closes #24220 from viirya/SPARK-26847-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 19:40:14 +09:00
liuxian	fac31104f6	[SPARK-27083][SQL] Add a new conf to control subqueryReuse ## What changes were proposed in this pull request? Subquery Reuse and Exchange Reuse are not the same feature， if we don't want to reuse subqueries，and we just want to reuse exchanges，only one configuration that cannot be done. This PR adds a new configuration `spark.sql.subquery.reuse` to control subqueryReuse. ## How was this patch tested? N/A Closes #23998 from 10110346/SUBQUERY_REUSE. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 23:37:58 -07:00
Gengliang Wang	6bcd4805d2	[SPARK-27286][SQL] Handles exceptions on proceeding to next record in FilePartitionReader ## What changes were proposed in this pull request? In data source V2, the method `PartitionReader.next()` has side effects. When the method is called, the current reader proceeds to the next record. This might throw RuntimeException/IOException and File source V2 framework should handle these exceptions. ## How was this patch tested? Unit test. Closes #24225 from gengliangwang/corruptFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 22:33:34 -07:00
Yuming Wang	ca1433b94a	[SPARK-27182][SQL] Move the conflict source code of the sql/core module to sql/core/v1.2.1 ## What changes were proposed in this pull request? To make https://github.com/apache/spark/pull/23788 easy to review. This PR moves `OrcColumnVector.java`, `OrcShimUtils.scala`, `OrcFilters.scala` and `OrcFilterSuite.scala` to `sql/core/v1.2.1` and copies it to `sql/core/v2.3.4`. ## How was this patch tested? manual tests ```shell diff -urNa sql/core/v1.2.1 sql/core/v2.3.4 ``` Closes #24119 from wangyum/SPARK-27182. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-26 22:32:03 -07:00
Sean Owen	3a8398df5c	[SPARK-26660][FOLLOWUP] Raise task serialized size warning threshold to 1000 KiB ## What changes were proposed in this pull request? Raise the threshold size for serialized task size at which a warning is generated from 100KiB to 1000KiB. As several people have noted, the original change for this JIRA highlighted that this threshold is low. Test output regularly shows: ``` - sorting on StringType with nullable=false, sortOrder=List('a DESC NULLS LAST) 22:47:53.320 WARN org.apache.spark.scheduler.TaskSetManager: Stage 80 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.348 WARN org.apache.spark.scheduler.TaskSetManager: Stage 81 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.417 WARN org.apache.spark.scheduler.TaskSetManager: Stage 83 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. 22:47:53.444 WARN org.apache.spark.scheduler.TaskSetManager: Stage 84 contains a task of very large size (755 KiB). The maximum recommended task size is 100 KiB. ... - SPARK-20688: correctly check analysis for scalar sub-queries 22:49:10.314 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.8 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1 22:49:10.595 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.744 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB 22:49:10.894 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 150.7 KiB - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 2 - SPARK-21835: Join in correlated subquery should be duplicateResolved: case 3 - SPARK-23316: AnalysisException after max iteration reached for IN query 22:49:11.559 WARN org.apache.spark.scheduler.DAGScheduler: Broadcasting large task binary with size 154.2 KiB ``` It seems that a larger threshold of about 1MB is more suitable. ## How was this patch tested? Existing tests. Closes #24226 from srowen/SPARK-26660.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-27 10:42:26 +09:00
Dilip Biswal	6c0e13b456	[SPARK-27285] Support describing output of CTE ## What changes were proposed in this pull request? SPARK-26982 allows users to describe output of a query. However, it had a limitation of not supporting CTEs due to limitation of the grammar having a single rule to parse both select and inserts. After SPARK-27209, which splits select and insert parsing to two different rules, we can now support describing output of the CTEs easily. ## How was this patch tested? Existing tests were modified. Closes #24224 from dilipbiswal/describe_support_cte. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 16:00:56 -07:00
Gengliang Wang	267160b360	[SPARK-27269][SQL] File source v2 should validate data schema only ## What changes were proposed in this pull request? Currently, File source v2 allows each data source to specify the supported data types by implementing the method `supportsDataType` in `FileScan` and `FileWriteBuilder`. However, in the read path, the validation checks all the data types in `readSchema`, which might contain partition columns. This is actually a regression. E.g. Text data source only supports String data type, while the partition columns can still contain Integer type since partition columns are processed by Spark. This PR is to: 1. Refactor schema validation and check data schema only. 2. Filter the partition columns in data schema if user specified schema provided. ## How was this patch tested? Unit test Closes #24203 from gengliangwang/schemaValidation. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-27 07:58:31 +09:00
Shixiong Zhu	5624bfbcfe	[SPARK-27275][CORE] Fix potential corruption in EncryptedMessage.transferTo ## What changes were proposed in this pull request? Right now there are several issues in `EncryptedMessage.transferTo`: - When the underlying buffer has more than `1024 * 32` bytes (this should be rare but it could happen in error messages that send over the wire), it may just send a partial message as `EncryptedMessage.count` becomes less than `transferred`. This will cause the client hang forever (or timeout) as it will wait until receiving expected length of bytes, or weird errors (such as corruption or silent correctness issue) if the channel is reused by other messages. - When the underlying buffer is full, it's still trying to write out bytes in a busy loop. This PR fixes the issues in `EncryptedMessage.transferTo` and also makes it follow the contract of `FileRegion`: - `count` should be a fixed value which is just the length of the whole message. - It should be non-blocking. When the underlying socket is not ready to write, it should give up and give control back. - `transferTo` should return the length of written bytes. ## How was this patch tested? The new added tests. Closes #24211 from zsxwing/fix-enc. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 15:48:29 -07:00
Maxim Gekk	69035684d4	[SPARK-27242][SQL] Make formatting TIMESTAMP/DATE literals independent from the default time zone ## What changes were proposed in this pull request? In the PR, I propose to use the SQL config `spark.sql.session.timeZone` in formatting `TIMESTAMP` literals, and make formatting `DATE` literals independent from time zone. The changes make parsing and formatting `TIMESTAMP`/`DATE` literals consistent, and independent from the default time zone of current JVM. Also this PR ports `TIMESTAMP`/`DATE` literals formatting on Proleptic Gregorian Calendar via using `TimestampFormatter`/`DateFormatter`. ## How was this patch tested? Added new tests to `LiteralExpressionSuite` Closes #24181 from MaxGekk/timezone-aware-literals. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-26 15:29:59 -07:00
Stavros Kontopoulos	05168e725d	[SPARK-24793][K8S] Enhance spark-submit for app management - supports `--kill` & `--status` flags. - supports globs which is useful in general check this long standing [issue](https://github.com/kubernetes/kubernetes/issues/17144#issuecomment-272052461) for kubectl. Manually against running apps. Example output: Submission Id reported at launch time: ``` 2019-01-20 23:47:56 INFO Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish... ``` Killing the app: ``` ./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master k8s://https://192.168.2.8:8443 2019-01-20 23:48:07 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 23:48:07 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address ``` App terminates with 143 (SIGTERM, since we have tiny this should lead to [graceful shutdown](https://cloud.google.com/solutions/best-practices-for-building-containers)): ``` 2019-01-20 23:48:08 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T21:48:00Z 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Failed container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - Container final statuses: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished. 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Shutdown hook called 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e ``` Glob scenario: ``` ./bin/spark-submit --status spark:spark-pi* --master k8s://https://192.168.2.8:8443 2019-01-20 22:27:44 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 22:27:44 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address Application status (driver): pod name: spark-pi-1547948600328-driver namespace: spark labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:43:22Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:43:22Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T01:43:27Z Application status (driver): pod name: spark-pi-1547948792539-driver namespace: spark labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:46:35Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:46:35Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T01:46:39Z container finished at: 2019-01-20T01:46:56Z exit code: 0 termination reason: Completed ``` Closes #23599 from skonto/submit_ops_extension. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-26 11:55:03 -07:00
Ilya Matiach	887279cc46	[SPARK-24102][ML][MLLIB][PYSPARK][FOLLOWUP] Added weight column to pyspark API for regression evaluator and metrics ## What changes were proposed in this pull request? Followup to PR https://github.com/apache/spark/pull/17085 This PR adds the weight column to the pyspark side, which was already added to the scala API. The PR also undoes a name change in the scala side corresponding to a change in another similar PR as noted here: https://github.com/apache/spark/pull/17084#discussion_r259648639 ## How was this patch tested? This patch adds python tests for the changes to the pyspark API. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24197 from imatiach-msft/ilmat/regressor-eval-python. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-26 09:06:04 -05:00
Hyukjin Kwon	0e16a6f5b0	[SPARK-27277][INFRA] Recover from setting fix version failure in merge script ## What changes were proposed in this pull request? I happened to meet this case few times before: ``` Enter comma-separated fix version(s) [3.0.0]: 3.0,0 Restoring head pointer to master git checkout master Already on 'master' git branch Traceback (most recent call last): File "./dev/merge_spark_pr_jira.py", line 537, in <module> main() File "./dev/merge_spark_pr_jira.py", line 523, in main resolve_jira_issues(title, merged_refs, jira_comment) File "./dev/merge_spark_pr_jira.py", line 359, in resolve_jira_issues resolve_jira_issue(merge_branches, comment, jira_id) File "./dev/merge_spark_pr_jira.py", line 302, in resolve_jira_issue jira_fix_versions = map(lambda v: get_version_json(v), fix_versions) File "./dev/merge_spark_pr_jira.py", line 302, in <lambda> jira_fix_versions = map(lambda v: get_version_json(v), fix_versions) File "./dev/merge_spark_pr_jira.py", line 300, in get_version_json return filter(lambda v: v.name == version_str, versions)[0].raw IndexError: list index out of range ``` I typed the fix version wrongly (there's comma in `3.0,0`) and it ended the loop in the merge script. Not a big deal but it bugged me few times. Finally I met this today again, and decided to fix. This PR proposes to recover from wrongly set fix versions. ## How was this patch tested? I manually copied and pasted the specific codes and tested separately in both Python 2 and Python 3. Positive cases: ``` Enter comma-separated fix version(s) [3.0.0]: # blank test (to use default) ['3.0.0'] ``` ``` Enter comma-separated fix version(s) [3.0.0,2.4.2]: # multiple default versions ['3.0.0', '2.4.2'] ``` ``` Enter comma-separated fix version(s) [3.0.0]: 2.4.1 # valid version ['2.4.1'] ``` ``` Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.4.2 # multiple valid versions ['3.0.0', '2.4.2'] ``` Keyboard interrupt(Ctrl + c): ``` Enter comma-separated fix version(s) [3.0.0]: ^CTraceback (most recent call last): # keyboard interrupt File "test_merge_script.py", line 45, in <module> test() File "test_merge_script.py", line 26, in test fix_versions = input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) KeyboardInterrupt ``` Wrongly typed versions (recovered): ``` Enter comma-separated fix version(s) [3.0.0]: 3.1 Specified version(s) [3.1] not found in the available versions, try again (or leave blank and fix manually). Enter comma-separated fix version(s) [3.0.0]: 123 Specified version(s) [123] not found in the available versions, try again (or leave blank and fix manually). Enter comma-separated fix version(s) [3.0.0]: 3.0,0 Specified version(s) [3.0, 0] not found in the available versions, try again (or leave blank and fix manually). Enter comma-separated fix version(s) [3.0.0]: damn Specified version(s) [damn] not found in the available versions, try again (or leave blank and fix manually). Enter comma-separated fix version(s) [3.0.0]: 3.0.0,2.5.2 # one invalid versions in multiple versions Specified version(s) [3.0.0, 2.5.2] not found in the available versions, try again (or leave blank and fix manually). ``` Arbitrary exceptions in fix version parsing (recovered) ``` Enter comma-separated fix version(s) [3.0.0]: Traceback (most recent call last): File "tmp.py", line 11, in <module> raise Exception("arbitrary exception") Exception: arbitrary exception Error setting fix version(s), try again (or leave blank and fix manually) Enter comma-separated fix version(s) [3.0.0]: Traceback (most recent call last): File "tmp.py", line 10, in <module> raise Exception("arbitrary exception") Exception: arbitrary exception Error setting fix version(s), try again (or leave blank and fix manually) Enter comma-separated fix version(s) [3.0.0]: ``` Closes #24213 from HyukjinKwon/merge_script_fix_version. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-26 21:14:07 +09:00
Takuya UESHIN	529a061168	[SPARK-26103][SQL][FOLLOW-UP] Use string-interpolation to show the config key. ## What changes were proposed in this pull request? This is a follow-up of #23169. We should've used string-interpolation to show the config key in the warn message. ## How was this patch tested? Existing tests. Closes #24217 from ueshin/issues/SPARK-26103/s. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-26 20:56:00 +09:00
Takuya UESHIN	90b72512f4	[SPARK-26288][CORE][FOLLOW-UP][DOC] Fix broken tag in the doc. ## What changes were proposed in this pull request? This pr is a follow-up of #23393. The HTML in the doc is broken so fixing the broken `code` tag. ## How was this patch tested? Existing tests. Closes #24216 from ueshin/issues/SPARK-26288/fix_doc. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-26 13:08:40 +09:00
Dilip Biswal	9cc925cda2	[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-level rules in the grammar file. ## What changes were proposed in this pull request? Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project []; 'Project [] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. #columns in left hand side: 1. #columns in right hand side: 0. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. ## How was this patch tested? Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes #24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-25 17:43:03 -07:00
Yuming Wang	300ec1a74c	[SPARK-27226][SQL] Reduce the code duplicate when upgrading built-in Hive ## What changes were proposed in this pull request? This pr related to #24119. Reduce the code duplicate when upgrading built-in Hive. To achieve this, we should avoid using classes in `org.apache.orc.storage.` because these classes will be replaced with `org.apache.hadoop.hive.` after upgrading the built-in Hive. Such as: ![image](https://user-images.githubusercontent.com/5399861/54437594-e9be1000-476f-11e9-8878-3b7414871ee5.png) - Move the usage of `org.apache.orc.storage.*` to `OrcShimUtils`: 1. Add wrapper for `VectorizedRowBatch`(Reduce code duplication of [OrcColumnarBatchReader](https://github.com/apache/spark/pull/24166/files#diff-e594f7295e5408c01ace8175166313b6)). 2. Move some serializer/deserializer method out of `OrcDeserializer` and `OrcSerializer`(Reduce code duplication of [OrcDeserializer](https://github.com/apache/spark/pull/24166/files#diff-b933819e6dcaff41eee8fce1e8f2932c) and [OrcSerializer](https://github.com/apache/spark/pull/24166/files#diff-6d3849d88929f6ea25c436d71da729da)). 3. Defined two type aliases: `Operator` and `SearchArgument`(Reduce code duplication of [OrcV1FilterSuite](https://github.com/apache/spark/pull/24166/files#diff-48c4fc7a3b3384a6d0aab246723a0058)). - Move duplication code to super class: 1. Add a trait for `OrcFilters`(Reduce code duplication of [OrcFilters](https://github.com/apache/spark/pull/24166/files#diff-224b8cbedf286ecbfdd092d1e2e2f237)). 2. Move `checkNoFilterPredicate` from `OrcFilterSuite` to `OrcTest`(Reduce code duplication of [OrcFilterSuite](https://github.com/apache/spark/pull/24166/files#diff-8e05c1faaaec98edd7723e62f84066f1)). After this pr. We only need to copy these 4 files: OrcColumnVector, OrcFilters, OrcFilterSuite and OrcShimUtils. ## How was this patch tested? existing tests Closes #24166 from wangyum/SPARK-27226. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 19:39:00 -05:00
sandeep-katta	0bc030c859	[SPARK-27246][SQL] Add an assert on invalid Scalar subquery plan with no column ## What changes were proposed in this pull request? This PR proposes to add an assert on `ScalarSubquery`'s `dataType` because there's a possibility that `dataType` can be called alone before throwing analysis exception. This was found while working on [SPARK-27088](https://issues.apache.org/jira/browse/SPARK-27088). This change calls `treeString` for logging purpose, and the specific test "scalar subquery with no column" under `AnalysisErrorSuite` was being failed with: ``` Caused by: sbt.ForkMain$ForkError: java.util.NoSuchElementException: next on empty iterator ... at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.dataType(subquery.scala:251) at org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:163) ... at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:465) ... at org.apache.spark.sql.catalyst.rules.RuleExecutor$PlanChangeLogger.logRule(RuleExecutor.scala:176) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:116) ... ``` The reason is that `treeString` for logging happened to call `dataType` on `ScalarSubquery` but one test has empty column plan. So, it happened to throw `NoSuchElementException` before checking analysis. ## How was this patch tested? Manually tested. ```scala ScalarSubquery(LocalRelation()).treeString ``` ``` An exception or error caused a run to abort: assertion failed: Scala subquery should have only one column java.lang.AssertionError: assertion failed: Scala subquery should have only one column at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.dataType(subquery.scala:252) at org.apache.spark.sql.catalyst.analysis.AnalysisErrorSuite.<init>(AnalysisErrorSuite.scala:116) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.scalatest.tools.Runner$.genSuiteConfig(Runner.scala:1428) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$8(Runner.scala:1236) at scala.collection.immutable.List.map(List.scala:286) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1235) ``` Closes #24182 from sandeep-katta/subqueryissue. Authored-by: sandeep-katta <sandeep.katta2007@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-26 09:25:57 +09:00
Ajith	b61dce23d2	[SPARK-26961][CORE] Enable parallel classloading capability ## What changes were proposed in this pull request? As per https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html ``Class loaders that support concurrent loading of classes are known as parallel capable class loaders and are required to register themselves at their class initialization time by invoking the ClassLoader.registerAsParallelCapable method. Note that the ClassLoader class is registered as parallel capable by default. However, its subclasses still need to register themselves if they are parallel capable. `` i.e we can have finer class loading locks by registering classloaders as parallel capable. (Refer to deadlock due to macro lock https://issues.apache.org/jira/browse/SPARK-26961). All the classloaders we have are wrapper of URLClassLoader which by itself is parallel capable. But this cannot be achieved by scala code due to static registration Refer https://github.com/scala/bug/issues/11429 ## How was this patch tested? All Existing UT must pass Closes #24126 from ajithme/driverlock. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 19:07:30 -05:00
Liang-Chi Hsieh	8433ff6607	[SPARK-26847][SQL] Pruning nested serializers from object serializers: MapType support ## What changes were proposed in this pull request? In SPARK-26837, we prune nested fields from object serializers if they are unnecessary in the query execution. SPARK-26837 leaves the support of MapType as a TODO item. This proposes to support map type. ## How was this patch tested? Added tests. Closes #24158 from viirya/SPARK-26847. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-25 15:36:58 -07:00
Liang-Chi Hsieh	5a36cf66ed	[SPARK-27268][SQL] Add map_keys and map_values support in nested schema pruning. ## What changes were proposed in this pull request? We need to add `map_keys` and `map_values` into `ProjectionOverSchema` to support those methods in nested schema pruning. This also adds end-to-end tests to SchemaPruningSuite. ## How was this patch tested? Added tests. Closes #24202 from viirya/SPARK-27268. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-25 15:32:01 -07:00
liuxian	e4b36df2c0	[SPARK-27256][CORE][SQL] If the configuration is used to set the number of bytes, we'd better use `bytesConf`'. ## What changes were proposed in this pull request? Currently, if we want to configure `spark.sql.files.maxPartitionBytes` to 256 megabytes, we must set `spark.sql.files.maxPartitionBytes=268435456`, which is very unfriendly to users. And if we set it like this:`spark.sql.files.maxPartitionBytes=256M`, we will encounter this exception: ``` Exception in thread "main" java.lang.IllegalArgumentException: spark.sql.files.maxPartitionBytes should be long, but was 256M at org.apache.spark.internal.config.ConfigHelpers$.toNumber(ConfigBuilder.scala) ``` This PR use `bytesConf` to replace `longConf` or `intConf`, if the configuration is used to set the number of bytes. Configuration change list: `spark.files.maxPartitionBytes` `spark.files.openCostInBytes` `spark.shuffle.sort.initialBufferSize` `spark.shuffle.spill.initialMemoryThreshold` `spark.sql.autoBroadcastJoinThreshold` `spark.sql.files.maxPartitionBytes` `spark.sql.files.openCostInBytes` `spark.sql.defaultSizeInBytes` ## How was this patch tested? 1.Existing unit tests 2.Manual testing Closes #24187 from 10110346/bytesConf. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-25 14:47:40 -07:00
Luca Canali	4b2b3da766	[SPARK-26928][CORE][FOLLOWUP] Fix JVMCPUSource file name and minor updates to doc ## What changes were proposed in this pull request? This applies some minor updates/cleaning following up SPARK-26928, notably renaming JVMCPU.scala to JVMCPUSource.scala. ## How was this patch tested? Manually tested Closes #24201 from LucaCanali/fixupSPARK-26928. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:35:24 -05:00
“attilapiros”	2fbed378bf	[MINOR][DOC] Add missing space after comma Adding missing spaces after commas. Closes #24205 from attilapiros/minor-doc-changes. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:22:07 -05:00
Zhu, Lipeng	1f2564d0b0	[SPARK-27155][TEST] Parameterize Oracle docker image name ## What changes were proposed in this pull request? Update Oracle docker image name. ## How was this patch tested? ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 Closes #24086 from lipzhu/SPARK-27155. Authored-by: Zhu, Lipeng <lipzhu@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:17:41 -05:00
Takuya UESHIN	594be7a911	[SPARK-27240][PYTHON] Use pandas DataFrame for struct type argument in Scalar Pandas UDF. ## What changes were proposed in this pull request? Now that we support returning pandas DataFrame for struct type in Scalar Pandas UDF. If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas DataFrame, the argument of the chained UDF will be pandas DataFrame, but currently we don't support pandas DataFrame as an argument of Scalar Pandas UDF. That means there is an inconsistency between the chained UDF and the single UDF. We should support taking pandas DataFrame for struct type argument in Scalar Pandas UDF to be consistent. Currently pyarrow >=0.11 is supported. ## How was this patch tested? Modified and added some tests. Closes #24177 from ueshin/issues/SPARK-27240/structtype_argument. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-03-25 11:26:09 -07:00
Sean Owen	8bc304f97e	[SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0 ## What changes were proposed in this pull request? Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below. ## How was this patch tested? Existing tests. Closes #23098 from srowen/SPARK-26132. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 10:46:42 -05:00
Takeshi Yamamuro	b8a0f981f2	[SPARK-25196][SQL][FOLLOWUP] Fix wrong tests in StatisticsCollectionSuite ## What changes were proposed in this pull request? This is a follow-up of #24047 and it fixed wrong tests in `StatisticsCollectionSuite`. ## How was this patch tested? Pass Jenkins. Closes #24198 from maropu/SPARK-25196-FOLLOWUP-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-03-25 21:02:01 +09:00
s71955	8ec6cb67c7	[SPARK-27261][DOC] Improve app submission doc for passing multiple configs ## What changes were proposed in this pull request? While submitting the spark application, passing multiple configurations not documented clearly, no examples given.it will be better if it can be documented since clarity is less from spark documentation side. Even when i was browsing i could see few queries raised by users, below provided the reference. https://community.hortonworks.com/questions/105022/spark-submit-multiple-configurations.html As part of fixing i had documented the above scenario with an example. ## How was this patch tested? Manual inspection of the updated document. Closes #24191 from sujith71955/master_conf. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 21:55:48 -07:00
Marcelo Vanzin	db801cf3f2	[SPARK-27219][CORE] Treat timeouts as fatal in SASL fallback path. When a timeout happens we don't know what's the state of the remote end, so there is no point in doing anything else since it will most probably fail anyway. The change also demotes the log message printed when falling back to SASL, since a warning is too noisy for when the fallback is really needed (e.g. old shuffle service, or shuffle service with new auth disabled). Closes #24160 from vanzin/SPARK-27219. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 21:49:54 -07:00
Hyukjin Kwon	84ec06d95e	Revert "[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION" This reverts commit `240c6a4d75`.	2019-03-25 11:02:14 +09:00
Dongjoon Hyun	6ef94e0f18	[SPARK-27260][SS] Upgrade to Kafka 2.2.0 ## What changes were proposed in this pull request? This PR aims to update Kafka dependency to 2.2.0 to bring the following improvement and bug fixes. - https://issues.apache.org/jira/projects/KAFKA/versions/12344063 Due to [KAFKA-4453](https://issues.apache.org/jira/browse/KAFKA-4453), data plane API and controller plane API are separated. Apache Spark needs the following changes. ```scala - servers.head.apis.metadataCache + servers.head.dataPlaneRequestProcessor.metadataCache ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #24190 from dongjoon-hyun/SPARK-27260. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 17:39:57 -07:00
Maxim Gekk	52671d631d	[SPARK-27008][SQL][FOLLOWUP] Fix typo from `_EANBLED` to `_ENABLED` ## What changes were proposed in this pull request? This fixes a typo in the SQL config value: DATETIME_JAVA8API_EANBLED -> DATETIME_JAVA8API_ENABLED. ## How was this patch tested? This was tested by `RowEncoderSuite` and `LiteralExpressionSuite`. Closes #24194 from MaxGekk/date-localdate-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-24 17:16:33 -07:00
pgandhi	a6c207c9c0	[SPARK-24935][SQL] fix Hive UDAF with two aggregation buffers ## What changes were proposed in this pull request? Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](`7f9e76e9e0/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java (L107)`). However, the Hive UDAF adapter in Spark always creates the buffer with partial1 mode, which can only deal with one input: the original data. This PR fixes it. All credits go to pgandhi999 , who investigate the problem and study the Hive UDAF behaviors, and write the tests. close https://github.com/apache/spark/pull/23778 ## How was this patch tested? a new test Closes #24144 from cloud-fan/hive. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-03-24 16:07:35 -07:00
John Zhuge	a15f17ce27	[SPARK-27250][TEST-MAVEN][BUILD] Scala 2.11 maven compile should target Java 1.8 ## What changes were proposed in this pull request? Fix Scala 2.11 maven build issue after merging SPARK-26946. ## How was this patch tested? Maven Scala 2.11 and 2.12 builds with `-Phadoop-provided -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver`. Closes #24184 from jzhuge/SPARK-26946-1. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-24 09:05:41 -05:00
Liang-Chi Hsieh	6f18ac9e99	[SPARK-27241][SQL] Support map_keys and map_values in SelectedField ## What changes were proposed in this pull request? `SelectedField` doesn't support map_keys and map_values for now. When map key or value is complex struct, we should be able to prune unnecessary fields from keys/values. This proposes to add map_keys and map_values support to `SelectedField`. ## How was this patch tested? Added tests. Closes #24179 from viirya/SPARK-27241. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 23:13:31 -07:00
Takeshi Yamamuro	01e63053df	[SPARK-25196][SPARK-27251][SQL][FOLLOWUP] Add synchronized for InMemoryRelation.statsOfPlanToCache ## What changes were proposed in this pull request? This is a follow-up of #24047; to follow the `CacheManager.cachedData` lock semantics, this pr wrapped the `statsOfPlanToCache` update with `synchronized`. ## How was this patch tested? Pass Jenkins Closes #24178 from maropu/SPARK-24047-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 22:54:27 -07:00
Gengliang Wang	624288556d	[SPARK-27085][SQL] Migrate CSV to File Data Source V2 ## What changes were proposed in this pull request? Migrate CSV to File Data Source V2. ## How was this patch tested? Unit test Closes #24005 from gengliangwang/CSVDataSourceV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 15:43:46 -07:00
Michael Chirico	240c6a4d75	[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION ## What changes were proposed in this pull request? I got this warning when following the recommended approach to generating documentation: ``` Warning message: roxygen2 requires Encoding: UTF-8 ``` As can be seen in [other](https://github.com/tidyverse/tidyverse/blob/master/DESCRIPTION) [`tidyverse`](https://github.com/tidyverse/dplyr/blob/master/DESCRIPTION) [`DESCRIPTION`s](https://github.com/tidyverse/readr/blob/master/DESCRIPTION), this is standard practice This PR adds `Encoding: UTF-8` to `R/pkg/DESCRIPTION` ## How was this patch tested? Pass the Jenkins without warning. Closes #23823 from MichaelChirico/patch-1. Authored-by: Michael Chirico <michaelchirico4@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-23 15:24:54 -07:00
Maxim Gekk	027ed2d11b	[SPARK-23643][CORE][SQL][ML] Shrinking the buffer in hashSeed up to size of the seed parameter ## What changes were proposed in this pull request? The hashSeed method allocates 64 bytes instead of 8. Other bytes are always zeros (thanks to default behavior of ByteBuffer). And they could be excluded from hash calculation because they don't differentiate inputs. ## How was this patch tested? By running the existing tests - XORShiftRandomSuite Closes #20793 from MaxGekk/hash-buff-size. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-23 11:26:09 -05:00

1 2 3 4 5 ...

24062 commits