ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	23bde44797	[SPARK-27358][UI] Update jquery to 1.12.x to pick up security fixes ## What changes were proposed in this pull request? Update jquery -> 1.12.4, datatables -> 1.10.18, mustache -> 2.3.12. Add missing mustache license ## How was this patch tested? I manually tested the UI locally with the javascript console open and didn't observe any problems or JS errors. The only 'risky' change seems to be mustache, but on reading its release notes, don't think the changes from 0.8.1 to 2.x would affect Spark's simple usage. Closes #24288 from srowen/SPARK-27358. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-05 12:54:01 -05:00
Jungtaek Lim (HeartSaVioR)	a840b99daf	[MINOR][DOC] Fix html tag broken in configuration.md ## What changes were proposed in this pull request? This patch fixes wrong HTML tag in configuration.md which breaks the table tag. This is originally reported in dev mailing list: https://lists.apache.org/thread.html/744bdc83b3935776c8d91bf48fdf80d9a3fed3858391e60e343206f9%3Cdev.spark.apache.org%3E ## How was this patch tested? This change is one-liner and pretty obvious so I guess we may be able to skip testing. Closes #24304 from HeartSaVioR/MINOR-configuration-doc-html-tag-error. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-05 08:41:19 -07:00
gatorsmile	5678e687c6	[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused ## What changes were proposed in this pull request? With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- (1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- (1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- (2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- (1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` ## How was this patch tested? Modified the existing test. Closes #24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-05 08:31:41 -07:00
Gengliang Wang	568db94e0c	[SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema ## What changes were proposed in this pull request? In the current file source V2 framework, the schema of `FileScan` is not returned correctly if there are overlap columns between `dataSchema` and `partitionSchema`. The actual schema should be `dataSchema - overlapSchema + partitionSchema`, which might have different column order from the pushed down `requiredSchema` in `SupportsPushDownRequiredColumns.pruneColumns`. For example, if the data schema is `[a: String, b: String, c: String]` and the partition schema is `[b: Int, d: Int]`, the result schema is `[a: String, b: Int, c: String, d: Int]` in current `FileTable` and `HadoopFsRelation`. while the actual scan schema is `[a: String, c: String, b: Int, d: Int]` in `FileScan`. To fix the corner case, this PR proposes that the output schema of `FileTable` should be `dataSchema - overlapSchema + partitionSchema`, so that the column order is consistent with `FileScan`. Putting all the partition columns to the end of table schema is more reasonable. ## How was this patch tested? Unit test. Closes #24284 from gengliangwang/FixReadSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-05 13:34:46 +08:00
Aayushmaan Jain	04e53d2e3c	[SPAR-27342][SQL] Optimize Limit 0 queries ## What changes were proposed in this pull request? With this change, unnecessary file scans are avoided in case of Limit 0 queries. I added a case (rule) to `PropagateEmptyRelation` to replace `GlobalLimit 0` and `LocalLimit 0` nodes with an empty `LocalRelation`. This prunes the subtree under the Limit 0 node and further allows other rules of `PropagateEmptyRelation` to optimize the Logical Plan - while remaining semantically consistent with the Limit 0 query. For instance: Query: `SELECT * FROM table1 INNER JOIN (SELECT * FROM table2 LIMIT 0) AS table2 ON table1.id = table2.id` Optimized Plan without fix: ``` Join Inner, (id#79 = id#87) :- Filter isnotnull(id#79) : +- Relation[id#79,num1#80] parquet +- Filter isnotnull(id#87) +- GlobalLimit 0 +- LocalLimit 0 +- Relation[id#87,num2#88] parquet ``` Optimized Plan with fix: `LocalRelation <empty>, [id#75, num1#76, id#77, num2#78]` ## How was this patch tested? Added unit tests to verify Limit 0 optimization for: - Simple query containing Limit 0 - Inner Join, Left Outer Join, Right Outer Join, Full Outer Join queries containing Limit 0 as one of their children - Nested Inner Joins between 3 tables with one of them having a Limit 0 clause. - Intersect query wherein one of the subqueries was a Limit 0 query. Closes #24271 from aayushmaanjain/optimize-limit0. Authored-by: Aayushmaan Jain <aayushmaan.jain42@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-04 21:19:40 -07:00
Ruben Fiszel	0e44a51f2e	[SPARK-24345][SQL] Improve ParseError stop location when offending symbol is a token In the case where the offending symbol is a CommonToken, this PR increases the accuracy of the start and stop origin by leveraging the start and stop index information from CommonToken. Closes #21334 from rubenfiszel/patch-1. Lead-authored-by: Ruben Fiszel <rubenfiszel@gmail.com> Co-authored-by: rubenfiszel <rfiszel@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-04 18:20:34 -05:00
Dongjoon Hyun	938d954375	[SPARK-27382][SQL][TEST] Update Spark 2.4.x testing in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Since Apache Spark 2.4.1 vote passed and is distributed into mirrors, we need to test 2.4.1. This should land on both `master` and `branch-2.4`. ## How was this patch tested? Pass the Jenkins. Closes #24292 from dongjoon-hyun/SPARK-27382. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-04 13:49:56 -07:00
Wenchen Fan	f7bd1ab586	[SPARK-26811][SQL][FOLLOWUP] some more document fixes ## What changes were proposed in this pull request? while working on https://github.com/apache/spark/pull/24129, I realized that I missed some document fixes in https://github.com/apache/spark/pull/24285. This PR covers all of them. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #24295 from cloud-fan/doc.	2019-04-05 01:07:08 +08:00
Yuming Wang	1d95dea307	[SPARK-27349][SQL] Dealing with TimeVars removed in Hive 2.x ## What changes were proposed in this pull request? `hive.stats.jdbc.timeout` and `hive.stats.retries.wait` were removed by [HIVE-12164](https://issues.apache.org/jira/browse/HIVE-12164). This pr to deal with this change. ## How was this patch tested? unit tests Closes #24277 from wangyum/SPARK-27349. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-04-03 22:52:37 -07:00
Wenchen Fan	b56e433b54	[SPARK-27338][CORE][FOLLOWUP] remove trailing space ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/24265 breaks the lint check, because it has trailing space. (not sure why it passed jenkins). This PR fixes it. ## How was this patch tested? N/A Closes #24289 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 11:43:20 +08:00
Wenchen Fan	5c50f68253	[SPARK-26811][SQL][FOLLOWUP] fix some documentation ## What changes were proposed in this pull request? It's a followup of https://github.com/apache/spark/pull/24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes #24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 10:31:27 +08:00
Venkata krishnan Sowrirajan	6c4552c650	[SPARK-27338][CORE] Fix deadlock in UnsafeExternalSorter.SpillableIterator when locking both UnsafeExternalSorter.SpillableIterator and TaskMemoryManager ## What changes were proposed in this pull request? In `UnsafeExternalSorter.SpillableIterator#loadNext()` takes lock on the `UnsafeExternalSorter` and calls `freePage` once the `lastPage` is consumed which needs to take a lock on `TaskMemoryManager`. At the same time, there can be another MemoryConsumer using `UnsafeExternalSorter` as part of sorting can try to `allocatePage` needs to get lock on `TaskMemoryManager` which can cause spill to happen which requires lock on `UnsafeExternalSorter` again causing deadlock. This is a classic deadlock situation happening similar to the SPARK-26265. To fix this, we can move the `freePage` call in `loadNext` outside of `Synchronized` block similar to the fix in SPARK-26265 ## How was this patch tested? Manual tests were being done and will also try to add a test. Closes #24265 from venkata91/deadlock-sorter. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@qubole.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 09:58:05 +08:00
LantaoJin	69dd44af19	[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. This is an update of #24157 ## How was this patch tested? Add a UT Closes #24264 from LantaoJin/SPARK-27216. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-03 20:09:50 -05:00
Dongjoon Hyun	b51763612a	Revert "[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not" This reverts commit `5888b15d9c`.	2019-04-03 09:41:13 -07:00
Wenchen Fan	ffb362a705	[SPARK-19712][SQL][FOLLOW-UP] reduce code duplication ## What changes were proposed in this pull request? abstract some common code into a method. ## How was this patch tested? existing tests Closes #24281 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 00:37:57 +08:00
Liang-Chi Hsieh	d04a7371da	[MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes #24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-03 09:11:09 -07:00
Maxim Gekk	1bc672366d	[SPARK-27344][SQL][TEST] Support the LocalDate and Instant classes in Java Bean encoders ## What changes were proposed in this pull request? - Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`. - Updated comment for `Encoders.bean` - New Row getters: `getLocalDate` and `getInstant` - Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`. ## How was this patch tested? By `JavaBeanDeserializationSuite` Closes #24273 from MaxGekk/bean-instant-localdate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 17:45:59 +08:00
Dilip Biswal	3286bff942	[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. ## What changes were proposed in this pull request? In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA. Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing. Here is an example: ``` val df = spark.sql("select * from t where sum(ta) > 0") df.explain(true) df.show() ``` Resulting exception: ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint)) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) ``` Checked the behaviour of other database and all of them return an exception: Postgress ``` select * from foo where max(c1) > 0; Error ERROR: aggregate functions are not allowed in WHERE Position: 25 ``` DB2 ``` db2 => select * from foo where max(c1) > 0; SQL0120N Invalid use of an aggregate function or OLAP function. ``` Oracle ``` select * from foo where max(c1) > 0; ORA-00934: group function is not allowed here ``` MySql ``` select * from foo where max(c1) > 0; Invalid use of group function ``` Update This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid. ## How was this patch tested? Added tests in AnalysisErrorSuite and group-by.sql Closes #24209 from dilipbiswal/SPARK-27255. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 13:05:06 +08:00
Maxim Gekk	1d20d13149	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 10:55:56 +08:00
Dilip Biswal	b8b5acdd41	[SPARK-19712][SQL][FOLLOW-UP] Don't do partial pushdown when pushing down LeftAnti joins below Aggregate or Window operators. ## What changes were proposed in this pull request? After [23750](https://github.com/apache/spark/pull/23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](https://github.com/apache/spark/pull/23750#discussion_r270017097). This pr addresses their comments. ## How was this patch tested? Added two new tests to verify the behaviour. Closes #24253 from dilipbiswal/SPARK-19712-followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 09:56:27 +08:00
Gabor Somogyi	3628242bd0	[MINOR][DSTREAMS] Add DStreamCheckpointData.cleanup warning if delete returns false ## What changes were proposed in this pull request? While I was reviewing #24235 I've found a minor addition possibility. Namely `FileSystem.delete` returns a boolean which is not yet checked. In this PR I've added a warning message when it returns false. I've added this as MINOR because no control flow change introduced. ## How was this patch tested? Existing unit tests. Closes #24263 from gaborgsomogyi/SPARK-27301-minor. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-02 18:34:40 -05:00
Hyukjin Kwon	d7dd59a6b4	[SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes #24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:30:24 +09:00
Hyukjin Kwon	949d712839	[SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: `827383a97c/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java (L82)` made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, `5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (L146-L150)` to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes #24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:27:41 +09:00
Yuming Wang	13c5c1fb4b	[SPARK-27180][BUILD][YARN] Fix testing issues with yarn module in Hadoop-3 ## What changes were proposed in this pull request? Fix testing issues with `yarn` module in Hadoop-3: 1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```. 2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```. 3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```. ## How was this patch tested? manual tests: ```shell build/sbt yarn/test -Pyarn build/sbt yarn/test -Phadoop-3.2 -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2 ``` Closes #24115 from wangyum/hadoop3-yarn. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-02 15:38:26 -05:00
Gabor Somogyi	57aff93886	[SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef \| egrep "94556\|94680\|94793\|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556\|94680\|94793\|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-02 09:18:43 -07:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
Dongjoon Hyun	d575a453db	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `c5e83ab92c`.	2019-04-02 01:05:54 -07:00
Dongjoon Hyun	a0d807d5ab	[SPARK-26856][PYSPARK][FOLLOWUP] Fix UT failure due to wrong patterns for Kinesis assembly ## What changes were proposed in this pull request? After [SPARK-26856](https://github.com/apache/spark/pull/23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console ``` Exception: Found multiple JARs: .../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar, .../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar; please remove all but one ``` It's because the pattern was changed in a wrong way. Original ```python kinesis_asl_assembly_dir, "target/scala-/%s-.jar" % name_prefix)) kinesis_asl_assembly_dir, "target/%s_.jar" % name_prefix)) ``` After SPARK-26856* ```python project_full_path, "target/scala-/%s.jar" % jar_name_prefix)) project_full_path, "target/%s.jar" % jar_name_prefix)) ``` The actual kinesis assembly jar files look like the followings. SBT Build* ``` -rw-r--r-- 1 dongjoon staff 87459461 Apr 1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` MAVEN Build ``` -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff 8.7K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 21M Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns. ## How was this patch tested? Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally. ``` $ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly $ export ENABLE_KINESIS_TESTS=1 $ python/run-tests.py --python-executables python2.7 --module pyspark-streaming ``` Closes #24268 from dongjoon-hyun/SPARK-26856. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-02 14:52:56 +09:00
Marco Gaido	0b150f833c	[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn ## What changes were proposed in this pull request? We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s. The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs. ## How was this patch tested? NA Closes #23285 from mgaido91/SPARK-26224. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-02 14:12:47 +09:00
Maxim Gekk	c5e83ab92c	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-02 10:20:06 +08:00
Liang-Chi Hsieh	eaf008ad0e	[SPARK-27329][SQL] Pruning nested field in map of map key and value from object serializers ## What changes were proposed in this pull request? If object serializer has map of map key/value, pruning nested field should work. Previously object serializer pruner don't recursively prunes nested fields if it is deeply located in map key or value. This patch proposed to address it by slightly factoring the pruning logic. ## How was this patch tested? Added tests. Closes #24260 from viirya/SPARK-27329. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 13:53:55 -07:00
Giovanni Lanzani	92530c7db1	[SPARK-9792] Make DenseMatrix equality semantical Before, you could have this code ``` A = SparseMatrix(2, 2, [0, 2, 3], [0], [2]) B = DenseMatrix(2, 2, [2, 0, 0, 0]) B == A # False A == B # True ``` The second would be `True` as `SparseMatrix` already checks for semantic equality. This commit changes `DenseMatrix` so that equality is semantical as well. ## What changes were proposed in this pull request? Better semantic equality for DenseMatrix ## How was this patch tested? Unit tests were added, plus manual testing. Note that the code falls back to the old behavior when `other` is not a SparseMatrix. Closes #17968 from gglanzani/SPARK-9792. Authored-by: Giovanni Lanzani <giovanni@lanzani.nl> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2019-04-01 09:30:33 -07:00
Marco Gaido	5888b15d9c	[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not ## What changes were proposed in this pull request? When `GetMapValue` contains a foldable Map and a non-foldable key, `SimplifyExtractValueOps` fails to optimize it transforming it into case when statements. The PR adds a case for covering this situation too. ## How was this patch tested? added UT Closes #24223 from mgaido91/SPARK-27278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 09:09:06 -07:00
Maxim Gekk	d332958109	[SPARK-27325][SQL] Add implicit encoders for LocalDate and Instant ## What changes were proposed in this pull request? Added implicit encoders for the `java.time.LocalDate` and `java.time.Instant` classes. This allows creation of datasets from instances of the types. ## How was this patch tested? Added new tests to `JavaDatasetSuite` and `DatasetSuite`. Closes #24249 from MaxGekk/instant-localdate-encoders. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 23:02:48 +08:00
Marco Gaido	8012f55a9b	[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes #23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 22:22:10 +08:00
Yuming Wang	f799e34962	[MINOR][BUILD] Upgrade apache-rat to 0.13 ## What changes were proposed in this pull request? This PR upgrade `apache-rat` to 0.13. Issues fixed by 0.13: https://issues.apache.org/jira/issues/?jql=project%20%3D%20RAT%20AND%20fixVersion%20%3D%200.13 ## How was this patch tested? manual tests Closes #24262 from wangyum/apache-rat. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 16:44:42 +09:00
“attilapiros”	9eb896cc3b	[SPARK-27333][TEST] Update thread audit whitelist to skip broadcast-exchange-., process reaper and StatisticsDataReferenceCleaner threads ## What changes were proposed in this pull request? Update thread audit whitelist to skip threads of the global broadcast exchange thread pool, process reaper and Hadoop FS statistics data reference cleaner thread. ## How was this patch tested? Via existing UT using broadcast exchange via `sbt` i.e: ``` > project sql > testOnly .SessionStateSuite -- -z "fork new sessions and run query on inherited table" ``` Before (wrapped long line for manually to save horizontal scrolling for reviewers): ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.SessionStateSuite, thread names: broadcast-exchange-6, broadcast-exchange-0, broadcast-exchange-2, broadcast-exchange-5, broadcast-exchange-7, broadcast-exchange-4, broadcast-exchange-1, process reaper, broadcast-exchange-3, org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner ===== ``` After this change no possible thread leak detected. Closes #24244 from attilapiros/thread-audit-minor. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:33:31 -07:00
chakravarthiT	fc9aad0957	[SPARK-27253][SQL] Prioritizes parent session's SQLConf over SparkConf when cloning a session ## What changes were proposed in this pull request? Cloned session should prioritize `SQLConf` from parent's over `SparkConf`. Currently, when cloning a session, the child session has configuration set in `SparkConf` even the same properties are set to its parent `SQLConf`. Currently, when a Spark session is cloned, `mergeSparkConf` in `BaseSessionStateBuilder`'s `conf` overwrites `SQLConf` values as set in `SparkConf`. This PR proposes to call `mergeSparkConf` only when the parent session is empty. See below codes to read. 1. Parent's `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L268)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L157-L161)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` 2. Child `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L269)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L155)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L102)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L74)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L305)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L283)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L292)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` ## How was this patch tested? Added UT and with existing Unit Tests. Closes #24189 from chakravarthiT/CloneDiscardsConf. Authored-by: chakravarthiT <tcchakra@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 09:33:18 +09:00
Takeshi Yamamuro	885aab40a2	[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views ## What changes were proposed in this pull request? The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows; ```scala scala> sql(s"CREATE VIEW v AS SELECT 1 c") scala> sql(s"CACHE LAZY TABLE v") scala> sql(s"ANALYZE TABLE v COMPUTE STATISTICS") org.apache.spark.sql.AnalysisException: ANALYZE TABLE is not supported on views.; ... ``` Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too. ## How was this patch tested? Added tests in `StatisticsCollectionSuite` and `InMemoryColumnarQuerySuite`. Closes #24200 from maropu/SPARK-27266. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:24:21 -07:00
Maxim Gekk	6115a5e1a0	[SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String] ## What changes were proposed in this pull request? Added new benchmarks for: 1. JSON functions: `from_json`, `json_tuple` and `get_json_object` 2. Parsing `Dataset[String]` with JSON records 3. Comparing just splitting input text by lines with schema inferring, per-line parsing when encoding is set and not set. Also existing benchmarks were refactored to use the `NoOp` datasource to eliminate overhead of triggers like `.filter((_: Row) => true).count()`. ## How was this patch tested? By running `JSONBenchmark` locally. Closes #24252 from MaxGekk/json-benchmark-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 08:33:16 +09:00
gatorsmile	92b6f86f6d	[SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 22:58:28 -07:00
Yuming Wang	b670f39fc6	[SPARK-24793][FOLLOW-UP][K8S] Remove duplicate declaration of mockito-core ## What changes were proposed in this pull request? ``` [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-kubernetes_2.12🫙3.0.0-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.mockito:mockito-core:jar -> duplicate declaration of version (?) org.apache.spark:spark-kubernetes_2.12:[unknown-version], /Users/yumwang/spark/resource-managers/kubernetes/core/pom.xml, line 98, column 17 ``` This pr remove duplicate declaration of `mockito-core`. ## How was this patch tested? N/A Closes #24256 from wangyum/SPARK-24793-FOLLOW-UP. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 21:29:32 -07:00
Felix Cheung	fa0f791d4d	[MINOR][R] fix R project description ## What changes were proposed in this pull request? update as per this NOTE when running CRAN check ``` The Title field should be in title case, current version then in title case: ‘R Front end for 'Apache Spark'’ ‘R Front End for 'Apache Spark'’ ``` Closes #24255 from felixcheung/rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 21:00:46 -07:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Dongjoon Hyun	88ea319871	Revert "[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores" This reverts commit `f8fa564dec`.	2019-03-30 16:35:34 -07:00
Gengliang Wang	5dab5f651f	[SPARK-27326][SQL] Fall back all v2 file sources in `InsertIntoTable` to V1 FileFormat ## What changes were proposed in this pull request? In the first PR for file source V2, there was a rule for falling back Orc V2 table to OrcFileFormat: https://github.com/apache/spark/pull/23383/files#diff-57e8244b6964e4f84345357a188421d5R34 As we are migrating more file sources to data source V2, we should make the rule more generic. This PR proposes to: 1. Rename the rule `FallbackOrcDataSourceV2 ` to `FallBackFileSourceV2`.The name is more generic. And we use "fall back" as verb, while "fallback" is noun. 2. Rename the method `fallBackFileFormat` in `FileDataSourceV2` to `fallbackFileFormat`. Here we should use "fallback" as noun. 3. Add new method `fallbackFileFormat` in `FileTable`. This is for falling back to V1 in rule `FallbackOrcDataSourceV2 `. ## How was this patch tested? Existing Unit tests. Closes #24251 from gengliangwang/fallbackV1Rule. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 14:38:26 -07:00
Yuming Wang	0cbef34ede	[MINOR][BUILD] Add ASF license header to plugins.sbt ## What changes were proposed in this pull request? This PR add ASF license header to plugins.sbt, otherwise: ![image](https://user-images.githubusercontent.com/5399861/55273959-670b8800-530d-11e9-9b6f-214a3cde802e.png) ## How was this patch tested? Warning disappears after adding ASF license header: ![image](https://user-images.githubusercontent.com/5399861/55273961-6c68d280-530d-11e9-9d15-5fb73a1b991e.png) Closes #24248 from wangyum/plugins.sbt. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:47:02 -05:00
Yuming Wang	44b0d328e5	[MINOR] Update the scala version of LICENSE-binary to 2.12 ## What changes were proposed in this pull request? Update the scala version of `LICENSE-binary` to 2.12. ## How was this patch tested? N/A Closes #24250 from wangyum/LICENSE-binary. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:46:08 -05:00
liulijia	f8fa564dec	[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores ## What changes were proposed in this pull request? spark.task.cpus should be less or equal than spark.executor.cores when use static executor allocation ## How was this patch tested? manual Closes #24131 from liutang123/SPARK-27192. Authored-by: liulijia <liutang123@yeah.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:38:05 -05:00
Sean Owen	2ec650d843	[SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data ## What changes were proposed in this pull request? (See JIRA for problem statement) Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix. There appear to be no other changes of consequence: https://github.com/xerial/snappy-java/blob/master/Milestone.md ## How was this patch tested? Existing tests Closes #24242 from srowen/SPARK-27267. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:41:24 -05:00

1 2 3 4 5 ...

24115 commits