ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Marco Gaido	ec873a4fd2	[SPARK-14516][FOLLOWUP] Adding ClusteringEvaluator to examples ## What changes were proposed in this pull request? In SPARK-14516 we have introduced ClusteringEvaluator, but we didn't put any reference in the documentation and the examples were still relying on the sum of squared errors to show a way to evaluate the clustering model. The PR adds the ClusteringEvaluator in the examples. ## How was this patch tested? Manual runs of the examples. Author: Marco Gaido <mgaido@hortonworks.com> Closes #19676 from mgaido91/SPARK-14516_examples.	2017-12-11 06:35:31 -06:00
zouchenjun	4289ac9d8d	[SPARK-22496][SQL] thrift server adds operation logs ## What changes were proposed in this pull request? since hive 2.0+ upgrades log4j to log4j2，a lot of [changes](https://issues.apache.org/jira/browse/HIVE-11304) are made working on it. as spark is not to ready to update its inner hive version(1.2.1) , so I manage to make little changes. the function registerCurrentOperationLog is moved from SQLOperstion to its parent class ExecuteStatementOperation so spark can use it. ## How was this patch tested? manual test Author: zouchenjun <zouchenjun@youzan.com> Closes #19721 from ChenjunZou/operation-log.	2017-12-10 20:36:14 -08:00
Felix Cheung	ab1b6ee731	[BUILD] update release scripts ## What changes were proposed in this pull request? Change to dist.apache.org instead of home directory sha512 should have .sha512 extension. From ASF release signing doc: "The checksum SHOULD be generated using SHA-512. A .sha file SHOULD contain a SHA-1 checksum, for historical reasons." NOTE: I think should require some changes to work with Jenkins' release build ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19754 from felixcheung/releasescript.	2017-12-09 09:28:46 -06:00
Dongjoon Hyun	251b2c03b4	[SPARK-22672][SQL][TEST][FOLLOWUP] Fix to use `spark.conf` ## What changes were proposed in this pull request? During https://github.com/apache/spark/pull/19882, `conf` is mistakenly used to switch ORC implementation between `native` and `hive`. To affect `OrcTest` correctly, `spark.conf` should be used. ## How was this patch tested? Pass the tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19931 from dongjoon-hyun/SPARK-22672-2.	2017-12-09 20:20:28 +09:00
Imran Rashid	acf7ef3154	[SPARK-12297][SQL] Adjust timezone for int96 data from impala ## What changes were proposed in this pull request? Int96 data written by impala vs data written by hive & spark is stored slightly differently -- they use a different offset for the timezone. This adds an option "spark.sql.parquet.int96TimestampConversion" (false by default) to adjust timestamps if and only if the writer is impala (or more precisely, if the parquet file's "createdBy" metadata does not start with "parquet-mr"). This matches the existing behavior in hive from HIVE-9482. ## How was this patch tested? Unit test added, existing tests run via jenkins. Author: Imran Rashid <irashid@cloudera.com> Author: Henry Robinson <henry@apache.org> Closes #19769 from squito/SPARK-12297_skip_conversion.	2017-12-09 11:53:15 +09:00
Sandor Murakozi	e4639fa68f	[SPARK-21672][CORE] Remove SHS-specific application / attempt data … …structures ## What changes were proposed in this pull request? In general, the SHS pages now use the public API types to represent applications. Some internal code paths still used its own view of what applications and attempts look like (`ApplicationHistoryInfo` and `ApplicationAttemptInfo`), declared in ApplicationHistoryProvider.scala. This pull request removes these classes and updates the rest of the code to use `status.api.v1.ApplicationInfo` and `status.api.v1.ApplicationAttemptInfo` instead. Furthermore `status.api.v1.ApplicationInfo` and `status.api.v1.ApplicationAttemptInfo` were changed to case class to - facilitate copying instances - equality checking in test code - nicer toString() To simplify the code a bit `v1.` prefixes were also removed from occurrences of v1.ApplicationInfo and v1.ApplicationAttemptInfo as there is no more ambiguity between classes in history and status.api.v1. ## How was this patch tested? By running existing automated tests. Author: Sandor Murakozi <smurakozi@gmail.com> Closes #19920 from smurakozi/SPARK-21672.	2017-12-08 14:17:50 -08:00
Li Jin	26e66453de	[SPARK-22655][PYSPARK] Throw exception rather than exit silently in PythonRunner when Spark session is stopped ## What changes were proposed in this pull request? During Spark shutdown, if there are some active tasks, sometimes they will complete with incorrect results. The issue is in PythonRunner where it is returning partial result instead of throwing exception during Spark shutdown. This patch makes it so that these tasks fail instead of complete with partial results. ## How was this patch tested? Existing tests. Author: Li Jin <ice.xelloss@gmail.com> Closes #19852 from icexelloss/python-runner-shutdown.	2017-12-08 20:44:21 +09:00
Juliusz Sompolski	f28b1a4c41	[SPARK-22721] BytesToBytesMap peak memory not updated. ## What changes were proposed in this pull request? Follow-up to earlier commit. The peak memory of BytesToBytesMap is not updated in more places - spill() and destructiveIterator(). ## How was this patch tested? Manually. Author: Juliusz Sompolski <julek@databricks.com> Closes #19923 from juliuszsompolski/SPARK-22721cd.	2017-12-08 12:19:45 +01:00
Sunitha Kambhampati	f88a67bf08	[SPARK-22452][SQL] Add getDouble to DataSourceV2Options - Implemented getDouble method in DataSourceV2Options - Add unit test Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #19921 from skambha/ds2.	2017-12-08 14:48:19 +08:00
Tathagata Das	b11869bc3b	[SPARK-22187][SS][REVERT] Revert change in state row format for mapGroupsWithState ## What changes were proposed in this pull request? #19416 changed the format in which rows were encoded in the state store. However, this can break existing streaming queries with the old format in unpredictable ways (potentially crashing the JVM). Hence I am reverting this for now. This will be re-applied in the future after we start saving more metadata in checkpoints to signify which version of state row format the existing streaming query is running. Then we can decode old and new formats accordingly. ## How was this patch tested? Existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #19924 from tdas/SPARK-22187-1.	2017-12-07 22:02:51 -08:00
Dongjoon Hyun	0ba8f4b211	[SPARK-21787][SQL] Support for pushing down filters for DateType in native OrcFileFormat ## What changes were proposed in this pull request? This PR support for pushing down filters for DateType in ORC ## How was this patch tested? Pass the Jenkins with newly add and updated test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18995 from dongjoon-hyun/SPARK-21787.	2017-12-08 09:52:16 +08:00
Dongjoon Hyun	aa1764ba1a	[SPARK-22279][SQL] Turn on spark.sql.hive.convertMetastoreOrc by default ## What changes were proposed in this pull request? Like Parquet, this PR aims to turn on `spark.sql.hive.convertMetastoreOrc` by default. ## How was this patch tested? Pass all the existing test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19499 from dongjoon-hyun/SPARK-22279.	2017-12-07 15:45:23 -08:00
Wang Gengliang	18b75d465b	[SPARK-22719][SQL] Refactor ConstantPropagation ## What changes were proposed in this pull request? The current time complexity of ConstantPropagation is O(n^2), which can be slow when the query is complex. Refactor the implementation with O( n ) time complexity, and some pruning to avoid traversing the whole `Condition` ## How was this patch tested? Unit test. Also simple benchmark test in ConstantPropagationSuite ``` val condition = (1 to 500).map{_ => Rand(0) === Rand(0)}.reduce(And) val query = testRelation .select(columnA) .where(condition) val start = System.currentTimeMillis() (1 to 40).foreach { _ => Optimize.execute(query.analyze) } val end = System.currentTimeMillis() println(end - start) ``` Run time before changes: 18989ms (474ms per loop) Run time after changes: 1275 ms (32ms per loop) Author: Wang Gengliang <ltnwgl@gmail.com> Closes #19912 from gengliangwang/ConstantPropagation.	2017-12-07 10:24:49 -08:00
kellyzly	f41c0a93fd	[SPARK-22660][BUILD] Use position() and limit() to fix ambiguity issue in scala-2.12 …a-2.12 and JDK9 ## What changes were proposed in this pull request? Some compile error after upgrading to scala-2.12 ```javascript spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: ambiguous reference to overloaded definition, method limit in class ByteBuffer of type (x$1: Int)java.nio.ByteBuffer method limit in class Buffer of type ()Int match expected type ? val resultSize = serializedDirectResult.limit error ``` The limit method was moved from ByteBuffer to the superclass Buffer and it can no longer be called without (). The same reason for position method. ```javascript /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] props.putAll(outputSerdeProps.toMap.asJava) [error] ^ ``` This is because the key type is Object instead of String which is unsafe. ## How was this patch tested? running tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: kellyzly <kellyzly@126.com> Closes #19854 from kellyzly/SPARK-22660.	2017-12-07 10:04:04 -06:00
Marco Gaido	b79071910e	[SPARK-22696][SQL] objects functions should not use unneeded global variables ## What changes were proposed in this pull request? Some objects functions are using global variables which are not needed. This can generate some unneeded entries in the constant pool. The PR replaces the unneeded global variables with local variables. ## How was this patch tested? added UTs Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19908 from mgaido91/SPARK-22696.	2017-12-07 21:24:36 +08:00
Marco Gaido	fc29446300	[SPARK-22699][SQL] GenerateSafeProjection should not use global variables for struct ## What changes were proposed in this pull request? GenerateSafeProjection is defining a mutable state for each struct, which is not needed. This is bad for the well known issues related to constant pool limits. The PR replace the global variable with a local one. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #19914 from mgaido91/SPARK-22699.	2017-12-07 21:18:27 +08:00
Dongjoon Hyun	dd59a4be36	[SPARK-22712][SQL] Use `buildReaderWithPartitionValues` in native OrcFileFormat ## What changes were proposed in this pull request? To support vectorization in native OrcFileFormat later, we need to use `buildReaderWithPartitionValues` instead of `buildReader` like ParquetFileFormat. This PR replaces `buildReader` with `buildReaderWithPartitionValues`. ## How was this patch tested? Pass the Jenkins with the existing test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19907 from dongjoon-hyun/SPARK-ORC-BUILD-READER.	2017-12-07 21:08:15 +08:00
Brad Kaiser	beb717f648	[SPARK-22618][CORE] Catch exception in removeRDD to stop jobs from dying ## What changes were proposed in this pull request? I propose that BlockManagerMasterEndpoint.removeRdd() should catch and log any IOExceptions it receives. As it is now, the exception can bubble up to the main thread and kill user applications when called from RDD.unpersist(). I think this change is a better experience for the end user. I chose to catch the exception in BlockManagerMasterEndpoint.removeRdd() instead of RDD.unpersist() because this way the RDD.unpersist() blocking option will still work correctly. Otherwise, blocking will get short circuited by the first error. ## How was this patch tested? This patch was tested with a job that shows the job killing behavior mentioned above. rxin, it looks like you originally wrote this method, I would appreciate it if you took a look. Thanks. This contribution is my original work and is licensed under the project's open source license. Author: Brad Kaiser <kaiserb@us.ibm.com> Closes #19836 from brad-kaiser/catch-unpersist-exception.	2017-12-07 21:04:09 +08:00
Sunitha Kambhampati	2be448260d	[SPARK-22452][SQL] Add getInt, getLong, getBoolean to DataSourceV2Options - Implemented methods getInt, getLong, getBoolean for DataSourceV2Options - Added new unit tests to exercise these methods Author: Sunitha Kambhampati <skambha@us.ibm.com> Closes #19902 from skambha/spark22452.	2017-12-07 20:59:47 +08:00
Kazuaki Ishizaki	ea2fbf4197	[SPARK-22705][SQL] Case, Coalesce, and In use less global variables ## What changes were proposed in this pull request? This PR accomplishes the following two items. 1. Reduce # of global variables from two to one for generated code of `Case` and `Coalesce` and remove global variables for generated code of `In`. 2. Make lifetime of global variable local within an operation Item 1. reduces # of constant pool entries in a Java class. Item 2. ensures that an variable is not passed to arguments in a method split by `CodegenContext.splitExpressions()`, which is addressed by #19865. ## How was this patch tested? Added new tests into `PredicateSuite`, `NullExpressionsSuite`, and `ConditionalExpressionSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19901 from kiszk/SPARK-22705.	2017-12-07 20:55:35 +08:00
Wenchen Fan	e103adf45a	[SPARK-22703][SQL] make ColumnarRow an immutable view ## What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/19842 , we should also make `ColumnarRow` an immutable view, and move forward to make `ColumnVector` public. ## How was this patch tested? Existing tests. The performance concern should be same as https://github.com/apache/spark/pull/19842 . Author: Wenchen Fan <wenchen@databricks.com> Closes #19898 from cloud-fan/row-id.	2017-12-07 20:45:11 +08:00
Dongjoon Hyun	c1e5688d1a	[SPARK-22672][SQL][TEST] Refactor ORC Tests ## What changes were proposed in this pull request? Since SPARK-20682, we have two `OrcFileFormat`s. This PR refactors ORC tests with three principles (with a few exceptions) 1. Move test suite into `sql/core`. 2. Create `HiveXXX` test suite in `sql/hive` by reusing `sql/core` test suite. 3. `OrcTest` will provide common helper functions and `val orcImp: String`. Test Suites Native OrcFileFormat - org.apache.spark.sql.hive.orc - OrcFilterSuite - OrcPartitionDiscoverySuite - OrcQuerySuite - OrcSourceSuite - o.a.s.sql.hive.orc - OrcHadoopFsRelationSuite Hive built-in OrcFileFormat - o.a.s.sql.hive.orc - HiveOrcFilterSuite - HiveOrcPartitionDiscoverySuite - HiveOrcQuerySuite - HiveOrcSourceSuite - HiveOrcHadoopFsRelationSuite Hierarchy ``` OrcTest -> OrcSuite -> OrcSourceSuite -> OrcQueryTest -> OrcQuerySuite -> OrcPartitionDiscoveryTest -> OrcPartitionDiscoverySuite -> OrcFilterSuite HadoopFsRelationTest -> OrcHadoopFsRelationSuite -> HiveOrcHadoopFsRelationSuite ``` Please note the followings. - Unlike the other test suites, `OrcHadoopFsRelationSuite` doesn't inherit `OrcTest`. It is inside `sql/hive` like `ParquetHadoopFsRelationSuite` due to the dependencies and follows the existing convention to use `val dataSourceName: String` - `OrcFilterSuite`s cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes. ## How was this patch tested? Pass the Jenkins tests with reorganized test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19882 from dongjoon-hyun/SPARK-22672.	2017-12-07 20:42:46 +08:00
Juliusz Sompolski	d32337b1ef	[SPARK-22721] BytesToBytesMap peak memory usage not accurate after reset() ## What changes were proposed in this pull request? BytesToBytesMap doesn't update peak memory usage before shrinking back to initial capacity in reset(), so after a disk spill one never knows what was the size of hash table was before spilling. ## How was this patch tested? Checked manually. Author: Juliusz Sompolski <julek@databricks.com> Closes #19915 from juliuszsompolski/SPARK-22721.	2017-12-07 13:05:59 +01:00
Kazuaki Ishizaki	8ae004b460	[SPARK-22688][SQL] Upgrade Janino version to 3.0.8 ## What changes were proposed in this pull request? This PR upgrade Janino version to 3.0.8. [Janino 3.0.8](https://janino-compiler.github.io/janino/changelog.html) includes an important fix to reduce the number of constant pool entries by using 'sipush' java bytecode. * SIPUSH bytecode is not used for short integer constant [#33](https://github.com/janino-compiler/janino/issues/33). Please see detail in [this discussion thread](https://github.com/apache/spark/pull/19518#issuecomment-346674976). ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19890 from kiszk/SPARK-22688.	2017-12-06 16:15:25 -08:00
Marco Gaido	f110a7f884	[SPARK-22693][SQL] CreateNamedStruct and InSet should not use global variables ## What changes were proposed in this pull request? CreateNamedStruct and InSet are using a global variable which is not needed. This can generate some unneeded entries in the constant pool. The PR removes the unnecessary mutable states and makes them local variables. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19896 from mgaido91/SPARK-22693.	2017-12-06 14:12:16 -08:00
smurakozi	9948b860ac	[SPARK-22516][SQL] Bump up Univocity version to 2.5.9 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-22516. This was fixed by upgrading from 2.5.4 to 2.5.9 version of the library : Executing ``` spark.read.option("header","true").option("inferSchema", "true").option("multiLine", "true").option("comment", "g").csv("test_file_without_eof_char.csv").show() ``` Before ``` ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6) com.univocity.parsers.common.TextParsingException: java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End of input reached ... Internal state when error was thrown: line=3, column=0, record=2, charIndex=31 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:281) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) ``` After ``` +-------+-------+ \|column1\|column2\| +-------+-------+ \| abc\| def\| +-------+-------+ ``` ## How was this patch tested? The already existing `CSVSuite.commented lines in CSV data` test was extended to parse the file also in multiline mode. The test input file was modified to also include a comment in the last line. Author: smurakozi <smurakozi@gmail.com> Closes #19906 from smurakozi/SPARK-22516.	2017-12-06 13:22:08 -08:00
gatorsmile	effca9868e	[SPARK-22720][SS] Make EventTimeWatermark Extend UnaryNode ## What changes were proposed in this pull request? Our Analyzer and Optimizer have multiple rules for `UnaryNode`. After making `EventTimeWatermark` extend `UnaryNode`, we do not need a special handling for `EventTimeWatermark`. ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19913 from gatorsmile/eventtimewatermark.	2017-12-06 13:11:38 -08:00
Devaraj K	51066b437b	[SPARK-14228][CORE][YARN] Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped ## What changes were proposed in this pull request? I see the two instances where the exception is occurring. Instance 1: ``` 17/11/10 15:49:32 ERROR util.Utils: Uncaught exception in thread driver-revive-thread org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:521) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(CoarseGrainedSchedulerBackend.scala:125) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1$$anonfun$run$1.apply$mcV$sp(CoarseGrainedSchedulerBackend.scala:125) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1344) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anon$1.run(CoarseGrainedSchedulerBackend.scala:124) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` In CoarseGrainedSchedulerBackend.scala, driver-revive-thread starts with DriverEndpoint.onStart() and keeps sending the ReviveOffers messages periodically till it gets shutdown as part DriverEndpoint.onStop(). There is no proper coordination between the driver-revive-thread(shutdown) and the RpcEndpoint unregister, RpcEndpoint unregister happens first and then driver-revive-thread shuts down as part of DriverEndpoint.onStop(), In-between driver-revive-thread may try to send the ReviveOffers message which is leading to the above exception. To fix this issue, this PR moves the shutting down of driver-revive-thread to CoarseGrainedSchedulerBackend.stop() which executes before the DriverEndpoint unregister. Instance 2: ``` 17/11/10 16:31:38 ERROR cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Error requesting driver to remove executor 1 for reason Executor for container container_1508535467865_0226_01_000002 exited because of a YARN event (e.g., pre-emption) and not because of an error in the running job. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:516) at org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$$anonfun$receive$1.applyOrElse(YarnSchedulerBackend.scala:269) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Here YarnDriverEndpoint tries to send remove executor messages after the Yarn scheduler backend service stop, which is leading to the above exception. To avoid the above exception, 1) We may add a condition(which checks whether service has stopped or not) before sending executor remove message 2) Add a warn log message in onFailure case when the service is already stopped In this PR, chosen the 2) option which adds a log message in the case of onFailure without the exception stack trace since the option 1) would need to to go through for every remove executor message. ## How was this patch tested? I verified it manually, I don't see these exceptions with the PR changes. Author: Devaraj K <devaraj@apache.org> Closes #19741 from devaraj-kavali/SPARK-14228.	2017-12-06 10:39:15 -08:00
Reynold Xin	4286cba7da	[SPARK-22710] ConfigBuilder.fallbackConf should trigger onCreate function ## What changes were proposed in this pull request? I was looking at the config code today and found that configs defined using ConfigBuilder.fallbackConf didn't trigger onCreate function. This patch fixes it. This doesn't require backporting since we currently have no configs that use it. ## How was this patch tested? Added a test case for all the config final creator functions in ConfigEntrySuite. Author: Reynold Xin <rxin@databricks.com> Closes #19905 from rxin/SPARK-22710.	2017-12-06 10:11:25 -08:00
Marco Gaido	e98f9647f4	[SPARK-22695][SQL] ScalaUDF should not use global variables ## What changes were proposed in this pull request? ScalaUDF is using global variables which are not needed. This can generate some unneeded entries in the constant pool. The PR replaces the unneeded global variables with local variables. ## How was this patch tested? added UT Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19900 from mgaido91/SPARK-22695.	2017-12-07 00:50:49 +08:00
Kazuaki Ishizaki	813c0f945d	[SPARK-22704][SQL] Least and Greatest use less global variables ## What changes were proposed in this pull request? This PR accomplishes the following two items. 1. Reduce # of global variables from two to one 2. Make lifetime of global variable local within an operation Item 1. reduces # of constant pool entries in a Java class. Item 2. ensures that an variable is not passed to arguments in a method split by `CodegenContext.splitExpressions()`, which is addressed by #19865. ## How was this patch tested? Added new test into `ArithmeticExpressionSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19899 from kiszk/SPARK-22704.	2017-12-07 00:45:51 +08:00
Zheng RuiFeng	6f41c593bb	[SPARK-22690][ML] Imputer inherit HasOutputCols ## What changes were proposed in this pull request? make `Imputer` inherit `HasOutputCols` ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19889 from zhengruifeng/using_HasOutputCols.	2017-12-06 08:27:17 -08:00
Dongjoon Hyun	fb6a922751	[SPARK-20728][SQL][FOLLOWUP] Use an actionable exception message ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/19871 to improve an exception message. ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19903 from dongjoon-hyun/orc_exception.	2017-12-06 20:20:20 +09:00
Liang-Chi Hsieh	00d176d2fe	[SPARK-20392][SQL] Set barrier to prevent re-entering a tree ## What changes were proposed in this pull request? The SQL `Analyzer` goes through a whole query plan even most part of it is analyzed. This increases the time spent on query analysis for long pipelines in ML, especially. This patch adds a logical node called `AnalysisBarrier` that wraps an analyzed logical plan to prevent it from analysis again. The barrier is applied to the analyzed logical plan in `Dataset`. It won't change the output of wrapped logical plan and just acts as a wrapper to hide it from analyzer. New operations on the dataset will be put on the barrier, so only the new nodes created will be analyzed. This analysis barrier will be removed at the end of analysis stage. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19873 from viirya/SPARK-20392-reopen.	2017-12-05 21:43:41 -08:00
Dongjoon Hyun	82183f7b57	[SPARK-22686][SQL] DROP TABLE IF EXISTS should not show AnalysisException ## What changes were proposed in this pull request? During [SPARK-22488](https://github.com/apache/spark/pull/19713) to fix view resolution issue, there occurs a regression at `2.2.1` and `master` branch like the following. This PR fixes that. ```scala scala> spark.version res2: String = 2.2.1 scala> sql("DROP TABLE IF EXISTS t").show 17/12/04 21:01:06 WARN DropTableCommand: org.apache.spark.sql.AnalysisException: Table or view not found: t; org.apache.spark.sql.AnalysisException: Table or view not found: t; ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19888 from dongjoon-hyun/SPARK-22686.	2017-12-06 10:52:29 +08:00
Mark Petruska	59aa3d56af	[SPARK-20706][SPARK-SHELL] Spark-shell not overriding method/variable definition ## What changes were proposed in this pull request? [SPARK-20706](https://issues.apache.org/jira/browse/SPARK-20706): Spark-shell not overriding method/variable definition This is a Scala repl bug ( [SI-9740](https://github.com/scala/bug/issues/9740) ), was fixed in version 2.11.9 ( [see the original PR](https://github.com/scala/scala/pull/5090) ) ## How was this patch tested? Added a new test case in `ReplSuite`. Author: Mark Petruska <petruska.mark@gmail.com> Closes #19879 from mpetruska/SPARK-20706.	2017-12-05 18:08:36 -06:00
Zhenhua Wang	1e17ab83de	[SPARK-22662][SQL] Failed to prune columns after rewriting predicate subquery ## What changes were proposed in this pull request? As a simple example: ``` spark-sql> create table base (a int, b int) using parquet; Time taken: 0.066 seconds spark-sql> create table relInSubq ( x int, y int, z int) using parquet; Time taken: 0.042 seconds spark-sql> explain select a from base where a in (select x from relInSubq); == Physical Plan == Project [a#83] +- BroadcastHashJoin [a#83], [x#85], LeftSemi, BuildRight :- FileScan parquet default.base[a#83,b#84] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://100.0.0.4:9000/wzh/base], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) +- Project [x#85] +- *FileScan parquet default.relinsubq[x#85] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://100.0.0.4:9000/wzh/relinsubq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<x:int> ``` We only need column `a` in table `base`, but all columns (`a`, `b`) are fetched. The reason is that, in "Operator Optimizations" batch, `ColumnPruning` first produces a `Project` on table `base`, but then it's removed by `removeProjectBeforeFilter`. Because at that time, the predicate subquery is in filter form. Then, in "Rewrite Subquery" batch, `RewritePredicateSubquery` converts the subquery into a LeftSemi join, but this batch doesn't have the `ColumnPruning` rule. This results in reading all columns for the `base` table. ## How was this patch tested? Added a new test case. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19855 from wzhfy/column_pruning_subquery.	2017-12-05 15:15:32 -08:00
Wenchen Fan	132a3f4708	[SPARK-22500][SQL][FOLLOWUP] cast for struct can split code even with whole stage codegen ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/19730, we can split the code for casting struct even with whole stage codegen. This PR also has some renaming to make the code easier to read. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #19891 from cloud-fan/cast.	2017-12-05 11:40:13 -08:00
Wenchen Fan	ced6ccf0d6	[SPARK-22701][SQL] add ctx.splitExpressionsWithCurrentInputs ## What changes were proposed in this pull request? This pattern appears many times in the codebase: ``` if (ctx.INPUT_ROW == null \|\| ctx.currentVars != null) { exprs.mkString("\n") } else { ctx.splitExpressions(...) } ``` This PR adds a `ctx.splitExpressionsWithCurrentInputs` for this pattern ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19895 from cloud-fan/splitExpression.	2017-12-05 10:15:15 -08:00
Carson Wang	03fdc92e42	[SPARK-22681] Accumulator should only be updated once for each task in result stage ## What changes were proposed in this pull request? As the doc says "For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value." But currently the code doesn't guarantee this. ## How was this patch tested? New added tests. Author: Carson Wang <carson.wang@intel.com> Closes #19877 from carsonwang/fixAccum.	2017-12-05 09:15:22 -08:00
Dongjoon Hyun	326f1d6728	[SPARK-20728][SQL] Make OrcFileFormat configurable between sql/hive and sql/core ## What changes were proposed in this pull request? This PR aims to provide a configuration to choose the default `OrcFileFormat` from legacy `sql/hive` module or new `sql/core` module. For example, this configuration will affects the following operations. ```scala spark.read.orc(...) ``` ```sql CREATE TABLE t USING ORC ... ``` ## How was this patch tested? Pass the Jenkins with new test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19871 from dongjoon-hyun/spark-sql-orc-enabled.	2017-12-05 20:46:35 +08:00
gatorsmile	53e5251bb3	[SPARK-22675][SQL] Refactoring PropagateTypes in TypeCoercion ## What changes were proposed in this pull request? PropagateTypes are called twice in TypeCoercion. We do not need to call it twice. Instead, we should call it after each change on the types. ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19874 from gatorsmile/deduplicatePropagateTypes.	2017-12-05 20:43:02 +08:00
Wenchen Fan	a8af4da12c	[SPARK-22682][SQL] HashExpression does not need to create global variables ## What changes were proposed in this pull request? It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots. This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19878 from cloud-fan/minor.	2017-12-05 12:43:05 +08:00
Wenchen Fan	295df746ec	[SPARK-22677][SQL] cleanup whole stage codegen for hash aggregate ## What changes were proposed in this pull request? The `HashAggregateExec` whole stage codegen path is a little messy and hard to understand, this code cleans it up a little bit, especially for the fast hash map part. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19869 from cloud-fan/hash-agg.	2017-12-05 12:38:26 +08:00
Marco Gaido	3887b7eef7	[SPARK-22665][SQL] Avoid repartitioning with empty list of expressions ## What changes were proposed in this pull request? Repartitioning by empty set of expressions is currently possible, even though it is a case which is not handled properly. Indeed, in `HashExpression` there is a check to avoid to run it on an empty set, but this check is not performed while repartitioning. Thus, the PR adds a check to avoid this wrong situation. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #19870 from mgaido91/SPARK-22665.	2017-12-04 17:08:56 -08:00
Zhenhua Wang	1d5597b408	[SPARK-22626][SQL][FOLLOWUP] improve documentation and simplify test case ## What changes were proposed in this pull request? This PR improves documentation for not using zero `numRows` statistics and simplifies the test case. The reason why some Hive tables have zero `numRows` is that, in Hive, when stats gathering is disabled, `numRows` is always zero after INSERT command: ``` hive> create table src (key int, value string) stored as orc; hive> desc formatted src; Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 0 numRows 0 rawDataSize 0 totalSize 0 transient_lastDdlTime 1512399590 hive> set hive.stats.autogather=false; hive> insert into src select 1, 'a'; hive> desc formatted src; Table Parameters: numFiles 1 numRows 0 rawDataSize 0 totalSize 275 transient_lastDdlTime 1512399647 hive> insert into src select 1, 'b'; hive> desc formatted src; Table Parameters: numFiles 2 numRows 0 rawDataSize 0 totalSize 550 transient_lastDdlTime 1512399687 ``` ## How was this patch tested? Modified existing test. Author: Zhenhua Wang <wzh_zju@163.com> Closes #19880 from wzhfy/doc_zero_rowCount.	2017-12-04 15:08:07 -08:00
Marcelo Vanzin	e1dd03e42c	[SPARK-22372][CORE, YARN] Make cluster submission use SparkApplication. The main goal of this change is to allow multiple cluster-mode submissions from the same JVM, without having them end up with mixed configuration. That is done by extending the SparkApplication trait, and doing so was reasonably trivial for standalone and mesos modes. For YARN mode, there was a complication. YARN used a "SPARK_YARN_MODE" system property to control behavior indirectly in a whole bunch of places, mainly in the SparkHadoopUtil / YarnSparkHadoopUtil classes. Most of the changes here are removing that. Since we removed support for Hadoop 1.x, some methods that lived in YarnSparkHadoopUtil can now live in SparkHadoopUtil. The remaining methods don't need to be part of the class, and can be called directly from the YarnSparkHadoopUtil object, so now there's a single implementation of SparkHadoopUtil. There were two places in the code that relied on SPARK_YARN_MODE to make decisions about YARN-specific functionality, and now explicitly check the master from the configuration for that instead: * fetching the external shuffle service port, which can come from the YARN configuration. * propagation of the authentication secret using Hadoop credentials. This also was cleaned up a little to not need so many methods in `SparkHadoopUtil`. With those out of the way, actually changing the YARN client to extend SparkApplication was easy. Tested with existing unit tests, and also by running YARN apps with auth and kerberos both on and off in a real cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19631 from vanzin/SPARK-22372.	2017-12-04 11:05:03 -08:00
Reza Safi	f81401e1cb	[SPARK-22162] Executors and the driver should use consistent JobIDs in the RDD commit protocol I have modified SparkHadoopWriter so that executors and the driver always use consistent JobIds during the hadoop commit. Before SPARK-18191, spark always used the rddId, it just incorrectly named the variable stageId. After SPARK-18191, it used the rddId as the jobId on the driver's side, and the stageId as the jobId on the executors' side. With this change executors and the driver will consistently uses rddId as the jobId. Also with this change, during the hadoop commit protocol spark uses actual stageId to check whether a stage can be committed unlike before that it was using executors' jobId to do this check. In addition to the existing unit tests, a test has been added to check whether executors and the driver are using the same JobId. The test failed before this change and passed after applying this fix. Author: Reza Safi <rezasafi@cloudera.com> Closes #19848 from rezasafi/stagerddsimple.	2017-12-04 09:23:48 -08:00
Marco Gaido	3927bb9b46	[SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions ## What changes were proposed in this pull request? #19696 replaced the deprecated usages for `Date` and `Waiter`, but a few methods were missed. The PR fixes the forgotten deprecated usages. ## How was this patch tested? existing UTs Author: Marco Gaido <mgaido@hortonworks.com> Closes #19875 from mgaido91/SPARK-22473_FOLLOWUP.	2017-12-04 11:07:27 -06:00
Yuming Wang	4131ad03f4	[SPARK-22489][DOC][FOLLOWUP] Update broadcast behavior changes in migration section ## What changes were proposed in this pull request? Update broadcast behavior changes in migration section. ## How was this patch tested? N/A Author: Yuming Wang <wgyumg@gmail.com> Closes #19858 from wangyum/SPARK-22489-migration.	2017-12-03 23:52:37 -08:00

... 11 12 13 14 15 ...

21547 commits