ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
maryannxue	e8167768cf	[SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature ## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-19 21:03:59 +08:00
Russell Spitzer	6e0fc8b0fc	[SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions This allows an implementer of Spark Session Extensions to utilize a method "injectFunction" which will add a new function to the default Spark Session Catalogue. ## What changes were proposed in this pull request? Adds a new function to SparkSessionExtensions def injectFunction(functionDescription: FunctionDescription) Where function description is a new type type FunctionDescription = (FunctionIdentifier, FunctionBuilder) The functions are loaded in BaseSessionBuilder when the function registry does not have a parent function registry to get loaded from. ## How was this patch tested? New unit tests are added for the extension in SparkSessionExtensionSuite Closes #22576 from RussellSpitzer/SPARK-25560. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2018-10-19 10:40:56 +02:00
hyukjinkwon	c8f7691c64	[MINOR][DOC] Spacing items in migration guide for readability and consistency ## What changes were proposed in this pull request? Currently, migration guide has no space between each item which looks too compact and hard to read. Some of items already had some spaces between them in the migration guide. This PR suggest to format them consistently for readability. Before: ![screen shot 2018-10-18 at 10 00 04 am](https://user-images.githubusercontent.com/6477701/47126768-9e84fb80-d2bc-11e8-9211-84703486c553.png) After: ![screen shot 2018-10-18 at 9 53 55 am](https://user-images.githubusercontent.com/6477701/47126708-4fd76180-d2bc-11e8-9aa5-546f0622ca20.png) ## How was this patch tested? Manually tested: Closes #22761 from HyukjinKwon/minor-migration-doc. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-19 13:55:27 +08:00
Justin Uang	1e6c1d8bfb	[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. Closes #22503 from justinuang/fix-clrf-multiline. Authored-by: Justin Uang <juang@palantir.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-19 11:13:02 +08:00
Marco Gaido	d0ecff2854	[SPARK-25764][ML][EXAMPLES] Update BisectingKMeans example to use ClusteringEvaluator ## What changes were proposed in this pull request? The PR updates the examples for `BisectingKMeans` so that they don't use the deprecated method `computeCost` (see SPARK-25758). ## How was this patch tested? running examples Closes #22763 from mgaido91/SPARK-25764. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-19 09:33:46 +08:00
shivusondur	f704ebe902	[SPARK-25683][CORE] Updated the log for the firstTime event Drop occurs ## What changes were proposed in this pull request? When the first dropEvent occurs, LastReportTimestamp was printing in the log as Wed Dec 31 16:00:00 PST 1969 (Dropped 1 events from eventLog since Wed Dec 31 16:00:00 PST 1969.) The reason is that lastReportTimestamp initialized with 0. Now log is updated to print "... since the application starts" if 'lastReportTimestamp' == 0. this will happens first dropEvent occurs. ## How was this patch tested? Manually verified. Closes #22677 from shivusondur/AsyncEvent1. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-18 15:05:56 -07:00
Yuanjian Li	987f386588	[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-18 11:59:06 -07:00
Marco Gaido	c2962546d9	[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 10:32:25 -07:00
Marcelo Vanzin	15524c41b2	[SPARK-25682][K8S] Package example jars in same target for dev and distro images. This way the image generated from both environments has the same layout, with just a difference in contents that should not affect functionality. Also added some minor error checking to the image script. Closes #22681 from vanzin/SPARK-25682. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-18 10:21:37 -07:00
Takuya UESHIN	e80f18dbd8	[SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test. ## What changes were proposed in this pull request? Currently each test in `SQLTest` in PySpark is not cleaned properly. We should introduce and use more `contextmanager` to be convenient to clean up the context properly. ## How was this patch tested? Modified tests. Closes #22762 from ueshin/issues/SPARK-25763/cleanup_sqltests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-19 00:31:01 +08:00
Yuming Wang	1117fc35ff	[SPARK-25760][SQL] Set AddJarCommand return empty ## What changes were proposed in this pull request? Only `AddJarCommand` return `0`, the user will be confused about what it means. This PR sets it to empty. ```sql spark-sql> add jar /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar; ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar 0 spark-sql> ``` ## How was this patch tested? manual tests ```sql spark-sql> add jar /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar; ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar spark-sql> ``` Closes #22747 from wangyum/AddJarCommand. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 09:19:42 -07:00
Sean Owen	734c6af0dd	[SPARK-24601][FOLLOWUP] Update Jackson to 2.9.6 in Kinesis ## What changes were proposed in this pull request? Also update Kinesis SDK's Jackson to match Spark's ## How was this patch tested? Existing tests, including Kinesis ones, which ought to be hereby triggered. This was uncovered, I believe, in https://github.com/apache/spark/pull/22729#issuecomment-430666080 Closes #22757 from srowen/SPARK-24601.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-18 07:00:00 -05:00
Russell Spitzer	c3eaee7765	[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? An integration test was added which mimics the Scala test for the same feature. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21990 from RussellSpitzer/SPARK-25003-master. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-18 12:29:09 +08:00
Marcelo Vanzin	7d425b190a	[SPARK-20327][YARN] Follow up: fix resource request tests on Hadoop 3. The test fix is to allocate a `Resource` object only after the resource types have been initialized. Otherwise the YARN classes get in a weird state and throw a different exception than expected, because the resource has a different view of the registered resources. I also removed a test for a null resource since that seems unnecessary and made the fix more complicated. All the other changes are just cleanup; basically simplify the tests by defining what is being tested and deriving the resource type registration and the SparkConf from that data, instead of having redundant definitions in the tests. Ran tests with Hadoop 3 (and also without it). Closes #22751 from vanzin/SPARK-20327.fix. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-10-17 10:40:47 -05:00
Gengliang Wang	24f5bbd770	[SPARK-25735][CORE][MINOR] Improve start-thriftserver.sh: print clean usage and exit with code 1 ## What changes were proposed in this pull request? Currently if we run ``` sh start-thriftserver.sh -h ``` we get ``` ... Thrift server options: 2018-10-15 21:45:39 INFO HiveThriftServer2:54 - Starting SparkContext 2018-10-15 21:45:40 INFO SparkContext:54 - Running Spark version 2.3.2 2018-10-15 21:45:40 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-10-15 21:45:40 ERROR SparkContext:91 - Error initializing SparkContext. org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.<init>(SparkContext.scala:367) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:934) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:925) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:925) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:79) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) 2018-10-15 21:45:40 ERROR Utils:91 - Uncaught exception in thread main ``` After fix, the usage output is clean: ``` ... Thrift server options: --hiveconf <property=value> Use value for given property ``` Also exit with code 1, to follow other scripts(this is the behavior of parsing option `-h` for other linux commands as well). ## How was this patch tested? Manual test. Closes #22727 from gengliangwang/stsUsage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-17 09:56:17 -05:00
Gengliang Wang	1901f06211	[SPARK-25741][WEBUI] Long URLs are not rendered properly in web UI ## What changes were proposed in this pull request? When the URL for description column in the table of job/stage page is long, WebUI doesn't render it properly. ![beforefix](https://user-images.githubusercontent.com/1097932/47009242-9323ba00-d16e-11e8-8262-0848d814442a.jpeg) Both job and stage page are using the class `name-link` for the description URL, so change the style of `a.name-link` to fix it. ## How was this patch tested? Manual test on my local: ![afterfix](https://user-images.githubusercontent.com/1097932/47009269-a46cc680-d16e-11e8-9ff5-0318a20db634.jpeg) Closes #22744 from gengliangwang/fixUILink. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-17 09:51:41 -05:00
Vladimir Kuriatkov	e5b8136f47	[SPARK-21402][SQL] Fix java array of structs deserialization When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes #22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <vofque@gmail.com> Co-authored-by: Vladimir Kuriatkov <Vladimir_Kuriatkov@epam.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 22:13:05 +08:00
Gengliang Wang	2ab4473bed	[SPARK-25754][DOC] Change CDN for MathJax ## What changes were proposed in this pull request? Currently when we open our doc site: https://spark.apache.org/docs/latest/index.html , there is one warning ![image](https://user-images.githubusercontent.com/1097932/47065926-2b757980-d217-11e8-868f-02ce73f513ae.png) This PR is to change the CDN as per the migration tips: https://www.mathjax.org/cdn-shutting-down/ This is very very trivial. But it would be good to follow the suggestion from MathJax team and remove the warning, in case one day the original CDN is no longer available. ## How was this patch tested? Manual check. Closes #22753 from gengliangwang/migrateMathJax. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-17 06:52:00 -05:00
Wenchen Fan	9690eba16e	[SPARK-25680][SQL] SQL execution listener shouldn't happen on execution thread ## What changes were proposed in this pull request? The SQL execution listener framework was created from scratch(see https://github.com/apache/spark/pull/9078). It didn't leverage what we already have in the spark listener framework, and one major problem is, the listener runs on the spark execution thread, which means a bad listener can block spark's query processing. This PR re-implements the SQL execution listener framework. Now `ExecutionListenerManager` is just a normal spark listener, which watches the `SparkListenerSQLExecutionEnd` events and post events to the user-provided SQL execution listeners. ## How was this patch tested? existing tests. Closes #22674 from cloud-fan/listener. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 16:06:07 +08:00
彭灿00244106	e9332f600e	[SQL][CATALYST][MINOR] update some error comments ## What changes were proposed in this pull request? this PR correct some comment error: 1. change from "as low a possible" to "as low as possible" in RewriteDistinctAggregates.scala 2. delete redundant word “with” in HiveTableScanExec’s doExecute() method ## How was this patch tested? Existing unit tests. Closes #22694 from CarolinePeng/update_comment. Authored-by: 彭灿00244106 <00244106@zte.intra> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-17 12:45:13 +08:00
Takeshi Yamamuro	a9f685bb70	[SPARK-25734][SQL] Literal should have a value corresponding to dataType ## What changes were proposed in this pull request? `Literal.value` should have a value a value corresponding to `dataType`. This pr added code to verify it and fixed the existing tests to do so. ## How was this patch tested? Modified the existing tests. Closes #22724 from maropu/SPARK-25734. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 11:02:39 +08:00
Maxim Gekk	e9af9460bc	[SPARK-25393][SQL] Adding new function from_csv() ## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes #22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-17 09:32:05 +08:00
Dilip Biswal	9d4dd7992b	[SPARK-25631][SPARK-25632][SQL][TEST] Improve the test runtime of KafkaRDDSuite ## What changes were proposed in this pull request? Set a reasonable poll timeout thats used while consuming topics/partitions from kafka. In the absence of it, a default of 2 minute is used as the timeout values. And all the negative tests take a minimum of 2 minute to execute. After this change, we save about 4 minutes in this suite. ## How was this patch tested? Test fix. Closes #22670 from dilipbiswal/SPARK-25631. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 17:49:40 -05:00
Stavros Kontopoulos	bd2c447131	[SPARK-25394][CORE] Add an application status metrics source - Exposes several metrics regarding application status as a source, useful to scrape them via jmx instead of mining the metrics rest api. Example use case: prometheus + jmx exporter. - Metrics are gathered when a job ends at the AppStatusListener side, could be more fine-grained but most metrics like tasks completed are also counted by executors. More metrics could be exposed in the future to avoid scraping executors in some scenarios. - a config option `spark.app.status.metrics.enabled` is added to disable/enable these metrics, by default they are disabled. This was manually tested with jmx source enabled and prometheus server on k8s: ![metrics](https://user-images.githubusercontent.com/7945591/45300945-63064d00-b518-11e8-812a-d9b4155ba0c0.png) In the next pic the job delay is shown for repeated pi calculation (Spark action). ![pi](https://user-images.githubusercontent.com/7945591/45329927-89a1a380-b56b-11e8-9cc1-5e76cb83969f.png) Closes #22381 from skonto/add_app_status_metrics. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-16 14:58:26 -07:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
Dongjoon Hyun	2c664edc06	[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. Test Data ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` Spark 2.3.2 ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 152 ms ``` Spark 2.4.0 RC3 ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 20:30:23 +08:00
Wenchen Fan	e028fd3aed	[SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count ## What changes were proposed in this pull request? AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell. Since Spark supports it, we should clearly document the current behavior and add tests to verify it. ## How was this patch tested? N/A Closes #22728 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 15:13:01 +08:00
Yuming Wang	5c7f6b6636	[SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs in Jenkins ## What changes were proposed in this pull request? Only test these 4 cases is enough: `be2238fb50/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (L269-L279)` ## How was this patch tested? Manual tests on my local machine. before: ``` - filter pushdown - decimal (13 seconds, 683 milliseconds) ``` after: ``` - filter pushdown - decimal (9 seconds, 713 milliseconds) ``` Closes #22636 from wangyum/SPARK-25629. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 12:30:02 +08:00
Imran Rashid	fdaa99897a	[SPARK-25738][SQL] Fix LOAD DATA INPATH for hdfs port ## What changes were proposed in this pull request? LOAD DATA INPATH didn't work if the defaultFS included a port for hdfs. Handling this just requires a small change to use the correct URI constructor. ## How was this patch tested? Added a unit test, ran all tests via jenkins Closes #22733 from squito/SPARK-25738. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-15 18:34:30 -07:00
gatorsmile	4cee191c04	[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch ## What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-16 08:58:29 +08:00
Ilan Filonenko	6c9c84ffb9	[SPARK-23257][K8S] Kerberos Support for Spark on K8S ## What changes were proposed in this pull request? This is the work on setting up Secure HDFS interaction with Spark-on-K8S. The architecture is discussed in this community-wide google [doc](https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg) This initiative can be broken down into 4 Stages STAGE 1 - [x] Detecting `HADOOP_CONF_DIR` environmental variable and using Config Maps to store all Hadoop config files locally, while also setting `HADOOP_CONF_DIR` locally in the driver / executors STAGE 2 - [x] Grabbing `TGT` from `LTC` or using keytabs+principle and creating a `DT` that will be mounted as a secret or using a pre-populated secret STAGE 3 - [x] Driver STAGE 4 - [x] Executor ## How was this patch tested? Locally tested on a single-noded, pseudo-distributed Kerberized Hadoop Cluster - [x] E2E Integration tests https://github.com/apache/spark/pull/22608 - [ ] Unit tests ## Docs and Error Handling? - [x] Docs - [x] Error Handling ## Contribution Credit kimoonkim skonto Closes #21669 from ifilonenko/secure-hdfs. Lead-authored-by: Ilan Filonenko <if56@cornell.edu> Co-authored-by: Ilan Filonenko <ifilondz@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-15 15:48:51 -07:00
SongYadong	0820484ba1	[SPARK-25716][SQL][MINOR] remove unnecessary collection operation in valid constraints generation ## What changes were proposed in this pull request? Project logical operator generates valid constraints using two opposite operations. It substracts child constraints from all constraints, than union child constraints again. I think it may be not necessary. Aggregate operator has the same problem with Project. This PR try to remove these two opposite collection operations. ## How was this patch tested? Related unit tests: ProjectEstimationSuite CollapseProjectSuite PushProjectThroughUnionSuite UnsafeProjectionBenchmark GeneratedProjectionSuite CodeGeneratorWithInterpretedFallbackSuite TakeOrderedAndProjectSuite GenerateUnsafeProjectionSuite BucketedRandomProjectionLSHSuite RemoveRedundantAliasAndProjectSuite AggregateBenchmark AggregateOptimizeSuite AggregateEstimationSuite DecimalAggregatesSuite DateFrameAggregateSuite ObjectHashAggregateSuite TwoLevelAggregateHashMapSuite ObjectHashAggregateExecBenchmark SingleLevelAggregateHaspMapSuite TypedImperativeAggregateSuite RewriteDistinctAggregatesSuite HashAggregationQuerySuite HashAggregationQueryWithControlledFallbackSuite TypedImperativeAggregateSuite TwoLevelAggregateHashMapWithVectorizedMapSuite Closes #22706 from SongYadong/generate_constraints. Authored-by: SongYadong <song.yadong1@zte.com.cn> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-15 15:45:40 -07:00
Marco Gaido	56247c1d17	[SPARK-25727][FOLLOWUP] Move outputOrdering to case class field for InMemoryRelation ## What changes were proposed in this pull request? The PR addresses [the comment](https://github.com/apache/spark/pull/22715#discussion_r225024084) in the previous one. `outputOrdering` becomes a field of `InMemoryRelation`. ## How was this patch tested? existing UTs Closes #22726 from mgaido91/SPARK-25727_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-15 10:12:45 -07:00
gatorsmile	9426fd0c24	[SPARK-25372][YARN][K8S][FOLLOW-UP] Deprecate and generalize keytab / principal config ## What changes were proposed in this pull request? Update the next version of Spark from 2.5 to 3.0 ## How was this patch tested? N/A Closes #22717 from gatorsmile/followupSPARK-25372. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-14 15:20:01 +08:00
gatorsmile	6c3f2c6a6a	[SPARK-25727][SQL] Add outputOrdering to otherCopyArgs in InMemoryRelation ## What changes were proposed in this pull request? Add `outputOrdering ` to `otherCopyArgs` in InMemoryRelation so that this field will be copied when we doing the tree transformation. ``` val data = Seq(100).toDF("count").cache() data.queryExecution.optimizedPlan.toJSON ``` The above code can generate the following error: ``` assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611) at org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604) at org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590) ``` ## How was this patch tested? Added a test Closes #22715 from gatorsmile/copyArgs1. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-13 22:10:17 -07:00
Dongjoon Hyun	6bbceb9fef	[SPARK-25726][SQL][TEST] Fix flaky test in SaveIntoDataSourceCommandSuite ## What changes were proposed in this pull request? [SPARK-22479](https://github.com/apache/spark/pull/19708/files#diff-5c22ac5160d3c9d81225c5dd86265d27R31) adds a test case which sometimes fails because the used password string `123` matches `41230802`. This PR aims to fix the flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97343/consoleFull ```scala SaveIntoDataSourceCommandSuite: - simpleString is redacted * FAILED * "SaveIntoDataSourceCommand .org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider41230802, Map(password -> *******(redacted), url -> *******(redacted), driver -> mydriver), ErrorIfExists +- Range (0, 1, step=1, splits=Some(2)) " contained "123" (SaveIntoDataSourceCommandSuite.scala:42) ``` ## How was this patch tested? Pass the Jenkins with the updated test case Closes #22716 from dongjoon-hyun/SPARK-25726. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-13 18:01:28 -07:00
Wenchen Fan	b73f76beb3	[SPARK-25714][SQL][FOLLOWUP] improve the comment inside BooleanSimplification rule ## What changes were proposed in this pull request? improve the code comment added in https://github.com/apache/spark/pull/22702/files ## How was this patch tested? N/A Closes #22711 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-13 16:43:10 -07:00
Gengliang Wang	26c1b959cf	[SPARK-25711][CORE] Improve start-history-server.sh: show usage User-Friendly and remove deprecated options ## What changes were proposed in this pull request? Currently, if we try run ``` ./start-history-server.sh -h ``` We will get such error ``` java.io.FileNotFoundException: File -h does not exist ``` 1. This is not User-Friendly. For option `-h` or `--help`, it should be parsed correctly and show the usage of the class/script. 2. We can remove deprecated options for setting event log directory through command line options. After fix, we can get following output: ``` Usage: ./sbin/start-history-server.sh [options] Options: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. Configuration options can be set by setting the corresponding JVM system property. History Server options are always available; additional options depend on the provider. History Server options: spark.history.ui.port Port where server will listen for connections (default 18080) spark.history.acls.enable Whether to enable view acls for all applications (default false) spark.history.provider Name of history provider class (defaults to file system-based provider) spark.history.retainedApplications Max number of application UIs to keep loaded in memory (default 50) FsHistoryProvider options: spark.history.fs.logDirectory Directory where app logs are stored (default: file:/tmp/spark-events) spark.history.fs.updateInterval How often to reload log data from storage (in seconds, default: 10) ``` ## How was this patch tested? Manual test Closes #22699 from gengliangwang/refactorSHSUsage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-13 13:34:31 -07:00
Gengliang Wang	2eaf058788	[SPARK-25718][SQL] Detect recursive reference in Avro schema and throw exception ## What changes were proposed in this pull request? Avro schema allows recursive reference, e.g. the schema for linked-list in https://avro.apache.org/docs/1.8.2/spec.html#schema_record ``` { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], // old name for this "fields" : [ {"name": "value", "type": "long"}, // each element has a long {"name": "next", "type": ["null", "LongList"]} // optional next element ] } ``` In current Spark SQL, it is impossible to convert the schema as `StructType` . Run `SchemaConverters.toSqlType(avroSchema)` and we will get stack overflow exception. We should detect the recursive reference and throw exception for it. ## How was this patch tested? New unit test case. Closes #22709 from gengliangwang/avroRecursiveRef. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-13 14:49:38 +08:00
gatorsmile	8812746d4f	[MINOR] Fix code comment in BooleanSimplification.	2018-10-12 23:01:06 -07:00
Wenchen Fan	34f229bc21	[SPARK-25710][SQL] range should report metrics correctly ## What changes were proposed in this pull request? Currently `Range` reports metrics in batch granularity. This is acceptable, but it's better if we can make it row granularity without performance penalty. Before this PR, the metrics are updated when preparing the batch, which is before we actually consume data. In this PR, the metrics are updated after the data are consumed. There are 2 different cases: 1. The data processing loop has a stop check. The metrics are updated when we need to stop. 2. no stop check. The metrics are updated after the loop. ## How was this patch tested? existing tests and a new benchmark Closes #22698 from cloud-fan/range. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-13 13:55:28 +08:00
gatorsmile	c9ba59d38e	[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22702 from gatorsmile/fixBooleanSimplify2. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-12 21:02:38 -07:00
Szilard Nemeth	3946de7734	[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs ## What changes were proposed in this pull request? This PR adds CLI support for YARN custom resources, e.g. GPUs and any other resources YARN defines. The custom resources are defined with Spark properties, no additional CLI arguments were introduced. The properties can be defined in the following form: AM resources, client mode: Format: `spark.yarn.am.resource.<resource-name>` The property name follows the naming convention of YARN AM cores / memory properties: `spark.yarn.am.memory and spark.yarn.am.cores ` Driver resources, cluster mode: Format: `spark.yarn.driver.resource.<resource-name>` The property name follows the naming convention of driver cores / memory properties: `spark.driver.memory and spark.driver.cores.` Executor resources: Format: `spark.yarn.executor.resource.<resource-name>` The property name follows the naming convention of executor cores / memory properties: `spark.executor.memory / spark.executor.cores`. For the driver resources (cluster mode) and executor resources properties, we use the `yarn` prefix here as custom resource types are specific to YARN, currently. Validation: Please note that a validation logic is added to avoid having requested resources defined in 2 ways, for example defining the following configs: ``` "--conf", "spark.driver.memory=2G", "--conf", "spark.yarn.driver.resource.memory=1G" ``` will not start execution and will print an error message. ## How was this patch tested? Unit tests + manual execution with Hadoop2 and Hadoop 3 builds. Testing have been performed on a real cluster with Spark and YARN configured: Cluster and client mode Request Resource Types with lowercase and uppercase units Start Spark job with only requesting standard resources (mem / cpu) Error handling cases: - Request unknown resource type - Request Resource type (either memory / cpu) with duplicate configs at the same time (e.g. with this config: ``` --conf spark.yarn.am.resource.memory=1G \ --conf spark.yarn.driver.resource.memory=2G \ --conf spark.yarn.executor.resource.memory=3G \ ``` ), ResourceTypeValidator handles these cases well, so it is not permitted - Request standard resource (memory / cpu) with the new style configs, e.g. --conf spark.yarn.am.resource.memory=1G, this is not permitted and handled well. An example about how I ran the testcases: ``` cd ~;export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop/; ./spark-2.4.0-SNAPSHOT-bin-custom-spark/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1G \ --driver-cores 1 \ --executor-memory 1G \ --executor-cores 1 \ --conf spark.logConf=true \ --conf spark.yarn.executor.resource.gpu=3G \ --verbose \ ./spark-2.4.0-SNAPSHOT-bin-custom-spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar \ 10; ``` Closes #20761 from szyszy/SPARK-20327. Authored-by: Szilard Nemeth <snemeth@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-12 18:14:13 -07:00
Sean Owen	1ddfab8c4f	[SPARK-19287][CORE][STREAMING] JavaPairRDD flatMapValues requires function returning Iterable, not Iterator ## What changes were proposed in this pull request? Fix old oversight in API: Java `flatMapValues` needs a `FlatMapFunction` ## How was this patch tested? Existing tests. Closes #22690 from srowen/SPARK-19287. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 18:10:59 -05:00
Yuming Wang	e965fb55ac	[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method ## What changes were proposed in this pull request? Refactor `JoinBenchmark` to use main method. 1. use `spark-submit`: ```console bin/spark-submit --class org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` 2. Generate benchmark result: ```console SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" ``` ## How was this patch tested? manual tests Closes #22661 from wangyum/SPARK-25664. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-12 16:08:12 -07:00
Mathieu St-Louis	4e141a4160	[STREAMING][DOC] Fix typo & formatting for JavaDoc ## What changes were proposed in this pull request? - Fixed typo for function outputMode - OutputMode.Complete(), changed `these is some updates` to `there are some updates` - Replaced hyphenized list by HTML unordered list tags in comments to fix the Javadoc documentation. Current render from most recent [Spark API Docs](https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/streaming/DataStreamWriter.html): #### outputMode(OutputMode) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250648-11086700-c3f4-11e8-8a5a-d88b079c165d.png) #### outputMode(String) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250651-24b3cd80-c3f4-11e8-9dac-ae37599afbce.png) #### partitionBy(String*) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250655-36957080-c3f4-11e8-990b-47bd612d3c51.png) ## How was this patch tested? This PR contains a document patch ergo no functional testing is required. Closes #22593 from niofire/fix-typo-datastreamwriter. Authored-by: Mathieu St-Louis <mastloui@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 14:09:10 -05:00
Maxim Gekk	c7eadb5e66	[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter ## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes #22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-12 12:04:00 -07:00
Shahid	8e039a7554	[SPARK-25697][CORE] When zstd compression enabled, InProgress application is throwing Error in the history webui ## What changes were proposed in this pull request? When we enable event log compression and compression codec as 'zstd', we are unable to open the webui of the running application from the history server page. The reason is that, Replay listener was unable to read from the zstd compressed eventlog due to the zstd frame was not finished yet. This causes truncated error while reading the eventLog. So, when we try to open the WebUI from the History server page, it throws "truncated error ", and we never able to open running application in the webui, when we enable zstd compression. In this PR, when the IO excpetion happens, and if it is a running application, we log the error, "Failed to read Spark event log: evetLogDirAppName.inprogress", instead of throwing exception. ## How was this patch tested? Test steps: 1)spark.eventLog.compress = true 2)spark.io.compression.codec = zstd 3)restart history server 4) launch bin/spark-shell 5) run some queries 6) Open history server page 7) click on the application Before fix: ![screenshot from 2018-10-10 23-52-12](https://user-images.githubusercontent.com/23054875/46757387-9b4fa580-cce7-11e8-96ad-8938400483ed.png) ![screenshot from 2018-10-10 23-52-28](https://user-images.githubusercontent.com/23054875/46757393-a0145980-cce7-11e8-8cb0-44b583dde648.png) After fix: ![screenshot from 2018-10-10 23-43-49](https://user-images.githubusercontent.com/23054875/46756971-6858e200-cce6-11e8-946c-0bffebb2cfba.png) ![screenshot from 2018-10-10 23-44-05](https://user-images.githubusercontent.com/23054875/46756981-6d1d9600-cce6-11e8-95ea-ff8339a2fdfd.png) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22689 from shahidki31/SPARK-25697. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 12:57:09 -05:00
Gengliang Wang	52f9f66d54	[SPARK-25712][CORE][MINOR] Improve usage message of start-master.sh and start-slave.sh ## What changes were proposed in this pull request? Currently if we run ``` ./sbin/start-master.sh -h ``` We get ``` Usage: ./sbin/start-master.sh [options] 18/10/11 23:38:30 INFO Master: Started daemon with process name: 33907C02TL2JZGTF1 18/10/11 23:38:30 INFO SignalUtils: Registered signal handler for TERM 18/10/11 23:38:30 INFO SignalUtils: Registered signal handler for HUP 18/10/11 23:38:30 INFO SignalUtils: Registered signal handler for INT Options: -i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on -p PORT, --port PORT Port to listen on (default: 7077) --webui-port PORT Port for web UI (default: 8080) --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` We can filter out some useless output. ## How was this patch tested? Manual test Closes #22700 from gengliangwang/improveStartScript. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 12:42:34 -05:00
lajin	541d7e1e4b	[SPARK-25685][BUILD] Allow running tests in Jenkins in enterprise Git repository ## What changes were proposed in this pull request? Many companies have their own enterprise GitHub to manage Spark code. To build and test in those repositories with Jenkins need to modify this script. So I suggest to add some environment variables to allow regression testing in enterprise Jenkins instead of default Spark repository in GitHub. ## How was this patch tested? Manually test. Closes #22678 from LantaoJin/SPARK-25685. Lead-authored-by: lajin <lajin@ebay.com> Co-authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 12:41:33 -05:00

... 10 11 12 13 14 ...

23538 commits