ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
xueyu	f71e8da5ef	[SPARK-24566][CORE] Fix spark.storage.blockManagerSlaveTimeoutMs default config This PR use spark.network.timeout in place of spark.storage.blockManagerSlaveTimeoutMs when it is not configured, as configuration doc said manual test Author: xueyu <278006819@qq.com> Closes #21575 from xueyumusic/slaveTimeOutConfig.	2018-06-29 10:44:49 -07:00
Jose Torres	f6e6899a8b	[SPARK-24386][SS] coalesce(1) aggregates in continuous processing ## What changes were proposed in this pull request? Provide a continuous processing implementation of coalesce(1), as well as allowing aggregates on top of it. The changes in ContinuousQueuedDataReader and such are to use split.index (the ID of the partition within the RDD currently being compute()d) rather than context.partitionId() (the partition ID of the scheduled task within the Spark job - that is, the post coalesce writer). In the absence of a narrow dependency, these values were previously always the same, so there was no need to distinguish. ## How was this patch tested? new unit test Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21560 from jose-torres/coalesce.	2018-06-28 16:25:40 -07:00
Huaxin Gao	2224861f2f	[SPARK-24439][ML][PYTHON] Add distanceMeasure to BisectingKMeans in PySpark ## What changes were proposed in this pull request? add distanceMeasure to BisectingKMeans in Python. ## How was this patch tested? added doctest and also manually tested it. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21557 from huaxingao/spark-24439.	2018-06-28 14:07:28 -07:00
Jacek Laskowski	e1d3f80103	[SPARK-24408][SQL][DOC] Move abs function to math_funcs group ## What changes were proposed in this pull request? A few math functions (`abs` , `bitwiseNOT`, `isnan`, `nanvl`) are not in math_funcs group. They should really be. ## How was this patch tested? Awaiting Jenkins Author: Jacek Laskowski <jacek@japila.pl> Closes #21448 from jaceklaskowski/SPARK-24408-math-funcs-doc.	2018-06-28 13:22:52 -07:00
Holden Karau	a95a4af764	[SPARK-23120][PYSPARK][ML] Add basic PMML export support to PySpark ## What changes were proposed in this pull request? Adds basic PMML export support for Spark ML stages to PySpark as was previously done in Scala. Includes LinearRegressionModel as the first stage to implement. ## How was this patch tested? Doctest, the main testing work for this is on the Scala side. (TODO holden add the unittest once I finish locally). Author: Holden Karau <holden@pigscanfly.ca> Closes #21172 from holdenk/SPARK-23120-add-pmml-export-support-to-pyspark.	2018-06-28 13:20:08 -07:00
bravo-zhang	524827f062	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model ## What changes were proposed in this pull request? [SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712) spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark. ## How was this patch tested? LogisticRegressionSuite.scala Python doctest in pyspark.ml.classification.py Author: bravo-zhang <mzhang1230@gmail.com> Closes #18826 from bravo-zhang/spark-14712.	2018-06-28 12:40:39 -07:00
Xingbo Jiang	5b05966488	[SPARK-24564][TEST] Add test suite for RecordBinaryComparator ## What changes were proposed in this pull request? Add a new test suite to test RecordBinaryComparator. ## How was this patch tested? New test suite. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21570 from jiangxb1987/rbc-test.	2018-06-28 14:19:50 +08:00
Fokko Driesprong	6a97e8eb31	[SPARK-24603][SQL] Fix findTightestCommonType reference in comments findTightestCommonTypeOfTwo has been renamed to findTightestCommonType ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Fokko Driesprong <fokkodriesprong@godatadriven.com> Closes #21597 from Fokko/fd-typo.	2018-06-28 09:59:00 +08:00
Takeshi Yamamuro	1c9acc2438	[SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results ## What changes were proposed in this pull request? This pr corrected the default configuration (`spark.master=local[1]`) for benchmarks. Also, this updated performance results on the AWS `r3.xlarge`. ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21625 from maropu/FixDataSourceReadBenchmark.	2018-06-28 09:21:10 +08:00
Takeshi Yamamuro	bd32b509a1	[SPARK-24645][SQL] Skip parsing when csvColumnPruning enabled and partitions scanned only ## What changes were proposed in this pull request? In the master, when `csvColumnPruning`(implemented in [this commit](`64fad0b519 (diff-d19881aceddcaa5c60620fdcda99b4c4)`)) enabled and partitions scanned only, it throws an exception below; ``` scala> val dir = "/tmp/spark-csv/csv" scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir) scala> spark.read.csv(dir).selectExpr("sum(p)").collect() 18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61) ... ``` This pr modified code to skip CSV parsing in the case. ## How was this patch tested? Added tests in `CSVSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21631 from maropu/SPARK-24645.	2018-06-28 09:19:25 +08:00
Kallman, Steven	c5aa54d54b	[SPARK-24553][WEB-UI] http 302 fixes for href redirect ## What changes were proposed in this pull request? Updated URL/href links to include a '/' before '?id' to make links consistent and avoid http 302 redirect errors within UI port 4040 tabs. ## How was this patch tested? Built a runnable distribution and executed jobs. Validated that http 302 redirects are no longer encountered when clicking on links within UI port 4040 tabs. Author: Steven Kallman <SJKallmangmail.com> Author: Kallman, Steven <Steven.Kallman@CapitalOne.com> Closes #21600 from SJKallman/{Spark-24553}{WEB-UI}-redirect-href-fixes.	2018-06-27 15:36:59 -07:00
Takeshi Yamamuro	893ea224cc	[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat ## What changes were proposed in this pull request? This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat. ## How was this patch tested? Added verification tests in `FileBasedDataSourceSuite` and `HiveOrcSourceSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21389 from maropu/SPARK-24204.	2018-06-27 15:25:51 -07:00
Sanket Chintapalli	221d03acca	[SPARK-24533] Typesafe rebranded to lightbend. Changing the build downloads path Typesafe has rebranded to lightbend. Just changing the downloads path to avoid redirection Tested by running build/mvn -DskipTests package Author: Sanket Chintapalli <schintap@yahoo-inc.com> Closes #21636 from redsanket/SPARK-24533.	2018-06-27 14:37:24 -07:00
Marco Gaido	776befbfd5	[SPARK-24660][SHS] Show correct error pages when downloading logs ## What changes were proposed in this pull request? SHS is showing bad errors when trying to download logs is not successful. This may happen because the requested application doesn't exist or the user doesn't have permissions for it, for instance. The PR fixes the response when errors occur, so that they are displayed properly. ## How was this patch tested? manual tests Before the patch: 1. Unauthorized user ![screen shot 2018-06-26 at 3 53 33 pm](https://user-images.githubusercontent.com/8821783/41918118-f8b37e70-795b-11e8-91e8-d0250239f09d.png) 2. Non-existing application ![screen shot 2018-06-26 at 3 25 19 pm](https://user-images.githubusercontent.com/8821783/41918082-e3034c72-795b-11e8-970e-cee4a1eae77f.png) After the patch 1. Unauthorized user ![screen shot 2018-06-26 at 3 41 29 pm](https://user-images.githubusercontent.com/8821783/41918155-0d950476-795c-11e8-8d26-7b7ce73e6fe1.png) 2. Non-existing application ![screen shot 2018-06-26 at 3 40 37 pm](https://user-images.githubusercontent.com/8821783/41918175-1a14bb88-795c-11e8-91ab-eadf29190a02.png) Author: Marco Gaido <marcogaido91@gmail.com> Closes #21644 from mgaido91/SPARK-24660.	2018-06-27 14:26:08 -07:00
debugger87	c04cb2d1b7	[SPARK-21687][SQL] Spark SQL should set createTime for Hive partition ## What changes were proposed in this pull request? Set createTime for every hive partition created in Spark SQL, which could be used to manage data lifecycle in Hive warehouse. We found that almost every partition modified by spark sql has not been set createTime. ``` mysql> select * from partitions where create_time=0 limit 1\G; ************************* 1. row ************************* PART_ID: 1028584 CREATE_TIME: 0 LAST_ACCESS_TIME: 1502203611 PART_NAME: date=20170130 SD_ID: 1543605 TBL_ID: 211605 LINK_TARGET_ID: NULL 1 row in set (0.27 sec) ``` ## How was this patch tested? N/A Author: debugger87 <yangchaozhong.2009@gmail.com> Author: Chaozhong Yang <yangchaozhong.2009@gmail.com> Closes #18900 from debugger87/fix/set-create-time-for-hive-partition.	2018-06-27 11:34:28 -07:00
Marcelo Vanzin	78ecb6d457	[SPARK-24446][YARN] Properly quote library path for YARN. Because the way YARN executes commands via bash -c, everything needs to be quoted so that the whole command is fully contained inside a bash string and is interpreted correctly when the string is read by bash. This is a bit different than the quoting done when executing things as if typing in a bash shell. Tweaked unit tests to exercise the bad behavior, which would cause existing tests to time out without the fix. Also tested on a real cluster, verifying the shell script created by YARN to run the container. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21476 from vanzin/SPARK-24446.	2018-06-27 10:57:29 -07:00
Yuanjian Li	6a0b77a55d	[SPARK-24215][PYSPARK][FOLLOW UP] Implement eager evaluation for DataFrame APIs in PySpark ## What changes were proposed in this pull request? Address comments in #21370 and add more test. ## How was this patch tested? Enhance test in pyspark/sql/test.py and DataFrameSuite Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21553 from xuanyuanking/SPARK-24215-follow.	2018-06-27 10:43:06 -07:00
Yuexin Zhang	a1a64e3583	[SPARK-21335][DOC] doc changes for disallowed un-aliased subquery use case ## What changes were proposed in this pull request? Document a change for un-aliased subquery use case, to address the last question in PR #18559: https://github.com/apache/spark/pull/18559#issuecomment-316884858 (Please fill in changes proposed in this fix) ## How was this patch tested? it does not affect tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Yuexin Zhang <zach.yx.zhang@gmail.com> Closes #21647 from cnZach/doc_change_for_SPARK-20690_SPARK-21335.	2018-06-27 16:05:36 +08:00
Takuya UESHIN	9a76f23c6a	[SPARK-23927][SQL][FOLLOW-UP] Fix a build failure. ## What changes were proposed in this pull request? This pr is a follow-up pr of #21155. The #21155 removed unnecessary import at that time, but the import became necessary in another pr. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21646 from ueshin/issues/SPARK-23927/fup1.	2018-06-27 11:52:48 +08:00
Vayda, Oleksandr: IT (PRG)	2669b4de3b	[SPARK-23927][SQL] Add "sequence" expression ## What changes were proposed in this pull request? The PR adds the SQL function ```sequence```. https://issues.apache.org/jira/browse/SPARK-23927 The behavior of the function is based on Presto's one. Ref: https://prestodb.io/docs/current/functions/array.html - ```sequence(start, stop) → array<bigint>``` Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```1``` if ```start``` is less than or equal to ```stop```, otherwise ```-1```. - ```sequence(start, stop, step) → array<bigint>``` Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```step```. - ```sequence(start_date, stop_date) → array<date>``` Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```. - ```sequence(start_date, stop_date, step_interval) → array<date>``` Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```. - ```sequence(start_timestemp, stop_timestemp) → array<timestamp>``` Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```. - ```sequence(start_timestamp, stop_timestamp, step_interval) → array<timestamp>``` Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```. ## How was this patch tested? Added unit tests. Author: Vayda, Oleksandr: IT (PRG) <Oleksandr.Vayda@barclayscapital.com> Closes #21155 from wajda/feature/array-api-sequence.	2018-06-27 11:52:31 +09:00
Maxim Gekk	d08f53dc61	[SPARK-24605][SQL] size(null) returns null instead of -1 ## What changes were proposed in this pull request? In PR, I propose new behavior of `size(null)` under the config flag `spark.sql.legacy.sizeOfNull`. If the former one is disabled, the `size()` function returns `null` for `null` input. By default the `spark.sql.legacy.sizeOfNull` is enabled to keep backward compatibility with previous versions. In that case, `size(null)` returns `-1`. ## How was this patch tested? Modified existing tests for the `size()` function to check new behavior (`null`) and old one (`-1`). Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21598 from MaxGekk/legacy-size-of-null.	2018-06-27 10:36:51 +08:00
Kris Mok	1b9368f7d4	[SPARK-24659][SQL] GenericArrayData.equals should respect element type differences ## What changes were proposed in this pull request? Fix `GenericArrayData.equals`, so that it respects the actual types of the elements. e.g. an instance that represents an `array<int>` and another instance that represents an `array<long>` should be considered incompatible, and thus should return false for `equals`. `GenericArrayData` doesn't keep any schema information by itself, and rather relies on the Java objects referenced by its `array` field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to call `equals` on the elements, instead of using Scala's `==` operator, which can have semantics that are not always desirable: ``` new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in Scala ``` ## How was this patch tested? Added unit test in `ComplexDataSuite` Author: Kris Mok <kris.mok@databricks.com> Closes #21643 from rednaxelafx/fix-genericarraydata-equals.	2018-06-27 10:27:40 +08:00
Imran Rashid	16f2c3ea46	[SPARK-6237][NETWORK] Network-layer changes to allow stream upload. These changes allow an RPCHandler to receive an upload as a stream of data, without having to buffer the entire message in the FrameDecoder. The primary use case is for replicating large blocks. By itself, this change is adding dead-code that is not being used -- it is a step towards SPARK-24296. Added unit tests for handling streaming data, including successfully sending data, and failures in reading the stream with concurrent requests. Summary of changes: * Introduce a new UploadStream RPC which is sent to push a large payload as a stream (in contrast, the pre-existing StreamRequest and StreamResponse RPCs are used for pull-based streaming). * Generalize RpcHandler.receive() to support requests which contain streams. * Generalize StreamInterceptor to handle both request and response messages (previously it only handled responses). * Introduce StdChannelListener to abstract away common logging logic in ChannelFuture listeners. Author: Imran Rashid <irashid@cloudera.com> Closes #21346 from squito/upload_stream.	2018-06-26 15:56:58 -07:00
Dilip Biswal	02f8781fa2	[SPARK-24423][SQL] Add a new option for JDBC sources ## What changes were proposed in this pull request? Here is the description in the JIRA - Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. ```SQL val jdbcDf = spark.read .format("jdbc") .option("dbtable", "dbName.tableName") .options(jdbcCredentials: Map) .load() ``` Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. ```SQL val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("dbtable", query) .options(jdbcCredentials: Map) .load() ``` However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. ```SQL val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("query", query) .options(jdbcCredentials: Map) .load() ``` ## How was this patch tested? Added tests in JDBCSuite and JDBCWriterSuite. Also tested against MySQL, Postgress, Oracle, DB2 (using docker infrastructure) to make sure there are no syntax issues. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21590 from dilipbiswal/SPARK-24423.	2018-06-26 15:17:00 -07:00
Yuming Wang	dcaa49ff1e	[SPARK-24658][SQL] Remove workaround for ANTLR bug ## What changes were proposed in this pull request? Issue antlr/antlr4#781 has already been fixed, so the workaround of extracting the pattern into a separate rule is no longer needed. The presto already removed it: https://github.com/prestodb/presto/pull/10744. ## How was this patch tested? Existing tests Author: Yuming Wang <yumwang@ebay.com> Closes #21641 from wangyum/ANTLR-780.	2018-06-26 14:33:04 -07:00
Marek Novotny	e07aee2165	[SPARK-24636][SQL] Type coercion of arrays for array_join function ## What changes were proposed in this pull request? Presto's implementation accepts arbitrary arrays of primitive types as an input: ``` presto> SELECT array_join(ARRAY [1, 2, 3], ', '); _col0 --------- 1, 2, 3 (1 row) ``` This PR proposes to implement a type coercion rule for ```array_join``` function that converts arrays of primitive as well as non-primitive types to arrays of string. ## How was this patch tested? New test cases add into: - sql-tests/inputs/typeCoercion/native/arrayJoin.sql - DataFrameFunctionsSuite.scala Author: Marek Novotny <mn.mikke@gmail.com> Closes #21620 from mn-mikke/SPARK-24636.	2018-06-26 09:51:55 +08:00
DB Tsai	c7967c6049	[SPARK-24418][BUILD] Upgrade Scala to 2.11.12 and 2.12.6 ## What changes were proposed in this pull request? Scala is upgraded to `2.11.12` and `2.12.6`. We used `loadFIles()` in `ILoop` as a hook to initialize the Spark before REPL sees any files in Scala `2.11.8`. However, it was a hack, and it was not intended to be a public API, so it was removed in Scala `2.11.12`. From the discussion in Scala community, https://github.com/scala/bug/issues/10913 , we can use `initializeSynchronous` to initialize Spark instead. This PR implements the Spark initialization there. However, in Scala `2.11.12`'s `ILoop.scala`, in function `def startup()`, the first thing it calls is `printWelcome()`. As a result, Scala will call `printWelcome()` and `splash` before calling `initializeSynchronous`. Thus, the Spark shell will allow users to type commends first, and then show the Spark UI URL. It's working, but it will change the Spark Shell interface as the following. ```scala ➜ apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161) Type in expressions to have them evaluated. Type :help for more information. scala> Spark context Web UI available at http://192.168.1.169:4040 Spark context available as 'sc' (master = local[*], app id = local-1528180279528). Spark session available as 'spark'. scala> ``` It seems there is no easy way to inject the Spark initialization code in the proper place as Scala doesn't provide a hook. Maybe som-snytt can comment on this. The following command is used to update the dep files. ```scala ./dev/test-dependencies.sh --replace-manifest ``` ## How was this patch tested? Existing tests Author: DB Tsai <d_tsai@apple.com> Closes #21495 from dbtsai/scala-2.11.12.	2018-06-26 09:48:52 +08:00
Bruce Robbins	4c059ebc60	[SPARK-23776][DOC] Update instructions for running PySpark after building with SBT ## What changes were proposed in this pull request? This update tells the reader how to build Spark with SBT such that pyspark-sql tests will succeed. If you follow the current instructions for building Spark with SBT, pyspark/sql/udf.py fails with: <pre> AnalysisException: u'Can not load class test.org.apache.spark.sql.JavaStringLength, please make sure it is on the classpath;' </pre> ## How was this patch tested? I ran the doc build command (SKIP_API=1 jekyll build) and eyeballed the result. Author: Bruce Robbins <bersprockets@gmail.com> Closes #21628 from bersprockets/SPARK-23776_doc.	2018-06-26 09:48:15 +08:00
Bryan Cutler	d48803bf64	[SPARK-24324][PYTHON][FOLLOWUP] Grouped Map positional conf should have deprecation note ## What changes were proposed in this pull request? Followup to the discussion of the added conf in SPARK-24324 which allows assignment by column position only. This conf is to preserve old behavior and will be removed in future releases, so it should have a note to indicate that. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21637 from BryanCutler/arrow-groupedMap-conf-deprecate-followup-SPARK-24324.	2018-06-25 17:08:23 -07:00
Marcelo Vanzin	6d16b9885d	[SPARK-24552][CORE][SQL] Use task ID instead of attempt number for writes. This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. For v1 / Hadoop writes, generate a unique ID based on available attempt numbers to avoid a similar problem. Closes #21558 Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Ryan Blue <blue@apache.org> Closes #21606 from vanzin/SPARK-24552.2.	2018-06-25 16:54:57 -07:00
Marcelo Vanzin	baa01c8ca9	[INFRA] Close stale PR. Closes #21614	2018-06-25 15:12:33 -07:00
Stacy Kerkela	5264164a67	[SPARK-24648][SQL] SqlMetrics should be threadsafe Use LongAdder to make SQLMetrics thread safe. ## What changes were proposed in this pull request? Replace += with LongAdder.add() for concurrent counting ## How was this patch tested? Unit tests with local threads Author: Stacy Kerkela <stacy.kerkela@databricks.com> Closes #21634 from dbkerkela/sqlmetrics-concurrency-stacy.	2018-06-25 23:41:39 +02:00
Marco Gaido	594ac4f7b8	[SPARK-24633][SQL] Fix codegen when split is required for arrays_zip ## What changes were proposed in this pull request? In function array_zip, when split is required by the high number of arguments, a codegen error can happen. The PR fixes codegen for cases when splitting the code is required. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21621 from mgaido91/SPARK-24633.	2018-06-25 23:44:20 +08:00
Maryann Xue	bac50aa371	[SPARK-24596][SQL] Non-cascading Cache Invalidation ## What changes were proposed in this pull request? 1. Add parameter 'cascade' in CacheManager.uncacheQuery(). Under 'cascade=false' mode, only invalidate the current cache, and for other dependent caches, rebuild execution plan and reuse cached buffer. 2. Pass true/false from callers in different uncache scenarios: - Drop tables and regular (persistent) views: regular mode - Drop temporary views: non-cascading mode - Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode - Call `DataSet.unpersist()`: non-cascading mode - Call `Catalog.uncacheTable()`: follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets. ## How was this patch tested? New tests in CachedTableSuite and DatasetCacheSuite. Author: Maryann Xue <maryannxue@apache.org> Closes #21594 from maryannxue/noncascading-cache.	2018-06-25 07:17:30 -07:00
Jim Kleckner	8ab8ef7733	Fix minor typo in docs/cloud-integration.md ## What changes were proposed in this pull request? Minor typo in docs/cloud-integration.md ## How was this patch tested? This is trivial enough that it should not affect tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Jim Kleckner <jim@cloudphysics.com> Closes #21629 from jkleckner/fix-doc-typo.	2018-06-25 16:23:23 +08:00
Takuya UESHIN	6e0596e263	[SPARK-23931][SQL][FOLLOW-UP] Make `arrays_zip` in function.scala `@scala.annotation.varargs`. ## What changes were proposed in this pull request? This is a follow-up pr of #21045 which added `arrays_zip`. The `arrays_zip` in functions.scala should've been `scala.annotation.varargs`. This pr makes it `scala.annotation.varargs`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21630 from ueshin/issues/SPARK-23931/fup1.	2018-06-24 23:56:47 -07:00
Takeshi Yamamuro	f596ebe4d3	[SPARK-24327][SQL] Verify and normalize a partition column name based on the JDBC resolved schema ## What changes were proposed in this pull request? This pr modified JDBC datasource code to verify and normalize a partition column based on the JDBC resolved schema before building `JDBCRelation`. Closes #20370 ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21379 from maropu/SPARK-24327.	2018-06-24 23:14:42 -07:00
Bryan Cutler	a5849ad9a3	[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name ## What changes were proposed in this pull request? Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign the resulting columns based on index of the return pandas.DataFrame. If a new DataFrame is returned and constructed using a dict, then the order of the columns could be arbitrary and be different than the defined schema for the UDF. If the schema types still match, then no error will be raised and the user will see column names and column data mixed up. This change will first try to assign columns using the return type field names. If a KeyError occurs, then the column index is checked if it is string based. If so, then the error is raised as it is most likely a naming mistake, else it will fallback to assign columns by position and raise a TypeError if the field types do not match. ## How was this patch tested? Added a test that returns a new DataFrame with column order different than the schema. Author: Bryan Cutler <cutlerb@gmail.com> Closes #21427 from BryanCutler/arrow-grouped-map-mixesup-cols-SPARK-24324.	2018-06-24 09:28:46 +08:00
Takeshi Yamamuro	98f363b774	[SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code ## What changes were proposed in this pull request? This pr added benchmark code `FilterPushdownBenchmark` for string pushdown and updated performance results on the AWS `r3.xlarge`. ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21288 from maropu/UpdateParquetBenchmark.	2018-06-23 17:51:18 -07:00
Maxim Gekk	c7e2742f9b	[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 ## What changes were proposed in this pull request? Currently, restrictions in JSONOptions for `encoding` and `lineSep` are the same for read and for write. For example, a requirement for `lineSep` in the code: ``` df.write.option("encoding", "UTF-32BE").json(file) ``` doesn't allow to skip `lineSep` and use its default value `\n` because it throws the exception: ``` equirement failed: The lineSep option must be specified for the UTF-32BE encoding java.lang.IllegalArgumentException: requirement failed: The lineSep option must be specified for the UTF-32BE encoding ``` In the PR, I propose to separate JSONOptions in read and write, and make JSONOptions in write less restrictive. ## How was this patch tested? Added new test for blacklisted encodings in read. And the `lineSep` option was removed in write for some tests. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21247 from MaxGekk/json-options-in-write.	2018-06-23 17:40:20 -07:00
Marcelo Vanzin	4e7d8678a3	[SPARK-24372][BUILD] Add scripts to help with preparing releases. The "do-release.sh" script asks questions about the RC being prepared, trying to find out as much as possible automatically, and then executes the existing scripts with proper arguments to prepare the release. This script was used to prepare the 2.3.1 release candidates, so was tested in that context. The docker version runs that same script inside a docker image especially crafted for building Spark releases. That image is based on the work by Felix C. linked in the bug. At this point is has been only midly tested. I also added a template for the vote e-mail, with placeholders for things that need to be replaced, although there is no automation around that for the moment. It shouldn't be hard to hook up certain things like version and tags to this, or to figure out certain things like the repo URL from the output of the release scripts. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21515 from vanzin/SPARK-24372.	2018-06-22 12:38:34 -05:00
jerryshao	33e77fa89b	[SPARK-24518][CORE] Using Hadoop credential provider API to store password ## What changes were proposed in this pull request? In our distribution, because we don't do such fine-grained access control of config file, also configuration file is world readable shared between different components, so password may leak to different users. Hadoop credential provider API support storing password in a secure way, in which Spark could read it in a secure way, so here propose to add support of using credential provider API to get password. ## How was this patch tested? Adding tests and verified locally. Author: jerryshao <sshao@hortonworks.com> Closes #21548 from jerryshao/SPARK-24518.	2018-06-22 10:14:12 -07:00
Hieu Huynh	39dfaf2fd1	[SPARK-24519] Make the threshold for highly compressed map status configurable Problem MapStatus uses hardcoded value of 2000 partitions to determine if it should use highly compressed map status. We should make it configurable to allow users to more easily tune their jobs with respect to this without having for them to modify their code to change the number of partitions. Note we can leave this as an internal/undocumented config for now until we have more advise for the users on how to set this config. Some of my reasoning: The config gives you a way to easily change something without the user having to change code, redeploy jar, and then run again. You can simply change the config and rerun. It also allows for easier experimentation. Changing the # of partitions has other side affects, whether good or bad is situation dependent. It can be worse are you could be increasing # of output files when you don't want to be, affects the # of tasks needs and thus executors to run in parallel, etc. There have been various talks about this number at spark summits where people have told customers to increase it to be 2001 partitions. Note if you just do a search for spark 2000 partitions you will fine various things all talking about this number. This shows that people are modifying their code to take this into account so it seems to me having this configurable would be better. Once we have more advice for users we could expose this and document information on it. What changes were proposed in this pull request? I make the hardcoded value mentioned above to be configurable under the name _SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS_, which has default value to be 2000. Users can set it to the value they want by setting the property name _spark.shuffle.minNumPartitionsToHighlyCompress_ How was this patch tested? I wrote a unit test to make sure that the default value is 2000, and _IllegalArgumentException_ will be thrown if user set it to a non-positive value. The unit test also checks that highly compressed map status is correctly used when the number of partition is greater than _SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS_. Author: Hieu Huynh <“Hieu.huynh@oath.com”> Closes #21527 from hthuynh2/spark_branch_1.	2018-06-22 09:16:14 -05:00
Marek Novotny	92c2f00bd2	[SPARK-23934][SQL] Adding map_from_entries function ## What changes were proposed in this pull request? The PR adds the `map_from_entries` function that returns a map created from the given array of entries. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionSuite` ## CodeGen Examples ### Primitive-type Keys and Values ``` val idf = Seq( Seq((1, 10), (2, 20), (3, 10)), Seq((1, 10), null, (2, 20)) ).toDF("a") idf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_2 = 0; !project_isNull_0 && project_idx_2 < inputadapter_value_0.numElements(); project_idx_2++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_2); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final long project_keySectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 052 / final long project_valueSectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 053 / final long project_byteArraySize_0 = 8 + project_keySectionSize_0 + project_valueSectionSize_0; / 054 / if (project_byteArraySize_0 > 2147483632) { / 055 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 056 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 057 / / 058 / for (int project_idx_1 = 0; project_idx_1 < project_numEntries_0; project_idx_1++) { / 059 / InternalRow project_entry_1 = inputadapter_value_0.getStruct(project_idx_1, 2); / 060 / / 061 / project_keys_0[project_idx_1] = project_entry_1.getInt(0); / 062 / project_values_0[project_idx_1] = project_entry_1.getInt(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 / } else { / 068 / final byte[] project_byteArray_0 = new byte[(int)project_byteArraySize_0]; / 069 / UnsafeMapData project_unsafeMapData_0 = new UnsafeMapData(); / 070 / Platform.putLong(project_byteArray_0, 16, project_keySectionSize_0); / 071 / Platform.putLong(project_byteArray_0, 24, project_numEntries_0); / 072 / Platform.putLong(project_byteArray_0, 24 + project_keySectionSize_0, project_numEntries_0); / 073 / project_unsafeMapData_0.pointTo(project_byteArray_0, 16, (int)project_byteArraySize_0); / 074 / ArrayData project_keyArrayData_0 = project_unsafeMapData_0.keyArray(); / 075 / ArrayData project_valueArrayData_0 = project_unsafeMapData_0.valueArray(); / 076 / / 077 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 078 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 079 / / 080 / project_keyArrayData_0.setInt(project_idx_0, project_entry_0.getInt(0)); / 081 / project_valueArrayData_0.setInt(project_idx_0, project_entry_0.getInt(1)); / 082 / } / 083 / / 084 / project_value_0 = project_unsafeMapData_0; / 085 / } / 086 / / 087 / } ``` ### Non-primitive-type Keys and Values ``` val sdf = Seq( Seq(("a", null), ("b", "bb"), ("c", "aa")), Seq(("a", "aa"), null, (null, "bb")) ).toDF("a") sdf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_1 = 0; !project_isNull_0 && project_idx_1 < inputadapter_value_0.numElements(); project_idx_1++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_1); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 052 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 053 / / 054 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 055 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 056 / / 057 / if (project_entry_0.isNullAt(0)) { / 058 / throw new RuntimeException("The first field from a struct (key) can't be null."); / 059 / } / 060 / / 061 / project_keys_0[project_idx_0] = project_entry_0.getUTF8String(0); / 062 / project_values_0[project_idx_0] = project_entry_0.getUTF8String(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 */ } ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21282 from mn-mikke/feature/array-api-map_from_entries-to-master.	2018-06-22 16:18:22 +09:00
Wenchen Fan	dc8a6befa5	[SPARK-24588][SS] streaming join should require HashClusteredPartitioning from children ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/19080 we simplified the distribution/partitioning framework, and make all the join-like operators require `HashClusteredDistribution` from children. Unfortunately streaming join operator was missed. This can cause wrong result. Think about ``` val input1 = MemoryStream[Int] val input2 = MemoryStream[Int] val df1 = input1.toDF.select('value as 'a, 'value * 2 as 'b) val df2 = input2.toDF.select('value as 'a, 'value * 2 as 'b).repartition('b) val joined = df1.join(df2, Seq("a", "b")).select('a) ``` The physical plan is ``` (3) Project [a#5] +- StreamingSymmetricHashJoin [a#5, b#6], [a#10, b#11], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = <unknown>, runId = 54e31fce-f055-4686-b75d-fcd2b076f8d8, opId = 0, ver = 0, numPartitions = 5], 0, state cleanup [ left = null, right = null ] :- Exchange hashpartitioning(a#5, b#6, 5) : +- (1) Project [value#1 AS a#5, (value#1 * 2) AS b#6] : +- StreamingRelation MemoryStream[value#1], [value#1] +- Exchange hashpartitioning(b#11, 5) +- (2) Project [value#3 AS a#10, (value#3 2) AS b#11] +- StreamingRelation MemoryStream[value#3], [value#3] ``` The left table is hash partitioned by `a, b`, while the right table is hash partitioned by `b`. This means, we may have a matching record that is in different partitions, which should be in the output but not. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #21587 from cloud-fan/join.	2018-06-21 15:38:46 -07:00
Maryann Xue	b9a6f7499a	[SPARK-24613][SQL] Cache with UDF could not be matched with subsequent dependent caches ## What changes were proposed in this pull request? Wrap the logical plan with a `AnalysisBarrier` for execution plan compilation in CacheManager, in order to avoid the plan being analyzed again. ## How was this patch tested? Add one test in `DatasetCacheSuite` Author: Maryann Xue <maryannxue@apache.org> Closes #21602 from maryannxue/cache-mismatch.	2018-06-21 11:45:30 -07:00
Marcelo Vanzin	c8e909cd49	[SPARK-24589][CORE] Correctly identify tasks in output commit coordinator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21577 from vanzin/SPARK-24552.	2018-06-21 13:25:15 -05:00
“attilapiros”	b56e9c613f	[SPARK-16630][YARN] Blacklist a node if executors won't launch on it ## What changes were proposed in this pull request? This change extends YARN resource allocation handling with blacklisting functionality. This handles cases when node is messed up or misconfigured such that a container won't launch on it. Before this change backlisting only focused on task execution but this change introduces YarnAllocatorBlacklistTracker which tracks allocation failures per host (when enabled via "spark.yarn.blacklist.executor.launch.blacklisting.enabled"). ## How was this patch tested? ### With unit tests Including a new suite: YarnAllocatorBlacklistTrackerSuite. #### Manually It was tested on a cluster by deleting the Spark jars on one of the node. #### Behaviour before these changes Starting Spark as: ``` spark2-shell --master yarn --deploy-mode client --num-executors 4 --conf spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6" ``` Log is: ``` 18/04/12 06:49:36 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (6) reached) 18/04/12 06:49:39 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Max number of executor failures (6) reached) 18/04/12 06:49:39 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 18/04/12 06:49:39 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://apiros-1.gce.test.com:8020/user/systest/.sparkStaging/application_1523459048274_0016 18/04/12 06:49:39 INFO util.ShutdownHookManager: Shutdown hook called ``` #### Behaviour after these changes Starting Spark as: ``` spark2-shell --master yarn --deploy-mode client --num-executors 4 --conf spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6" --conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true" ``` And the log is: ``` 18/04/13 05:37:43 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 1 core(s) and 4505 MB memory (including 409 MB of overhead) 18/04/13 05:37:43 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests. 18/04/13 05:37:43 INFO yarn.YarnAllocator: Launching container container_1523459048274_0025_01_000008 on host apiros-4.gce.test.com for executor with ID 6 18/04/13 05:37:43 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them. 18/04/13 05:37:43 INFO yarn.YarnAllocator: Completed container container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com (state: COMPLETE, exit status: 1) 18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting host as YARN allocation failed: apiros-4.gce.test.com 18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to YARN application master's blacklist: List(apiros-4.gce.test.com) 18/04/13 05:37:43 WARN yarn.YarnAllocator: Container marked as failed: container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com. Exit status: 1. Diagnostics: Exception from container-launch. Container id: container_1523459048274_0025_01_000007 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` Where the most important part is: ``` 18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting host as YARN allocation failed: apiros-4.gce.test.com 18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to YARN application master's blacklist: List(apiros-4.gce.test.com) ``` And execution was continued (no shutdown called). ### Testing the backlisting of the whole cluster Starting Spark with YARN blacklisting enabled then removing a the Spark core jar one by one from all the cluster nodes. Then executing a simple spark job which fails checking the yarn log the expected exit status is contained: ``` 18/06/15 01:07:10 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) 18/06/15 01:07:13 INFO util.ShutdownHookManager: Shutdown hook called ``` Author: “attilapiros” <piros.attila.zsolt@gmail.com> Closes #21068 from attilapiros/SPARK-16630.	2018-06-21 09:17:18 -05:00
Rekha Joshi	c0cad596b8	[SPARK-24614][PYSPARK] Fix for SyntaxWarning on tests.py ## What changes were proposed in this pull request? Fix for SyntaxWarning on tests.py ## How was this patch tested? ./dev/run-tests Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #21604 from rekhajoshm/SPARK-24614.	2018-06-21 16:41:43 +08:00
Chongguang LIU	7236e759c9	[SPARK-24574][SQL] array_contains, array_position, array_remove and element_at functions deal with Column type ## What changes were proposed in this pull request? For the function ```def array_contains(column: Column, value: Any): Column ``` , if we pass the `value` parameter as a Column type, it will yield a runtime exception. This PR proposes a pattern matching to detect if `value` is of type Column. If yes, it will use the .expr of the column, otherwise it will work as it used to. Same thing for ```array_position, array_remove and element_at``` functions ## How was this patch tested? Unit test modified to cover this code change. Ping ueshin Author: Chongguang LIU <chong@Chongguangs-MacBook-Pro.local> Closes #21581 from chongguang/SPARK-24574.	2018-06-21 14:58:57 +08:00

1 2 3 4 5 ...

22155 commits