ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
pj.fanning	c0e333dbed	[SPARK-21709][BUILD] sbt 0.13.16 and some plugin updates ## What changes were proposed in this pull request? Update sbt version to 0.13.16. I think this is a useful stepping stone to getting to sbt 1.0.0. ## How was this patch tested? Existing Build. Author: pj.fanning <pj.fanning@workday.com> Closes #18921 from pjfanning/SPARK-21709.	2017-08-12 20:01:20 +01:00
Ajay Saini	35db3b9fe3	[SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages ## What changes were proposed in this pull request? Implemented a Python-only persistence framework for pipelines containing stages that cannot be saved using Java. ## How was this patch tested? Created a custom Python-only UnaryTransformer, included it in a Pipeline, and saved/loaded the pipeline. The loaded pipeline was compared against the original using _compare_pipelines() in tests.py. Author: Ajay Saini <ajays725@gmail.com> Closes #18888 from ajaysaini725/PythonPipelines.	2017-08-11 23:57:08 -07:00
Sean Owen	b0bdfce9ca	[MINOR][BUILD] Download RAT and R version info over HTTPS; use RAT 0.12 ## What changes were proposed in this pull request? This is trivial, but bugged me. We should download software over HTTPS. And we can use RAT 0.12 while at it to pick up bug fixes. ## How was this patch tested? N/A Author: Sean Owen <sowen@cloudera.com> Closes #18927 from srowen/Rat012.	2017-08-12 14:31:05 +09:00
Stavros Kontopoulos	da8c59bdea	[SPARK-12559][SPARK SUBMIT] fix --packages for stand-alone cluster mode Fixes --packages flag for the stand-alone case in cluster mode. Adds to the driver classpath the jars that are resolved via ivy along with any other jars passed to `spark.jars`. Jars not resolved by ivy are downloaded explicitly to a tmp folder on the driver node. Similar code is available in SparkSubmit so we refactored part of it to use it at the DriverWrapper class which is responsible for launching driver in standalone cluster mode. Note: In stand-alone mode `spark.jars` contains the user jar so it can be fetched later on at the executor side. Manually by submitting a driver in cluster mode within a standalone cluster and checking if dependencies were resolved at the driver side. Author: Stavros Kontopoulos <st.kontopoulos@gmail.com> Closes #18630 from skonto/fix_packages_stand_alone_cluster.	2017-08-11 15:52:32 -07:00
Tejas Patil	7f16c69107	[SPARK-19122][SQL] Unnecessary shuffle+sort added if join predicates ordering differ from bucketing and sorting order ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-19122 `leftKeys` and `rightKeys` in `SortMergeJoinExec` are altered based on the ordering of join keys in the child's `outputPartitioning`. This is done everytime `requiredChildDistribution` is invoked during query planning. ## How was this patch tested? - Added new test case - Existing tests Author: Tejas Patil <tejasp@fb.com> Closes #16985 from tejasapatil/SPARK-19122_join_order_shuffle.	2017-08-11 15:13:42 -07:00
Tejas Patil	94439997d5	[SPARK-21595] Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595.	2017-08-11 22:01:00 +02:00
LucaCanali	0377338bf7	[SPARK-21519][SQL] Add an option to the JDBC data source to initialize the target DB environment Add an option to the JDBC data source to initialize the environment of the remote database session ## What changes were proposed in this pull request? This proposes an option to the JDBC datasource, tentatively called " sessionInitStatement" to implement the functionality of session initialization present for example in the Sqoop connector for Oracle (see https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements ) . After each database session is opened to the remote DB, and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block in the case of Oracle). See also https://issues.apache.org/jira/browse/SPARK-21519 ## How was this patch tested? Manually tested using Spark SQL data source and Oracle JDBC Author: LucaCanali <luca.canali@cern.ch> Closes #18724 from LucaCanali/JDBC_datasource_sessionInitStatement.	2017-08-11 12:03:37 -07:00
Kent Yao	2387f1e316	[SPARK-21675][WEBUI] Add a navigation bar at the bottom of the Details for Stage Page ## What changes were proposed in this pull request? 1. In Spark Web UI, the Details for Stage Page don't have a navigation bar at the bottom. When we drop down to the bottom, it is better for us to see a navi bar right there to go wherever we what. 2. Executor ID is not equivalent to Host, it may be better to separate them, and then we can group the tasks by Hosts . ## How was this patch tested? manually test ![wx20170809-165606](https://user-images.githubusercontent.com/8326978/29114161-f82b4920-7d25-11e7-8d0c-0c036b008a78.png) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Kent Yao <yaooqinn@hotmail.com> Closes #18893 from yaooqinn/SPARK-21675.	2017-08-11 14:57:06 +01:00
Reynold Xin	584c7f1437	[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog ## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes #18912 from rxin/remove-getTableOption.	2017-08-10 18:56:25 -07:00
Peng Meng	ca6955858c	[SPARK-21638][ML] Fix RF/GBT Warning message error ## What changes were proposed in this pull request? When train RF model, there are many warning messages like this: > WARN RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration. This warning message is unnecessary and the data is not accurate. Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration. ## How was this patch tested? The existing UT Author: Peng Meng <peng.meng@intel.com> Closes #18868 from mpjlu/fixRFwarning.	2017-08-10 21:38:03 +01:00
Adrian Ionescu	95ad960caf	[SPARK-21669] Internal API for collecting metrics/stats during FileFormatWriter jobs ## What changes were proposed in this pull request? This patch introduces an internal interface for tracking metrics and/or statistics on data on the fly, as it is being written to disk during a `FileFormatWriter` job and partially reimplements SPARK-20703 in terms of it. The interface basically consists of 3 traits: - `WriteTaskStats`: just a tag for classes that represent statistics collected during a `WriteTask` The only constraint it adds is that the class should be `Serializable`, as instances of it will be collected on the driver from all executors at the end of the `WriteJob`. - `WriteTaskStatsTracker`: a trait for classes that can actually compute statistics based on tuples that are processed by a given `WriteTask` and eventually produce a `WriteTaskStats` instance. - `WriteJobStatsTracker`: a trait for classes that act as containers of `Serializable` state that's necessary for instantiating `WriteTaskStatsTracker` on executors and finally process the resulting collection of `WriteTaskStats`, once they're gathered back on the driver. Potential future use of this interface is e.g. CBO stats maintenance during `INSERT INTO table ... ` operations. ## How was this patch tested? Existing tests for SPARK-20703 exercise the new code: `hive/SQLMetricsSuite`, `sql/JavaDataFrameReaderWriterSuite`, etc. Author: Adrian Ionescu <adrian@databricks.com> Closes #18884 from adrian-ionescu/write-stats-tracker-api.	2017-08-10 12:37:10 -07:00
bravo-zhang	84454d7d33	[SPARK-14932][SQL] Allow DataFrame.replace() to replace values with None ## What changes were proposed in this pull request? Currently `df.na.replace("", Map[String, String]("NULL" -> null))` will produce exception. This PR enables passing null/None as value in the replacement map in DataFrame.replace(). Note that the replacement map keys and values should still be the same type, while the values can have a mix of null/None and that type. This PR enables following operations for example: `df.na.replace("", Map[String, String]("NULL" -> null))`(scala) `df.na.replace("", Map[Any, Any](60 -> null, 70 -> 80))`(scala) `df.na.replace('Alice', None)`(python) `df.na.replace([10, 20])`(python, replacing with None is by default) One use case could be: I want to replace all the empty strings with null/None because they were incorrectly generated and then drop all null/None data `df.na.replace("", Map("" -> null)).na.drop()`(scala) `df.replace(u'', None).dropna()`(python) ## How was this patch tested? Scala unit test. Python doctest and unit test. Author: bravo-zhang <mzhang1230@gmail.com> Closes #18820 from bravo-zhang/spark-14932.	2017-08-09 17:42:21 -07:00
peay	c06f3f5ac5	[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator ## What changes were proposed in this pull request? This modification increases the timeout for `serveIterator` (which is not dynamically configurable). This fixes timeout issues in pyspark when using `collect` and similar functions, in cases where Python may take more than a couple seconds to connect. See https://issues.apache.org/jira/browse/SPARK-21551 ## How was this patch tested? Ran the tests. cc rxin Author: peay <peay@protonmail.com> Closes #18752 from peay/spark-21551.	2017-08-09 14:03:18 -07:00
Jose Torres	0fb73253fc	[SPARK-21587][SS] Added filter pushdown through watermarks. ## What changes were proposed in this pull request? Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.) ## How was this patch tested? unit tests Author: Jose Torres <joseph-torres@databricks.com> Closes #18790 from joseph-torres/SPARK-21587.	2017-08-09 12:50:04 -07:00
gatorsmile	2d799d0808	[SPARK-21504][SQL] Add spark version info into table metadata ## What changes were proposed in this pull request? This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18709 from gatorsmile/addVersion.	2017-08-09 08:46:25 -07:00
Takeshi Yamamuro	b78cf13bf0	[SPARK-21276][CORE] Update lz4-java to the latest (v1.4.0) ## What changes were proposed in this pull request? This pr updated `lz4-java` to the latest (v1.4.0) and removed custom `LZ4BlockInputStream`. We currently use custom `LZ4BlockInputStream` to read concatenated byte stream in shuffle. But, this functionality has been implemented in the latest lz4-java (https://github.com/lz4/lz4-java/pull/105). So, we might update the latest to remove the custom `LZ4BlockInputStream`. Major diffs between the latest release and v1.3.0 in the master are as follows (`62f7547abb...6d4693f562`); - fixed NPE in XXHashFactory similarly - Don't place resources in default package to support shading - Fixes ByteBuffer methods failing to apply arrayOffset() for array-backed - Try to load lz4-java from java.library.path, then fallback to bundled - Add ppc64le binary - Add s390x JNI binding - Add basic LZ4 Frame v1.5.0 support - enable aarch64 support for lz4-java - Allow unsafeInstance() for ppc64le archiecture - Add unsafeInstance support for AArch64 - Support 64-bit JNI build on Solaris - Avoid over-allocating a buffer - Allow EndMark to be incompressible for LZ4FrameInputStream. - Concat byte stream ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18883 from maropu/SPARK-21276.	2017-08-09 17:31:52 +02:00
vinodkc	83fe3b5e10	[SPARK-21665][CORE] Need to close resources after use ## What changes were proposed in this pull request? Resources in Core - SparkSubmitArguments.scala, Spark-launcher - AbstractCommandBuilder.java, resource-managers- YARN - Client.scala are released ## How was this patch tested? No new test cases added, Unit test have been passed Author: vinodkc <vinod.kc.in@gmail.com> Closes #18880 from vinodkc/br_fixresouceleak.	2017-08-09 15:18:38 +02:00
10087686	6426adffaf	[SPARK-21663][TESTS] test("remote fetch below max RPC message size") should call masterTracker.stop() in MapOutputTrackerSuite Signed-off-by: 10087686 <wang.jiaochunzte.com.cn> ## What changes were proposed in this pull request? After Unit tests end，there should be call masterTracker.stop() to free resource; (Please fill in changes proposed in this fix) ## How was this patch tested? Run Unit tests; (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10087686 <wang.jiaochun@zte.com.cn> Closes #18867 from wangjiaochun/mapout.	2017-08-09 18:45:38 +08:00
WeichenXu	b35660dd0e	[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #18797 from WeichenXu123/update-breeze.	2017-08-09 14:44:10 +08:00
Anderson Osagie	ae8a2b1496	[SPARK-21176][WEB UI] Use a single ProxyServlet to proxy all workers and applications ## What changes were proposed in this pull request? Currently, each application and each worker creates their own proxy servlet. Each proxy servlet is backed by its own HTTP client and a relatively large number of selector threads. This is excessive but was fixed (to an extent) by https://github.com/apache/spark/pull/18437. However, a single HTTP client (backed by a single selector thread) should be enough to handle all proxy requests. This PR creates a single proxy servlet no matter how many applications and workers there are. ## How was this patch tested? . The unit tests for rewriting proxied locations and headers were updated. I then spun up a 100 node cluster to ensure that proxy'ing worked correctly jiangxb1987 Please let me know if there's anything else I can do to help push this thru. Thanks! Author: Anderson Osagie <osagie@gmail.com> Closes #18499 from aosagie/fix/minimize-proxy-threads.	2017-08-09 14:35:27 +08:00
pgandhi	f016f5c8f6	[SPARK-21503][UI] Spark UI shows incorrect task status for a killed Executor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: <img width="1398" alt="screen shot-before" src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png"> After: <img width="1385" alt="screen shot-after" src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <pgandhi@yahoo-inc.com> Author: pgandhi999 <parthkgandhi9@gmail.com> Closes #18707 from pgandhi999/master.	2017-08-09 13:46:06 +08:00
Xingbo Jiang	031910b0ec	[SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API should allow literal boundary ## What changes were proposed in this pull request? Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp. Example of the use case can be: ``` SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData ``` This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid. ## How was this patch tested? Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18814 from jiangxb1987/literal-boundary.	2017-08-09 13:23:49 +08:00
Shixiong Zhu	6edfff055c	[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the return value ## What changes were proposed in this pull request? When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the return value. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18799 from zsxwing/SPARK-21596.	2017-08-08 20:20:26 -07:00
Sean Owen	fb54a564d7	[SPARK-20433][BUILD] Bump jackson from 2.6.5 to 2.6.7.1 ## What changes were proposed in this pull request? Taking over https://github.com/apache/spark/pull/18789 ; Closes #18789 Update Jackson to 2.6.7 uniformly, and some components to 2.6.7.1, to get some fixes and prep for Scala 2.12 ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18881 from srowen/SPARK-20433.	2017-08-08 18:15:29 -07:00
Marcelo Vanzin	2c1bfb497f	[SPARK-21671][CORE] Move kvstore to "util" sub-package, add private annotation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18886 from vanzin/SPARK-21671.	2017-08-08 14:33:27 -07:00
Marcelo Vanzin	979bf946d5	[SPARK-20655][CORE] In-memory KVStore implementation. This change adds an in-memory implementation of KVStore that can be used by the live UI. The implementation is not fully optimized, neither for speed nor space, but should be fast enough for using in the listener bus. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18395 from vanzin/SPARK-20655.	2017-08-08 11:02:54 -07:00
hyukjinkwon	08ef7d7187	[MINOR][R][BUILD] More reliable detection of R version for Windows in AppVeyor ## What changes were proposed in this pull request? This PR proposes to use https://rversions.r-pkg.org/r-release-win instead of https://rversions.r-pkg.org/r-release to check R's version for Windows correctly. We met a syncing problem with Windows release (see #15709) before. To cut this short, it was ... - 3.3.2 release was released but not for Windows for few hours. - `https://rversions.r-pkg.org/r-release` returns the latest as 3.3.2 and the download link for 3.3.1 becomes `windows/base/old` by our script - 3.3.2 release for WIndows yet - 3.3.1 is still not in `windows/base/old` but `windows/base` as the latest - Failed to download with `windows/base/old` link and builds were broken I believe this problem is not only what we met. Please see `01ce943929` and also this `r-release-win` API came out between 3.3.1 and 3.3.2 (assuming to deal with this issue), please see `https://github.com/metacran/rversions.app/issues/2`. Using this API will prevent the problem although it looks quite rare assuming from the commit logs in https://github.com/metacran/rversions.app/commits/master. After 3.3.2, both `r-release-win` and `r-release` are being updated together. ## How was this patch tested? AppVeyor tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18859 from HyukjinKwon/use-reliable-link.	2017-08-08 23:18:59 +09:00
Liang-Chi Hsieh	ee1304199b	[SPARK-21567][SQL] Dataset should work with type alias ## What changes were proposed in this pull request? If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. ## How was this patch tested? Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18813 from viirya/SPARK-21567.	2017-08-08 16:12:41 +08:00
Marcos P. Sanchez	312bebfb6d	[SPARK-21640][FOLLOW-UP][SQL] added errorifexists on IllegalArgumentException message ## What changes were proposed in this pull request? This commit adds a new argument for IllegalArgumentException message. This recent commit added the argument: [`dcac1d57f0`) ## How was this patch tested? Unit test have been passed Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Marcos P. Sanchez <mpenate@stratio.com> Closes #18862 from mpenate/feature/exception-errorifexists.	2017-08-07 22:41:57 -07:00
Yanbo Liang	f763d8464b	[SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation. ## What changes were proposed in this pull request? PySpark GLR ```model.summary``` should return a printable representation by calling Scala ```toString```. ## How was this patch tested? ``` from pyspark.ml.regression import GeneralizedLinearRegression dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt") glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3) model = glr.fit(dataset) model.summary ``` Before this PR: ![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png) After this PR: ![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #18870 from yanboliang/spark-19270.	2017-08-08 08:43:58 +08:00
Ajay Saini	fdcee028af	[SPARK-21542][ML][PYTHON] Python persistence helper functions ## What changes were proposed in this pull request? Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters. ## How was this patch tested? Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests. Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master. Author: Ajay Saini <ajays725@gmail.com> Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.	2017-08-07 17:03:20 -07:00
gatorsmile	baf5cac0f8	[SPARK-21648][SQL] Fix confusing assert failure in JDBC source when parallel fetching parameters are not properly provided. ### What changes were proposed in this pull request? ```SQL CREATE TABLE mytesttable1 USING org.apache.spark.sql.jdbc OPTIONS ( url 'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}', dbtable 'mytesttable1', paritionColumn 'state_id', lowerBound '0', upperBound '52', numPartitions '53', fetchSize '10000' ) ``` The above option name `paritionColumn` is wrong. That mean, users did not provide the value for `partitionColumn`. In such case, users hit a confusing error. ``` AssertionError: assertion failed java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18864 from gatorsmile/jdbcPartCol.	2017-08-07 13:04:04 -07:00
Jose Torres	cce25b360e	[SPARK-21565][SS] Propagate metadata in attribute replacement. ## What changes were proposed in this pull request? Propagate metadata in attribute replacement during streaming execution. This is necessary for EventTimeWatermarks consuming replaced attributes. ## How was this patch tested? new unit test, which was verified to fail before the fix Author: Jose Torres <joseph-torres@databricks.com> Closes #18840 from joseph-torres/SPARK-21565.	2017-08-07 12:27:16 -07:00
Mac	4f7ec3a316	[SPARK][DOCS] Added note on meaning of position to substring function ## What changes were proposed in this pull request? Enhanced some existing documentation Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Mac <maclockard@gmail.com> Closes #18710 from maclockard/maclockard-patch-1.	2017-08-07 17:16:03 +01:00
Xiao Li	bbfd6b5d24	[SPARK-21647][SQL] Fix SortMergeJoin when using CROSS ### What changes were proposed in this pull request? author: BoleynSu closes https://github.com/apache/spark/pull/18836 ```Scala val df = Seq((1, 1)).toDF("i", "j") df.createOrReplaceTempView("T") withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { sql("select * from (select a.i from T a cross join T t where t.i = a.i) as t1 " + "cross join T t2 where t2.i = t1.i").explain(true) } ``` The above code could cause the following exception: ``` SortMergeJoinExec should not take Cross as the JoinType java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross as the JoinType at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100) ``` Our SortMergeJoinExec supports CROSS. We should not hit such an exception. This PR is to fix the issue. ### How was this patch tested? Modified the two existing test cases. Author: Xiao Li <gatorsmile@gmail.com> Author: Boleyn Su <boleyn.su@gmail.com> Closes #18863 from gatorsmile/pr-18836.	2017-08-08 00:00:01 +08:00
zhoukang	8b69b17f3f	[SPARK-21544][DEPLOY][TEST-MAVEN] Tests jar of some module should not upload twice ## What changes were proposed in this pull request? For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: `[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged` The reason is below: `[DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11🫙2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11🫙tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT]` when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be overrided. ## How was this patch tested? Execute 'mvn clean install -Pyarn -Phadoop-2.6 -Phadoop-provided -DskipTests' Author: zhoukang <zhoukang199191@gmail.com> Closes #18745 from caneGuy/zhoukang/fix-installtwice.	2017-08-07 12:51:39 +01:00
Peng Meng	1426eea84c	[SPARK-21623][ML] fix RF doc ## What changes were proposed in this pull request? comments of parentStats in RF are wrong. parentStats is not only used for the first iteration, it is used with all the iteration for unordered features. ## How was this patch tested? Author: Peng Meng <peng.meng@intel.com> Closes #18832 from mpjlu/fixRFDoc.	2017-08-07 11:03:07 +01:00
Stavros Kontopoulos	663f30d14a	[SPARK-13041][MESOS] Adds sandbox uri to spark dispatcher ui ## What changes were proposed in this pull request? Adds a sandbox link per driver in the dispatcher ui with minimal changes after a bug was fixed here: https://issues.apache.org/jira/browse/MESOS-4992 The sandbox uri has the following format: http://<proxy_uri>/#/slaves/\<agent-id\>/ frameworks/ \<scheduler-id\>/executors/\<driver-id\>/browse For dc/os the proxy uri is <dc/os uri>/mesos. For the dc/os deployment scenario and to make things easier I introduced a new config property named `spark.mesos.proxy.baseURL` which should be passed to the dispatcher when launched using --conf. If no such configuration is detected then no sandbox uri is depicted, and there is an empty column with a header (this can be changed so nothing is shown). Within dc/os the base url must be a property for the dispatcher that we should add in the future here: `9e7c909c3b/repo/packages/S/spark/26/config.json` It is not easy to detect in different environments what is that uri so user should pass it. ## How was this patch tested? Tested with the mesos test suite here: https://github.com/typesafehub/mesos-spark-integration-tests. Attached image shows the ui modification where the sandbox header is added. ![image](https://user-images.githubusercontent.com/7945591/27831630-2a3b447e-60d4-11e7-87bb-d057efd4efa7.png) Tested the uri redirection the way it was suggested here: https://issues.apache.org/jira/browse/MESOS-4992 Built mesos 1.4 from the master branch and started the mesos dispatcher with the command: `./sbin/start-mesos-dispatcher.sh --conf spark.mesos.proxy.baseURL=http://localhost:5050 -m mesos://127.0.0.1:5050` Run a spark example: `./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://10.10.1.79:7078 --deploy-mode cluster --executor-memory 2G --total-executor-cores 2 http://<path>/spark-examples_2.11-2.1.1.jar 10` Sandbox uri is shown at the bottom of the page: ![image](https://user-images.githubusercontent.com/7945591/28599237-89d0a8c8-71b1-11e7-8f94-41ad117ceead.png) Redirection works as expected: ![image](https://user-images.githubusercontent.com/7945591/28599247-a5d65248-71b1-11e7-8b5e-a0ac2a79fa23.png) Author: Stavros Kontopoulos <st.kontopoulos@gmail.com> Closes #18528 from skonto/adds_the_sandbox_uri.	2017-08-07 10:32:19 +01:00
Xianyang Liu	534a063f7c	[SPARK-21621][CORE] Reset numRecordsWritten after DiskBlockObjectWriter.commitAndGet called ## What changes were proposed in this pull request? We should reset numRecordsWritten to zero after DiskBlockObjectWriter.commitAndGet called. Because when `revertPartialWritesAndClose` be called, we decrease the written records in `ShuffleWriteMetrics` . However, we decreased the written records to zero, this should be wrong, we should only decreased the number reords after the last `commitAndGet` called. ## How was this patch tested? Modified existing test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18830 from ConeyLiu/DiskBlockObjectWriter.	2017-08-07 17:04:53 +08:00
Sean Owen	39e044e3d8	[MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency from Hive module ## What changes were proposed in this pull request? Remove duplicate test-jar:test spark-sql dependency from Hive module; move test-jar dependencies together logically. This generates a big warning at the start of the Maven build otherwise. ## How was this patch tested? Existing build. No functional changes here. Author: Sean Owen <sowen@cloudera.com> Closes #18858 from srowen/DupeSqlTestDep.	2017-08-06 16:48:49 -07:00
BartekH	438c381584	Add "full_outer" name to join types I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: BartekH <bartekhamielec@gmail.com> Closes #17985 from BartekH/patch-1.	2017-08-06 16:40:59 -07:00
actuaryzhang	55aa4da285	[SPARK-21622][ML][SPARKR] Support offset in SparkR GLM ## What changes were proposed in this pull request? Support offset in SparkR GLM #16699 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18831 from actuaryzhang/sparkROffset.	2017-08-06 15:14:12 -07:00
Takeshi Yamamuro	74b47845ea	[SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for visitTableName ## What changes were proposed in this pull request? This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18857 from maropu/SPARK-20963-FOLLOWUP.	2017-08-06 10:14:45 -07:00
Yuming Wang	10b3ca3e93	[SPARK-21574][SQL] Point out user to set hive config before SparkSession is initialized ## What changes were proposed in this pull request? Since Spark 2.0.0, SET hive config commands do not pass the values to HiveClient, this PR point out user to set hive config before SparkSession is initialized when they try to set hive config. ## How was this patch tested? manual tests <img width="1637" alt="spark-set" src="https://user-images.githubusercontent.com/5399861/29001141-03f943ee-7ab3-11e7-8584-ba5a5e81f6ad.png"> Author: Yuming Wang <wgyumg@gmail.com> Closes #18769 from wangyum/SPARK-21574.	2017-08-06 10:08:44 -07:00
Felix Cheung	d4e7f20f54	[SPARKR][BUILD] AppVeyor change to latest R version ## What changes were proposed in this pull request? R version update ## How was this patch tested? AppVeyor Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #18856 from felixcheung/rappveyorver.	2017-08-06 19:51:35 +09:00
vinodkc	1ba967b25e	[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588.	2017-08-05 23:04:39 -07:00
Takeshi Yamamuro	990efad1c6	[SPARK-20963][SQL] Support column aliases for join relations in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support column aliases for join relations in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18772 from maropu/SPARK-20963-2.	2017-08-05 20:35:54 -07:00
hzyaoqin	41568e9a0f	[SPARK-21637][SPARK-21451][SQL] get `spark.hadoop.` properties from sysProps to hiveconf ## What changes were proposed in this pull request? When we use `bin/spark-sql` command configuring `--conf spark.hadoop.foo=bar`, the `SparkSQLCliDriver` initializes an instance of hiveconf, it does not add `foo->bar` to it. this pr gets `spark.hadoop.` properties from sysProps to this hiveconf ## How was this patch tested? UT Author: hzyaoqin <hzyaoqin@corp.netease.com> Author: Kent Yao <yaooqinn@hotmail.com> Closes #18668 from yaooqinn/SPARK-21451.	2017-08-05 17:30:47 -07:00
arodriguez	dcac1d57f0	[SPARK-21640] Add errorifexists as a valid string for ErrorIfExists save mode ## What changes were proposed in this pull request? This PR includes the changes to make the string "errorifexists" also valid for ErrorIfExists save mode. ## How was this patch tested? Unit tests and manual tests Author: arodriguez <arodriguez@arodriguez.stratio> Closes #18844 from ardlema/SPARK-21640.	2017-08-05 11:21:51 -07:00
hyukjinkwon	ba327ee54c	[SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments separately, and note/since in SQL built-in function documentation ## What changes were proposed in this pull request? This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information. For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html For few good examples to check the built documentation, please see both: `from_json` - https://spark-test.github.io/sparksqldoc/#from_json `like` - https://spark-test.github.io/sparksqldoc/#like For `DESCRIBE FUNCTION`, `note` and `since` are added as below: ``` > DESCRIBE FUNCTION EXTENDED rlike; ... Extended Usage: Arguments: ... Examples: ... Note: Use LIKE to match with simple string pattern ``` ``` > DESCRIBE FUNCTION EXTENDED to_json; ... Examples: ... Since: 2.2.0 ``` For the complete documentation, see https://spark-test.github.io/sparksqldoc/ ## How was this patch tested? Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc Jenkins tests are needed to double check Author: hyukjinkwon <gurwls223@gmail.com> Closes #18749 from HyukjinKwon/followup-sql-doc-gen.	2017-08-05 10:10:56 -07:00

1 2 3 4 5 ...

20259 commits