ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
WeichenXu	fb0a8a8dd7	[SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkR ## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: https://github.com/apache/spark/pull/13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15516 from WeichenXu123/storageLevel_df_r.	2016-10-26 13:26:43 -07:00
Yanbo Liang	ea3605e825	[MINOR][ML] Refactor clustering summary. ## What changes were proposed in this pull request? Abstract ```ClusteringSummary``` from ```KMeansSummary```, ```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate duplicated pieces of code. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15555 from yanboliang/clustering-summary.	2016-10-26 11:48:54 -07:00
Shixiong Zhu	7d10631c16	[SPARK-18104][DOC] Don't build KafkaSource doc ## What changes were proposed in this pull request? Don't need to build doc for KafkaSource because the user should use the data source APIs to use KafkaSource. All KafkaSource APIs are internal. ## How was this patch tested? Verified manually. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15630 from zsxwing/kafka-unidoc.	2016-10-26 11:16:20 -07:00
jiangxingbo	fa7d9d7082	[SPARK-18063][SQL] Failed to infer constraints over multiple aliases ## What changes were proposed in this pull request? The `UnaryNode.getAliasedConstraints` function fails to replace all expressions by their alias where constraints contains more than one expression to be replaced. For example: ``` val tr = LocalRelation('a.int, 'b.string, 'c.int) val multiAlias = tr.where('a === 'c + 10).select('a.as('x), 'c.as('y)) multiAlias.analyze.constraints ``` currently outputs: ``` ExpressionSet(Seq( IsNotNull(resolveColumn(multiAlias.analyze, "x")), IsNotNull(resolveColumn(multiAlias.analyze, "y")) ) ``` The constraint `resolveColumn(multiAlias.analyze, "x") === resolveColumn(multiAlias.analyze, "y") + 10)` is missing. ## How was this patch tested? Add new test cases in `ConstraintPropagationSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15597 from jiangxb1987/alias-constraints.	2016-10-26 20:12:20 +02:00
Shixiong Zhu	7ac70e7ba8	[SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL ## What changes were proposed in this pull request? Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool. This PR just uses `Awaitable.result` instead to prevent ForkJoinPool from running other tasks in the current waiting thread. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15520 from zsxwing/SPARK-13747.	2016-10-26 10:36:36 -07:00
Yanbo Liang	312ea3f7f6	[SPARK-17748][FOLLOW-UP][ML] Reorg variables of WeightedLeastSquares. ## What changes were proposed in this pull request? This is follow-up work of #15394. Reorg some variables of ```WeightedLeastSquares``` and fix one minor issue of ```WeightedLeastSquaresSuite```. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15621 from yanboliang/spark-17748.	2016-10-26 09:28:28 -07:00
Mark Grover	4bee954079	[SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega… …rdless of warehouse dir's existence ## What changes were proposed in this pull request? Appending a trailing slash, if there already isn't one for the sake comparison of the two paths. It doesn't take away from the essence of the check, but removes any potential mismatch due to lack of trailing slash. ## How was this patch tested? Ran unit tests and they passed. Author: Mark Grover <mark@apache.org> Closes #15623 from markgrover/spark-18093.	2016-10-26 09:07:30 -07:00
jiangxingbo	3c023570b2	[SPARK-17733][SQL] InferFiltersFromConstraints rule never terminates for query ## What changes were proposed in this pull request? The function `QueryPlan.inferAdditionalConstraints` and `UnaryNode.getAliasedConstraints` can produce a non-converging set of constraints for recursive functions. For instance, if we have two constraints of the form(where a is an alias): `a = b, a = f(b, c)` Applying both these rules in the next iteration would infer: `f(b, c) = f(f(b, c), c)` This process repeated, the iteration won't converge and the set of constraints will grow larger and larger until OOM. ~~To fix this problem, we collect alias from expressions and skip infer constraints if we are to transform an `Expression` to another which contains it.~~ To fix this problem, we apply additional check in `inferAdditionalConstraints`, when it's possible to generate recursive constraints, we skip generate that. ## How was this patch tested? Add new testcase in `SQLQuerySuite`/`InferFiltersFromConstraintsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15319 from jiangxb1987/constraints.	2016-10-26 17:09:48 +02:00
Shuai Lin	402205ddf7	[SPARK-17802] Improved caller context logging. ## What changes were proposed in this pull request? [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. This patch improves this behaviour by only logging the `ClassNotFoundException` once. ## How was this patch tested? Existing tests. Author: Shuai Lin <linshuai2012@gmail.com> Closes #15377 from lins05/spark-17802-improve-callercontext-logging.	2016-10-26 14:31:47 +02:00
Alex Bozarth	5d0f81da49	[SPARK-4411][WEB UI] Add "kill" link for jobs in the UI ## What changes were proposed in this pull request? Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation. ## How was this patch tested? Manually tested and dev/run-tests ![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #15441 from ajbozarth/spark4411.	2016-10-26 14:26:54 +02:00
Sean Owen	2978136475	[SPARK-18027][YARN] .sparkStaging not clean on RM ApplicationNotFoundException ## What changes were proposed in this pull request? Cleanup YARN staging dir on all `KILLED`/`FAILED` paths in `monitorApplication` ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15598 from srowen/SPARK-18027.	2016-10-26 14:23:11 +02:00
Sean Owen	6c7d094ec4	[SPARK-18022][SQL] java.lang.NullPointerException instead of real exception when saving DF to MySQL ## What changes were proposed in this pull request? On null next exception in JDBC, don't init it as cause or suppressed ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15599 from srowen/SPARK-18022.	2016-10-26 14:19:40 +02:00
gatorsmile	93b8ad184a	[SPARK-17693][SQL] Fixed Insert Failure To Data Source Tables when the Schema has the Comment Field ### What changes were proposed in this pull request? ```SQL CREATE TABLE tab1(col1 int COMMENT 'a', col2 int) USING parquet INSERT INTO TABLE tab1 SELECT 1, 2 ``` The insert attempt will fail if the target table has a column with comments. The error is strange to the external users: ``` assertion failed: No plan for InsertIntoTable Relation[col1#15,col2#16] parquet, false, false +- Project [1 AS col1#19, 2 AS col2#20] +- OneRowRelation$ ``` This PR is to fix the above bug by checking the metadata when comparing the schema between the table and the query. If not matched, we also copy the metadata. This is an alternative to https://github.com/apache/spark/pull/15266 ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #15615 from gatorsmile/insertDataSourceTableWithCommentSolution2.	2016-10-26 00:38:34 -07:00
WeichenXu	12b3e8d2e0	[SPARK-18007][SPARKR][ML] update SparkR MLP - add initalWeights parameter ## What changes were proposed in this pull request? update SparkR MLP, add initalWeights parameter. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15552 from WeichenXu123/mlp_r_add_initialWeight_param.	2016-10-25 21:42:59 -07:00
hayashidac	c329a568b5	[SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled spark history server log needs to be fixed to show https url when ssl is enabled Author: chie8842 <chie@chie-no-Mac-mini.local> Closes #15611 from hayashidac/SPARK-16988.	2016-10-26 07:13:48 +09:00
sethah	2c7394ad09	[SPARK-18019][ML] Add instrumentation to GBTs ## What changes were proposed in this pull request? Add instrumentation for logging in ML GBT, part of umbrella ticket [SPARK-14567](https://issues.apache.org/jira/browse/SPARK-14567) ## How was this patch tested? Tested locally: ```` 16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: training: numPartitions=1 storageLevel=StorageLevel(1 replicas) 16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"maxIter":1} 16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numFeatures":2} 16/10/20 10:24:51 INFO Instrumentation: GBTRegressor-gbtr_2b460d3e2e93-1207021668-45: {"numClasses":0} ... 16/10/20 15:54:21 INFO Instrumentation: GBTRegressor-gbtr_065fad465377-1922077832-22: training finished ```` Author: sethah <seth.hendrickson16@gmail.com> Closes #15574 from sethah/gbt_instr.	2016-10-25 13:11:21 -07:00
Wenchen Fan	a21791e316	[SPARK-18070][SQL] binary operator should not consider nullability when comparing input types ## What changes were proposed in this pull request? Binary operator requires its inputs to be of same type, but it should not consider nullability, e.g. `EqualTo` should be able to compare an element-nullable array and an element-non-nullable array. ## How was this patch tested? a regression test in `DataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #15606 from cloud-fan/type-bug.	2016-10-25 12:08:17 -07:00
Vinayak	c5fe3dd4f5	[SPARK-18010][CORE] Reduce work performed for building up the application list for the History Server app list UI page ## What changes were proposed in this pull request? allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change. ## How was this patch tested? Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Vinayak <vijoshi5@in.ibm.com> Closes #15556 from vijoshi/SAAS-467_master.	2016-10-25 10:36:03 -07:00
Yanbo Liang	ac8ff920fa	[SPARK-17748][FOLLOW-UP][ML] Fix build error for Scala 2.10. ## What changes were proposed in this pull request? #15394 introduced build error for Scala 2.10, this PR fix it. ## How was this patch tested? Existing test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15625 from yanboliang/spark-17748-scala.	2016-10-25 10:22:02 -07:00
Zheng RuiFeng	38cdd6ccda	[SPARK-14634][ML][FOLLOWUP] Delete superfluous line in BisectingKMeans ## What changes were proposed in this pull request? As commented by jkbradley in https://github.com/apache/spark/pull/12394, `model.setSummary(summary)` is superfluous ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15619 from zhengruifeng/del_superfluous.	2016-10-25 03:19:50 -07:00
Wenchen Fan	6f31833dbe	[SPARK-18026][SQL] should not always lowercase partition columns of partition spec in parser ## What changes were proposed in this pull request? Currently we always lowercase the partition columns of partition spec in parser, with the assumption that table partition columns are always lowercased. However, this is not true for data source tables, which are case preserving. It's safe for now because data source tables don't store partition spec in metastore and don't support `ADD PARTITION`, `DROP PARTITION`, `RENAME PARTITION`, but we should make our code future-proof. This PR makes partition spec case preserving at parser, and improve the `PreprocessTableInsertion` analyzer rule to normalize the partition columns in partition spec, w.r.t. the table partition columns. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15566 from cloud-fan/partition-spec.	2016-10-25 15:00:33 +08:00
sethah	78d740a08a	[SPARK-17748][ML] One pass solver for Weighted Least Squares with ElasticNet ## What changes were proposed in this pull request? 1. Make a pluggable solver interface for `WeightedLeastSquares` 2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares` 3. Add method `BLAS.dspmv` used by QN solver 4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails. ## How was this patch tested? Unit tests - see below. ## Design choices Pluggable Normal Solver Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait: ````scala private[ml] sealed trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } ```` We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface. Always train in the standardized space The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels. Use L-BFGS locally to solve normal equations for singular matrix When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well. ## Test cases I found it helpful to enumerate some of the test cases and hopefully it makes review easier. WeightedLeastSquares 1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully. 2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully. 3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero. 4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package. 5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet. 6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean. 7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R. LinearRegression 1. For all existing L1 test cases, test the "normal" solver too. 2. Check that using the normal solver now handles singular matrices. 3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA. BLAS 1. Test new method `dspmv`. ## Performance Testing This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement. We should decide if this PR needs to be blocked before performance testing is conducted. Author: sethah <seth.hendrickson16@gmail.com> Closes #15394 from sethah/SPARK-17748.	2016-10-24 23:47:59 -07:00
Kay Ousterhout	483c37c581	[SPARK-17894][HOTFIX] Fix broken build from The named parameter in an overridden class isn't supported in Scala 2.10 so was breaking the build. cc zsxwing Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #15617 from kayousterhout/hotfix.	2016-10-24 20:16:00 -07:00
gatorsmile	d479c52622	[SPARK-17409][SQL][FOLLOW-UP] Do Not Optimize Query in CTAS More Than Once ### What changes were proposed in this pull request? This follow-up PR is for addressing the [comment](https://github.com/apache/spark/pull/15048). We added two test cases based on the suggestion from yhuai . One is a new test case using the `saveAsTable` API to create a data source table. Another is for CTAS on Hive serde table. Note: No need to backport this PR to 2.0. Will submit a new PR to backport the whole fix with new test cases to Spark 2.0 ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #15459 from gatorsmile/ctasOptimizedTestCases.	2016-10-25 10:47:11 +08:00
Wenchen Fan	84a3399908	[SPARK-18028][SQL] simplify TableFileCatalog ## What changes were proposed in this pull request? Simplify/cleanup TableFileCatalog: 1. pass a `CatalogTable` instead of `databaseName` and `tableName` into `TableFileCatalog`, so that we don't need to fetch table metadata from metastore again 2. In `TableFileCatalog.filterPartitions0`, DO NOT set `PartitioningAwareFileCatalog.BASE_PATH_PARAM`. According to the [classdoc](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala#L189-L209), the default value of `basePath` already satisfies our need. What's more, if we set this parameter, we may break the case 2 which is metioned in the classdoc. 3. add `equals` and `hashCode` to `TableFileCatalog` 4. add `SessionCatalog.listPartitionsByFilter` which handles case sensitivity. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15568 from cloud-fan/table-file-catalog.	2016-10-25 08:42:21 +08:00
Tathagata Das	407c3cedf2	[SPARK-17624][SQL][STREAMING][TEST] Fixed flaky StateStoreSuite.maintenance ## What changes were proposed in this pull request? The reason for the flakiness was follows. The test starts the maintenance background thread, and then writes 20 versions of the state store. The maintenance thread is expected to create snapshots in the middle, and clean up old files that are not needed any more. The earliest delta file (1.delta) is expected to be deleted as snapshots will ensure that the earliest delta would not be needed. However, the default configuration for the maintenance thread is to retain files such that last 2 versions can be recovered, and delete the rest. Now while generating the versions, the maintenance thread can kick in and create snapshots anywhere between version 10 and 20 (at least 10 deltas needed for snapshot). Then later it will choose to retain only version 20 and 19 (last 2). There are two cases. - Common case: One of the version between 10 and 19 gets snapshotted. Then recovering versions 19 and 20 just needs 19.snapshot and 20.delta, so 1.delta gets deleted. - Uncommon case (reason for flakiness): Only version 20 gets snapshotted. Then recovering versoin 20 requires 20.snapshot, and recovering version 19 all the previous 19...1.delta. So 1.delta does not get deleted. This PR rearranges the checks such that it create 20 versions, and then waits that there is at least one snapshot, then creates another 20. This will ensure that the latest 2 versions cannot require anything older than the first snapshot generated, and therefore will 1.delta will be deleted. In addition, I have added more logs, and comments that I felt would help future debugging and understanding what is going on. ## How was this patch tested? Ran the StateStoreSuite > 6K times in a heavily loaded machine (10 instances of tests running in parallel). No failures. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15592 from tdas/SPARK-17624.	2016-10-24 17:21:16 -07:00
Eren Avsarogullari	81d6933e75	[SPARK-17894][CORE] Ensure uniqueness of TaskSetManager name. `TaskSetManager` should have unique name to avoid adding duplicate ones to parent `Pool` via `SchedulableBuilder`. This problem has been surfaced with following discussion: [[PR: Avoid adding duplicate schedulables]](https://github.com/apache/spark/pull/15326) Proposal : There is 1x1 relationship between `stageAttemptId` and `TaskSetManager` so `taskSet.Id` covering both `stageId` and `stageAttemptId` looks to be used for uniqueness of `TaskSetManager` name instead of just `stageId`. Current TaskSetManager Name : `var name = "TaskSet_" + taskSet.stageId.toString` Sample: TaskSet_0 Proposed TaskSetManager Name : `val name = "TaskSet_" + taskSet.Id ` `// taskSet.Id = (stageId + "." + stageAttemptId)` Sample : TaskSet_0.0 Added new Unit Test. Author: erenavsarogullari <erenavsarogullari@gmail.com> Closes #15463 from erenavsarogullari/SPARK-17894.	2016-10-24 15:33:54 -07:00
Sean Owen	4ecbe1b92f	[SPARK-17810][SQL] Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path ## What changes were proposed in this pull request? Always resolve spark.sql.warehouse.dir as a local path, and as relative to working dir not home dir ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15382 from srowen/SPARK-17810.	2016-10-24 10:44:45 +01:00
Zheng RuiFeng	c64a8ff397	[SPARK-18049][MLLIB][TEST] Add missing tests for truePositiveRate and weightedTruePositiveRate ## What changes were proposed in this pull request? Add missing tests for `truePositiveRate` and `weightedTruePositiveRate` in `MulticlassMetricsSuite` ## How was this patch tested? added testing Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15585 from zhengruifeng/mc_missing_test.	2016-10-24 10:25:24 +01:00
Felix Cheung	3a423f5a03	[SPARKR][BRANCH-2.0] R merge API doc and example fix ## What changes were proposed in this pull request? Fixes for R doc ## How was this patch tested? N/A Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15589 from felixcheung/rdocmergefix. (cherry picked from commit `0e0d83a597`) Signed-off-by: Felix Cheung <felixcheung@apache.org>	2016-10-23 10:53:43 -07:00
CodingCat	a81fba048f	[SPARK-18058][SQL] Comparing column types ignoring Nullability in Union and SetOperation ## What changes were proposed in this pull request? The PR tries to fix [SPARK-18058](https://issues.apache.org/jira/browse/SPARK-18058) which refers to a bug that the column types are compared with the extra care about Nullability in Union and SetOperation. This PR converts the columns types by setting all fields as nullable before comparison ## How was this patch tested? regular unit test cases Author: CodingCat <zhunansjtu@gmail.com> Closes #15595 from CodingCat/SPARK-18058.	2016-10-23 19:42:11 +02:00
jiangxingbo	b158256c2e	[SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSuite` to package `sql` ## What changes were proposed in this pull request? The testsuite `HiveDataFrameAnalyticsSuite` has nothing to do with HIVE, we should move it to package `sql`. The original test cases in that suite are splited into two existing testsuites: `DataFrameAggregateSuite` tests for the functions and ~~`SQLQuerySuite`~~`SQLQueryTestSuite` tests for the SQL statements. ## How was this patch tested? ~~Modified `SQLQuerySuite` in package `sql`.~~ Add query file for `SQLQueryTestSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15582 from jiangxb1987/group-analytics-test.	2016-10-23 13:28:35 +02:00
Tejas Patil	21c7539a52	[SPARK-18038][SQL] Move output partitioning definition from UnaryNodeExec to its children ## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-18038 This was a suggestion by rxin over one of the dev list discussion : http://apache-spark-developers-list.1001551.n3.nabble.com/Project-not-preserving-child-partitioning-td19417.html His words: >> It would be better (safer) to move the output partitioning definition into each of the operator and remove it from UnaryExecNode. With this PR, following is the output partitioning and ordering for all the impls of `UnaryExecNode`. UnaryExecNode's impl \| outputPartitioning \| outputOrdering \| comment ------------ \| ------------- \| ------------ \| ------------ AppendColumnsExec \| child's \| Nil \| child's ordering can be used AppendColumnsWithObjectExec \| child's \| Nil \| child's ordering can be used BroadcastExchangeExec \| BroadcastPartitioning \| Nil \| - CoalesceExec \| UnknownPartitioning \| Nil \| - CollectLimitExec \| SinglePartition \| Nil \| - DebugExec \| child's \| Nil \| child's ordering can be used DeserializeToObjectExec \| child's \| Nil \| child's ordering can be used ExpandExec \| UnknownPartitioning \| Nil \| - FilterExec \| child's \| child's \| - FlatMapGroupsInRExec \| child's \| Nil \| child's ordering can be used GenerateExec \| child's \| Nil \| need to dig more GlobalLimitExec \| child's \| child's \| - HashAggregateExec \| child's \| Nil \| - InputAdapter \| child's \| child's \| - InsertIntoHiveTable \| child's \| Nil \| terminal node, doesn't need partitioning LocalLimitExec \| child's \| child's \| - MapElementsExec \| child's \| child's \| - MapGroupsExec \| child's \| Nil \| child's ordering can be used MapPartitionsExec \| child's \| Nil \| child's ordering can be used ProjectExec \| child's \| child's \| - SampleExec \| child's \| Nil \| child's ordering can be used ScriptTransformation \| child's \| Nil \| child's ordering can be used SerializeFromObjectExec \| child's \| Nil \| child's ordering can be used ShuffleExchange \| custom \| Nil \| - SortAggregateExec \| child's \| sort over grouped exprs \| - SortExec \| child's \| custom \| - StateStoreRestoreExec \| child's \| Nil \| child's ordering can be used StateStoreSaveExec \| child's \| Nil \| child's ordering can be used SubqueryExec \| child's \| child's \| - TakeOrderedAndProjectExec \| SinglePartition \| custom \| - WholeStageCodegenExec \| child's \| child's \| - WindowExec \| child's \| child's \| - ## How was this patch tested? This does NOT change any existing functionality so relying on existing tests Author: Tejas Patil <tejasp@fb.com> Closes #15575 from tejasapatil/SPARK-18038_UnaryNodeExec_output_partitioning.	2016-10-23 13:25:47 +02:00
Tejas Patil	eff4aed1ac	[SPARK-18035][SQL] Introduce performant and memory efficient APIs to create ArrayBasedMapData ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-18035 In HiveInspectors, I saw that converting Java map to Spark's `ArrayBasedMapData` spent quite sometime in buffer copying : https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L658 The reason being `map.toSeq` allocates a new buffer and copies the map entries to it: https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/MapLike.scala#L323 This copy is not needed as we get rid of it once we extract the key and value arrays. Here is the call trace: ``` org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$41.apply(HiveInspectors.scala:664) scala.collection.AbstractMap.toSeq(Map.scala:59) scala.collection.MapLike$class.toSeq(MapLike.scala:323) scala.collection.AbstractMap.toBuffer(Map.scala:59) scala.collection.MapLike$class.toBuffer(MapLike.scala:326) scala.collection.AbstractTraversable.copyToBuffer(Traversable.scala:104) scala.collection.TraversableOnce$class.copyToBuffer(TraversableOnce.scala:275) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) scala.collection.AbstractIterable.foreach(Iterable.scala:54) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterator.foreach(Iterator.scala:1336) scala.collection.Iterator$class.foreach(Iterator.scala:893) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) ``` Also, earlier code was populating keys and values arrays separately by iterating twice. The PR avoids double iteration of the map and does it in one iteration. EDIT: During code review, there were several more places in the code which were found to do similar thing. The PR dedupes those instances and introduces convenient APIs which are performant and memory efficient ## Performance gains The number is subjective and depends on how many map columns are accessed in the query and average entries per map. For one the queries that I tried out, I saw 3% CPU savings (end-to-end) for the query. ## How was this patch tested? This does not change the end result produced so relying on existing tests. Author: Tejas Patil <tejasp@fb.com> Closes #15573 from tejasapatil/SPARK-18035_avoid_toSeq.	2016-10-22 20:43:43 -07:00
Sandeep Singh	bc167a2a53	[SPARK-928][CORE] Add support for Unsafe-based serializer in Kryo ## What changes were proposed in this pull request? Now since we have migrated to Kryo-3.0.0 in https://issues.apache.org/jira/browse/SPARK-11416, we can gives users option to use unsafe SerDer. It can turned by setting `spark.kryo.useUnsafe` to `true` ## How was this patch tested? Ran existing tests ``` Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ basicTypes: Int unsafe:true 160 / 178 98.5 10.1 1.0X basicTypes: Long unsafe:true 210 / 218 74.9 13.4 0.8X basicTypes: Float unsafe:true 203 / 213 77.5 12.9 0.8X basicTypes: Double unsafe:true 226 / 235 69.5 14.4 0.7X Array: Int unsafe:true 1087 / 1101 14.5 69.1 0.1X Array: Long unsafe:true 2758 / 2844 5.7 175.4 0.1X Array: Float unsafe:true 1511 / 1552 10.4 96.1 0.1X Array: Double unsafe:true 2942 / 2972 5.3 187.0 0.1X Map of string->Double unsafe:true 2645 / 2739 5.9 168.2 0.1X basicTypes: Int unsafe:false 211 / 218 74.7 13.4 0.8X basicTypes: Long unsafe:false 247 / 253 63.6 15.7 0.6X basicTypes: Float unsafe:false 211 / 216 74.5 13.4 0.8X basicTypes: Double unsafe:false 227 / 233 69.2 14.4 0.7X Array: Int unsafe:false 3012 / 3032 5.2 191.5 0.1X Array: Long unsafe:false 4463 / 4515 3.5 283.8 0.0X Array: Float unsafe:false 2788 / 2868 5.6 177.2 0.1X Array: Double unsafe:false 3558 / 3752 4.4 226.2 0.0X Map of string->Double unsafe:false 2806 / 2933 5.6 178.4 0.1X ``` Author: Sandeep Singh <sandeep@techaddict.me> Author: Sandeep Singh <sandeep@origamilogic.com> Closes #12913 from techaddict/SPARK-928.	2016-10-22 12:03:37 -07:00
WeichenXu	4f1dcd3dce	[SPARK-18051][SPARK CORE] fix bug of custom PartitionCoalescer causing serialization exception ## What changes were proposed in this pull request? add a require check in `CoalescedRDD` to make sure the passed in `partitionCoalescer` to be `serializable`. and update the document for api `RDD.coalesce` ## How was this patch tested? Manual.(test code in jira [SPARK-18051]) Author: WeichenXu <WeichenXu123@outlook.com> Closes #15587 from WeichenXu123/fix_coalescer_bug.	2016-10-22 11:59:28 -07:00
hyukjinkwon	5fa9f8795a	[SPARK-17123][SQL] Use type-widened encoder for DataFrame rather than existing encoder to allow type-widening from set operations # What changes were proposed in this pull request? This PR fixes set operations in `DataFrame` to be performed fine without exceptions when the types are non-scala native types. (e.g, `TimestampType`, `DateType` and `DecimalType`). The problem is, it seems set operations such as `union`, `intersect` and `except` uses the encoder belonging to the `Dataset` in caller. So, `Dataset` of the caller holds `ExpressionEncoder[Row]` as it is when the set operations are performed. However, the return types can be actually widen. So, we should use `ExpressionEncoder[Row]` constructed from executed plan rather than using existing one. Otherwise, this will generate some codes wrongly via `StaticInvoke`. Running the codes below: ```scala val dates = Seq( (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)), (new Date(3), BigDecimal.valueOf(4), new Timestamp(5)) ).toDF("date", "timestamp", "decimal") val widenTypedRows = Seq( (new Timestamp(2), 10.5D, "string") ).toDF("date", "timestamp", "decimal") val results = dates.union(widenTypedRows).collect() results.foreach(println) ``` prints below: Before ```java 23:08:54.490 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 28, Column 107: No applicable constructor/method found for actual parameters "long"; candidates are: "public static java.sql.Date org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" /* 001 / public java.lang.Object generate(Object[] references) { / 002 / return new SpecificSafeProjection(references); / 003 / } / 004 / / 005 / class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { / 006 / / 007 / private Object[] references; / 008 / private MutableRow mutableRow; / 009 / private Object[] values; / 010 / private org.apache.spark.sql.types.StructType schema; / 011 / / 012 / / 013 / public SpecificSafeProjection(Object[] references) { / 014 / this.references = references; / 015 / mutableRow = (MutableRow) references[references.length - 1]; / 016 / / 017 / this.schema = (org.apache.spark.sql.types.StructType) references[0]; / 018 / } / 019 / / 020 / public java.lang.Object apply(java.lang.Object _i) { / 021 / InternalRow i = (InternalRow) _i; / 022 / / 023 / values = new Object[3]; / 024 / / 025 / boolean isNull2 = i.isNullAt(0); / 026 / long value2 = isNull2 ? -1L : (i.getLong(0)); / 027 / boolean isNull1 = isNull2; / 028 / final java.sql.Date value1 = isNull1 ? null : org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); / 029 / isNull1 = value1 == null; / 030 / if (isNull1) { / 031 / values[0] = null; / 032 / } else { / 033 / values[0] = value1; / 034 / } / 035 / / 036 / boolean isNull4 = i.isNullAt(1); / 037 / double value4 = isNull4 ? -1.0 : (i.getDouble(1)); / 038 / / 039 / boolean isNull3 = isNull4; / 040 / java.math.BigDecimal value3 = null; / 041 / if (!isNull3) { / 042 / / 043 / Object funcResult = null; / 044 / funcResult = value4.toJavaBigDecimal(); / 045 / if (funcResult == null) { / 046 / isNull3 = true; / 047 / } else { / 048 / value3 = (java.math.BigDecimal) funcResult; / 049 / } / 050 / / 051 / } / 052 / isNull3 = value3 == null; / 053 / if (isNull3) { / 054 / values[1] = null; / 055 / } else { / 056 / values[1] = value3; / 057 / } / 058 / / 059 / boolean isNull6 = i.isNullAt(2); / 060 / UTF8String value6 = isNull6 ? null : (i.getUTF8String(2)); / 061 / boolean isNull5 = isNull6; / 062 / final java.sql.Timestamp value5 = isNull5 ? null : org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaTimestamp(value6); / 063 / isNull5 = value5 == null; / 064 / if (isNull5) { / 065 / values[2] = null; / 066 / } else { / 067 / values[2] = value5; / 068 / } / 069 / / 070 / final org.apache.spark.sql.Row value = new org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, schema); / 071 / if (false) { / 072 / mutableRow.setNullAt(0); / 073 / } else { / 074 / / 075 / mutableRow.update(0, value); / 076 / } / 077 / / 078 / return mutableRow; / 079 / } / 080 / } ``` After* ```bash [1969-12-31 00:00:00.0,1.0,1969-12-31 16:00:00.002] [1969-12-31 00:00:00.0,4.0,1969-12-31 16:00:00.005] [1969-12-31 16:00:00.002,10.5,string] ``` ## How was this patch tested? Unit tests in `DataFrameSuite` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15072 from HyukjinKwon/SPARK-17123.	2016-10-22 20:09:04 +02:00
Eric Liang	3eca283aca	[SPARK-17994][SQL] Add back a file status cache for catalog tables ## What changes were proposed in this pull request? In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. ## How was this patch tested? Existing tests and new tests in `HiveTablePerfStatsSuite`. cc mallman Author: Eric Liang <ekl@databricks.com> Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekhliang@gmail.com> Closes #15539 from ericl/meta-cache.	2016-10-22 22:08:28 +08:00
Drew Robb	ab3363e9f6	[SPARK-17986][ML] SQLTransformer should remove temporary tables ## What changes were proposed in this pull request? A call to the method `SQLTransformer.transform` previously would create a temporary table and never delete it. This change adds a call to `dropTempView()` that deletes this temporary table before returning the result so that the table will not remain in spark's table catalog. Because `tableName` is randomized and not exposed, there should be no expected use of this table outside of the `transform` method. ## How was this patch tested? A single new assertion was added to the existing test of the `SQLTransformer.transform` method that all temporary tables are removed. Without the corresponding code change, this new assertion fails. I am not aware of any circumstances in which removing this temporary view would be bad for performance or correctness in other ways, but some expertise here would be helpful. Author: Drew Robb <drewrobb@gmail.com> Closes #15526 from drewrobb/SPARK-17986.	2016-10-22 01:59:36 -07:00
Sean Owen	01b26a0643	[SPARK-17898][DOCS] repositories needs username and password ## What changes were proposed in this pull request? Document `user:password` syntax as possible means of specifying credentials for password-protected `--repositories` ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15584 from srowen/SPARK-17898.	2016-10-22 09:39:07 +01:00
Erik O'Shaughnessy	625fdddacd	[SPARK-17944][DEPLOY] sbin/start-* scripts use of `hostname -f` fail with Solaris ## What changes were proposed in this pull request? Modify sbin/start-master.sh, sbin/start-mesos-dispatcher.sh and sbin/start-slaves.sh to use the output of 'uname' to select which OS-specific command-line is used to determine the host's fully qualified host name. ## How was this patch tested? Tested by hand; starting on Solaris, Linux and macOS. Author: Erik O'Shaughnessy <erik.oshaughnessy@gmail.com> Closes #15557 from JnyJny/SPARK-17944.	2016-10-22 09:37:53 +01:00
Sean Owen	7178c56433	[SPARK-16606][MINOR] Tiny follow-up to , to correct more instances of the same log message typo ## What changes were proposed in this pull request? Tiny follow-up to SPARK-16606 / https://github.com/apache/spark/pull/14533 , to correct more instances of the same log message typo ## How was this patch tested? Existing tests (no functional change anyway) Author: Sean Owen <sowen@cloudera.com> Closes #15586 from srowen/SPARK-16606.2.	2016-10-21 22:20:52 -07:00
Reynold Xin	3fbf5a58c2	[SPARK-18042][SQL] OutputWriter should expose file path written ## What changes were proposed in this pull request? This patch adds a new "path" method on OutputWriter that returns the path of the file written by the OutputWriter. This is part of the necessary work to consolidate structured streaming and batch write paths. The batch write path has a nice feature that each data source can define the extension of the files, and allow Spark to specify the staging directory and the prefix for the files. However, in the streaming path we need to collect the list of files written, and there is no interface right now to do that. ## How was this patch tested? N/A - there is no behavior change and this should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15580 from rxin/SPARK-18042.	2016-10-21 17:27:18 -07:00
cody koeninger	c9720b2195	[STREAMING][KAFKA][DOC] clarify kafka settings needed for larger batches ## What changes were proposed in this pull request? Minor doc change to mention kafka configuration for larger spark batches. ## How was this patch tested? Doc change only, confirmed via jekyll. The configuration issue was discussed / confirmed with users on the mailing list. Author: cody koeninger <cody@koeninger.org> Closes #15570 from koeninger/kafka-doc-heartbeat.	2016-10-21 16:27:19 -07:00
cody koeninger	268ccb9a48	[SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream ## What changes were proposed in this pull request? startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy assign with specific topicpartitions as a consumer strategy ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #15504 from koeninger/SPARK-17812.	2016-10-21 15:55:04 -07:00
Wenchen Fan	140570252f	[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch ## What changes were proposed in this pull request? In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again. This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #15581 from cloud-fan/stream.	2016-10-21 15:28:16 -07:00
w00228970	c1f344f1a0	[SPARK-17929][CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <wangfei1@huawei.com> Closes #15481 from scwf/spark-17929.	2016-10-21 14:43:55 -07:00
Tathagata Das	7a531e3054	[SPARK-17926][SQL][STREAMING] Added json for statuses ## What changes were proposed in this pull request? StreamingQueryStatus exposed through StreamingQueryListener often needs to be recorded (similar to SparkListener events). This PR adds `.json` and `.prettyJson` to `StreamingQueryStatus`, `SourceStatus` and `SinkStatus`. ## How was this patch tested? New unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15476 from tdas/SPARK-17926.	2016-10-21 13:07:29 -07:00
Hossein	e371040a01	[SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns ## What changes were proposed in this pull request? NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time. ## How was this patch tested? * [x] TODO Author: Hossein <hossein@databricks.com> Closes #15421 from falaki/SPARK-17811.	2016-10-21 12:38:52 -07:00
Felix Cheung	e21e1c946c	[SPARK-18013][SPARKR] add crossJoin API ## What changes were proposed in this pull request? Add crossJoin and do not default to cross join if joinExpr is left out ## How was this patch tested? unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15559 from felixcheung/rcrossjoin.	2016-10-21 12:35:37 -07:00

... 4 5 6 7 8 ...

18230 commits