ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takeshi YAMAMURO	1b39fafa75	[SPARK-13361][SQL] Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark This pr added benchmark codes for Encoder#compress(). Also, it replaced the benchmark results with new ones because the output format of `Benchmark` changed. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #11236 from maropu/CompressionSpike.	2016-02-25 20:17:48 -08:00
Josh Rosen	633d63a48a	[SPARK-12757] Add block-level read/write locks to BlockManager ## Motivation As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults. ## Changes ### BlockInfoManager and reader/writer locks This patch adds block-level read/write locks to the BlockManager. It introduces a new `BlockInfoManager` component, which is contained within the `BlockManager`, holds the `BlockInfo` objects that the `BlockManager` uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes. `BlockManager`'s `get()` and `put()` methods now implicitly acquire the necessary locks. After a `get()` call successfully retrieves a block, that block is locked in a shared read mode. A `put()` call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This `put()` locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify `CacheManager` in the future (see #10748). See `BlockInfoManagerSuite`'s test cases for a more detailed specification of the locking semantics. ### Auto-release of locks at the end of tasks Our locking APIs support explicit release of locks (by calling `unlock()`), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit `close()` operator to signal that no more records will be consumed, operations like `take()` or `limit()` don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task. To address this, `BlockInfoManager` uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from `TaskContext`. When a task finishes, code in `Executor` calls `BlockInfoManager.unlockAllLocksForTask(taskAttemptId)` to free locks. ### Locking and the MemoryStore In order to prevent in-memory blocks from being evicted while they are being read, the `MemoryStore`'s `evictBlocksToFreeSpace()` method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed. ### Locking and remote block transfer This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers. ## FAQ - Why not use Java's built-in [`ReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html)? Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using `ReadWriteLock` would mean that we might call `unlock()` from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use [`StampedLock`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) to work around this issue. - Why not detect "leaked" locks in tests?: See above notes about `take()` and `limit`. Author: Josh Rosen <joshrosen@databricks.com> Closes #10705 from JoshRosen/pin-pages.	2016-02-25 17:17:56 -08:00
Timothy Chen	712995757c	[SPARK-13387][MESOS] Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. ## What changes were proposed in this pull request? Add support for SPARK_DAEMON_JAVA_OPTS with MesosClusterDispatcher. ## How was the this patch tested? Manual testing by launching dispatcher with SPARK_DAEMON_JAVA_OPTS Author: Timothy Chen <tnachen@gmail.com> Closes #11277 from tnachen/cluster_dispatcher_opts.	2016-02-25 17:07:58 -08:00
Josh Rosen	f2cfafdfe0	[SPARK-13501] Remove use of Guava Stopwatch Our nightly doc snapshot builds are failing due to some issue involving the Guava Stopwatch constructor: ``` [error] /home/jenkins/workspace/spark-master-docs/spark/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala:496: constructor Stopwatch in class Stopwatch cannot be accessed in class CoarseMesosSchedulerBackend [error] val stopwatch = new Stopwatch() [error] ^ ``` This Stopwatch constructor was deprecated in newer versions of Guava (`fd0cbc2c5c`) and it's possible that some classpath issues affecting Unidoc could be causing this to trigger compilation failures. In order to work around this issue, this patch removes this use of Stopwatch since we don't use it anywhere else in the Spark codebase. Author: Josh Rosen <joshrosen@databricks.com> Closes #11376 from JoshRosen/remove-stopwatch.	2016-02-25 17:04:43 -08:00
hushan	7a6ee8a8fe	[SPARK-12009][YARN] Avoid to re-allocating yarn container while driver want to stop all Executors Author: hushan <hushan@xiaomi.com> Closes #9992 from suyanNone/tricky.	2016-02-25 16:57:41 -08:00
Liwei Lin	dc6c5ea4c9	[SPARK-13468][WEB UI] Fix a corner case where the Stage UI page should show DAG but it doesn't show When uses clicks more than one time on any stage in the DAG graph on the Job web UI page, many new Stage web UI pages are opened, but only half of their DAG graphs are expanded. After this PR's fix, every newly opened Stage page's DAG graph is expanded. Before: ![](https://cloud.githubusercontent.com/assets/15843379/13279144/74808e86-db10-11e5-8514-cecf31af8908.png) After: ![](https://cloud.githubusercontent.com/assets/15843379/13279145/77ca5dec-db10-11e5-9457-8e1985461328.png) ## What changes were proposed in this pull request? - Removed the `expandDagViz` parameter for _Stage_ page and related codes - Added a `onclick` function setting `expandDagVizArrowKey(false)` as `true` ## How was this patch tested? Manual tests (with this fix) to verified this fix work: - clicked many times on _Job_ Page's DAG Graph → each newly opened Stage page's DAG graph is expanded Manual tests (with this fix) to verified this fix do not break features we already had: - refreshed many times for a same _Stage_ page (whose DAG already expanded) → DAG remained expanded upon every refresh - refreshed many times for a same _Stage_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh - refreshed many times for a same _Job_ page (whose DAG already expanded) → DAG remained expanded upon every refresh - refreshed many times for a same _Job_ page (whose DAG unexpanded) → DAG remained unexpanded upon every refresh Author: Liwei Lin <proflin.me@gmail.com> Closes #11368 from proflin/SPARK-13468.	2016-02-25 15:36:25 -08:00
Yu ISHIKAWA	35316cb0b7	[SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark ## What changes were proposed in this pull request? QuantileDiscretizer in Python should also specify a random seed. ## How was this patch tested? unit tests Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits: 02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark	2016-02-25 13:29:10 -08:00
Yu ISHIKAWA	14e2700de2	[SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication ## What changes were proposed in this pull request? ML StringIndexer does not protect itself from column name duplication. We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue. ## How was this patch tested? unit test Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11370 from yu-iskw/SPARK-12874.	2016-02-25 13:21:33 -08:00
Lin Zhao	fb8bb04766	[SPARK-13069][STREAMING] Add "ask" style store() to ActorReciever Introduces a "ask" style ```store``` in ```ActorReceiver``` as a way to allow actor receiver blocked by back pressure or maxRate. Author: Lin Zhao <lin@exabeam.com> Closes #11176 from lin-zhao/SPARK-13069.	2016-02-25 12:32:24 -08:00
Davies Liu	751724b132	Revert "[SPARK-13457][SQL] Removes DataFrame RDD operations" This reverts commit `157fe64f3e`.	2016-02-25 11:53:48 -08:00
Shixiong Zhu	46f6e79316	Revert "[SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0" This reverts commit `2e44031faf`.	2016-02-25 11:39:26 -08:00
huangzhaowei	5fcf4c2bfc	[SPARK-12316] Wait a minutes to avoid cycle calling. When application end, AM will clean the staging dir. But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'. Then it lead driver StackOverflowError. https://issues.apache.org/jira/browse/SPARK-12316 Author: huangzhaowei <carlmartinmax@gmail.com> Closes #10475 from SaintBacchus/SPARK-12316.	2016-02-25 09:14:19 -06:00
Cheng Lian	157fe64f3e	[SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This PR removes DataFrame RDD operations. Original calls are now replaced by calls to methods of `DataFrame.rdd`. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11323 from liancheng/remove-df-rdd-ops.	2016-02-25 23:07:59 +08:00
Yanbo Liang	4460113d41	[SPARK-13490][ML] ML LinearRegression should cache standardization param value ## What changes were proposed in this pull request? Like #11027 for ```LogisticRegression```, ```LinearRegression``` with L1 regularization should also cache the value of the ```standardization``` rather than re-fetching it from the ```ParamMap``` for every OWLQN iteration. cc srowen ## How was this patch tested? No extra tests are added. It should pass all existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11367 from yanboliang/spark-13490.	2016-02-25 13:34:29 +00:00
Michael Gummelt	c98a93ded3	[SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #11311 from mgummelt/document_csv.	2016-02-25 13:32:09 +00:00
Terence Yim	fae88af184	[SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method ## What changes were proposed in this pull request? Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead ## How was the this patch tested? Ran the ./dev/run-tests locally Tested manually on a cluster Author: Terence Yim <terence@cask.co> Closes #11337 from chtyim/fixes/SPARK-13441-null-check.	2016-02-25 13:29:30 +00:00
Oliver Pierson	6f8e835c68	[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames ## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444.	2016-02-25 13:24:46 +00:00
Cheng Lian	3fa6491be6	[SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <lian@databricks.com> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field.	2016-02-25 20:43:03 +08:00
Devaraj K	2e44031faf	[SPARK-13117][WEB UI] WebUI should use the local ip not 0.0.0.0 Fixed the HTTP Server Host Name/IP issue i.e. HTTP Server to take the configured host name/IP and not '0.0.0.0' always. Author: Devaraj K <devaraj@apache.org> Closes #11133 from devaraj-kavali/SPARK-13117.	2016-02-25 12:18:43 +00:00
Reynold Xin	2b2c8c3323	[SPARK-13486][SQL] Move SQLConf into an internal package ## What changes were proposed in this pull request? This patch moves SQLConf into org.apache.spark.sql.internal package to make it very explicit that it is internal. Soon I will also submit more API work that creates implementations of interfaces in this internal package. ## How was this patch tested? If it compiles, then the refactoring should work. Author: Reynold Xin <rxin@databricks.com> Closes #11363 from rxin/SPARK-13486.	2016-02-25 17:49:50 +08:00
Davies Liu	07f92ef1fa	[SPARK-13376] [SPARK-13476] [SQL] improve column pruning ## What changes were proposed in this pull request? This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset). This PR also fix a bug in Generate, it should always output UnsafeRow, added an regression test for that. ## How was this patch tested? This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s). Author: Davies Liu <davies@databricks.com> Closes #11354 from davies/fix_column_pruning.	2016-02-25 00:13:07 -08:00
huangzhaowei	264533b553	[SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton named in TransportConf. `spark.storage.memoryMapThreshold` has two kind of the value, one is 210241024 as integer and the other one is '2m' as string. "2m" is recommanded in document but it will go wrong if the code goes into `TransportConf#memoryMapBytes`. [Jira](https://issues.apache.org/jira/browse/SPARK-13482) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #11360 from SaintBacchus/SPARK-13482.	2016-02-24 23:52:17 -08:00
Kai Jiang	4d2864b2d7	[SPARK-7106][MLLIB][PYSPARK] Support model save/load in Python's FPGrowth ## What changes were proposed in this pull request? Python API supports mode save/load in FPGrowth JIRA: [https://issues.apache.org/jira/browse/SPARK-7106](https://issues.apache.org/jira/browse/SPARK-7106) ## How was the this patch tested? The patch is tested with Python doctest. Author: Kai Jiang <jiangkai@gmail.com> Closes #11321 from vectorijk/spark-7106.	2016-02-24 23:22:14 -08:00
Joseph K. Bradley	13ce10e954	[SPARK-13479][SQL][PYTHON] Added Python API for approxQuantile ## What changes were proposed in this pull request? * Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility * Python DataFrame and DataFrameStatFunctions: Added approxQuantile ## How was this patch tested? * unit test in sql/tests.py Documentation was copied from the existing approxQuantile exactly. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11356 from jkbradley/approx-quantile-python.	2016-02-24 23:15:36 -08:00
Michael Armbrust	2b042577fb	[SPARK-13092][SQL] Add ExpressionSet for constraint tracking This PR adds a new abstraction called an `ExpressionSet` which attempts to canonicalize expressions to remove cosmetic differences. Deterministic expressions that are in the set after canonicalization will always return the same answer given the same input (i.e. false positives should not be possible). However, it is possible that two canonical expressions that are not equal will in fact return the same answer given any input (i.e. false negatives are possible). ```scala val set = AttributeSet('a + 1 :: 1 + 'a :: Nil) set.iterator => Iterator('a + 1) set.contains('a + 1) => true set.contains(1 + 'a) => true set.contains('a + 2) => false ``` Other relevant changes include: - Since this concept overlaps with the existing `semanticEquals` and `semanticHash`, those functions are also ported to this new infrastructure. - A memoized `canonicalized` version of the expression is added as a `lazy val` to `Expression` and is used by both `semanticEquals` and `ExpressionSet`. - A set of unit tests for `ExpressionSet` are added - Tests which expect `semanticEquals` to be less intelligent than it now is are updated. As a followup, we should consider auditing the places where we do `O(n)` `semanticEquals` operations and replace them with `ExpressionSet`. We should also consider consolidating `AttributeSet` as a specialized factory for an `ExpressionSet.` Author: Michael Armbrust <michael@databricks.com> Closes #11338 from marmbrus/expressionSet.	2016-02-24 19:43:00 -08:00
Nong Li	5a7af9e7ac	[SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if using the vectorized scanner. Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion in all cases. The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds to 6.5 seconds. Author: Nong Li <nong@databricks.com> Closes #11141 from nongli/spark-13250.	2016-02-24 17:16:45 -08:00
Yin Huai	cbb0b65ad5	[SPARK-13383][SQL] Fix test ## What changes were proposed in this pull request? Reverting SPARK-13376 (`d563c8fa01`) affects the test added by SPARK-13383. So, I am fixing the test. Author: Yin Huai <yhuai@databricks.com> Closes #11355 from yhuai/SPARK-13383-fix-test.	2016-02-24 16:13:55 -08:00
Yin Huai	bc353805bd	[SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR builder even if a PR only changes sql/core ## What changes were proposed in this pull request? `HiveCompatibilitySuite` should still run in PR build even if a PR only changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from `HiveCompatibilitySuite`. https://issues.apache.org/jira/browse/SPARK-13475 Author: Yin Huai <yhuai@databricks.com> Closes #11351 from yhuai/SPARK-13475.	2016-02-24 13:34:53 -08:00
gatorsmile	5289837a72	[HOT][TEST] Disable a Test that Requires Nested Union Support. ## What changes were proposed in this pull request? Since "[SPARK-13321][SQL] Support nested UNION in parser" is reverted, we need to disable the test case that requires this PR. Thanks! rxin yhuai marmbrus ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #11352 from gatorsmile/disableTestCase.	2016-02-24 13:30:23 -08:00
Wenchen Fan	a60f91284c	[SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code ## What changes were proposed in this pull request? When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear. ## How was the this patch tested? by existing unit tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11342 from cloud-fan/python-clean.	2016-02-24 12:44:54 -08:00
Reynold Xin	f92f53faee	Revert "[SPARK-13321][SQL] Support nested UNION in parser" This reverts commit `55d6fdf22d`.	2016-02-24 12:25:02 -08:00
Reynold Xin	65805ab6ea	Revert "Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning"" This reverts commit `382b27babf`.	2016-02-24 12:03:45 -08:00
Reynold Xin	d563c8fa01	Revert "[SPARK-13376] [SQL] improve column pruning" This reverts commit `e9533b419e`.	2016-02-24 11:58:32 -08:00
Reynold Xin	382b27babf	Revert "[SPARK-13383][SQL] Keep broadcast hint after column pruning" This reverts commit `f373986997`.	2016-02-24 11:58:12 -08:00
Liang-Chi Hsieh	f373986997	[SPARK-13383][SQL] Keep broadcast hint after column pruning JIRA: https://issues.apache.org/jira/browse/SPARK-13383 ## What changes were proposed in this pull request? When we do column pruning in Optimizer, we put additional Project on top of a logical plan. However, when we already wrap a BroadcastHint on a logical plan, the added Project will hide BroadcastHint after later execution. We should take care of BroadcastHint when we do column pruning. ## How was the this patch tested? Unit test is added. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11260 from viirya/keep-broadcasthint.	2016-02-24 10:22:40 -08:00
Liang-Chi Hsieh	8930181833	[SPARK-13472] [SPARKR] Fix unstable Kmeans test in R JIRA: https://issues.apache.org/jira/browse/SPARK-13472 ## What changes were proposed in this pull request? One Kmeans test in R is unstable and sometimes fails. We should fix it. ## How was this patch tested? Unit test is modified in this PR. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11345 from viirya/fix-kmeans-r-test and squashes the following commits: f959f61 [Liang-Chi Hsieh] Sort resulted clusters.	2016-02-24 07:05:20 -08:00
Daniel Jalova	bcfd55fa98	[SPARK-12759][Core][Spark should fail fast if --executor-memory is too small for spark to start] Added an exception to be thrown in UnifiedMemoryManager.scala if the configuration given for executor memory is too low. Also modified the exception message thrown when driver memory is too low. This patch was tested manually by passing in config options to Spark shell. I also added a test in UnifiedMemoryManagerSuite.scala Author: Daniel Jalova <djalova@us.ibm.com> Closes #11255 from djalova/SPARK-12759.	2016-02-24 12:15:11 +00:00
Rahul Tanwani	831429170f	[MINOR][MAINTENANCE] Fix typo for the pull request template. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was the this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Rahul Tanwani <tanwanirahul@gmail.com> Closes #11343 from tanwanirahul/pull_request_template.	2016-02-24 00:45:31 -08:00
Davies Liu	86c852cf2e	[SPARK-13431] [SQL] [test-maven] split keywords from ExpressionParser.g ## What changes were proposed in this pull request? This PR pull all the keywords (and some others) from ExpressionParser.g as KeywordParser.g, because ExpressionParser is too large to compile. ## How was the this patch tested? unit test, maven build Closes #11329 Author: Davies Liu <davies@databricks.com> Closes #11331 from davies/split_expr.	2016-02-23 21:22:00 -08:00
Davies Liu	e9533b419e	[SPARK-13376] [SQL] improve column pruning ## What changes were proposed in this pull request? This PR mostly rewrite the ColumnPruning rule to support most of the SQL logical plans (except those for Dataset). ## How was the this patch tested? This is test by unit tests, also manually test with TPCDS Q78, which could prune all unused columns successfully, improved the performance by 78% (from 22s to 12s). Author: Davies Liu <davies@databricks.com> Closes #11256 from davies/fix_column_pruning.	2016-02-23 18:19:22 -08:00
JeremyNixon	230bbeaa61	[SPARK-10759][ML] update cross validator with include_example This pull request uses {%include_example%} to add an example for the python cross validator to ml-guide. Author: JeremyNixon <jnixon2@gmail.com> Closes #11240 from JeremyNixon/pipeline_include_example.	2016-02-23 15:57:29 -08:00
Xusen Yin	8d29001dec	[SPARK-13011] K-means wrapper in SparkR https://issues.apache.org/jira/browse/SPARK-13011 Author: Xusen Yin <yinxusen@gmail.com> Closes #11124 from yinxusen/SPARK-13011.	2016-02-23 15:42:58 -08:00
Timothy Hunter	15e3015563	[SPARK-6761][SQL][ML] Fixes to API and documentation of approximate quantiles ## What changes were proposed in this pull request? This continues thunterdb 's work on `approxQuantile` API. It changes the signature of `approxQuantile` from `(col: String, quantile: Double, epsilon: Double): Double` to `(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]` and update API doc. It also improves the error message in tests and simplifies the merge algorithm for summaries. ## How was the this patch tested? Use the same unit tests as before. Closes #11325 Author: Timothy Hunter <timhunter@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11332 from mengxr/SPARK-6761.	2016-02-23 15:31:17 -08:00
Davies Liu	9cdd867da9	[SPARK-13373] [SQL] generate sort merge join ## What changes were proposed in this pull request? Generates code for SortMergeJoin. ## How was the this patch tested? Unit tests and manually tested with TPCDS Q72, which showed 70% performance improvements (from 42s to 25s), but micro benchmark only show minor improvements, it may depends the distribution of data and number of columns. Author: Davies Liu <davies@databricks.com> Closes #11248 from davies/gen_smj.	2016-02-23 15:00:10 -08:00
Davies Liu	c481bdf512	[SPARK-13329] [SQL] considering output for statistics of logical plan The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess. We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join. After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time. We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them: DecimalType: 8 or 16 bytes, based on the precision StringType: 20 bytes BinaryType: 100 bytes UDF: default size of SQL type These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096. Author: Davies Liu <davies@databricks.com> Closes #11210 from davies/statics.	2016-02-23 12:55:44 -08:00
Michael Armbrust	c5bfe5d2a2	[SPARK-13440][SQL] ObjectType should accept any ObjectType, If should not care about nullability The type checking functions of `If` and `UnwrapOption` are fixed to eliminate spurious failures. `UnwrapOption` was checking for an input of `ObjectType` but `ObjectType`'s accept function was hard coded to return `false`. `If`'s type check was returning a false negative in the case that the two options differed only by nullability. Tests added: - an end-to-end regression test is added to `DatasetSuite` for the reported failure. - all the unit tests in `ExpressionEncoderSuite` are augmented to also confirm successful analysis. These tests are actually what pointed out the additional issues with `If` resolution. Author: Michael Armbrust <michael@databricks.com> Closes #11316 from marmbrus/datasetOptions.	2016-02-23 11:20:27 -08:00
Lianhui Wang	9f4263392e	[SPARK-7729][UI] Executor which has been killed should also be displayed on Executor Tab andrewor14 squito Dead Executors should also be displayed on Executor Tab. as following: ![image](https://cloud.githubusercontent.com/assets/545478/11492707/ae55d7f6-982b-11e5-919a-b62cd84684b2.png) Author: Lianhui Wang <lianhuiwang09@gmail.com> This patch had conflicts when merged, resolved by Committer: Andrew Or <andrew@databricks.com> Closes #10058 from lianhuiwang/SPARK-7729.	2016-02-23 11:08:39 -08:00
Grzegorz Chilkiewicz	5d69eaf097	[SPARK-13338][ML] Allow setting 'degree' parameter to 1 for PolynomialExpansion Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #11216 from grzegorz-chilkiewicz/master.	2016-02-23 10:30:02 -08:00
zhuol	4d1e5f92e1	[SPARK-13364] Sort appId as num rather than str in history page. ## What changes were proposed in this pull request? History page now sorts the appID as a string, which can lead to unexpected order for the case "application_11111_9" and "application_11111_20". Add a new sort type called appId-numeric can fix it. ## How was the this patch tested? This patch was manually tested with UI. See the screenshot below: ![sortappidbetter](https://cloud.githubusercontent.com/assets/11683054/13185564/7f941a16-d707-11e5-8fb7-0316368d3030.png) Author: zhuol <zhuol@yahoo-inc.com> Closes #11259 from zhuoliu/13364.	2016-02-23 11:16:42 -06:00
Liang-Chi Hsieh	87d7f8904a	[SPARK-13358] [SQL] Retrieve grep path when do benchmark JIRA: https://issues.apache.org/jira/browse/SPARK-13358 When trying to run a benchmark, I found that on my Ubuntu linux grep is not in /usr/bin/ but /bin/. So wondering if it is better to use which to retrieve grep path. cc davies Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11231 from viirya/benchmark-grep-path.	2016-02-23 07:56:08 -08:00

... 4 5 6 7 8 ...

15132 commits