ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Eric Liang	a3356343cb	[SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE for Datasource tables ## What changes were proposed in this pull request? As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the partitions matching the static keys, as in Hive. It also doesn't respect custom partition locations. This PR adds support for all these operations to Datasource tables managed by the Hive metastore. It is implemented as follows - During planning time, the full set of partitions affected by an INSERT or OVERWRITE command is read from the Hive metastore. - The planner identifies any partitions with custom locations and includes this in the write task metadata. - FileFormatWriter tasks refer to this custom locations map when determining where to write for dynamic partition output. - When the write job finishes, the set of written partitions is compared against the initial set of matched partitions, and the Hive metastore is updated to reflect the newly added / removed partitions. It was necessary to introduce a method for staging files with absolute output paths to `FileCommitProtocol`. These files are not handled by the Hadoop output committer but are moved to their final locations when the job commits. The overwrite behavior of legacy Datasource tables is also changed: no longer will the entire table be overwritten if a partial partition spec is present. cc cloud-fan yhuai ## How was this patch tested? Unit tests, existing tests. Author: Eric Liang <ekl@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Closes #15814 from ericl/sc-5027.	2016-11-10 17:00:43 -08:00
jiangxingbo	64fbdf1aa9	[SPARK-18191][CORE][FOLLOWUP] Call `setConf` if `OutputFormat` is `Configurable`. ## What changes were proposed in this pull request? We should call `setConf` if `OutputFormat` is `Configurable`, this should be done before we create `OutputCommitter` and `RecordWriter`. This is follow up of #15769, see discussion [here](https://github.com/apache/spark/pull/15769/files#r87064229) ## How was this patch tested? Add test of this case in `PairRDDFunctionsSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15823 from jiangxb1987/config-format.	2016-11-09 13:14:26 -08:00
Vinayak	06a13ecca7	[SPARK-16808][CORE] History Server main page does not honor APPLICATION_WEB_PROXY_BASE ## What changes were proposed in this pull request? Application links generated on the history server UI no longer (regression from 1.6) contain the configured spark.ui.proxyBase in the links. To address this, made the uiRoot available globally to all javascripts for Web UI. Updated the mustache template (historypage-template.html) to include the uiroot for rendering links to the applications. The existing test was not sufficient to verify the scenario where ajax call is used to populate the application listing template, so added a new selenium test case to cover this scenario. ## How was this patch tested? Existing tests and a new unit test. No visual changes to the UI. Author: Vinayak <vijoshi5@in.ibm.com> Closes #15742 from vijoshi/SPARK-16808_master.	2016-11-09 10:40:14 -08:00
Shixiong Zhu	b6de0c98c7	[SPARK-18280][CORE] Fix potential deadlock in `StandaloneSchedulerBackend.dead` ## What changes were proposed in this pull request? "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock. This PR add a thread local flag inside RPC threads. `SparkContext.stop` uses it to decide if launching a new thread to stop the SparkContext. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15775 from zsxwing/SPARK-18280.	2016-11-08 13:14:56 -08:00
jiangxingbo	9c419698fe	[SPARK-18191][CORE] Port RDD API to use commit protocol ## What changes were proposed in this pull request? This PR port RDD API to use commit protocol, the changes made here: 1. Add new internal helper class that saves an RDD using a Hadoop OutputFormat named `SparkNewHadoopWriter`, it's similar with `SparkHadoopWriter` but uses commit protocol. This class supports the newer `mapreduce` API, instead of the old `mapred` API which is supported by `SparkHadoopWriter`; 2. Rewrite `PairRDDFunctions.saveAsNewAPIHadoopDataset` function, so it uses commit protocol now. ## How was this patch tested? Exsiting test cases. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15769 from jiangxb1987/rdd-commit.	2016-11-08 09:41:01 -08:00
fidato	6f3697136a	[SPARK-16575][CORE] partition calculation mismatch with sc.binaryFiles ## What changes were proposed in this pull request? This Pull request comprises of the critical bug SPARK-16575 changes. This change rectifies the issue with BinaryFileRDD partition calculations as upon creating an RDD with sc.binaryFiles, the resulting RDD always just consisted of two partitions only. ## How was this patch tested? The original issue ie. getNumPartitions on binary Files RDD (always having two partitions) was first replicated and then tested upon the changes. Also the unit tests have been checked and passed. This contribution is my original work and I licence the work to the project under the project's open source license srowen hvanhovell rxin vanzin skyluc kmader zsxwing datafarmer Please have a look . Author: fidato <fidato.july13@gmail.com> Closes #15327 from fidato13/SPARK-16575.	2016-11-07 18:41:17 -08:00
Josh Rosen	3a710b94b0	[SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer ## What changes were proposed in this pull request? When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks. - Task metrics (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty. - TaskInfo.accumulables (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object. - String.intern() in JSONProtocol (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/). ## How was this patch tested? I ran ``` sc.parallelize(1 to 100000, 100000).count() ``` in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects): ![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png) Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling): ![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png) Author: Josh Rosen <joshrosen@databricks.com> Closes #15743 from JoshRosen/spark-ui-memory-usage.	2016-11-07 16:14:19 -08:00
Hyukjin Kwon	8f0ea011a7	[SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests ## What changes were proposed in this pull request? Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files. ## How was this patch tested? Existing tests Author: U-FAREAST\tl <tl@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Tao LI <tl@microsoft.com> Closes #15618 from HyukjinKwon/SPARK-14914-1.	2016-11-07 12:47:39 -08:00
Susan X. Huynh	9a87c31385	[SPARK-17964][SPARKR] Enable SparkR with Mesos client mode and cluster mode ## What changes were proposed in this pull request? Enabled SparkR with Mesos client mode and cluster mode. Just a few changes were required to get this working on Mesos: (1) removed the SparkR on Mesos error checks and (2) do not require "--class" to be specified for R apps. The logic to check spark.mesos.executor.home was already in there. sun-rui ## How was this patch tested? 1. SparkSubmitSuite 2. On local mesos cluster (on laptop): ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. 3. On multi-node mesos cluster: ran SparkR shell, spark-submit client mode, and spark-submit cluster mode, with the "examples/src/main/R/dataframe.R" example application. I tested with the following --conf values set: spark.mesos.executor.docker.image and spark.mesos.executor.home This contribution is my original work and I license the work to the project under the project's open source license. Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes #15700 from susanxhuynh/susan-r-branch.	2016-11-05 17:45:15 +00:00
Weiqing Yang	8a9ca19247	[SPARK-17710][FOLLOW UP] Add comments to state why 'Utils.classForName' is not used ## What changes were proposed in this pull request? Add comments. ## How was this patch tested? Build passed. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #15776 from weiqingy/SPARK-17710.	2016-11-04 23:44:46 -07:00
Josh Rosen	0e3312ee72	[SPARK-18256] Improve the performance of event log replay in HistoryServer ## What changes were proposed in this pull request? This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes: - Don't use `extractOpt`: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method. - Don't call `Utils.getFormattedClassName` for every event: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls. ## How was this patch tested? Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half: ![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png) Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`: ![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png) After this patch, these hotspots are completely eliminated. Author: Josh Rosen <joshrosen@databricks.com> Closes #15756 from JoshRosen/speed-up-jsonprotocol.	2016-11-04 19:32:26 -07:00
Adam Roberts	a42d738c5d	[SPARK-18197][CORE] Optimise AppendOnlyMap implementation ## What changes were proposed in this pull request? This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change. We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time. After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time. ## How was this patch tested? Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter. Author: Adam Roberts <aroberts@uk.ibm.com> Closes #15714 from a-roberts/patch-9.	2016-11-04 12:06:06 -07:00
Dongjoon Hyun	27602c3375	[SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet ## What changes were proposed in this pull request? This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent. Before ``` nextPowerOf2(0) => 2 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` After ``` nextPowerOf2(0) => 1 nextPowerOf2(1) => 1 nextPowerOf2(2) => 2 nextPowerOf2(3) => 4 nextPowerOf2(4) => 4 nextPowerOf2(5) => 8 ``` ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15754 from dongjoon-hyun/SPARK-18200-2.	2016-11-03 23:15:33 -07:00
Sean Owen	dc4c600986	[SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0 ## What changes were proposed in this pull request? Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it. ## How was this patch tested? Doc build Author: Sean Owen <sowen@cloudera.com> Closes #15733 from srowen/SPARK-18138.	2016-11-03 17:27:23 -07:00
Reynold Xin	937af592e6	[SPARK-18219] Move commit protocol API (internal) from sql/core to core module ## What changes were proposed in this pull request? This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API. As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #15731 from rxin/SPARK-18219.	2016-11-03 02:42:48 -07:00
Dongjoon Hyun	d24e736471	[SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet ## What changes were proposed in this pull request? [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero. ## How was this patch tested? Pass the Jenkins test with a new test case in `OpenHashSetSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #15741 from dongjoon-hyun/SPARK-18200.	2016-11-02 23:50:50 -07:00
Jeff Zhang	3c24299b71	[SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode ## What changes were proposed in this pull request? spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc. ## How was this patch tested? Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it. Author: Jeff Zhang <zjffdu@apache.org> Closes #15669 from zjffdu/SPARK-18160.	2016-11-02 11:47:45 -07:00
Xiangrui Meng	02f203107b	[SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union ## What changes were proposed in this pull request? When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following: - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations. However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column. See the unit tests below or JIRA for examples. This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization. ## How was this patch tested? Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...) cc: rxin davies Author: Xiangrui Meng <meng@databricks.com> Closes #15567 from mengxr/SPARK-14393.	2016-11-02 11:41:49 -07:00
Sean Owen	9c8deef64e	[SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US ## What changes were proposed in this pull request? Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat` ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15610 from srowen/SPARK-18076.	2016-11-02 09:39:15 +00:00
Jacek Laskowski	70a5db7bbd	[SPARK-18204][WEBUI] Remove SparkUI.appUIAddress ## What changes were proposed in this pull request? Removing `appUIAddress` attribute since it is no longer in use. ## How was this patch tested? Local build Author: Jacek Laskowski <jacek@japila.pl> Closes #15603 from jaceklaskowski/sparkui-fixes.	2016-11-02 09:21:26 +00:00
Ryan Blue	2dc0480816	[SPARK-17532] Add lock debugging info to thread dumps. ## What changes were proposed in this pull request? This adds information to the web UI thread dump page about the JVM locks held by threads and the locks that threads are blocked waiting to acquire. This should help find cases where lock contention is causing Spark applications to run slowly. ## How was this patch tested? Tested by applying this patch and viewing the change in the web UI. ![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png) Additions: - A "Thread Locking" column with the locks held by the thread or that are blocking the thread - Links from the a blocked thread to the thread holding the lock - Stack frames show where threads are inside `synchronized` blocks, "holding Monitor(...)" Author: Ryan Blue <blue@apache.org> Closes #15088 from rdblue/SPARK-17532-add-thread-lock-info.	2016-11-02 00:08:30 -07:00
Josh Rosen	b929537b6e	[SPARK-18182] Expose ReplayListenerBus.read() overload which takes string iterator The `ReplayListenerBus.read()` method is used when implementing a custom `ApplicationHistoryProvider`. The current interface only exposes a `read()` method which takes an `InputStream` and performs stream-to-lines conversion itself, but it would also be useful to expose an overloaded method which accepts an iterator of strings, thereby enabling events to be provided from non-`InputStream` sources. Author: Josh Rosen <joshrosen@databricks.com> Closes #15698 from JoshRosen/replay-listener-bus-interface.	2016-11-01 16:49:41 -07:00
Shixiong Zhu	d2923f1732	[SPARK-18143][SQL] Ignore Structured Streaming event logs to avoid breaking history server ## What changes were proposed in this pull request? Because of the refactoring work in Structured Streaming, the event logs generated by Strucutred Streaming in Spark 2.0.0 and 2.0.1 cannot be parsed. This PR just ignores these logs in ReplayListenerBus because no places use them. ## How was this patch tested? - Generated events logs using Spark 2.0.0 and 2.0.1, and saved them as `structured-streaming-query-event-logs-2.0.0.txt` and `structured-streaming-query-event-logs-2.0.1.txt` - The new added test makes sure ReplayListenerBus will skip these bad jsons. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15663 from zsxwing/fix-event-log.	2016-10-31 00:11:33 -07:00
Hossein	2881a2d1d1	[SPARK-17919] Make timeout to RBackend configurable in SparkR ## What changes were proposed in this pull request? This patch makes RBackend connection timeout configurable by user. ## How was this patch tested? N/A Author: Hossein <hossein@databricks.com> Closes #15471 from falaki/SPARK-17919.	2016-10-30 16:17:23 -07:00
Eric Liang	90d3b91f4c	[SPARK-18103][SQL] Rename FileCatalog to FileIndex ## What changes were proposed in this pull request? To reduce the number of components in SQL named Catalog, rename FileCatalog to *FileIndex. A FileIndex is responsible for returning the list of partitions / files to scan given a filtering expression. ``` TableFileCatalog => CatalogFileIndex FileCatalog => FileIndex ListingFileCatalog => InMemoryFileIndex MetadataLogFileCatalog => MetadataLogFileIndex PrunedTableFileCatalog => PrunedInMemoryFileIndex ``` cc yhuai marmbrus ## How was this patch tested? N/A Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #15634 from ericl/rename-file-provider.	2016-10-30 13:14:45 -07:00
Yin Huai	d3b4831d00	[SPARK-18132] Fix checkstyle This PR fixes checkstyle. Author: Yin Huai <yhuai@databricks.com> Closes #15656 from yhuai/fix-format.	2016-10-26 22:22:23 -07:00
Miao Wang	a76846cfb1	[SPARK-18126][SPARK-CORE] getIteratorZipWithIndex accepts negative value as index ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) `Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition. method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index. This PR just adds a defensive check on `startIndex` to make sure it is >= 0. ## How was this patch tested? Add a new unit test. Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local> Closes #15639 from wangmiao1981/zip.	2016-10-27 01:17:32 +02:00
Shixiong Zhu	7ac70e7ba8	[SPARK-13747][SQL] Fix concurrent executions in ForkJoinPool for SQL ## What changes were proposed in this pull request? Calling `Await.result` will allow other tasks to be run on the same thread when using ForkJoinPool. However, SQL uses a `ThreadLocal` execution id to trace Spark jobs launched by a query, which doesn't work perfectly in ForkJoinPool. This PR just uses `Awaitable.result` instead to prevent ForkJoinPool from running other tasks in the current waiting thread. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #15520 from zsxwing/SPARK-13747.	2016-10-26 10:36:36 -07:00
Shuai Lin	402205ddf7	[SPARK-17802] Improved caller context logging. ## What changes were proposed in this pull request? [SPARK-16757](https://issues.apache.org/jira/browse/SPARK-16757) sets the hadoop `CallerContext` when calling hadoop/hdfs apis to make spark applications more diagnosable in hadoop/hdfs logs. However, the `org.apache.hadoop.ipc.CallerContext` class is only added since [hadoop 2.8](https://issues.apache.org/jira/browse/HDFS-9184), which is not officially releaed yet. So each time `utils.CallerContext.setCurrentContext()` is called (e.g [when a task is created](https://github.com/apache/spark/blob/b678e46/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L95-L96)), a "java.lang.ClassNotFoundException: org.apache.hadoop.ipc.CallerContext" error is logged, which pollutes the spark logs when there are lots of tasks. This patch improves this behaviour by only logging the `ClassNotFoundException` once. ## How was this patch tested? Existing tests. Author: Shuai Lin <linshuai2012@gmail.com> Closes #15377 from lins05/spark-17802-improve-callercontext-logging.	2016-10-26 14:31:47 +02:00
Alex Bozarth	5d0f81da49	[SPARK-4411][WEB UI] Add "kill" link for jobs in the UI ## What changes were proposed in this pull request? Currently users can kill stages via the web ui but not jobs directly (jobs are killed if one of their stages is). I've added the ability to kill jobs via the web ui. This code change is based on #4823 by lianhuiwang and updated to work with the latest code matching how stages are currently killed. In general I've copied the kill stage code warning and note comments and all. I also updated applicable tests and documentation. ## How was this patch tested? Manually tested and dev/run-tests ![screen shot 2016-10-11 at 4 49 43 pm](https://cloud.githubusercontent.com/assets/13952758/19292857/12f1b7c0-8fd4-11e6-8982-210249f7b697.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #15441 from ajbozarth/spark4411.	2016-10-26 14:26:54 +02:00
hayashidac	c329a568b5	[SPARK-16988][SPARK SHELL] spark history server log needs to be fixed to show https url when ssl is enabled spark history server log needs to be fixed to show https url when ssl is enabled Author: chie8842 <chie@chie-no-Mac-mini.local> Closes #15611 from hayashidac/SPARK-16988.	2016-10-26 07:13:48 +09:00
Vinayak	c5fe3dd4f5	[SPARK-18010][CORE] Reduce work performed for building up the application list for the History Server app list UI page ## What changes were proposed in this pull request? allow ReplayListenerBus to skip deserialising and replaying certain events using an inexpensive check of the event log entry. Use this to ensure that when event log replay is triggered for building the application list, we get the ReplayListenerBus to skip over all but the few events needed for our immediate purpose. Refer [SPARK-18010] for the motivation behind this change. ## How was this patch tested? Tested with existing HistoryServer and ReplayListener unit test suites. All tests pass. Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Vinayak <vijoshi5@in.ibm.com> Closes #15556 from vijoshi/SAAS-467_master.	2016-10-25 10:36:03 -07:00
Eren Avsarogullari	81d6933e75	[SPARK-17894][CORE] Ensure uniqueness of TaskSetManager name. `TaskSetManager` should have unique name to avoid adding duplicate ones to parent `Pool` via `SchedulableBuilder`. This problem has been surfaced with following discussion: [[PR: Avoid adding duplicate schedulables]](https://github.com/apache/spark/pull/15326) Proposal : There is 1x1 relationship between `stageAttemptId` and `TaskSetManager` so `taskSet.Id` covering both `stageId` and `stageAttemptId` looks to be used for uniqueness of `TaskSetManager` name instead of just `stageId`. Current TaskSetManager Name : `var name = "TaskSet_" + taskSet.stageId.toString` Sample: TaskSet_0 Proposed TaskSetManager Name : `val name = "TaskSet_" + taskSet.Id ` `// taskSet.Id = (stageId + "." + stageAttemptId)` Sample : TaskSet_0.0 Added new Unit Test. Author: erenavsarogullari <erenavsarogullari@gmail.com> Closes #15463 from erenavsarogullari/SPARK-17894.	2016-10-24 15:33:54 -07:00
Zheng RuiFeng	c64a8ff397	[SPARK-18049][MLLIB][TEST] Add missing tests for truePositiveRate and weightedTruePositiveRate ## What changes were proposed in this pull request? Add missing tests for `truePositiveRate` and `weightedTruePositiveRate` in `MulticlassMetricsSuite` ## How was this patch tested? added testing Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15585 from zhengruifeng/mc_missing_test.	2016-10-24 10:25:24 +01:00
Sandeep Singh	bc167a2a53	[SPARK-928][CORE] Add support for Unsafe-based serializer in Kryo ## What changes were proposed in this pull request? Now since we have migrated to Kryo-3.0.0 in https://issues.apache.org/jira/browse/SPARK-11416, we can gives users option to use unsafe SerDer. It can turned by setting `spark.kryo.useUnsafe` to `true` ## How was this patch tested? Ran existing tests ``` Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ basicTypes: Int unsafe:true 160 / 178 98.5 10.1 1.0X basicTypes: Long unsafe:true 210 / 218 74.9 13.4 0.8X basicTypes: Float unsafe:true 203 / 213 77.5 12.9 0.8X basicTypes: Double unsafe:true 226 / 235 69.5 14.4 0.7X Array: Int unsafe:true 1087 / 1101 14.5 69.1 0.1X Array: Long unsafe:true 2758 / 2844 5.7 175.4 0.1X Array: Float unsafe:true 1511 / 1552 10.4 96.1 0.1X Array: Double unsafe:true 2942 / 2972 5.3 187.0 0.1X Map of string->Double unsafe:true 2645 / 2739 5.9 168.2 0.1X basicTypes: Int unsafe:false 211 / 218 74.7 13.4 0.8X basicTypes: Long unsafe:false 247 / 253 63.6 15.7 0.6X basicTypes: Float unsafe:false 211 / 216 74.5 13.4 0.8X basicTypes: Double unsafe:false 227 / 233 69.2 14.4 0.7X Array: Int unsafe:false 3012 / 3032 5.2 191.5 0.1X Array: Long unsafe:false 4463 / 4515 3.5 283.8 0.0X Array: Float unsafe:false 2788 / 2868 5.6 177.2 0.1X Array: Double unsafe:false 3558 / 3752 4.4 226.2 0.0X Map of string->Double unsafe:false 2806 / 2933 5.6 178.4 0.1X ``` Author: Sandeep Singh <sandeep@techaddict.me> Author: Sandeep Singh <sandeep@origamilogic.com> Closes #12913 from techaddict/SPARK-928.	2016-10-22 12:03:37 -07:00
WeichenXu	4f1dcd3dce	[SPARK-18051][SPARK CORE] fix bug of custom PartitionCoalescer causing serialization exception ## What changes were proposed in this pull request? add a require check in `CoalescedRDD` to make sure the passed in `partitionCoalescer` to be `serializable`. and update the document for api `RDD.coalesce` ## How was this patch tested? Manual.(test code in jira [SPARK-18051]) Author: WeichenXu <WeichenXu123@outlook.com> Closes #15587 from WeichenXu123/fix_coalescer_bug.	2016-10-22 11:59:28 -07:00
Eric Liang	3eca283aca	[SPARK-17994][SQL] Add back a file status cache for catalog tables ## What changes were proposed in this pull request? In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. ## How was this patch tested? Existing tests and new tests in `HiveTablePerfStatsSuite`. cc mallman Author: Eric Liang <ekl@databricks.com> Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekhliang@gmail.com> Closes #15539 from ericl/meta-cache.	2016-10-22 22:08:28 +08:00
w00228970	c1f344f1a0	[SPARK-17929][CORE] Fix deadlock when CoarseGrainedSchedulerBackend reset ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-17929 Now `CoarseGrainedSchedulerBackend` reset will get the lock, ``` protected def reset(): Unit = synchronized { numPendingExecutors = 0 executorsPendingToRemove.clear() // Remove all the lingering executors that should be removed but not yet. The reason might be // because (1) disconnected event is not yet received; (2) executors die silently. executorDataMap.toMap.foreach { case (eid, _) => driverEndpoint.askWithRetry[Boolean]( RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered."))) } } ``` but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock. ``` private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = { logDebug(s"Asked to remove executor $executorId with reason $reason") executorDataMap.get(executorId) match { case Some(executorInfo) => // This must be synchronized because variables mutated // in this block are read when requesting executors val killed = CoarseGrainedSchedulerBackend.this.synchronized { addressToExecutorId -= executorInfo.executorAddress executorDataMap -= executorId executorsPendingLossReason -= executorId executorsPendingToRemove.remove(executorId).getOrElse(false) } ... ## How was this patch tested? manual test. Author: w00228970 <wangfei1@huawei.com> Closes #15481 from scwf/spark-17929.	2016-10-21 14:43:55 -07:00
Hossein	e371040a01	[SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date columns ## What changes were proposed in this pull request? NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time. ## How was this patch tested? * [x] TODO Author: Hossein <hossein@databricks.com> Closes #15421 from falaki/SPARK-17811.	2016-10-21 12:38:52 -07:00
Alex Bozarth	3a237512b1	[SPARK-13275][WEB UI] Visually clarified executors start time in timeline ## What changes were proposed in this pull request? Updated the Executors added/removed bubble in the time line so it's clearer where it starts. Now the bubble is left justified on the start time (still also denoted by the line) rather than center justified. ## How was this patch tested? Manually tested UI <img width="596" alt="screen shot 2016-10-17 at 6 04 36 pm" src="https://cloud.githubusercontent.com/assets/13952758/19496563/e6c9186e-953c-11e6-85e4-63309a553f65.png"> <img width="492" alt="screen shot 2016-10-17 at 5 54 09 pm" src="https://cloud.githubusercontent.com/assets/13952758/19496568/e9f06132-953c-11e6-8901-54405ebc7f5b.png"> Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #15536 from ajbozarth/spark13275.	2016-10-21 11:39:32 +01:00
Jagadeesan	595893d33a	[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] ## What changes were proposed in this pull request? 1) Upgrade the Py4J version on the Java side 2) Update the py4j src zip file we bundle with Spark ## How was this patch tested? Existing doctests & unit tests pass Author: Jagadeesan <as2@us.ibm.com> Closes #15514 from jagadeesanas2/SPARK-17960.	2016-10-21 09:48:24 +01:00
WeichenXu	39755169fb	[SPARK-18003][SPARK CORE] Fix bug of RDD zipWithIndex & zipWithUniqueId index value overflowing ## What changes were proposed in this pull request? - Fix bug of RDD `zipWithIndex` generating wrong result when one partition contains more than 2147483647 records. - Fix bug of RDD `zipWithUniqueId` generating wrong result when one partition contains more than 2147483647 records. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15550 from WeichenXu123/fix_rdd_zipWithIndex_overflow.	2016-10-19 23:41:38 -07:00
Alex Bozarth	444c2d22e3	[SPARK-10541][WEB UI] Allow ApplicationHistoryProviders to provide their own text when there aren't any complete apps ## What changes were proposed in this pull request? I've added a method to `ApplicationHistoryProvider` that returns the html paragraph to display when there are no applications. This allows providers other than `FsHistoryProvider` to determine what is printed. The current hard coded text is now moved into `FsHistoryProvider` since it assumed that's what was being used before. I chose to make the function return html rather than text because the current text block had inline html in it and it allows a new implementation of `ApplicationHistoryProvider` more versatility. I did not see any security issues with this since injecting html here requires implementing `ApplicationHistoryProvider` and can't be done outside of code. ## How was this patch tested? Manual testing and dev/run-tests No visible changes to the UI Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #15490 from ajbozarth/spark10541.	2016-10-19 13:01:33 -07:00
Yu Peng	2629cd7460	[SPARK-17711][TEST-HADOOP2.2] Fix hadoop2.2 compilation error ## What changes were proposed in this pull request? Fix hadoop2.2 compilation error. ## How was this patch tested? Existing tests. cc tdas zsxwing Author: Yu Peng <loneknightpy@gmail.com> Closes #15537 from loneknightpy/fix-17711.	2016-10-18 19:43:08 -07:00
Guoqiang Li	4518642abd	[SPARK-17930][CORE] The SerializerInstance instance used when deserializing a TaskResult is not reused ## What changes were proposed in this pull request? The following code is called when the DirectTaskResult instance is deserialized ```scala def value(): T = { if (valueObjectDeserialized) { valueObject } else { // Each deserialization creates a new instance of SerializerInstance, which is very time-consuming val resultSer = SparkEnv.get.serializer.newInstance() valueObject = resultSer.deserialize(valueBytes) valueObjectDeserialized = true valueObject } } ``` In the case of stage has a lot of tasks, reuse SerializerInstance instance can improve the scheduling performance of three times The test data is TPC-DS 2T (Parquet) and SQL statement as follows (query 2): ```sql select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotion where ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = 'M' and cd_marital_status = 'M' and cd_education_status = '4 yr Degree' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = 2001 group by i_item_id order by i_item_id limit 100; ``` `spark-defaults.conf` file: ``` spark.master yarn-client spark.executor.instances 20 spark.driver.memory 16g spark.executor.memory 30g spark.executor.cores 5 spark.default.parallelism 100 spark.sql.shuffle.partitions 100000 spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.maxResultSize 0 spark.rpc.netty.dispatcher.numThreads 8 spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M spark.cleaner.referenceTracking.blocking true spark.cleaner.referenceTracking.blocking.shuffle true ``` Performance test results are as follows [SPARK-17930](https://github.com/witgo/spark/tree/SPARK-17930)\| [`ed14633`](`ed14633414`]) ------------ \| ------------- 54.5 s\|231.7 s ## How was this patch tested? Existing tests. Author: Guoqiang Li <witgo@qq.com> Closes #15512 from witgo/SPARK-17930.	2016-10-18 13:46:57 -07:00
Yu Peng	231f39e3f6	[SPARK-17711] Compress rolled executor log ## What changes were proposed in this pull request? This PR adds support for executor log compression. ## How was this patch tested? Unit tests cc: yhuai tdas mengxr Author: Yu Peng <loneknightpy@gmail.com> Closes #15285 from loneknightpy/compress-executor-log.	2016-10-18 13:23:31 -07:00
Liwei Lin	7d878cf2da	[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ \| StreamExecution thread \| testing thread \| +-----------------------------------+--------------------------------+ \| ManualClock.waitTillTime(100) { \| \| \| _isWaiting = true \| \| \| wait(10) \| \| \| still in wait(10) \| if (_isWaiting) advance(100) \| \| still in wait(10) \| if (_isWaiting) advance(200) \| <- this should be disallowed ! \| still in wait(10) \| if (_isWaiting) advance(300) \| <- this should be disallowed ! \| wake up from wait(10) \| \| \| current time is 600 \| \| \| _isWaiting = false \| \| \| } \| \| +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #15519 from tdas/metrics-flaky-test-fix.	2016-10-18 00:49:57 -07:00
Sital Kedia	c7ac027d5f	[SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer ## What changes were proposed in this pull request? Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline. ## How was this patch tested? Tested by running the job in the cluster and observed up to 7% speedup. Author: Sital Kedia <skedia@fb.com> Closes #15408 from sitalkedia/skedia/nio_spill_read.	2016-10-17 11:03:04 -07:00
Reynold Xin	72a6e7a57a	Revert "[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors" This reverts commit `ed14633414`. The patch merged had obvious quality and documentation issue. The idea is useful, and we should work towards improving its quality and merging it in again.	2016-10-15 22:31:37 -07:00
Zhan Zhang	ed14633414	[SPARK-17637][SCHEDULER] Packed scheduling for Spark tasks across executors ## What changes were proposed in this pull request? Restructure the code and implement two new task assigner. PackedAssigner: try to allocate tasks to the executors with least available cores, so that spark can release reserved executors when dynamic allocation is enabled. BalancedAssigner: try to allocate tasks to the executors with more available cores in order to balance the workload across all executors. By default, the original round robin assigner is used. We test a pipeline, and new PackedAssigner save around 45% regarding the reserved cpu and memory with dynamic allocation enabled. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Both unit test in TaskSchedulerImplSuite and manual tests in production pipeline. Author: Zhan Zhang <zhanzhang@fb.com> Closes #15218 from zhzhan/packed-scheduler.	2016-10-15 18:45:04 -07:00

1 2 3 4 5 ...

5382 commits