ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Johan Grande	0ef9fe64e2	Typo in comment - Author: Johan Grande <nahoj@crans.org> Closes #18738 from nahoj/patch-1.	2017-07-28 16:51:18 +01:00
davidxdh	784680903c	[SPARK-21553][SPARK SHELL] Add the description of the default value of master parameter in the spark-shell When I type spark-shell --help, I find that the default value description for the master parameter is missing. The user does not know what the default value is when the master parameter is not included, so we need to add the master parameter default description to the help information. [https://issues.apache.org/jira/browse/SPARK-21553](https://issues.apache.org/jira/browse/SPARK-21553) Author: davidxdh <xu.donghui@zte.com.cn> Author: Donghui Xu <xu.donghui@zte.com.cn> Closes #18755 from davidxdh/dev_0728.	2017-07-28 15:21:45 +01:00
Sean Owen	63d168cbb8	[MINOR][BUILD] Fix current lint-java failures ## What changes were proposed in this pull request? Fixes current failures in dev/lint-java ## How was this patch tested? Existing linter, tests. Author: Sean Owen <sowen@cloudera.com> Closes #18757 from srowen/LintJava.	2017-07-28 11:31:40 +01:00
Wenchen Fan	9f5647d62e	[SPARK-21319][SQL] Fix memory leak in sorter ## What changes were proposed in this pull request? `UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time. After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up. Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long. This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator. close #18543 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18679 from cloud-fan/memory-leak.	2017-07-27 22:56:26 +08:00
zhoukang	16612638f0	[SPARK-21517][CORE] Avoid copying memory when transfer chunks remotely ## What changes were proposed in this pull request? In our production cluster,oom happens when NettyBlockRpcServer receive OpenBlocks message.The reason we observed is below: When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default maxNumComponents=16 in low-level CompositeByteBuf.When our component's number is bigger than 16, it will execute consolidateIfNeeded int numComponents = this.components.size(); if(numComponents > this.maxNumComponents) { int capacity = ((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset; ByteBuf consolidated = this.allocBuffer(capacity); for(int c = 0; c < numComponents; ++c) { CompositeByteBuf.Component c1 = (CompositeByteBuf.Component)this.components.get(c); ByteBuf b = c1.buf; consolidated.writeBytes(b); c1.freeIfNecessary(); } CompositeByteBuf.Component var7 = new CompositeByteBuf.Component(consolidated); var7.endOffset = var7.length; this.components.clear(); this.components.add(var7); } in CompositeByteBuf which will consume some memory during buffer copy. We can use another api Unpooled. wrappedBuffer(int maxNumComponents, ByteBuffer... buffers) to avoid this comsuming. ## How was this patch tested? Test in production cluster. Author: zhoukang <zhoukang@xiaomi.com> Closes #18723 from caneGuy/zhoukang/fix-chunkbuffer.	2017-07-25 17:59:21 -07:00
Eric Vandenberg	06a9793793	[SPARK-21447][WEB UI] Spark history server fails to render compressed inprogress history file in some cases. Add failure handling for EOFException that can be thrown during decompression of an inprogress spark history file, treat same as case where can't parse the last line. ## What changes were proposed in this pull request? Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method to handle the case analogous to json parse fail case. This path can arise in compressed inprogress history files since an incomplete compression block could be read (not flushed by writer on a block boundary). See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447) ## How was this patch tested? Added a unit test that specifically targets validating the failure handling path appropriately when maybeTruncated is true and false. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file.	2017-07-25 11:45:35 -07:00
Marcelo Vanzin	cecd285a2a	[SPARK-20904][CORE] Don't report task failures to driver during shutdown. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18594 from vanzin/SPARK-20904.	2017-07-23 23:23:13 +08:00
Wenchen Fan	3ac6093086	[SPARK-10063] Follow-up: remove dead code related to an old output committer ## What changes were proposed in this pull request? DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well. This is kind of a follow-up of https://github.com/apache/spark/pull/16796 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18689 from cloud-fan/minor.	2017-07-20 12:08:20 -07:00
Oleg Danilov	da9f067a1e	[SPARK-19531] Send UPDATE_LENGTH for Spark History service ## What changes were proposed in this pull request? During writing to the .inprogress file (stored on the HDFS) Hadoop doesn't update file length until close and therefor Spark's history server can't detect any changes. We have to send UPDATE_LENGTH manually. Author: Oleg Danilov <oleg.danilov@wandisco.com> Closes #16924 from dosoft/SPARK-19531.	2017-07-20 09:38:49 -07:00
Xiang Gao	b7a40f64e6	[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ \| doublearray\| floatarray\| +---------------+------------------+ \|[1.0, 2.0, 3.0]\|[null, null, null]\| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <qasdfgtyuiop@gmail.com> Author: Gao, Xiang <qasdfgtyuiop@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18444 from zasdfgbnm/fix_array_infer.	2017-07-20 12:46:06 +09:00
Dhruve Ashar	ef61775586	[SPARK-21243][Core] Limit no. of map outputs in a shuffle fetch ## What changes were proposed in this pull request? For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. ## How was this patch tested? Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #18487 from dhruve/impr/SPARK-21243.	2017-07-19 15:53:28 -05:00
Peng	46307b2cd3	[SPARK-21401][ML][MLLIB] add poll function for BoundedPriorityQueue ## What changes were proposed in this pull request? The most of BoundedPriorityQueue usages in ML/MLLIB are: Get the value of BoundedPriorityQueue, then sort it. For example, in Word2Vec: pq.toSeq.sortBy(-_._2) in ALS, pq.toArray.sorted() The test results show using pq.poll is much faster than sort the value. It is good to add the poll function for BoundedPriorityQueue. ## How was this patch tested? The existing UT Author: Peng <peng.meng@intel.com> Author: Peng Meng <peng.meng@intel.com> Closes #18620 from mpjlu/add-poll.	2017-07-19 09:56:48 +01:00
Marcelo Vanzin	264b0f36ce	[SPARK-21408][CORE] Better default number of RPC dispatch threads. Instead of using the host's cpu count, use the number of cores allocated for the Spark process when sizing the RPC dispatch thread pool. This avoids creating large thread pools on large machines when the number of allocated cores is small. Tested by verifying number of threads with spark.executor.cores set to 1 and 4; same thing for YARN AM. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18639 from vanzin/SPARK-21408.	2017-07-18 13:36:10 -07:00
jerryshao	cde64add18	[SPARK-21411][YARN] Lazily create FS within kerberized UGI to avoid token acquiring failure ## What changes were proposed in this pull request? In the current `YARNHadoopDelegationTokenManager`, `FileSystem` to which to get tokens are created out of KDC logged UGI, using these `FileSystem` to get new tokens will lead to exception. The main thing is that Spark code trying to get new tokens from the FS created with token auth-ed UGI, but Hadoop can only grant new tokens in kerberized UGI. To fix this issue, we should lazily create these FileSystem within KDC logged UGI. ## How was this patch tested? Manual verification in secure cluster. CC vanzin mgummelt please help to review, thanks! Author: jerryshao <sshao@hortonworks.com> Closes #18633 from jerryshao/SPARK-21411.	2017-07-18 11:44:01 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
Josh Rosen	5952ad2b40	[SPARK-21444] Be more defensive when removing broadcasts in MapOutputTracker ## What changes were proposed in this pull request? In SPARK-21444, sitalkedia reported an issue where the `Broadcast.destroy()` call in `MapOutputTracker`'s `ShuffleStatus.invalidateSerializedMapOutputStatusCache()` was failing with an `IOException`, causing the DAGScheduler to crash and bring down the entire driver. This is a bug introduced by #17955. In the old code, we removed a broadcast variable by calling `BroadcastManager.unbroadcast` with `blocking=false`, but the new code simply calls `Broadcast.destroy()` which is capable of failing with an IOException in case certain blocking RPCs time out. The fix implemented here is to replace this with a call to `destroy(blocking = false)` and to wrap the entire operation in `Utils.tryLogNonFatalError`. ## How was this patch tested? I haven't written regression tests for this because it's really hard to inject mocks to simulate RPC failures here. Instead, this class of issue is probably best uncovered with more generalized error injection / network unreliability / fuzz testing tools. Author: Josh Rosen <joshrosen@databricks.com> Closes #18662 from JoshRosen/SPARK-21444.	2017-07-17 20:40:32 -07:00
Zhang A Peng	7aac755ba0	[SPARK-21410][CORE] Create less partitions for RangePartitioner if RDD.count() is less than `partitions` ## What changes were proposed in this pull request? Fix a bug in RangePartitioner: In RangePartitioner(partitions: Int, rdd: RDD[]), RangePartitioner.numPartitions is wrong if the number of elements in RDD (rdd.count()) is less than number of partitions (partitions in constructor). ## How was this patch tested? test as described in [SPARK-SPARK-21410](https://issues.apache.org/jira/browse/SPARK-21410 ) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Zhang A Peng <zhangap@cn.ibm.com> Closes #18631 from apapi/fixRangePartitioner.numPartitions.	2017-07-18 09:57:53 +08:00
John Lee	0e07a29cf4	[SPARK-21321][SPARK CORE] Spark very verbose on shutdown ## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes #18547 from yoonlee95/SPARK-21321.	2017-07-17 13:13:35 -05:00
Kazuaki Ishizaki	ac5d5d7959	[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344.	2017-07-14 20:16:04 -07:00
Stavros Kontopoulos	d8257b99dd	[SPARK-21403][MESOS] fix --packages for mesos ## What changes were proposed in this pull request? Fixes --packages flag for mesos in cluster mode. Probably I will handle standalone and Yarn in another commit, I need to investigate those cases as they are different. ## How was this patch tested? Tested with a community 1.9 dc/os cluster. packages were successfully resolved in cluster mode within a container. andrewor14 susanxhuynh ArtRand srowen pls review. Author: Stavros Kontopoulos <st.kontopoulos@gmail.com> Closes #18587 from skonto/fix_packages_mesos_cluster.	2017-07-13 10:37:15 -07:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
Devaraj K	e16e8c7ad3	[SPARK-21146][CORE] Master/Worker should handle and shutdown when any thread gets UncaughtException ## What changes were proposed in this pull request? Adding the default UncaughtExceptionHandler to the Worker. ## How was this patch tested? I verified it manually, when any of the worker thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions. Author: Devaraj K <devaraj@apache.org> Closes #18357 from devaraj-kavali/SPARK-21146.	2017-07-12 00:14:58 -07:00
jinxing	97a1aa2c70	[SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray. ## What changes were proposed in this pull request? In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time. ## How was this patch tested? Added unit test Did a small test for benchmark: Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows. Move the iterator forward to index=2000001. With this change: `getIterator(2000001)`, it will cost almost 0ms~1ms; Without this change: `for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms. Author: jinxing <jinxing6042@126.com> Closes #18541 from jinxing64/SPARK-21315.	2017-07-11 11:47:47 +08:00
jinxing	6a06c4b03c	[SPARK-21342] Fix DownloadCallback to work well with RetryingBlockFetcher. ## What changes were proposed in this pull request? When `RetryingBlockFetcher` retries fetching blocks. There could be two `DownloadCallback`s download the same content to the same target file. It could cause `ShuffleBlockFetcherIterator` reading a partial result. This pr proposes to create and delete the tmp files in `OneForOneBlockFetcher` Author: jinxing <jinxing6042@126.com> Author: Shixiong Zhu <zsxwing@gmail.com> Closes #18565 from jinxing64/SPARK-21342.	2017-07-10 21:06:58 +08:00
Eric Vandenberg	96d58f285b	[SPARK-21219][CORE] Task retry occurs on same executor due to race condition with blacklisting ## What changes were proposed in this pull request? There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor prior to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask ## How was this patch tested? Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18427 from ericvandenbergfb/blacklistFix.	2017-07-10 14:40:20 +08:00
jinxing	062c336d06	[SPARK-21343] Refine the document for spark.reducer.maxReqSizeShuffleToMem. ## What changes were proposed in this pull request? In current code, reducer can break the old shuffle service when `spark.reducer.maxReqSizeShuffleToMem` is enabled. Let's refine document. Author: jinxing <jinxing6042@126.com> Closes #18566 from jinxing64/SPARK-21343.	2017-07-09 00:27:58 +08:00
Marcelo Vanzin	9131bdb7e1	[SPARK-20342][CORE] Update task accumulators before sending task end event. This makes sures that listeners get updated task information; otherwise it's possible to write incomplete task information into event logs, for example, making the information in a replayed UI inconsistent with the original application. Added a new unit test to try to detect the problem, but it's not guaranteed to fail since it's a race; but it fails pretty reliably for me without the scheduler changes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18393 from vanzin/SPARK-20342.try2.	2017-07-09 00:24:54 +08:00
Marcelo Vanzin	9760c15acb	[SPARK-20379][CORE] Allow SSL config to reference env variables. This change exposes the internal code path in SparkConf that allows configs to be read with variable substitution applied, and uses that new method in SSLOptions so that SSL configs can reference other variables, and more importantly, environment variables, providing a secure way to provide passwords to Spark when using SSL. The approach is a little bit hacky, but is the smallest change possible. Otherwise, the concept of "namespaced configs" would have to be added to the config system, which would create a lot of noise for not much gain at this point. Tested with added unit tests, and on a real cluster with SSL enabled. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18394 from vanzin/SPARK-20379.try2.	2017-07-08 14:20:09 +08:00
CodingCat	fbbe37ed41	[SPARK-19358][CORE] LiveListenerBus shall log the event name when dropping them due to a fully filled queue ## What changes were proposed in this pull request? Some dropped event will make the whole application behaves unexpectedly, e.g. some UI problem...we shall log the dropped event name to facilitate the debugging ## How was this patch tested? Existing tests Author: CodingCat <zhunansjtu@gmail.com> Closes #16697 from CodingCat/SPARK-19358.	2017-07-07 20:10:24 +08:00
Takuya UESHIN	53c2eb59b2	[SPARK-21327][SQL][PYSPARK] ArrayConstructor should handle an array of typecode 'l' as long rather than int in Python 2. ## What changes were proposed in this pull request? Currently `ArrayConstructor` handles an array of typecode `'l'` as `int` when converting Python object in Python 2 into Java object, so if the value is larger than `Integer.MAX_VALUE` or smaller than `Integer.MIN_VALUE` then the overflow occurs. ```python import array data = [Row(longarray=array.array('l', [-9223372036854775808, 0, 9223372036854775807]))] df = spark.createDataFrame(data) df.show(truncate=False) ``` ``` +----------+ \|longarray \| +----------+ \|[0, 0, -1]\| +----------+ ``` This should be: ``` +----------------------------------------------+ \|longarray \| +----------------------------------------------+ \|[-9223372036854775808, 0, 9223372036854775807]\| +----------------------------------------------+ ``` ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18553 from ueshin/issues/SPARK-21327.	2017-07-07 14:05:22 +09:00
caoxuewen	565e7a8d4a	[SPARK-20950][CORE] add a new config to diskWriteBufferSize which is hard coded before ## What changes were proposed in this pull request? This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test `forceSorterToSpill` The average performance of running 10 times is as follows:(their unit is MS). ``` diskWriteBufferSize: 1M 512K 256K 128K 64K 32K 16K 8K 4K --------------------------------------------------------------------------------------- RecordSize = 2.5M 742 722 694 686 667 668 671 669 683 RecordSize = 1M 294 293 292 287 283 285 281 279 285 ``` 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. ## How was this patch tested? The unit test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #18174 from heary-cao/buffersize.	2017-07-06 19:49:34 +08:00
Liang-Chi Hsieh	6ff05a66fe	[SPARK-20703][SQL] Associate metrics with data writes onto DataFrameWriter operations ## What changes were proposed in this pull request? Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations. #### Supported commands This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`. Supported metrics: * number of written files * number of dynamic partitions * total bytes of written data * total number of output rows * average writing data out time (ms) * (TODO) min/med/max number of output rows per file/partition * (TODO) min/med/max bytes of written data per file/partition #### Commands not supported `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`: The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now. `CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` : The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO. #### How to update metrics of writing files out A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`. The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly. There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`. ## How was this patch tested? Updated unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18159 from viirya/SPARK-20703-2.	2017-07-06 15:47:09 +08:00
jerryshao	5800144a54	[SPARK-21012][SUBMIT] Add glob support for resources adding to Spark Current "--jars (spark.jars)", "--files (spark.files)", "--py-files (spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only support non-glob path. This is OK for most of the cases, but when user requires to add more jars, files into Spark, it is too verbose to list one by one. So here propose to add glob path support for resources. Also improving the code of downloading resources. ## How was this patch tested? UT added, also verified manually in local cluster. Author: jerryshao <sshao@hortonworks.com> Closes #18235 from jerryshao/SPARK-21012.	2017-07-06 15:32:49 +08:00
Shixiong Zhu	ab866f1173	[SPARK-21248][SS] The clean up codes in StreamExecution should not be interrupted ## What changes were proposed in this pull request? This PR uses `runUninterruptibly` to avoid that the clean up codes in StreamExecution is interrupted. It also removes an optimization in `runUninterruptibly` to make sure this method never throw `InterruptedException`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18461 from zsxwing/SPARK-21248.	2017-07-05 18:26:28 -07:00
Dongjoon Hyun	c8d0aba198	[SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 ## What changes were proposed in this pull request? This PR aims to bump Py4J in order to fix the following float/double bug. Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6. BEFORE ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +--------------------+ \|(id + 17.1335742042)\| +--------------------+ \| 17.1335742042\| +--------------------+ ``` AFTER ``` >>> df = spark.range(1) >>> df.select(df['id'] + 17.133574204226083).show() +-------------------------+ \|(id + 17.133574204226083)\| +-------------------------+ \| 17.133574204226083\| +-------------------------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18546 from dongjoon-hyun/SPARK-21278.	2017-07-05 16:33:23 -07:00
he.qiao	e3e2b5da36	[SPARK-21286][TEST] Modified StorageTabSuite unit test ## What changes were proposed in this pull request? The old unit test not effect ## How was this patch tested? unit test Author: he.qiao <he.qiao17@zte.com.cn> Closes #18511 from Geek-He/dev_0703.	2017-07-05 21:13:25 +08:00
hyukjinkwon	2b1e94b9ad	[MINOR][SPARK SUBMIT] Print out R file usage in spark-submit ## What changes were proposed in this pull request? Currently, running the shell below: ```bash $ ./bin/spark-submit tmp.R a b c ``` with R file, `tmp.R` as below: ```r #!/usr/bin/env Rscript library(SparkR) sparkRSQL.init(sparkR.init(master = "local")) collect(createDataFrame(list(list(1)))) print(commandArgs(trailingOnly = TRUE)) ``` working fine as below: ```bash _1 1 1 [1] "a" "b" "c" ``` However, it looks not printed in usage documentation as below: ```bash $ ./bin/spark-submit ``` ``` Usage: spark-submit [options] <app jar \| python file> [app arguments] ... ``` For `./bin/sparkR`, it looks fine as below: ```bash $ ./bin/sparkR tmp.R ``` ``` Running R applications through 'sparkR' is not supported as of Spark 2.0. Use ./bin/spark-submit <R file> ``` Running the script below: ```bash $ ./bin/spark-submit ``` Before ``` Usage: spark-submit [options] <app jar \| python file> [app arguments] ... ``` After ``` Usage: spark-submit [options] <app jar \| python file \| R file> [app arguments] ... ``` ## How was this patch tested? Manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18505 from HyukjinKwon/minor-doc-summit.	2017-07-04 12:18:42 +01:00
liuxian	6657e00de3	[SPARK-21283][CORE] FileOutputStream should be created as append mode ## What changes were proposed in this pull request? `FileAppender` is used to write `stderr` and `stdout` files in `ExecutorRunner`, But before writing `ErrorStream` into the the `stderr` file, the header information has been written into ,if FileOutputStream is not created as append mode, the header information will be lost ## How was this patch tested? unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18507 from 10110346/wip-lx-0703.	2017-07-04 09:16:40 +08:00
Sean Owen	a9339db99f	[SPARK-21137][CORE] Spark reads many small files slowly ## What changes were proposed in this pull request? Parallelize FileInputFormat.listStatus in Hadoop API via LIST_STATUS_NUM_THREADS to speed up examination of file sizes for wholeTextFiles et al ## How was this patch tested? Existing tests, which will exercise the key path here: using a local file system. Author: Sean Owen <sowen@cloudera.com> Closes #18441 from srowen/SPARK-21137.	2017-07-03 19:52:39 +08:00
guoxiaolong	d913db16a0	[SPARK-21250][WEB-UI] Add a url in the table of 'Running Executors' in worker page to visit job page. ## What changes were proposed in this pull request? Add a url in the table of 'Running Executors' in worker page to visit job page. When I click URL of 'Name', the current page jumps to the job page. Of course this is only in the table of 'Running Executors'. This URL of 'Name' is in the table of 'Finished Executors' does not exist, the click will not jump to any page. fix before: ![1](https://user-images.githubusercontent.com/26266482/27679397-30ddc262-5ceb-11e7-839b-0889d1f42480.png) fix after: ![2](https://user-images.githubusercontent.com/26266482/27679405-3588ef12-5ceb-11e7-9756-0a93815cd698.png) ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <guo.xiaolong1@zte.com.cn> Closes #18464 from guoxiaolongzte/SPARK-21250.	2017-07-03 13:31:01 +08:00
Devaraj K	6beca9ce94	[SPARK-21170][CORE] Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted ## What changes were proposed in this pull request? Not adding the exception to the suppressed if it is the same instance as originalThrowable. ## How was this patch tested? Added new tests to verify this, these tests fail without source code changes and passes with the change. Author: Devaraj K <devaraj@apache.org> Closes #18384 from devaraj-kavali/SPARK-21170.	2017-07-01 15:53:49 +01:00
Liang-Chi Hsieh	fd13255225	[SPARK-21052][SQL][FOLLOW-UP] Add hash map metrics to join ## What changes were proposed in this pull request? Remove `numHashCollisions` in `BytesToBytesMap`. And change `getAverageProbesPerLookup()` to `getAverageProbesPerLookup` as suggested. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18480 from viirya/SPARK-21052-followup.	2017-06-30 15:11:27 -07:00
曾林西	1fe08d62f0	[SPARK-21223] Change fileToAppInfo in FsHistoryProvider to fix concurrent issue. # What issue does this PR address ? Jira:https://issues.apache.org/jira/browse/SPARK-21223 fix the Thread-safety issue in FsHistoryProvider Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class FsHistoryProvider to store the map of eventlog path and attemptInfo. When use ThreadPool to Replay the log files in the list and merge the list of old applications with new ones, multi thread may update fileToAppInfo at the same time, which may cause Thread-safety issues, such as falling into an infinite loop because of calling resize func of the hashtable. Author: 曾林西 <zenglinxi@meituan.com> Closes #18430 from zenglinxi0615/master.	2017-06-30 19:28:43 +01:00
Xingbo Jiang	3c2fc19d47	[SPARK-18294][CORE] Implement commit protocol to support `mapred` package's committer ## What changes were proposed in this pull request? This PR makes the following changes: - Implement a new commit protocol `HadoopMapRedCommitProtocol` which support the old `mapred` package's committer; - Refactor SparkHadoopWriter and SparkHadoopMapReduceWriter, now they are combined together, thus we can support write through both mapred and mapreduce API by the new SparkHadoopWriter, a lot of duplicated codes are removed. After this change, it should be pretty easy for us to support the committer from both the new and the old hadoop API at high level. ## How was this patch tested? No major behavior change, passed the existing test cases. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18438 from jiangxb1987/SparkHadoopWriter.	2017-06-30 20:30:26 +08:00
IngoSchuster	88a536babf	[SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 ## What changes were proposed in this pull request? Please see also https://issues.apache.org/jira/browse/SPARK-21176 This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2). The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers). Once https://github.com/eclipse/jetty.project/issues/1643 is available, the code could be cleaned up to avoid the method override. I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR? ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy. gurvindersingh zsxwing can you please review the change? Author: IngoSchuster <ingo.schuster@de.ibm.com> Author: Ingo Schuster <ingo.schuster@de.ibm.com> Closes #18437 from IngoSchuster/master.	2017-06-30 11:16:09 +08:00
Shixiong Zhu	80f7ac3a60	[SPARK-21253][CORE] Disable spark.reducer.maxReqSizeShuffleToMem ## What changes were proposed in this pull request? Disable spark.reducer.maxReqSizeShuffleToMem because it breaks the old shuffle service. Credits to wangyum Closes #18466 ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Author: Yuming Wang <wgyumg@gmail.com> Closes #18467 from zsxwing/SPARK-21253.	2017-06-30 11:02:22 +08:00
Feng Liu	f9151bebca	[SPARK-21188][CORE] releaseAllLocksForTask should synchronize the whole method ## What changes were proposed in this pull request? Since the objects `readLocksByTask`, `writeLocksByTask` and `info`s are coupled and supposed to be modified by other threads concurrently, all the read and writes of them in the method `releaseAllLocksForTask` should be protected by a single synchronized block like other similar methods. ## How was this patch tested? existing tests Author: Feng Liu <fengliu@databricks.com> Closes #18400 from liufengdb/synchronize.	2017-06-29 16:03:15 -07:00
杨治国10192065	29bd251dd5	[SPARK-21225][CORE] Considering CPUS_PER_TASK when allocating task slots for each WorkerOffer JIRA Issue:https://issues.apache.org/jira/browse/SPARK-21225 In the function "resourceOffers", It declare a variable "tasks" for storage the tasks which have allocated a executor. It declared like this: `val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))` But, I think this code only conside a situation for that one task per core. If the user set "spark.task.cpus" as 2 or 3, It really don't need so much Mem. I think It can motify as follow: val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK)) to instead. Motify like this the other earning is that it's more easy to understand the way how the tasks allocate offers. Author: 杨治国10192065 <yang.zhiguo@zte.com.cn> Closes #18435 from JackYangzg/motifyTaskCoreDisp.	2017-06-29 20:53:48 +08:00
fjh100456	d7da2b94d6	[SPARK-21135][WEB UI] On history server page，duration of incompleted applications should be hidden instead of showing up as 0 ## What changes were proposed in this pull request? Hide duration of incompleted applications. ## How was this patch tested? manual tests Author: fjh100456 <fu.jinhua6@zte.com.cn> Closes #18351 from fjh100456/master.	2017-06-29 10:01:12 +01:00
jinxing	d106a74c53	[SPARK-21240] Fix code style for constructing and stopping a SparkContext in UT. ## What changes were proposed in this pull request? Same with SPARK-20985. Fix code style for constructing and stopping a `SparkContext`. Assure the context is stopped to avoid other tests complain that there's only one `SparkContext` can exist. Author: jinxing <jinxing6042@126.com> Closes #18454 from jinxing64/SPARK-21240.	2017-06-29 09:59:36 +01:00

1 2 3 4 5 ...

6147 commits