ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Thomas Graves	5b08ee6396	[SPARK-15671] performance regression CoalesceRDD.pickBin with large #… I was running a 15TB join job with 202000 partitions. It looks like the changes I made to CoalesceRDD in pickBin() are really slow with that large of partitions. The array filter with that many elements just takes to long. It took about an hour for it to pickBins for all the partitions. original change: `83ee92f603` Just reverting the pickBin code back to get currpreflocs fixes the issue After reverting the pickBin code the coalesce takes about 10 seconds so for now it makes sense to revert those changes and we can look at further optimizations later. Tested this via RDDSuite unit test and manually testing the very large job. Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Closes #13443 from tgravescs/SPARK-15671.	2016-06-01 13:21:40 -07:00
Sean Zhong	d5012c2740	[SPARK-15495][SQL] Improve the explain output for Aggregation operator ## What changes were proposed in this pull request? This PR improves the explain output of Aggregator operator. SQL: ``` Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1") spark.sql("cache table df1") spark.sql("select count(a), count(c), b from df1 group by b").explain() ``` Before change: ``` TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) `````` After change: ``` Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) ``` ## How was this patch tested? Manual test and existing UT. Author: Sean Zhong <seanzhong@databricks.com> Closes #13363 from clockfly/verbose3.	2016-06-01 09:58:01 -07:00
Tejas Patil	ac38bdc756	[SPARK-15601][CORE] CircularBuffer's toString() to print only the contents written if buffer isn't full ## What changes were proposed in this pull request? 1. The class allocated 4x space than needed as it was using `Int` to store the `Byte` values 2. If CircularBuffer isn't full, currently toString() will print some garbage chars along with the content written as is tries to print the entire array allocated for the buffer. The fix is to keep track of buffer getting full and don't print the tail of the buffer if it isn't full (suggestion by sameeragarwal over https://github.com/apache/spark/pull/12194#discussion_r64495331) 3. Simplified `toString()` ## How was this patch tested? Added new test case Author: Tejas Patil <tejasp@fb.com> Closes #13351 from tejasapatil/circular_buffer.	2016-05-31 19:52:22 -05:00
WeichenXu	dad5a68818	[SPARK-15670][JAVA API][SPARK CORE] label_accumulator_deprecate_in_java_spark_context ## What changes were proposed in this pull request? Add deprecate annotation for acumulator V1 interface in JavaSparkContext class ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13412 from WeichenXu123/label_accumulator_deprecate_in_java_spark_context.	2016-05-31 17:34:34 -07:00
Jacek Laskowski	0f24713468	[CORE][DOC][MINOR] typos + links ## What changes were proposed in this pull request? A very tiny change to javadoc (which I don't mind if gets merged with a bigger change). I've just found it annoying and couldn't resist proposing a pull request. Sorry srowen and rxin. ## How was this patch tested? Manual build Author: Jacek Laskowski <jacek@japila.pl> Closes #13383 from jaceklaskowski/memory-consumer.	2016-05-31 17:32:37 -07:00
Reynold Xin	223f1d58c4	[SPARK-15662][SQL] Add since annotation for classes in sql.catalog ## What changes were proposed in this pull request? This patch does a few things: 1. Adds since version annotation to methods and classes in sql.catalog. 2. Fixed a typo in FilterFunction and a whitespace issue in spark/api/java/function/package.scala 3. Added "database" field to Function class. ## How was this patch tested? Updated unit test case for "database" field in Function class. Author: Reynold Xin <rxin@databricks.com> Closes #13406 from rxin/SPARK-15662.	2016-05-31 17:29:10 -07:00
Jacek Laskowski	6954704299	[CORE][MINOR][DOC] Removing incorrect scaladoc ## What changes were proposed in this pull request? I don't think the method will ever throw an exception so removing a false comment. Sorry srowen and rxin again -- I simply couldn't resist. I wholeheartedly support merging the change with a bigger one (and trashing this PR). ## How was this patch tested? Manual build Author: Jacek Laskowski <jacek@japila.pl> Closes #13384 from jaceklaskowski/blockinfomanager.	2016-05-31 19:21:25 -05:00
catapan	6878f3e2ea	[SPARK-15641] HistoryServer to not show invalid date for incomplete application ## What changes were proposed in this pull request? For incomplete applications in HistoryServer, the complete column will show "-" instead of incorrect date. ## How was this patch tested? manually tested. Author: catapan <cedarpan86@gmail.com> Author: Ziying Pan <cedarpan@Ziyings-MacBook.local> Closes #13396 from catapan/SPARK-15641_fix_completed_column.	2016-05-31 06:55:07 -05:00
Reynold Xin	675921040e	[SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext ## What changes were proposed in this pull request? This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13370 from rxin/SPARK-15638.	2016-05-30 22:47:58 -07:00
Devaraj K	5b21139dbf	[SPARK-10530][CORE] Kill other task attempts when one taskattempt belonging the same task is succeeded in speculation ## What changes were proposed in this pull request? With this patch, TaskSetManager kills other running attempts when any one of the attempt succeeds for the same task. Also killed tasks will not be considered as failed tasks and they get listed separately in the UI and also shows the task state as KILLED instead of FAILED. ## How was this patch tested? core\src\test\scala\org\apache\spark\ui\jobs\JobProgressListenerSuite.scala core\src\test\scala\org\apache\spark\util\JsonProtocolSuite.scala I have verified this patch manually by enabling spark.speculation as true, when any attempt gets succeeded then other running attempts are getting killed for the same task and other pending tasks are getting assigned in those. And also when any attempt gets killed then they are considered as KILLED tasks and not considered as FAILED tasks. Please find the attached screen shots for the reference. ![stage-tasks-table](https://cloud.githubusercontent.com/assets/3174804/14075132/394c6a12-f4f4-11e5-8638-20ff7b8cc9bc.png) ![stages-table](https://cloud.githubusercontent.com/assets/3174804/14075134/3b60f412-f4f4-11e5-9ea6-dd0dcc86eb03.png) Ref : https://github.com/apache/spark/pull/11916 Author: Devaraj K <devaraj@apache.org> Closes #11996 from devaraj-kavali/SPARK-10530.	2016-05-30 14:29:27 -07:00
Xin Ren	5728aa558e	[SPARK-15645][STREAMING] Fix some typos of Streaming module ## What changes were proposed in this pull request? No code change, just some typo fixing. ## How was this patch tested? Manually run project build with testing, and build is successful. Author: Xin Ren <iamshrek@126.com> Closes #13385 from keypointt/codeWalkThroughStreaming.	2016-05-30 08:40:03 -05:00
Reynold Xin	73178c7556	[SPARK-15633][MINOR] Make package name for Java tests consistent ## What changes were proposed in this pull request? This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #13364 from rxin/SPARK-15633.	2016-05-27 21:20:02 -07:00
dding3	88c9c467a3	[SPARK-15562][ML] Delete temp directory after program exit in DataFrameExample ## What changes were proposed in this pull request? Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit. ## How was this patch tested? unit tests and local build. Author: dding3 <ding.ding@intel.com> Closes #13328 from dding3/master.	2016-05-27 21:01:50 -05:00
Sital Kedia	ce756daa4f	[SPARK-15569] Reduce frequency of updateBytesWritten function in Disk… ## What changes were proposed in this pull request? Profiling a Spark job spilling large amount of intermediate data we found that significant portion of time is being spent in DiskObjectWriter.updateBytesWritten function. Looking at the code, we see that the function is being called too frequently to update the number of bytes written to disk. We should reduce the frequency to avoid this. ## How was this patch tested? Tested by running the job on cluster and saw 20% CPU gain by this change. Author: Sital Kedia <skedia@fb.com> Closes #13332 from sitalkedia/DiskObjectWriter.	2016-05-27 11:22:39 -07:00
Zheng RuiFeng	6b1a6180e7	[MINOR] Fix Typos 'a -> an' ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml//scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.	2016-05-26 22:39:14 -07:00
Joseph K. Bradley	ee3609a2ef	[MINOR][CORE] Fixed doc for Accumulator2.add ## What changes were proposed in this pull request? Scala doc used outdated ```+=```. Replaced with ```add```. ## How was this patch tested? N/A Author: Joseph K. Bradley <joseph@databricks.com> Closes #13346 from jkbradley/accum-doc.	2016-05-26 22:36:43 -07:00
Sameer Agarwal	fe6de16f78	[SPARK-8428][SPARK-13850] Fix integer overflows in TimSort ## What changes were proposed in this pull request? This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets. ## How was this patch tested? Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur. Author: Sameer Agarwal <sameer@databricks.com> Closes #13336 from sameeragarwal/timsort-bug.	2016-05-26 15:49:16 -07:00
Steve Loughran	01b350a4f7	[SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted Author: Steve Loughran <stevel@hortonworks.com> Author: Steve Loughran <stevel@apache.org> Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.	2016-05-26 13:55:22 -05:00
Imran Rashid	dfc9fc02cc	[SPARK-10372] [CORE] basic test framework for entire spark scheduler This is a basic framework for testing the entire scheduler. The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.). Author: Imran Rashid <irashid@cloudera.com> Closes #8559 from squito/SPARK-10372-scheduler-integs.	2016-05-26 00:29:09 -05:00
Takuya UESHIN	698ef762f8	[SPARK-14269][SCHEDULER] Eliminate unnecessary submitStage() call. ## What changes were proposed in this pull request? Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status. The case we should try to submit waiting stages is only when their parent stages are successfully completed. This elimination can improve `DAGScheduler` performance. ## How was this patch tested? Added some checks and other existing tests, and our projects. We have a project bottle-necked by `DAGScheduler`, having about 2000 stages. Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows: \| \| total execution time \| `dag-scheduler-event-loop` thread time \| `submitStage()` \| \|--------\|---------------------:\|---------------------------------------:\|----------------:\| \| Before \| 760 sec \| 710 sec \| 667 sec \| \| After \| 440 sec \| 14 sec \| 10 sec \| Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #12060 from ueshin/issues/SPARK-14269.	2016-05-25 13:57:25 -07:00
Dongjoon Hyun	d6d3e50719	[MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. ## What changes were proposed in this pull request? This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports. ```scala - logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" + + logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" + ... - // since its not removed yet + // since it's not removed yet ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.	2016-05-25 10:51:33 -07:00
Jeff Zhang	01e7b9c85b	[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext ## What changes were proposed in this pull request? Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach. ## How was this patch tested? Manually verify it in spark-shell. rxin Please help review it, I think this is a very critical issue for spark 2.0 Author: Jeff Zhang <zjffdu@apache.org> Closes #13160 from zjffdu/SPARK-15345.	2016-05-25 10:46:51 -07:00
Lukasz	b120fba6ae	[SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. ## What changes were proposed in this pull request? 1. Making 'name' field of RDDInfo mutable. 2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo. ## How was this patch tested? 1. Manual verification - the 'Storage' tab now behaves as expected. 2. The commit also contains a new unit test which verifies this. Author: Lukasz <lgieron@gmail.com> Closes #13264 from lgieron/SPARK-9044.	2016-05-25 10:24:21 -07:00
Reynold Xin	14494da87b	[SPARK-15518] Rename various scheduler backend for consistency ## What changes were proposed in this pull request? This patch renames various scheduler backends to make them consistent: - LocalScheduler -> LocalSchedulerBackend - AppClient -> StandaloneAppClient - AppClientListener -> StandaloneAppClientListener - SparkDeploySchedulerBackend -> StandaloneSchedulerBackend - CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend - MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend ## How was this patch tested? Updated test cases to reflect the name change. Author: Reynold Xin <rxin@databricks.com> Closes #13288 from rxin/SPARK-15518.	2016-05-24 20:55:47 -07:00
Dongjoon Hyun	f08bf587b1	[SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException ## What changes were proposed in this pull request? Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases. Before ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0).collect() res1: Array[Int] = Array() // empty scala> spark.sql("select 1").coalesce(0) res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").coalesce(0).collect() java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. scala> spark.sql("select 1").repartition(0) res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").repartition(0).collect() res4: Array[org.apache.spark.sql.Row] = Array() // empty ``` After ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13282 from dongjoon-hyun/SPARK-15512.	2016-05-24 18:55:23 -07:00
Dongjoon Hyun	be99a99fe7	[MINOR][CORE][TEST] Update obsolete `takeSample` test case. ## What changes were proposed in this pull request? This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`. ## How was this patch tested? This fixes the testcase only. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13260 from dongjoon-hyun/SPARK-15481.	2016-05-24 11:09:54 -07:00
Liang-Chi Hsieh	695d9a0fd4	[SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from PythonMLLibAPI ## What changes were proposed in this pull request? Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13214 from viirya/pycore-use-serdeutil.	2016-05-24 10:10:41 -07:00
Xin Wu	01659bc50c	[SPARK-15431][SQL] Support LIST FILE(s)\|JAR(s) command natively ## What changes were proposed in this pull request? Currently command `ADD FILE\|JAR <filepath \| jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)\|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] \| JAR[s] [jarfile ...])` ### For example: ##### LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| +----------------------------------------------+ scala> spark.sql("list files").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt \| +----------------------------------------------+ ``` ##### LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ scala> spark.sql("list jars").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xinwu@us.ibm.com> Author: xin Wu <xinwu@us.ibm.com> Closes #13212 from xwu0226/list_command.	2016-05-23 17:32:01 -07:00
Bo Meng	72288fd67e	[SPARK-15468][SQL] fix some typos ## What changes were proposed in this pull request? Fix some typos while browsing the codes. ## How was this patch tested? None and obvious. Author: Bo Meng <mengbo@hotmail.com> Author: bomeng <bmeng@us.ibm.com> Closes #13246 from bomeng/typo.	2016-05-22 08:10:54 -05:00
Liang-Chi Hsieh	7920296bf8	[SPARK-15430][SQL] Fix potential ConcurrentModificationException for ListAccumulator ## What changes were proposed in this pull request? In `ListAccumulator` we create an unmodifiable view for underlying list. However, it doesn't prevent the underlying to be modified further. So as we access the unmodifiable list, the underlying list can be modified in the same time. It could cause `java.util.ConcurrentModificationException`. We can observe such exception in recent tests. To fix it, we can copy a list of the underlying list and then create the unmodifiable view of this list instead. ## How was this patch tested? The exception might be difficult to test. Existing tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13211 from viirya/fix-concurrentmodify.	2016-05-22 08:08:46 -05:00
Shixiong Zhu	305263954a	Fix the compiler error introduced by #13153 for Scala 2.10	2016-05-19 12:36:44 -07:00
Shixiong Zhu	4e3cb7a5d9	[SPARK-15317][CORE] Don't store accumulators for every task in listeners ## What changes were proposed in this pull request? In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values. In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s. ## How was this patch tested? I ran two tests reported in JIRA locally: The first one is: ``` val data = spark.range(0, 10000, 1, 10000) data.cache().count() ``` The retained size of JobProgressListener decreases from 60.7M to 6.9M. The second one is: ``` import org.apache.spark.ml.CC import org.apache.spark.sql.SQLContext val sqlContext = SQLContext.getOrCreate(sc) CC.runTest(sqlContext) ``` This test won't cause OOM after applying this patch. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13153 from zsxwing/memory.	2016-05-19 12:05:17 -07:00
Davies Liu	ad182086cc	[SPARK-15300] Fix writer lock conflict when remove a block ## What changes were proposed in this pull request? A writer lock could be acquired when 1) create a new block 2) remove a block 3) evict a block to disk. 1) and 3) could happen in the same time within the same task, all of them could happen in the same time outside a task. It's OK that when someone try to grab the write block for a block, but the block is acquired by another one that has the same task attempt id. This PR remove the check. ## How was this patch tested? Updated existing tests. Author: Davies Liu <davies@databricks.com> Closes #13082 from davies/write_lock_conflict.	2016-05-19 11:47:17 -07:00
Sandeep Singh	3facca5152	[CORE][MINOR] Remove redundant set master in OutputCommitCoordinatorIntegrationSuite ## What changes were proposed in this pull request? Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43. ## How was this patch tested? existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13168 from techaddict/minor-1.	2016-05-19 10:44:26 +01:00
Shixiong Zhu	5c9117a3ed	[SPARK-15395][CORE] Use getHostString to create RpcAddress ## What changes were proposed in this pull request? Right now the netty RPC uses `InetSocketAddress.getHostName` to create `RpcAddress` for network events. If we use an IP address to connect, then the RpcAddress's host will be a host name (if the reverse lookup successes) instead of the IP address. However, some places need to compare the original IP address and the RpcAddress in `onDisconnect` (e.g., CoarseGrainedExecutorBackend), and this behavior will make the check incorrect. This PR uses `getHostString` to resolve the issue. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13185 from zsxwing/host-string.	2016-05-18 20:15:00 -07:00
Dongjoon Hyun	cc6a47dd81	[SPARK-15373][WEB UI] Spark UI should show consistent timezones. ## What changes were proposed in this pull request? Currently, SparkUI shows two timezones in a single page when the timezone of browser is different from the server JVM timezone. The following is an example on Databricks CE which uses 'Etc/UTC' timezone. - The time of `submitted` column of list and pop-up description shows `2016/05/18 00:03:07` - The time of `timeline chart` shows `2016/05/17 17:03:07`. ![Different Timezone](https://issues.apache.org/jira/secure/attachment/12804553/12804553_timezone.png) This PR fixes the timeline chart to use the same timezone by the followings. - Upgrade `vis` from 3.9.0(2015-01-16) to 4.16.1(2016-04-18) - Override `moment` of `vis` to get `offset` - Update `AllJobsPage`, `JobPage`, and `StagePage`. ## How was this patch tested? Manual. Run the following command and see the Spark UI's event timelines. ``` $ SPARK_SUBMIT_OPTS="-Dscala.usejavacp=true -Duser.timezone=Etc/UTC" bin/spark-submit --class org.apache.spark.repl.Main ... scala> sql("select 1").head ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13158 from dongjoon-hyun/SPARK-15373.	2016-05-18 23:19:55 +01:00
Davies Liu	8fb1d1c7f3	[SPARK-15357] Cooperative spilling should check consumer memory mode ## What changes were proposed in this pull request? Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling. ## How was this patch tested? Add new test. Author: Davies Liu <davies@databricks.com> Closes #13151 from davies/fix_mode.	2016-05-18 09:44:21 -07:00
WeichenXu	2f9047b5eb	[SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project ## What changes were proposed in this pull request? I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself) ## How was this patch tested? Exisiting unit tests Author: WeichenXu <WeichenXu123@outlook.com> Closes #13112 from WeichenXu123/update_accuV2_in_mllib.	2016-05-18 11:48:46 +01:00
Shixiong Zhu	8e8bc9f957	[SPARK-11735][CORE][SQL] Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped ## What changes were proposed in this pull request? Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13154 from zsxwing/check-spark-context-stop.	2016-05-17 14:57:21 -07:00
Sean Owen	122302cbf5	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.	2016-05-17 09:55:53 +01:00
Nicholas Tietz	0f1f31d3a6	[SPARK-15197][DOCS] Added Scaladoc for countApprox and countByValueApprox parameters This pull request simply adds Scaladoc documentation of the parameters for countApprox and countByValueApprox. This is an important documentation change, as it clarifies what should be passed in for the timeout. Without units, this was previously unclear. I did not open a JIRA ticket per my understanding of the project contribution guidelines; as they state, the description in the ticket would be essentially just what is in the PR. If I should open one, let me know and I will do so. Author: Nicholas Tietz <nicholas.tietz@crosschx.com> Closes #12955 from ntietz/rdd-countapprox-docs.	2016-05-14 09:44:20 +01:00
Holden Karau	382dbc12bb	[SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1 ## What changes were proposed in this pull request? This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 . ## How was this patch tested? Existing doctests & unit tests pass Author: Holden Karau <holden@us.ibm.com> Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.	2016-05-13 08:59:18 +01:00
Takuya UESHIN	a57aadae84	[SPARK-13902][SCHEDULER] Make DAGScheduler not to create duplicate stage. ## What changes were proposed in this pull request? `DAGScheduler`sometimes generate incorrect stage graph. Suppose you have the following DAG: ``` [A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D] \ / <------------- ``` Note: [] means an RDD, () means a shuffle dependency. Here, RDD `B` has a shuffle dependency on RDD `A`, and RDD `C` has shuffle dependency on both `B` and `A`. The shuffle dependency IDs are numbers in the `DAGScheduler`, but to make the example easier to understand, let's call the shuffled data from `A` shuffle dependency ID `s_A` and the shuffled data from `B` shuffle dependency ID `s_B`. The `getAncestorShuffleDependencies` method in `DAGScheduler` (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when `getAncestorShuffleDependencies` gets called on `C` (previous of the final RDD), `getAncestorShuffleDependencies` will return `s_A`, `s_B`, `s_A` (`s_A` gets added twice: once when the method "visit"s RDD `C`, and once when the method "visit"s RDD `B`). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by `getAncestorShuffleDependencies`, resulting in duplicate map stages that compute the map output from RDD `A`. As a result, `DAGScheduler` generates the following stages and their parents for each shuffle: \| \| stage \| parents \| \|----\|----\|----\| \| s_A \| ShuffleMapStage 2 \| List() \| \| s_B \| ShuffleMapStage 1 \| List(ShuffleMapStage 0) \| \| s_C \| ShuffleMapStage 3 \| List(ShuffleMapStage 1, ShuffleMapStage 2) \| \| - \| ResultStage 4 \| List(ShuffleMapStage 3) \| The stage for s_A should be `ShuffleMapStage 0`, but the stage for `s_A` is generated twice as `ShuffleMapStage 2` and `ShuffleMapStage 0` is overwritten by `ShuffleMapStage 2`, and the stage `ShuffleMap Stage1` keeps referring the old stage `ShuffleMapStage 0`. This patch is fixing it. ## How was this patch tested? I added the sample RDD graph to show the illegal stage graph to `DAGSchedulerSuite`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #12655 from ueshin/issues/SPARK-13902.	2016-05-12 12:36:18 -07:00
bomeng	81bf870848	[SPARK-14897][SQL] upgrade to jetty 9.2.16 ## What changes were proposed in this pull request? Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+. `javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version. ## How was this patch tested? Manual test and current test cases should cover it. Author: bomeng <bmeng@us.ibm.com> Closes #12916 from bomeng/SPARK-14897.	2016-05-12 20:07:44 +01:00
Sandeep Singh	ff92eb2e80	[SPARK-15080][CORE] Break copyAndReset into copy and reset ## What changes were proposed in this pull request? Break copyAndReset into two methods copy and reset instead of just one. ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #12936 from techaddict/SPARK-15080.	2016-05-12 11:12:09 +08:00
Andrew Or	40a949aae9	[SPARK-15262] Synchronize block manager / scheduler executor state ## What changes were proposed in this pull request? If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not. That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #13055 from andrewor14/block-manager-remove.	2016-05-11 13:36:58 -07:00
Andrew Or	bb88ad4e0e	[SPARK-15260] Atomically resize memory pools ## What changes were proposed in this pull request? When we acquire execution memory, we do a lot of things between shrinking the storage memory pool and enlarging the execution memory pool. In particular, we call `memoryStore.evictBlocksToFreeSpace`, which may do a lot of I/O and can throw exceptions. If an exception is thrown, the pool sizes on that executor will be in a bad state. This patch minimizes the things we do between the two calls to make the resizing more atomic. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #13039 from andrewor14/safer-pool.	2016-05-11 12:58:57 -07:00
cody koeninger	89e67d6667	[SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact ## What changes were proposed in this pull request? Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions ## How was this patch tested? Unit tests Author: cody koeninger <cody@koeninger.org> Closes #12946 from koeninger/SPARK-15085.	2016-05-11 12:15:41 -07:00
Eric Liang	6d0368ab8d	[SPARK-15259] Sort time metric should not include spill and record insertion time ## What changes were proposed in this pull request? After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node. We should track just the time spent for in-memory sort, as before. ## How was this patch tested? Verified metric in the UI, also unit test on UnsafeExternalRowSorter. cc davies Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13035 from ericl/fix-metrics.	2016-05-11 11:25:46 -07:00
Kousuke Saruta	ba181c0c7a	[SPARK-15235][WEBUI] Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline ## What changes were proposed in this pull request? To extract job descriptions and stage name, there are following regular expressions in timeline-view.js ``` var jobIdText = $($(baseElem).find(".application-timeline-content")[0]).text(); var jobId = jobIdText.match("\$Job (\\d+)\$")[1]; ... var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); var stageIdAndAttempt = stageIdText.match("\$Stage (\\d+\\.\\d+)\$")[1].split("."); ``` But if job descriptions include patterns like "(Job x)" or stage names include patterns like "(Stage x.y)", the regular expressions cannot be match as we expected, ending up with corresponding row cannot be highlighted even though we move the cursor onto the job on Web UI's timeline. ## How was this patch tested? Manually tested with spark-shell and Web UI. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13016 from sarutak/SPARK-15235.	2016-05-10 22:32:38 -07:00

1 2 3 4 5 ...

5490 commits