ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Andrew Or	2d4a961f82	[HOT FIX #6125 ] Do not wait for all stages to start rendering zsxwing Author: Andrew Or <andrew@databricks.com> Closes #6138 from andrewor14/dag-viz-clean-properly and squashes the following commits: 19d4e98 [Andrew Or] Add synchronize 02542d6 [Andrew Or] Rename overloaded variable d11bee1 [Andrew Or] Don't wait until all stages have started before rendering	2015-05-13 21:04:50 -07:00
Josh Rosen	c53ebea9db	[SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records. The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf. The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles. UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold: - The shuffle dependency specifies no aggregation or output ordering. - The shuffle serializer supports relocation of serialized values (this is currently supported by KryoSerializer and Spark SQL's custom serializers). - The shuffle produces fewer than 16777216 output partitions. - No individual record is larger than 128 MB when serialized. In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer. At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager. In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file. UnsafeShuffleManager optimizes this process in several ways: - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization. See SPARK-4550, where this optimization was first proposed and implemented, for more details. - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache. - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge. - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition. This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge. The shuffle read path is unchanged. This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725). ### Future work There are several tasks that build upon this patch, which will be left to future work: - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data. - Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits: ef0a86e [Josh Rosen] Fix scalastyle errors 7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data. d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances. 52a9981 [Josh Rosen] Fix some bugs in the address packing code. 51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort 4023fa4 [Josh Rosen] Add @Private annotation to some Java classes. de40b9d [Josh Rosen] More comments to try to explain metrics code df07699 [Josh Rosen] Attempt to clarify confusing metrics update code 5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file. d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter `4a2c785` [Josh Rosen] rename 'sort buffer' to 'pointer array' 6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter. 57312c9 [Josh Rosen] Clarify fileBufferSize units 2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter. fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer. 85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator. 0ad34da [Josh Rosen] Fix off-by-one in nextInt() call 56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS. e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding. 4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics. d4e6d89 [Josh Rosen] Update to bit shifting constants 69d5899 [Josh Rosen] Remove some unnecessary override vals 8531286 [Josh Rosen] Add tests that automatically trigger spills. 7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap(). e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections 39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!) 1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class. ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable. ae538dc [Josh Rosen] Document UnsafeShuffleManager. ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions. 0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass. b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance. 1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations. b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless. f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation. 4a01c45 [Josh Rosen] Remove unnecessary log message 27b18b0 [Josh Rosen] That for inserting records AT the max record size. fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes. 9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's 67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager 8f5061a [Josh Rosen] Strengthen assertion to check partitioning 01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite 1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename. e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort 9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors 7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling. 722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests. `9883e30` [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort b95e642 [Josh Rosen] Refactor and document logic that decides when to spill. 1ce1300 [Josh Rosen] More minor cleanup 5e8cf75 [Josh Rosen] More minor cleanup e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface. cfe0ec4 [Josh Rosen] Address a number of minor review comments: 8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter 11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics. b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter. 4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests. 133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter. f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort. 57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort. 69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode. 7ee918e [Josh Rosen] Re-order imports in tests 3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces 3490512 [Josh Rosen] Misc. cleanup f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces. 2776aca [Josh Rosen] First passing test for ExternalSorter. 5e100b2 [Josh Rosen] Super-messy WIP on external sort 595923a [Josh Rosen] Remove some unused variables. 8958584 [Josh Rosen] Fix bug in calculating free space in current page. f17fa8f [Josh Rosen] Add missing newline c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use: b8a09fe [Josh Rosen] Back out accidental log4j.properties change bfc12d3 [Josh Rosen] Add tests for serializer relocation property. 240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert() 1433b42 [Josh Rosen] Store record length as int instead of long. 026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter 0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java. 87e721b [Josh Rosen] Renaming and comments d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used. e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite 9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter. 253f13e [Josh Rosen] More cleanup 8e3ec20 [Josh Rosen] Begin code cleanup. 4d2f5e1 [Josh Rosen] WIP 3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter 767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter. e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter 57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter abf7bfe [Josh Rosen] Add basic test case. 81d52c5 [Josh Rosen] WIP on UnsafeSorter (cherry picked from commit `73bed408fb`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-13 17:07:39 -07:00
Andrew Or	895d46a24a	[SPARK-7502] DAG visualization: gracefully handle removed stages Old stages are removed without much feedback to the user. This happens very often in streaming. See screenshots below for more detail. zsxwing Before <img src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png" width="500px"/> ------------------------- After <img src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the following commits: 43175cd [Andrew Or] Handle removed jobs and stages gracefully (cherry picked from commit `aa1837875a`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-13 16:29:59 -07:00
Andrew Or	4b4f10bc90	[SPARK-7464] DAG visualization: highlight the same RDDs on hover This is pretty useful for MLlib. <img src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6100 from andrewor14/dag-viz-hover and squashes the following commits: fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD 90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs (cherry picked from commit `44403414d3`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-13 16:29:18 -07:00
Andrew Or	e6b8cef514	[SPARK-7399] Spark compilation error for scala 2.11 Subsequent fix following #5966. I tried this out locally. Author: Andrew Or <andrew@databricks.com> Closes #6129 from andrewor14/211-compilation and squashes the following commits: 713868f [Andrew Or] Fix compilation issue for scala 2.11 (cherry picked from commit `f88ac70155`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-13 16:28:45 -07:00
Andrew Or	ec342308a8	[SPARK-7608] Clean up old state in RDDOperationGraphListener This is necessary for streaming and long-running Spark applications. zsxwing tdas Author: Andrew Or <andrew@databricks.com> Closes #6125 from andrewor14/viz-listener-leak and squashes the following commits: 8660949 [Andrew Or] Fix thing + add tests 33c0843 [Andrew Or] Clean up old job state (cherry picked from commit `f6e18388d9`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-13 16:28:02 -07:00
zsxwing	10007fbe0b	[SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page consistent with other pages This PR makes "Input Rate" in the Streaming page consistent with Job and Stage pages. ![screen shot 2015-05-12 at 5 03 35 pm](https://cloud.githubusercontent.com/assets/1000778/7601444/f943f8ac-f8ca-11e4-8280-a715d814f434.png) ![screen shot 2015-05-12 at 5 07 25 pm](https://cloud.githubusercontent.com/assets/1000778/7601445/f9571c0c-f8ca-11e4-9b12-9317cb55c002.png) Author: zsxwing <zsxwing@gmail.com> Closes #6102 from zsxwing/SPARK-7589 and squashes the following commits: 2745225 [zsxwing] Make "Input Rate" in the Streaming page consistent with other pages (cherry picked from commit `bec938f777`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-13 10:01:38 -07:00
Masayoshi TSUZUKI	bfdecace5d	[SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path escape spaces in the arguments. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits: 3f9a188 [Masayoshi TSUZUKI] modified some errors. ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors. 1784239 [Masayoshi TSUZUKI] removed Utils.formatPath. e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala. 84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows 016128d [Masayoshi TSUZUKI] fixed to use File.toURI() 2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2 7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2 45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2 10f1c73 [Kousuke Saruta] Added a comment 93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2 649da82 [Kousuke Saruta] Fix classpath handling c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path (cherry picked from commit `50c7270801`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-05-13 09:43:49 +01:00
linweizhong	7bd5274270	[SPARK-7526] [SPARKR] Specify ip of RBackend, MonitorServer and RRDD Socket server These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip. Author: linweizhong <linweizhong@huawei.com> Closes #6053 from Sephiroth-Lin/spark-7526 and squashes the following commits: 5303af7 [linweizhong] bind to localhost rather than wildcard ip (cherry picked from commit `98195c3031`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-05-12 23:55:54 -07:00
zsxwing	221375ee1f	[SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay" Screenshots: ![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png) ![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png) tdas as we discussed offline Author: zsxwing <zsxwing@gmail.com> Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits: 2b004ea [zsxwing] Merge branch 'master' into SPARK-7406 e9eb506 [zsxwing] Update tooltip contents 2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay" (cherry picked from commit `1422e79e51`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-12 14:41:30 -07:00
Andrew Or	ce6c40066f	[HOT FIX #6076 ] DAG visualization: curve the edges	2015-05-12 12:07:36 -07:00
Andrew Or	a23610458c	[SPARK-7500] DAG visualization: move cluster labeling to dagre-d3 This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular: (1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark (2) All graph styling is now moved from Scala to JS Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3. ------------------- Before. <img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/> ------------------- After. <img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/> Author: Andrew Or <andrew@databricks.com> Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits: 5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed c686dc4 [Andrew Or] Fix tooltip placement d908c36 [Andrew Or] Add link to dagre-d3 changes (minor) 4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS (cherry picked from commit `65697bbeaf`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-12 11:18:08 -07:00
Cheng Lian	d2328137f7	[SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path. Existing data sources like JSON and Parquet can be simplified with this work. ## New features provided 1. Hive compatible partition discovery This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0. 1. Generalized partition pruning optimization Now partition pruning is handled during physical planning phase. Specific data sources don't need to worry about this harness anymore. (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.) 1. Insertion with dynamic partitions When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns. ## New structures provided ### Developer API 1. `FSBasedRelation` Base abstract class for file system based data sources. 1. `OutputWriter` Base abstract class for output row writers, responsible for writing a single row object. 1. `FSBasedRelationProvider` A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`. ### User API New overloaded versions of 1. `DataFrame.save()` 1. `DataFrame.saveAsTable()` 1. `SQLContext.load()` are provided to allow users to save/load DataFrames with user defined dynamic partition columns. ### Spark SQL query planning 1. `InsertIntoFSBasedRelation` Used to implement write path for `FSBasedRelation`s. 1. New rules for `FSBasedRelation` in `DataSourceStrategy` These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning. ## TODO - [ ] Use scratch directories when overwriting a table with data selected from itself. Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it. - [ ] When inserting with dynamic partition columns, use external sorter to group the data first. This ensures that we only need to open a single `OutputWriter` at a time. For data sources like Parquet, `OutputWriter`s can be quite memory consuming. One issue is that, this approach breaks the row distribution in the original DataFrame. However, we did't promise to preserve data distribution when writing a DataFrame. - [x] More tests. Specifically, test cases for - [x] Self-join - [x] Loading partitioned relations with a subset of partition columns stored in data files. - [x] `SQLContext.load()` with user defined dynamic partition columns. ## Parquet data source migration Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged. Author: Cheng Lian <lian@databricks.com> Closes #5526 from liancheng/partitioning-support and squashes the following commits: 5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing 1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations 43ba50e [Cheng Lian] Avoids serializing generated projection code edf49e7 [Cheng Lian] Removed commented stale code block 348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths) ad4d4de [Cheng Lian] Enables HDFS style globbing 8d12e69 [Cheng Lian] Fixes compilation error c71ac6c [Cheng Lian] Addresses comments from @marmbrus 7552168 [Cheng Lian] Fixes typo in MimaExclude.scala 0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing 52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala c466de6 [Cheng Lian] Addresses comments bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows 795920a [Cheng Lian] Fixes compilation error after rebasing 0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables fa543f3 [Cheng Lian] Addresses comments 5849dd0 [Cheng Lian] Fixes doc typos. Fixes partition discovery refresh. 51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite c4ed4fe [Cheng Lian] Bug fixes and a new test suite a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan 5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat 54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init 0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers 422ff4a [Cheng Lian] Fixes style issue ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns 8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet f18dec2 [Cheng Lian] More strict schema checking b746ab5 [Cheng Lian] More tests 9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing ea6c8dd [Cheng Lian] Removes remote debugging stuff 327bb1d [Cheng Lian] Implements partitioning support for data sources API 3c5073a [Cheng Lian] Fixes SaveModes used in test cases fb5a607 [Cheng Lian] Fixes compilation error 9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor 5de194a [Cheng Lian] Forgot Apache licence header 95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider 770b5ba [Cheng Lian] Adds tests for FSBasedRelation 3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning 1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation aa8ba9a [Cheng Lian] Javadoc fix 012ed2d [Cheng Lian] Adds PartitioningOptions 7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support (cherry picked from commit `0595b6de8f`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-05-13 01:32:55 +08:00
Daoyuan Wang	653db0a1bd	[SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6003 from adrian-wang/pynareplace and squashes the following commits: 672efba [Daoyuan Wang] remove py2.7 feature 4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests 9e232e7 [Daoyuan Wang] rename scala map af0268a [Daoyuan Wang] remove na 63ac579 [Daoyuan Wang] add na.replace in pyspark (cherry picked from commit `d86ce84584`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-12 10:23:57 -07:00
Andrew Or	56016326c0	[SPARK-7467] Dag visualization: treat checkpoint as an RDD operation Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming. Author: Andrew Or <andrew@databricks.com> Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits: 9217439 [Andrew Or] Fix checkpoints 4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint 19bc07b [Andrew Or] Treat checkpoint as an RDD operation (cherry picked from commit `f3e8e60063`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-12 01:41:02 -07:00
Marcelo Vanzin	afe54b76a6	[SPARK-7485] [BUILD] Remove pyspark files from assembly. The sbt part of the build is hacky; it basically tricks sbt into generating the zip by using a generator, but returns an empty list for the generated files so that nothing is actually added to the assembly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6022 from vanzin/SPARK-7485 and squashes the following commits: 22c1e04 [Marcelo Vanzin] Remove unneeded code. 4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly. (cherry picked from commit `82e890fb19`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-12 01:39:28 -07:00
linweizhong	4092a2e859	[MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <linweizhong@huawei.com> Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build (cherry picked from commit `9847875266`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-12 01:36:36 -07:00
zsxwing	af374ed268	[SPARK-7534] [CORE] [WEBUI] Fix the Stage table when a stage is missing Just improved the Stage table when a stage is missing. Before: ![screen shot 2015-05-11 at 10 11 51 am](https://cloud.githubusercontent.com/assets/1000778/7570842/2ba37380-f7c8-11e4-9b5f-cf1a6264b2a4.png) After: ![screen shot 2015-05-11 at 10 26 09 am](https://cloud.githubusercontent.com/assets/1000778/7570848/33703152-f7c8-11e4-81a8-d53dd72d7b8d.png) Author: zsxwing <zsxwing@gmail.com> Closes #6061 from zsxwing/SPARK-7534 and squashes the following commits: 09fe862 [zsxwing] Leave it blank rather than '-' 6299197 [zsxwing] Fix the Stage table when a stage is missing	2015-05-12 01:35:14 -07:00
Steve Loughran	779174a5f4	[SPARK-7508] JettyUtils-generated servlets to log & report all errors Patch for SPARK-7508 This logs warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500. In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way. This patch doesn't address that, it ensures that 1. if the jetty handlers were put to one side the users would see the errors 2. at least the exceptions appear in the server-side logs. This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong. Author: Steve Loughran <stevel@hortonworks.com> Closes #6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits: 584836f [Steve Loughran] SPARK-7508 drop trailing semicolon ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed 258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed 69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors	2015-05-11 13:37:54 -07:00
Kousuke Saruta	869a52d9c5	[SPARK-7403] [WEBUI] Link URL in objects on Timeline View is wrong in case of running on YARN When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong. In timeline-view.js, the link is generated as follows. ``` window.location.href = "job/?id=" + getJobId(this); ``` This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/" The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages. ![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png) Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #5947 from sarutak/fix-link-in-timeline and squashes the following commits: aaf40e1 [Kousuke Saruta] Added Copyright for vis.js 01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href (cherry picked from commit `12b95abc70`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-05-09 10:10:37 +01:00
Vinod K C	b0460f4149	[SPARK-7438] [SPARK CORE] Fixed validation of relativeSD in countApproxDistinct Author: Vinod K C <vinod.kc@huawei.com> Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits: 3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017 799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7 8ddbfae [Vinod K C] Remove blank line b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation 122d378 [Vinod K C] Fixed validation of relativeSD in countApproxDistinct (cherry picked from commit `dda6d9f404`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-05-09 10:03:37 +01:00
tedyu	45b62151da	[SPARK-7237] Clean function in several RDD methods Author: tedyu <yuzhihong@gmail.com> Closes #5959 from ted-yu/master and squashes the following commits: f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex 56d7c92 [tedyu] Consolidate import of Random f6014c0 [tedyu] Remove cleaning in RDD#filterWith 36feb6c [tedyu] Try to get correct syntax 55d01eb [tedyu] Try to get correct syntax c2786df [tedyu] Correct syntax d92bfcf [tedyu] Correct syntax in test 164d3e4 [tedyu] Correct variable name 8b50d93 [tedyu] Address Andrew's review comments 0c8d47e [tedyu] Add test for mapWith() 6846e40 [tedyu] Add test for flatMapWith() 6c124a9 [tedyu] Clean function in several RDD methods (cherry picked from commit `54e6fa0563`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 17:16:45 -07:00
Andrew Or	cafffd0c29	[SPARK-7469] [SQL] DAG visualization: show SQL query operators The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus ----------------- Before <img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/> ----------------- After (Pay attention to the words) <img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/> ----------------- Author: Andrew Or <andrew@databricks.com> Closes #5999 from andrewor14/dag-viz-sql and squashes the following commits: 0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql 1e211db [Andrew Or] Update comment 0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql ffd237a [Andrew Or] Fix style 202dac1 [Andrew Or] Make ignoreParent false by default e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives 569034a [Andrew Or] Add a flag to ignore parent settings and scopes (cherry picked from commit `bd61f07039`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 17:15:17 -07:00
Aaron Davidson	1eae47620d	[SPARK-6955] Perform port retries at NettyBlockTransferService level Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services. Author: Aaron Davidson <aaron@databricks.com> Closes #5575 from aarondav/port-bind and squashes the following commits: 3c2d6ed [Aaron Davidson] Oops, never do it. a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite 59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level (cherry picked from commit `ffdc40ce7a`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 17:14:02 -07:00
Marcelo Vanzin	3024f6b01d	[SPARK-7378] [CORE] Handle deep links to unloaded apps. The code was treating deep links as if they were attempt IDs, so for example if you tried to load "/history/app1/jobs" directly, that would fail because the code would treat "jobs" as an attempt id. This change modifies the code to try both cases - first without an attempt id, then with it, so that deep links are handled correctly. This assumes that the links in the Spark UI do not clash with the attempt id namespace, though, which is the case for YARN at least, which is the only backend that currently publishes attempt IDs. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5922 from vanzin/SPARK-7378 and squashes the following commits: 96f648b [Marcelo Vanzin] Fix comparison. ed3bcd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-7378 23483e4 [Marcelo Vanzin] Fat fingers. b728f08 [Marcelo Vanzin] [SPARK-7378] [core] Handle deep links to unloaded apps. (cherry picked from commit `5467c34c3d`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 14:13:05 -07:00
Marcelo Vanzin	3da5f8b71a	[MINOR] [CORE] Allow History Server to read kerberos opts from config file. Order of initialization code was wrong. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5998 from vanzin/hs-conf-fix and squashes the following commits: 00b6b6b [Marcelo Vanzin] [minor] [core] Allow History Server to read kerberos opts from config file. (cherry picked from commit `9042f8f378`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 14:10:34 -07:00
Andrew Or	ca2f1c56c6	[SPARK-7466] DAG visualization: fix orphan nodes Simple fix. We were comparing an option with `null`. Before: <img src="https://issues.apache.org/jira/secure/attachment/12731383/before.png" width="250px"/> After: <img src="https://issues.apache.org/jira/secure/attachment/12731384/after.png" width="250px"/> Author: Andrew Or <andrew@databricks.com> Closes #6002 from andrewor14/dag-viz-orphan-nodes and squashes the following commits: a1468dc [Andrew Or] Fix null check (cherry picked from commit `3b0c5e71f1`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 14:09:47 -07:00
Tim Ellison	f734c5895c	[MINOR] Defeat early garbage collection of test suite variable The JVM is free to collect references to variables that no longer participate in a computation. This simple patch adds an operation to the variable 'rdd' to ensure it is not collected early in the test suite's explicit calls to GC. ref: http://bugs.java.com/view_bug.do?bug_id=6721588 Author: Tim Ellison <t.p.ellison@gmail.com> Closes #6010 from tellison/master and squashes the following commits: 77d1c8f [Tim Ellison] Defeat early garbage collection of test suite variable by aggressive JVMs (cherry picked from commit `31da40dfee`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 14:09:09 -07:00
Kousuke Saruta	1dde3b36bb	[WEBUI] Remove debug feature for vis.js `vis.min.js` refers `vis.map` and this even refers `vis.js` which is used for debug `vis.js` but this debug feature is not needed for Spark itself. This issue is really minor so I don't file this in JIRA. /CC andrewor14 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #5994 from sarutak/remove-debug-feature-for-vis and squashes the following commits: 8be038f [Kousuke Saruta] Remove vis.map entry from .rat-exclude 7404945 [Kousuke Saruta] Removed debug feature for vis.js (cherry picked from commit `c45c09b015`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-08 14:06:44 -07:00
Evan Jones	62308097b2	[SPARK-7490] [CORE] [Minor] MapOutputTracker.deserializeMapStatuses: close input streams GZIPInputStream allocates native memory that is not freed until close() or when the finalizer runs. It is best to close() these streams explicitly. stephenh made the same change for serializeMapStatuses in commit `b0d884f0`. This is the same change for deserialize. (I ran the unit test suite! it seems to have passed. I did not make a JIRA since this seems "trivial", and the guidelines suggest it is not required for trivial changes) Author: Evan Jones <ejones@twitter.com> Closes #5982 from evanj/master and squashes the following commits: 0d76e85 [Evan Jones] [CORE] MapOutputTracker.deserializeMapStatuses: close input streams (cherry picked from commit `25889d8d97`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-05-08 22:01:01 +01:00
Kay Ousterhout	82be68f105	[SPARK-6627] Finished rename to ShuffleBlockResolver The previous cleanup-commit for SPARK-6627 renamed ShuffleBlockManager to ShuffleBlockResolver, but didn't rename the associated subclasses and variables; this commit does that. I'm unsure whether it's ok to rename ExternalShuffleBlockManager, since that's technically a public class? cc pwendell Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #5764 from kayousterhout/SPARK-6627 and squashes the following commits: 43add1e [Kay Ousterhout] Spacing fix 96080bf [Kay Ousterhout] Test fixes d8a5d36 [Kay Ousterhout] [SPARK-6627] Finished rename to ShuffleBlockResolver (cherry picked from commit `4b3bb0e43c`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-08 12:30:49 -07:00
Matei Zaharia	0b2c252d08	[SPARK-7298] Harmonize style of new visualizations - Colors on the timeline now match the rest of the UI - The expandable buttons to show timeline view, DAG, etc are now more visible - Timeline text is smaller - DAG visualization text and colors are more consistent throughout - Fix some JavaScript style issues - Various small fixes throughout (e.g. inconsistent capitalization, some confusing names, HTML escaping, etc) Author: Matei Zaharia <matei@databricks.com> Closes #5942 from mateiz/ui and squashes the following commits: def38d0 [Matei Zaharia] Add some tooltips 4c5a364 [Matei Zaharia] Reduce stage and rank separation slightly 43dcbe3 [Matei Zaharia] Some updates to DAG fac734a [Matei Zaharia] tweaks 6a6705d [Matei Zaharia] More fixes 67629f5 [Matei Zaharia] Various small tweaks (cherry picked from commit `a1ec08f7ed`) Signed-off-by: Matei Zaharia <matei@databricks.com>	2015-05-08 14:42:30 -04:00
Jacek Lewandowski	89d94878fd	[SPARK-7436] Fixed instantiation of custom recovery mode factory and added tests Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #5976 from jacek-lewandowski/SPARK-7436-1.4 and squashes the following commits: 6298313 [Jacek Lewandowski] SPARK-7436: Fixed instantiation of custom recovery mode factory and added tests	2015-05-08 11:38:09 -07:00
Imran Rashid	532bfdad4a	[SPARK-3454] separate json endpoints for data in the UI Exposes data available in the UI as json over http. Key points: * new endpoints, handled independently of existing XyzPage classes. Root entrypoint is `JsonRootResource` * Uses jersey + jackson for routing & converting POJOs into json * tests against known results in `HistoryServerSuite` * also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages. Author: Imran Rashid <irashid@cloudera.com> Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits: 1a72ed6 [Imran Rashid] rats 85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454 1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI"" 1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write 4e12013 [Imran Rashid] just use test case name for expectation file name 863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php (cherry picked from commit `c796be70f3`) Signed-off-by: Patrick Wendell <patrick@databricks.com>	2015-05-08 16:54:46 +01:00
Lianhui Wang	acf4bc1caa	[SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR. andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks. Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits: 66ffa43 [Lianhui Wang] Update Client.scala c2ad0f9 [Lianhui Wang] Update Client.scala 1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 20402cd [Lianhui Wang] use ZipEntry 9d87c3f [Lianhui Wang] update scala style e7bd971 [Lianhui Wang] address vanzin's comments 4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt e6b573b [Lianhui Wang] address vanzin's comments f11f84a [Lianhui Wang] zip pyspark archives 5192cca [Lianhui Wang] update import path 3b1e4c8 [Lianhui Wang] address tgravescs's comments 9396346 [Lianhui Wang] put zip to make-distribution.sh 0d2baf7 [Lianhui Wang] update import paths e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit 31e8e06 [Lianhui Wang] update code style 9f31dac [Lianhui Wang] update code and add comments f72987c [Lianhui Wang] add archives path to PYTHONPATH (cherry picked from commit `ebff7327af`) Signed-off-by: Thomas Graves <tgraves@apache.org>	2015-05-08 08:45:13 -05:00
Zhang, Liye	f5e9678e39	[SPARK-7392] [CORE] bugfix: Kryo buffer size cannot be larger than 2M Author: Zhang, Liye <liye.zhang@intel.com> Closes #5934 from liyezhang556520/kryoBufSize and squashes the following commits: 5707e04 [Zhang, Liye] fix import order 8693288 [Zhang, Liye] replace multiplier with ByteUnit methods 9bf93e9 [Zhang, Liye] add tests d91e5ed [Zhang, Liye] change kb to mb (cherry picked from commit `c2f0821aad`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-05-08 09:11:25 +01:00
Andrew Or	1b742a414e	[SPARK-7347] DAG visualization: add tooltips to RDDs This is an addition to #5729. Here's an example with ALS. <img src="https://issues.apache.org/jira/secure/attachment/12731039/tooltip.png" width="400px"></img> Author: Andrew Or <andrew@databricks.com> Closes #5957 from andrewor14/viz-hover2 and squashes the following commits: 60e3758 [Andrew Or] Add tooltips for RDDs on job page (cherry picked from commit `88717ee4e7`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-07 12:30:03 -07:00
Andrew Or	800c0fc8d5	[SPARK-7391] DAG visualization: auto expand if linked from another viz This is an addition to #5729. If you click into a stage from the DAG viz on the job page, you might expect to expand on the stage. However, once you get to the stage page, you actually have to expand the DAG viz there yourself. This patch makes this happen automatically. It's a small UX improvement. Author: Andrew Or <andrew@databricks.com> Closes #5958 from andrewor14/viz-auto-expand and squashes the following commits: 03cd157 [Andrew Or] Automatically expand DAG viz if from job page (cherry picked from commit `f1216514b8`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-07 12:29:25 -07:00
Timothy Chen	226033cfff	[SPARK-7373] [MESOS] Add docker support for launching drivers in mesos cluster mode. Using the existing docker support for mesos, also enabling the mesos cluster mode scheduler to launch Spark drivers in docker images as well. This also allows the executors launched by the drivers to be also in the same Docker image by passing the docker settings. Author: Timothy Chen <tnachen@gmail.com> Closes #5917 from tnachen/spark_cluster_docker and squashes the following commits: 1e842f5 [Timothy Chen] Add docker support for launching drivers in mesos cluster mode. (cherry picked from commit `4eecf550aa`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-07 12:23:22 -07:00
Tijo Thomas	d4e31bfcdb	[SPARK-7399] [SPARK CORE] Fixed compilation error in scala 2.11 scala has deterministic naming-scheme for the generated methods which return default arguments . here one of the default argument of overloaded method has to be removed Author: Tijo Thomas <tijoparacka@gmail.com> Closes #5966 from tijoparacka/fix_compilation_error_in_scala2.11 and squashes the following commits: c90bba8 [Tijo Thomas] Fixed compilation error in scala 2.11 (cherry picked from commit `0c33bf817c`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-07 12:21:40 -07:00
Andrew Or	85a644b7bc	[HOT FIX] For DAG visualization #5954	2015-05-06 18:03:21 -07:00
Andrew Or	76e8344f20	[SPARK-7371] [SPARK-7377] [SPARK-7408] DAG visualization addendum (#5729 ) This is a follow-up patch for #5729. [SPARK-7408] Move as much style code from JS to CSS as possible [SPARK-7377] Fix JS error if a job / stage contains only one RDD [SPARK-7371] Decrease emphasis on RDD on stage page as requested by mateiz pwendell This patch also includes general code clean up. <img src="https://issues.apache.org/jira/secure/attachment/12730992/before-after.png" width="500px"></img> Author: Andrew Or <andrew@databricks.com> Closes #5954 from andrewor14/viz-emphasize-rdd and squashes the following commits: 3c0d4f0 [Andrew Or] Guard against JS error by rendering arrows only if needed f23e15b [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz-emphasize-rdd 565801f [Andrew Or] Clean up code 9dab5f0 [Andrew Or] Move styling from JS to CSS + clean up code 107c0b6 [Andrew Or] Tweak background color, stroke width, font size etc. 1610c62 [Andrew Or] Implement cluster padding for stage page (cherry picked from commit `8fa6829f5e`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-06 17:52:41 -07:00
Andrew Or	c0ec20a510	[HOT FIX] [SPARK-7418] Ignore flaky SparkSubmitUtilsSuite test	2015-05-06 17:10:06 -07:00
Josh Rosen	2163367ea9	Add `Private` annotation. This was originally added as part of #4435, which was reverted.	2015-05-06 11:07:21 -07:00
Josh Rosen	d651e28383	[SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output. This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them. The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property. I also added tests and comments clarifying when this works for KryoSerializer. This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`. Author: Josh Rosen <joshrosen@databricks.com> Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits: 50a68ca [Josh Rosen] Address minor nits 0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer 123b992 [Josh Rosen] Cleanup for submitting as standalone patch. 4aa61b2 [Josh Rosen] Add missing newline 2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use: 0ba75e6 [Josh Rosen] Add tests for serializer relocation property. 450fa21 [Josh Rosen] Back out accidental log4j.properties change 86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used. (cherry picked from commit `002c12384d`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-06 10:53:19 -07:00
zsxwing	20f9237712	[SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite Fixed the following failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/452/testReport/junit/org.apache.spark.broadcast/BroadcastSuite/Unpersisting_HttpBroadcast_on_executors_and_driver_in_distributed_mode/ The tests should wait until all slaves are up. Otherwise, there may be only a part of `BlockManager`s registered, and fail the tests. Author: zsxwing <zsxwing@gmail.com> Closes #5925 from zsxwing/SPARK-7384 and squashes the following commits: 783cb7b [zsxwing] Add comments for _jobProgressListener and remove postfixOps 1009ef1 [zsxwing] [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite (cherry picked from commit `9f019c7223`) Signed-off-by: Reynold Xin <rxin@databricks.com>	2015-05-05 23:25:36 -07:00
Reynold Xin	765f6e115e	Revert "[SPARK-3454] separate json endpoints for data in the UI" This reverts commit `ff8b449958`. This commit broke Spark on Windows.	2015-05-05 19:28:35 -07:00
Sandy Ryza	762ff2e113	Some minor cleanup after SPARK-4550. JoshRosen this PR addresses the comments you left on #4450 after it got merged. Author: Sandy Ryza <sandy@cloudera.com> Closes #5916 from sryza/sandy-spark-4550-cleanup and squashes the following commits: dee3d85 [Sandy Ryza] Some minor cleanup after SPARK-4550. (cherry picked from commit `0092abb47a`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-05-05 18:33:04 -07:00
zsxwing	8109c9e105	[SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for streaming statistics This is the initial work of SPARK-6939. Not yet ready for code review. Here are the screenshots: ![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png) ![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png) TODOs: - [x] Display more information on mouse hover - [x] Align the timeline and distribution graphs - [x] Clean up the codes Author: zsxwing <zsxwing@gmail.com> Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits: 9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939 deacc3f [zsxwing] Remove unused import cd03424 [zsxwing] Fix .rat-excludes 70cc87d [zsxwing] Streaming Scheduling Delay => Scheduling Delay d457277 [zsxwing] Fix UIUtils in BatchPage b3f303e [zsxwing] Add comments for unclear classes and methods ff0bff8 [zsxwing] Make InputDStream.name private[streaming] cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939 e275e23 [zsxwing] Move time related methods to Streaming's UIUtils d5d86f6 [zsxwing] Fix incorrect lastErrorTime 3be4b7a [zsxwing] Use InputInfo b50fa32 [zsxwing] Jump to the batch page when clicking a point in the timeline graphs 203605d [zsxwing] Merge branch 'master' into SPARK-6939 74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page size 2586916 [zsxwing] Merge branch 'master' into SPARK-6939 70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames 7bbdc0a [zsxwing] Hide the receiver sub table if no receiver a2972e9 [zsxwing] Add some ui tests for StreamingPage fd03ad0 [zsxwing] Add a test to verify no memory leak 4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939 18607a1 [zsxwing] Merge branch 'master' into SPARK-6939 d0b0aec [zsxwing] Clean up the codes a459f49 [zsxwing] Add a dash line to processing time graphs 8e4363c [zsxwing] Prepare for the demo c81a1ee [zsxwing] Change time unit in the graphs automatically 4c0b43f [zsxwing] Update Streaming UI 04c7500 [zsxwing] Make the server and client use the same timezone fed8219 [zsxwing] Move the x axis at the top and show a better tooltip c23ce10 [zsxwing] Make two graphs close d78672a [zsxwing] Make the X axis use the same range 881c907 [zsxwing] Use histogram for distribution 5688702 [zsxwing] Fix the unit test ddf741a [zsxwing] Fix the unit test ad93295 [zsxwing] Remove unnecessary codes a0458f9 [zsxwing] Clean the codes b82ed1e [zsxwing] Update the graphs as per comments dd653a1 [zsxwing] Add timeline and histogram graphs for streaming statistics (cherry picked from commit `489700c809`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-05-05 12:52:29 -07:00
jerryshao	29350eef30	[SPARK-7007] [CORE] Add a metric source for ExecutorAllocationManager Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the resource usage of executors when dynamic allocation is enable. Please help to review, thanks a lot. Author: jerryshao <saisai.shao@intel.com> Closes #5589 from jerryshao/dynamic-allocation-source and squashes the following commits: 104d155 [jerryshao] rebase and address the comments c501a2c [jerryshao] Address the comments d237ba5 [jerryshao] Address the comments 2c3540f [jerryshao] Add a metric source for ExecutorAllocationManager (cherry picked from commit `9f1f9b1037`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-05-05 09:43:55 -07:00

1 2 3 4 5 ...

4516 commits