During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
Author: Davies Liu <davies.liu@gmail.com>
Closes#1460 from davies/spill and squashes the following commits:
cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
37d71f7 [Davies Liu] balance the partitions
902f036 [Davies Liu] add shuffle.py into run-tests
dcf03a9 [Davies Liu] fix memory_info() of psutil
67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
400be01 [Davies Liu] address all the comments
6178844 [Davies Liu] refactor and improve docs
fdd0a49 [Davies Liu] add long doc string for ExternalMerger
1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
e6cc7f9 [Davies Liu] Merge branch 'master' into spill
3652583 [Davies Liu] address comments
e78a0a0 [Davies Liu] fix style
24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
57ee7ef [Davies Liu] update docs
286aaff [Davies Liu] let spilled aggregation in Python configurable
e9a40f6 [Davies Liu] recursive merger
6edbd1f [Davies Liu] Hash based disk spilling aggregation
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#1051 from ScrapCodes/SPARK-2014/pyspark-cache and squashes the following commits:
f192df7 [Prashant Sharma] Code Review
2a2f43f [Prashant Sharma] [SPARK-2014] Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
Stopping the Twitter Receiver would call twitter4j's TwitterStream.shutdown, which in turn causes an Exception to be thrown to the listener. This exception caused the Receiver to be restarted. This patch check whether the receiver was stopped or not, and accordingly restarts on exception.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#1577 from tdas/twitter-stop and squashes the following commits:
011b525 [Tathagata Das] Fixed Twitter stream stopping bug.
Author: Neville Li <neville@spotify.com>
Closes#1188 from nevillelyh/neville/ui and squashes the following commits:
d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI
f075db9 [Neville Li] SPARK-2035: show call stack even when description is available
Allow small errors in comparison.
@dbtsai , this unit test blocks https://github.com/apache/spark/pull/1562 . I may need to merge this one first. We can change it to use the tools in https://github.com/apache/spark/pull/1425 after that PR gets merged.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1576 from mengxr/fix-binary-metrics-unit-tests and squashes the following commits:
5076a7f [Xiangrui Meng] fix binary metrics unit tests
In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record (twitter dataset). My simple test program is attached below.
```scala
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
val jsonData = sc.textFile("...")
jsonData.cache.count
val jsonSchemaRDD = sqlContext.jsonRDD(jsonData)
jsonSchemaRDD.registerAsTable("jt")
sqlContext.sql("select count(*) from jt").collect
```
Stages for the schema inference and the table scan both had 48 tasks. These tasks were executed sequentially. For the current implementation, scanning the JSON dataset will materialize values of all fields of a record. The inferred schema of the dataset can be accessed at https://gist.github.com/yhuai/05fe8a57c638c6666f8d.
From the result, there was no significant difference on running `jsonRDD`. For the simple aggregation query, results are attached below.
```
Original:
Run 1: 26.1s
Run 2: 27.03s
Run 3: 27.035s
With this change:
Run 1: 21.086s
Run 2: 21.035s
Run 3: 21.029s
```
JIRA: https://issues.apache.org/jira/browse/SPARK-2603
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1504 from yhuai/removeToMapToList and squashes the following commits:
6831b77 [Yin Huai] Fix failed tests.
09b9bca [Yin Huai] Merge remote-tracking branch 'upstream/master' into removeToMapToList
d1abdb8 [Yin Huai] Remove unnecessary toMap and toList.
Add a `<deb.bin.filemode>744</deb.bin.filemode>` property to the `assembly/pom.xml` that defaults to `744`.
Use this property for ../bin folder <filemode>.
This patch doesn't change the current default modes but allows one override the modes at build time:
`-Ddeb.bin.filemode=<new mode>`
Author: tzolov <christian.tzolov@gmail.com>
Closes#1531 from tzolov/SPARK-2619 and squashes the following commits:
6d95343 [tzolov] [Build] SPARK-2619: Configurable filemode for the spark/bin folder in the .deb package
...rce manager UI
Use the event logger directory to provide a direct link to finished
application UI in yarn resourcemanager UI.
Author: Rahul Singhal <rahul.singhal@guavus.com>
Closes#1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits:
95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI
Unpersist useless rdd during bagel iteration to make full use of memory.
Author: Daoyuan <daoyuan.wang@intel.com>
Closes#1519 from adrian-wang/bagelunpersist and squashes the following commits:
182c9dd [Daoyuan] rename var nextUseless to lastRDD
87fd3a4 [Daoyuan] bagel unpersist old processed rdd
...spark-submit
The PR allows invocations like
spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1253 from sryza/sandy-spark-2310 and squashes the following commits:
1dc9855 [Sandy Ryza] More doc and cleanup
00edfb9 [Sandy Ryza] Review comments
91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
Author: Michael Armbrust <michael@databricks.com>
Closes#1556 from marmbrus/fixBooleanEqualsOne and squashes the following commits:
ad8edd4 [Michael Armbrust] Add rule for true = 1 and false = 0.
Author: GuoQiang Li <witgo@qq.com>
Closes#1511 from witgo/JsonProtocol and squashes the following commits:
2b6227f [GuoQiang Li] Fix NPE for JsonProtocol
RoutingTableMessage was used to construct routing tables to enable
joining VertexRDDs with partitioned edges. It stored three elements: the
destination vertex ID, the source edge partition, and a byte specifying
the position in which the edge partition referenced the vertex to enable
join elimination.
However, this was incompatible with sort-based shuffle (SPARK-2045). It
was also slightly wasteful, because partition IDs are usually much
smaller than 2^32, though this was mitigated by a custom serializer that
used variable-length encoding.
This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
where the Int encodes both the source partition ID (in the lower 30
bits) and the position (in the top 2 bits).
Author: Ankur Dave <ankurdave@gmail.com>
Closes#1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:
697e17b [Ankur Dave] Replace RoutingTableMessage with pair
Author: witgo <witgo@qq.com>
Closes#1403 from witgo/hive_compatibility and squashes the following commits:
4e5ecdb [witgo] The default does not run hive compatibility tests
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#1510 from ScrapCodes/SPARK-2549/fun-in-fun and squashes the following commits:
9458bc5 [Prashant Sharma] Tested by removing an inner function from excludes.
bc03b1c [Prashant Sharma] SPARK-2549 Functions defined inside of other functions trigger failures
Author: Ian O Connell <ioconnell@twitter.com>
Closes#1377 from ianoc/feature/SPARK-2102 and squashes the following commits:
5498566 [Ian O Connell] Docs update suggested by Patrick
20e8555 [Ian O Connell] Slight style change
f92c294 [Ian O Connell] Add docs for new KryoSerializer option
f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
Instead of shipping just the name and then looking up the info on the workers, we now ship the whole classname. Also, I refactored the file as it was getting pretty large to move out the type conversion code to its own file.
Author: Michael Armbrust <michael@databricks.com>
Closes#1552 from marmbrus/fixTempUdfs and squashes the following commits:
b695904 [Michael Armbrust] Make add jar execute with Hive. Ship the whole function class name since sometimes we cannot lookup temporary functions on the workers.
This change adds an analyzer rule to
1. find expressions in `HAVING` clause filters that depend on unresolved attributes,
2. push these expressions down to the underlying aggregates, and then
3. project them away above the filter.
It also enables the `HAVING` queries in the Hive compatibility suite.
Author: William Benton <willb@redhat.com>
Closes#1497 from willb/spark-2226 and squashes the following commits:
92c9a93 [William Benton] Removed unnecessary import
f1d4f34 [William Benton] Cleanups missed in prior commit
0e1624f [William Benton] Incorporated suggestions from @marmbrus; thanks!
541d4ee [William Benton] Cleanups from review
5a12647 [William Benton] Explanatory comments and stylistic cleanups.
c7f2b2c [William Benton] Whitelist HAVING queries.
29a26e3 [William Benton] Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)
Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later.
Really sorry about this.
Author: Rui Li <rui.li@intel.com>
Closes#1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits:
f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#1491 from ueshin/issues/SPARK-2588 and squashes the following commits:
43d0a46 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2588
1023ea0 [Takuya UESHIN] Modify tests to use DSLs.
2310bf1 [Takuya UESHIN] Add some more DSLs.
Make spark's "local[N]" better.
In our company, we use "local[N]" in production. It works exellentlly. It's our best choice.
Author: woshilaiceshide <woshilaiceshide@qq.com>
Closes#1544 from woshilaiceshide/localX and squashes the following commits:
6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance:
```
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far)
```
Author: Andrew Or <andrewor14@gmail.com>
Closes#1517 from andrewor14/external-log and squashes the following commits:
90e48bb [Andrew Or] Log thread ID when spilling
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.
This PR
1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1526 from mengxr/preserve-partitioner and squashes the following commits:
b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).
Author: Ankur Dave <ankurdave@gmail.com>
Closes#1537 from ankurdave/remove-MessageToPartition and squashes the following commits:
f9d0054 [Ankur Dave] Remove MessageToPartition
ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
Opting to the option 2 defined in SPARK-2577, i.e., retrieve and pass the correct file system object to addResource.
Author: Gera Shegalov <gera@twitter.com>
Closes#1483 from gerashegalov/master and squashes the following commits:
90c9087 [Gera Shegalov] [YARN] SPARK-2577: File upload to viewfs is broken due to mount point resolution
The issue is caused by #1112 .
Author: GuoQiang Li <witgo@qq.com>
Closes#1501 from witgo/webui_style and squashes the following commits:
4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
fix examples
Author: CrazyJvm <crazyjvm@gmail.com>
Closes#1523 from CrazyJvm/graphx-example and squashes the following commits:
663457a [CrazyJvm] outDegrees does not take parameters
7cfff1d [CrazyJvm] fix example for joinVertices
Currently, the "==" in HiveQL expression will cause exception thrown, this patch will fix it.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1522 from chenghao-intel/equal and squashes the following commits:
f62a0ff [Cheng Hao] Add == Support for HiveQl
### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.
This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.
Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl
### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.
Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).
The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.
Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.
### Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.
Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".
<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
</table>
The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.
In short, this patch should significantly improve performance for users running either Java 6 or 7.
Author: Aaron Davidson <aaron@databricks.com>
Closes#1502 from aarondav/sort and squashes the following commits:
652d936 [Aaron Davidson] Update license, move Sorter to java src
a7b5b1c [Aaron Davidson] fix licenses
5c0efaf [Aaron Davidson] Update tmpLength
ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
034bf10 [Aaron Davidson] Change to Apache v2 Timsort
b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
Fix Mima issues in #1521.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1533 from mengxr/mima-als and squashes the following commits:
78386e1 [Xiangrui Meng] make Mima ignore updateFeatures (private) in ALS
Author: peng.zhang <peng.zhang@xiaomi.com>
Closes#1521 from renozhang/fix-als and squashes the following commits:
b5727a4 [peng.zhang] Remove no need argument
1a4f7a0 [peng.zhang] Fix data skew in ALS
Author: Prashant Sharma <prashant@apache.org>
Closes#1441 from ScrapCodes/SPARK-2452/multi-statement and squashes the following commits:
26c5c72 [Prashant Sharma] Added a test case.
7e8d28d [Prashant Sharma] SPARK-2452, create a new valid for each instead of using lineId, because Line ids can be same sometimes.
Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly
New output:
```
(3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
| MappedValuesRDD[324] at apply at Transformer.scala:22
| CoGroupedRDD[323] at apply at Transformer.scala:22
+-(5) MappedRDD[320] at apply at Transformer.scala:22
| | MappedRDD[319] at apply at Transformer.scala:22
| | MappedValuesRDD[318] at apply at Transformer.scala:22
| | MapPartitionsRDD[317] at apply at Transformer.scala:22
| | ShuffledRDD[316] at apply at Transformer.scala:22
| +-(10) MappedRDD[315] at apply at Transformer.scala:22
| | ParallelCollectionRDD[314] at apply at Transformer.scala:22
+-(100) MappedRDD[322] at apply at Transformer.scala:22
| ParallelCollectionRDD[321] at apply at Transformer.scala:22
```
Author: Gregory Owen <greowen@gmail.com>
Closes#1364 from GregOwen/to-debug-string and squashes the following commits:
08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
We need to use the analyzed attributes otherwise we end up with a tree that will never resolve.
Author: Michael Armbrust <michael@databricks.com>
Closes#1470 from marmbrus/fixApplySchema and squashes the following commits:
f968195 [Michael Armbrust] Use analyzed attributes when applying the schema.
4969015 [Michael Armbrust] Add test case.
[SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified:
Scala:
* LocalALS
* LocalFileLR
* LocalKMeans
* LocalLP
* SparkALS
* SparkHdfsLR
* SparkKMeans
* SparkLR
Python:
* kmeans.py
* als.py
* logistic_regression.py
Author: Burak <brkyvz@gmail.com>
Closes#1515 from brkyvz/SPARK-2434 and squashes the following commits:
7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation
b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed
4762f39 [Burak] [SPARK-2434]: Warning messages added
17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms
2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.
Result may not be returned in the expected order, so relax that constraint.
Author: Aaron Davidson <aaron@databricks.com>
Closes#1514 from aarondav/flakey and squashes the following commits:
e5af823 [Aaron Davidson] Fix flakey HiveQuerySuite test
In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.
Author: Davies Liu <davies.liu@gmail.com>
Closes#1371 from davies/hash_of_none and squashes the following commits:
d01745f [Davies Liu] add comments, remove outdated unit tests
5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
b7118aa [Davies Liu] use __builtin__ instead of __builtins__
839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
Author: Sandy Ryza <sandy@cloudera.com>
Closes#634 from sryza/sandy-spark-1707 and squashes the following commits:
2f6e358 [Sandy Ryza] Default min registered executors ratio to .8 for YARN
354c630 [Sandy Ryza] Remove outdated comments
c744ef3 [Sandy Ryza] Take out waitForInitialAllocations
2a4329b [Sandy Ryza] SPARK-1707. Remove unnecessary 3 second sleep in YarnClusterScheduler
Standalone application examples are added to 'mllib-linear-methods.md' file written in Java.
This commit is related to the issue [Add full Java Examples in MLlib docs](https://issues.apache.org/jira/browse/SPARK-1945).
Also I changed the name of the sigmoid function from 'logit' to 'f'. This is because the logit function
is the inverse of sigmoid.
Thanks,
Michael
Author: Michael Giannakopoulos <miccagiann@gmail.com>
Closes#1311 from miccagiann/master and squashes the following commits:
8ffe5ab [Michael Giannakopoulos] Update code so as to comply with code standards.
f7ad5cc [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
38d92c7 [Michael Giannakopoulos] Adding PCA, SVD and LBFGS examples in Java. Performing minor updates in the already committed examples so as to eradicate the call of 'productElement' function whenever is possible.
cc0a089 [Michael Giannakopoulos] Modyfied Java examples so as to comply with coding standards.
b1141b2 [Michael Giannakopoulos] Added Java examples for Clustering and Collaborative Filtering [mllib-clustering.md & mllib-collaborative-filtering.md].
837f7a8 [Michael Giannakopoulos] Merge remote-tracking branch 'upstream/master'
15f0eb4 [Michael Giannakopoulos] Java examples included in 'mllib-linear-methods.md' file.
As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large.
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#1488 from kayousterhout/better_tooltips and squashes the following commits:
22176fd [Kay Ousterhout] Improve scheduler delay tooltip.
to avoid overflow in `exp(x)` if `x` is large.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1493 from mengxr/py-logistic and squashes the following commits:
259e863 [Xiangrui Meng] stabilize logistic function in pyspark
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1474 from sryza/sandy-spark-2564 and squashes the following commits:
35b8388 [Sandy Ryza] Fix compile error on upmerge
7b985fb [Sandy Ryza] Fix test compile error
43f79e6 [Sandy Ryza] SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant
This is part of SPARK-2495 to allow users construct linear models manually.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1492 from mengxr/public-constructor and squashes the following commits:
a48b766 [Xiangrui Meng] remove private[mllib] from linear models' constructors
We should fix this in branch-1.0 as well.
Author: Reynold Xin <rxin@apache.org>
Closes#1500 from rxin/rangePartitioner and squashes the following commits:
c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.
...s of CoGroupedRDD and PairRDDFunctions
This also removes an unnecessary tuple creation in cogroup.
Author: Sandy Ryza <sandy@cloudera.com>
Closes#1447 from sryza/sandy-spark-2519-2 and squashes the following commits:
b6d9699 [Sandy Ryza] Remove missed Tuple2 match in CoGroupedRDD
a109828 [Sandy Ryza] Remove another pattern matching in MappedValuesRDD and revert some changes in PairRDDFunctions
be10f8a [Sandy Ryza] SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions
make-distribution.sh gives a slightly off error message when using --with-hive.
Author: Mark Wagner <mwagner@mwagner-ld.linkedin.biz>
Closes#1489 from wagnermarkd/SPARK-2587 and squashes the following commits:
7b5d3ff [Mark Wagner] SPARK-2587: Fix error message in make-distribution.sh