* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.*
* define `DefaultParamsReadable/Writable` and use them to save some code
* use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9827 from mengxr/SPARK-11839.
Before this PR there were two things that would blow up if you called `df.as[MyClass]` if `MyClass` was defined in the REPL:
- [x] Because `classForName` doesn't work on the munged names returned by `tpe.erasure.typeSymbol.asClass.fullName`
- [x] Because we don't have anything to pass into the constructor for the `$outer` pointer.
Note that this PR is just adding the infrastructure for working with inner classes in encoder and is not yet sufficient to make them work in the REPL. Currently, the implementation show in 95cec7d413 is causing a bug that breaks code gen due to some interaction between janino and the `ExecutorClassLoader`. This will be addressed in a follow-up PR.
Author: Michael Armbrust <michael@databricks.com>
Closes#9602 from marmbrus/dataset-replClasses.
stack trace of failure:
```
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 62 times over 1.006322071 seconds. Last failure message:
Argument(s) are different! Wanted:
writeAheadLog.write(
java.nio.HeapByteBuffer[pos=0 lim=124 cap=124],
10
);
-> at org.apache.spark.streaming.util.BatchedWriteAheadLogSuite$$anonfun$23$$anonfun$apply$mcV$sp$15.apply(WriteAheadLogSuite.scala:518)
Actual invocation has different arguments:
writeAheadLog.write(
java.nio.HeapByteBuffer[pos=0 lim=124 cap=124],
10
);
-> at org.apache.spark.streaming.util.WriteAheadLogSuite$BlockingWriteAheadLog.write(WriteAheadLogSuite.scala:756)
```
I believe the issue was that due to a race condition, the ordering of the events could be messed up in the final ByteBuffer, therefore the comparison fails.
By adding eventually between the requests, we make sure the ordering is preserved. Note that in real life situations, the ordering across threads will not matter.
Another solution would be to implement a custom mockito matcher that sorts and then compares the results, but that kind of sounds like overkill to me. Let me know what you think tdas zsxwing
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#9790 from brkyvz/fix-flaky-2.
DStream checkpoint interval is by default set at max(10 second, batch interval). That's bad for large batch intervals where the checkpoint interval = batch interval, and RDDs get checkpointed every batch.
This PR is to set the checkpoint interval of trackStateByKey to 10 * batch duration.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#9805 from tdas/SPARK-11814.
The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9455 from JoshRosen/fix-potential-resource-leaks.
SparkListenerSuite's _"onTaskGettingResult() called when result fetched remotely"_ test was extremely slow (1 to 4 minutes to run) and recently became extremely flaky, frequently failing with OutOfMemoryError.
The root cause was the fact that this was using `System.setProperty` to set the Akka frame size, which was not actually modifying the frame size. As a result, this test would allocate much more data than necessary. The fix here is to simply use SparkConf in order to configure the frame size.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9822 from JoshRosen/SPARK-11649.
Add read/write support to the following estimators under spark.ml:
* CountVectorizer
* IDF
* MinMaxScaler
* StandardScaler (a little awkward because we store some params in spark.mllib model)
* StringIndexer
Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9798 from mengxr/SPARK-6787.
This patch refactors the existing Kryo encoder expressions and adds support for Java serialization.
Author: Reynold Xin <rxin@databricks.com>
Closes#9802 from rxin/SPARK-11810.
Apply the user supplied pathfilter while retrieving the files from fs.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#9652 from dilipbiswal/spark-11544.
This PR includes:
* Update SparkR:::glm, SparkR:::summary API docs.
* Update SparkR machine learning user guide and example codes to show:
* supporting feature interaction in R formula.
* summary for gaussian GLM model.
* coefficients for binomial GLM model.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9727 from yanboliang/spark-11684.
jira: https://issues.apache.org/jira/browse/SPARK-11813
I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.
Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.
Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9803 from hhbyyh/w2vVocab.
Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata.
CC: mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9786 from jkbradley/als-io.
This replaces [https://github.com/apache/spark/pull/9656] with updates.
fayeshine should be the main author when this PR is committed.
CC: mengxr fayeshine
Author: Wenjian Huang <nextrush@163.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9814 from jkbradley/fayeshine-patch-6790.
return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null.
Author: JihongMa <linlin200605@gmail.com>
Closes#9705 from JihongMA/SPARK-11720.
[SPARK-6028](https://issues.apache.org/jira/browse/SPARK-6028) uses network module to implement RPC. However, there are some configurations named with `spark.shuffle` prefix in the network module.
This PR refactors them to make sure the user can control them in shuffle and RPC separately. The user can use `spark.rpc.*` to set the configuration for netty RPC.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9481 from zsxwing/SPARK-10745.
Based on my conversions with people, I believe the consensus is that the coarse-grained mode is more stable and easier to reason about. It is best to use that as the default rather than the more flaky fine-grained mode.
Author: Reynold Xin <rxin@databricks.com>
Closes#9795 from rxin/SPARK-11809.
Currently streaming foreachRDD Java API uses a function prototype requiring a return value of null. This PR deprecates the old method and uses VoidFunction to allow for more concise declaration. Also added VoidFunction2 to Java API in order to use in Streaming methods. Unit test is added for using foreachRDD with VoidFunction, and changes have been tested with Java 7 and Java 8 using lambdas.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#9488 from BryanCutler/foreachRDD-VoidFunction-SPARK-4557.
Currently, if the first SQLContext is not removed after stopping SparkContext, a SQLContext could set there forever. This patch make this more robust.
Author: Davies Liu <davies@databricks.com>
Closes#9706 from davies/clear_context.
https://issues.apache.org/jira/browse/SPARK-11792
The main changes include:
* Renaming `SizeEstimation` to `KnownSizeEstimation`. Hopefully this new name has more information.
* Making `estimatedSize` return `Long` instead of `Option[Long]`.
* In `UnsaveHashedRelation`, `estimatedSize` will delegate the work to `SizeEstimator` if we have not created a `BytesToBytesMap`.
Since we will put `UnsaveHashedRelation` to `BlockManager`, it is generally good to let it provide a more accurate size estimation. Also, if we do not put `BytesToBytesMap` directly into `BlockerManager`, I feel it is not really necessary to make `BytesToBytesMap` extends `KnownSizeEstimation`.
Author: Yin Huai <yhuai@databricks.com>
Closes#9813 from yhuai/SPARK-11792-followup.
Using ENSIME, I often have `.ensime_cache` polluting my source tree. This PR simply adds the cache directory to `.gitignore`
Author: Jakob Odersky <jodersky@gmail.com>
Closes#9708 from jodersky/master.
we use `ExpressionEncoder.tuple` to build the result encoder, which assumes the input encoder should point to a struct type field if it’s non-flat.
However, our keyEncoder always point to a flat field/fields: `groupingAttributes`, we should combine them into a single `NamedExpression`.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9792 from cloud-fan/agg.
If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9770 from cloud-fan/udf.
When we resolve the join operator, we may change the output of right side if self-join is detected. So in `Dataset.joinWith`, we should resolve the join operator first, and then get the left output and right output from it, instead of using `left.output` and `right.output` directly.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9806 from cloud-fan/self-join.
Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.
The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`.
Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`.
See #9367 for previous comments
See SPARK-11195 for a full repro
Author: Hurshal Patel <hpatel516@gmail.com>
Closes#9779 from choochootrain/spark-11195-master.
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9743 from zero323/SPARK-11281-tests.
Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability
Author: Sean Owen <sowen@cloudera.com>
Closes#9731 from srowen/SPARK-11652.
I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803.
Author: Reynold Xin <rxin@databricks.com>
Closes#9789 from rxin/SPARK-11802.
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.
The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9716 from yinxusen/SPARK-11728.
I have added unit test for ML's StandardScaler By comparing with R's output, please review for me.
Thx.
Author: RoyGaoVLIS <roygao@zju.edu.cn>
Closes#6665 from RoyGao/7013.
See discussion toward the tail of https://github.com/apache/spark/pull/9723
From zsxwing :
```
The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext.
I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally.
```
Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread.
Author: tedyu <yuzhihong@gmail.com>
Closes#9741 from tedyu/master.
The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:
Package Library
SparkR /Users/yanboliang/data/trunk2/spark/R/lib
stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library
Choose one
1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9732 from yanboliang/spark-11755.
The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM)
Author: Davies Liu <davies@databricks.com>
Closes#9704 from davies/kyro_string.
This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty.
This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 .
Author: Kent Yao <yaooqinn@hotmail.com>
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>
Closes#9746 from davies/roaring_mapstatus.
Fix the serialization of RoaringBitmap with Kyro serializer
This PR came from https://github.com/metamx/spark/pull/1, thanks to drcrallen
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>
Closes#9748 from davies/SPARK-11016.
I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795.
Author: Reynold Xin <rxin@databricks.com>
Closes#9784 from rxin/SPARK-11503.
When we exceed the max memory tell users to increase both params instead of just the one.
Author: Holden Karau <holden@us.ibm.com>
Closes#9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9778 from zsxwing/SPARK-11790.