Currently, if the first SQLContext is not removed after stopping SparkContext, a SQLContext could set there forever. This patch make this more robust.
Author: Davies Liu <davies@databricks.com>
Closes#9706 from davies/clear_context.
https://issues.apache.org/jira/browse/SPARK-11792
The main changes include:
* Renaming `SizeEstimation` to `KnownSizeEstimation`. Hopefully this new name has more information.
* Making `estimatedSize` return `Long` instead of `Option[Long]`.
* In `UnsaveHashedRelation`, `estimatedSize` will delegate the work to `SizeEstimator` if we have not created a `BytesToBytesMap`.
Since we will put `UnsaveHashedRelation` to `BlockManager`, it is generally good to let it provide a more accurate size estimation. Also, if we do not put `BytesToBytesMap` directly into `BlockerManager`, I feel it is not really necessary to make `BytesToBytesMap` extends `KnownSizeEstimation`.
Author: Yin Huai <yhuai@databricks.com>
Closes#9813 from yhuai/SPARK-11792-followup.
Using ENSIME, I often have `.ensime_cache` polluting my source tree. This PR simply adds the cache directory to `.gitignore`
Author: Jakob Odersky <jodersky@gmail.com>
Closes#9708 from jodersky/master.
we use `ExpressionEncoder.tuple` to build the result encoder, which assumes the input encoder should point to a struct type field if it’s non-flat.
However, our keyEncoder always point to a flat field/fields: `groupingAttributes`, we should combine them into a single `NamedExpression`.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9792 from cloud-fan/agg.
If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9770 from cloud-fan/udf.
When we resolve the join operator, we may change the output of right side if self-join is detected. So in `Dataset.joinWith`, we should resolve the join operator first, and then get the left output and right output from it, instead of using `left.output` and `right.output` directly.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9806 from cloud-fan/self-join.
Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.
The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`.
Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`.
See #9367 for previous comments
See SPARK-11195 for a full repro
Author: Hurshal Patel <hpatel516@gmail.com>
Closes#9779 from choochootrain/spark-11195-master.
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9743 from zero323/SPARK-11281-tests.
Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability
Author: Sean Owen <sowen@cloudera.com>
Closes#9731 from srowen/SPARK-11652.
I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803.
Author: Reynold Xin <rxin@databricks.com>
Closes#9789 from rxin/SPARK-11802.
JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.
The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9716 from yinxusen/SPARK-11728.
I have added unit test for ML's StandardScaler By comparing with R's output, please review for me.
Thx.
Author: RoyGaoVLIS <roygao@zju.edu.cn>
Closes#6665 from RoyGao/7013.
See discussion toward the tail of https://github.com/apache/spark/pull/9723
From zsxwing :
```
The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext.
I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally.
```
Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread.
Author: tedyu <yuzhihong@gmail.com>
Closes#9741 from tedyu/master.
The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:
Package Library
SparkR /Users/yanboliang/data/trunk2/spark/R/lib
stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library
Choose one
1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9732 from yanboliang/spark-11755.
The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM)
Author: Davies Liu <davies@databricks.com>
Closes#9704 from davies/kyro_string.
This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty.
This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 .
Author: Kent Yao <yaooqinn@hotmail.com>
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>
Closes#9746 from davies/roaring_mapstatus.
Fix the serialization of RoaringBitmap with Kyro serializer
This PR came from https://github.com/metamx/spark/pull/1, thanks to drcrallen
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>
Closes#9748 from davies/SPARK-11016.
I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795.
Author: Reynold Xin <rxin@databricks.com>
Closes#9784 from rxin/SPARK-11503.
When we exceed the max memory tell users to increase both params instead of just the one.
Author: Holden Karau <holden@us.ibm.com>
Closes#9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9778 from zsxwing/SPARK-11790.
By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized.
For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time.
1. the timer expiration starts before the listener event arrives.
2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally.
Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation.
For the rest of usages, the end users can decide if to use force killing or not by themselves. If to turn on that option, the killExecutor will do the action without any status checking.
Author: Grace <jie.huang@intel.com>
Author: Andrew Or <andrew@databricks.com>
Author: Jie Huang <jie.huang@intel.com>
Closes#7888 from GraceH/forcekill.
We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9707 from zsxwing/fix-checkpoint.
There events happen normally during the app's lifecycle, so printing
out ERROR logs all the time is misleading, and can actually affect usability
of interactive shells.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9772 from vanzin/SPARK-11786.
This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9776 from mengxr/SPARK-11764.
Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.
Moved LogisticRegressionReader/Writer to within LogisticRegressionModel
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9749 from jkbradley/lr-io-2.
This adds an extra filter for private or protected classes. We only filter for package private right now.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#9697 from thunterdb/spark-11732.
Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management).
This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).
This also change the way to grow buffer, double it each time, then trim it once finished.
cc liancheng
Author: Davies Liu <davies@databricks.com>
Closes#9760 from davies/cache_limit.
This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds:
* Bucketizer
* DCT
* HashingTF
* Interaction
* NGram
* Normalizer
* OneHotEncoder
* PolynomialExpansion
* QuantileDiscretizer
* RFormula
* SQLTransformer
* StopWordsRemover
* StringIndexer
* Tokenizer
* VectorAssembler
* VectorSlicer
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9755 from jkbradley/transformer-io.
Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers.
Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it.
marmbrus cloud-fan Please review if the changes are good.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#9761 from gatorsmile/hashCodeNamedExpression.
This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server.
Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one.
Author: Cheng Lian <lian@databricks.com>
Closes#9740 from liancheng/spark-11089.single-session-option.
In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;"
I directly cast java.util.List[StructField] into Array[StructField] in this patch.
Author: mayuanwen <mayuanwen@qiyi.com>
Closes#9649 from jackieMaKing/Spark-11679.
This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9751 from mengxr/SPARK-11766.
Set s3a credentials when creating a new default hadoop configuration.
Author: Chris Bannister <chris.bannister@swiftkey.com>
Closes#9663 from Zariel/set-s3a-creds.
MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation.
Author: Philipp Hoffmann <mail@philipphoffmann.de>
Closes#9768 from philipphoffmann/patch-2.
In the **[Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads)** section,
>Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves.
as we known **Task Serialization** is configuration by **spark.closure.serializer** parameter, but currently only the Java serializer is supported. If we set **spark.closure.serializer** to **org.apache.spark.serializer.KryoSerializer**, then this will throw a exception.
Author: yangping.wu <wyphao.2007@163.com>
Closes#9734 from 397090770/397090770-patch-1.