## What changes were proposed in this pull request?
Currently, many functions do now show usages like the followings.
```
scala> sql("desc function extended `sin`").collect().foreach(println)
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: To be added.]
[Extended Usage:
To be added.]
```
This PR adds descriptions for functions and adds a testcase prevent adding function without usage.
```
scala> sql("desc function extended `sin`").collect().foreach(println);
[Function: sin]
[Class: org.apache.spark.sql.catalyst.expressions.Sin]
[Usage: sin(x) - Returns the sine of x.]
[Extended Usage:
> SELECT sin(0);
0.0]
```
The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.
## How was this patch tested?
Pass the Jenkins tests (including new testcases.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12185 from dongjoon-hyun/SPARK-14415.
## What changes were proposed in this pull request?
example does not work wo DataFrame import
## How was this patch tested?
example doc only
example does not work wo DataFrame import
Author: Örjan Lundberg <orjan.lundberg@gmail.com>
Closes#12277 from oluies/patch-1.
## What changes were proposed in this pull request?
Replace sortBy() with top() to calculate the top N frequent words as dictionary.
## How was this patch tested?
existing unit tests. The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"...
Author: fwang1 <desperado.wf@gmail.com>
Closes#12265 from lionelfeng/master.
## What changes were proposed in this pull request?
When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.
## How was this patch tested?
Added a test suite for the component that extracts the root cause of the error.
Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.
Author: Jason Moore <jasonmoore2k@outlook.com>
Closes#12228 from jasonmoore2k/SPARK-14357.
## What changes were proposed in this pull request?
When calling `ReceiverTracker#allocatedExecutors` in receiver-less scenario, NPE will be thrown, since this `ReceiverTracker` actually is not started and `endpoint` is not created.
This will be happened when playing streaming dynamic allocation with direct Kafka.
## How was this patch tested?
Local integrated test is done.
Author: jerryshao <sshao@hortonworks.com>
Closes#12236 from jerryshao/SPARK-14455.
## What changes were proposed in this pull request?
For an external table's metadata (in Hive's representation), its table type needs to be EXTERNAL_TABLE. Also, there needs to be a field called EXTERNAL set in the table property with a value of TRUE (for a MANAGED_TABLE it will be FALSE) based on https://github.com/apache/hive/blob/release-1.2.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105. HiveClientImpl's toHiveTable misses to set this table property.
## How was this patch tested?
Added a new test.
Author: Yin Huai <yhuai@databricks.com>
Closes#12275 from yhuai/SPARK-14506.
## What changes were proposed in this pull request?
Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
```xml
-<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
+<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
```
## How was this patch tested?
After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12242 from dongjoon-hyun/SPARK-14465.
## What changes were proposed in this pull request?
This fix tries to remove duplicate Java code in examples/mllib and examples/ml. The following changes have been made:
```
deleted: ml/JavaCrossValidatorExample.java (duplicate of JavaModelSelectionViaCrossValidationExample.java)
deleted: ml/JavaTrainValidationSplitExample.java (duplicated of JavaModelSelectionViaTrainValidationSplitExample.java)
deleted: mllib/JavaFPGrowthExample.java (duplicate of JavaSimpleFPGrowth.java)
deleted: mllib/JavaLDAExample.java (duplicate of JavaLatentDirichletAllocationExample.java)
deleted: mllib/JavaKMeans.java (merged with JavaKMeansExample.java)
deleted: mllib/JavaLR.java (duplicate of JavaLinearRegressionWithSGDExample.java)
updated: mllib/JavaKMeansExample.java (merged with mllib/JavaKMeans.java)
```
## How was this patch tested?
Existing tests passed.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12143 from yongtang/SPARK-14301.
## What changes were proposed in this pull request?
Eagerly cleanup PySpark's temporary parallelize cleanup files rather than waiting for shut down.
## How was this patch tested?
Unit tests
Author: Holden Karau <holden@us.ibm.com>
Closes#12233 from holdenk/SPARK-13687-cleanup-pyspark-temporary-files.
## What changes were proposed in this pull request?
This PR is based on #12017
Currently, this causes batches where some values are dictionary encoded and some
which are not. The non-dictionary encoded values cause us to remove the dictionary
from the batch causing the first values to return garbage.
This patch fixes the issue by first decoding the dictionary for the values that are
already dictionary encoded before switching. A similar thing is done for the reverse
case where the initial values are not dictionary encoded.
## How was this patch tested?
This is difficult to test but replicated on a test cluster using a large tpcds data set.
Author: Nong Li <nong@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes#12279 from davies/fix_dict.
## What changes were proposed in this pull request?
Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
This PR reopen#12190 to fix bugs.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12278 from davies/long_map3.
#### What changes were proposed in this pull request?
This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis.
Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
DropView
), `DROP VIEW` is defined as,
**Syntax:**
```SQL
DROP VIEW [IF EXISTS] [db_name.]view_name;
```
- to remove metadata for the specified view.
- illegal to use DROP TABLE on a view.
- illegal to use DROP VIEW on a table.
- this command only works in `HiveContext`. In `SQLContext`, we will get an exception.
This PR also handles `DROP TABLE`.
**Syntax:**
```SQL
DROP TABLE [IF EXISTS] table_name [PURGE];
```
- Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`.
- In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception.
- In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified.
- Data will not be deleted if the tables are `external`, unless table type is `managed_table`.
#### How was this patch tested?
For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala`
For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12146 from gatorsmile/dropView.
#### What changes were proposed in this pull request?
"Not good to slightly ignore all the un-supported options/clauses. We should either support it or throw an exception." A comment from yhuai in another PR https://github.com/apache/spark/pull/12146
- Can `Explain` be an exception? The `Formatted` clause is used in `HiveCompatibilitySuite`.
- Two unsupported clauses in `Drop Table` are handled in a separate PR: https://github.com/apache/spark/pull/12146
#### How was this patch tested?
Test cases are added to verify all the cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#12255 from gatorsmile/warningToException.
## What changes were proposed in this pull request?
…because some of built-in functions are not in function registry.
This fix tries to fix issues in `describe function` command where some of the outputs
still shows Hive's function because some built-in functions are not in FunctionRegistry.
The following built-in functions have been added to FunctionRegistry:
```
-
!
*
/
&
%
^
+
<
<=
<=>
=
==
>
>=
|
~
and
in
like
not
or
rlike
when
```
The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell):
```
!=
<>
between
case
```
Below are the existing result of the above functions that have not been added:
```
spark-sql> describe function `!=`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `<>`;
Function: <>
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
Usage: a <> b - Returns TRUE if a is not equal to b
```
```
spark-sql> describe function `between`;
Function: between
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween
Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c
```
```
spark-sql> describe function `case`;
Function: case
Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase
Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
```
## How was this patch tested?
Existing tests passed. Additional test cases added.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12128 from yongtang/SPARK-14335.
## What changes were proposed in this pull request?
add three python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#12063 from zhengruifeng/dct_pe.
## What changes were proposed in this pull request?
In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
## How was this patch tested?
Unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#12241 from dbtsai/dbtsai-mllib-local-build.
## What changes were proposed in this pull request?
Minor issues. Found 2 typos while browsing the code.
## How was this patch tested?
None.
Author: bomeng <bmeng@us.ibm.com>
Closes#12264 from bomeng/SPARK-14496.
## What changes were proposed in this pull request?
CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.
## How was this patch tested?
Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.
All tests in CounterVectorizerSuite.scala are passed.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12200 from wangmiao1981/binary_param.
## What changes were proposed in this pull request?
Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
## How was this patch tested?
Updated existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12190 from davies/long_map2.
## What changes were proposed in this pull request?
When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type.
Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators.
## How was this patch tested?
Updated unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12231 from rxin/SPARK-14451.
## What changes were proposed in this pull request?
Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.
This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON).
## How was this patch tested?
Should be covered by existing unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#12256 from rxin/SPARK-14482.
## What changes were proposed in this pull request?
Cleanups to documentation. No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only
## How was this patch tested?
Doc build
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request?
This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
```scala
logError("Aborting task.", cause)
// call failure callbacks first, so we could have a chance to cleanup the writer.
TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
if (currentWriter != null) {
currentWriter.close()
}
abortTask()
throw new SparkException("Task failed while writing rows.", cause)
```
If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
## How was this patch tested?
No new functionality added
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12234 from sameeragarwal/fix-exception.
## What changes were proposed in this pull request?
Here is why SPARK-14437 happens:
BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname.
To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer.
## How was this patch tested?
Manually checked the bound address using local-cluster.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12240 from zsxwing/SPARK-14437.
This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12076 from JoshRosen/kryo3.
## What changes were proposed in this pull request?
In this PR, two changes are proposed for ColumnVector :
1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.
## How was this patch tested?
Existing unit tests.
Author: tedyu <yuzhihong@gmail.com>
Closes#12225 from tedyu/master.
## What changes were proposed in this pull request?
In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
## How was this patch tested?
Existing tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12089 from yanboliang/spark-14298.
[archive.apache.org](https://archive.apache.org/) is undergoing maintenance, breaking our `build/mvn` script:
> We are in the process of relocating this service. To save on the immense bandwidth that this service outputs, we have put it in maintenance mode, disabling all downloads for the next few days. We expect the maintenance to be complete no later than the morning of Monday the 11th of April, 2016.
This patch fixes this issue by updating the script to use the regular mirror network to download Maven.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12262 from JoshRosen/fix-mvn-download.
## What changes were proposed in this pull request?
A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
## How was this patch tested?
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (12s)
Finished test(python2.7): pyspark.ml.clustering (18s)
Finished test(python2.7): pyspark.ml.classification (30s)
Finished test(python2.7): pyspark.ml.recommendation (28s)
Finished test(python2.7): pyspark.ml.feature (43s)
Finished test(python2.7): pyspark.ml.regression (31s)
Finished test(python2.7): pyspark.ml.tuning (19s)
Finished test(python2.7): pyspark.ml.tests (34s)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#12116 from wangmiao1981/fix_api.
## What changes were proposed in this pull request?
supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
[JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
## How was this patch tested?
doctest
Author: Kai Jiang <jiangkai@gmail.com>
Closes#12238 from vectorijk/spark-14373.
## What changes were proposed in this pull request?
Allows to override locations for downloading Apache and Typesafe artifacts in build/mvn script.
## How was this patch tested?
By running script like
````
# Remove all previously downloaded artifacts
rm -rf build/apache-maven*
rm -rf build/zinc-*
rm -rf build/scala-*
# Make sure path is clean and doesn't contain mvn, for example.
...
# Run a command without setting anything and make sure it succeeds
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
# Run a command setting the default location as mirror and make sure it succeeds
APACHE_MIRROR=http://mirror.infra.cloudera.com/apache/ build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
# Do the same without the trailing slash this time and make sure it succeeds
APACHE_MIRROR=http://mirror.infra.cloudera.com/apache build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
# Do it with a bad URL and make sure it fails
APACHE_MIRROR=xyz build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
````
Author: Mark Grover <mark@apache.org>
Closes#12250 from markgrover/spark-14477.
## What changes were proposed in this pull request?
This splits commons.httpclient.version from commons.httpcore.version, since these two versions do not necessarily have to be the same. This change may follow up with an up-to-date version of the httpclient/httpcore libraries.
The latest 4.3.x httpclient version as of writing is 4.3.6 and the latest 4.3.x httpcore version as of writing is 4.3.3. This change would be a prerequisite for potentially moving to this new bugfix version.
## How was this patch tested?
no version change was made for httpclient/httpcore versions
mvn package
Author: Aaron Tokhy <tokaaron@amazon.com>
Closes#12245 from atokhy/pull-request.
## What changes were proposed in this pull request?
Fix for the error introduced in c59abad052:
```
/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626: error: annotation argument needs to be a constant; found: "_FUNC_(str) - ".+("Returns str, with the first letter of each word in uppercase, all other letters in ").+("lowercase. Words are delimited by white space.")
"Returns str, with the first letter of each word in uppercase, all other letters in " +
^
```
## How was this patch tested?
Local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#12192 from jaceklaskowski/SPARK-14402-HOTFIX.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14189
When inferred types in the same field during finding compatible `DataType`, are `IntegralType` and `DecimalType` but `DecimalType` is not capable of the given `IntegralType`, JSON data source simply fails to find a compatible type resulting in `StringType`.
This can be observed when `prefersDecimal` is enabled.
```scala
def mixedIntegerAndDoubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
"""{"a": 3, "b": 1.1}""" ::
"""{"a": 3.1, "b": 1}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(mixedIntegerAndDoubleRecords)
.printSchema()
```
- **Before**
```
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
```
- **After**
```
root
|-- a: decimal(21, 1) (nullable = true)
|-- b: decimal(21, 1) (nullable = true)
```
(Note that integer is inferred as `LongType` which becomes `DecimalType(20, 0)`)
## How was this patch tested?
unit tests were used and style tests by `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11993 from HyukjinKwon/SPARK-14189.
## What changes were proposed in this pull request?
This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:
```
"a"b,ccc,ddd
e,f,g
```
produces a data below:
- **Before**
```bash
["a"b,ccc,ddd[\n]e,f,g] <- as a value.
```
- **After**
```bash
["a"b], [ccc], [ddd]
[e], [f], [g]
```
This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.
## How was this patch tested?
Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12226 from HyukjinKwon/SPARK-14103-quote.
## What changes were proposed in this pull request?
The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).
This PR adds a "deleteLastCheckpoint" option which defaults to false. This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.
This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.
This also:
* Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
* Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
* Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```
## How was this patch tested?
Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12166 from jkbradley/emlda-save-checkpoint.
Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`. This changes the addListener function and the config based injection to use the interface instead.
The existing tests in SparkListenerSuite are improved such that they would have caught this.
Follow-up to #12142
Author: Michael Armbrust <michael@databricks.com>
Closes#12227 from marmbrus/fixListener.
## What changes were proposed in this pull request?
`OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).
Before: `OutputCommitCoordinator` is enabled only if speculation is enabled.
After: `OutputCommitCoordinator` is always enabled.
Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...
## How was this patch tested?
`OutputCommitCoordinator*Suite`
Author: Andrew Or <andrew@databricks.com>
Closes#12244 from andrewor14/always-occ.
Docs change to remove the sentence about Mesos not supporting cluster mode.
It was not.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#12249 from mgummelt/fix-mesos-cluster-docs.
## What changes were proposed in this pull request?
We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free.
This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12061 from cloud-fan/whole-stage-codegen.
## What changes were proposed in this pull request?
This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.
No change in functionality is expected.
## How was this patch tested?
`SessionCatalogSuite`, `DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#12198 from andrewor14/function-exists.
## What changes were proposed in this pull request?
This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause.
The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute.
This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example:
```sql
select count(1) from (select 1 as a) t group by a having a > 0
```
## How was this patch tested?
Add new tests.
Author: Davies Liu <davies@databricks.com>
Closes#12235 from davies/grouping_having.
## What changes were proposed in this pull request?
In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them.
## How was this patch tested?
Existing tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#12237 from sarutak/SPARK-14456.
## What changes were proposed in this pull request?
The timeouts were lower the other timeouts in the test. Other tests were stable over the last month.
## How was this patch tested?
Jenkins tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#12219 from tdas/flaky-test-fix.
## What changes were proposed in this pull request?
Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.
## How was this patch tested?
Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12115 from dhruve/impr/SPARK-12384.