## What changes were proposed in this pull request?
Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case.
For example,
```
spark.read.load("/some-data")
.withColumn("date_dt", to_date($"date"))
.withColumn("year", year($"date_dt"))
.withColumn("week", weekofyear($"date_dt"))
.withColumn("user_count", count($"userId"))
.withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
)
```
creates the following output:
```
org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
```
In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem.
## How was this patch tested?
Manually test
Before:
```
scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
```
After:
```
scala> spark.sql("select col, count(col) from tbl")
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;;
```
Also add new test sqls in `group-by.sql`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15672 from jiangxb1987/groupBy-empty.
## What changes were proposed in this pull request?
Likewise [DataSet.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L156) KeyValueGroupedDataset should mark the queryExecution as transient.
As mentioned in the Jira ticket, without transient we saw serialization issues like
```
Caused by: java.io.NotSerializableException: org.apache.spark.sql.execution.QueryExecution
Serialization stack:
- object not serializable (class: org.apache.spark.sql.execution.QueryExecution, value: ==
```
## How was this patch tested?
Run the query which is specified in the Jira ticket before and after:
```
val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4)))).as[(Int,Int)]
val grouped = a.groupByKey(
{x:(Int,Int)=>x._1}
)
val mappedGroups = grouped.mapGroups((k,x)=>
{(k,1)}
)
val yyy = sc.broadcast(1)
val last = mappedGroups.rdd.map(xx=>
{ val simpley = yyy.value 1 }
)
```
Author: Ergin Seyfe <eseyfe@fb.com>
Closes#15706 from seyfe/keyvaluegrouped_serialization.
## What changes were proposed in this pull request?
This is a follow-up to https://github.com/apache/spark/pull/15634.
## How was this patch tested?
N/A
Author: Liwei Lin <lwlin7@gmail.com>
Closes#15712 from lw-lin/18103.
## What changes were proposed in this pull request?
1, move cast to `Predictor`
2, and then, remove unnecessary cast
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15414 from zhengruifeng/move_cast.
## What changes were proposed in this pull request?
Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
This PR includes:
1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
## How was this patch tested?
running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
Credit goes to hvanhovell for assisting with this PR.
Author: eyal farago <eyal farago>
Author: eyal farago <eyal.farago@gmail.com>
Author: Herman van Hovell <hvanhovell@databricks.com>
Author: Eyal Farago <eyal.farago@actimize.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: eyalfa <eyal.farago@gmail.com>
Closes#14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
## What changes were proposed in this pull request?
Currently an unqualified `getFunction(..)`call returns a wrong result; the returned function is shown as temporary function without a database. For example:
```
scala> sql("create function fn1 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.catalog.getFunction("fn1")
res1: org.apache.spark.sql.catalog.Function = Function[name='fn1', className='org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs', isTemporary='true']
```
This PR fixes this by adding database information to ExpressionInfo (which is used to store the function information).
## How was this patch tested?
Added more thorough tests to `CatalogSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#15542 from hvanhovell/SPARK-17996.
## What changes were proposed in this pull request?
Enclose --conf option value with "" to support multi value configs like spark.driver.extraJavaOptions, without "", driver will fail to start.
## How was this patch tested?
Jenkins Tests.
Test in our production environment, also unit tests, It is a very small change.
Author: Wang Lei <lei.wang@kongming-inc.com>
Closes#15643 from LeightonWong/messos-cluster.
## What changes were proposed in this pull request?
Migrate Mesos configs to use ConfigEntry
## How was this patch tested?
Jenkins Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#15654 from techaddict/SPARK-16881.
Mesos 0.23.0 introduces a Fetch Cache feature http://mesos.apache.org/documentation/latest/fetcher/ which allows caching of resources specified in command URIs.
This patch:
- Updates the Mesos shaded protobuf dependency to 0.23.0
- Allows setting `spark.mesos.fetcherCache.enable` to enable the fetch cache for all specified URIs. (URIs must be specified for the setting to have any affect)
- Updates documentation for Mesos configuration with the new setting.
This patch does NOT:
- Allow for per-URI caching configuration. The cache setting is global to ALL URIs for the command.
Author: Charles Allen <charles@allen-net.com>
Closes#13713 from drcrallen/SPARK15994.
## What changes were proposed in this pull request?
When multiple records have the minimum value, the answer of ApproximatePercentile is wrong.
## How was this patch tested?
add a test case
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#15641 from wzhfy/percentile.
## What changes were proposed in this pull request?
This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?
Manual by the followings because this is documentation update.
```
cd docs
SKIP_API=1 jekyll build
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15686 from dongjoon-hyun/minor_doc_space.
## What changes were proposed in this pull request?
As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.
It seems there is a patch [HIVE-11940](ba21806b77) which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.
Because Spark SQL uses older Hive library, we can not benefit from such improvement.
The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.
Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.
Note: The case reported on the jira is insert overwrite to partition. Since `Hive.loadTable` also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.
## How was this patch tested?
Jenkins tests.
There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.
For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#15667 from viirya/improve-hive-insertoverwrite.
## What changes were proposed in this pull request?
This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.
## How was this patch tested?
Should be covered by existing write tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Eric Liang <ekl@databricks.com>
Closes#15707 from rxin/SPARK-18024-2.
## What changes were proposed in this pull request?
This will re-run the flaky test a few times after it fails. This will help determine if it's due to nondeterministic test setup, or because of some environment issue (e.g. leaked config from another test).
cc yhuai
Author: Eric Liang <ekl@databricks.com>
Closes#15708 from ericl/spark-18167-3.
## What changes were proposed in this pull request?
When inserting into datasource tables with partitions managed by the hive metastore, we need to notify the metastore of newly added partitions. Previously this was implemented via `msck repair table`, but this is more expensive than needed.
This optimizes the insertion path to add only the updated partitions.
## How was this patch tested?
Existing tests (I verified manually that tests fail if the repair operation is omitted).
Author: Eric Liang <ekl@databricks.com>
Closes#15633 from ericl/spark-18087.
## What changes were proposed in this pull request?
One possibility for this test flaking is that we have corrupted the partition schema somehow in the tests, which causes the cast to decimal to fail in the call. This should at least show us the actual partition values.
## How was this patch tested?
Run it locally, it prints out something like `ArrayBuffer(test(partcol=0), test(partcol=1), test(partcol=2), test(partcol=3), test(partcol=4))`.
Author: Eric Liang <ekl@databricks.com>
Closes#15701 from ericl/print-more-info.
## What changes were proposed in this pull request?
The test `when schema inference is turned on, should read partition data` should not delete files because the source maybe is listing files. This PR just removes the delete actions since they are not necessary.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15699 from zsxwing/SPARK-18030.
## What changes were proposed in this pull request?
### Problem
Iterative ML code may easily create query plans that grow exponentially. We found that query planning time also increases exponentially even when all the sub-plan trees are cached.
The following snippet illustrates the problem:
``` scala
(0 until 6).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
println(s"== Iteration $iteration ==")
val time0 = System.currentTimeMillis()
val joined = plan.join(plan, "value").join(plan, "value").join(plan, "value").join(plan, "value")
joined.cache()
println(s"Query planning takes ${System.currentTimeMillis() - time0} ms")
joined.as[Int]
}
// == Iteration 0 ==
// Query planning takes 9 ms
// == Iteration 1 ==
// Query planning takes 26 ms
// == Iteration 2 ==
// Query planning takes 53 ms
// == Iteration 3 ==
// Query planning takes 163 ms
// == Iteration 4 ==
// Query planning takes 700 ms
// == Iteration 5 ==
// Query planning takes 3418 ms
```
This is because when building a new Dataset, the new plan is always built upon `QueryExecution.analyzed`, which doesn't leverage existing cached plans.
On the other hand, usually, doing caching every a few iterations may not be the right direction for this problem since caching is too memory consuming (imaging computing connected components over a graph with 50 billion nodes). What we really need here is to truncate both the query plan (to minimize query planning time) and the lineage of the underlying RDD (to avoid stack overflow).
### Changes introduced in this PR
This PR tries to fix this issue by introducing a `checkpoint()` method into `Dataset[T]`, which does exactly the things described above. The following snippet, which is essentially the same as the one above but invokes `checkpoint()` instead of `cache()`, shows the micro benchmark result of this PR:
One key point is that the checkpointed Dataset should preserve the origianl partitioning and ordering information of the original Dataset, so that we can avoid unnecessary shuffling (similar to reading from a pre-bucketed table). This is done by adding `outputPartitioning` and `outputOrdering` to `LogicalRDD` and `RDDScanExec`.
### Micro benchmark
``` scala
spark.sparkContext.setCheckpointDir("/tmp/cp")
(0 until 100).foldLeft(Seq(1, 2, 3).toDS) { (plan, iteration) =>
println(s"== Iteration $iteration ==")
val time0 = System.currentTimeMillis()
val cp = plan.checkpoint()
cp.count()
System.out.println(s"Checkpointing takes ${System.currentTimeMillis() - time0} ms")
val time1 = System.currentTimeMillis()
val joined = cp.join(cp, "value").join(cp, "value").join(cp, "value").join(cp, "value")
val result = joined.as[Int]
println(s"Query planning takes ${System.currentTimeMillis() - time1} ms")
result
}
// == Iteration 0 ==
// Checkpointing takes 591 ms
// Query planning takes 13 ms
// == Iteration 1 ==
// Checkpointing takes 1605 ms
// Query planning takes 16 ms
// == Iteration 2 ==
// Checkpointing takes 782 ms
// Query planning takes 8 ms
// == Iteration 3 ==
// Checkpointing takes 729 ms
// Query planning takes 10 ms
// == Iteration 4 ==
// Checkpointing takes 734 ms
// Query planning takes 9 ms
// == Iteration 5 ==
// ...
// == Iteration 50 ==
// Checkpointing takes 571 ms
// Query planning takes 7 ms
// == Iteration 51 ==
// Checkpointing takes 548 ms
// Query planning takes 7 ms
// == Iteration 52 ==
// Checkpointing takes 596 ms
// Query planning takes 8 ms
// == Iteration 53 ==
// Checkpointing takes 568 ms
// Query planning takes 7 ms
// ...
```
You may see that although checkpointing is more heavy weight an operation, it always takes roughly the same amount of time to perform both checkpointing and query planning.
### Open question
mengxr mentioned that it would be more convenient if we can make `Dataset.checkpoint()` eager, i.e., always performs a `RDD.count()` after calling `RDD.checkpoint()`. Not quite sure whether this is a universal requirement. Maybe we can add a `eager: Boolean` argument for `Dataset.checkpoint()` to support that.
## How was this patch tested?
Unit test added in `DatasetSuite`.
Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#15651 from liancheng/ds-checkpoint.
## What changes were proposed in this pull request?
Because of the refactoring work in Structured Streaming, the event logs generated by Strucutred Streaming in Spark 2.0.0 and 2.0.1 cannot be parsed.
This PR just ignores these logs in ReplayListenerBus because no places use them.
## How was this patch tested?
- Generated events logs using Spark 2.0.0 and 2.0.1, and saved them as `structured-streaming-query-event-logs-2.0.0.txt` and `structured-streaming-query-event-logs-2.0.1.txt`
- The new added test makes sure ReplayListenerBus will skip these bad jsons.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15663 from zsxwing/fix-event-log.
## What changes were proposed in this pull request?
Add subsmaplingRate to randomForestClassifier
Add varianceCol to randomForestRegressor
In Python
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15638 from felixcheung/pyrandomforest.
## What changes were proposed in this pull request?
Random Forest Regression and Classification for R
Clean-up/reordering generics.R
## How was this patch tested?
manual tests, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15607 from felixcheung/rrandomforest.
## What changes were proposed in this pull request?
This patch makes RBackend connection timeout configurable by user.
## How was this patch tested?
N/A
Author: Hossein <hossein@databricks.com>
Closes#15471 from falaki/SPARK-17919.
## What changes were proposed in this pull request?
Currently, `ANALYZE TABLE` command accepts `identifier` for option `NOSCAN`. This PR raises a ParseException for unknown option.
**Before**
```scala
scala> sql("create table test(a int)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("analyze table test compute statistics blah")
res1: org.apache.spark.sql.DataFrame = []
```
**After**
```scala
scala> sql("create table test(a int)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("analyze table test compute statistics blah")
org.apache.spark.sql.catalyst.parser.ParseException:
Expected `NOSCAN` instead of `blah`(line 1, pos 0)
```
## How was this patch tested?
Pass the Jenkins test with a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#15640 from dongjoon-hyun/SPARK-18106.
## What changes were proposed in this pull request?
To reduce the number of components in SQL named *Catalog, rename *FileCatalog to *FileIndex. A FileIndex is responsible for returning the list of partitions / files to scan given a filtering expression.
```
TableFileCatalog => CatalogFileIndex
FileCatalog => FileIndex
ListingFileCatalog => InMemoryFileIndex
MetadataLogFileCatalog => MetadataLogFileIndex
PrunedTableFileCatalog => PrunedInMemoryFileIndex
```
cc yhuai marmbrus
## How was this patch tested?
N/A
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes#15634 from ericl/rename-file-provider.
## What changes were proposed in this pull request?
The behavior of union is not well defined here. It is safer to explicitly execute these commands in order. The other use of `Union` in this way will be removed by https://github.com/apache/spark/pull/15633
## How was this patch tested?
Existing tests.
cc yhuai cloud-fan
Author: Eric Liang <ekhliang@gmail.com>
Author: Eric Liang <ekl@databricks.com>
Closes#15665 from ericl/spark-18146.
## What changes were proposed in this pull request?
Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#15450 from srowen/SPARK-3261.
## What changes were proposed in this pull request?
org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive partition pruning is enabled.
Based on the stack traces, it seems to be an old issue where Hive fails to cast a numeric partition column ("Invalid character string format for type DECIMAL"). There are two possibilities here: either we are somehow corrupting the partition table to have non-decimal values in that column, or there is a transient issue with Derby.
This PR logs the result of the retry when this exception is encountered, so we can confirm what is going on.
## How was this patch tested?
n/a
cc yhuai
Author: Eric Liang <ekl@databricks.com>
Closes#15676 from ericl/spark-18167.
## What changes were proposed in this pull request?
Fixed the issue that ForeachSink didn't rethrow the exception.
## How was this patch tested?
The fixed unit test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15674 from zsxwing/foreach-sink-error.
## What changes were proposed in this pull request?
Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).
Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
Things that will be implemented in a follow-up PR:
- Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
- PySpark Integration for the scala classes and methods.
## How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.
Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).
## References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Closes#15148 from Yunni/SPARK-5992-yunn-lsh.
## What changes were proposed in this pull request?
In Python 3, there is only one integer type (i.e., int), which mostly behaves like the long type in Python 2. Since Python 3 won't accept "L", so removed "L" in all examples.
## How was this patch tested?
Unit tests.
…rrors]
Author: Jagadeesan <as2@us.ibm.com>
Closes#15660 from jagadeesanas2/SPARK-18133.
## What changes were proposed in this pull request?
Add instrumentation to GMM
## How was this patch tested?
Test in spark-shell
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#15636 from zhengruifeng/gmm_instr.
## What changes were proposed in this pull request?
Issue:
Querying on a global temp view throws Table or view not found exception.
Fix:
Update the lookupRelation in HiveSessionCatalog to check for global temp views similar to the SessionCatalog.lookupRelation.
Before fix:
Querying on a global temp view ( for. e.g.: select * from global_temp.v1) throws Table or view not found exception
After fix:
Query succeeds and returns the right result.
## How was this patch tested?
- Two unit tests are added to check for global temp view for the code path when hive support is enabled.
- Regression unit tests were run successfully. ( build/sbt -Phive hive/test, build/sbt sql/test, build/sbt catalyst/test)
Author: Sunitha Kambhampati <skambha@us.ibm.com>
Closes#15649 from skambha/lookuprelationChanges.
## What changes were proposed in this pull request?
We should follow hive table and also store partition spec in metastore for data source table.
This brings 2 benefits:
1. It's more flexible to manage the table data files, as users can use `ADD PARTITION`, `DROP PARTITION` and `RENAME PARTITION`
2. We don't need to cache all file status for data source table anymore.
## How was this patch tested?
existing tests.
Author: Eric Liang <ekl@databricks.com>
Author: Michael Allman <michael@videoamp.com>
Author: Eric Liang <ekhliang@gmail.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#15515 from cloud-fan/partition.
## What changes were proposed in this pull request?
A follow up PR for #14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#15661 from zsxwing/fix-StreamingQuerySuite.
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
'''Before:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
.setInputCol("feature")
.setOutputCol("result")
.setSplits(splits)
.setHandleNaN("keep")
## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15428 from VinceShieh/spark-17219_followup.
## What changes were proposed in this pull request?
maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.
## How was this patch tested?
Added unit test
Author: cody koeninger <cody@koeninger.org>
Closes#15527 from koeninger/SPARK-17813.
## What changes were proposed in this pull request?
While learning core scheduler code, I found two lines of wrong comments. This PR simply corrects the comments.
## How was this patch tested?
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15631 from wangmiao1981/Rbug.
## What changes were proposed in this pull request?
API and programming guide doc changes for Scala, Python and R.
## How was this patch tested?
manual test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15629 from felixcheung/jsondoc.
## What changes were proposed in this pull request?
a couple of small late finding fixes for doc
## How was this patch tested?
manually
wangmiao1981
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#15650 from felixcheung/logitfix.
## What changes were proposed in this pull request?
A short code snippet that uses toLocalIterator() on a dataframe produced by a RunnableCommand
reproduces the problem. toLocalIterator() is called by thriftserver when
`spark.sql.thriftServer.incrementalCollect`is set to handle queries producing large result
set.
**Before**
```SQL
scala> spark.sql("show databases")
res0: org.apache.spark.sql.DataFrame = [databaseName: string]
scala> res0.toLocalIterator()
16/10/26 03:00:24 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow
```
**After**
```SQL
scala> spark.sql("drop database databases")
res30: org.apache.spark.sql.DataFrame = []
scala> spark.sql("show databases")
res31: org.apache.spark.sql.DataFrame = [databaseName: string]
scala> res31.toLocalIterator().asScala foreach println
[default]
[parquet]
```
## How was this patch tested?
Added a test in DDLSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#15642 from dilipbiswal/SPARK-18009.
## What changes were proposed in this pull request?
In order to facilitate the writing of additional Encoders, I proposed opening up the ObjectType SQL DataType. This DataType is used extensively in the JavaBean Encoder, but would also be useful in writing other custom encoders.
As mentioned by marmbrus, it is understood that the Expressions API is subject to potential change.
## How was this patch tested?
The change only affects the visibility of the ObjectType class, and the existing SQL test suite still runs without error.
Author: ALeksander Eskilson <alek.eskilson@cerner.com>
Closes#15453 from bdrillard/master.
## What changes were proposed in this pull request?
This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes:
* Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive.
* Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer".
* Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`.
* Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code.
* The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint.
* `MemoryStream` now cleans committed batches out of its internal buffer.
* `TextSocketSource` now cleans committed batches from its internal buffer.
## How was this patch tested?
Existing regression tests already exercise the new code.
Author: frreiss <frreiss@us.ibm.com>
Closes#14553 from frreiss/fred-16963.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
`Utils.getIteratorZipWithIndex` was added to deal with number of records > 2147483647 in one partition.
method `getIteratorZipWithIndex` accepts `startIndex` < 0, which leads to negative index.
This PR just adds a defensive check on `startIndex` to make sure it is >= 0.
## How was this patch tested?
Add a new unit test.
Author: Miao Wang <miaowang@Miaos-MacBook-Pro.local>
Closes#15639 from wangmiao1981/zip.
## What changes were proposed in this pull request?
As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression.
This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression.
## How was this patch tested?
New unit tests are added.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#15365 from wangmiao1981/glm.
## What changes were proposed in this pull request?
Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING SETS) in `SQLQuerySuite`, should better move them into a query file test.
The following test cases are moved to `group-analytics.sql`:
```
test("rollup")
test("grouping sets when aggregate functions containing groupBy columns")
test("cube")
test("grouping sets")
test("grouping and grouping_id")
test("grouping and grouping_id in having")
test("grouping and grouping_id in sort")
```
This is followup work of #15582
## How was this patch tested?
Modified query file `group-analytics.sql`, which will be tested by `SQLQueryTestSuite`.
Author: jiangxingbo <jiangxb1987@gmail.com>
Closes#15624 from jiangxb1987/group-analytics-test.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14300
Duplicated code found in scala/examples/mllib, below all deleted in this PR:
- DenseGaussianMixture.scala
- StreamingLinearRegression.scala
## delete reasons:
#### delete: mllib/DenseGaussianMixture.scala
- duplicate of mllib/GaussianMixtureExample
#### delete: mllib/StreamingLinearRegression.scala
- duplicate of mllib/StreamingLinearRegressionExample
When merging and cleaning those code, be sure not disturb the previous example on and off blocks.
## How was this patch tested?
Test with `SKIP_API=1 jekyll` manually to make sure that works well.
Author: Xin Ren <iamshrek@126.com>
Closes#12195 from keypointt/SPARK-14300.