## What changes were proposed in this pull request?
Currently, for the key that can not fit within a long, we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.
In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K, Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.
## How was this patch tested?
Existing tests. Added benchmark for broadcast hash join with duplicated keys.
Author: Davies Liu <davies@databricks.com>
Closes#11870 from davies/map2.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14156
In HiveComparisonTest, when catalyst results are different to hive results, we will collect the messages for computed tables during the test. During creating the message, we use sparkPlan. But we actually run the query with executedPlan. So the error message is sometimes confusing.
For example, as wholestage codegen is enabled by default now. The shown spark plan for computed tables is the plan before wholestage codegen.
A concrete is the following error message shown before this patch. It is the error shown when running `HiveCompatibilityTest` `auto_join26`.
auto_join26 has one SQL to create table:
INSERT OVERWRITE TABLE dest_j1
SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key; (1)
Then a SQL to retrieve the result:
select * from dest_j1 x order by x.key; (2)
When the above SQL (2) to retrieve the result fails, In `HiveComparisonTest` we will try to collect and show the generated data from table `dest_j1` using the SQL (1)'s spark plan. The you will see this error:
TungstenAggregate(key=[key#8804], functions=[(count(1),mode=Partial,isDistinct=false)], output=[key#8804,count#8834L])
+- Project [key#8804]
+- BroadcastHashJoin [key#8804], [key#8806], Inner, BuildRight, None
:- Filter isnotnull(key#8804)
: +- InMemoryColumnarTableScan [key#8804], [isnotnull(key#8804)], InMemoryRelation [key#8804,value#8805], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8717,value#8718], MetastoreRelation default, src1, None, Some(src1)
+- Filter isnotnull(key#8806)
+- InMemoryColumnarTableScan [key#8806], [isnotnull(key#8806)], InMemoryRelation [key#8806,value#8807], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8760,value#8761], MetastoreRelation default, src, None, Some(src)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:82)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:140)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:137)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:87)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:82)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 70 more
Caused by: java.lang.UnsupportedOperationException: Filter does not implement doExecuteBroadcast
at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:221)
The message is confusing because it is not the plan actually run by SparkSQL engine to create the generated table. The plan actually run is no problem. But as before this patch, we run `e.sparkPlan.collect` to retrieve and show the generated data, spark plan is not the plan we can run. So the above error will be shown.
After this patch, we won't see the error because the executed plan is no problem and works.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11957 from viirya/use-executedplan.
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
Author: Sun Rui <rui.sun@intel.com>
Closes#10947 from sun-rui/SPARK-12792.
JIRA: https://issues.apache.org/jira/browse/SPARK-13742
## What changes were proposed in this pull request?
`RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator #11517. This change is to add non-iterator interface to `RandomSampler`.
This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.
This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.
For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.
For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times.
## How was this patch tested?
Tests are added into `RandomSamplerSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11578 from viirya/random-sampler-no-iterator.
## What changes were proposed in this pull request?
Fix incorrect use of binarySearch in SparseMatrix
## How was this patch tested?
Unit test added.
Author: Chenliang Xu <chexu@groupon.com>
Closes#11992 from luckyrandom/SPARK-14187.
## What changes were proposed in this pull request?
Spark Shell provides an easy way to use Spark in Scala environment. This PR adds `reset` command to a blocked list, also cleaned up according to the Scala coding style.
```scala
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext718fad24
scala> :reset
scala> sc
<console>:11: error: not found: value sc
sc
^
```
If we blocks `reset`, Spark Shell works like the followings.
```scala
scala> :reset
reset: no such command. Type :help for help.
scala> :re
re is ambiguous: did you mean :replay or :require?
```
## How was this patch tested?
Manual. Run `bin/spark-shell` and type `:reset`.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11920 from dongjoon-hyun/SPARK-14102.
## What changes were proposed in this pull request?
Better error message with k-means init can't be enough samples from input (because it is perhaps empty)
## How was this patch tested?
Jenkins tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11979 from srowen/SPARK-12494.
## What changes were proposed in this pull request?
The indentation of debug log output by `CodeGenerator` is weird.
The first line of the generated code should be put on the next line of the first line of the log message.
```
16/03/28 11:10:24 DEBUG CodeGenerator: /* 001 */
/* 002 */ public java.lang.Object generate(Object[] references) {
/* 003 */ return new SpecificSafeProjection(references);
...
```
After this patch is applied, we get debug log like as follows.
```
16/03/28 10:45:50 DEBUG CodeGenerator:
/* 001 */
/* 002 */ public java.lang.Object generate(Object[] references) {
/* 003 */ return new SpecificSafeProjection(references);
...
```
## How was this patch tested?
Ran some jobs and checked debug logs.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#11990 from sarutak/fix-debuglog-indentation.
## What changes were proposed in this pull request?
Made evaluate method public. Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified.
## How was this patch tested?
There were already unit tests for these methods.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11928 from jkbradley/public-evaluate.
## What changes were proposed in this pull request?
This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code.
```scala
- categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed
+ categories: Seq[Double]) {
```
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11966 from dongjoon-hyun/change_categories_type.
## What changes were proposed in this pull request?
This PR fixes the following two testcases in order to test the correct usages.
```
checkSqlGeneration("SELECT substr('This is a test', 'is')")
checkSqlGeneration("SELECT substring('This is a test', 'is')")
```
Actually, the testcases works but tests on exceptional cases.
```
scala> sql("SELECT substr('This is a test', 'is')")
res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]
scala> sql("SELECT substr('This is a test', 'is')").collect()
res1: Array[org.apache.spark.sql.Row] = Array([null])
```
## How was this patch tested?
Pass the modified unit tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11963 from dongjoon-hyun/fix_substr_testcase.
#### What changes were proposed in this pull request?
This PR is to provide native parsing support for two DDL commands: ```Describe Database``` and ```Alter Database Set Properties```
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER DATABASE
**Syntax:**
```SQL
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)
```
- `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES`
##### 2. DESCRIBE DATABASE
**Syntax:**
```SQL
DESCRIBE DATABASE [EXTENDED] db_name
```
- `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties
#### How was this patch tested?
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
This patch had conflicts when merged, resolved by
Committer: Yin Huai <yhuai@databricks.com>
Closes#11977 from gatorsmile/parseAlterDatabase.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path.
Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`.
## How was this patch tested?
Existing test cases should do the work
Author: Cheng Lian <lian@databricks.com>
Closes#11936 from liancheng/spark-14116-build-reader-for-orc.
### What changes were proposed in this pull request?
Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The syntax of DDL command for Drop Database is
```SQL
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
```
- If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist
- `RESTRICT` is the default behavior.
This PR is to provide a native parsing support for `DROP DATABASE`.
#### How was this patch tested?
Added a test case `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11962 from gatorsmile/parseDropDatabase.
This patch extends Spark's `UnifiedMemoryManager` to add bookkeeping support for off-heap storage memory, an requirement for enabling off-heap caching (which will be done by #11805). The `MemoryManager`'s `storageMemoryPool` has been split into separate on- and off-heap pools and the storage and unroll memory allocation methods have been updated to accept a `memoryMode` parameter to specify whether allocations should be performed on- or off-heap.
In order to reduce the testing surface, the `StaticMemoryManager` does not support off-heap caching (we plan to eventually remove the `StaticMemoryManager`, so this isn't a significant limitation).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11942 from JoshRosen/off-heap-storage-memory-bookkeeping.
## What changes were proposed in this pull request?
1. merge consumeChild into consume()
2. always generate code for input variables and UnsafeRow, a plan can use eight of them.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11975 from davies/gen_refactor.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13973
## How was this patch tested?
Pyspark
Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>
Closes#11829 from rekhajoshm/SPARK-13973.
## What changes were proposed in this pull request?
Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5.
## How was this patch tested?
- manully checked that no codes in Spark call these methods any more
- existing test suits
Author: Liwei Lin <lwlin7@gmail.com>
Author: proflin <proflin.me@gmail.com>
Closes#11910 from lw-lin/remove-deprecates.
## What changes were proposed in this pull request?
This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11968 from dongjoon-hyun/SPARK-14167.
## What changes were proposed in this pull request?
This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages
Also remove mqtt_wordcount.py that I forgot to remove previously.
## How was this patch tested?
Jenkins PR Build.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11824 from zsxwing/remove-doc.
## What changes were proposed in this pull request?
HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified.
In addition I have also added more tests to increase the code coverage.
## How was this patch tested?
Unit test.
Tested on cluster with custom file system.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11925 from tdas/SPARK-14109.
## What changes were proposed in this pull request?
This PR moves flume back to Spark as per the discussion in the dev mail-list.
## How was this patch tested?
Existing Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11895 from zsxwing/move-flume-back.
## What changes were proposed in this pull request?
StringIndexerModel.transform sets the output column metadata to use name inputCol. It should not. Fixing this causes a problem with the metadata produced by RFormula.
Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction.
Note that "prefixes" is no longer accurate since internal strings may be replaced.
## How was this patch tested?
Unit test which failed before this fix.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11965 from jkbradley/StringIndexer-fix.
Currently SparkContext.getCallSite() makes a call to Utils.getCallSite().
```
private[spark] def getCallSite(): CallSite = {
val callSite = Utils.getCallSite()
CallSite(
Option(getLocalProperty(CallSite.SHORT_FORM)).getOrElse(callSite.shortForm),
Option(getLocalProperty(CallSite.LONG_FORM)).getOrElse(callSite.longForm)
)
}
```
However, in some places utils.withDummyCallSite(sc) is invoked to avoid expensive threaddumps within getCallSite(). But Utils.getCallSite() is evaluated earlier causing threaddumps to be computed.
This can have severe impact on smaller queries (that finish in 10-20 seconds) having large number of RDDs.
Creating this patch for lazy evaluation of getCallSite.
No new test cases are added. Following standalone test was tried out manually. Also, built entire spark binary and tried with few SQL queries in TPC-DS and TPC-H in multi node cluster
```
def run(): Unit = {
val conf = new SparkConf()
val sc = new SparkContext("local[1]", "test-context", conf)
val start: Long = System.currentTimeMillis();
val confBroadcast = sc.broadcast(new SerializableConfiguration(new Configuration()))
Utils.withDummyCallSite(sc) {
//Large tables end up creating 5500 RDDs
for(i <- 1 to 5000) {
//ignore nulls in RDD as its mainly for testing callSite
val testRDD = new HadoopRDD(sc, confBroadcast, None, null,
classOf[NullWritable], classOf[Writable], 10)
}
}
val end: Long = System.currentTimeMillis();
println("Time taken : " + (end - start))
}
def main(args: Array[String]): Unit = {
run
}
```
Author: Rajesh Balamohan <rbalamohan@apache.org>
Closes#11911 from rajeshbalamohan/SPARK-14091.
## What changes were proposed in this pull request?
There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue.
This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:
- offsetLog.add
- FileStreamSource.getOffset
They will create a file using HDFS API and call Shell.runCommand to set the file permission.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11940 from zsxwing/workaround-for-HADOOP-10622.
## What changes were proposed in this pull request?
This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in https://github.com/apache/spark/pull/11828.
## How was this patch tested?
Unit test in `ConstraintPropagationSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11953 from sameeragarwal/infer-isnotnull.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-12443
`constructorFor` will call `dataTypeFor` to determine if a type is `ObjectType` or not. If there is not case for `Decimal`, it will be recognized as `ObjectType` and causes the bug.
## How was this patch tested?
Test is added into `ExpressionEncoderSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#10399 from viirya/fix-encoder-decimal.
## What changes were proposed in this pull request?
StateStoreCoordinator.reportActiveInstance is async, so subsequence state checks must be in eventually.
## How was this patch tested?
Jenkins tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11924 from tdas/state-store-flaky-fix.
## What changes were proposed in this pull request?
This PR is a minor cleanup task as part of https://issues.apache.org/jira/browse/SPARK-14008 to explicitly identify/catch the `UnsupportedOperationException` while initializing the vectorized parquet reader. Other exceptions will simply be thrown back to `SqlNewHadoopPartition`.
## How was this patch tested?
N/A (cleanup only; no new functionality added)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11950 from sameeragarwal/parquet-cleanup.
## What changes were proposed in this pull request?
As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
## How was this patch tested?
various new tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11879 from cloud-fan/create_map.
## What changes were proposed in this pull request?
This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.
The RuleExecutor will fail the test if reached max iterations.
Closes#11745
## How was this patch tested?
Existing tests.
This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137
Author: Davies Liu <davies@databricks.com>
Closes#11828 from davies/fail_rule.
## What changes were proposed in this pull request?
Change lint python script to stop on first error rather than building them up so its clearer why we failed (requested by rxin). Also while in the file, remove the commented out code.
## How was this patch tested?
Manually ran lint-python script with & without pep8 errors locally and verified expected results.
Author: Holden Karau <holden@us.ibm.com>
Closes#11898 from holdenk/SPARK-13887-pylint-fast-fail.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/11410, we missed a corner case: define the inner class and use it in `Dataset` at the same time by using paste mode. For this case, the inner class and the `Dataset` are inside same line object, when we build the `Dataset`, we try to get outer pointer from line object, and it will fail because the line object is not initialized yet.
https://issues.apache.org/jira/browse/SPARK-13456?focusedCommentId=15209174&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209174 is an example for this corner case.
This PR make the process of getting outer pointer from line object lazy, so that we can successfully build the `Dataset` and finish initializing the line object.
## How was this patch tested?
new test in repl suite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11931 from cloud-fan/repl.
## What changes were proposed in this pull request?
We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place.
## How was this patch tested?
A logging change with a manual test.
Author: Reynold Xin <rxin@databricks.com>
Closes#11951 from rxin/SPARK-14149.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API.
## How was this patch tested?
This patch removes a method, and removes the associated tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11949 from rxin/SPARK-14145.
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.
## How was this patch tested?
Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11946 from rxin/SPARK-14142.
## What changes were proposed in this pull request?
This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR.
## How was this patch tested?
Test against output from R package survival's survreg.
cc mengxr felixcheung
Close#11447
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11932 from yanboliang/spark-13010-new.
#### What changes were proposed in this pull request?
This PR is to support group by position in SQL. For example, when users input the following query
```SQL
select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4
```
The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to
```SQL
select c1, c2, c3, sum(*) from tbl group by c1, c3, c4
```
This is controlled by the config option `spark.sql.groupByOrdinal`.
- When true, the ordinal numbers in group by clauses are treated as the position in the select list.
- When false, the ordinal numbers are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
- When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
- star is not allowed to use in the select list when users specify ordinals in group by
Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li
Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil
#### How was this patch tested?
Added a few test cases for both positive and negative test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11846 from gatorsmile/groupByOrdinal.
## What changes were proposed in this pull request?
Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests.
## How was this patch tested?
Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor.
Author: GayathriMurali <gayathri.m.softie@gmail.com>
Closes#11892 from GayathriMurali/SPARK-13949.
## What changes were proposed in this pull request?
GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value.
## How was this patch tested?
Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11944 from sethah/SPARK-14107.
When a block is persisted in the MemoryStore at a serialized storage level, the current MemoryStore.putIterator() code will unroll the entire iterator as Java objects in memory, then will turn around and serialize an iterator obtained from the unrolled array. This is inefficient and doubles our peak memory requirements.
Instead, I think that we should incrementally serialize blocks while unrolling them.
A downside to incremental serialization is the fact that we will need to deserialize the partially-unrolled data in case there is not enough space to unroll the block and the block cannot be dropped to disk. However, I'm hoping that the memory efficiency improvements will outweigh any performance losses as a result of extra serialization in that hopefully-rare case.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11791 from JoshRosen/serialize-incrementally.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-11871
Add save/load for MLPC
## How was this patch tested?
Test with Scala unit test
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9854 from yinxusen/SPARK-11871.
Replace example code in mllib-feature-extraction.md using include_example
https://issues.apache.org/jira/browse/SPARK-13017
The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6.
Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example.
`{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}`
Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/TFIDFExample.scala` and pick code blocks marked "example" and replace code block in
`{% highlight %}`
in the markdown.
See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337
Author: Xin Ren <iamshrek@126.com>
Closes#11142 from keypointt/SPARK-13017.
## What changes were proposed in this pull request?
A fix for local metrics tests that can fail on fast machines.
This is probably what is suggested here #3380 by aarondav?
## How was this patch tested?
CI Tests
Cheers
Author: Joan <joan@goyeau.com>
Closes#11747 from joan38/SPARK-2208-Local-metrics-tests.