## What changes were proposed in this pull request?
fix the bug:
The spark daemon shell script error, daemon process start successfully but script output fail message
## How was this patch tested?
existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13172 from WeichenXu123/fix-spark-15203.
## What changes were proposed in this pull request?
Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13220 from viirya/hotfix-regresstion.
## What changes were proposed in this pull request?
There is no way to use the Hive catalog in `pyspark-shell`. This is because we used to create a `SparkContext` before calling `SparkSession.enableHiveSupport().getOrCreate()`, which just gets the existing `SparkContext` instead of creating a new one. As a result, `spark.sql.catalogImplementation` was never propagated.
## How was this patch tested?
Manual.
Author: Andrew Or <andrew@databricks.com>
Closes#13203 from andrewor14/fix-pyspark-shell.
## What changes were proposed in this pull request?
When we parse DDLs involving table or database properties, we need to validate the values.
E.g. if we alter a database's property without providing a value:
```
ALTER DATABASE my_db SET DBPROPERTIES('some_key')
```
Then we'll ignore it with Hive, but override the property with the in-memory catalog. Inconsistencies like these arise because we don't validate the property values.
In such cases, we should throw exceptions instead.
## How was this patch tested?
`DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13205 from andrewor14/ddl-prop-values.
#### What changes were proposed in this pull request?
`refreshTable` was a method in `HiveContext`. It was deleted accidentally while we were migrating the APIs. This PR is to add it back to `HiveContext`.
In addition, in `SparkSession`, we put it under the catalog namespace (`SparkSession.catalog.refreshTable`).
#### How was this patch tested?
Changed the existing test cases to use the function `refreshTable`. Also added a test case for refreshTable in `hivecontext-compatibility`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13156 from gatorsmile/refreshTable.
## What changes were proposed in this pull request?
* ```GeneralizedLinearRegression``` API docs enhancement.
* The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol```
* Make some methods more private.
* Fix a minor bug of LinearRegression.
* Fix some other issues.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13129 from yanboliang/spark-15339.
## What changes were proposed in this pull request?
Correct some typos and incorrectly worded sentences.
## How was this patch tested?
Doc changes only.
Note that many of these changes were identified by whomfire01
Author: sethah <seth.hendrickson16@gmail.com>
Closes#13180 from sethah/ml_guide_audit.
## What changes were proposed in this pull request?
MLlib are not recommended to use, and some methods are even deprecated.
Update the warning message to recommend ML usage.
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
|org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
|for more conventional use.
""".stripMargin)
}
```
To
```
def showWarning() {
System.err.println(
"""WARN: This is a naive implementation of Logistic Regression and is given as an example!
|Please use org.apache.spark.ml.classification.LogisticRegression
|for more conventional use.
""".stripMargin)
}
```
## How was this patch tested?
local build
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13190 from zhengruifeng/update_recd.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
In this DataFrame example, we use VectorImplicits._, which is private API.
Since Vectors object has public API, we use Vectors.fromML instead of implicts.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually run the example.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13213 from wangmiao1981/ml.
## What changes were proposed in this pull request?
Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).
## How was this patch tested?
Added a test case.
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#13170 from lianhuiwang/truncate.
## What changes were proposed in this pull request?
The following code:
```
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
```
throws an Exception:
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
...
```
This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.
The analyzed and optimized plans of the above example are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
!Project [_1#420]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.
The plans after this patch are as follows:
```
== Analyzed Logical Plan ==
_1: string
Project [_1#420]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
+- Filter <function1>.apply
+- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
== Optimized Logical Plan ==
Project [_1#416]
+- Filter <function1>.apply
+- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
```
## How was this patch tested?
Existing tests and I added a test to check if `filter and then select` works.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13096 from ueshin/issues/SPARK-15313.
## What changes were proposed in this pull request?
This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.
This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case.
## How was this patch tested?
Added a new test suite and test case.
Closes#12768.
Author: Reynold Xin <rxin@databricks.com>
Author: Oleg Danilov <oleg.danilov@wandisco.com>
Closes#13208 from rxin/SPARK-14990.
## What changes were proposed in this pull request?
Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
## How was this patch tested?
Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
Author: Reynold Xin <rxin@databricks.com>
Closes#13200 from rxin/SPARK-15075.
Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .
Author: Kevin Yu <qyu@us.ibm.com>
Closes#10125 from kevinyu98/working_on_spark-11827.
## What changes were proposed in this pull request?
Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:
`encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
`
... you will see that (without this fix) it fails with the following output:
```
- encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
Exception thrown while decoding
Converted: [0,1000000010,800000001,52a7ccdc36800]
Schema: value#61615
root
-- value: array (nullable = true)
|-- element: timestamp (containsNull = true)
Encoder:
class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
```
## How was this patch tested?
Existing tests
Author: Sumedh Mungee <smungee@gmail.com>
Closes#13108 from smungee/fix-itemAccessorMethod.
## What changes were proposed in this pull request?
Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13101 from techaddict/SPARK-15296.
## What changes were proposed in this pull request?
If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.
## How was this patch tested?
1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.
It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13201 from zsxwing/better-message.
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.
## How was this patch tested?
Only API docs change, no new tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13195 from yanboliang/evaluation-doc.
## What changes were proposed in this pull request?
Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.
## How was this patch tested?
Documentation update, no unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13131 from yanboliang/spark-15341.
## What changes were proposed in this pull request?
Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it:
```
val query = result.write
.format("console")
.trigger(ProcessingTime("2 seconds"))
.startStream()
```
## How was this patch tested?
local verified.
Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#13162 from jerryshao/SPARK-15375.
## What changes were proposed in this pull request?
Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
`Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`
## How was this patch tested?
Existing Tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13202 from techaddict/SPARK-15414.
## What changes were proposed in this pull request?
Audit Scala API for ml.clustering.
Fix some wrong API documentations and update outdated one.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13148 from yanboliang/spark-15361.
## What changes were proposed in this pull request?
Add since to ml.stat.MultivariateOnlineSummarizer.scala
## How was this patch tested?
unit tests
Author: DB Tsai <dbt@netflix.com>
Closes#13197 from dbtsai/cleanup.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13183 from davies/fix_default_size.
## What changes were proposed in this pull request?
In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.
In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.
## How was this patch tested?
I ran two tests reported in JIRA locally:
The first one is:
```
val data = spark.range(0, 10000, 1, 10000)
data.cache().count()
```
The retained size of JobProgressListener decreases from 60.7M to 6.9M.
The second one is:
```
import org.apache.spark.ml.CC
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(sc)
CC.runTest(sqlContext)
```
This test won't cause OOM after applying this patch.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13153 from zsxwing/memory.
## What changes were proposed in this pull request?
This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
## How was this patch tested?
Updated existing test case to check exception message.
Author: Cheng Lian <lian@databricks.com>
Closes#13173 from liancheng/spark-14346-follow-up.
## What changes were proposed in this pull request?
Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list
## How was this patch tested?
doctests & built docs locally
Author: Holden Karau <holden@us.ibm.com>
Closes#13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
## What changes were proposed in this pull request?
This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.
Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.
## How was this patch tested?
Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13187 from HyukjinKwon/SPARK-15322.
## What changes were proposed in this pull request?
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.
## How was this patch tested?
Manual test.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13175 from sarutak/SPARK-15387.
## What changes were proposed in this pull request?
A writer lock could be acquired when 1) create a new block 2) remove a block 3) evict a block to disk. 1) and 3) could happen in the same time within the same task, all of them could happen in the same time outside a task. It's OK that when someone try to grab the write block for a block, but the block is acquired by another one that has the same task attempt id.
This PR remove the check.
## How was this patch tested?
Updated existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#13082 from davies/write_lock_conflict.
#### What changes were proposed in this pull request?
This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385
The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135
For example, in PySpark, if we input the following statement:
```python
>>> l = [('Alice', 1)]
>>> df = sqlContext.createDataFrame(l)
>>> df.createTempView("people")
>>> df.createTempView("people")
```
Before this PR, the exception we will get is like
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
self._jdf.createTempView(name)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView.
: org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324)
at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523)
at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
```
After this PR, the exception we will get become cleaner:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
self._jdf.createTempView(name)
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;"
```
#### How was this patch tested?
Fixed an existing PySpark test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13126 from gatorsmile/followup-14684.
## What changes were proposed in this pull request?
When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.
This PR fix the overflow by converting it to Long when calculating the size of memory.
Also add more checking in broadcast to show reasonable messages.
## How was this patch tested?
Add test.
Author: Davies Liu <davies@databricks.com>
Closes#13182 from davies/fix_broadcast.
## What changes were proposed in this pull request?
This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local.
## How was this patch tested?
Scala Style Checks.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#13191 from pravingadakh/SPARK-14613.
## What changes were proposed in this pull request?
Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
* Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
* Add missing setter function for ```solver``` and ```stepSize```.
* Make ```GD``` solver take effect.
* Update docs, annotations and fix other minor issues.
## How was this patch tested?
Existing unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13076 from yanboliang/spark-15292.
## What changes were proposed in this pull request?
[SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.
## How was this patch tested?
Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#13149 from yanboliang/spark-15362.
## What changes were proposed in this pull request?
Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43.
## How was this patch tested?
existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13168 from techaddict/minor-1.
## What changes were proposed in this pull request?
This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)
1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"
This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.
**Before**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Sort [1.0#5 ASC,x#0 ASC], true, 0
: +- INPUT
+- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
+- WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
**After**
```
scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
== Physical Plan ==
WholeStageCodegen
: +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
: +- INPUT
+- Scan OneRowRelation[]
```
## How was this patch tested?
Pass the Jenkins tests including a new test case.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12719 from dongjoon-hyun/SPARK-14939.
## What changes were proposed in this pull request?
It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:
- `simple_params_example.py`
- `aft_survival_regression.py`
are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.
This PR corrects the example and make this use SparkSession.
In more detail, it seems `threshold` is replaced to `thresholds` here and there by 5a23213c14. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).
According to the comment below. this is not allowed, 354f8f11bd/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (L58-L61).
So, in this PR, it sets the equivalent value so that this does not throw an exception.
## How was this patch tested?
Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13135 from HyukjinKwon/SPARK-15031.
## What changes were proposed in this pull request?
Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.
## How was this patch tested?
new test in DatasetSuite
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13167 from cloud-fan/bug.
## What changes were proposed in this pull request?
Right now the netty RPC uses `InetSocketAddress.getHostName` to create `RpcAddress` for network events. If we use an IP address to connect, then the RpcAddress's host will be a host name (if the reverse lookup successes) instead of the IP address. However, some places need to compare the original IP address and the RpcAddress in `onDisconnect` (e.g., CoarseGrainedExecutorBackend), and this behavior will make the check incorrect.
This PR uses `getHostString` to resolve the issue.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13185 from zsxwing/host-string.
## What changes were proposed in this pull request?
I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.
## How was this patch tested?
Built docs locally, ran style checks
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13159 from BryanCutler/ml.feature-api-sync.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13184 from rxin/SPARK-14463.
#### What changes were proposed in this pull request?
The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.
In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.
This PR is to fix the above two issues.
#### How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13081 from gatorsmile/setVcommand.
## What changes were proposed in this pull request?
This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
## How was this patch tested?
new tests in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13008 from cloud-fan/row-encoder.
https://issues.apache.org/jira/browse/SPARK-15323
I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0.
It would be logical if you could also write those,
but not entirely sure how to solve this with the new DataSet implementation.
Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`.
See https://issues.apache.org/jira/browse/SPARK-14463 for that issue.
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13104 from jurriaan/fix-partitioned-text-reads.
## What changes were proposed in this pull request?
We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
This PR change the default value to Long.MaxValue.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13179 from davies/fix_default_size.