## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Scala doc used outdated ```+=```. Replaced with ```add```.
## How was this patch tested?
N/A
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13346 from jkbradley/accum-doc.
## What changes were proposed in this pull request?
Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples.
Also add to the programming guide migration section.
## How was this patch tested?
SparkR tests
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#13340 from felixcheung/sqlcontextdoc.
## What changes were proposed in this pull request?
This PR corrects SparkR to use `shell()` instead of `system2()` on Windows.
Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS.
Existing tests were failed on Windows due to this problem. For example, those were failed.
```
8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34)
9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36)
```
The cases above were due to using of `system2`.
In addition, this PR also fixes some tests failed on Windows.
```
5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128)
6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131)
7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134)
```
The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows.
```r
# On Linus
path <- normalizePath("aa")
print(path)
[1] "aa"
# On Windows
path <- normalizePath("aa")
print(path)
[1] "C:\\Users\\aa"
```
## How was this patch tested?
Jenkins tests and manually tested in a Window machine as below:
Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing.
Closes#7025
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Author: Prakash PC <prakash.chinnu@gmail.com>
Closes#13165 from HyukjinKwon/pr/7025.
## What changes were proposed in this pull request?
Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to
(1) Create a Hive table setting these properties
(2) Alter these properties in an existing table
Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables.
## How was this patch tested?
DDLSuite
Author: Andrew Or <andrew@databricks.com>
Closes#13341 from andrewor14/alter-table-props.
https://issues.apache.org/jira/browse/SPARK-15542
## What changes were proposed in this pull request?
When running`./R/install-dev.sh` in **Mac OS EI Captain** environment, I got
```
mbp185-xr:spark xin$ ./R/install-dev.sh
usage: dirname path
```
This message is very confusing to me, and then I found R is not properly configured on my Mac when this script is using `$(which R)` to get R home.
I tried similar situation on CentOS with R missing, and it's giving me very clear error message while MacOS is not.
on CentOS:
```
[rootip-xxx-31-9-xx spark]# which R
/usr/bin/which: no R in (/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/lib/jvm/java-1.7.0-openjdk.x86_64/bin:/root/bin)
```
but on Mac, if not found then nothing returned and this is causing the confusing message for R build failure and running R/install-dev.sh:
```
mbp185-xr:spark xin$ which R
mbp185-xr:spark xin$
```
Here I just added a clear message for this miss configuration for R when running `R/install-dev.sh`.
```
mbp185-xr:spark xin$ ./R/install-dev.sh
Cannot find R home by running 'which R', please make sure R is properly installed.
```
## How was this patch tested?
Manually tested on local machine.
Author: Xin Ren <iamshrek@126.com>
Closes#13308 from keypointt/SPARK-15542.
## What changes were proposed in this pull request?
Two more changes:
(1) Fix truncate table for data source tables (only for cases without `PARTITION`)
(2) Disallow truncating external tables or views
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13315 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext.
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#13310 from yhuai/SPARK-15532.
## What changes were proposed in this pull request?
This PR addresses two related issues:
1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones.
2. `Dataset.showString()` should show full contents produced the underlying query plan
Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan.
## How was this patch tested?
Two new test cases are added in `DatasetSuite` to check `.showString()` output.
Author: Cheng Lian <lian@databricks.com>
Closes#13331 from liancheng/spark-15550-ds-show.
## What changes were proposed in this pull request?
This patch fixes a few integer overflows in `UnsafeSortDataFormat.copyRange()` and `ShuffleSortDataFormat copyRange()` that seems to be the most likely cause behind a number of `TimSort` contract violation errors seen in Spark 2.0 and Spark 1.6 while sorting large datasets.
## How was this patch tested?
Added a test in `ExternalSorterSuite` that instantiates a large array of the form of [150000000, 150000001, 150000002, ...., 300000000, 0, 1, 2, ..., 149999999] that triggers a `copyRange` in `TimSort.mergeLo` or `TimSort.mergeHi`. Note that the input dataset should contain at least 268.43 million rows with a certain data distribution for an overflow to occur.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13336 from sameeragarwal/timsort-bug.
## What changes were proposed in this pull request?
Add more verbose error message when order by clause is missed when using Window function.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13333 from clockfly/spark-13445.
## What changes were proposed in this pull request?
Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
* Use in PythonMLlibAPI: Change to using private constructors
* Streaming algs: No warnings after we un-deprecate the classes
* Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field
## How was this patch tested?
Existing tests. Checked for warnings manually.
Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#13314 from jkbradley/warning-cleanups.
## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.
As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.
## How was this patch tested?
Updated test cases to reflect the changes.
Author: Reynold Xin <rxin@databricks.com>
Closes#13319 from rxin/SPARK-15552.
## What changes were proposed in this pull request?
Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289
## How was this patch tested?
Python doc-tests.
Author: Eric Liang <ekl@databricks.com>
Closes#13309 from ericl/spark-15520-1.
## What changes were proposed in this pull request?
Same as #13302, but for DROP TABLE.
## How was this patch tested?
`DDLSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#13307 from andrewor14/drop-table.
This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted
Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>
Closes#11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.
Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).
Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9192 from felixcheung/rsqlcontext.
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-15523
This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.
## How was this patch tested?
1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.
Author: Villu Ruusmann <villu.ruusmann@gmail.com>
Closes#13297 from vruusmann/update-jpmml.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.
So I removed Dataframe in the code.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually tested
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13266 from wangmiao1981/unit.
## What changes were proposed in this pull request?
For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.
The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.
## How was this patch tested?
Manually checked the temp folder to make sure the temp files were deleted.
Author: Bo Meng <mengbo@hotmail.com>
Closes#13304 from bomeng/SPARK-15537.
## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.
They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat
Backward compatibility is maintained through aliasing.
## How was this patch tested?
Updated relevant test cases too.
Author: Reynold Xin <rxin@databricks.com>
Closes#13311 from rxin/SPARK-15543.
This is a basic framework for testing the entire scheduler. The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).
Author: Imran Rashid <irashid@cloudera.com>
Closes#8559 from squito/SPARK-10372-scheduler-integs.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
There are some failures when running SparkR unit tests.
In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
The first one is due to different masked name. I added missed names in the expected arrays.
The second one is because one PR removed the logic of a previous fix of missing subset method.
The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
test_that("pipeRDD() on RDDs", {
actual <- collect(pipeRDD(rdd, "more"))
When using run-test script, it complains no such directories as below:
cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13284 from wangmiao1981/R.
## What changes were proposed in this pull request?
This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#13312 from sameeragarwal/deprecate.
## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).
This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.
## How was this patch tested?
Manually running SBT/Maven builds.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#13299 from hvanhovell/SPARK-15525.
## What changes were proposed in this pull request?
Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.
## How was this patch tested?
Jenkins.
Author: Andrew Or <andrew@databricks.com>
Closes#13302 from andrewor14/truncate-table.
## What changes were proposed in this pull request?
Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.
This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.
## How was this patch tested?
`build/mvn -DskipTests clean package` build succeeds
Author: Gio Borje <gborje@linkedin.com>
Closes#13265 from Hydrotoast/master.
## What changes were proposed in this pull request?
Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session. These examples should use the provided property `sparkContext` in `SparkSession` instead.
## How was this patch tested?
Ran modified examples
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13303 from BryanCutler/pyspark-session-sparkContext-MINOR.
## What changes were proposed in this pull request?
Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status.
The case we should try to submit waiting stages is only when their parent stages are successfully completed.
This elimination can improve `DAGScheduler` performance.
## How was this patch tested?
Added some checks and other existing tests, and our projects.
We have a project bottle-necked by `DAGScheduler`, having about 2000 stages.
Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows:
| | total execution time | `dag-scheduler-event-loop` thread time | `submitStage()` |
|--------|---------------------:|---------------------------------------:|----------------:|
| Before | 760 sec | 710 sec | 667 sec |
| After | 440 sec | 14 sec | 10 sec |
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#12060 from ueshin/issues/SPARK-14269.
## What changes were proposed in this pull request?
Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.
This pr fixes `IncrementalExecution` to include extra strategies to use them.
## How was this patch tested?
I added a test to check if extra strategies work for streams.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#13261 from ueshin/issues/SPARK-15483.
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.
We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.
Tests N/A.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
fixed typos for source code for components [mllib] [streaming] and [SQL]
None and obvious.
Author: lfzCarlosC <lfz.carlos@gmail.com>
Closes#13298 from lfzCarlosC/master.
## What changes were proposed in this pull request?
This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
```scala
- logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
+ logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
...
- // since its not removed yet
+ // since it's not removed yet
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13294 from dongjoon-hyun/minor_rdd_fix_log_message.
## What changes were proposed in this pull request?
This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.
## How was this patch tested?
Python doc tests.
cc andrewor14
Author: Eric Liang <ekl@databricks.com>
Closes#13289 from ericl/spark-15520.
## What changes were proposed in this pull request?
Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.
## How was this patch tested?
Manually verify it in spark-shell.
rxin Please help review it, I think this is a very critical issue for spark 2.0
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13160 from zjffdu/SPARK-15345.
## What changes were proposed in this pull request?
1. Making 'name' field of RDDInfo mutable.
2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.
## How was this patch tested?
1. Manual verification - the 'Storage' tab now behaves as expected.
2. The commit also contains a new unit test which verifies this.
Author: Lukasz <lgieron@gmail.com>
Closes#13264 from lgieron/SPARK-9044.
## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.
## How was this patch tested?
Created a new SparkSqlParserSuite.
Author: Reynold Xin <rxin@databricks.com>
Closes#13292 from rxin/SPARK-15436.
## What changes were proposed in this pull request?
Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.
## How was this patch tested?
Document update, no tests.
Author: Krishna Kalyan <krishnakalyan3@gmail.com>
Closes#13268 from krishnakalyan3/spark-12071-1.
## What changes were proposed in this pull request?
PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).
## How was this patch tested?
built pydocs locally, tested new user build instructions
Author: Holden Karau <holden@us.ibm.com>
Closes#13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
## What changes were proposed in this pull request?
`JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.
This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.
Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13281 from zsxwing/flaky-kafka-test.
## What changes were proposed in this pull request?
This PR fixes 3 slow tests:
1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13273 from cloud-fan/test.
## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.
## How was this patch tested?
I have executed queries locally to test.
Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Closes#13150 from Parth-Brahmbhatt/SPARK-15365.
## What changes were proposed in this pull request?
This patch renames various scheduler backends to make them consistent:
- LocalScheduler -> LocalSchedulerBackend
- AppClient -> StandaloneAppClient
- AppClientListener -> StandaloneAppClientListener
- SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
- CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
- MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend
## How was this patch tested?
Updated test cases to reflect the name change.
Author: Reynold Xin <rxin@databricks.com>
Closes#13288 from rxin/SPARK-15518.
## What changes were proposed in this pull request?
Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.
**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array() // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array() // empty
```
**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```
## How was this patch tested?
Pass the Jenkins tests with new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13282 from dongjoon-hyun/SPARK-15512.
## What changes were proposed in this pull request?
If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.
To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.
In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.
## How was this patch tested?
Updated unit tests that test error behavior with and without schema inference enabled.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13238 from tdas/SPARK-15458.
This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.
(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).
Also cleaned up a reference to `mllib` in the ML doc.
## How was this patch tested?
Built and viewed User Guide doc locally.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13278 from MLnick/SPARK-15502-als-int-id-doc-note.
## What changes were proposed in this pull request?
This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.
## How was this patch tested?
This fixes the testcase only.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13260 from dongjoon-hyun/SPARK-15481.
## What changes were proposed in this pull request?
spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.
I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.
## How was this patch tested?
added a unit test and also tested it manually
Author: wangyang <wangyang@haizhi.com>
Closes#13177 from wangyang1992/fixCreateFunc2.
We only need one copy of it. The client code that was uploading the
second copy just needs to be modified to update the metadata in the
cache, so that the AM knows where to find the configuration.
Tested by running app on YARN and verifying in the logs only one archive
is uploaded.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13232 from vanzin/SPARK-15405.