## What changes were proposed in this pull request?
The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13813 from mengxr/checkpoint-non-expert.
## What changes were proposed in this pull request?
Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13718 from zsxwing/SPARK-16002.
## What changes were proposed in this pull request?
This PR migrates some test cases introduced in #12313 as a follow-up of #13754 and #13766. These test cases cover `DataFrameWriter.insertInto()`, while the former two only cover SQL `INSERT` statements.
Note that the `testPartitionedTable` utility method tests both Hive SerDe tables and data source tables.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13810 from liancheng/spark-16037-follow-up-tests.
## What changes were proposed in this pull request?
Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero.
## How was this patch tested?
Ran PySpark tests locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
## What changes were proposed in this pull request?
Found these issues while reviewing for SPARK-16090
## How was this patch tested?
roxygen2 doc gen, checked output html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13803 from felixcheung/rdocrd.
## What changes were proposed in this pull request?
This PR allows us to create a Row without any fields.
## How was this patch tested?
Added a test for empty row and udf without arguments.
Author: Davies Liu <davies@databricks.com>
Closes#13812 from davies/no_argus.
This makes sure the files are in the executor's classpath as they're
expected to be. Also update the unit test to make sure the files are
there as expected.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#13792 from vanzin/SPARK-16080.
## What changes were proposed in this pull request?
This is a follow-up to https://github.com/apache/spark/pull/13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13800 from rxin/SPARK-13792-2.
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13801 from mengxr/SPARK-15177.1.
## What changes were proposed in this pull request?
1. FORMATTED is actually supported, but partition is not supported;
2. Remove parenthesis as it is not necessary just like anywhere else.
## How was this patch tested?
Minor issue. I do not think it needs a test case!
Author: bomeng <bmeng@us.ibm.com>
Closes#13791 from bomeng/SPARK-16084.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.
## How was this patch tested?
manual review for doc
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes#13375 from hhbyyh/stopdoc.
This PR adds missing `Since` annotations to `ml.feature` package.
Closes#8505.
## How was this patch tested?
Existing tests.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#13641 from MLnick/add-since-annotations.
## What changes were proposed in this pull request?
I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc.
There are still more doc issues to be cleaned up.
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13798 from felixcheung/rdocseealso.
## What changes were proposed in this pull request?
Update docs for two parameters `spark.sql.files.maxPartitionBytes` and `spark.sql.files.openCostInBytes ` in Other Configuration Options.
## How was this patch tested?
N/A
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13797 from maropu/SPARK-15894-2.
## What changes were proposed in this pull request?
Update doc as per discussion in PR #13592
## How was this patch tested?
manual
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13799 from felixcheung/rsqlprogrammingguide.
This has changed from 1.6, and now stores memory off-heap using spark's off-heap support instead of in tachyon.
Author: Eric Liang <ekl@databricks.com>
Closes#13744 from ericl/spark-16025.
## What changes were proposed in this pull request?
This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](cba5eee1ab/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala (L149)) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47).
The codes with the external data sources below:
```scala
df.select(input_file_name).show()
```
will produce
- **Before**
```
+-----------------+
|input_file_name()|
+-----------------+
| |
+-----------------+
```
- **After**
```
+--------------------+
| input_file_name()|
+--------------------+
|file:/private/var...|
+--------------------+
```
## How was this patch tested?
Unit tests in `ColumnExpressionSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13759 from HyukjinKwon/SPARK-16044.
## What changes were proposed in this pull request?
Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection.
## How was this patch tested?
Unit tests in Scala and Java.
Author: Xiangrui Meng <meng@databricks.com>
Closes#13789 from mengxr/SPARK-16074.
#### What changes were proposed in this pull request?
This PR is to fix the following bugs:
**Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 4,
upperBound = 0,
numPartitions = 3,
connectionProperties = new Properties)
```
**Before code changes:**
The returned results are wrong and the generated partitions are wrong:
```
Part 0 id < 3 or id is null
Part 1 id >= 3 AND id < 2
Part 2 id >= 2
```
**After code changes:**
Issue an `IllegalArgumentException` exception:
```
Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1
```
**Issue 2: numPartitions is more than the number of key values between upper and lower bounds**
```scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 1,
upperBound = 5,
numPartitions = 10,
connectionProperties = new Properties)
```
**Before code changes:**
Returned correct results but the generated partitions are very inefficient, like:
```
Partition 0: id < 1 or id is null
Partition 1: id >= 1 AND id < 1
Partition 2: id >= 1 AND id < 1
Partition 3: id >= 1 AND id < 1
Partition 4: id >= 1 AND id < 1
Partition 5: id >= 1 AND id < 1
Partition 6: id >= 1 AND id < 1
Partition 7: id >= 1 AND id < 1
Partition 8: id >= 1 AND id < 1
Partition 9: id >= 1
```
**After code changes:**
Adjust `numPartitions` and can return the correct answers:
```
Partition 0: id < 2 or id is null
Partition 1: id >= 2 AND id < 3
Partition 2: id >= 3 AND id < 4
Partition 3: id >= 4
```
**Issue 3: java.lang.ArithmeticException when numPartitions is zero**
```Scala
spark.read.jdbc(
url = urlWithUserAndPass,
table = "TEST.seq",
columnName = "id",
lowerBound = 0,
upperBound = 4,
numPartitions = 0,
connectionProperties = new Properties)
```
**Before code changes:**
Got the following exception:
```
java.lang.ArithmeticException: / by zero
```
**After code changes:**
Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero
#### How was this patch tested?
Added test cases to verify the results
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13773 from gatorsmile/jdbcPartitioning.
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```
Closes#12173
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#13795 from rxin/SPARK-13792.
## What changes were proposed in this pull request?
This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.
## How was this patch tested?
Pass the Jenkins tests (including new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13786 from dongjoon-hyun/SPARK-15294.
Fix the bug for Python UDF that does not have any arguments.
Added regression tests.
Author: Davies Liu <davies.liu@gmail.com>
Closes#13793 from davies/fix_no_arguments.
(cherry picked from commit abe36c53d1)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.
In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.
## How was this patch tested?
Existing test cases.
Author: Narine Kokhlikyan <narine@slice.com>
Closes#13790 from NarineK/dapply-docs-fix.
## What changes were proposed in this pull request?
Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.
## How was this patch tested?
Local tests
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
## What changes were proposed in this pull request?
The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot.
## How was this patch tested?
Existing unit tests.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#13777 from sarutak/SPARK-16061.
## What changes were proposed in this pull request?
Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)
The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.
## How was this patch tested?
Added new unit tests for Scala and Java tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13727 from tdas/SPARK-15982.
## What changes were proposed in this pull request?
Initial SQL programming guide update for Spark 2.0. Contents like 1.6 to 2.0 migration guide are still incomplete.
We may also want to add more examples for Scala/Java Dataset typed transformations.
## How was this patch tested?
N/A
Author: Cheng Lian <lian@databricks.com>
Closes#13592 from liancheng/sql-programming-guide-2.0.
## What changes were proposed in this pull request?
This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13734 from dongjoon-hyun/SPARK-14995.
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13751 from felixcheung/rsparksessiondoc.
## What changes were proposed in this pull request?
This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
id SPARK_PARTITION_ID()
1 3 0
2 4 0
3 8 1
4 9 1
5 0 2
6 1 3
7 2 4
8 5 5
9 6 6
10 7 7
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13768 from dongjoon-hyun/SPARK-16053.
## What changes were proposed in this pull request?
fix code doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13782 from felixcheung/rcountdoc.
## What changes were proposed in this pull request?
spark.lapply and setLogLevel
## How was this patch tested?
unit test
shivaram thunterdb
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13752 from felixcheung/rlapply.
## What changes were proposed in this pull request?
This issue adds `read.orc/write.orc` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13763 from dongjoon-hyun/SPARK-16051.
## What changes were proposed in this pull request?
Add dropTempView and deprecate dropTempTable
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#13753 from felixcheung/rdroptempview.
## What changes were proposed in this pull request?
This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.
```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
monotonically_increasing_id() name age
1 0 Michael NA
2 1 Andy 30
3 2 Justin 19
```
## How was this patch tested?
Pass the Jenkins tests (with added testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#13774 from dongjoon-hyun/SPARK-16059.
## What changes were proposed in this pull request?
ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail.
Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it.
## How was this patch tested?
Just removed a flaky test.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13776 from zsxwing/SPARK-16050.
## What changes were proposed in this pull request?
This PR adds the static partition support to INSERT statement when the target table is a data source table.
## How was this patch tested?
New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
**Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
Author: Yin Huai <yhuai@databricks.com>
Closes#13769 from yhuai/SPARK-16030-1.
## What changes were proposed in this pull request?
This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
## How was this patch tested?
Unit tests and manual tests in spark-shell.
Author: Matei Zaharia <matei@databricks.com>
Closes#13748 from mateiz/socket-source.
## What changes were proposed in this pull request?
In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.
There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".
We should remove the first line, which is consistent with other documents.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manual test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#13755 from wangmiao1981/doc.
## What changes were proposed in this pull
(Paste from JIRA issue.)
As a follow up for SPARK-15697, I have following semantics for `:reset` command.
On `:reset` we forget all that user has done but not the initialization of spark. To avoid confusion or make it more clear, we show the message `spark` and `sc` are not erased, infact they are in same state as they were left by previous operations done by the user.
While doing above, somewhere I felt that this is not usually what reset means. But an accidental shutdown of a cluster can be very costly, so may be in that sense this is less surprising and still useful.
## How was this patch tested?
Manually, by calling `:reset` command, by both altering the state of SparkContext and creating some local variables.
Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashsh1@in.ibm.com>
Closes#13661 from ScrapCodes/repl-reset-command.
## What changes were proposed in this pull request?
Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).
This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#13652 from davies/fix_timezone.
## What changes were proposed in this pull request?
`DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
## How was this patch tested?
Unit test.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13749 from clockfly/SPARK-16034.
## What changes were proposed in this pull request?
The current table insertion has some weird behaviours:
1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
2. inserting into a partitioned table without partition list has wrong result for hive table.
This PR fixes these 2 problems.
## How was this patch tested?
new test in hive `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#13754 from cloud-fan/insert2.
*This contribution is my original work and that I license the work to the project under the project's open source license.*
## What changes were proposed in this pull request?
Documentation updates to PySpark's GroupedData
## How was this patch tested?
Manual Tests
Author: Josh Howes <josh.howes@gmail.com>
Author: Josh Howes <josh.howes@maxpoint.com>
Closes#13724 from josh-howes/bugfix/SPARK-15973.
## What changes were proposed in this pull request?
Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
## How was this patch tested?
Jenkins
Author: Andrew Or <andrew@databricks.com>
Closes#13742 from andrewor14/move-inmemory-relation.
## What changes were proposed in this pull request?
Support with statement syntax for SparkSession in pyspark
## How was this patch tested?
Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement.
Author: Jeff Zhang <zjffdu@apache.org>
Closes#13541 from zjffdu/SPARK-15803.