(Please fill in changes proposed in this fix)
Python API for VectorSizeHint Transformer.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
doc-tests.
Author: Bago Amirbekian <bago@databricks.com>
Closes#20112 from MrBago/vectorSizeHint-PythonAPI.
## What changes were proposed in this pull request?
Adding fitMultiple API to `Estimator` with default implementation. Also update have ml.tuning meta-estimators use this API.
## How was this patch tested?
Unit tests.
Author: Bago Amirbekian <bago@databricks.com>
Closes#20058 from MrBago/python-fitMultiple.
## What changes were proposed in this pull request?
Add sql functions
## How was this patch tested?
manual, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20105 from felixcheung/rsqlfuncs.
## What changes were proposed in this pull request?
make sure model data is stored in order. WeichenXu123
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#20113 from zhengruifeng/gmm_save.
The scheduled task was racing with the test code and could influence
the values returned to the test, triggering assertions. The change adds
a new config that is only used during testing, and overrides it
on the affected test suite.
The issue in the bug can be reliably reproduced by reducing the interval
in the test (e.g. to 10ms).
While there, fixed an exception that shows up in the logs while these
tests run, and simplified some code (which was also causing misleading
log messages in the log output of the test).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20050 from vanzin/SPARK-22864.
This change adds a new configuration option and support code that limits
how much disk space the SHS will use. The default value is pretty generous
so that applications will, hopefully, only rarely need to be replayed
because of their disk stored being evicted.
This works by keeping track of how much data each application is using.
Also, because it's not possible to know, before replaying, how much space
will be needed, it's possible that usage will exceed the configured limit
temporarily. The code uses the concept of a "lease" to try to limit how
much the SHS will exceed the limit in those cases.
Active UIs are also tracked, so they're never deleted. This works in
tandem with the existing option of how many active UIs are loaded; because
unused UIs will be unloaded, their disk stores will also become candidates
for deletion. If the data is not deleted, though, re-loading the UI is
pretty quick.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20011 from vanzin/SPARK-20654.
## What changes were proposed in this pull request?
This is a follow-up pr of #19587.
If `xmlrunner` is installed, `VectorizedUDFTests.test_vectorized_udf_check_config` fails by the following error because the `self` which is a subclass of `unittest.TestCase` in the UDF `check_records_per_batch` can't be pickled anymore.
```
PicklingError: Cannot pickle files that are not opened for reading: w
```
This changes the UDF not to refer the `self`.
## How was this patch tested?
Tested locally.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#20115 from ueshin/issues/SPARK-22370_fup1.
In general jiras are assigned to the original reporter or one of
the commentors. This updates the merge script to give you a simple
choice to do that, so you don't have to do it manually.
Author: Imran Rashid <irashid@cloudera.com>
Closes#20107 from squito/SPARK-22921.
## What changes were proposed in this pull request?
The issue has been raised in two Jira tickets: [SPARK-21657](https://issues.apache.org/jira/browse/SPARK-21657), [SPARK-16998](https://issues.apache.org/jira/browse/SPARK-16998). Basically, what happens is that in collection generators like explode/inline we create many rows from each row. Currently each exploded row contains also the column on which it was created. This causes, for example, if we have a 10k array in one row that this array will get copy 10k times - to each of the row. this results a qudratic memory consumption. However, it is a common case that the original column gets projected out after the explode, so we can avoid duplicating it.
In this solution we propose to identify this situation in the optimizer and turn on a flag for omitting the original column in the generation process.
## How was this patch tested?
1. We added a benchmark test to MiscBenchmark that shows x16 improvement in runtimes.
2. We ran some of the other tests in MiscBenchmark and they show 15% improvements.
3. We ran this code on a specific case from our production data with rows containing arrays of size ~200k and it reduced the runtime from 6 hours to 3 mins.
Author: oraviv <oraviv@paypal.com>
Author: uzadude <ohad.raviv@gmail.com>
Author: uzadude <15645757+uzadude@users.noreply.github.com>
Closes#19683 from uzadude/optimize_explode.
## What changes were proposed in this pull request?
When there are no broadcast hints, the current spark strategies will prefer to building the right side, without considering the sizes of the two tables. This patch added the logic to consider the sizes of the two tables for the build side. To make the logic clear, the build side is determined by two steps:
1. If there are broadcast hints, the build side is determined by `broadcastSideByHints`;
2. If there are no broadcast hints, the build side is determined by `broadcastSideBySizes`;
3. If the broadcast is disabled by the config, it falls back to the next cases.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes#20099 from liufengdb/fix-spark-strategies.
## What changes were proposed in this pull request?
Simplify some estimation logic by using double instead of decimal.
## How was this patch tested?
Existing tests.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#20062 from wzhfy/simplify_by_double.
## What changes were proposed in this pull request?
With #19474, children of insertion commands are missing in UI.
To fix it:
1. Create a new physical plan `DataWritingCommandExec` to exec `DataWritingCommand` with children. So that the other commands won't be affected.
2. On creation of `DataWritingCommand`, a new field `allColumns` must be specified, which is the output of analyzed plan.
3. In `FileFormatWriter`, the output schema will use `allColumns` instead of the output of optimized plan.
Before code changes:
![2017-12-19 10 27 10](https://user-images.githubusercontent.com/1097932/34161850-d2fd0acc-e50c-11e7-898a-177154fe7d8e.png)
After code changes:
![2017-12-19 10 27 04](https://user-images.githubusercontent.com/1097932/34161865-de23de26-e50c-11e7-9131-0c32f7b7b749.png)
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#20020 from gengliangwang/insert.
## What changes were proposed in this pull request?
This is to walk around the hive issue: https://issues.apache.org/jira/browse/HIVE-11935
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes#20109 from liufengdb/synchronized.
## What changes were proposed in this pull request?
This PR explicitly imports the missing `warnings` in `flume.py`.
## How was this patch tested?
Manually tested.
```python
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>> from pyspark.streaming import flume
>>> flume.FlumeUtils.createStream(None, None, None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/streaming/flume.py", line 60, in createStream
warnings.warn(
NameError: global name 'warnings' is not defined
```
```python
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>> from pyspark.streaming import flume
>>> flume.FlumeUtils.createStream(None, None, None)
/.../spark/python/pyspark/streaming/flume.py:65: DeprecationWarning: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. See SPARK-22142.
DeprecationWarning)
...
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20110 from HyukjinKwon/SPARK-22313-followup.
## What changes were proposed in this pull request?
Currently, in `ChiSqSelectorModel`, save:
```
spark.createDataFrame(dataArray).repartition(1).write...
```
The default partition number used by createDataFrame is "defaultParallelism",
Current RoundRobinPartitioning won't guarantee the "repartition" generating the same order result with local array. We need fix it.
## How was this patch tested?
N/A
Author: WeichenXu <weichen.xu@databricks.com>
Closes#20088 from WeichenXu123/fix_chisq_model_save.
## What changes were proposed in this pull request?
Escape of escape should be considered when using the UniVocity csv encoding/decoding library.
Ref: https://github.com/uniVocity/univocity-parsers#escaping-quote-escape-characters
One option is added for reading and writing CSV: `escapeQuoteEscaping`
## How was this patch tested?
Unit test added.
Author: soonmok-kwon <soonmok.kwon@navercorp.com>
Closes#20004 from ep1804/SPARK-22818.
This change adds a new launcher that allows applications to be run
in a separate thread in the same process as the calling code. To
achieve that, some code from the child process implementation was
moved to abstract classes that implement the common functionality,
and the new launcher inherits from those.
The new launcher was added as a new class, instead of implemented
as a new option to the existing SparkLauncher, to avoid ambigous
APIs. For example, SparkLauncher has ways to set the child app's
environment, modify SPARK_HOME, or control the logging of the
child process, none of which apply to in-process apps.
The in-process launcher has limitations: it needs Spark in the
context class loader of the calling thread, and it's bound by
Spark's current limitation of a single client-mode application
per JVM. It also relies on the recently added SparkApplication
trait to make sure different apps don't mess up each other's
configuration, so config isolation is currently limited to cluster mode.
I also chose to keep the same socket-based communication for in-process
apps, even though it might be possible to avoid it for in-process
mode. That helps both implementations share more code.
Tested with new and existing unit tests, and with a simple app that
uses the launcher; also made sure the app ran fine with older launcher
jar to check binary compatibility.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19591 from vanzin/SPARK-11035.
## What changes were proposed in this pull request?
Test Coverage for `DateTimeOperations`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722).
## How was this patch tested?
N/A
Author: Yuming Wang <wgyumg@gmail.com>
Closes#20061 from wangyum/SPARK-22890.
Port code from the old executors listener to the new one, so that
the driver logs present in the application start event are kept.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20038 from vanzin/SPARK-22836.
## What changes were proposed in this pull request?
Add tarLongFileMode=posix configuration for the assembly plugin
## How was this patch tested?
Reran build successfully
```
./build/mvn package -Pbigtop-dist -DskipTests -rf :spark-assembly_2.11
[INFO] Spark Project Assembly ............................. SUCCESS [ 23.082 s]
```
Author: Gera Shegalov <gshegalov@salesforce.com>
Closes#20055 from gerashegalov/gera/tarLongFileMode.
## What changes were proposed in this pull request?
This PR cleans up a few Java linter errors for Apache Spark 2.3 release.
## How was this patch tested?
```bash
$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```
We can see the result from [Travis CI](https://travis-ci.org/dongjoon-hyun/spark/builds/322470787), too.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#20101 from dongjoon-hyun/fix-java-lint.
## What changes were proposed in this pull request?
For empty/null column, the result of `ApproximatePercentile` is null. Then in `ApproxCountDistinctForIntervals`, a `MatchError` (for `endpoints`) will be thrown if we try to generate histogram for that column. Besides, there is no need to generate histogram for such column. In this patch, we exclude such column when generating histogram.
## How was this patch tested?
Enhanced test cases for empty/null columns.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#20102 from wzhfy/no_record_hgm_bug.
## What changes were proposed in this pull request?
I found this problem while auditing the analyzer code. It's dangerous to introduce extra `AnalysisBarrer` during analysis, as the plan inside it will bypass all analysis afterward, which may not be expected. We should only preserve `AnalysisBarrer` but not introduce new ones.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#20094 from cloud-fan/barrier.
## What changes were proposed in this pull request?
This PR proposes to add `localCheckpoint(..)` in R API.
```r
df <- localCheckpoint(createDataFrame(iris))
```
## How was this patch tested?
Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20073 from HyukjinKwon/SPARK-22843.
## What changes were proposed in this pull request?
This PR updates the Kubernetes documentation corresponding to the following features/changes in #19954.
* Ability to use remote dependencies through the init-container.
* Ability to mount user-specified secrets into the driver and executor pods.
vanzin jiangxb1987 foxish
Author: Yinan Li <liyinan926@gmail.com>
Closes#20059 from liyinan926/doc-update.
## What changes were proposed in this pull request?
This PR expands the Kubernetes mode to be able to use remote dependencies on http/https endpoints, GCS, S3, etc. It adds steps for configuring and appending the Kubernetes init-container into the driver and executor pods for downloading remote dependencies.
[Init-containers](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/), as the name suggests, are containers that are run to completion before the main containers start, and are often used to perform initialization tasks prior to starting the main containers. We use init-containers to localize remote application dependencies before the driver/executors start running. The code that the init-container runs is also included. This PR also adds a step to the driver and executors for mounting user-specified secrets that may store credentials for accessing data storage, e.g., S3 and Google Cloud Storage (GCS), into the driver and executors.
## How was this patch tested?
* The patch contains unit tests which are passing.
* Manual testing: `./build/mvn -Pkubernetes clean package` succeeded.
* Manual testing of the following cases:
* [x] Running SparkPi using container-local spark-example jar.
* [x] Running SparkPi using container-local spark-example jar with user-specific secret mounted.
* [x] Running SparkPi using spark-example jar hosted remotely on an https endpoint.
cc rxin felixcheung mateiz (shepherd)
k8s-big-data SIG members & contributors: mccheah foxish ash211 ssuchter varunkatta kimoonkim erikerlandson tnachen ifilonenko liyinan926
reviewers: vanzin felixcheung jiangxb1987 mridulm
Author: Yinan Li <liyinan926@gmail.com>
Closes#19954 from liyinan926/init-container.
## What changes were proposed in this pull request?
This PR addresses additional review comments in #19811
## How was this patch tested?
Existing test suites
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#20036 from kiszk/SPARK-18066-followup.
## What changes were proposed in this pull request?
Fix OneVsRestModel transform on streaming data failed.
## How was this patch tested?
UT will be added soon, once #19979 merged. (Need a helper test method there)
Author: WeichenXu <weichen.xu@databricks.com>
Closes#20077 from WeichenXu123/fix_ovs_model_transform.
## What changes were proposed in this pull request?
Test coverage for arithmetic operations leading to:
1. Precision loss
2. Overflow
Moreover, tests for casting bad string to other input types and for using bad string as operators of some functions.
## How was this patch tested?
added tests
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20084 from mgaido91/SPARK-22904.
## What changes were proposed in this pull request?
This is a follow-up pr of #19884 updating setup.py file to add pyarrow dependency.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#20089 from ueshin/issues/SPARK-22324/fup1.
## What changes were proposed in this pull request?
fix table owner is null when create new table through spark sql
## How was this patch tested?
manual test.
1、first create a table
2、then select the table properties from mysql which connected to hive metastore
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: xu.wenchun <xu.wenchun@immomo.com>
Closes#20034 from BruceXu1991/SPARK-22846.
## What changes were proposed in this pull request?
`DateTimeOperations` accept [`StringType`](ae998ec2b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L669)), but:
```
spark-sql> SELECT '2017-12-24' + interval 2 months 2 seconds;
Error in query: cannot resolve '(CAST('2017-12-24' AS DOUBLE) + interval 2 months 2 seconds)' due to data type mismatch: differing types in '(CAST('2017-12-24' AS DOUBLE) + interval 2 months 2 seconds)' (double and calendarinterval).; line 1 pos 7;
'Project [unresolvedalias((cast(2017-12-24 as double) + interval 2 months 2 seconds), None)]
+- OneRowRelation
spark-sql>
```
After this PR:
```
spark-sql> SELECT '2017-12-24' + interval 2 months 2 seconds;
2018-02-24 00:00:02
Time taken: 0.2 seconds, Fetched 1 row(s)
```
## How was this patch tested?
unit tests
Author: Yuming Wang <wgyumg@gmail.com>
Closes#20067 from wangyum/SPARK-22894.
## What changes were proposed in this pull request?
Some improvements:
1. Point out we are using both Spark SQ native syntax and HQL syntax in the example
2. Avoid using the same table name with temp view, to not confuse users.
3. Create the external hive table with a directory that already has data, which is a more common use case.
4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config was introduced by https://github.com/apache/spark/pull/8566 and has nothing to do with Hive.
5. Remove `repartition` and `coalesce` example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, `spark.sql.files.maxRecordsPerFile` also controls it.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#20081 from cloud-fan/minor.
## What changes were proposed in this pull request?
In SPARK-20586 the flag `deterministic` was added to Scala UDF, but it is not available for python UDF. This flag is useful for cases when the UDF's code can return different result with the same input. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. This can lead to unexpected behavior.
This PR adds the deterministic flag, via the `asNondeterministic` method, to let the user mark the function as non-deterministic and therefore avoid the optimizations which might lead to strange behaviors.
## How was this patch tested?
Manual tests:
```
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> df_br = spark.createDataFrame([{'name': 'hello'}])
>>> import random
>>> udf_random_col = udf(lambda: int(100*random.random()), IntegerType()).asNondeterministic()
>>> df_br = df_br.withColumn('RAND', udf_random_col())
>>> random.seed(1234)
>>> udf_add_ten = udf(lambda rand: rand + 10, IntegerType())
>>> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show()
+-----+----+-------------+
| name|RAND|RAND_PLUS_TEN|
+-----+----+-------------+
|hello| 3| 13|
+-----+----+-------------+
```
Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>
Closes#19929 from mgaido91/SPARK-22629.
## What changes were proposed in this pull request?
Decimal type is not yet supported in `ArrowWriter`.
This is adding the decimal type support.
## How was this patch tested?
Added a test to `ArrowConvertersSuite`.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#18754 from ueshin/issues/SPARK-21552.
## What changes were proposed in this pull request?
Kinesis client can resume from a specified timestamp while creating a stream. We should have option to pass a timestamp in config to allow kinesis to resume from the given timestamp.
The patch introduces a new `KinesisInitialPositionInStream` that takes the `InitialPositionInStream` with the `timestamp` information that can be used to resume kinesis fetches from the provided timestamp.
## How was this patch tested?
Unit Tests
cc : budde brkyvz
Author: Yash Sharma <ysharma@atlassian.com>
Closes#18029 from yssharma/ysharma/kcl_resume.
## What changes were proposed in this pull request?
This is a follow-up pr of #20054 modifying error messages for both pandas and pyarrow to show actual versions.
## How was this patch tested?
Existing tests.
Author: Takuya UESHIN <ueshin@databricks.com>
Closes#20074 from ueshin/issues/SPARK-22874_fup1.
## What changes were proposed in this pull request?
We should use `dataType.simpleString` to unified the data type mismatch message:
Before:
```
spark-sql> select cast(1 as binary);
Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType; line 1 pos 7;
```
After:
```
park-sql> select cast(1 as binary);
Error in query: cannot resolve 'CAST(1 AS BINARY)' due to data type mismatch: cannot cast int to binary; line 1 pos 7;
```
## How was this patch tested?
Exist test.
Author: Yuming Wang <wgyumg@gmail.com>
Closes#20064 from wangyum/SPARK-22893.
## What changes were proposed in this pull request?
Via some test I found CrossValidator still exists memory issue, it will still occupy `O(n*sizeof(model))` memory for holding models when fitting, if well optimized, it should be `O(parallelism*sizeof(model))`
This is because modelFutures will hold the reference to model object after future is complete (we can use `future.value.get.get` to fetch it), and the `Future.sequence` and the `modelFutures` array holds references to each model future. So all model object are keep referenced. So it will still occupy `O(n*sizeof(model))` memory.
I fix this by merging the `modelFuture` and `foldMetricFuture` together, and use `atomicInteger` to statistic complete fitting tasks and when all done, trigger `trainingDataset.unpersist`.
I ever commented this issue on the old PR [SPARK-19357]
https://github.com/apache/spark/pull/16774#pullrequestreview-53674264
unfortunately, at that time I do not realize that the issue still exists, but now I confirm it and create this PR to fix it.
## Discussion
I give 3 approaches which we can compare, after discussion I realized none of them is ideal, we have to make a trade-off.
**After discussion with jkbradley , choose approach 3**
### Approach 1
~~The approach proposed by MrBago at~~ https://github.com/apache/spark/pull/19904#discussion_r156751569
~~This approach resolve the model objects referenced issue, allow the model objects to be GCed in time. **BUT, in some cases, it still do not resolve the O(N) model memory occupation issue**. Let me use an extreme case to describe it:~~
~~suppose we set `parallelism = 1`, and there're 100 paramMaps. So we have 100 fitting & evaluation tasks. In this approach, because of `parallelism = 1`, the code have to wait 100 fitting tasks complete, **(at this time the memory occupation by models already reach 100 * sizeof(model) )** and then it will unpersist training dataset and then do 100 evaluation tasks.~~
### Approach 2
~~This approach is my PR old version code~~ 2cc7c28f38
~~This approach can make sure at any case, the peak memory occupation by models to be `O(numParallelism * sizeof(model))`, but, it exists an issue that, in some extreme case, the "unpersist training dataset" will be delayed until most of the evaluation tasks complete. Suppose the case
`parallelism = 1`, and there're 100 fitting & evaluation tasks, each fitting&evaluation task have to be executed one by one, so only after the first 99 fitting&evaluation tasks and the 100th fitting task complete, the "unpersist training dataset" will be triggered.~~
### Approach 3
After I compared approach 1 and approach 2, I realized that, in the case which parallelism is low but there're many fitting & evaluation tasks, we cannot achieve both of the following two goals:
- Make the peak memory occupation by models(driver-side) to be O(parallelism * sizeof(model))
- unpersist training dataset before most of the evaluation tasks started.
So I vote for a simpler approach, move the unpersist training dataset to the end (Does this really matters ?)
Because the goal 1 is more important, we must make sure the peak memory occupation by models (driver-side) to be O(parallelism * sizeof(model)), otherwise it will bring high risk of OOM.
Like following code:
```
val foldMetricFutures = epm.zipWithIndex.map { case (paramMap, paramIndex) =>
Future[Double] {
val model = est.fit(trainingDataset, paramMap).asInstanceOf[Model[_]]
//...other minor codes
val metric = eval.evaluate(model.transform(validationDataset, paramMap))
logDebug(s"Got metric metricformodeltrainedwithparamMap.")
metric
} (executionContext)
}
val foldMetrics = foldMetricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf))
trainingDataset.unpersist() // <------- unpersist at the end
validationDataset.unpersist()
```
## How was this patch tested?
N/A
Author: WeichenXu <weichen.xu@databricks.com>
Closes#19904 from WeichenXu123/fix_cross_validator_memory_issue.
## What changes were proposed in this pull request?
In choosing a Partitioner to use for a cogroup-like operation between a number of RDDs, the default behaviour was if some of the RDDs already have a partitioner, we choose the one amongst them with the maximum number of partitions.
This behaviour, in some cases, could hit the 2G limit (SPARK-6235). To illustrate one such scenario, consider two RDDs:
rDD1: with smaller data and smaller number of partitions, alongwith a Partitioner.
rDD2: with much larger data and a larger number of partitions, without a Partitioner.
The cogroup of these two RDDs could hit the 2G limit, as a larger amount of data is shuffled into a smaller number of partitions.
This PR introduces a safety-check wherein the Partitioner is chosen only if either of the following conditions are met:
1. if the number of partitions of the RDD associated with the Partitioner is greater than or equal to the max number of upstream partitions; or
2. if the number of partitions of the RDD associated with the Partitioner is less than and within a single order of magnitude of the max number of upstream partitions.
## How was this patch tested?
Unit tests in PartitioningSuite and PairRDDFunctionsSuite
Author: sujithjay <sujith@logistimo.com>
Closes#20002 from sujithjay/SPARK-22465.
## What changes were proposed in this pull request?
Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests.
## How was this patch tested?
Test manually by running `R CMD check --as-cran`
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#20060 from shivaram/sparkr-overwrite-cran.
## What changes were proposed in this pull request?
This PR adds `date_trunc` in R API as below:
```r
> df <- createDataFrame(list(list(a = as.POSIXlt("2012-12-13 12:34:00"))))
> head(select(df, date_trunc("hour", df$a)))
date_trunc(hour, a)
1 2012-12-13 12:00:00
```
## How was this patch tested?
Unit tests added in `R/pkg/tests/fulltests/test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20031 from HyukjinKwon/r-datetrunc.
## What changes were proposed in this pull request?
This PR fixes a style that broke the build.
## How was this patch tested?
Manually tested.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20065 from HyukjinKwon/minor-style.
## What changes were proposed in this pull request?
Create table using the right DataFrame. peopleDF->usersDF
peopleDF:
+----+-------+
| age| name|
+----+-------+
usersDF:
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
## How was this patch tested?
Manually tested.
Author: CNRui <13266776177@163.com>
Closes#20052 from CNRui/patch-2.
## What changes were proposed in this pull request?
SparkHive Scala Examples Improvement made:
* Writing DataFrame / DataSet to Hive Managed , Hive External table using different storage format.
* Implementation of Partition, Reparition, Coalesce with appropriate example.
## How was this patch tested?
* Patch has been tested manually and by running ./dev/run-tests.
Author: chetkhatri <ckhatrimanjal@gmail.com>
Closes#20018 from chetkhatri/scala-sparkhive-examples.
## What changes were proposed in this pull request?
Basic continuous execution, supporting map/flatMap/filter, with commits and advancement through RPC.
## How was this patch tested?
new unit-ish tests (exercising execution end to end)
Author: Jose Torres <jose@databricks.com>
Closes#19984 from jose-torres/continuous-impl.