Currently, the checkpoints to DynamoDB occur only when new data comes in, as we update the clock for the checkpointState. This PR makes the checkpoint a scheduled execution based on the `checkpointInterval`.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#9421 from brkyvz/kinesis-checkpoint.
For now they are thin wrappers around the corresponding Hive UDAFs.
One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.
I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.
Do we also want to add these to `functions.py`?
This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089
marmbrus rxin
Author: Nick Buroojy <nick.buroojy@civitaslearning.com>
Closes#9526 from nburoojy/nick/udaf-alias.
(cherry picked from commit a6ee4f989d)
Signed-off-by: Michael Armbrust <michael@databricks.com>
I have tested it on my local, it is working fine, please review
Author: sachin aggarwal <different.sachin@gmail.com>
Closes#9539 from agsachin/SPARK-11552-real.
this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.
Author: Felix Bechstein <felix.bechstein@otto.de>
Closes#8639 from felixb/decline_offers_constraint_mismatch.
1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams.
2) print() should be pprint() for pyspark.
This contribution is my original work, and I license the work to the project under the project's open source license.
Author: chriskang90 <jckang@uchicago.edu>
Closes#9545 from c-kang/streaming_python_typo.
Make sample test less flaky by setting the seed
Tested with
```
repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#9549 from felixcheung/rsample.
As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with:
```
[error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments.
[error] class RDDInfo(
[error]
```
This PR tries to fix the compilation error
Author: tedyu <yuzhihong@gmail.com>
Closes#9538 from tedyu/master.
The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2.
Author: fazlan-nazeem <fazlann@wso2.com>
Closes#9558 from fazlan-nazeem/master.
Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like
```Java
$DevianceResiduals
Min Max
-0.9509607 0.7291832
$Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6765 0.2353597 7.123139 4.456124e-11
Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12
Species_versicolor -0.9833885 0.07207471 -13.64402 0
Species_virginica -1.00751 0.09330565 -10.79796 0
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9561 from yanboliang/spark-11494.
It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change.
Author: Rohit Agarwal <mindprince@gmail.com>
Closes#9544 from mindprince/patch-2.
Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh.
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help
Usage: Master [options]
Options:
-i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: 7077)
--webui-port PORT Port for web UI (default: 8080)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
```
```
charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh
Usage: Worker [options] <master>
Master must be a URL of the form spark://hostname:port
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
```
Author: Charles Yeh <charlesyeh@dropbox.com>
Closes#9432 from CharlesYeh/helpmsg.
The reason is that:
1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`)
2. When append data to table, we use position to figure out how to match input columns to table's columns.
So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105)
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9408 from cloud-fan/append.
A few changes:
1. Removed fold, since it can be confusing for distributed collections.
2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction)
3. Added more documentation and test cases.
The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function.
Author: Reynold Xin <rxin@databricks.com>
Closes#9531 from rxin/SPARK-11564.
Doc change to align with HiveConf default in terms of where to create `warehouse` directory.
Author: xin Wu <xinwu@us.ibm.com>
Closes#9365 from xwu0226/spark-10046-commit.
This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path.
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#9409 from hvanhovell/SPARK-11451.
This snippet seems to be mistakenly introduced at two places in #5348.
Author: Rohit Agarwal <mindprince@gmail.com>
Closes#9540 from mindprince/patch-1.
JIRA: https://issues.apache.org/jira/browse/SPARK-11362
We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9316 from viirya/use-spark-bitset.
This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite.
cc yhuai marmbrus
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#9541 from hvanhovell/SPARK-9241-followup.
Could jkbradley and davies review it?
- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.
[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8643 from yu-iskw/SPARK-8467-2.
In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.
## User-facing changes
This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.
## Internals changes
This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.
There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).
I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.
## TODOs
- [x] Fix handful of test failures in the MemoryManagerSuites.
- [x] Fix remaining TODO comments in code.
- [ ] Document new configuration.
- [x] Fix commented-out tests / asserts:
- [x] UnifiedMemoryManagerSuite.
- [x] Write tests that exercise the new off-heap memory management policies.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#9344 from JoshRosen/offheap-memory-accounting.
SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file.
Author: navis.ryu <navis@apache.org>
Closes#9514 from navis/SPARK-11546.
The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path.
This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are:
- This can use the faster TungstenAggregate code path.
- It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself.
The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed.
cc yhuai - Could you also tell me where to add tests for this?
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#9406 from hvanhovell/SPARK-9241-rewriter.
I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.
--num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Closes#9523 from tgravescs/SPARK-11555.
This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes:
* class name
* uid
* timestamp
* paramMap
The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases.
~~~scala
instance.save("path")
instance.write.context(sqlContext).overwrite().save("path")
Instance.load("path")
~~~
The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params.
TODOs:
* [x] Java test
* [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#9454 from mengxr/SPARK-11217.
This PR enables the Expand operator to process and produce Unsafe Rows.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#9414 from hvanhovell/SPARK-11450.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
This PR adds test cases that test various column pruning and filter push-down cases.
Author: Cheng Lian <lian@databricks.com>
Closes#9468 from liancheng/spark-10978.follow-up.
JIRA: https://issues.apache.org/jira/browse/SPARK-9162
Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9270 from viirya/scalaudf-codegen.
Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#9476 from zsxwing/SPARK-11511.
A cleanup for https://github.com/apache/spark/pull/9085.
The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them.
Also added low level unit test at `SqlParserSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#9482 from cloud-fan/parser.
This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR.
```scala
val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
ds.groupBy(_._1).agg(sum("_2").as[Int]).collect()
res0: Array(("a", 30), ("b", 3), ("c", 1))
```
Author: Michael Armbrust <michael@databricks.com>
Closes#9499 from marmbrus/dataset-agg.
This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution.
Closes#8068
Author: Davies Liu <davies@databricks.com>
Closes#9477 from davies/unsafe_timsort.
sbt's version resolution code always picks the most recent version, and we
don't want that for guava.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9508 from vanzin/SPARK-11538.