ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Burak Yavuz	1431319e5b	Add mockito as an explicit test dependency to spark-streaming While sbt successfully compiles as it properly pulls the mockito dependency, maven builds have broken. We need this in ASAP. tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #9584 from brkyvz/fix-master.	2015-11-09 18:53:57 -08:00
Shixiong Zhu	6502944f39	[SPARK-11333][STREAMING] Add executorId to ReceiverInfo and display it in UI Expose executorId to `ReceiverInfo` and UI since it's helpful when there are multiple executors running in the same host. Screenshot: <img width="1058" alt="screen shot 2015-11-02 at 10 52 19 am" src="https://cloud.githubusercontent.com/assets/1000778/10890968/2e2f5512-8150-11e5-8d9d-746e826b69e8.png"> Author: Shixiong Zhu <shixiong@databricks.com> Author: zsxwing <zsxwing@gmail.com> Closes #9418 from zsxwing/SPARK-11333.	2015-11-09 18:13:37 -08:00
zsxwing	1f0f14efe3	[SPARK-11462][STREAMING] Add JavaStreamingListener Currently, StreamingListener is not Java friendly because it exposes some Scala collections to Java users directly, such as Option, Map. This PR added a Java version of StreamingListener and a bunch of Java friendly classes for Java users. Author: zsxwing <zsxwing@gmail.com> Author: Shixiong Zhu <shixiong@databricks.com> Closes #9420 from zsxwing/java-streaming-listener.	2015-11-09 17:38:19 -08:00
Burak Yavuz	0ce6f9b2d2	[SPARK-11141][STREAMING] Batch ReceivedBlockTrackerLogEvents for WAL writes When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don't get blocked by the driver for too long. cc zsxwing tdas Author: Burak Yavuz <brkyvz@gmail.com> Closes #9143 from brkyvz/batch-wal-writes.	2015-11-09 17:35:12 -08:00
Burak Yavuz	26062d2260	[SPARK-11198][STREAMING][KINESIS] Support de-aggregation of records during recovery While the KCL handles de-aggregation during the regular operation, during recovery we use the lower level api, and therefore need to de-aggregate the records. tdas Testing is an issue, we need protobuf magic to do the aggregated records. Maybe we could depend on KPL for tests? Author: Burak Yavuz <brkyvz@gmail.com> Closes #9403 from brkyvz/kinesis-deaggregation.	2015-11-09 17:18:49 -08:00
Yuhao Yang	61f9c8711c	[SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.	2015-11-09 16:55:23 -08:00
Yu ISHIKAWA	7dc9d8dba6	[SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9577 from yu-iskw/SPARK-11610.	2015-11-09 16:25:29 -08:00
Reynold Xin	675c7e723c	[SPARK-11564][SQL] Fix documentation for DataFrame.take/collect Author: Reynold Xin <rxin@databricks.com> Closes #9557 from rxin/SPARK-11564-1.	2015-11-09 16:22:15 -08:00
Michael Armbrust	9c740a9ddf	[SPARK-11578][SQL] User API for Typed Aggregation This PR adds a new interface for user-defined aggregations, that can be used in `DataFrame` and `Dataset` operations to take all of the elements of a group and reduce them to a single value. For example, the following aggregator extracts an `int` from a specific class and adds them up: ```scala case class Data(i: Int) val customSummer = new Aggregator[Data, Int, Int] { def prepare(d: Data) = d.i def reduce(l: Int, r: Int) = l + r def present(r: Int) = r }.toColumn() val ds: Dataset[Data] = ... val aggregated = ds.select(customSummer) ``` By using helper functions, users can make a generic `Aggregator` that works on any input type: ```scala /** An `Aggregator` that adds up any numeric type returned by the given function. / class SumOf[I, N : Numeric](f: I => N) extends Aggregator[I, N, N] with Serializable { val numeric = implicitly[Numeric[N]] override def zero: N = numeric.zero override def reduce(b: N, a: I): N = numeric.plus(b, f(a)) override def present(reduction: N): N = reduction } def sum[I, N : Numeric : Encoder](f: I => N): TypedColumn[I, N] = new SumOf(f).toColumn ``` These aggregators can then be used alongside other built-in SQL aggregations. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds .groupBy(_._1) .agg( sum(_._2), // The aggregator defined above. expr("sum(_2)").as[Int], // A built-in dynatically typed aggregation. count("")) // A built-in statically typed aggregation. .collect() res0: ("a", 30, 30, 2L), ("b", 3, 3, 2L), ("c", 1, 1, 1L) ``` The current implementation focuses on integrating this into the typed API, but currently only supports running aggregations that return a single long value as explained in `TypedAggregateExpression`. This will be improved in a followup PR. Author: Michael Armbrust <michael@databricks.com> Closes #9555 from marmbrus/dataset-useragg.	2015-11-09 16:11:00 -08:00
gatorsmile	2f38378856	[SPARK-11360][DOC] Loss of nullability when writing parquet files This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. Author: gatorsmile <gatorsmile@gmail.com> Closes #9314 from gatorsmile/lossNull.	2015-11-09 16:06:48 -08:00
hyukjinkwon	9565c246ea	[SPARK-9557][SQL] Refactor ParquetFilterSuite and remove old ParquetFilters code Actually this was resolved by https://github.com/apache/spark/pull/8275. But I found the JIRA issue for this is not marked as resolved since the PR above was made for another issue but the PR above resolved both. I commented that this is resolved by the PR above; however, I opened this PR as I would like to just add a little bit of corrections. In the previous PR, I refactored the test by not reducing just collecting filters; however, this would not test properly `And` filter (which is not given to the tests). I unintentionally changed this from the original way (before being refactored). In this PR, I just followed the original way to collect filters by reducing. I would like to close this if this PR is inappropriate and somebody would like this deal with it in the separate PR related with this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9554 from HyukjinKwon/SPARK-9557.	2015-11-09 15:20:50 -08:00
Wenchen Fan	fcb57e9c73	[SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDataset created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction` Author: Wenchen Fan <wenchen@databricks.com> Closes #9564 from cloud-fan/map.	2015-11-09 15:16:47 -08:00
Yu ISHIKAWA	8a2336893a	[SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.	2015-11-09 14:56:36 -08:00
Burak Yavuz	a3a7c9103e	[SPARK-11359][STREAMING][KINESIS] Checkpoint to DynamoDB even when new data doesn't come in Currently, the checkpoints to DynamoDB occur only when new data comes in, as we update the clock for the checkpointState. This PR makes the checkpoint a scheduled execution based on the `checkpointInterval`. Author: Burak Yavuz <brkyvz@gmail.com> Closes #9421 from brkyvz/kinesis-checkpoint.	2015-11-09 14:39:18 -08:00
Cheng Lian	150f6a89b7	[SPARK-11595] [SQL] Fixes ADD JAR when the input path contains URL scheme Author: Cheng Lian <lian@databricks.com> Closes #9569 from liancheng/spark-11595.fix-add-jar.	2015-11-09 14:32:52 -08:00
Nick Buroojy	f138cb8733	[SPARK-9301][SQL] Add collect_set and collect_list aggregate functions For now they are thin wrappers around the corresponding Hive UDAFs. One limitation with these in Hive 0.13.0 is they only support aggregating primitive types. I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns. Do we also want to add these to `functions.py`? This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089 marmbrus rxin Author: Nick Buroojy <nick.buroojy@civitaslearning.com> Closes #9526 from nburoojy/nick/udaf-alias. (cherry picked from commit `a6ee4f989d`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-11-09 14:30:52 -08:00
Rishabh Bhardwaj	b7720fa455	[SPARK-11548][DOCS] Replaced example code in mllib-collaborative-filtering.md using include_example Kindly review the changes. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Closes #9519 from rishabhbhardwaj/SPARK-11337.	2015-11-09 14:27:36 -08:00
sachin aggarwal	51d41e4b1a	[SPARK-11552][DOCS][Replaced example code in ml-decision-tree.md using include_example] I have tested it on my local, it is working fine, please review Author: sachin aggarwal <different.sachin@gmail.com> Closes #9539 from agsachin/SPARK-11552-real.	2015-11-09 14:25:42 -08:00
Felix Bechstein	5039a49b63	[SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraints this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation. this prevents mesos to send us these offers again and again. in return, we get more offers for slaves which might meet our constraints. and it enables mesos to send the rejected offers to other frameworks. Author: Felix Bechstein <felix.bechstein@otto.de> Closes #8639 from felixb/decline_offers_constraint_mismatch.	2015-11-09 13:36:14 -08:00
Yu ISHIKAWA	88a3fdcc78	[SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8690 from yu-iskw/SPARK-10280.	2015-11-09 13:16:04 -08:00
Bharat Lal	860ea0d386	[SPARK-11581][DOCS] Example mllib code in documentation incorrectly computes MSE Author: Bharat Lal <bharat.iisc@gmail.com> Closes #9560 from bharatl/SPARK-11581.	2015-11-09 11:33:01 -08:00
chriskang90	874cd66d4b	[DOCS] Fix typo for Python section on unifying Kafka streams 1) kafkaStreams is a list. The list should be unpacked when passing it into the streaming context union method, which accepts a variable number of streams. 2) print() should be pprint() for pyspark. This contribution is my original work, and I license the work to the project under the project's open source license. Author: chriskang90 <jckang@uchicago.edu> Closes #9545 from c-kang/streaming_python_typo.	2015-11-09 19:39:22 +01:00
felixcheung	cd174882a5	[SPARK-9865][SPARKR] Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame Make sample test less flaky by setting the seed Tested with ``` repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } } ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #9549 from felixcheung/rsample.	2015-11-09 10:26:09 -08:00
tedyu	404a28f4ed	[SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scala As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with: ``` [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments. [error] class RDDInfo( [error] ``` This PR tries to fix the compilation error Author: tedyu <yuzhihong@gmail.com> Closes #9538 from tedyu/master.	2015-11-09 10:07:58 -08:00
Charles Yeh	08a7a836c3	[SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON I looked at the other endpoints, and they don't seem to be missing any fields. Added fields: ![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png) Author: Charles Yeh <charlesyeh@dropbox.com> Closes #9472 from CharlesYeh/api_vars.	2015-11-09 11:59:32 -06:00
fazlan-nazeem	9b88e1dcad	[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.	2015-11-09 08:58:55 -08:00
Yanbo Liang	d50a66cc04	[SPARK-10689][ML][DOC] User guide and example code for AFTSurvivalRegression Add user guide and example code for ```AFTSurvivalRegression```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9491 from yanboliang/spark-10689.	2015-11-09 08:57:29 -08:00
Yanbo Liang	8c0e1b50e9	[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.	2015-11-09 08:56:22 -08:00
Rohit Agarwal	b541b31630	[DOC][MINOR][SQL] Fix internal link It doesn't show up as a hyperlink currently. It will show up as a hyperlink after this change. Author: Rohit Agarwal <mindprince@gmail.com> Closes #9544 from mindprince/patch-2.	2015-11-09 13:28:00 +01:00
Charles Yeh	9e48cdfbde	[SPARK-11218][CORE] show help messages for start-slave and start-master Addressing https://issues.apache.org/jira/browse/SPARK-11218, mostly copied start-thriftserver.sh. ``` charlesyeh-mbp:spark charlesyeh$ ./sbin/start-master.sh --help Usage: Master [options] Options: -i HOST, --ip HOST Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on -p PORT, --port PORT Port to listen on (default: 7077) --webui-port PORT Port for web UI (default: 8080) --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` ``` charlesyeh-mbp:spark charlesyeh$ ./sbin/start-slave.sh Usage: Worker [options] <master> Master must be a URL of the form spark://hostname:port Options: -c CORES, --cores CORES Number of cores to use -m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G) -d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work) -i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h) -h HOST, --host HOST Hostname to listen on -p PORT, --port PORT Port to listen on (default: random) --webui-port PORT Port for web UI (default: 8081) --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. ``` Author: Charles Yeh <charlesyeh@dropbox.com> Closes #9432 from CharlesYeh/helpmsg.	2015-11-09 13:22:05 +01:00
Wenchen Fan	d8b50f7029	[SPARK-11453][SQL] append data to partitioned table will messes up the result The reason is that: 1. For partitioned hive table, we will move the partitioned columns after data columns. (e.g. `<a: Int, b: Int>` partition by `a` will become `<b: Int, a: Int>`) 2. When append data to table, we use position to figure out how to match input columns to table's columns. So when we append data to partitioned table, we will match wrong columns between input and table. A solution is reordering the input columns before match by position, like what we did for [`InsertIntoHadoopFsRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala#L101-L105) Author: Wenchen Fan <wenchen@databricks.com> Closes #9408 from cloud-fan/append.	2015-11-08 21:01:53 -08:00
Reynold Xin	97b7080cf2	[SPARK-11564][SQL] Dataset Java API audit A few changes: 1. Removed fold, since it can be confusing for distributed collections. 2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction) 3. Added more documentation and test cases. The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function. Author: Reynold Xin <rxin@databricks.com> Closes #9531 from rxin/SPARK-11564.	2015-11-08 20:57:09 -08:00
Wenchen Fan	b2d195e137	[SPARK-11554][SQL] add map/flatMap to GroupedDataset Author: Wenchen Fan <wenchen@databricks.com> Closes #9521 from cloud-fan/map.	2015-11-08 12:59:35 -08:00
xin Wu	26739059bc	[SPARK-10046][SQL] Hive warehouse dir not set in current directory when not … Doc change to align with HiveConf default in terms of where to create `warehouse` directory. Author: xin Wu <xinwu@us.ibm.com> Closes #9365 from xwu0226/spark-10046-commit.	2015-11-08 12:28:19 -08:00
Herman van Hovell	30c8ba71a7	[SPARK-11451][SQL] Support single distinct count on multiple columns. This PR adds support for multiple column in a single count distinct aggregate to the new aggregation path. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9409 from hvanhovell/SPARK-11451.	2015-11-08 11:06:10 -08:00
Rohit Agarwal	5c4e6d7ec9	[DOC][SQL] Remove redundant out-of-place python snippet This snippet seems to be mistakenly introduced at two places in #5348. Author: Rohit Agarwal <mindprince@gmail.com> Closes #9540 from mindprince/patch-1.	2015-11-08 14:24:26 +00:00
Sean Owen	d981902101	[SPARK-11476][DOCS] Incorrect function referred to in MLib Random data generation documentation Fix Python example to use normalRDD as advertised Author: Sean Owen <sowen@cloudera.com> Closes #9529 from srowen/SPARK-11476.	2015-11-08 11:15:58 +00:00
Liang-Chi Hsieh	4b69a42eda	[SPARK-11362] [SQL] Use Spark BitSet in BroadcastNestedLoopJoin JIRA: https://issues.apache.org/jira/browse/SPARK-11362 We use scala.collection.mutable.BitSet in BroadcastNestedLoopJoin now. We should use Spark's BitSet. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9316 from viirya/use-spark-bitset.	2015-11-07 19:44:45 -08:00
Herman van Hovell	ef362846eb	[SPARK-9241][SQL] Supporting multiple DISTINCT columns - follow-up This PR is a follow up for PR https://github.com/apache/spark/pull/9406. It adds more documentation to the rewriting rule, removes a redundant if expression in the non-distinct aggregation path and adds a multiple distinct test to the AggregationQuerySuite. cc yhuai marmbrus Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9541 from hvanhovell/SPARK-9241-followup.	2015-11-07 13:37:37 -08:00
Yu ISHIKAWA	2ff0e79a86	[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.	2015-11-06 22:56:29 -08:00
Andrew Or	7f741905b0	[SPARK-11112] DAG visualization: display RDD callsite <img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png"> mateiz sarutak Author: Andrew Or <andrew@databricks.com> Closes #9398 from andrewor14/rdd-callsite.	2015-11-07 05:35:53 +01:00
Josh Rosen	30b706b7b3	[SPARK-11389][CORE] Add support for off-heap memory to MemoryManager In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory. ## User-facing changes This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit. ## Internals changes This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage\|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies. There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances). I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution. ## TODOs - [x] Fix handful of test failures in the MemoryManagerSuites. - [x] Fix remaining TODO comments in code. - [ ] Document new configuration. - [x] Fix commented-out tests / asserts: - [x] UnifiedMemoryManagerSuite. - [x] Write tests that exercise the new off-heap memory management policies. Author: Josh Rosen <joshrosen@databricks.com> Closes #9344 from JoshRosen/offheap-memory-accounting.	2015-11-06 18:17:34 -08:00
Michael Armbrust	105732dcc6	[HOTFIX] Fix python tests after #9527 #9527 missed updating the python tests. Author: Michael Armbrust <michael@databricks.com> Closes #9533 from marmbrus/hotfixTextValue.	2015-11-06 17:22:30 -08:00
navis.ryu	1c80d66e52	[SPARK-11546] Thrift server makes too many logs about result schema SparkExecuteStatementOperation logs result schema for each getNextRowSet() calls which is by default every 1000 rows, overwhelming whole log file. Author: navis.ryu <navis@apache.org> Closes #9514 from navis/SPARK-11546.	2015-11-06 17:13:46 -08:00
Herman van Hovell	6d0ead322e	[SPARK-9241][SQL] Supporting multiple DISTINCT columns (2) - Rewriting Rule The second PR for SPARK-9241, this adds support for multiple distinct columns to the new aggregation code path. This PR solves the multiple DISTINCT column problem by rewriting these Aggregates into an Expand-Aggregate-Aggregate combination. See the [JIRA ticket](https://issues.apache.org/jira/browse/SPARK-9241) for some information on this. The advantages over the - competing - [first PR](https://github.com/apache/spark/pull/9280) are: - This can use the faster TungstenAggregate code path. - It is impossible to OOM due to an ```OpenHashSet``` allocating to much memory. However, this will multiply the number of input rows by the number of distinct clauses (plus one), and puts a lot more memory pressure on the aggregation code path itself. The location of this Rule is a bit funny, and should probably change when the old aggregation path is changed. cc yhuai - Could you also tell me where to add tests for this? Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9406 from hvanhovell/SPARK-9241-rewriter.	2015-11-06 16:04:20 -08:00
Nong Li	1ab72b0860	[SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW… …ithinPartitions. Author: Nong Li <nong@databricks.com> Closes #9504 from nongli/spark-11410.	2015-11-06 15:48:20 -08:00
Wenchen Fan	7e9a9e603a	[SPARK-11269][SQL] Java API support & test cases for Dataset This simply brings https://github.com/apache/spark/pull/9358 up-to-date. Author: Wenchen Fan <wenchen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #9528 from rxin/dataset-java.	2015-11-06 15:37:07 -08:00
Thomas Graves	f6680cdc5d	[SPARK-11555] spark on yarn spark-class --num-workers doesn't work I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied. --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #9523 from tgravescs/SPARK-11555.	2015-11-06 15:24:33 -08:00
Xiangrui Meng	c447c9d546	[SPARK-11217][ML] save/load for non-meta estimators and transformers This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.	2015-11-06 14:51:03 -08:00
Reynold Xin	3a652f691b	[SPARK-11561][SQL] Rename text data source's column name to value. Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.	2015-11-06 14:47:41 -08:00

1 2 3 4 5 ...

13648 commits