Commit graph

10006 commits

Author SHA1 Message Date
Xiangrui Meng e43139f403 [SPARK-5976][MLLIB] Add partitioner to factors returned by ALS
The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression.

srowen coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #4748 from mengxr/SPARK-5976 and squashes the following commits:

9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS
260f183 [Xiangrui Meng] add a test for partitioner
2015-02-25 23:43:29 -08:00
Joseph K. Bradley d20559b157 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
  * I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:

c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
2015-02-25 16:13:17 -08:00
Brennon York 46a044a36a [SPARK-1182][Docs] Sort the configuration parameters in configuration.md
Sorts all configuration options present on the `configuration.md` page to ease readability.

Author: Brennon York <brennon.york@capitalone.com>

Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits:

5696f21 [Brennon York] fixed merge conflict with port comments
81a7b10 [Brennon York] capitalized A in Allocation
e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc
7de5f75 [Brennon York] moved serialization from application to compression and serialization section
a16fec0 [Brennon York] moved shuffle settings from network to shuffle
f8fa286 [Brennon York] sorted encryption category
1023f15 [Brennon York] moved initialExecutors
e9d62aa [Brennon York] fixed akka.heartbeat.interval
25e6f6f [Brennon York] moved spark.executer.user*
4625ade [Brennon York] added spark.executor.extra* items
4ee5648 [Brennon York] fixed merge conflicts
1b49234 [Brennon York] sorting mishap
2b5758b [Brennon York] sorting mishap
6fbdf42 [Brennon York] sorting mishap
55dc6f8 [Brennon York] sorted security
ec34294 [Brennon York] sorted dynamic allocation
2a7c4a3 [Brennon York] sorted scheduling
aa9acdc [Brennon York] sorted networking
a4380b8 [Brennon York] sorted execution behavior
27f3919 [Brennon York] sorted compression and serialization
80a5bbb [Brennon York] sorted spark ui
3f32e5b [Brennon York] sorted shuffle behavior
6c51b38 [Brennon York] sorted runtime environment
efe9d6f [Brennon York] sorted application properties
2015-02-25 16:12:56 -08:00
Yanbo Liang 41e2e5acb7 [SPARK-5926] [SQL] make DataFrame.explain leverage queryExecution.logical
DataFrame.explain return wrong result when the query is DDL command.

For example, the following two queries should print out the same execution plan, but it not.
sql("create table tb as select * from src where key > 490").explain(true)
sql("explain extended create table tb as select * from src where key > 490")

This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use  the unexecuted plan queryExecution.logical.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4707 from yanboliang/spark-5926 and squashes the following commits:

fa6db63 [Yanbo Liang] logicalPlan is not lazy
0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
2015-02-25 15:37:13 -08:00
Liang-Chi Hsieh 12dbf98c5d [SPARK-5999][SQL] Remove duplicate Literal matching block
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4760 from viirya/dup_literal and squashes the following commits:

06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
2015-02-25 15:22:33 -08:00
Cheng Lian e0fdd467e2 [SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits
`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.

In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4768 from liancheng/spark-6010 and squashes the following commits:

9002f0a [Cheng Lian] Fixes SPARK-6010
2015-02-25 15:15:22 -08:00
Davies Liu f3f4c87b3d [SPARK-5944] [PySpark] fix version in Python API docs
use RELEASE_VERSION when building the Python API docs

Author: Davies Liu <davies@databricks.com>

Closes #4731 from davies/api_version and squashes the following commits:

c9744c9 [Davies Liu] Update create-release.sh
08cbc3f [Davies Liu] fix python docs
2015-02-25 15:13:34 -08:00
Kay Ousterhout 838a48036c [SPARK-5982] Remove incorrect Local Read Time Metric
This metric is incomplete, because the files are memory mapped, so much of the read from disk occurs later as tasks actually read the file's data.

This should be merged into 1.3, so that we never expose this incorrect metric to users.

CC pwendell ksakellis sryza

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #4749 from kayousterhout/SPARK-5982 and squashes the following commits:

9737b5e [Kay Ousterhout] More fixes
a1eb300 [Kay Ousterhout] Removed one more use of local read time
cf13497 [Kay Ousterhout] [SPARK-5982] Remove incorrectwq Local Read Time Metric
2015-02-25 14:55:24 -08:00
Brennon York 9f603fce78 [SPARK-1955][GraphX]: VertexRDD can incorrectly assume index sharing
Fixes the issue whereby when VertexRDD's are `diff`ed, `innerJoin`ed, or `leftJoin`ed and have different partition sizes they fail under the `zipPartitions` method. This fix tests whether the partitions are equal or not and, if not, will repartition the other to match the partition size of the calling VertexRDD.

Author: Brennon York <brennon.york@capitalone.com>

Closes #4705 from brennonyork/SPARK-1955 and squashes the following commits:

0882590 [Brennon York] updated to properly handle differently-partitioned vertexRDDs
2015-02-25 14:11:12 -08:00
Milan Straka a777c65da9 [SPARK-5970][core] Register directory created in getOrCreateLocalRootDirs for automatic deletion.
As documented in createDirectory, the result of createDirectory is not registered for automatic removal. Currently there are 4 directories left in `/tmp` after just running `pyspark`.

Author: Milan Straka <fox@ucw.cz>

Closes #4759 from foxik/remove-tmp-dirs and squashes the following commits:

280450d [Milan Straka] Use createTempDir in getOrCreateLocalRootDirs...
2015-02-25 21:33:34 +00:00
Sean Owen 7d8e6a2e44 SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing
Clarify default max wait in spark.shuffle.io.retryWait docs

CC andrewor14

Author: Sean Owen <sowen@cloudera.com>

Closes #4769 from srowen/SPARK-5930 and squashes the following commits:

ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs
2015-02-25 12:20:44 -08:00
Michael Armbrust f84c799ea0 [SPARK-5996][SQL] Fix specialized outbound conversions
Author: Michael Armbrust <michael@databricks.com>

Closes #4757 from marmbrus/udtConversions and squashes the following commits:

3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions
2015-02-25 10:13:40 -08:00
guliangliang dd077abf2e [SPARK-5771] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
In Standalone mode, the number of cores in Completed Applications of the Master Web Page will always be zero, if sc.stop() is called.
But the number will always be right, if sc.stop() is not called.
The reason maybe:
after sc.stop() is called, the function removeExecutor of class ApplicationInfo will be called, thus reduce the variable coresGranted to zero. The variable coresGranted is used to display the number of Cores on the Web Page.

Author: guliangliang <guliangliang@qiyi.com>

Closes #4567 from marsishandsome/Spark5771 and squashes the following commits:

694796e [guliangliang] remove duplicate code
a20e390 [guliangliang] change to Cores Using & Requested
0c19c95 [guliangliang] change Cores to Cores (max)
cfbd97d [guliangliang] [SPARK-5771] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
2015-02-25 14:48:02 +00:00
Benedikt Linse 5b8480e035 [GraphX] fixing 3 typos in the graphx programming guide
Corrected 3 Typos in the GraphX programming guide. I hope this is the correct way to contribute.

Author: Benedikt Linse <benedikt.linse@gmail.com>

Closes #4766 from 1123/master and squashes the following commits:

8a63812 [Benedikt Linse] fixing 3 typos in the graphx programming guide
2015-02-25 14:46:17 +00:00
prabs d51ed263ee [SPARK-5666][streaming][MQTT streaming] some trivial fixes
modified to adhere to accepted coding standards as pointed by tdas in PR #3844

Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabsmails@gmail.com>

Closes #4178 from prabeesh/master and squashes the following commits:

bd2cb49 [Prabeesh K] adress the comment
ccc0765 [prabs] adress the comment
46f9619 [prabs] adress the comment
c035bdc [prabs] adress the comment
22dd7f7 [prabs] address the comments
0cc67bd [prabs] adress the comment
838c38e [prabs] adress the comment
cd57029 [prabs] address the comments
66919a3 [Prabeesh K] changed MqttDefaultFilePersistence to MemoryPersistence
5857989 [prabs] modified to adhere to accepted coding standards
2015-02-25 14:37:35 +00:00
Davies Liu d641fbb39c [SPARK-5994] [SQL] Python DataFrame documentation fixes
select empty should NOT be the same as select. make sure selectExpr is behaving the same.
join param documentation
link to source doesn't work in jekyll generated file
cross reference of columns (i.e. enabling linking)
show(): move df example before df.show()
move tests in SQLContext out of docstring otherwise doc is too long
Column.desc and .asc doesn't have any documentation
in documentation, sort functions.*)

Author: Davies Liu <davies@databricks.com>

Closes #4756 from davies/df_docs and squashes the following commits:

f30502c [Davies Liu] fix doc
32f0d46 [Davies Liu] fix DataFrame docs
2015-02-24 20:51:55 -08:00
Yin Huai 769e092bdc [SPARK-5286][SQL] SPARK-5286 followup
https://issues.apache.org/jira/browse/SPARK-5286

Author: Yin Huai <yhuai@databricks.com>

Closes #4755 from yhuai/SPARK-5286-throwable and squashes the following commits:

4c0c450 [Yin Huai] Catch Throwable instead of Exception.
2015-02-24 19:51:36 -08:00
Tathagata Das 922b43b3cc [SPARK-5993][Streaming][Build] Fix assembly jar location of kafka-assembly
Published Kafka-assembly JAR was empty in 1.3.0-RC1
This is because the maven build generated two Jars-
1. an empty JAR file (since kafka-assembly has no code of its own)
2. a assembly JAR file containing everything in a different location as 1
The maven publishing plugin uploaded 1 and not 2.
Instead if 2 is not configure to generate in a different location, there is only 1 jar containing everything, which gets published.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4753 from tdas/SPARK-5993 and squashes the following commits:

c390db8 [Tathagata Das] Fix assembly jar location of kafka-assembly
2015-02-24 19:10:37 -08:00
Reynold Xin fba11c2f55 [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #4752 from rxin/SPARK-5985 and squashes the following commits:

aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
047ad03 [Reynold Xin] Lift alias out of cast.
c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
2015-02-24 18:59:23 -08:00
Reynold Xin 53a1ebf33b [SPARK-5904][SQL] DataFrame Java API test suites.
Added a new test suite to make sure Java DF programs can use varargs properly.
Also moved all suites into test.org.apache.spark package to make sure the suites also test for method visibility.

Author: Reynold Xin <rxin@databricks.com>

Closes #4751 from rxin/df-tests and squashes the following commits:

1e8b8e4 [Reynold Xin] Fixed imports and renamed JavaAPISuite.
a6ca53b [Reynold Xin] [SPARK-5904][SQL] DataFrame Java API test suites.
2015-02-24 18:51:41 -08:00
Cheng Lian f816e73902 [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness
**NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.

`HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:

1. Fixes a racing condition occurred while using `tail -f` to check log file

   It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.

2. Retries up to 3 times if the server fails to start

   In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.

3. A server instance is reused among all test cases within a single suite

   The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.

**TODO**

- [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:

d6c80eb [Cheng Lian] Relaxes server startup timeout
6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness
2015-02-25 08:34:55 +08:00
MechCoder 2a0fe34891 [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.

This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4677 from MechCoder/spark-5436 and squashes the following commits:

1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
2015-02-24 15:13:22 -08:00
Davies Liu da505e5927 [SPARK-5973] [PySpark] fix zip with two RDDs with AutoBatchedSerializer
Author: Davies Liu <davies@databricks.com>

Closes #4745 from davies/fix_zip and squashes the following commits:

2124b2c [Davies Liu] Update tests.py
b5c828f [Davies Liu] increase the number of records
c1e40fd [Davies Liu] fix zip with two RDDs with AutoBatchedSerializer
2015-02-24 14:50:00 -08:00
Michael Armbrust a2b9137923 [SPARK-5952][SQL] Lock when using hive metastore client
Author: Michael Armbrust <michael@databricks.com>

Closes #4746 from marmbrus/hiveLock and squashes the following commits:

8b871cf [Michael Armbrust] [SPARK-5952][SQL] Lock when using hive metastore client
2015-02-24 13:39:29 -08:00
Judy c5ba975ee8 [Spark-5708] Add Slf4jSink to Spark Metrics
Add Slf4jSink to Spark Metrics using Coda Hale's SlfjReporter.
This sends metrics to log4j, allowing spark users to reuse log4j pipeline for metrics collection.

Reviewed existing unit tests and didn't see any sink-related tests. Please advise on if tests should be added.

Author: Judy <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>

Closes #4644 from judynash/master and squashes the following commits:

57ef214 [judynash] doc clarification and indent fixes
a751a66 [Judy] Spark-5708: Add Slf4jSink to Spark Metrics
2015-02-24 20:50:16 +00:00
Xiangrui Meng 105791e35c [MLLIB] Change x_i to y_i in Variance's user guide
Variance is calculated on labels/responses.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4740 from mengxr/patch-1 and squashes the following commits:

673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide
2015-02-24 11:38:59 -08:00
Andrew Or 6d2caa576f [SPARK-5965] Standalone Worker UI displays {{USER_JAR}}
For screenshot see: https://issues.apache.org/jira/browse/SPARK-5965
This was caused by 20a6013106.

Author: Andrew Or <andrew@databricks.com>

Closes #4739 from andrewor14/user-jar-blocker and squashes the following commits:

23c4a9e [Andrew Or] Use right argument
2015-02-24 11:08:07 -08:00
Tathagata Das 64d2c01ff1 [Spark-5967] [UI] Correctly clean JobProgressListener.stageIdToActiveJobIds
Patch should be self-explanatory
pwendell JoshRosen

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4741 from tdas/SPARK-5967 and squashes the following commits:

653b5bb [Tathagata Das] Fixed the fix and added test
e2de972 [Tathagata Das] Clear stages which have no corresponding active jobs.
2015-02-24 11:02:47 -08:00
Michael Armbrust 201236628a [SPARK-5532][SQL] Repartition should not use external rdd representation
Author: Michael Armbrust <michael@databricks.com>

Closes #4738 from marmbrus/udtRepart and squashes the following commits:

c06d7b5 [Michael Armbrust] fix compilation
91c8829 [Michael Armbrust] [SQL][SPARK-5532] Repartition should not use external rdd representation
2015-02-24 10:52:18 -08:00
Michael Armbrust 0a59e45e2f [SPARK-5910][SQL] Support for as in selectExpr
Author: Michael Armbrust <michael@databricks.com>

Closes #4736 from marmbrus/asExprs and squashes the following commits:

5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr
2015-02-24 10:49:51 -08:00
Cheng Lian 8403331333 [SPARK-5968] [SQL] Suppresses ParquetOutputCommitter WARN logs
Please refer to the [JIRA ticket] [1] for the motivation.

[1]: https://issues.apache.org/jira/browse/SPARK-5968

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4744)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4744 from liancheng/spark-5968 and squashes the following commits:

caac6a8 [Cheng Lian] Suppresses ParquetOutputCommitter WARN logs
2015-02-24 10:45:38 -08:00
Xiangrui Meng cf2e41653d [SPARK-5958][MLLIB][DOC] update block matrix user guide
* Removed SVD code from examples.
* Corrected Java API doc link.
* Updated variable names: `AtransposeA` -> `ata`.
* Minor changes.

brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #4737 from mengxr/update-block-matrix-user-guide and squashes the following commits:

70f53ac [Xiangrui Meng] update block matrix user guide
2015-02-23 22:08:44 -08:00
Michael Armbrust 1ed57086d4 [SPARK-5873][SQL] Allow viewing of partially analyzed plans in queryExecution
Author: Michael Armbrust <michael@databricks.com>

Closes #4684 from marmbrus/explainAnalysis and squashes the following commits:

afbaa19 [Michael Armbrust] fix python
d93278c [Michael Armbrust] fix hive
e5fa0a4 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
52119f2 [Michael Armbrust] more tests
82a5431 [Michael Armbrust] fix tests
25753d2 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explainAnalysis
aee1e6a [Michael Armbrust] fix hive
b23a844 [Michael Armbrust] newline
de8dc51 [Michael Armbrust] more comments
acf620a [Michael Armbrust] [SPARK-5873][SQL] Show partially analyzed plans in query execution
2015-02-23 17:34:54 -08:00
Yin Huai 48376bfe9c [SPARK-5935][SQL] Accept MapType in the schema provided to a JSON dataset.
JIRA: https://issues.apache.org/jira/browse/SPARK-5935

Author: Yin Huai <yhuai@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #4710 from yhuai/jsonMapType and squashes the following commits:

3e40390 [Yin Huai] Remove unnecessary changes.
f8e6267 [Yin Huai] Fix test.
baa36e3 [Yin Huai] Accept MapType in the schema provided to jsonFile/jsonRDD.
2015-02-23 17:16:34 -08:00
Joseph K. Bradley 59536cc87e [SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs
Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity

CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  Do the changes look OK?

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4732 from jkbradley/chisqselector-docs and squashes the following commits:

9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs
2015-02-23 16:15:57 -08:00
Alexander Ulanov 28ccf5ee76 [MLLIB] SPARK-5912 Programming guide for feature selection
Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #4709 from avulanov/SPARK-5912 and squashes the following commits:

19a8a4e [Alexander Ulanov] Addressing reviewers comments @jkbradley
58d9e4d [Alexander Ulanov] Addressing reviewers comments @jkbradley
eb6b9fe [Alexander Ulanov] Typo
2921a1d [Alexander Ulanov] ChiSqSelector example of use
c845350 [Alexander Ulanov] ChiSqSelector docs
2015-02-23 12:09:40 -08:00
Jacky Li 651a1c019e [SPARK-5939][MLLib] make FPGrowth example app take parameters
Add parameter parsing in FPGrowth example app in Scala and Java
And a sample data file is added in data/mllib folder

Author: Jacky Li <jacky.likun@huawei.com>

Closes #4714 from jackylk/parameter and squashes the following commits:

8c478b3 [Jacky Li] fix according to comments
3bb74f6 [Jacky Li] make FPGrowth exampl app take parameters
f0e4d10 [Jacky Li] make FPGrowth exampl app take parameters
2015-02-23 08:47:28 -08:00
CodingCat 242d49584c [SPARK-5724] fix the misconfiguration in AkkaUtils
https://issues.apache.org/jira/browse/SPARK-5724

In AkkaUtil, we set several failure detector related the parameters as following

```
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

```

Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
what we have is "akka.remote.watch-failure-detector.threshold"

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:

bafe56e [CodingCat] fix the grammar in configuration doc
338296e [CodingCat] remove failure-detector related info
8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
2015-02-23 11:29:25 +00:00
Saisai Shao 757b14b862 [SPARK-5943][Streaming] Update the test to use new API to reduce the warning
Author: Saisai Shao <saisai.shao@intel.com>

Closes #4722 from jerryshao/SPARK-5943 and squashes the following commits:

1b01233 [Saisai Shao] Update the test to use new API to reduce the warning
2015-02-23 11:27:27 +00:00
Makoto Fukuhara 9348767416 [EXAMPLES] fix typo.
Author: Makoto Fukuhara <fukuo33@gmail.com>

Closes #4724 from fukuo33/fix-typo and squashes the following commits:

8c806b9 [Makoto Fukuhara] fix typo.
2015-02-23 09:24:33 +00:00
Ilya Ganelin 95cd643aa9 [SPARK-3885] Provide mechanism to remove accumulators once they are no longer used
Instead of storing a strong reference to accumulators, I've replaced this with a weak reference and updated any code that uses these accumulators to check whether the reference resolves before using the accumulator. A weak reference will be cleared when there is no longer an existing copy of the variable versus using a soft reference in which case accumulators would only be cleared when the GC explicitly ran out of memory.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #4021 from ilganeli/SPARK-3885 and squashes the following commits:

4ba9575 [Ilya Ganelin]  Fixed error in test suite
8510943 [Ilya Ganelin] Extra code
bb76ef0 [Ilya Ganelin] File deleted somehow
283a333 [Ilya Ganelin] Added cleanup method for accumulators to remove stale references within Accumulators.original to accumulators that are now out of scope
345fd4f [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
7485a82 [Ilya Ganelin] Fixed build error
c8e0f2b [Ilya Ganelin] Added working test for accumulator garbage collection
94ce754 [Ilya Ganelin] Still not being properly garbage collected
8722b63 [Ilya Ganelin] Fixing gc test
7414a9c [Ilya Ganelin] Added test for accumulator garbage collection
18d62ec [Ilya Ganelin] Updated to throw Exception when accessing a GCd accumulator
9a81928 [Ilya Ganelin] Reverting permissions changes
28f705c [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
b820ab4b [Ilya Ganelin] reset
d78f4bf [Ilya Ganelin] Removed obsolete comment
0746e61 [Ilya Ganelin] Updated DAGSchedulerSUite to fix bug
3350852 [Ilya Ganelin] Updated DAGScheduler and Suite to correctly use new implementation of WeakRef Accumulator storage
c49066a [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
cbb9023 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-3885
a77d11b [Ilya Ganelin] Updated Accumulators class to store weak references instead of strong references to allow garbage collection of old accumulators
2015-02-22 22:57:26 -08:00
Aaron Josephs e4f9d03d72 [SPARK-911] allow efficient queries for a range if RDD is partitioned wi...
...th RangePartitioner

Author: Aaron Josephs <ajoseph4@binghamton.edu>

Closes #1381 from aaronjosephs/PLAT-911 and squashes the following commits:

e30ade5 [Aaron Josephs] [SPARK-911] allow efficient queries for a range if RDD is partitioned with RangePartitioner
2015-02-22 22:09:06 -08:00
Cheng Hao 275b1bef89 [DataFrame] [Typo] Fix the typo
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4717 from chenghao-intel/typo1 and squashes the following commits:

858d7b0 [Cheng Hao] update the typo
2015-02-22 08:56:30 +00:00
Alexander a7f9039025 [DOCS] Fix typo in API for custom InputFormats based on the “new” MapReduce API
This looks like a simple typo ```SparkContext.newHadoopRDD``` instead of ```SparkContext.newAPIHadoopRDD``` as in actual http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.SparkContext

Author: Alexander <abezzubov@nflabs.com>

Closes #4718 from bzz/hadoop-InputFormats-doc-fix and squashes the following commits:

680a4c4 [Alexander] Fix typo in docs on custom Hadoop InputFormats
2015-02-22 08:53:05 +00:00
Patrick Wendell 46462ff255 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #3490 (close requested by 'andrewor14')
Closes #4646 (close requested by 'srowen')
Closes #3591 (close requested by 'andrewor14')
Closes #3656 (close requested by 'andrewor14')
Closes #4553 (close requested by 'JoshRosen')
Closes #4202 (close requested by 'srowen')
Closes #4497 (close requested by 'marmbrus')
Closes #4150 (close requested by 'andrewor14')
Closes #2409 (close requested by 'andrewor14')
Closes #4221 (close requested by 'srowen')
2015-02-21 23:07:30 -08:00
Evan Yu 7683982faf [SPARK-5860][CORE] JdbcRDD: overflow on large range with high number of partitions
Fix a overflow bug in JdbcRDD when calculating partitions for large BIGINT ids

Author: Evan Yu <ehotou@gmail.com>

Closes #4701 from hotou/SPARK-5860 and squashes the following commits:

9e038d1 [Evan Yu] [SPARK-5860][CORE] Prevent overflowing at the length level
7883ad9 [Evan Yu] [SPARK-5860][CORE] Prevent overflowing at the length level
c88755a [Evan Yu] [SPARK-5860][CORE] switch to BigInt instead of BigDecimal
4e9ff4f [Evan Yu] [SPARK-5860][CORE] JdbcRDD overflow on large range with high number of partitions
2015-02-21 20:40:21 +00:00
Hari Shreedharan 7138816abe [SPARK-5937][YARN] Fix ClientSuite to set YARN mode, so that the correct class is used in t...
...ests.

Without this SparkHadoopUtil is used by the Client instead of YarnSparkHadoopUtil.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #4711 from harishreedharan/SPARK-5937 and squashes the following commits:

d154de6 [Hari Shreedharan] Use System.clearProperty() instead of setting the value of SPARK_YARN_MODE to empty string.
f729f70 [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
2015-02-21 10:01:01 -08:00
Nishkam Ravi d3cbd38c33 SPARK-5841 [CORE] [HOTFIX 2] Memory leak in DiskBlockManager
Continue to see IllegalStateException in YARN cluster mode. Adding a simple workaround for now.

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #4690 from nishkamravi2/master_nravi and squashes the following commits:

d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2015-02-21 09:59:28 -08:00
Jacky Li e155324711 [MLlib] fix typo
fix typo: it should be "default:" instead of "default;"

Author: Jacky Li <jackylk@users.noreply.github.com>

Closes #4713 from jackylk/patch-10 and squashes the following commits:

15daf2e [Jacky Li] [MLlib] fix typo
2015-02-21 13:00:16 +00:00
Davies Liu 5b0a42cb17 [SPARK-5898] [SPARK-5896] [SQL] [PySpark] create DataFrame from pandas and tuple/list
Fix createDataFrame() from pandas DataFrame (not tested by jenkins, depends on SPARK-5693).

It also support to create DataFrame from plain tuple/list without column names, `_1`, `_2` will be used as column names.

Author: Davies Liu <davies@databricks.com>

Closes #4679 from davies/pandas and squashes the following commits:

c0cbe0b [Davies Liu] fix tests
8466d1d [Davies Liu] fix create DataFrame from pandas
2015-02-20 15:35:05 -08:00