Commit graph

8204 commits

Author SHA1 Message Date
RJ Nowling fec921552f [MLLib] Fix example code variable name misspelling in MLLib Feature Extraction guide
Author: RJ Nowling <rnowling@gmail.com>

Closes #2459 from rnowling/tfidf-fix and squashes the following commits:

b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
2014-09-22 09:10:41 -07:00
wangfei fd0b32c520 [Minor]ignore .idea_modules
ignore .idea_modules ,  ```sbt/sbt gen-idea``` generate this dir.

Author: wangfei <wangfei1@huawei.com>

Closes #2476 from scwf/patch-4 and squashes the following commits:

e6ab88a [wangfei] ignore .idea_modules
2014-09-21 13:09:36 -07:00
Ian Hummel a0454efe21 [SPARK-3595] Respect configured OutputCommitters when calling saveAsHadoopFile
Addresses the issue in https://issues.apache.org/jira/browse/SPARK-3595, namely saveAsHadoopFile hardcoding the OutputCommitter.  This is not ideal when running Spark jobs that write to S3, especially when running them from an EMR cluster where the default OutputCommitter is a DirectOutputCommitter.

Author: Ian Hummel <ian@themodernlife.net>

Closes #2450 from themodernlife/spark-3595 and squashes the following commits:

f37a0e5 [Ian Hummel] Update based on comments from pwendell
a11d9f3 [Ian Hummel] Fix formatting
4359664 [Ian Hummel] Add an example showing usage
8b6be94 [Ian Hummel] Add ability to specify OutputCommitter, espcially useful when writing to an S3 bucket from an EMR cluster
2014-09-21 13:04:36 -07:00
Patrick Wendell d112a6c79d MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #1328 (close requested by 'pwendell')
Closes #2314 (close requested by 'pwendell')
Closes #997 (close requested by 'pwendell')
Closes #550 (close requested by 'pwendell')
Closes #1506 (close requested by 'pwendell')
Closes #2423 (close requested by 'mengxr')
Closes #554 (close requested by 'joshrosen')
2014-09-20 23:11:05 -07:00
WangTao 8e875d2aff [SPARK-3599]Avoid loading properties file frequently
https://issues.apache.org/jira/browse/SPARK-3599

Author: WangTao <barneystinson@aliyun.com>
Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2454 from WangTaoTheTonic/avoidLoadingFrequently and squashes the following commits:

3681182 [WangTao] do not use clone
7dca036 [WangTao] use lazy val instead
2a79f26 [WangTaoTheTonic] Avoid loaing properties file frequently
2014-09-20 19:07:23 -07:00
Michael Armbrust 293ce85145 [SPARK-3414][SQL] Replace LowerCaseSchema with Resolver
**This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala.  Specifically, while resolution remains case insensitive, it is now case preserving.**

_This PR is a follow up to #2293 (and to a lesser extent #2262 #2334)._

In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones.  While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a `LowerCaseSchema` operator) it had the consequence of breaking assumptions made by `MultiInstanceRelation`.  Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).

In this PR, I instead remove the need to insert `LowerCaseSchema` operators at all, by moving the concern of matching up identifiers completely into analysis.  Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).

While it is rolled back in this PR, storing the analyzed plan might actually be a good idea.  For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore.  This can be addressed in a follow up PR.

Follow-ups:
 - Configurable case sensitivity
 - Consider storing analyzed plans for temp tables

Author: Michael Armbrust <michael@databricks.com>

Closes #2382 from marmbrus/lowercase and squashes the following commits:

c21171e [Michael Armbrust] Ensure the resolver is used for field lookups and ensure that case insensitive resolution is still case preserving.
d4320f1 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into lowercase
2de881e [Michael Armbrust] Address comments.
219805a [Michael Armbrust] style
5b93711 [Michael Armbrust] Replace LowerCaseSchema with Resolver.
2014-09-20 16:41:14 -07:00
Cheng Lian 7f54580c45 [SPARK-3609][SQL] Adds sizeInBytes statistics for Limit operator when all output attributes are of native data types
This helps to replace shuffled hash joins with broadcast hash joins in some cases.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2468 from liancheng/more-stats and squashes the following commits:

32687dc [Cheng Lian] Moved the test case to PlannerSuite
5595a91 [Cheng Lian] Removes debugging code
73faf69 [Cheng Lian] Test case for auto choosing broadcast hash join
f30fe1d [Cheng Lian] Adds sizeInBytes estimation for Limit when all output types are native types
2014-09-20 16:30:49 -07:00
Sandy Ryza 7c8ad1c083 SPARK-3574. Shuffle finish time always reported as -1
The included test waits 100 ms after job completion for task completion events to come in so it can verify they have reasonable finish times.  Does anyone know a better way to wait on listener events that are expected to come in?

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2440 from sryza/sandy-spark-3574 and squashes the following commits:

c81439b [Sandy Ryza] Fix test failure
b340956 [Sandy Ryza] SPARK-3574. Remove shuffleFinishTime metric
2014-09-20 16:03:17 -07:00
Matthew Farrellee 5f8833c672 [PySpark] remove unnecessary use of numSlices from pyspark tests
Author: Matthew Farrellee <matt@redhat.com>

Closes #2467 from mattf/master-pyspark-remove-numslices-from-tests and squashes the following commits:

c49a87b [Matthew Farrellee] [PySpark] remove unnecessary use of numSlices from pyspark tests
2014-09-20 15:09:35 -07:00
Santiago M. Mola c32c8538ef Fix Java example in Streaming Programming Guide
"val conf" was used instead of "SparkConf conf" in Java snippet.

Author: Santiago M. Mola <santi@mola.io>

Closes #2472 from smola/patch-1 and squashes the following commits:

5bfeb9b [Santiago M. Mola] Fix Java example in Streaming Programming Guide
2014-09-20 15:05:03 -07:00
Vida Ha 78d4220fa0 SPARK-3608 Break if the instance tag naming succeeds
Author: Vida Ha <vida@databricks.com>

Closes #2466 from vidaha/vida/spark-3608 and squashes the following commits:

9509776 [Vida Ha] Break if the instance tag naming succeeds
2014-09-20 01:24:49 -07:00
andrewor14 8af2370619 [Docs] Fix outdated docs for standalone cluster
This is now supported!

Author: andrewor14 <andrewor14@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2461 from andrewor14/document-standalone-cluster and squashes the following commits:

85c8b9e [andrewor14] Wording change per Patrick
35e30ee [Andrew Or] Fix outdated docs for standalone cluster
2014-09-19 16:02:38 -07:00
Nicholas Chammas 99b06b6fd2 [Build] Fix passing of args to sbt
Simple mistake, simple fix:
```shell
args="arg1 arg2 arg3"

sbt $args    # sbt sees 3 arguments
sbt "$args"  # sbt sees 1 argument
```

Should fix the problems we are seeing [here](https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/694/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/console), for example.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2462 from nchammas/fix-sbt-master-build and squashes the following commits:

4500c86 [Nicholas Chammas] warn about quoting
10018a6 [Nicholas Chammas] Revert "test hadoop1 build"
7d5356c [Nicholas Chammas] Revert "re-add bad quoting for testing"
061600c [Nicholas Chammas] re-add bad quoting for testing
b2de56c [Nicholas Chammas] test hadoop1 build
43fb854 [Nicholas Chammas] unquote profile args
2014-09-19 15:44:47 -07:00
Daoyuan Wang ba68a51c40 [SPARK-3485][SQL] Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions
This is just another solution to SPARK-3485, in addition to PR #2355
In this patch, we will use ConventionHelper and FunctionRegistry to invoke a simple udf evaluation, which rely more on hive, but much cleaner and safer.
We can discuss which one is better.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2407 from adrian-wang/simpleudf and squashes the following commits:

15762d2 [Daoyuan Wang] add posmod test which would fail the test but now ok
0d69eb4 [Daoyuan Wang] another way to pass to hive simple udf
2014-09-19 15:39:31 -07:00
Sandy Ryza 3b9cd13ebc SPARK-3605. Fix typo in SchemaRDD.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #2460 from sryza/sandy-spark-3605 and squashes the following commits:

09d940b [Sandy Ryza] SPARK-3605. Fix typo in SchemaRDD.
2014-09-19 15:34:48 -07:00
Davies Liu a95ad99e31 [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row
Fix the issue when applySchema() to an RDD of Row.

Also add type mapping for BinaryType.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2448 from davies/row and squashes the following commits:

dd220cf [Davies Liu] fix test
3f3f188 [Davies Liu] add more test
f559746 [Davies Liu] add tests, fix serialization
9688fd2 [Davies Liu] support applySchema to RDD of Row
2014-09-19 15:33:42 -07:00
ravipesala 5522151eb1 [SPARK-2594][SQL] Support CACHE TABLE <name> AS SELECT ...
This feature allows user to add cache table from the select query.
Example : ```CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE```
Spark takes this type of SQL as command and it does lazy caching just like ```SQLContext.cacheTable```, ```CACHE TABLE <name>``` does.
It can be executed from both SQLContext and HiveContext.

Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests.
https://github.com/apache/spark/pull/2381
https://github.com/apache/spark/pull/2390

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2397 from ravipesala/SPARK-2594 and squashes the following commits:

a5f0beb [ravipesala] Simplified the code as per Admin comment.
8059cd2 [ravipesala] Changed the behaviour from eager caching to lazy caching.
d6e469d [ravipesala] Code review comments by Admin are handled.
c18aa38 [ravipesala] Merge remote-tracking branch 'remotes/ravipesala/Add-Cache-table-as' into SPARK-2594
394d5ca [ravipesala] Changed style
fb1759b [ravipesala] Updated as per Admin comments
8c9993c [ravipesala] Changed the style
d8b37b2 [ravipesala] Updated as per the comments by Admin
bc0bffc [ravipesala] Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into Add-Cache-table-as
e3265d0 [ravipesala] Updated the code as per the comments by Admin in pull request.
724b9db [ravipesala] Changed style
aaf5b59 [ravipesala] Added comment
dc33895 [ravipesala] Updated parser to support add cache table command
b5276b2 [ravipesala] Updated parser to support add cache table command
eebc0c1 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
6758f80 [ravipesala] Changed style
7459ce3 [ravipesala] Added comment
13c8e27 [ravipesala] Updated parser to support add cache table command
4e858d8 [ravipesala] Updated parser to support add cache table command
b803fc8 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
2014-09-19 15:31:57 -07:00
Cheng Hao 2c3cc7641d [SPARK-3501] [SQL] Fix the bug of Hive SimpleUDF creates unnecessary type cast
When do the query like:
```
select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src;
```
SparkSQL will raise exception:
```
[info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2368 from chenghao-intel/cast_exception and squashes the following commits:

5c9c3a5 [Cheng Hao] make more clear code
49dfc50 [Cheng Hao] Add no-op for Cast and revert the position of SimplifyCasts
b804abd [Cheng Hao] Add unit test to show the failure in identical data type casting
330a5c8 [Cheng Hao] Update Code based on comments
b834ed4 [Cheng Hao] Fix bug of HiveSimpleUDF with unnecessary type cast which cause exception in constant folding
2014-09-19 15:29:22 -07:00
Davies Liu fce5e251d6 [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2378 from davies/pickle_mllib and squashes the following commits:

dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
810f97f [Davies Liu] fix equal of matrix
032cd62 [Davies Liu] add more type check and conversion for user_product
bd738ab [Davies Liu] address comments
e431377 [Davies Liu] fix cache of rdd, refactor
19d0967 [Davies Liu] refactor Picklers
2511e76 [Davies Liu] cleanup
1fccf1a [Davies Liu] address comments
a2cc855 [Davies Liu] fix tests
9ceff73 [Davies Liu] test size of serialized Rating
44e0551 [Davies Liu] fix cache
a379a81 [Davies Liu] fix pickle array in python2.7
df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
154d141 [Davies Liu] fix autobatchedpickler
44736d7 [Davies Liu] speed up pickling array in Python 2.7
e1d1bfc [Davies Liu] refactor
708dc02 [Davies Liu] fix tests
9dcfb63 [Davies Liu] fix style
88034f0 [Davies Liu] rafactor, address comments
46a501e [Davies Liu] choose batch size automatically
df19464 [Davies Liu] memorize the module and class name during pickleing
f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
722dd96 [Davies Liu] cleanup _common.py
0ee1525 [Davies Liu] remove outdated tests
b02e34f [Davies Liu] remove _common.py
84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
4d7963e [Davies Liu] remove muanlly serialization
6d26b03 [Davies Liu] fix tests
c383544 [Davies Liu] classification
f2a0856 [Davies Liu] mllib/regression
d9f691f [Davies Liu] mllib/util
cccb8b1 [Davies Liu] mllib/tree
8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
aa2287e [Davies Liu] random
f1544c4 [Davies Liu] refactor clustering
52d1350 [Davies Liu] use new protocol in mllib/stat
b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-19 15:01:11 -07:00
Matthew Farrellee a03e5b81e9 [SPARK-1701] [PySpark] remove slice terminology from python examples
Author: Matthew Farrellee <matt@redhat.com>

Closes #2304 from mattf/SPARK-1701-partition-over-slice-for-python-examples and squashes the following commits:

928a581 [Matthew Farrellee] [SPARK-1701] [PySpark] remove slice terminology from python examples
2014-09-19 14:35:22 -07:00
Matthew Farrellee be0c7563ea [SPARK-1701] Clarify slice vs partition in the programming guide
This is a partial solution to SPARK-1701, only addressing the
documentation confusion.

Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.

Author: Matthew Farrellee <matt@redhat.com>

Closes #2305 from mattf/SPARK-1701 and squashes the following commits:

c0af05d [Matthew Farrellee] Further tweak
06f80fc [Matthew Farrellee] Wording tweak from Josh Rosen's review
7b045e0 [Matthew Farrellee] [SPARK-1701] Clarify slice vs partition in the programming guide
2014-09-19 14:31:50 -07:00
Patrick Wendell a48956f582 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #726 (close requested by 'pwendell')
Closes #151 (close requested by 'pwendell')
2014-09-19 10:49:42 -07:00
Larry Xiao 3bbbdd8180 [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc
VertexRDD.apply had a bug where it ignored the merge function for
duplicate vertices and instead used whichever vertex attribute occurred
first. This commit fixes the bug by passing the merge function through
to ShippableVertexPartition.apply, which merges any duplicates using the
merge function and then fills in missing vertices using the specified
default vertex attribute. This commit also adds a unit test for
VertexRDD.apply.

Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Author: Blie Arkansol <xiaodi@sjtu.edu.cn>
Author: Ankur Dave <ankurdave@gmail.com>

Closes #1903 from larryxiao/2062 and squashes the following commits:

625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062
476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values
614059f [Larry Xiao] doc update: note about the default null value vertices construction
dfdb3c9 [Larry Xiao] minor fix
1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces
e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
4fbc29c [Blie Arkansol] undo unnecessary change
efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc
b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062
52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled
581e9ee [Larry Xiao] TODO: VertexRDDSuite
20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc
2014-09-18 23:33:18 -07:00
Burak e76ef5cb8e [SPARK-3418] Sparse Matrix support (CCS) and additional native BLAS operations added
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.

BLAS doesn't support  sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.

Author: Burak <brkyvz@gmail.com>

Closes #2294 from brkyvz/SPARK-3418 and squashes the following commits:

88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
2014-09-18 22:18:51 -07:00
Davies Liu e77fa81a61 [SPARK-3554] [PySpark] use broadcast automatically for large closure
Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2417 from davies/command and squashes the following commits:

fbf4e97 [Davies Liu] bugfix
aefd508 [Davies Liu] use broadcast automatically for large closure
2014-09-18 18:11:48 -07:00
Andrew Or 9306297d1d [Minor Hot Fix] Move a line in SparkSubmit to the right place
This was introduced in #2449

Author: Andrew Or <andrewor14@gmail.com>

Closes #2452 from andrewor14/standalone-hot-fix and squashes the following commits:

d5190ca [Andrew Or] Put that line in the right place
2014-09-18 17:49:28 -07:00
Victsm b3ed37e5ba [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode
Author: Victsm <victor.nju@gmail.com>
Author: Min Shen <mshen@linkedin.com>

Closes #2449 from Victsm/SPARK-3560 and squashes the following commits:

918405a [Victsm] Removed the additional space
4502a2a [Min Shen] [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode.

(cherry picked from commit 832dff64dd)
Signed-off-by: Andrew Or <andrewor14@gmail.com>
2014-09-18 15:58:29 -07:00
WangTaoTheTonic 471e6a3a47 [SPARK-3589][Minor]remove redundant code
https://issues.apache.org/jira/browse/SPARK-3589

"export CLASSPATH" in spark-class is redundant since same variable is exported before.
We could reuse defined value "isYarnCluster" in SparkSubmit.scala.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2445 from WangTaoTheTonic/removeRedundant and squashes the following commits:

6fb6872 [WangTaoTheTonic] remove redundant code
2014-09-18 12:07:53 -07:00
Kousuke Saruta 6cab838b98 [SPARK-3566] [BUILD] .gitignore and .rat-excludes should consider Windows cmd file and Emacs' backup files
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2426 from sarutak/emacs-metafiles-ignore and squashes the following commits:

a306020 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
6a0a5eb [Kousuke Saruta] Added cmd file entry to .rat-excludes and .gitignore
897da63 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
8cade06 [Kousuke Saruta] Modified .gitignore to ignore emacs lock file and backup file
2014-09-18 12:04:32 -07:00
Patrick Wendell 3ad4176cf9 SPARK-3579 Jekyll doc generation is different across environments.
This patch makes some small changes to fix this problem:
1. We document specific versions of Jekyll/Kramdown to use that match
   those used when building the upstream docs.
2. We add a configuration for a property that for some reason varies across
   packages of Jekyll/Kramdown even with the same version.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #2443 from pwendell/jekyll and squashes the following commits:

54ee2ab [Patrick Wendell] SPARK-3579 Jekyll doc generation is different across environments.
2014-09-18 10:30:17 -07:00
WangTaoTheTonic 3447d10090 [SPARK-3547]Using a special exit code instead of 1 to represent ClassNotFoundExcepti...
...on

As improvement of https://github.com/apache/spark/pull/1944, we should use more special exit code to represent ClassNotFoundException.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2421 from WangTaoTheTonic/classnotfoundExitCode and squashes the following commits:

645a22a [WangTaoTheTonic] Serveral typos to trigger Jenkins
d6ae559 [WangTaoTheTonic] use 101 instead
a2d6465 [WangTaoTheTonic] use 127 instead
fbb232f [WangTaoTheTonic] Using a special exit code instead of 1 to represent ClassNotFoundException
2014-09-18 10:17:18 -07:00
GuoQiang Li 6772afec2f [Minor] rat exclude dependency-reduced-pom.xml
Author: GuoQiang Li <witgo@qq.com>

Closes #2326 from witgo/rat-excludes and squashes the following commits:

860904e [GuoQiang Li] rat exclude dependency-reduced-pom.xml
2014-09-17 22:54:34 -07:00
Nicholas Chammas 5547fa1ee9 [SPARK-3534] Add hive-thriftserver to SQL tests
Addresses the problem pointed out in [this comment](https://github.com/apache/spark/pull/2441#issuecomment-55990116).

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2442 from nchammas/patch-1 and squashes the following commits:

7e68b60 [Nicholas Chammas] [SPARK-3534] Add hive-thriftserver to SQL tests
2014-09-17 22:37:11 -07:00
WangTaoTheTonic 3f169bfe3c [SPARK-3565]Fix configuration item not consistent with document
https://issues.apache.org/jira/browse/SPARK-3565

"spark.ports.maxRetries" should be "spark.port.maxRetries". Make the configuration keys in document and code consistent.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2427 from WangTaoTheTonic/fixPortRetries and squashes the following commits:

c178813 [WangTaoTheTonic] Use blank lines trigger Jenkins
646f3fe [WangTaoTheTonic] also in SparkBuild.scala
3700dba [WangTaoTheTonic] Fix configuration item not consistent with document
2014-09-17 21:59:23 -07:00
Kousuke Saruta 1147973f1c [SPARK-3567] appId field in SparkDeploySchedulerBackend should be volatile
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2428 from sarutak/appid-volatile-modification and squashes the following commits:

c7d890d [Kousuke Saruta] Added volatile modifier to appId field in SparkDeploySchedulerBackend
2014-09-17 16:52:27 -07:00
Kousuke Saruta 6688a266f2 [SPARK-3564][WebUI] Display App ID on HistoryPage
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2424 from sarutak/display-appid-on-webui and squashes the following commits:

417fe90 [Kousuke Saruta] Added "App ID column" to HistoryPage
2014-09-17 16:31:58 -07:00
Kousuke Saruta cbc065039f [SPARK-3571] Spark standalone cluster mode doesn't work.
I think, this issue is caused by #1106

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2436 from sarutak/SPARK-3571 and squashes the following commits:

7a4deea [Kousuke Saruta] Modified Master.scala to use numWorkersVisited and numWorkersAlive instead of stopPos
4e51e35 [Kousuke Saruta] Modified Master to prevent from 0 divide
4817ecd [Kousuke Saruta] Brushed up previous change
71e84b6 [Kousuke Saruta] Modified Master to enable schedule normally
2014-09-17 16:23:50 -07:00
Nicholas Chammas 7fc3bb7c88 [SPARK-3534] Fix expansion of testing arguments to sbt
Testing arguments to `sbt` need to be passed as an array, not a single, long string.

Fixes a bug introduced in #2420.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2437 from nchammas/selective-testing and squashes the following commits:

a9f9c1c [Nicholas Chammas] fix printing of sbt test arguments
cf57cbf [Nicholas Chammas] fix sbt test arguments
e33b978 [Nicholas Chammas] Merge pull request #2 from apache/master
0b47ca4 [Nicholas Chammas] Merge branch 'master' of github.com:nchammas/spark
8051486 [Nicholas Chammas] Merge pull request #1 from apache/master
03180a4 [Nicholas Chammas] Merge branch 'master' of github.com:nchammas/spark
d4c5f43 [Nicholas Chammas] Merge pull request #6 from apache/master
2014-09-17 15:14:04 -07:00
Andrew Ash b3830b28f8 Docs: move HA subsections to a deeper indentation level
Makes the table of contents read better

Author: Andrew Ash <andrew@andrewash.com>

Closes #2402 from ash211/docs/better-indentation and squashes the following commits:

ea0e130 [Andrew Ash] Move HA subsections to a deeper indentation level
2014-09-17 15:07:57 -07:00
Nicholas Chammas 5044e4953a [SPARK-1455] [SPARK-3534] [Build] When possible, run SQL tests only.
If the only files changed are related to SQL, then only run the SQL tests.

This patch includes some cosmetic/maintainability refactoring. I would be more than happy to undo some of these changes if they are inappropriate.

We can accept this patch mostly as-is and address the immediate need documented in [SPARK-3534](https://issues.apache.org/jira/browse/SPARK-3534), or we can keep it open until a satisfactory solution along the lines [discussed here](https://issues.apache.org/jira/browse/SPARK-1455?focusedCommentId=14136424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14136424) is reached.

Note: I had to hack this patch up to test it locally, so what I'm submitting here and what I tested are technically different.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2420 from nchammas/selective-testing and squashes the following commits:

db3fa2d [Nicholas Chammas] diff against master!
f9e23f6 [Nicholas Chammas] when possible, run SQL tests only
2014-09-17 12:44:44 -07:00
Michael Armbrust cbf983bb4a [SQL][DOCS] Improve table caching section
Author: Michael Armbrust <michael@databricks.com>

Closes #2434 from marmbrus/patch-1 and squashes the following commits:

67215be [Michael Armbrust] [SQL][DOCS] Improve table caching section
2014-09-17 12:41:49 -07:00
Nicholas Chammas 8fbd5f4a90 [Docs] minor grammar fix
Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #2430 from nchammas/patch-2 and squashes the following commits:

d476bfb [Nicholas Chammas] [Docs] minor grammar fix
2014-09-17 12:33:09 -07:00
chesterxgchen 7d1a37239c SPARK-3177 (on Master Branch)
The JIRA and PR was original created for branch-1.1, and move to master branch now.
Chester

The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).

 To fix the test, I add a new method

  def getFieldValue2[A: ClassTag, A1: ClassTag, B](clazz: Class[_], field: String,
                                                      defaults: => B)
                              (mapTo:  A => B)(mapTo1: A1 => B) : B =
    Try(clazz.getField(field)).map(_.get(null)).map {
      case v: A => mapTo(v)
      case v1: A1 => mapTo1(v1)
      case _ => defaults
    }.toOption.getOrElse(defaults)

to handle the cases where the field type can be either type A or A1. In this new method the type A or A1 is pattern matched and corresponding mapTo function (mapTo or mapTo1) is used.

Author: chesterxgchen <chester@alpinenow.com>

Closes #2204 from chesterxgchen/SPARK-3177-master and squashes the following commits:

e72a6ea [chesterxgchen]  The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
2014-09-17 10:25:52 -05:00
viper-kun 983609a4dd [Docs] Correct spark.files.fetchTimeout default value
change the value of spark.files.fetchTimeout

Author: viper-kun <xukun.xu@huawei.com>

Closes #2406 from viper-kun/master and squashes the following commits:

ecb0d46 [viper-kun] [Docs] Correct spark.files.fetchTimeout default value
7cf4c7a [viper-kun] Update configuration.md
2014-09-17 00:09:57 -07:00
wangfei 008a5ed480 [Minor]ignore all config files in conf
Some config files in ```conf``` should ignore, such as
        conf/fairscheduler.xml
        conf/hive-log4j.properties
        conf/metrics.properties
...
So ignore all ```sh```/```properties```/```conf```/```xml``` files

Author: wangfei <wangfei1@huawei.com>

Closes #2395 from scwf/patch-2 and squashes the following commits:

3dc53f2 [wangfei] duplicate ```conf/*.conf```
3c2986f [wangfei] ignore all config files
2014-09-16 21:57:33 -07:00
Andrew Or 0a7091e689 [SPARK-3555] Fix UISuite race condition
The test "jetty selects different port under contention" is flaky.

If another process binds to 4040 before the test starts, then the first server we start there will fail, and the subsequent servers we start thereafter may successfully bind to 4040 if it was released between the servers starting. Instead, we should just let Java find a random free port for us and hold onto it for the duration of the test.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2418 from andrewor14/fix-port-contention and squashes the following commits:

0cd4974 [Andrew Or] Stop them servers
a7071fe [Andrew Or] Pick random port instead of 4040
2014-09-16 16:03:20 -07:00
Evan Chan a6e1712f1e Add a Community Projects page
This adds a new page to the docs listing community projects -- those created outside of Apache Spark that are of interest to the community of Spark users.   Anybody can add to it just by submitting a PR.

There was a discussion thread about alternatives:
* Creating a Github organization for Spark projects -  we could not find any sponsors for this, and it would be difficult to organize since many folks just create repos in their company organization or personal accounts
* Apache has some place for storing community projects, but it was deemed difficult to work with, and again would be some permissions issues -- not everyone could update it.

Author: Evan Chan <velvia@gmail.com>

Closes #2219 from velvia/community-projects-page and squashes the following commits:

7316822 [Evan Chan] Point to Spark wiki: supplemental projects page
613b021 [Evan Chan] Add a few more projects
a85eaaf [Evan Chan] Add a Community Projects page
2014-09-16 13:46:06 -07:00
Dan Osipov b20171267d [SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts
When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: https://github.com/mesos/spark-ec2/pull/58

Author: Dan Osipov <daniil.osipov@shazam.com>

Closes #1120 from danosipov/s3_credentials and squashes the following commits:

758da8b [Dan Osipov] Modify documentation to include the new parameter
71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment
7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance
39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
2014-09-16 13:40:16 -07:00
Davies Liu ec1adecbb7 [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx
Using Sphinx to generate API docs for PySpark.

requirement: Sphinx

```
$ cd python/docs/
$ make html
```

The generated API docs will be located at python/docs/_build/html/index.html

It can co-exists with those generated by Epydoc.

This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2292 from davies/sphinx and squashes the following commits:

425a3b1 [Davies Liu] cleanup
1573298 [Davies Liu] move docs to python/docs/
5fe3903 [Davies Liu] Merge branch 'master' into sphinx
9468ab0 [Davies Liu] fix makefile
b408f38 [Davies Liu] address all comments
e2ccb1b [Davies Liu] update name and version
9081ead [Davies Liu] generate PySpark API docs using Sphinx
2014-09-16 12:51:58 -07:00
Kousuke Saruta a9e910430f [SPARK-3546] InputStream of ManagedBuffer is not closed and causes running out of file descriptor
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2408 from sarutak/resolve-resource-leak-issue and squashes the following commits:

074781d [Kousuke Saruta] Modified SuffleBlockFetcherIterator
5f63f67 [Kousuke Saruta] Move metrics increment logic and debug logging outside try block
b37231a [Kousuke Saruta] Modified FileSegmentManagedBuffer#nioByteBuffer to check null or not before invoking channel.close
bf29d4a [Kousuke Saruta] Modified FileSegment to close channel
2014-09-16 12:41:45 -07:00