Commit graph

6618 commits

Author SHA1 Message Date
Reynold Xin 83f2a2f14e [sql] Rename Expression.apply to eval for better readability.
Also used this opportunity to add a bunch of override's and made some members private.

Author: Reynold Xin <rxin@apache.org>

Closes #340 from rxin/eval and squashes the following commits:

a7c7ca7 [Reynold Xin] Fixed conflicts in merge.
9069de6 [Reynold Xin] Merge branch 'master' into eval
3ccc313 [Reynold Xin] Merge branch 'master' into eval
1a47e10 [Reynold Xin] Renamed apply to eval for generators and added a bunch of override's.
ea061de [Reynold Xin] Rename Expression.apply to eval for better readability.
2014-04-07 10:45:31 -07:00
Davis Shepherd a3c51c6ea2 SPARK-1432: Make sure that all metadata fields are properly cleaned
While working on spark-1337 with @pwendell, we noticed that not all of the metadata maps in JobProgessListener were being properly cleaned. This could lead to a (hypothetical) memory leak issue should a job run long enough. This patch aims to address the issue.

Author: Davis Shepherd <davis@conviva.com>

Closes #338 from dgshep/master and squashes the following commits:

a77b65c [Davis Shepherd] In the contex of SPARK-1337: Make sure that all metadata fields are properly cleaned
2014-04-07 10:02:00 -07:00
Michael Armbrust b5bae849db [SQL] SPARK-1427 Fix toString for SchemaRDD NativeCommands.
Author: Michael Armbrust <michael@databricks.com>

Closes #343 from marmbrus/toStringFix and squashes the following commits:

37198fe [Michael Armbrust] Fix toString for SchemaRDD NativeCommands.
2014-04-07 01:46:50 -07:00
Michael Armbrust accd0999f9 [SQL] SPARK-1371 Hash Aggregation Improvements
Given:
```scala
case class Data(a: Int, b: Int)
val rdd =
  sparkContext
    .parallelize(1 to 200)
    .flatMap(_ => (1 to 50000).map(i => Data(i % 100, i)))
rdd.registerAsTable("data")
cacheTable("data")
```
Before:
```
SELECT COUNT(*) FROM data:[10000000]
16795.567ms
SELECT a, SUM(b) FROM data GROUP BY a
7536.436ms
SELECT SUM(b) FROM data
10954.1ms
```

After:
```
SELECT COUNT(*) FROM data:[10000000]
1372.175ms
SELECT a, SUM(b) FROM data GROUP BY a
2070.446ms
SELECT SUM(b) FROM data
958.969ms
```

Author: Michael Armbrust <michael@databricks.com>

Closes #295 from marmbrus/hashAgg and squashes the following commits:

ec63575 [Michael Armbrust] Add comment.
d0495a9 [Michael Armbrust] Use scaladoc instead.
b4a6887 [Michael Armbrust] Address review comments.
a2d90ba [Michael Armbrust] Capture child output statically to avoid issues with generators and serialization.
7c13112 [Michael Armbrust] Rewrite Aggregate operator to stream input and use projections.  Remove unused local RDD functions implicits.
5096f99 [Michael Armbrust] Make HiveUDAF fields transient since object inspectors are not serializable.
6a4b671 [Michael Armbrust] Add option to avoid binding operators expressions automatically.
92cca08 [Michael Armbrust] Always include serialization debug info when running tests.
1279df2 [Michael Armbrust] Increase default number of partitions.
2014-04-07 00:14:00 -07:00
Patrick Wendell 87d0928a33 SPARK-1431: Allow merging conflicting pull requests
Sometimes if there is a small conflict it's nice to be able to just
manually fix it up rather than have another RTT with the contributor.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #342 from pwendell/merge-conflicts and squashes the following commits:

cdce61a [Patrick Wendell] SPARK-1431: Allow merging conflicting pull requests
2014-04-06 21:04:45 -07:00
Evan Chan 1440154c27 SPARK-1154: Clean up app folders in worker nodes
This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154).   The issue is that worker nodes fill up with a huge number of app-* folders after some time.  This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL.

Two new configuration parameters have been introduced:
  spark.worker.cleanup_interval
  spark.worker.app_data_ttl

This change does not include moving the downloads of application jars to a location outside of the work directory.  We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR.

Author: Evan Chan <ev@ooyala.com>
Author: Kelvin Chu <kelvinkwchu@yahoo.com>

Closes #288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits:

0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode
9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled
f2f6027 [Evan Chan] CR from @andrewor14
553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds
8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge.
cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles()
72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds.
ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles()
dc1a311 [Evan Chan] Don't recompute current time with every new file
e3c408e [Evan Chan] Document the two new settings
b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories
2014-04-06 19:21:40 -07:00
Aaron Davidson 4106558435 SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging
Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.

This patch has the following features/bug fixes:

- Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
- Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
- assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
- avoid adding log message in compute-classpath.sh to the classpath :)

Still TODO before mergeable:
- We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
- Spark SQL documentation updates.

Author: Aaron Davidson <aaron@databricks.com>

Closes #237 from aarondav/master and squashes the following commits:

5dc4329 [Aaron Davidson] Typo fixes
dd4f298 [Aaron Davidson] Doc update
dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
2014-04-06 17:48:41 -07:00
Aaron Davidson 7ce52c4a7a SPARK-1349: spark-shell gets its own command history
Currently, spark-shell shares its command history with scala repl.

This fix is simply a modification of the default FileBackedHistory file setting:
https://github.com/scala/scala/blob/master/src/repl/scala/tools/nsc/interpreter/session/FileBackedHistory.scala#L77

Author: Aaron Davidson <aaron@databricks.com>

Closes #267 from aarondav/repl and squashes the following commits:

f9c62d2 [Aaron Davidson] SPARK-1349: spark-shell gets its own command history separate from scala repl
2014-04-06 17:43:44 -07:00
Sean Owen 856c50f59b SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions
Another handful of small build changes to organize and standardize a bit, and avoid warnings:

- Update Maven plugin versions for good measure
- Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
- Use variables to define versions across dependencies where they should move in lock step
- ... and make this consistent between Maven/SBT

OK, I also updated the JIRA URL while I was at it here.

Author: Sean Owen <sowen@cloudera.com>

Closes #291 from srowen/SPARK-1387 and squashes the following commits:

461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
2014-04-06 17:41:01 -07:00
Egor Pakhomov e258e5040f [SPARK-1259] Make RDD locally iterable
Author: Egor Pakhomov <pahomov.egor@gmail.com>

Closes #156 from epahomov/SPARK-1259 and squashes the following commits:

8ec8f24 [Egor Pakhomov] Make to local iterator shorter
34aa300 [Egor Pakhomov] Fix toLocalIterator docs
08363ef [Egor Pakhomov] SPARK-1259 from toLocallyIterable to toLocalIterator
6a994eb [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
8be3dcf [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
33ecb17 [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
2014-04-06 16:43:01 -07:00
witgo 7012ffafad Fix SPARK-1420 The maven build error for Spark Catalyst
Author: witgo <witgo@qq.com>

Closes #333 from witgo/SPARK-1420 and squashes the following commits:

902519e [witgo] add dependency scala-reflect to catalyst
2014-04-06 16:03:06 -07:00
Matei Zaharia 0b85516781 SPARK-1421. Make MLlib work on Python 2.6
The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.

Author: Matei Zaharia <matei@databricks.com>

Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:

f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
2014-04-05 20:52:05 -07:00
Sean Owen 890d63bd4e Fix for PR #195 for Java 6
Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility. See PR #195

Author: Sean Owen <sowen@cloudera.com>

Closes #334 from srowen/FixPR195ForJava6 and squashes the following commits:

f92fbd3 [Sean Owen] Use Java 6's recommended equivalent of Java 7's Logger.getGlobal() to retain Java 6 compatibility
2014-04-05 19:08:24 -07:00
Mridul Muralidharan 6e88583aef [SPARK-1371] fix computePreferredLocations signature to not depend on underlying implementation
Change to Map and Set - not mutable HashMap and HashSet

Author: Mridul Muralidharan <mridulm80@apache.org>

Closes #302 from mridulm/master and squashes the following commits:

df747af [Mridul Muralidharan] Address review comments
17e2907 [Mridul Muralidharan] fix computePreferredLocations signature to not depend on underlying implementation
2014-04-05 15:23:37 -07:00
Kay Ousterhout 2d0150c1a2 Remove the getStageInfo() method from SparkContext.
This method exposes the Stage objects, which are
private to Spark and should not be exposed to the
user.

This method was added in 01d77f329f; ccing @squito here in case there's a good reason to keep this!

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #308 from kayousterhout/remove_public_method and squashes the following commits:

2e2f009 [Kay Ousterhout] Remove the getStageInfo() method from SparkContext.
2014-04-05 15:17:50 -07:00
Prashant Sharma 7c18428fac HOTFIX for broken CI, by SPARK-1336
Learnt about `set -o pipefail` is very useful.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #321 from ScrapCodes/hf-SPARK-1336 and squashes the following commits:

9d22bc2 [Prashant Sharma] added comment why echo -e q exists.
f865951 [Prashant Sharma] made error to match with word boundry so errors does not match. This is there to make sure build fails if provided SparkBuild has compile errors.
7fffdf2 [Prashant Sharma] Removed a stray line.
97379d8 [Prashant Sharma] HOTFIX for broken CI, by SPARK-1336
2014-04-04 22:49:19 -07:00
Prabeesh K 0acc7a02b4 small fix ( proogram -> program )
Author: Prabeesh K <prabsmails@gmail.com>

Closes #331 from prabeesh/patch-3 and squashes the following commits:

9399eb5 [Prabeesh K] small fix(proogram -> program)
2014-04-04 21:32:00 -07:00
Michael Armbrust 8de038eb36 [SQL] SPARK-1366 Consistent sql function across different types of SQLContexts
Now users who want to use HiveQL should explicitly say `hiveql` or `hql`.

Author: Michael Armbrust <michael@databricks.com>

Closes #319 from marmbrus/standardizeSqlHql and squashes the following commits:

de68d0e [Michael Armbrust] Fix sampling test.
fbe4a54 [Michael Armbrust] Make `sql` always use spark sql parser, users of hive context can now use hql or hiveql to run queries using HiveQL instead.
2014-04-04 21:15:33 -07:00
Haoyuan Li b50ddfde03 SPARK-1305: Support persisting RDD's directly to Tachyon
Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
2014-04-04 20:38:20 -07:00
Mark Hamstra 1347ebd4b5 [SPARK-1419] Bumped parent POM to apache 14
Keeping up-to-date with the parent, which includes some bugfixes.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #328 from markhamstra/Apache14 and squashes the following commits:

3f19975 [Mark Hamstra] Bumped parent POM to apache 14
2014-04-04 19:19:48 -07:00
Patrick Wendell 5f3c1bb513 Add test utility for generating Jar files with compiled classes.
This was requested by a few different people and may be generally
useful, so I'd like to contribute this and not block on a different
PR for it to get in.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #326 from pwendell/class-loader-test-utils and squashes the following commits:

ff3e88e [Patrick Wendell] Add test utility for generating Jar files with compiled classes.
2014-04-04 19:15:15 -07:00
Matei Zaharia 60e18ce7dd SPARK-1414. Python API for SparkContext.wholeTextFiles
Also clarified comment on each file having to fit in memory

Author: Matei Zaharia <matei@databricks.com>

Closes #327 from mateiz/py-whole-files and squashes the following commits:

9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
2014-04-04 17:29:29 -07:00
Michael Armbrust d956cc2516 [SQL] Minor fixes.
Author: Michael Armbrust <michael@databricks.com>

Closes #315 from marmbrus/minorFixes and squashes the following commits:

b23a15d [Michael Armbrust] fix scaladoc
11062ac [Michael Armbrust] Fix registering "SELECT *" queries as tables and caching them.  As some tests for this and self-joins.
3997dc9 [Michael Armbrust] Move Row extractor to catalyst.
208bf5e [Michael Armbrust] More idiomatic naming of DSL functions. * subquery => as * for join condition => on, i.e., `r.join(s, condition = 'a == 'b)` =>`r.join(s, on = 'a == 'b)`
87211ce [Michael Armbrust] Correctly handle self joins of in-memory cached tables.
69e195e [Michael Armbrust] Change != to !== in the DSL since != will always translate to != on Any.
01f2dd5 [Michael Armbrust] Correctly assign aliases to tables in SqlParser.
2014-04-04 17:23:17 -07:00
Thomas Graves 198892fe8d [SPARK-1198] Allow pipes tasks to run in different sub-directories
This works as is on Linux/Mac/etc but doesn't cover working on Windows.  In here I use ln -sf for symlinks. Putting this up for comments on that. Do we want to create perhaps some classes for doing shell commands - Linux vs Windows.  Is there some other way we want to do this?   I assume we are still supporting jdk1.6?

Also should I update the Java API for pipes to allow this parameter?

Author: Thomas Graves <tgraves@apache.org>

Closes #128 from tgravescs/SPARK1198 and squashes the following commits:

abc1289 [Thomas Graves] remove extra tag in pom file
ba23fc0 [Thomas Graves] Add support for symlink on windows, remove commons-io usage
da4b221 [Thomas Graves] Merge branch 'master' of https://github.com/tgravescs/spark into SPARK1198
61be271 [Thomas Graves] Fix file name filter
6b783bd [Thomas Graves] style fixes
1ab49ca [Thomas Graves] Add support for running pipe tasks is separate directories
2014-04-04 17:16:31 -07:00
Patrick Wendell a02b535d5e Don't create SparkContext in JobProgressListenerSuite.
This reduces the time of the test from 11 seconds to 20 milliseconds.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #324 from pwendell/job-test and squashes the following commits:

868d9eb [Patrick Wendell] Don't create SparkContext in JobProgressListenerSuite.
2014-04-04 14:46:32 -07:00
Sandy Ryza 16b8308887 SPARK-1375. Additional spark-submit cleanup
Author: Sandy Ryza <sandy@cloudera.com>

Closes #278 from sryza/sandy-spark-1375 and squashes the following commits:

5fbf1e9 [Sandy Ryza] SPARK-1375. Additional spark-submit cleanup
2014-04-04 13:28:42 -07:00
Xusen Yin f1fa617023 [SPARK-1133] Add whole text files reader in MLlib
Here is a pointer to the former [PR164](https://github.com/apache/spark/pull/164).

I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #252 from yinxusen/whole-files-input and squashes the following commits:

7191be6 [Xusen Yin] refine comments
0af3faf [Xusen Yin] add JavaAPI test
01745ee [Xusen Yin] fix deletion error
cc97dca [Xusen Yin] move whole text file API to Spark core
d792cee [Xusen Yin] remove the typo character "+"
6bdf2c2 [Xusen Yin] test for small local file system block size
a1f1e7e [Xusen Yin] add two extra spaces
28cb0fe [Xusen Yin] add whole text files reader
2014-04-04 11:12:47 -07:00
Aaron Davidson 01cf4c402b SPARK-1404: Always upgrade spark-env.sh vars to environment vars
This was broken when spark-env.sh was made idempotent, as the idempotence check is an environment variable, but the spark-env.sh variables may not have been.

Tested in zsh, bash, and sh.

Author: Aaron Davidson <aaron@databricks.com>

Closes #310 from aarondav/SPARK-1404 and squashes the following commits:

c3406a5 [Aaron Davidson] Add extra export in spark-shell
6a0e340 [Aaron Davidson] SPARK-1404: Always upgrade spark-env.sh vars to environment vars
2014-04-04 09:50:24 -07:00
Sandy Ryza 7f32fd42aa SPARK-1350. Always use JAVA_HOME to run executor container JVMs.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #313 from sryza/sandy-spark-1350 and squashes the following commits:

bb6d187 [Sandy Ryza] SPARK-1350. Always use JAVA_HOME to run executor container JVMs.
2014-04-04 08:54:04 -05:00
Patrick Wendell ee6e9e7d86 SPARK-1337: Application web UI garbage collects newest stages
Simple fix...

Author: Patrick Wendell <pwendell@gmail.com>

Closes #320 from pwendell/stage-clean-up and squashes the following commits:

29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones
2014-04-03 22:13:56 -07:00
Patrick Wendell 33e63618d0 Revert "[SPARK-1398] Removed findbugs jsr305 dependency"
This reverts commit 92a86b285f.
2014-04-03 17:00:06 -07:00
Michael Armbrust 9231b011a9 Fix jenkins from giving the green light to builds that don't compile.
Adding `| grep` swallows the non-zero return code from sbt failures. See [here](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13735/consoleFull) for a Jenkins run that fails to compile, but still gets a green light.

Note the [BUILD FIX] commit isn't actually part of this PR, but github is out of date.

Author: Michael Armbrust <michael@databricks.com>

Closes #317 from marmbrus/fixJenkins and squashes the following commits:

7c77ff9 [Michael Armbrust] Remove output filter that was swallowing non-zero exit codes for test failures.
2014-04-03 16:53:35 -07:00
Michael Armbrust d94826be6d [BUILD FIX] Fix compilation of Spark SQL Java API.
The JavaAPI and the Parquet improvements PRs didn't conflict, but broke the build.

Author: Michael Armbrust <michael@databricks.com>

Closes #316 from marmbrus/hotFixJavaApi and squashes the following commits:

0b84c2d [Michael Armbrust] Fix compilation of Spark SQL Java API.
2014-04-03 16:12:08 -07:00
Diana Carroll a599e43d6e [SPARK-1134] Fix and document passing of arguments to IPython
This is based on @dianacarroll's previous pull request https://github.com/apache/spark/pull/227, and @joshrosen's comments on https://github.com/apache/spark/pull/38. Since we do want to allow passing arguments to IPython, this does the following:
* It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see https://github.com/ipython/ipython/pull/5226, but no released version has that fix.)
* If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
* The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.

This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes https://github.com/ipython/ipython/pull/5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.

@joshrosen you should probably take the final call on this.

Author: Diana Carroll <dcarroll@cloudera.com>

Closes #294 from mateiz/spark-1134 and squashes the following commits:

747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
2014-04-03 15:48:42 -07:00
Michael Armbrust b8f534196f [SQL] SPARK-1333 First draft of java API
WIP: Some work remains...
 * [x] Hive support
 * [x] Tests
 * [x] Update docs

Feedback welcome!

Author: Michael Armbrust <michael@databricks.com>

Closes #248 from marmbrus/javaSchemaRDD and squashes the following commits:

b393913 [Michael Armbrust] @srowen 's java style suggestions.
f531eb1 [Michael Armbrust] Address matei's comments.
33a1b1a [Michael Armbrust] Ignore JavaHiveSuite.
822f626 [Michael Armbrust] improve docs.
ab91750 [Michael Armbrust] Improve Java SQL API: * Change JavaRow => Row * Add support for querying RDDs of JavaBeans * Docs * Tests * Hive support
0b859c8 [Michael Armbrust] First draft of java API.
2014-04-03 15:45:34 -07:00
Prashant Sharma c1ea3afb51 Spark 1162 Implemented takeOrdered in pyspark.
Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.

We have our own implementation of max heap.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:

35f86ba [Prashant Sharma] code review
2b1124d [Prashant Sharma] fixed tests
e8a08e2 [Prashant Sharma] Code review comments.
49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
2014-04-03 15:42:17 -07:00
Cheng Hao 5d1feda217 [SPARK-1360] Add Timestamp Support for SQL
This PR includes:
1) Add new data type Timestamp
2) Add more data type casting base on Hive's Rule
3) Fix bug missing data type in both parsers (HiveQl & SQLParser).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #275 from chenghao-intel/timestamp and squashes the following commits:

df709e5 [Cheng Hao] Move orc_ends_with_nulls to blacklist
24b04b0 [Cheng Hao] Put 3 cases into the black lists(describe_pretty,describe_syntax,lateral_view_outer)
fc512c2 [Cheng Hao] remove the unnecessary data type equality check in data casting
d0d1919 [Cheng Hao] Add more data type for scala reflection
3259808 [Cheng Hao] Add the new Golden files
3823b97 [Cheng Hao] Update the UnitTest cases & add timestamp type for HiveQL
54a0489 [Cheng Hao] fix bug mapping to 0 (which is supposed to be null) when NumberFormatException occurs
9cb505c [Cheng Hao] Fix issues according to PR comments
e529168 [Cheng Hao] Fix bug of converting from String
6fc8100 [Cheng Hao] Update Unit Test & CodeStyle
8a1d4d6 [Cheng Hao] Add DataType for SqlParser
ce4385e [Cheng Hao] Add TimestampType Support
2014-04-03 15:33:17 -07:00
Andre Schumacher fbebaedf26 Spark parquet improvements
A few improvements to the Parquet support for SQL queries:
- Instead of files a ParquetRelation is now backed by a directory, which simplifies importing data from other
  sources
- InsertIntoParquetTable operation now supports switching between overwriting or appending (at least in
  HiveQL)
- tests now use the new API
- Parquet logging can be set to WARNING level (Default)
- Default compression for Parquet files (GZIP, as in parquet-mr)

Author: Andre Schumacher <andre.schumacher@iki.fi>

Closes #195 from AndreSchumacher/spark_parquet_improvements and squashes the following commits:

54df314 [Andre Schumacher] SPARK-1383 [SQL] Improvements to ParquetRelation
2014-04-03 15:31:47 -07:00
Mark Hamstra 92a86b285f [SPARK-1398] Removed findbugs jsr305 dependency
Should be a painless upgrade, and does offer some significant advantages should we want to leverage FindBugs more during the 1.0 lifecycle. http://findbugs.sourceforge.net/findbugs2.html

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #307 from markhamstra/findbugs and squashes the following commits:

99f2d09 [Mark Hamstra] Removed unnecessary findbugs jsr305 dependency
2014-04-03 14:08:47 -07:00
Michael Armbrust 47ebea5468 [SQL] SPARK-1364 Improve datatype and test coverage for ScalaReflection schema inference.
Author: Michael Armbrust <michael@databricks.com>

Closes #293 from marmbrus/reflectTypes and squashes the following commits:

f54e8e8 [Michael Armbrust] Improve datatype and test coverage for ScalaReflection schema inference.
2014-04-02 18:14:31 -07:00
Xiangrui Meng 9c65fa76f9 [SPARK-1212, Part II] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
2014-04-02 14:01:12 -07:00
Reynold Xin ed730c9502 StopAfter / TopK related changes
1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases.
2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API.
3. Avoid breaking lineage in Limit.
4. Added a bunch of override's to execution/basicOperators.scala.

@marmbrus @liancheng

Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>

Closes #233 from rxin/limit and squashes the following commits:

13eb12a [Reynold Xin] Merge pull request #1 from marmbrus/limit
92b9727 [Michael Armbrust] More hacks to make Maps serialize with Kryo.
4fc8b4e [Reynold Xin] Merge branch 'master' of github.com:apache/spark into limit
87b7d37 [Reynold Xin] Use the proper serializer in limit.
9b79246 [Reynold Xin] Updated doc for Limit.
47d3327 [Reynold Xin] Copy tuples in Limit before shuffle.
231af3a [Reynold Xin] Limit/TakeOrdered: 1. Renamed StopAfter to Limit to be more consistent with naming in other relational databases. 2. Renamed TopK to TakeOrdered to be more consistent with Spark RDD API. 3. Avoid breaking lineage in Limit. 4. Added a bunch of override's to execution/basicOperators.scala.
2014-04-02 12:48:04 -07:00
Cheng Lian 1faa579711 [SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373)

(Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.)

This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include:

*   `CompressionScheme`

    Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include:

    * `RunLengthEncoding`
    * `DictionaryEncoding`

    Algorithms to be implemented include:

    * `BooleanBitSet`
    * `IntDelta`
    * `LongDelta`

*   `CompressibleColumnBuilder`

    A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns.  A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time.

    Memory layout of the final byte buffer is showed below:

    ```
     .--------------------------- Column type ID (4 bytes)
     |   .----------------------- Null count N (4 bytes)
     |   |   .------------------- Null positions (4 x N bytes, empty if null count is zero)
     |   |   |     .------------- Compression scheme ID (4 bytes)
     |   |   |     |   .--------- Compressed non-null elements
     V   V   V     V   V
    +---+---+-----+---+---------+
    |   |   | ... |   | ... ... |
    +---+---+-----+---+---------+
     \-----------/ \-----------/
        header         body
    ```

*   `CompressibleColumnAccessor`

    A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column.

*   `ColumnStats`

    Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information.

    Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible).

A major refactoring change since PR #205 is:

* Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #285 from liancheng/memColumnarCompression and squashes the following commits:

ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus
d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance
5034453 [Cheng Lian] Bug fix, more tests, and more refactoring
c298b76 [Cheng Lian] Test suites refactored
2780d6a [Cheng Lian] [WIP] in-memory columnar compression support
211331c [Cheng Lian] WIP: in-memory columnar compression support
85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
2014-04-02 12:47:22 -07:00
Daniel Darabos 78236334e4 Do not re-use objects in the EdgePartition/EdgeTriplet iterators.
This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization.

I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite?

This is an example benchmark I did:

test("benchmark") {
  val builder = new EdgePartitionBuilder[Int]
  for (i <- (1 to 10000000)) {
    builder.add(i.toLong, i.toLong, i)
  }
  val p = builder.toEdgePartition
  p.map(_.attr + 1).iterator.toList
}

It ran for 10 seconds both before and after this change.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #276 from darabos/spark-1188 and squashes the following commits:

574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code.
4117a64 [Daniel Darabos] Revert EdgePartitionSuite.
4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code.
4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions.
2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition.
0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code.
c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.
2014-04-02 12:27:37 -07:00
Andrew Or de8eefa804 [SPARK-1385] Use existing code for JSON de/serialization of BlockId
`BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
With this PR, the de/serialization of BlockIds will go through the first (older) code path.

(Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)

Author: Andrew Or <andrewor14@gmail.com>

Closes #289 from andrewor14/blockid-json and squashes the following commits:

409d226 [Andrew Or] Simplify JSON de/serialization for BlockId
2014-04-02 10:43:09 -07:00
Kay Ousterhout 11973a7bda Renamed stageIdToActiveJob to jobIdToActiveJob.
This data structure was misused and, as a result, later renamed to an incorrect name.

This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.

This patch renames it and removes an incorrect indexing into it.  The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here 57579934f0) was not always executed; this commit fixes that.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #301 from kayousterhout/fixCancellation and squashes the following commits:

bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.
2014-04-02 10:35:52 -07:00
Michael Armbrust ea9de658a3 Remove * from test case golden filename.
@rxin mentioned this might cause issues on windows machines.

Author: Michael Armbrust <michael@databricks.com>

Closes #297 from marmbrus/noStars and squashes the following commits:

263122a [Michael Armbrust] Remove * from test case golden filename.
2014-04-01 23:54:38 -07:00
Manish Amde 8b3045ceab MLI-1 Decision Trees
Joint work with @hirakendu, @etrain, @atalwalkar and @harsha2010.

Key features:
+ Supports binary classification and regression
+ Supports gini, entropy and variance for information gain calculation
+ Supports both continuous and categorical features

The algorithm has gone through several development iterations over the last few months leading to a highly optimized implementation. Optimizations include:

1. Level-wise training to reduce passes over the entire dataset.
2. Bin-wise split calculation to reduce computation overhead.
3. Aggregation over partitions before combining to reduce communication overhead.

Author: Manish Amde <manish9ue@gmail.com>
Author: manishamde <manish9ue@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #79 from manishamde/tree and squashes the following commits:

1e8c704 [Manish Amde] remove numBins field in the Strategy class
7d54b4f [manishamde] Merge pull request #4 from mengxr/dtree
f536ae9 [Xiangrui Meng] another pass on code style
e1dd86f [Manish Amde] implementing code style suggestions
62dc723 [Manish Amde] updating javadoc and converting helper methods to package private to allow unit testing
201702f [Manish Amde] making some more methods private
f963ef5 [Manish Amde] making methods private
c487e6a [manishamde] Merge pull request #1 from mengxr/dtree
24500c5 [Xiangrui Meng] minor style updates
4576b64 [Manish Amde] documentation and for to while loop conversion
ff363a7 [Manish Amde] binary search for bins and while loop for categorical feature bins
632818f [Manish Amde] removing threshold for classification predict method
2116360 [Manish Amde] removing dummy bin calculation for categorical variables
6068356 [Manish Amde] ensuring num bins is always greater than max number of categories
62c2562 [Manish Amde] fixing comment indentation
ad1fc21 [Manish Amde] incorporated mengxr's code style suggestions
d1ef4f6 [Manish Amde] more documentation
794ff4d [Manish Amde] minor improvements to docs and style
eb8fcbe [Manish Amde] minor code style updates
cd2c2b4 [Manish Amde] fixing code style based on feedback
63e786b [Manish Amde] added multiple train methods for java compatability
d3023b3 [Manish Amde] adding more docs for nested methods
84f85d6 [Manish Amde] code documentation
9372779 [Manish Amde] code style: max line lenght <= 100
dd0c0d7 [Manish Amde] minor: some docs
0dd7659 [manishamde] basic doc
5841c28 [Manish Amde] unit tests for categorical features
f067d68 [Manish Amde] minor cleanup
c0e522b [Manish Amde] updated predict and split threshold logic
b09dc98 [Manish Amde] minor refactoring
6b7de78 [Manish Amde] minor refactoring and tests
d504eb1 [Manish Amde] more tests for categorical features
dbb7ac1 [Manish Amde] categorical feature support
6df35b9 [Manish Amde] regression predict logic
53108ed [Manish Amde] fixing index for highest bin
e23c2e5 [Manish Amde] added regression support
c8f6d60 [Manish Amde] adding enum for feature type
b0e3e76 [Manish Amde] adding enum for feature type
154aa77 [Manish Amde] enums for configurations
733d6dd [Manish Amde] fixed tests
02c595c [Manish Amde] added command line parsing
98ec8d5 [Manish Amde] tree building and prediction logic
b0eb866 [Manish Amde] added logic to handle leaf nodes
80e8c66 [Manish Amde] working version of multi-level split calculation
4798aae [Manish Amde] added gain stats class
dad0afc [Manish Amde] decison stump functionality working
03f534c [Manish Amde] some more tests
0012a77 [Manish Amde] basic stump working
8bca1e2 [Manish Amde] additional code for creating intermediate RDD
92cedce [Manish Amde] basic building blocks for intermediate RDD calculation. untested.
cd53eae [Manish Amde] skeletal framework
2014-04-01 21:40:49 -07:00
Matei Zaharia 45df912736 Revert "[Spark-1134] only call ipython if no arguments are given; remove IPYTHONOPTS from call"
This reverts commit afb5ea6278.
2014-04-01 19:31:50 -07:00
Diana Carroll afb5ea6278 [Spark-1134] only call ipython if no arguments are given; remove IPYTHONOPTS from call
see comments on Pull Request https://github.com/apache/spark/pull/38
(i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)

Author: Diana Carroll <dcarroll@cloudera.com>

Closes #227 from dianacarroll/spark-1134 and squashes the following commits:

ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
2014-04-01 19:29:26 -07:00