Commit graph

7182 commits

Author SHA1 Message Date
baishuo(白硕) aa41a522d8 fix java.lang.ClassCastException
get Exception when run:bin/run-example org.apache.spark.examples.sql.RDDRelation
Exception's detail is:
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
	at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49)
	at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala)
change sql("SELECT COUNT(*) FROM records").collect().head.getInt(0) to sql("SELECT COUNT(*) FROM records").collect().head.getLong(0), then the Exception do not occur any more

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #949 from baishuo/master and squashes the following commits:

f4b319f [baishuo(白硕)] fix java.lang.ClassCastException
2014-06-03 13:39:47 -07:00
Erik Selin 8edc9d0330 [SPARK-1468] Modify the partition function used by partitionBy.
Make partitionBy use a tweaked version of hash as its default partition function
since the python hash function does not consistently assign the same value
to None across python processes.

Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468

Author: Erik Selin <erik.selin@jadedpixel.com>

Closes #371 from tyro89/consistent_hashing and squashes the following commits:

201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
2014-06-03 13:31:16 -07:00
tzolov b1f285359a Add support for Pivotal HD in the Maven build: SPARK-1992
Allow Spark to build against particular Pivotal HD distributions. For example to build Spark against Pivotal HD 2.0.1 one can run:
```
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0-gphd-3.0.1.0 -DskipTests clean package
```

Author: tzolov <christian.tzolov@gmail.com>

Closes #942 from tzolov/master and squashes the following commits:

bc3e05a [tzolov] Add support for Pivotal HD in the Maven build and SBT build: [SPARK-1992]
2014-06-03 13:26:29 -07:00
Wenchen Fan(Cloud) 45e9bc85db [SPARK-1912] fix compress memory issue during reduce
When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block.
Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem.
Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily.

Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com>

Closes #860 from cloud-fan/fix-compress and squashes the following commits:

0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator'
07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize
d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class
2c8adb2 [Wenchen Fan(Cloud)] add inline comment
8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
2014-06-03 13:18:20 -07:00
Henry Saputra 6c044ed100 SPARK-2001 : Remove docs/spark-debugger.md from master
Per discussion in dev list:
"
Seemed like the spark-debugger.md is no longer accurate (see
http://spark.apache.org/docs/latest/spark-debugger.html) and since it
was originally written Spark has evolved that makes the doc obsolete.
There are already work pending for new replay debugging (I could not
find the PR links for it) so I
With version control we could always reinstate the old doc if needed,
but as of today the doc is no longer reflect the current state of
Spark's RDD.
"

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #953 from hsaputra/SPARK-2001-hsaputra and squashes the following commits:

dc324aa [Henry Saputra] SPARK-2001 : Remove docs/spark-debugger.md from master since it is obsolete
2014-06-03 13:03:51 -07:00
Syed Hashmi 7782a304ad [SPARK-1942] Stop clearing spark.driver.port in unit tests
stop resetting spark.driver.port in unit tests (scala, java and python).

Author: Syed Hashmi <shashmi@cloudera.com>
Author: CodingCat <zhunansjtu@gmail.com>

Closes #943 from syedhashmi/master and squashes the following commits:

885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
2014-06-03 12:04:47 -07:00
Cheng Lian 862283e9cc Avoid dynamic dispatching when unwrapping Hive data.
This is a follow up of PR #758.

The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost.

According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table.

```
Optimized version:

CSV: 6870 ms, RCFile: 5687 ms
CSV: 6832 ms, RCFile: 5800 ms
CSV: 6822 ms, RCFile: 5679 ms
CSV: 6704 ms, RCFile: 5758 ms
CSV: 6819 ms, RCFile: 5725 ms

Original version:

CSV: 7042 ms, RCFile: 5667 ms
CSV: 6883 ms, RCFile: 5703 ms
CSV: 7115 ms, RCFile: 5665 ms
CSV: 7020 ms, RCFile: 5981 ms
CSV: 6871 ms, RCFile: 5906 ms
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #935 from liancheng/staticUnwrapping and squashes the following commits:

c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.
2014-06-02 19:20:23 -07:00
egraldlo ec8be274a7 [SPARK-1995][SQL] system function upper and lower can be supported
I don't know whether it's time to implement system function about string operation in spark sql now.

Author: egraldlo <egraldlo@gmail.com>

Closes #936 from egraldlo/stringoperator and squashes the following commits:

3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser
ea76d0a [egraldlo] modify the formatting issues
b49f25e [egraldlo] modify the formatting issues
1f0bbb5 [egraldlo] system function upper and lower supported
13d3267 [egraldlo] system function upper and lower supported
2014-06-02 18:02:57 -07:00
Cheng Lian d000ca98a8 [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #939 from liancheng/spark-1958 and squashes the following commits:

bdc4a14 [Cheng Lian] Copy rows to present immutable data to users
8250976 [Cheng Lian] Added return type explicitly for public API
192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
2014-06-02 12:09:43 -07:00
Tor Myklebust 9a5d482e09 [SPARK-1553] Alternating nonnegative least-squares
This pull request includes a nonnegative least-squares solver (NNLS) tailored to the kinds of small-scale problems that come up when training matrix factorisation models by alternating nonnegative least-squares (ANNLS).

The method used for the NNLS subproblems is based on the classical method of projected gradients.  There is a modification where, if the set of active constraints has not changed since the last iteration, a conjugate gradient step is considered and possibly rejected in favour of the gradient; this improves convergence once the optimal face has been located.

The NNLS solver is in `org.apache.spark.mllib.optimization.NNLSbyPCG`.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #460 from tmyklebu/annls and squashes the following commits:

79bc4b5 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into annls
199b0bc [Tor Myklebust] Make the ctor private again and use the builder pattern.
7fbabf1 [Tor Myklebust] Cleanup matrix math in NNLSSuite.
65ef7f2 [Tor Myklebust] Make ALS's ctor public and remove a couple of "convenience" wrappers.
2d4f3cb [Tor Myklebust] Cleanup.
0cb4481 [Tor Myklebust] Drop the iteration limit from 40k to max(400,20n).
e2a01d1 [Tor Myklebust] Create a workspace object for NNLS to cut down on memory allocations.
b285106 [Tor Myklebust] Clean up NNLS test cases.
9c820b6 [Tor Myklebust] Tweak variable names.
8a1a436 [Tor Myklebust] Describe the problem and add a reference to Polyak's paper.
5345402 [Tor Myklebust] Style fixes that got eaten.
ac673bd [Tor Myklebust] More safeguards against numerical ridiculousness.
c288b6a [Tor Myklebust] Finish moving the NNLS solver.
9a82fa6 [Tor Myklebust] Fix scalastyle moanings.
33bf4f2 [Tor Myklebust] Fix missing space.
89ea0a8 [Tor Myklebust] Hack ALSSuite to support NNLS testing.
f5dbf4d [Tor Myklebust] Teach ALS how to use the NNLS solver.
6cb563c [Tor Myklebust] Tests for the nonnegative least squares solver.
a68ac10 [Tor Myklebust] A nonnegative least-squares solver.
2014-06-02 11:48:09 -07:00
Ankur Dave 9535f4045d Add landmark-based Shortest Path algorithm to graphx.lib
This is a modified version of apache/spark#10.

Author: Ankur Dave <ankurdave@gmail.com>
Author: Andres Perez <andres@tresata.com>

Closes #933 from ankurdave/shortestpaths and squashes the following commits:

03a103c [Ankur Dave] Style fixes
7a1ff48 [Ankur Dave] Improve ShortestPaths documentation
d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED
d983fb4 [Ankur Dave] Fix style errors
60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.
2014-06-02 00:00:24 -07:00
Patrick Wendell d17d221487 Better explanation for how to use MIMA excludes.
This patch does a few things:
1. We have a file MimaExcludes.scala exclusively for excludes.
2. The test runner tells users about that file if a test fails.
3. I've added back the excludes used from 0.9->1.0. We should keep
   these in the project as an official audit trail of times where
   we decided to make exceptions.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #937 from pwendell/mima and squashes the following commits:

7ee0db2 [Patrick Wendell] Better explanation for how to use MIMA excludes.
2014-06-01 17:27:05 -07:00
Reynold Xin eea3aab4f2 Made spark_ec2.py PEP8 compliant.
The change set is actually pretty small -- mostly whitespace changes. Admittedly this is a scary change due to the lack of tests to cover the ec2 scripts, and also because indentation actually impacts control flow in Python ...

Look at changes without whitespace diff here: https://github.com/apache/spark/pull/891/files?w=1

Author: Reynold Xin <rxin@apache.org>

Closes #891 from rxin/spark-ec2-pep8 and squashes the following commits:

ac1bf11 [Reynold Xin] Made spark_ec2.py PEP8 compliant.
2014-06-01 15:39:04 -07:00
Yadid Ayzenberg 366c0c4c30 updated java code blocks in spark SQL guide such that ctx will refer to ...
...a JavaSparkContext and sqlCtx will refer to a JavaSQLContext

Author: Yadid Ayzenberg <yadid@media.mit.edu>

Closes #932 from yadid/master and squashes the following commits:

f92fb3a [Yadid Ayzenberg] updated java code blocks in spark SQL guide such that ctx will refer to a JavaSparkContext and sqlCtx will refer to a JavaSQLContext
2014-05-31 19:44:13 -07:00
Uri Laserson 5e98967b61 SPARK-1917: fix PySpark import of scipy.special functions
https://issues.apache.org/jira/browse/SPARK-1917

Author: Uri Laserson <laserson@cloudera.com>

Closes #866 from laserson/SPARK-1917 and squashes the following commits:

d947e8c [Uri Laserson] Added test for scipy.special importing
1798bbd [Uri Laserson] SPARK-1917: fix PySpark import of scipy.special
2014-05-31 14:59:09 -07:00
witgo d8c005d537 Improve maven plugin configuration
Author: witgo <witgo@qq.com>

Closes #786 from witgo/maven_plugin and squashes the following commits:

5de86a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into maven_plugin
c35ef73 [witgo] Improve maven plugin configuration
2014-05-31 14:36:27 -07:00
Aaron Davidson 9909efc10a SPARK-1839: PySpark RDD#take() shouldn't always read from driver
This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)

Author: Aaron Davidson <aaron@databricks.com>

Closes #922 from aarondav/take and squashes the following commits:

fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
2014-05-31 13:04:57 -07:00
Aaron Davidson 7d52777eff Super minor: Close inputStream in SparkSubmitArguments
`Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway...

Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found.

Author: Aaron Davidson <aaron@databricks.com>

Closes #914 from aarondav/tiny and squashes the following commits:

db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments
2014-05-31 12:36:58 -07:00
Michael Armbrust 1a0da0ec57 [SQL] SPARK-1964 Add timestamp to hive metastore type parser.
Author: Michael Armbrust <michael@databricks.com>

Closes #913 from marmbrus/timestampMetastore and squashes the following commits:

8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.
2014-05-31 12:34:22 -07:00
Michael Armbrust 7463cd248f Optionally include Hive as a dependency of the REPL.
Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt.  That said, it is kinda nice to be able to launch all things directly from SBT when developing.

Author: Michael Armbrust <michael@databricks.com>

Closes #801 from marmbrus/hiveRepl and squashes the following commits:

9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.
2014-05-31 12:24:35 -07:00
Takuya UESHIN 3ce81494c5 [SPARK-1947] [SQL] Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.
Child of `SumDistinct` or `Average` should be widened to prevent overflows the same as `Sum`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #902 from ueshin/issues/SPARK-1947 and squashes the following commits:

99c3dcb [Takuya UESHIN] Insert Cast for SumDistinct and Average.
2014-05-31 11:30:03 -07:00
Chen Chao 9ecc40d3ae correct tiny comment error
Author: Chen Chao <crazyjvm@gmail.com>

Closes #928 from CrazyJvm/patch-8 and squashes the following commits:

144328b [Chen Chao] correct tiny comment error
2014-05-31 00:06:49 -07:00
Cheng Lian cf989601d0 [SPARK-1959] String "NULL" shouldn't be interpreted as null value
JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #909 from liancheng/spark-1959 and squashes the following commits:

306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value
2014-05-30 22:13:11 -07:00
CodingCat 41bfdda3cc SPARK-1976: fix the misleading part in streaming docs
Spark streaming requires at least two working threads, but the document gives the example like

import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
// Create a StreamingContext with a local master
val ssc = new StreamingContext("local", "NetworkWordCount", Seconds(1))
http://spark.apache.org/docs/latest/streaming-programming-guide.html

Author: CodingCat <zhunansjtu@gmail.com>

Closes #924 from CodingCat/master and squashes the following commits:

bb89f20 [CodingCat] update streaming docs
2014-05-30 22:06:08 -07:00
nchammas 23ae36630a updated link to mailing list
Author: nchammas <nicholas.chammas@gmail.com>

Closes #923 from nchammas/patch-1 and squashes the following commits:

65c4d18 [nchammas] updated link to mailing list
2014-05-30 22:04:57 -07:00
Andrew Ash 9c1f204d80 Typo: and -> an
Author: Andrew Ash <andrew@andrewash.com>

Closes #927 from ash211/patch-5 and squashes the following commits:

79b577d [Andrew Ash] Typo: and -> an
2014-05-30 22:02:04 -07:00
Zhen Peng ff562b2396 [SPARK-1901] worker should make sure executor has exited before updating executor's info
https://issues.apache.org/jira/browse/SPARK-1901

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits:

21d380b [Zhen Peng] add some error messages
506cea6 [Zhen Peng] add some docs for killProcess()
a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info
2014-05-30 10:12:51 -07:00
Prashant Sharma 79fa8fd4b1 [SPARK-1971] Update MIMA to compare against Spark 1.0.0
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #910 from ScrapCodes/enable-mima/spark-core and squashes the following commits:

79f3687 [Prashant Sharma] updated Mima to check against version 1.0
1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.
2014-05-30 01:13:51 -07:00
Matei Zaharia c8bf4131bc [SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:

* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions

You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.

Author: Matei Zaharia <matei@databricks.com>

Closes #896 from mateiz/1.0-docs and squashes the following commits:

03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 00:34:33 -07:00
Prashant Sharma eeee978a34 [SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware.
We add all the classes annotated as `DeveloperApi` to `~/.mima-excludes`.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: nikhil7sh <nikhilsharmalnmiit@gmail.ccom>

Closes #904 from ScrapCodes/SPARK-1820/ignore-Developer-Api and squashes the following commits:

de944f9 [Prashant Sharma] Code review.
e3c5215 [Prashant Sharma] Incorporated patrick's suggestions and fixed the scalastyle build.
9983a42 [nikhil7sh] [SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware
2014-05-29 23:20:20 -07:00
Ankur Dave b7e28fa451 initial version of LPA
A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework.  Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.

Author: Ankur Dave <ankurdave@gmail.com>
Author: haroldsultan <haroldsultan@gmail.com>
Author: Harold Sultan <haroldsultan@gmail.com>

Closes #905 from haroldsultan/master and squashes the following commits:

327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation
227a4d0 [Ankur Dave] Untabify
0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
0e24303 [Ankur Dave] Add LabelPropagationSuite
84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
9830342 [Harold Sultan] initial version of LPA
2014-05-29 15:39:25 -07:00
Cheng Lian 8f7141fbc0 [SPARK-1368][SQL] Optimized HiveTableScan
JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368)

This PR introduces two major updates:

- Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`.
- Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning.

My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively:

```
Original:

[info] CSV: 27676 ms, RCFile: 26415 ms
[info] CSV: 27703 ms, RCFile: 26029 ms
[info] CSV: 27511 ms, RCFile: 25962 ms

Optimized:

[info] CSV: 13820 ms, RCFile: 10402 ms
[info] CSV: 14158 ms, RCFile: 10691 ms
[info] CSV: 13606 ms, RCFile: 10346 ms
```

The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively.

Preparation code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanPrepare extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  hql("drop table scan_csv")
  hql("drop table scan_rcfile")

  hql("""create table scan_csv (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
        |  with serdeproperties ('field.delim'=',')
      """.stripMargin)

  hql(s"""load data local inpath "${args(0)}" into table scan_csv""")

  hql("""create table scan_rcfile (key int, value string)
        |  row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
        |stored as
        |  inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
        |  outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
      """.stripMargin)

  hql(
    """
      |from scan_csv
      |insert overwrite table scan_rcfile
      |select scan_csv.key, scan_csv.value
    """.stripMargin)
}
```

Benchmark code:

```scala
package org.apache.spark.examples.sql.hive

import org.apache.spark.sql.hive.LocalHiveContext
import org.apache.spark.{SparkConf, SparkContext}

object HiveTableScanBenchmark extends App {
  val sparkContext = new SparkContext(
    new SparkConf()
      .setMaster("local")
      .setAppName(getClass.getSimpleName.stripSuffix("$")))

  val hiveContext = new LocalHiveContext(sparkContext)

  import hiveContext._

  val scanCsv = hql("select key from scan_csv")
  val scanRcfile = hql("select key from scan_rcfile")

  val csvDuration = benchmark(scanCsv.count())
  val rcfileDuration = benchmark(scanRcfile.count())

  println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms")

  def benchmark(f: => Unit) = {
    val begin = System.currentTimeMillis()
    f
    val end = System.currentTimeMillis()
    end - begin
  }
}
```

@marmbrus Please help review, thanks!

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #758 from liancheng/fastHiveTableScan and squashes the following commits:

4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest
cf640d8 [Cheng Lian] More HiveTableScan optimisations:
bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest.
6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning
eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan
2014-05-29 15:24:03 -07:00
Yin Huai 60b89fe6b0 SPARK-1935: Explicitly add commons-codec 1.5 as a dependency.
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #889 from yhuai/SPARK-1935 and squashes the following commits:

7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.
2014-05-29 09:07:39 -07:00
Jyotiska NK 9cff1dd25a Added doctest and method description in context.py
Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py

Author: Jyotiska NK <jyotiska123@gmail.com>

Closes #187 from jyotiska/pyspark_context and squashes the following commits:

356f945 [Jyotiska NK] Added doctest for textFile method in context.py
5b23686 [Jyotiska NK] Updated context.py with method descriptions
2014-05-28 23:08:39 -07:00
witgo 4dbb27b0cf [SPARK-1712]: TaskDescription instance is too big causes Spark to hang
Author: witgo <witgo@qq.com>

Closes #694 from witgo/SPARK-1712_new and squashes the following commits:

0f52483 [witgo] review commit
83ce29b [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
52e6752 [witgo] reset test SparkContext
63636b6 [witgo] review commit
44a59ee [witgo] review commit
3b6d48c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
926bd6a [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
9a5cfad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
03cc562 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
b0930b0 [witgo] review commit
b1174bd [witgo] merge master
f76679b [witgo] merge master
689495d [witgo] fix scala style bug
1d35c3c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
062c182 [witgo] fix small bug for code style
0a428cf [witgo] add unit tests
158b2dc [witgo] review commit
4afe71d [witgo] review commit
9e4ffa7 [witgo] review commit
1d35c7d [witgo] fix hang
7965580 [witgo] fix Statement order
0e29eac [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
3ea1ca1 [witgo] remove duplicate serialize
743a7ad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
86e2048 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
2a89adc [witgo] SPARK-1712: TaskDescription instance is too big causes Spark to hang
2014-05-28 15:57:05 -07:00
David Lemieux 4312cf0bad Spark 1916
The changes could be ported back to 0.9 as well.
Changing in.read to in.readFully to read the whole input stream rather than the first 1020 bytes.
This should ok considering that Flume caps the body size to 32K by default.

Author: David Lemieux <david.lemieux@radialpoint.com>

Closes #865 from lemieud/SPARK-1916 and squashes the following commits:

a265673 [David Lemieux] Updated SparkFlumeEvent to read the whole stream rather than the first X bytes.
(cherry picked from commit 0b769b73fb)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-05-28 15:51:32 -07:00
Patrick Wendell 7801d44fd3 Organize configuration docs
This PR improves and organizes the config option page
and makes a few other changes to config docs. See a preview here:
http://people.apache.org/~pwendell/config-improvements/configuration.html

The biggest changes are:
1. The configs for the standalone master/workers were moved to the
standalone page and out of the general config doc.
2. SPARK_LOCAL_DIRS was missing from the standalone docs.
3. Expanded discussion of injecting configs with spark-submit, including an
example.
4. Config options were organized into the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming

Author: Patrick Wendell <pwendell@gmail.com>

Closes #880 from pwendell/config-cleanup and squashes the following commits:

93f56c3 [Patrick Wendell] Feedback from Matei
6f66efc [Patrick Wendell] More feedback
16ae776 [Patrick Wendell] Adding back header section
d9c264f [Patrick Wendell] Small fix
e0c1728 [Patrick Wendell] Response to Matei's review
27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a374369 [Patrick Wendell] Line wrapping fixes
fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
3289ea4 [Patrick Wendell] Pulling in changes from #856
106ee31 [Patrick Wendell] Small link fix
f7e79bc [Patrick Wendell] Re-organizing config options.
54b184d [Patrick Wendell] Adding standalone configs to the standalone page
592e94a [Patrick Wendell] Stash
29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
2d719ef [Patrick Wendell] Small fix
4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
204b248 [Patrick Wendell] Small fixes
2014-05-28 15:49:54 -07:00
jmu 82eadc3b07 Fix doc about NetworkWordCount/JavaNetworkWordCount usage of spark streaming
Usage: NetworkWordCount <master> <hostname> <port>
-->
Usage: NetworkWordCount <hostname> <port>

Usage: JavaNetworkWordCount <master> <hostname> <port>
-->
Usage: JavaNetworkWordCount <hostname> <port>

Author: jmu <jmujmu@gmail.com>

Closes #826 from jmu/master and squashes the following commits:

9fb7980 [jmu] Merge branch 'master' of https://github.com/jmu/spark
b9a6b02 [jmu] Fix doc for NetworkWordCount/JavaNetworkWordCount Usage: NetworkWordCount <master> <hostname> <port> --> Usage: NetworkWordCount <hostname> <port>
2014-05-27 22:41:47 -07:00
Takuya UESHIN 9df86835b6 [SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int value.
`ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #893 from ueshin/issues/SPARK-1938 and squashes the following commits:

3970e88 [Takuya UESHIN] Remove a superfluous line.
5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType.
cbe7c71 [Takuya UESHIN] Revert a change.
fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.
2014-05-27 22:17:50 -07:00
LY Lai 0682567450 [SQL] SPARK-1922
Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922 .

Author: LY Lai <ly.lai@vpon.com>

Closes #873 from lyuanlai/master and squashes the following commits:

2253263 [LY Lai] Allow underscore in struct field column name
2014-05-27 16:08:38 -07:00
Takuya UESHIN 3b0babad1f [SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null.
Average values are difference between the calculation is done partially or not partially.
Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits:

b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.
2014-05-27 14:55:23 -07:00
Takuya UESHIN d1375a2bff [SPARK-1926] [SQL] Nullability of Max/Min/First should be true.
Nullability of `Max`/`Min`/`First` should be `true` because they return `null` if there are no rows.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #881 from ueshin/issues/SPARK-1926 and squashes the following commits:

322610f [Takuya UESHIN] Fix nullability of Min/Max/First.
2014-05-27 14:53:57 -07:00
lianhuiwang 95e4c9c6fb bugfix worker DriverStateChanged state should match DriverState.FAILED
bugfix worker DriverStateChanged state should match DriverState.FAILED

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #864 from lianhuiwang/master and squashes the following commits:

480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
2014-05-27 11:53:38 -07:00
zsxwing 549830b0db SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
`var cachedPeers: Seq[BlockManagerId] = null` is used in `def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel)` without proper protection.

There are two place will call `replicate(blockId, bytesAfterPut, level)`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L644) runs in `connectionManager.futureExecContext`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L752) `doPut` runs in `connectionManager.handleMessageExecutor`. `org.apache.spark.storage.BlockManagerWorker` calls `blockManager.putBytes` in `connectionManager.handleMessageExecutor`.

As they run in different `Executor`s, this is a race condition which may cause the memory pointed by `cachedPeers` is not correct even if `cachedPeers != null`.

The race condition of `onReceiveCallback` is that it's set in `BlockManagerWorker` but read in a different thread in `ConnectionManager.handleMessageExecutor`.

Author: zsxwing <zsxwing@gmail.com>

Closes #887 from zsxwing/SPARK-1932 and squashes the following commits:

524f69c [zsxwing] SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
2014-05-26 23:17:39 -07:00
Reynold Xin 90e281b55a SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
https://issues.apache.org/jira/browse/SPARK-1933

Author: Reynold Xin <rxin@apache.org>

Closes #888 from rxin/addfile and squashes the following commits:

8c402a3 [Reynold Xin] Updated comment.
ff6c162 [Reynold Xin] SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
2014-05-26 22:05:23 -07:00
Reynold Xin 9ed37190f4 Updated dev Python scripts to make them PEP8 compliant.
Author: Reynold Xin <rxin@apache.org>

Closes #875 from rxin/pep8-dev-scripts and squashes the following commits:

04b084f [Reynold Xin] Made dev Python scripts PEP8 compliant.
2014-05-26 21:40:52 -07:00
Reynold Xin ef690e1f69 Fixed the error message for OutOfMemoryError in DAGScheduler. 2014-05-26 21:31:27 -07:00
Zhen Peng 8d271c90fa SPARK-1929 DAGScheduler suspended by local task OOM
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:

76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
2014-05-26 21:30:25 -07:00
Ankur Dave 56c771cb2d [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
905173df57 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties.

This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #885 from ankurdave/SPARK-1931 and squashes the following commits:

3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins
9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
2014-05-26 16:10:22 -07:00
zsxwing cb7fe50348 SPARK-1925: Replace '&' with '&&'
JIRA: https://issues.apache.org/jira/browse/SPARK-1925

Author: zsxwing <zsxwing@gmail.com>

Closes #879 from zsxwing/SPARK-1925 and squashes the following commits:

5cf5a6d [zsxwing] SPARK-1925: Replace '&' with '&&'
2014-05-26 14:34:58 -07:00