Commit graph

8828 commits

Author SHA1 Message Date
Mingfei 165cec9c46 [Spark-4432]close InStream after the block is accessed
InStream is not closed after data is read from Tachyon. which makes the blocks in Tachyon locked after accessed.

Author: Mingfei <mingfei.shi@intel.com>

Closes #3290 from shimingfei/lockFix and squashes the following commits:

fffe345 [Mingfei] close InStream after the block is accessed
2014-11-18 22:17:06 -08:00
Mingfei 67e9876b3e [SPARK-4441] Close Tachyon client when TachyonBlockManager is shutdown
Currently Tachyon client is not closed when TachyonBlockManager is shut down. which causes some resources in Tachyon not reclaimed

Author: Mingfei <mingfei.shi@intel.com>

Closes #3299 from shimingfei/closeClient and squashes the following commits:

0913fbd [Mingfei] close Tachyon client when TachyonBlockManager is shutdown
2014-11-18 22:16:36 -08:00
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Cheng Lian 423baea953 [SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side
For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side.

(This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.)

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits:

0130897 [Cheng Lian] Fixes Parquet comparison filter generation
2014-11-18 17:41:54 -08:00
Davies Liu 7f22fa81eb [SPARK-4327] [PySpark] Python API for RDD.randomSplit()
```
pyspark.RDD.randomSplit(self, weights, seed=None)
    Randomly splits this RDD with the provided weights.

    :param weights: weights for splits, will be normalized if they don't sum to 1
    :param seed: random seed
    :return: split RDDs in an list

    >>> rdd = sc.parallelize(range(10), 1)
    >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
    >>> rdd1.collect()
    [3, 6]
    >>> rdd2.collect()
    [0, 5, 7]
    >>> rdd3.collect()
    [1, 2, 4, 8, 9]
```

Author: Davies Liu <davies@databricks.com>

Closes #3193 from davies/randomSplit and squashes the following commits:

78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
2014-11-18 16:37:35 -08:00
Xiangrui Meng bb46046154 [SPARK-4433] fix a racing condition in zipWithIndex
Spark hangs with the following code:

~~~
sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
~~~

This is because ZippedWithIndexRDD triggers a job in getPartitions and it causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to compute `startIndices` during construction.

This should be applied to branch-1.0, branch-1.1, and branch-1.2.

pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #3291 from mengxr/SPARK-4433 and squashes the following commits:

c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex
2014-11-18 16:25:44 -08:00
Davies Liu 4a377aff2d [SPARK-3721] [PySpark] broadcast objects larger than 2G
This patch will bring support for broadcasting objects larger than 2G.

pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].

Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2659 from davies/huge and squashes the following commits:

7b57a14 [Davies Liu] add more tests for broadcast
28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
a2f6a02 [Davies Liu] bug fix
4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
5875c73 [Davies Liu] address comments
10a349b [Davies Liu] address comments
0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
6182c8f [Davies Liu] Merge branch 'master' into huge
d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
2514848 [Davies Liu] address comments
fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
1c2d928 [Davies Liu] fix scala style
091b107 [Davies Liu] broadcast objects larger than 2G
2014-11-18 16:17:51 -08:00
Davies Liu d2e29516f2 [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS
```
class LogisticRegressionWithLBFGS
 |  train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
 |      Train a logistic regression model on the given data.
 |
 |      :param data:           The training data, an RDD of LabeledPoint.
 |      :param iterations:     The number of iterations (default: 100).
 |      :param initialWeights: The initial weights (default: None).
 |      :param regParam:       The regularizer parameter (default: 0.01).
 |      :param regType:        The type of regularizer used for training
 |                             our model.
 |                             :Allowed values:
 |                               - "l1" for using L1 regularization
 |                               - "l2" for using L2 regularization
 |                               - None for no regularization
 |                               (default: "l2")
 |      :param intercept:      Boolean parameter which indicates the use
 |                             or not of the augmented representation for
 |                             training data (i.e. whether bias features
 |                             are activated or not).
 |      :param corrections:    The number of corrections used in the LBFGS update (default: 10).
 |      :param tolerance:      The convergence tolerance of iterations for L-BFGS (default: 1e-4).
 |
 |      >>> data = [
 |      ...     LabeledPoint(0.0, [0.0, 1.0]),
 |      ...     LabeledPoint(1.0, [1.0, 0.0]),
 |      ... ]
 |      >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
 |      >>> lrm.predict([1.0, 0.0])
 |      1
 |      >>> lrm.predict([0.0, 1.0])
 |      0
 |      >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
 |      [1, 0]
```

Author: Davies Liu <davies@databricks.com>

Closes #3307 from davies/lbfgs and squashes the following commits:

34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
2014-11-18 15:57:33 -08:00
Kay Ousterhout 010bc86e40 [SPARK-4463] Add (de)select all button for add'l metrics.
This commit removes the behavior where when a user clicks
"Show additional metrics" on the stage page, all of the additional
metrics are automatically selected; now, collapsing and expanding
the additional metrics has no effect on which options are selected.
Instead, there's a "(De)select All" box at the top; checking this box
checks all additional metrics (and similarly, unchecking it unchecks
all additional metrics).

This commit is intended to be backported to 1.2, so that the additional
metrics behavior is not confusing to users.

Now when a user clicks the "Show additional metrics" menu, this is what
it looks like:
![image](https://cloud.githubusercontent.com/assets/1108612/5094347/1541ead6-6f15-11e4-8e8c-25a65ddbdfb2.png)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3331 from kayousterhout/SPARK-4463 and squashes the following commits:

9e17cea [Kay Ousterhout] Added italics
b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l metrics.
2014-11-18 15:01:06 -08:00
Davies Liu e34f38ff1a [SPARK-4017] show progress bar in console
The progress bar will look like this:

![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)

In the right corner, the numbers are: finished tasks, running tasks, total tasks.

After the stage has finished, it will disappear.

The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.

Author: Davies Liu <davies@databricks.com>

Closes #3029 from davies/progress and squashes the following commits:

95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
fc49ac8 [Davies Liu] address commentse
2e90f75 [Davies Liu] show multiple stages in same time
0081bcc [Davies Liu] address comments
38c42f1 [Davies Liu] fix tests
ab87958 [Davies Liu] disable progress bar during tests
30ac852 [Davies Liu] re-implement progress bar
b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
e4e7344 [Davies Liu] refactor
e1f524d [Davies Liu] revert unnecessary change
a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
5cae3f2 [Davies Liu] fix style
ea49fe0 [Davies Liu] address comments
bc53d99 [Davies Liu] refactor
e6bb189 [Davies Liu] fix logging in sparkshell
7e7d4e7 [Davies Liu] address commments
5df26bb [Davies Liu] fix style
9e42208 [Davies Liu] show progress bar in console and title
2014-11-18 13:37:21 -08:00
Davies Liu 80f3177882 [SPARK-4404] remove sys.exit() in shutdown hook
If SparkSubmit die first, then bootstrapper will be blocked by shutdown hook. sys.exit() in a shutdown hook will cause some kind of dead lock.

cc andrewor14

Author: Davies Liu <davies@databricks.com>

Closes #3289 from davies/fix_bootstraper and squashes the following commits:

ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_bootstraper
e04b690 [Davies Liu] remove sys.exit in hook
4d11366 [Davies Liu] remove shutdown hook if subprocess die fist
2014-11-18 13:11:38 -08:00
Kousuke Saruta bfebfd8b28 [SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar name.
This PR adds a regression test for SPARK-4434.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3326 from sarutak/add-triple-slash-testcase and squashes the following commits:

82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment
9149027 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
c1c80ca [Kousuke Saruta] Fixed style
4f30210 [Kousuke Saruta] Modified comments
9e09da2 [Kousuke Saruta] Fixed URI validation for jar file
d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not enough for Jar file
ac79906 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into add-triple-slash-testcase
6d4f47e [Kousuke Saruta] Added a test case as a regression check for SPARK-4434
2014-11-18 12:17:33 -08:00
Michael Armbrust 90d72ec850 [SQL] Support partitioned parquet tables that have the key in both the directory and the file
Author: Michael Armbrust <michael@databricks.com>

Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits:

447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file
2014-11-18 12:13:23 -08:00
Xiangrui Meng b54c6ab3c5 [SPARK-4396] allow lookup by index in Python's Rating
In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple.

davies

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
<!-- Reviewable:end -->

Author: Xiangrui Meng <meng@databricks.com>

Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:

543aef0 [Xiangrui Meng] use named tuple to implement ALS
0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396
d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
2014-11-18 10:35:29 -08:00
Davies Liu 8fbf72b790 [SPARK-4435] [MLlib] [PySpark] improve classification
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()

Author: Davies Liu <davies@databricks.com>

Closes #3305 from davies/setThreshold and squashes the following commits:

d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
2014-11-18 10:11:13 -08:00
Felix Maximilian Möller cedc3b5aa4 ALS implicit: added missing parameter alpha in doc string
Author: Felix Maximilian Möller <felixmaximilian.moeller@immobilienscout24.de>

Closes #3343 from felixmaximilian/fix-documentation and squashes the following commits:

43dcdfb [Felix Maximilian Möller] Removed the information about the switch implicitPrefs. The parameter implicitPrefs cannot be set in this context because it is inherent true when calling the trainImplicit method.
7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc string.
2014-11-18 10:08:24 -08:00
Patrick Wendell c6e0c2ab1c SPARK-4466: Provide support for publishing Scala 2.11 artifacts to Maven
The maven release plug-in does not have support for publishing two separate sets of artifacts for a single release. Because of the way that Scala 2.11 support in Spark works, we have to write some customized code to do this. The good news is that the Maven release API is just a thin wrapper on doing git commits and pushing artifacts to the HTTP API of Apache's Sonatype server and this might overall make our deployment easier to understand.

This was already used for the 1.2 snapshot, so I think it is working well. One other nice thing is this could be pretty easily extended to publish nightly snapshots.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3332 from pwendell/releases and squashes the following commits:

2fedaed [Patrick Wendell] Automate the opening and closing of Sonatype repos
e2a24bb [Patrick Wendell] Fixing issue where we overrode non-spark version numbers
9df3a50 [Patrick Wendell] Adding TODO
1cc1749 [Patrick Wendell] Don't build the thriftserver for 2.11
933201a [Patrick Wendell] Make tagging of release commit eager
d0388a6 [Patrick Wendell] Support Scala 2.11 build
4f4dc62 [Patrick Wendell] Change to 2.11 should not be included when committing new patch
bf742e1 [Patrick Wendell] Minor fixes
ffa1df2 [Patrick Wendell] Adding a Scala 2.11 package to test it
9ac4381 [Patrick Wendell] Addressing TODO
b3105ff [Patrick Wendell] Removing commented out code
d906803 [Patrick Wendell] Small fix
3f4d985 [Patrick Wendell] More work
fcd54c2 [Patrick Wendell] Consolidating use of keys
df2af30 [Patrick Wendell] Changes to release stuff
2014-11-17 21:07:50 -08:00
Cheng Lian 36b0956a3e [SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code
While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification.

While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (L213-L228))]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot.

The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits:

d6a9499 [Cheng Lian] Fixes import styling issue
43760e8 [Cheng Lian] Simplifies Parquet filter generation logic
2014-11-17 16:55:12 -08:00
Cheng Hao ef7c464eff [SPARK-4448] [SQL] unwrap for the ConstantObjectInspector
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits:

156b500 [Cheng Hao] rebase the master
c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector
2014-11-17 16:35:49 -08:00
w00228970 42389b1780 [SPARK-4443][SQL] Fix statistics for external table in spark sql hive
The `totalSize` of external table  is always zero, which will influence join strategy(always use broadcast join for external table).

Author: w00228970 <wangfei1@huawei.com>

Closes #3304 from scwf/statistics and squashes the following commits:

568f321 [w00228970] fix statistics for external table
2014-11-17 16:33:50 -08:00
Cheng Lian 6b7f2f753d [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types
This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits:

866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile)
6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
2014-11-17 16:31:05 -08:00
Cheng Hao 69e858cc77 [SQL] Construct the MutableRow from an Array
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3217 from chenghao-intel/mutablerow and squashes the following commits:

e8a10bd [Cheng Hao] revert the change of Row object
4681aea [Cheng Hao] Add toMutableRow method in object Row
a751838 [Cheng Hao] Construct the MutableRow from an existed row
2014-11-17 16:29:52 -08:00
Takuya UESHIN 566c791931 [SPARK-4425][SQL] Handle NaN or Infinity cast to Timestamp correctly.
`Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits:

14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.
2014-11-17 16:28:07 -08:00
Takuya UESHIN 3a81a1c9e0 [SPARK-4420][SQL] Change nullability of Cast from DoubleType/FloatType to DecimalType.
This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256).

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits:

7fea558 [Takuya UESHIN] Add some tests.
cb2301a [Takuya UESHIN] Fix tests.
133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.
2014-11-17 16:26:48 -08:00
Cheng Lian 5ce7dae859 [SQL] Makes conjunction pushdown more aggressive for in-memory table
This is inspired by the [Parquet record filter generation code](64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala (L387-L400)).

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits:

78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive
2014-11-17 15:33:13 -08:00
Josh Rosen 0f3ceb56c7 [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts
This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).

**The solution implemented here is only a partial fix.**  A complete fix would have the following properties:

1. Only one SparkContext may ever be under construction at any given time.
2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.

This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.

### The correct solution:

I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:

```scala
class SparkContext private (deps: SparkContextDependencies) {
  def this(conf: SparkConf) {
    this(SparkContext.getDeps(conf))
  }
}

object SparkContext(
  private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
    if (anotherSparkContextIsActive) { throw Exception(...) }
    var dagScheduler: DAGScheduler = null
    try {
        dagScheduler = new DAGScheduler(...)
        [...]
    } catch {
      case e: Exception =>
         Option(dagScheduler).foreach(_.stop())
          [...]
    }
    SparkContextDependencies(dagScheduler, ....)
  }
}
```

This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.

This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.

### Alternative solutions:

As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.

The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.

### This PR's solution:

- At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
- If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
- At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.

This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.

This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:

23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
d38251b [Josh Rosen] Address latest round of feedback.
c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
85a424a [Josh Rosen] Incorporate more review feedback.
372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
f5bb78c [Josh Rosen] Update mvn build, too.
d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
79a7e6f [Josh Rosen] Fix commented out test
a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
2014-11-17 12:48:18 -08:00
Andy Konwinski cec1116b4b [DOCS][SQL] Fix broken link to Row class scaladoc
Author: Andy Konwinski <andykonwinski@gmail.com>

Closes #3323 from andyk/patch-2 and squashes the following commits:

4699fdc [Andy Konwinski] Fix broken link to Row class scaladoc
2014-11-17 11:52:47 -08:00
Andrew Or dbb9da5c3d Revert "[SPARK-4075] [Deploy] Jar url validation is not enough for Jar file"
This reverts commit 098f83c7cc.
2014-11-17 11:24:28 -08:00
Ankur Dave 9ac2bb18ed [SPARK-4444] Drop VD type parameter from EdgeRDD
Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter.

This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits:

38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public
fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD
2014-11-17 11:06:31 -08:00
Adam Pingel e7690ed20a SPARK-2811 upgrade algebird to 0.8.1
Author: Adam Pingel <adam@axle-lang.org>

Closes #3282 from adampingel/master and squashes the following commits:

70c8d3c [Adam Pingel] relocate the algebird example back to example/src
7a9d8be [Adam Pingel] SPARK-2811 upgrade algebird to 0.8.1
2014-11-17 10:47:29 -08:00
Prashant Sharma 5c92d47ad2 SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #3310 from ScrapCodes/SPARK-4445/rddDebugStringFix and squashes the following commits:

4e57c52 [Prashant Sharma] SPARK-4445, Don't display storage level in toDebugString unless RDD is persisted
2014-11-17 10:40:33 -08:00
Michael Armbrust 64c6b9bad5 [SPARK-4410][SQL] Add support for external sort
Adds a new operator that uses Spark's `ExternalSort` class.  It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance.

Author: Michael Armbrust <michael@databricks.com>

Closes #3268 from marmbrus/externalSort and squashes the following commits:

48b9726 [Michael Armbrust] comments
b98799d [Michael Armbrust] Add test
afd7562 [Michael Armbrust] Add support for external sort.
2014-11-16 21:55:57 -08:00
GuoQiang Li 5168c6ca9f [SPARK-4422][MLLIB]In some cases, Vectors.fromBreeze get wrong results.
cc mengxr

Author: GuoQiang Li <witgo@qq.com>

Closes #3281 from witgo/SPARK-4422 and squashes the following commits:

5f1fa5e [GuoQiang Li] import order
50783bd [GuoQiang Li] review commits
7a10123 [GuoQiang Li] In some cases, Vectors.fromBreeze get wrong results.
2014-11-16 21:31:51 -08:00
Michael Armbrust 45ce3273cb Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
Author: Michael Armbrust <michael@databricks.com>

Closes #3292 from marmbrus/revert4309 and squashes the following commits:

808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"
2014-11-16 15:05:08 -08:00
Cheng Lian cb6bd83a91 [SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types
SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits:

6f71d0b [Cheng Lian] Makes toHiveString static
26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim
a92882a [Cheng Lian] Updates HiveShim for 0.13.1
73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)
2014-11-16 14:26:41 -08:00
Josh Rosen 7850e0c707 [SPARK-4393] Fix memory leak in ConnectionManager ACK timeout TimerTasks; use HashedWheelTimer
This patch is intended to fix a subtle memory leak in ConnectionManager's ACK timeout TimerTasks: in the old code, each TimerTask held a reference to the message being sent and a cancelled TimerTask won't necessarily be garbage-collected until it's scheduled to run, so this caused huge buildups of messages that weren't garbage collected until their timeouts expired, leading to OOMs.

This patch addresses this problem by capturing only the message ID in the TimerTask instead of the whole message, and by keeping a WeakReference to the promise in the TimerTask.  I've also modified this code to use Netty's HashedWheelTimer, whose performance characteristics should be better for this use-case.

Thanks to cristianopris for narrowing down this issue!

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3259 from JoshRosen/connection-manager-timeout-bugfix and squashes the following commits:

afcc8d6 [Josh Rosen] Address rxin's review feedback.
2a2e92d [Josh Rosen] Keep only WeakReference to promise in TimerTask;
0f0913b [Josh Rosen] Spelling fix: timout => timeout
3200c33 [Josh Rosen] Use Netty HashedWheelTimer
f847dd4 [Josh Rosen] Don't capture entire message in ACK timeout task.
2014-11-16 00:44:15 -08:00
Kousuke Saruta 84468b2e20 [SPARK-4426][SQL][Minor] The symbol of BitwiseOr is wrong, should not be '&'
The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '|'.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits:

aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr
2014-11-15 22:23:47 -08:00
Josh Rosen 7d8e152eec [SPARK-4419] Upgrade snappy-java to 1.1.1.6
This upgrades snappy-java to 1.1.1.6, which includes a patch that improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see xerial/snappy-java#89).

We previously tried up upgrade to 1.1.1.5 in #2911 but reverted that patch after discovering a memory leak in snappy-java.  This should leak have been fixed in 1.1.1.6, though (see xerial/snappy-java#92).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3287 from JoshRosen/SPARK-4419 and squashes the following commits:

5d6f4cc [Josh Rosen] [SPARK-4419] Upgrade snappy-java to 1.1.1.6.
2014-11-15 22:22:34 -08:00
Josh Rosen 40eb8b6ef3 [SPARK-2321] Several progress API improvements / refactorings
This PR refactors / extends the status API introduced in #2696.

- Change StatusAPI from a mixin trait to a class.  Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field.  As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example).
- Change the name from SparkStatusAPI to SparkStatusTracker.
- Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group.
- Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext.  This should simplify davies's progress bar code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits:

30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker.
d1b08d8 [Josh Rosen] Add missing newlines
2cc7353 [Josh Rosen] Add missing file.
d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods.
a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group
c47e294 [Josh Rosen] Remove StatusAPI mixin trait.
2014-11-14 23:46:25 -08:00
kai cbddac2369 Added contains(key) to Metadata
Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist.
Testcases are added to MetadataSuite as well.

Author: kai <kaizeng@eecs.berkeley.edu>

Closes #3273 from kai-zeng/metadata-fix and squashes the following commits:

74b3d03 [kai] Added contains(key) to Metadata
2014-11-14 23:44:23 -08:00
Kousuke Saruta 60969b0336 [SPARK-4260] Httpbroadcast should set connection timeout.
Httpbroadcast sets read timeout but doesn't set connection timeout.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3122 from sarutak/httpbroadcast-timeout and squashes the following commits:

c7f3a56 [Kousuke Saruta] Added Connection timeout for Http Connection to HttpBroadcast.scala
2014-11-14 22:36:56 -08:00
zsxwing 861223ee5b [SPARK-4363][Doc] Update the Broadcast example
Author: zsxwing <zsxwing@gmail.com>

Closes #3226 from zsxwing/SPARK-4363 and squashes the following commits:

8109914 [zsxwing] Update the Broadcast example
2014-11-14 22:28:48 -08:00
zsxwing dba1405823 [SPARK-4379][Core] Change Exception to SparkException in checkpoint
It's better to change to SparkException. However, it's a breaking change since it will change the exception type.

Author: zsxwing <zsxwing@gmail.com>

Closes #3241 from zsxwing/SPARK-4379 and squashes the following commits:

409f3af [zsxwing] Change Exception to SparkException in checkpoint
2014-11-14 22:25:41 -08:00
Davies Liu 7fe08b43c7 [SPARK-4415] [PySpark] JVM should exit after Python exit
When JVM is started in a Python process, it should exit once the stdin is closed.

test: add spark.driver.memory in conf/spark-defaults.conf

```
daviesdm:~/work/spark$ cat conf/spark-defaults.conf
spark.driver.memory       8g
daviesdm:~/work/spark$ bin/pyspark
>>> quit
daviesdm:~/work/spark$ jps
4931 Jps
286
daviesdm:~/work/spark$ python wc.py
943738
0.719928026199
daviesdm:~/work/spark$ jps
286
4990 Jps
```

Author: Davies Liu <davies@databricks.com>

Closes #3274 from davies/exit and squashes the following commits:

df0e524 [Davies Liu] address comments
ce8599c [Davies Liu] address comments
050651f [Davies Liu] JVM should exit after Python exit
2014-11-14 20:14:33 -08:00
WangTao 303a4e4d23 [SPARK-4404]SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-proc...
...ess ends

https://issues.apache.org/jira/browse/SPARK-4404

When we have spark.driver.extra* or spark.driver.memory in SPARK_SUBMIT_PROPERTIES_FILE, spark-class will use SparkSubmitDriverBootstrapper to launch driver.
If we get process id of SparkSubmitDriverBootstrapper and wanna kill it during its running, we expect its SparkSubmit sub-process stop also.

Author: WangTao <barneystinson@aliyun.com>
Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #3266 from WangTaoTheTonic/killsubmit and squashes the following commits:

e03eba5 [WangTaoTheTonic] add comments
57b5ca1 [WangTao] SparkSubmitDriverBootstrapper should stop after its SparkSubmit sub-process ends
2014-11-14 20:11:51 -08:00
Sandy Ryza ad42b28324 SPARK-4214. With dynamic allocation, avoid outstanding requests for more...
... executors than pending tasks need.

WIP. Still need to add and fix tests.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3204 from sryza/sandy-spark-4214 and squashes the following commits:

35cf0e0 [Sandy Ryza] Add comment
13b53df [Sandy Ryza] Review feedback
067465f [Sandy Ryza] Whitespace fix
6ae080c [Sandy Ryza] Add tests and get num pending tasks from ExecutorAllocationListener
531e2b6 [Sandy Ryza] SPARK-4214. With dynamic allocation, avoid outstanding requests for more executors than pending tasks need.
2014-11-14 15:51:40 -08:00
Jim Carroll 37482ce5a7 [SPARK-4412][SQL] Fix Spark's control of Parquet logging.
The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect.

ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer.

The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging.

Author: Jim Carroll <jim@dontcallme.com>

Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits:

37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.
2014-11-14 15:33:21 -08:00
Yash Datta 63ca3af66f [SPARK-4365][SQL] Remove unnecessary filter call on records returned from parquet library
Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :

from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java

public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;
while (!recordFound) {
// no more records left
if (current >= total)
{ return false; }
try {
checkRead();
currentValue = recordReader.read();
current ++;
if (recordReader.shouldSkipCurrentRecord())
{
 // this record is being filtered via the filter2 package
if (DEBUG) LOG.debug("skipping record");
 continue;
 }
if (currentValue == null)
{
// only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar;
 if (DEBUG) LOG.debug("filtered record reader reached end of block");
 continue;
}

recordFound = true;
if (DEBUG) LOG.debug("read value: " + currentValue);
} catch (RuntimeException e)
{ throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); }

}
return true;
}

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #3229 from saucam/remove_filter and squashes the following commits:

8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library
2014-11-14 15:16:40 -08:00
Jim Carroll f76b968370 [SPARK-4386] Improve performance when writing Parquet files.
If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?").

This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'.

Author: Jim Carroll <jim@dontcallme.com>

Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits:

30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.
2014-11-14 15:11:53 -08:00
Cheng Lian 0c7b66bd44 [SPARK-4322][SQL] Enables struct fields as sub expressions of grouping fields
While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query:

```sql
SELECT a.b + 1 FROM t GROUP BY a.b + 1
```

the grouping expression is

```scala
Add(GetField(a, "b"), Literal(1, IntegerType))
```

while the aggregation expression is

```scala
Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType))
```

This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3248)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3248 from liancheng/spark-4322 and squashes the following commits:

23a46ea [Cheng Lian] Code simplification
dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s
7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields
2014-11-14 15:09:36 -08:00