Commit graph

8689 commits

Author SHA1 Message Date
Prashant Sharma d6a3025392 [BUILD] Fixed resolver for scalastyle plugin and upgrade sbt version.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2877 from ScrapCodes/scalastyle-fix and squashes the following commits:

a17b9fe [Prashant Sharma] [BUILD] Fixed resolver for scalastyle plugin.
2014-10-22 22:13:11 -07:00
Prashant Sharma c5882c663e [SPARK-3812] [BUILD] Adapt maven build to publish effective pom.
I have tried maven help plugin first but that published all projects in top level pom. So I was left with no choice but to roll my own trivial plugin. This patch basically installs an effective pom after maven install is finished.

The problem it fixes is described as follows:
If you install using maven
` mvn install -DskipTests -Dhadoop.version=2.2.0 -Phadoop-2.2 `
Then without this patch the published pom(s) will have hadoop version as 1.0.4. This can be a problem at some point.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2673 from ScrapCodes/build-changes-effective-pom and squashes the following commits:

aa7b91d [Prashant Sharma] used an unused dep.
0300dac [Prashant Sharma] improved comment messages..
28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us.
553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
2014-10-22 19:44:00 -07:00
zsxwing 137d942353 [SPARK-3877][YARN] Throw an exception when application is not successful so that the exit code wil be set to 1
When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts.

This PR added a status checking after `monitorApplication`. If an application is not successful, `run()` will throw an `SparkException`, so that Client.scala will exit with code 1. Therefore, people can use the exit code of `spark-submit` to write some automatic scripts.

Author: zsxwing <zsxwing@gmail.com>

Closes #2732 from zsxwing/SPARK-3877 and squashes the following commits:

1f89fa5 [zsxwing] Fix the unit test
a0498e1 [zsxwing] Update the docs and the error message
e1cb9ef [zsxwing] Fix the hacky way of calling Client
ff16fec [zsxwing] Remove System.exit in Client.scala and add a test
6a2c103 [zsxwing] [SPARK-3877] Throw an exception when application is not successful so that the exit code wil be set to 1
2014-10-22 15:04:41 -07:00
Josh Rosen 813effc701 [SPARK-3426] Fix sort-based shuffle error when spark.shuffle.compress and spark.shuffle.spill.compress settings are different
This PR fixes SPARK-3426, an issue where sort-based shuffle crashes if the
`spark.shuffle.spill.compress` and `spark.shuffle.compress` settings have
different values.

The problem is that sort-based shuffle's read and write paths use different
settings for determining whether to apply compression.  ExternalSorter writes
runs to files using `TempBlockId` ids, which causes
`spark.shuffle.spill.compress` to be used for enabling compression, but these
spilled files end up being shuffled over the network and read as shuffle files
using `ShuffleBlockId` by BlockStoreShuffleFetcher, which causes
`spark.shuffle.compress` to be used for enabling decompression.  As a result,
this leads to errors when these settings disagree.

Based on the discussions in #2247 and #2178, it sounds like we don't want to
remove the `spark.shuffle.spill.compress` setting.  Therefore, I've tried to
come up with a fix where `spark.shuffle.spill.compress` is used to compress
data that's read and written locally and `spark.shuffle.compress` is used to
compress any data that will be fetched / read as shuffle blocks.

To do this, I split `TempBlockId` into two new id types, `TempLocalBlockId` and
`TempShuffleBlockId`, which map to `spark.shuffle.spill.compress` and
`spark.shuffle.compress`, respectively.  ExternalAppendOnlyMap also used temp
blocks for spilling data.  It looks like ExternalSorter was designed to be
a generic sorter but its configuration already happens to be tied to sort-based
shuffle, so I think it's fine if we use `spark.shuffle.compress` to compress
its spills; we can move the compression configuration to the constructor in
a later commit if we find that ExternalSorter is being used in other contexts
where we want different configuration options to control compression.  To
summarize:

**Before:**

|       | ExternalAppendOnlyMap        | ExternalSorter               |
|-------|------------------------------|------------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress       |
| Write | spark.shuffle.spill.compress | spark.shuffle.spill.compress |

**After:**

|       | ExternalAppendOnlyMap        | ExternalSorter         |
|-------|------------------------------|------------------------|
| Read  | spark.shuffle.spill.compress | spark.shuffle.compress |
| Write | spark.shuffle.spill.compress | spark.shuffle.compress |

Thanks to andrewor14 for debugging this with me!

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2890 from JoshRosen/SPARK-3426 and squashes the following commits:

1921cf6 [Josh Rosen] Minor edit for clarity.
c8dd8f2 [Josh Rosen] Add comment explaining use of createTempShuffleBlock().
2c687b9 [Josh Rosen] Fix SPARK-3426.
91e7e40 [Josh Rosen] Combine tests into single test of all combinations
76ca65e [Josh Rosen] Add regression test for SPARK-3426.
2014-10-22 14:49:58 -07:00
freeman 97cf19f64e Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK]
Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound.

Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings).

mengxr

Author: freeman <the.freeman.lab@gmail.com>

Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits:

dc385ef [freeman] Change maximum value for default seed
2014-10-22 09:33:12 -07:00
CrazyJvm f05e09b4c9 use isRunningLocally rather than runningLocally
runningLocally is deprecated now

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #2879 from CrazyJvm/runningLocally and squashes the following commits:

bec0b3e [CrazyJvm] use isRunningLocally rather than runningLocally
2014-10-22 00:51:33 -07:00
Karthik bae4ca3bbf Update JavaCustomReceiver.java
Changed the usage string to correctly reflect the file name.

Author: Karthik <karthik.gomadam@gmail.com>

Closes #2699 from namelessnerd/patch-1 and squashes the following commits:

8570e33 [Karthik] Update JavaCustomReceiver.java
2014-10-22 00:08:53 -07:00
Sandy Ryza 6bb56faea8 SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
Author: Sandy Ryza <sandy@cloudera.com>

Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:

48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
2014-10-21 21:53:09 -07:00
wangfei 856b081729 [SQL]redundant methods for broadcast
redundant methods for broadcast in ```TableReader```

Author: wangfei <wangfei1@huawei.com>

Closes #2862 from scwf/TableReader and squashes the following commits:

414cc24 [wangfei] unnecessary methods for broadcast
2014-10-21 16:20:58 -07:00
coderxiang 814a9cd7fa SPARK-3568 [mllib] add ranking metrics
Add common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precisionn: top-n precision
 - Discounted cumulative gain (DCG) and NDCG

The following methods and the corresponding tests are implemented:

```
class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) {
  /* Returns the precsionk for each query */
  lazy val precAtK: RDD[Array[Double]]

  /**
   * param k the position to compute the truncated precision
   * return the average precision at the first k ranking positions
   */
  def precision(k: Int): Double

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcgAtK: RDD[Array[Double]]

  /**
   * param k the position to compute the truncated ndcg
   * return the average ndcg at the first k ranking positions
   */
  def ndcg(k: Int): Double
}
```

Author: coderxiang <shuoxiangpub@gmail.com>

Closes #2667 from coderxiang/rankingmetrics and squashes the following commits:

d881097 [coderxiang] update doc
14d9cd9 [coderxiang] remove unexpected files
d7fb93f [coderxiang] style change and remove ignored files
f113ee1 [coderxiang] modify doc for displaying superscript and subscript
f626896 [coderxiang] improve doc and remove unnecessary computation while labSet is empty
be6645e [coderxiang] set the precision of empty labset to 0.0
d64c120 [coderxiang] add logWarning for empty ground truth set
dfae292 [coderxiang] handle empty labSet for map. add test
62047c4 [coderxiang] style change and add documentation
f66612d [coderxiang] add additional test of precisionAt
b794cb2 [coderxiang] move private members precAtK, ndcgAtK into public methods. style change
77c9e5d [coderxiang] set precAtK and ndcgAtK as private member. Improve documentation
5f87bce [coderxiang] add API to calculate precision and ndcg at each ranking position
b7851cc [coderxiang] Use generic type to represent IDs
e443fee [coderxiang] change style and use alternative builtin methods
3a5a6ff [coderxiang] add ranking metrics
2014-10-21 15:45:47 -07:00
Aaron Davidson 5fdaf52a9d [SPARK-3994] Use standard Aggregator code path for countByKey and countByValue
See [JIRA](https://issues.apache.org/jira/browse/SPARK-3994) for more information. Also adds
a note which warns against using these methods.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2839 from aarondav/countByKey and squashes the following commits:

d6fdb2a [Aaron Davidson] Respond to comments
e1f06d3 [Aaron Davidson] [SPARK-3994] Use standard Aggregator code path for countByKey and countByValue
2014-10-21 13:15:29 -07:00
Michelangelo D'Agostino 1a623b2e16 SPARK-3770: Make userFeatures accessible from python
https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #2636 from mdagost/mf_user_features and squashes the following commits:

c98f9e2 [Michelangelo D'Agostino] Added unit tests for userFeatures and productFeatures and merged master.
d5eadf8 [Michelangelo D'Agostino] Merge branch 'master' into mf_user_features
2481a2a [Michelangelo D'Agostino] Merged master and resolved conflict.
a6ffb96 [Michelangelo D'Agostino] Eliminated a function from our first approach to this problem that is no longer needed now that we added the fromTuple2RDD function.
2aa1bf8 [Michelangelo D'Agostino] Implemented a function called fromTuple2RDD in PythonMLLibAPI and used it to expose the MF userFeatures and productFeatures in python.
34cb2a2 [Michelangelo D'Agostino] A couple of lint cleanups and a comment.
cdd98e3 [Michelangelo D'Agostino] It's working now.
e1fbe5e [Michelangelo D'Agostino] Added scala function to stringify userFeatures for access in python.
2014-10-21 11:49:39 -07:00
Andrew Or 61ca7742d2 [SPARK-4020] Do not rely on timeouts to remove failed block managers
If an executor fails without being scheduled to run any tasks, then `DAGScheduler` won't notify `BlockManagerMasterActor` that the associated block manager should be removed. Instead, the associated block manager will be expired only after a few rounds of heartbeat timeouts. In terms of removal treatment, there should really be no distinction between executors that have been scheduled tasks and those that have not.

The fix, then, is to add all known executors to `TaskSchedulerImpl`'s `activeExecutorIds` whether or not it has been scheduled a task. In fact, the existing comment above `activeExecutorIds` is
```
// Which executor IDs we have executors on
val activeExecutorIds = new HashSet[String]
```
not  "Which executors have been scheduled tasks thus far."

Author: Andrew Or <andrewor14@gmail.com>

Closes #2865 from andrewor14/active-executors and squashes the following commits:

ff3172b [Andrew Or] Add all known executors to `activeExecutorIds`
2014-10-21 11:22:25 -07:00
zsxwing c262cd5ddc [SPARK-4035] Fix a wrong format specifier
Just found a typo. Should not use "%f" for Long.

Author: zsxwing <zsxwing@gmail.com>

Closes #2875 from zsxwing/SPARK-4035 and squashes the following commits:

ce347e2 [zsxwing] Fix a wrong format specifier
2014-10-21 10:37:13 -07:00
Holden Karau 2aeb84bc79 replace awaitTransformation with awaitTermination in scaladoc/javadoc
Author: Holden Karau <holden@pigscanfly.ca>

Closes #2861 from holdenk/SPARK-4015-Documentation-in-the-streaming-context-references-non-existent-function and squashes the following commits:

081db8a [Holden Karau] fix pyspark streaming doc too
0e03863 [Holden Karau] replace awaitTransformation with awaitTermination
2014-10-21 09:37:17 -07:00
Davies Liu 8570816834 [SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of Vector
Convert the input rdd to RDD of Vector.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2870 from davies/fix4023 and squashes the following commits:

1eac767 [Davies Liu] address comments
0871576 [Davies Liu] convert rdd into RDD of Vector
2014-10-21 09:29:45 -07:00
Josh Rosen 5a8f64f336 [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
This PR makes several changes to TorrentBroadcast in order to make
it easier to reason about, which should help when debugging SPARK-3958.
The key changes:

- Remove all state from the global TorrentBroadcast object.  This state
  consisted mainly of configuration options, like the block size and
  compression codec, and was read by the blockify / unblockify methods.
  Unfortunately, the use of `lazy val` for `BLOCK_SIZE` meant that the block
  size was always determined by the first SparkConf that TorrentBroadast was
  initialized with; as a result, unit tests could not properly test
  TorrentBroadcast with different block sizes.

  Instead, blockifyObject and unBlockifyObject now accept compression codecs
  and blockSizes as arguments.  These arguments are supplied at the call sites
  inside of TorrentBroadcast instances.  Each TorrentBroadcast instance
  determines these values from SparkEnv's SparkConf.  I was careful to ensure
  that we do not accidentally serialize CompressionCodec or SparkConf objects
  as part of the TorrentBroadcast object.

- Remove special-case handling of local-mode in TorrentBroadcast.  I don't
  think that broadcast implementations should know about whether we're running
  in local mode.  If we want to optimize the performance of broadcast in local
  mode, then we should detect this at a higher level and use a dummy
  LocalBroadcastFactory implementation instead.

  Removing this code fixes a subtle error condition: in the old local mode
  code, a failure to find the broadcast in the local BlockManager would lead
  to an attempt to deblockify zero blocks, which could lead to confusing
  deserialization or decompression errors when we attempted to decompress
  an empty byte array.  This should never have happened, though: a failure to
  find the block in local mode is evidence of some other error.  The changes
  here will make it easier to debug those errors if they ever happen.

- Add a check that throws an exception when attempting to deblockify an
  empty array.

- Use ScalaCheck to add a test to check that TorrentBroadcast's
  blockifyObject and unBlockifyObject methods are inverses.

- Misc. cleanup and logging improvements.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2844 from JoshRosen/torrentbroadcast-bugfix and squashes the following commits:

1e8268d [Josh Rosen] Address Reynold's review comments
2a9fdfd [Josh Rosen] Address Reynold's review comments.
c3b08f9 [Josh Rosen] Update TorrentBroadcast tests to reflect removal of special local-mode optimizations.
5c22782 [Josh Rosen] Store broadcast variable's value in the driver.
33fc754 [Josh Rosen] Change blockify/unblockifyObject to accept serializer as argument.
618a872 [Josh Rosen] [SPARK-3958] TorrentBroadcast cleanup / debugging improvements.
2014-10-21 00:49:11 -07:00
Reynold Xin 342b57db66 Update Building Spark link. 2014-10-20 19:16:35 -07:00
wangxiaojing 0fe1c09369 [SPARK-3940][SQL] Avoid console printing error messages three times
If  wrong sql,the console print error one times。
eg:
<pre>
spark-sql> show tabless;
show tabless;
14/10/13 21:03:48 INFO ParseDriver: Parsing command: show tabless
............
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:274)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:209)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
Caused by: org.apache.hadoop.hive.ql.parse.ParseException: line 1:5 cannot recognize input near 'show' 'tabless' '<EOF>' in ddl statement

	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:193)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161)
	at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218)
	at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226)
	... 47 more
Time taken: 4.35 seconds
14/10/13 21:03:51 INFO CliDriver: Time taken: 4.35 seconds
</pre>

Author: wangxiaojing <u9jing@gmail.com>

Closes #2790 from wangxiaojing/spark-3940 and squashes the following commits:

e2e5c14 [wangxiaojing] sql Print the error code three times
2014-10-20 17:15:28 -07:00
Takuya UESHIN 7586e2e67a [SPARK-3969][SQL] Optimizer should have a super class as an interface.
Some developers want to replace `Optimizer` to fit their projects but can't do so because currently `Optimizer` is an `object`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2825 from ueshin/issues/SPARK-3969 and squashes the following commits:

abbc53c [Takuya UESHIN] Re-rename Optimizer object.
4d2e1bc [Takuya UESHIN] Rename Optimizer object.
9547a23 [Takuya UESHIN] Extract abstract class from Optimizer for developers to be able to replace Optimizer.
2014-10-20 17:09:12 -07:00
luogankun fce1d41611 [SPARK-3945]Properties of hive-site.xml is invalid in running the Thrift JDBC server
Write properties of hive-site.xml to HiveContext when initilize session state in SparkSQLEnv.scala.

The method of SparkSQLEnv.init() in HiveThriftServer2.scala can not write the properties of hive-site.xml to HiveContext. Such as: add configuration property spark.sql.shuffle.partititions in the hive-site.xml.

Author: luogankun <luogankun@gmail.com>

Closes #2800 from luogankun/SPARK-3945 and squashes the following commits:

3679efc [luogankun] [SPARK-3945]Write properties of hive-site.xml to HiveContext when initilize session state In SparkSQLEnv.scala
2014-10-20 16:50:51 -07:00
Takuya UESHIN 364d52b707 [SPARK-3966][SQL] Fix nullabilities of Cast related to DateType.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2820 from ueshin/issues/SPARK-3966 and squashes the following commits:

ca4a745 [Takuya UESHIN] Fix nullabilities of Cast related to DateType.
2014-10-20 15:51:05 -07:00
Michael Armbrust e9c1afa87b [SPARK-3800][SQL] Clean aliases from grouping expressions
Author: Michael Armbrust <michael@databricks.com>

Closes #2658 from marmbrus/nestedAggs and squashes the following commits:

862b763 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs
3234521 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs
8b06fdc [Michael Armbrust] possible fix for grouping on nested fields
2014-10-20 15:32:17 -07:00
Cheng Lian 1b3ce61ce9 [SPARK-3906][SQL] Adds multiple join support for SQLContext
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2767 from liancheng/multi-join and squashes the following commits:

9dc0d18 [Cheng Lian] Adds multiple join support for SQLContext
2014-10-20 15:29:54 -07:00
Qiping Li eadc4c590e [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively
DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:

1. Sort sample values for this feature
2. Get number of occurrence of each distinct value
3. Iterate the value count array computed in step 2 to choose splits.

After find splits, `numSplits` and `numBins` in metadata will be updated.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <liqiping1991@gmail.com>
Author: chouqin <liqiping1991@gmail.com>
Author: liqi <liqiping1991@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2780 from chouqin/dt-findsplits and squashes the following commits:

18d0301 [Qiping Li] check explicitly findsplits return distinct splits
8dc28ab [chouqin] remove blank lines
ffc920f [chouqin] adjust code based on comments and add more test cases
9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
d353596 [qiping.lqp] fix pyspark doc test
9e64699 [Qiping Li] fix random forest unit test
3c72913 [Qiping Li] fix random forest unit test
092efcb [Qiping Li] fix bug
f69f47f [Qiping Li] fix bug
ab303a4 [Qiping Li] fix bug
af6dc97 [Qiping Li] fix bug
2a8267a [Qiping Li] fix bug
c339a61 [Qiping Li] fix bug
369f812 [Qiping Li] fix style
8f46af6 [Qiping Li] add comments and unit test
9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits
1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
0cd744a [liqi] fix bug
3652823 [Qiping Li] fix bug
af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
2014-10-20 13:12:26 -07:00
mcheah 4afe9a4852 [SPARK-3736] Workers reconnect when disassociated from the master.
Before, if the master node is killed and restarted, the worker nodes
would not attempt to reconnect to the Master. Therefore, when the Master
node was restarted, the worker nodes needed to be restarted as well.

Now, when the Master node is disconnected, the worker nodes will
continuously ping the master node in attempts to reconnect to it. Once
the master node restarts, it will detect one of the registration
requests from its former workers. The result is that the cluster
re-enters a healthy state.

In addition, when the master does not receive a heartbeat from the
worker, the worker was removed; however, when the worker sent a
heartbeat to the master, the master used to ignore the heartbeat. Now,
a master that receives a heartbeat from a worker that had been
disconnected will request the worker to re-attempt the registration
process, at which point the worker will send a RegisterWorker request
and be re-connected accordingly.

Re-connection attempts per worker are submitted every N seconds, where N
is configured by the property spark.worker.reconnect.interval - this has
a default of 60 seconds right now.

Author: mcheah <mcheah@palantir.com>

Closes #2828 from mccheah/reconnect-dead-workers and squashes the following commits:

83f8bc9 [mcheah] [SPARK-3736] More informative log message, and fixing some indentation.
fe0e02f [mcheah] [SPARK-3736] Moving reconnection logic to registerWithMaster().
94ddeca [mcheah] [SPARK-3736] Changing a log warning to a log info.
a698e35 [mcheah] [SPARK-3736] Addressing PR comment to make some defs private.
b9a3077 [mcheah] [SPARK-3736] Addressing PR comments related to reconnection.
2ad5ed5 [mcheah] [SPARK-3736] Cancel attempts to reconnect if the master changes.
b5b34af [mcheah] [SPARK-3736] Workers reconnect when disassociated from the master.
2014-10-20 11:35:18 -07:00
Takuya UESHIN ea054e1fc7 [SPARK-3986][SQL] Fix package names to fit their directory names.
Package names of 2 test suites are different from their directory names.
- `GeneratedEvaluationSuite`
- `GeneratedMutableEvaluationSuite`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2835 from ueshin/issues/SPARK-3986 and squashes the following commits:

fa2cc05 [Takuya UESHIN] Fix package names to fit their directory names.
2014-10-20 11:31:51 -07:00
GuoQiang Li 51afde9d8b [SPARK-4010][Web UI]Spark UI returns 500 in yarn-client mode
The problem caused by #1966
CC YanTangZhai andrewor14

Author: GuoQiang Li <witgo@qq.com>

Closes #2858 from witgo/SPARK-4010 and squashes the following commits:

9866fbf [GuoQiang Li] Spark UI returns 500 in yarn-client mode
2014-10-20 11:03:53 -07:00
jerryshao c7aeecd08f [SPARK-3948][Shuffle]Fix stream corruption bug in sort-based shuffle
Kernel 2.6.32 bug will lead to unexpected behavior of transferTo in copyStream, and this will corrupt the shuffle output file in sort-based shuffle, which will somehow introduce PARSING_ERROR(2), deserialization error or offset out of range. Here fix this by adding append flag, also add some position checking code. Details can be seen in [SPARK-3948](https://issues.apache.org/jira/browse/SPARK-3948).

Author: jerryshao <saisai.shao@intel.com>

Closes #2824 from jerryshao/SPARK-3948 and squashes the following commits:

be0533a [jerryshao] Address the comments
a82b184 [jerryshao] add configuration to control the NIO way of copying stream
e17ada2 [jerryshao] Fix kernel 2.6.32 bug led unexpected behavior of transferTo
2014-10-20 10:20:21 -07:00
Josh Rosen d1966f3a8b [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java API
This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits:

0d45fbc [Josh Rosen] Whitespace fix.
ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java
c0153a5 [Josh Rosen] Remove unused variable.
e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback
7a1417f [Josh Rosen] Removed unnecessary java.util import.
6f8f6ac [Josh Rosen] Fix import ordering.
ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error.
346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
2014-10-19 20:02:31 -07:00
Josh Rosen 7e63bb49c5 [SPARK-2546] Clone JobConf for each task (branch-1.0 / 1.1 backport)
This patch attempts to fix SPARK-2546 in `branch-1.0` and `branch-1.1`.  The underlying problem is that thread-safety issues in Hadoop Configuration objects may cause Spark tasks to get stuck in infinite loops.  The approach taken here is to clone a new copy of the JobConf for each task rather than sharing a single copy between tasks.  Note that there are still Configuration thread-safety issues that may affect the driver, but these seem much less likely to occur in practice and will be more complex to fix (see discussion on the SPARK-2546 ticket).

This cloning is guarded by a new configuration option (`spark.hadoop.cloneConf`) and is disabled by default in order to avoid unexpected performance regressions for workloads that are unaffected by the Configuration thread-safety issues.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2684 from JoshRosen/jobconf-fix-backport and squashes the following commits:

f14f259 [Josh Rosen] Add configuration option to control cloning of Hadoop JobConf.
b562451 [Josh Rosen] Remove unused jobConfCacheKey field.
dd25697 [Josh Rosen] [SPARK-2546] [1.0 / 1.1 backport] Clone JobConf for each task.

(cherry picked from commit 2cd40db2b3)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
2014-10-19 00:35:05 -07:00
Davies Liu 05db2da7dc [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming Programming Guide
Having Python examples in Streaming Programming Guide.

Also add RecoverableNetworkWordCount example.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2808 from davies/pyguide and squashes the following commits:

8d4bec4 [Davies Liu] update readme
26a7e37 [Davies Liu] fix format
3821c4d [Davies Liu] address comments, add missing file
7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
2014-10-18 19:14:48 -07:00
Sean Owen f406a83918 SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not Serializable
Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are

Author: Sean Owen <sowen@cloudera.com>

Closes #2805 from srowen/SPARK-3926 and squashes the following commits:

ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable
f4717f9 [Sean Owen] Oops, fix compile problem
ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well
51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are
2014-10-18 12:38:18 -07:00
Joseph K. Bradley 477c6481cc [SPARK-3934] [SPARK-3918] [mllib] Bug fixes for RandomForest, DecisionTree
SPARK-3934: When run with a mix of unordered categorical and continuous features, on multiclass classification, RandomForest fails. The bug is in the sanity checks in getFeatureOffset and getLeftRightFeatureOffsets, which use the wrong indices for checking whether features are unordered.
Fix: Remove the sanity checks since they are not really needed, and since they would require DTStatsAggregator to keep track of an extra set of indices (for the feature subset).

Added test to RandomForestSuite which failed with old version but now works.

SPARK-3918: Added baggedInput.unpersist at end of training.

Also:
* I removed DTStatsAggregator.isUnordered since it is no longer used.
* DecisionTreeMetadata: Added logWarning when maxBins is automatically reduced.
* Updated DecisionTreeRunner to explicitly fix the test data to have the same number of features as the training data.  This is a temporary fix which should eventually be replaced by pre-indexing both datasets.
* RandomForestModel: Updated toString to print total number of nodes in forest.
* Changed Predict class to be public DeveloperApi.  This was necessary to allow users to create their own trees by hand (for testing).

CC: mengxr  manishamde chouqin codedeft  Just notifying you of these small bug fixes.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2785 from jkbradley/dtrunner-update and squashes the following commits:

9132321 [Joseph K. Bradley] merged with master, fixed imports
9dbd000 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e116473 [Joseph K. Bradley] Changed Predict class to be public DeveloperApi.
f502e65 [Joseph K. Bradley] bug fix for SPARK-3934
7f3d60f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
ba567ab [Joseph K. Bradley] Changed DTRunner to load test data using same number of features as in training data.
4e88c1f [Joseph K. Bradley] changed RF toString to print total number of nodes
2014-10-17 15:02:57 -07:00
Daoyuan Wang 23f6171d63 [SPARK-3985] [Examples] fix file path using os.path.join
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2834 from adrian-wang/sqlpypath and squashes the following commits:

da7aa95 [Daoyuan Wang] fix file path using path.join
2014-10-17 14:49:44 -07:00
Michael Armbrust adcb7d3350 [SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations
In the current implementation it was possible for the reference to change after analysis.

Author: Michael Armbrust <michael@databricks.com>

Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits:

da14879 [Michael Armbrust] Fix test
6343bcb [Michael Armbrust] add test
9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
2014-10-17 14:12:07 -07:00
Marcelo Vanzin 803e7f0877 [SPARK-3979] [yarn] Use fs's default replication.
This avoids issues when HDFS is configured in a way that would not
allow the hardcoded default replication of "3".

Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest
one available on Maven Central is 0.23.7, so I chose to not add code
to access that method via reflection.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2831 from vanzin/SPARK-3979 and squashes the following commits:

b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication.
2014-10-17 13:45:10 -05:00
likun c351862064 [SPARK-3935][Core] log the number of records that has been written
There is a unused variable(count) in saveAsHadoopDataset in PairRDDFunctions.scala. The initial idea of this variable seems to count the number of records, so I am adding a log statement to log the number of records that has been written to the writer.

Author: likun <jacky.likun@huawei.com>
Author: jackylk <jacky.likun@huawei.com>

Closes #2791 from jackylk/SPARK-3935 and squashes the following commits:

a874047 [jackylk] removing the unused variable in PairRddFunctions.scala
3bf43c7 [likun] log the number of records has been written
2014-10-17 10:33:45 -07:00
Shivaram Venkataraman e678b9f02a [SPARK-3973] Print call site information for broadcasts
Its hard to debug which broadcast variables refer to what in a big codebase. Printing call site information helps in debugging.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2829 from shivaram/spark-broadcast-print and squashes the following commits:

cd6dbdf [Shivaram Venkataraman] Print call site information for broadcasts
2014-10-16 19:43:33 -07:00
yantangzhai dedace83f3 [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more.

Author: yantangzhai <tyz0303@163.com>
Author: YanTangZhai <hakeemzhai@tencent.com>

Closes #1966 from YanTangZhai/SPARK-3067 and squashes the following commits:

d4323f8 [yantangzhai] update [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
b6391cc [yantangzhai] revert [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
d2226cd [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
aac7f7b [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
2014-10-16 19:27:19 -07:00
zsxwing 56fd34af52 [SPARK-3741] Add afterExecute for handleConnectExecutor
Sorry. I found that I forgot to add `afterExecute` for `handleConnectExecutor` in #2593.

Author: zsxwing <zsxwing@gmail.com>

Closes #2794 from zsxwing/SPARK-3741 and squashes the following commits:

a0bc4dd [zsxwing] Add afterExecute for handleConnectExecutor
2014-10-16 19:22:02 -07:00
WangTaoTheTonic e7f4ea8a52 [SPARK-3890][Docs]remove redundant spark.executor.memory in doc
Introduced in f7e79bc42c, I'm not sure why we need two spark.executor.memory here.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #2745 from WangTaoTheTonic/redundantconfig and squashes the following commits:

e7564dc [WangTao] too long line
fdbdb1f [WangTaoTheTonic] trivial workaround
d06b6e5 [WangTaoTheTonic] remove redundant spark.executor.memory in doc
2014-10-16 19:12:57 -07:00
Zhang, Liye 642b246beb [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
In BlockManagermasterActor, _remainingMem would increase memSize for twice when updateBlockInfo if new storageLevel is invalid and old storageLevel is "useMemory". Also, _remainingMem should increase with original memory size instead of new memSize.

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #2792 from liyezhang556520/spark-3941-remainMem and squashes the following commits:

3d487cc [Zhang, Liye] make the code concise
0380a32 [Zhang, Liye] [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
2014-10-16 19:07:37 -07:00
Kun Li be2ec4a91d [SQL]typo in HiveFromSpark
Author: Kun Li <jacky.likun@gmail.com>

Closes #2809 from jackylk/patch-1 and squashes the following commits:

46c926b [Kun Li] typo in HiveFromSpark
2014-10-16 19:00:10 -07:00
Aaron Davidson 7f7b50ed9d [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
2014-10-16 18:58:18 -07:00
Prashant Sharma 2fe0ba9561 SPARK-3874: Provide stable TaskContext API
This is a small number of clean-up changes on top of #2782. Closes #2782.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #2803 from pwendell/pr-2782 and squashes the following commits:

56d5b7a [Patrick Wendell] Minor clean-up
44089ec [Patrick Wendell] Clean-up the TaskContext API.
ed551ce [Prashant Sharma] Fixed a typo
df261d0 [Prashant Sharma] Josh's suggestion
facf3b1 [Prashant Sharma] Fixed the mima issue.
7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
bbd9e05 [Prashant Sharma] adding missed out files to git.
ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
2014-10-16 21:38:45 -04:00
Cheng Lian 99e416b6d6 [SQL] Fixes the race condition that may cause test failure
The removed `Future` was used to end the test case as soon as the Spark SQL CLI process exits. When the process exits prematurely, this mechanism prevents the test case to wait until timeout. But it also creates a race condition: when `foundAllExpectedAnswers.tryFailure` is called, there are chances that the last expected output line of the CLI process hasn't been caught by the main logics of the test code, thus fails the test case.

Removing this `Future` doesn't affect correctness.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2823 from liancheng/clean-clisuite and squashes the following commits:

489a97c [Cheng Lian] Fixes the race condition that may cause test failure
2014-10-16 16:15:55 -07:00
Davies Liu 091d32c52e [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.

So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2830 from davies/fix_pickle and squashes the following commits:

0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
2014-10-16 14:56:50 -07:00
Shiti 4c589cac44 [SPARK-3944][Core] Code re-factored as suggested
Author: Shiti <ssaxena.ece@gmail.com>

Closes #2810 from Shiti/master and squashes the following commits:

051d82f [Shiti] setting the default value of uri scheme to "file"  where matching "file" or None yields the same result
2014-10-16 10:52:06 -07:00
prudhvi 044583a241 [Core] Upgrading ScalaStyle version to 0.5 and removing SparkSpaceAfterCommentStartChecker.
Author: prudhvi <prudhvi953@gmail.com>

Closes #2799 from prudhvije/ScalaStyle/space-after-comment-start and squashes the following commits:

fc263a1 [prudhvi] [Core] Using scalastyle to check the space after comment start
2014-10-16 02:05:44 -04:00