Commit graph

8753 commits

Author SHA1 Message Date
Cheng Hao ca83f1e2c4 [SPARK-2917] [SQL] Avoid table creation in logical plan analyzing for CTAS
Author: Cheng Hao <hao.cheng@intel.com>

Closes #1846 from chenghao-intel/ctas and squashes the following commits:

56a0578 [Cheng Hao] remove the unused imports
9a57abc [Cheng Hao] Avoid table creation in logical plan analyzing
2014-09-11 11:57:01 -07:00
Davies Liu 1ef656ea85 [SPARK-3047] [PySpark] add an option to use str in textFileRDD
str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).

use_unicode=True:

daviesliudm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	2m8.298s
user	0m0.185s
sys	0m0.064s

use_unicode=False

daviesliudm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	1m26.402s
user	0m0.182s
sys	0m0.062s

We can see that it got 32% improvement!

Author: Davies Liu <davies.liu@gmail.com>

Closes #1951 from davies/unicode and squashes the following commits:

8352d57 [Davies Liu] update version number
a286f2f [Davies Liu] rollback loads()
85246e5 [Davies Liu] add docs for use_unicode
a0295e1 [Davies Liu] add an option to use str in textFile()
2014-09-11 11:50:36 -07:00
Chris Cope ed1980ffa9 [SPARK-2140] Updating heap memory calculation for YARN stable and alpha.
Updated pull request, reflecting YARN stable and alpha states. I am getting intermittent test failures on my own test infrastructure. Is that tracked anywhere yet?

Author: Chris Cope <ccope@resilientscience.com>

Closes #2253 from copester/master and squashes the following commits:

5ad89da [Chris Cope] [SPARK-2140] Removing calculateAMMemory functions since they are no longer needed.
52b4e45 [Chris Cope] [SPARK-2140] Updating heap memory calculation for YARN stable and alpha.
2014-09-11 08:13:07 -05:00
Aaron Staple c27718f376 [SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer.
LogicalPlan contains a ‘resolved’ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union plan’s implementation of ‘resolved’ verifies that the types of its children’s columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column.  See SPARK-2781 for an example.

This patch adds two checks to the analyzer’s CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the ‘problem’ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough for this redundant checking to meaningfully impact performance.

Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts, and ExtractPythonUdfs rules manually after the analyzer runs. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer.

It’s worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable.

Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of policy differences between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as “true” / “false” while spark sql casts booleans to strings as “1” / “0” (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to “1”/“0” is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces “true”/“false” instead.) I believe “true” / “false” should be the behavior for spark sql - I changed the behavior so bools are converted to “true”/“false” to be consistent with hive, and none of the existing spark tests failed.

Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, “” ) -> “true” while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> “TRUE”.  The change I’ve made just converts to lowercase strings in all cases.  I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become “1” / “0”.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1706 from staple/SPARK-2781 and squashes the following commits:

32683c4 [Aaron Staple] Fix compilation failure due to merge.
7c77fda [Aaron Staple] Move ExtractPythonUdfs to Analyzer's extendedRules in HiveContext.
d49bfb3 [Aaron Staple] Address review comments.
915b690 [Aaron Staple] Fix merge issue causing compilation failure.
701dcd2 [Aaron Staple] [SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer.
2014-09-10 21:01:53 -07:00
Michael Armbrust f92cde24e8 [SPARK-3447][SQL] Remove explicit conversion with JListWrapper to avoid NPE
Author: Michael Armbrust <michael@databricks.com>

Closes #2323 from marmbrus/kryoJListNPE and squashes the following commits:

9634f11 [Michael Armbrust] Rollback JSON RDD changes
4d4d93c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into kryoJListNPE
646976b [Michael Armbrust] Fix JSON RDD Conversion too
59065bc [Michael Armbrust] Remove explicit conversion to avoid NPE
2014-09-10 20:59:40 -07:00
Michael Armbrust 84e2c8bfe4 [SQL] Add test case with workaround for reading partitioned Avro files
In order to read from partitioned Avro files we need to also set the `SERDEPROPERTIES` since `TBLPROPERTIES` are not passed to the initialization.  This PR simply adds a test to make sure we don't break this workaround.

Author: Michael Armbrust <michael@databricks.com>

Closes #2340 from marmbrus/avroPartitioned and squashes the following commits:

6b969d6 [Michael Armbrust] fix style
fea2124 [Michael Armbrust] Add test case with workaround for reading partitioned avro files.
2014-09-10 20:57:38 -07:00
qiping.lqp 79cdb9b64a [SPARK-2207][SPARK-3272][MLLib]Add minimum information gain and minimum instances per node as training parameters for decision tree.
These two parameters can act as early stop rules to do pre-pruning. When a split cause cause left or right child to have less than `minInstancesPerNode` or has less information gain than `minInfoGain`, current node will not be split by this split.

When there is no possible splits that satisfy requirements, there is no useful information gain stats, but we still need to calculate the predict value for current node. So I separated calculation of predict from calculation of information gain, which can also save computation when the number of possible splits is large. Please see [SPARK-3272](https://issues.apache.org/jira/browse/SPARK-3272) for more details.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2332 from chouqin/dt-preprune and squashes the following commits:

f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-10 15:37:10 -07:00
WangTaoTheTonic 558962a83f [SPARK-3411] Improve load-balancing of concurrently-submitted drivers across workers
If the waiting driver array is too big, the drivers in it will be dispatched to the first worker we get(if it has enough resources), with or without the Randomization.

We should do randomization every time we dispatch a driver, in order to better balance drivers.

Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>

Closes #1106 from WangTaoTheTonic/fixBalanceDrivers and squashes the following commits:

d1a928b [WangTaoTheTonic] Minor adjustment
b6560cf [WangTaoTheTonic] solve the shuffle problem for HashSet
f674e59 [WangTaoTheTonic] add comment and minor fix
2835929 [WangTao] solve the failed test and avoid filtering
2ca3091 [WangTao] fix checkstyle
bc91bb1 [WangTao] Avoid shuffle every time we schedule the driver using round robin
bbc7087 [WangTaoTheTonic] Optimize the schedule in Master
2014-09-10 13:06:47 -07:00
Wenchen Fan e4f4886d71 [SPARK-2096][SQL] Correctly parse dot notations
First let me write down the current `projections` grammar of spark sql:

    expression                : orExpression
    orExpression              : andExpression {"or" andExpression}
    andExpression             : comparisonExpression {"and" comparisonExpression}
    comparisonExpression      : termExpression | termExpression "=" termExpression | termExpression ">" termExpression | ...
    termExpression            : productExpression {"+"|"-" productExpression}
    productExpression         : baseExpression {"*"|"/"|"%" baseExpression}
    baseExpression            : expression "[" expression "]" | ... | ident | ...
    ident                     : identChar {identChar | digit} | delimiters | ...
    identChar                 : letter | "_" | "."
    delimiters                : "," | ";" | "(" | ")" | "[" | "]" | ...
    projection                : expression [["AS"] ident]
    projections               : projection { "," projection}

For something like `a.b.c[1]`, it will be parsed as:
<img src="http://img51.imgspice.com/i/03008/4iltjsnqgmtt_t.jpg" border=0>
But for something like `a[1].b`, the current grammar can't parse it correctly.
A simple solution is written in `ParquetQuerySuite#NestedSqlParser`, changed grammars are:

    delimiters                : "." | "," | ";" | "(" | ")" | "[" | "]" | ...
    identChar                 : letter | "_"
    baseExpression            : expression "[" expression "]" | expression "." ident | ... | ident | ...
This works well, but can't cover some corner case like `select t.a.b from table as t`:
<img src="http://img51.imgspice.com/i/03008/v2iau3hoxoxg_t.jpg" border=0>
`t.a.b` parsed as `GetField(GetField(UnResolved("t"), "a"), "b")` instead of `GetField(UnResolved("t.a"), "b")` using this new grammar.
However, we can't resolve `t` as it's not a filed, but the whole table.(if we could do this, then `select t from table as t` is legal, which is unexpected)
My solution is:

    dotExpressionHeader       : ident "." ident
    baseExpression            : expression "[" expression "]" | expression "." ident | ... | dotExpressionHeader  | ident | ...
I passed all test cases under sql locally and add a more complex case.
"arrayOfStruct.field1 to access all values of field1" is not supported yet. Since this PR has changed a lot of code, I will open another PR for it.
I'm not familiar with the latter optimize phase, please correct me if I missed something.

Author: Wenchen Fan <cloud0fan@163.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #2230 from cloud-fan/dot and squashes the following commits:

e1a8898 [Wenchen Fan] remove support for arbitrary nested arrays
ee8a724 [Wenchen Fan] rollback LogicalPlan, support dot operation on nested array type
a58df40 [Michael Armbrust] add regression test for doubly nested data
16bc4c6 [Wenchen Fan] some enhance
95d733f [Wenchen Fan] split long line
dc31698 [Wenchen Fan] SPARK-2096 Correctly parse dot notations
2014-09-10 12:56:59 -07:00
Sandy Ryza 1f4a648d4e SPARK-1713. Use a thread pool for launching executors.
This patch copies the approach used in the MapReduce application master for launching containers.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #663 from sryza/sandy-spark-1713 and squashes the following commits:

036550d [Sandy Ryza] SPARK-1713. [YARN] Use a threadpool for launching executor containers
2014-09-10 14:34:24 -05:00
Josh Rosen 26503fdf20 [HOTFIX] Fix scala style issue introduced by #2276. 2014-09-10 12:02:23 -07:00
Daoyuan Wang f0c87dc86a [SPARK-3363][SQL] Type Coercion should promote null to all other types.
Type Coercion should support every type to have null value

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #2246 from adrian-wang/spark3363-0 and squashes the following commits:

c6241de [Daoyuan Wang] minor code clean
595b417 [Daoyuan Wang] Merge pull request #2 from marmbrus/pr/2246
832e640 [Michael Armbrust] reduce code duplication
ef6f986 [Daoyuan Wang] make double boolean miss in jsonRDD compatibleType
c619f0a [Daoyuan Wang] Type Coercion should support every type to have null value
2014-09-10 10:48:36 -07:00
Daoyuan Wang a0283300c4 [SPARK-3362][SQL] Fix resolution for casewhen with nulls.
Current implementation will ignore else val type.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2245 from adrian-wang/casewhenbug and squashes the following commits:

3332f6e [Daoyuan Wang] remove wrong comment
83b536c [Daoyuan Wang] a comment to trigger retest
d7315b3 [Daoyuan Wang] code improve
eed35fc [Daoyuan Wang] bug in casewhen resolve
2014-09-10 10:45:24 -07:00
Benoy Antony 6f7a76838f [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme i...
...s https

Author: Benoy Antony <benoy@apache.org>

Closes #2276 from benoyantony/SPARK-3286 and squashes the following commits:

c3d51ee [Benoy Antony] Use address with scheme, but Allpha version removes the scheme
e82f94e [Benoy Antony] Use address with scheme, but Allpha version removes the scheme
92127c9 [Benoy Antony] rebasing from master
450c536 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https
f060c02 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https
2014-09-10 11:59:39 -05:00
Eric Liang b734ed0c22 [SPARK-3395] [SQL] DSL sometimes incorrectly reuses attribute ids, breaking queries
This resolves https://issues.apache.org/jira/browse/SPARK-3395

Author: Eric Liang <ekl@google.com>

Closes #2266 from ericl/spark-3395 and squashes the following commits:

7f2b6f0 [Eric Liang] add regression test
05bd1e4 [Eric Liang] in the dsl, create a new schema instance in each applySchema
2014-09-09 23:47:12 -07:00
Matthew Farrellee 25b5b867d5 [SPARK-3458] enable python "with" statements for SparkContext
allow for best practice code,

```
try:
  sc = SparkContext()
  app(sc)
finally:
  sc.stop()
```

to be written using a "with" statement,

```
with SparkContext() as sc:
  app(sc)
```

Author: Matthew Farrellee <matt@redhat.com>

Closes #2335 from mattf/SPARK-3458 and squashes the following commits:

5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
2014-09-09 18:54:54 -07:00
Cheng Lian c110614b33 [SPARK-3448][SQL] Check for null in SpecificMutableRow.update
`SpecificMutableRow.update` doesn't check for null, and breaks existing `MutableRow` contract.

The tricky part here is that for performance considerations, the `update` method of all subclasses of `MutableValue` doesn't check for null and sets the null bit to false.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2325 from liancheng/check-for-null and squashes the following commits:

9366c44 [Cheng Lian] Check for null in SpecificMutableRow.update
2014-09-09 18:39:33 -07:00
xinyunh 07ee4a28c3 [SPARK-3176] Implement 'ABS and 'LAST' for sql
Add support for the mathematical function"ABS" and the analytic function "last" to return a subset of the rows satisfying a query within spark sql. Test-cases included.

Author: xinyunh <xinyun.huang@huawei.com>
Author: bomeng <golf8lover>

Closes #2099 from xinyunh/sqlTest and squashes the following commits:

71d15e7 [xinyunh] remove POWER part
8843643 [xinyunh] fix the code style issue
39f0309 [bomeng] Modify the code of POWER and ABS. Move them to the file arithmetic
ff8e51e [bomeng] add abs() function support
7f6980a [xinyunh] fix the bug in 'Last' component
b3df91b [xinyunh] add 'Last' component
2014-09-09 16:55:39 -07:00
Prashant Sharma 02b5ac7191 Minor - Fix trivial compilation warnings.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2331 from ScrapCodes/compilation-warn and squashes the following commits:

44c1e76 [Prashant Sharma] Minor - Fix trivial compilation warnings.
2014-09-09 14:42:28 -07:00
scwf 26862337c9 [SPARK-3193]output errer info when Process exit code is not zero in test suite
https://issues.apache.org/jira/browse/SPARK-3193
I noticed that sometimes pr tests failed due to the Process exitcode != 0,refer to
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18688/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19118/consoleFull

[info] SparkSubmitSuite:
[info] - prints usage on empty input
[info] - prints usage with only --help
[info] - prints error with unrecognized options
[info] - handle binary specified but not class
[info] - handles arguments with --key=val
[info] - handles arguments to user program
[info] - handles arguments to user program with name collision
[info] - handles YARN cluster mode
[info] - handles YARN client mode
[info] - handles standalone cluster mode
[info] - handles standalone client mode
[info] - handles mesos client mode
[info] - handles confs with flag equivalents
[info] - launch simple application with spark-submit *** FAILED ***
[info]   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, local, file:/tmp/1408854098404-0/testJar-1408854098404.jar) exited with code 1
[info]   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:872)
[info]   at org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
[info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
[info]   at org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
[info]   at org.apacSpark assembly has been built with Hive, including Datanucleus jars on classpath

this PR output the process error info when failed, it can be helpful for diagnosis.

Author: scwf <wangfei1@huawei.com>

Closes #2108 from scwf/output-test-error-info and squashes the following commits:

0c48082 [scwf] minor fix according to comments
563fde1 [scwf] output errer info when Process exitcode not zero
2014-09-09 11:57:01 -07:00
Sean Owen f0f1ba09b1 SPARK-3404 [BUILD] SparkSubmitSuite fails with "spark-submit exits with code 1"
This fixes the `SparkSubmitSuite` failure by setting `<spark.ui.port>0</spark.ui.port>` in the Maven build, to match the SBT build. This avoids a port conflict which causes failures.

(This also updates the `scalatest` plugin off of a release candidate, to the identical final release.)

Author: Sean Owen <sowen@cloudera.com>

Closes #2328 from srowen/SPARK-3404 and squashes the following commits:

512d782 [Sean Owen] Set spark.ui.port=0 in Maven scalatest config to match SBT build and avoid SparkSubmitSuite failure due to port conflict
2014-09-09 10:24:00 -07:00
Sandy Ryza 88547a09fc SPARK-3422. JavaAPISuite.getHadoopInputSplits isn't used anywhere.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #2324 from sryza/sandy-spark-3422 and squashes the following commits:

6446175 [Sandy Ryza] SPARK-3422. JavaAPISuite.getHadoopInputSplits isn't used anywhere.
2014-09-09 10:23:28 -07:00
Cheng Hao 1e03cf79f8 [SPARK-3455] [SQL] **HOT FIX** Fix the unit test failure
Unit test failed due to can not resolve the attribute references. Temporally disable this test case for a quick fixing, otherwise it will block the others.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2334 from chenghao-intel/unit_test_failure and squashes the following commits:

661f784 [Cheng Hao] temporally disable the failed test case
2014-09-09 10:18:25 -07:00
Mario Pastorelli c419e4f1bd [Docs] actorStream storageLevel default is MEMORY_AND_DISK_SER_2
Comment of the storageLevel param of actorStream says that it defaults to memory-only while the default is MEMORY_AND_DISK_SER_2.

Author: Mario Pastorelli <pastorelli.mario@gmail.com>

Closes #2319 from melrief/master and squashes the following commits:

7b6ce68 [Mario Pastorelli] [Docs] actorStream storageLevel default is MEMORY_AND_DISK_SER_2
2014-09-09 00:51:28 -07:00
Cheng Lian ce5cb32587 [Build] Removed -Phive-thriftserver since this profile has been removed
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2269 from liancheng/clean-run-tests-profile and squashes the following commits:

08617bd [Cheng Lian] Removed -Phive-thriftserver since this profile has been removed
2014-09-09 00:50:59 -07:00
Mark Hamstra 092e2f152f SPARK-2425 Don't kill a still-running Application because of some misbehaving Executors
Introduces a LOADING -> RUNNING ApplicationState transition and prevents Master from removing an Application with RUNNING Executors.

Two basic changes: 1) Instead of allowing MAX_NUM_RETRY abnormal Executor exits over the entire lifetime of the Application, allow that many since any Executor successfully began running the Application; 2) Don't remove the Application while Master still thinks that there are RUNNING Executors.

This should be fine as long as the ApplicationInfo doesn't believe any Executors are forever RUNNING when they are not.  I think that any non-RUNNING Executors will eventually no longer be RUNNING in Master's accounting, but another set of eyes should confirm that.  This PR also doesn't try to detect which nodes have gone rogue or to kill off bad Workers, so repeatedly failing Executors will continue to fail and fill up log files with failure reports as long as the Application keeps running.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #1360 from markhamstra/SPARK-2425 and squashes the following commits:

f099c0b [Mark Hamstra] Reuse appInfo
b2b7b25 [Mark Hamstra] Moved 'Application failed' logging
bdd0928 [Mark Hamstra] switched to string interpolation
1dd591b [Mark Hamstra] SPARK-2425 introduce LOADING -> RUNNING ApplicationState transition and prevent Master from removing Application with RUNNING Executors
2014-09-08 20:51:56 -07:00
William Benton 2b7ab814f9 [SPARK-3329][SQL] Don't depend on Hive SET pair ordering in tests.
This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists.

Author: William Benton <willb@redhat.com>
Author: Aaron Davidson <aaron@databricks.com>

Closes #2220 from willb/spark-3329 and squashes the following commits:

3b3e205 [William Benton] Collapse collectResults case match in HiveQuerySuite
6525d8e [William Benton] Handle cases where SET returns Rows of (single) strings
cf11b0e [Aaron Davidson] Fix flakey HiveQuerySuite test
2014-09-08 19:29:23 -07:00
Cheng Lian dc1dbf206e [SPARK-3414][SQL] Stores analyzed logical plan when registering a temp table
Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names, because we store unanalyzed logical plan when registering temp tables while the `CaseInsensitivityAttributeReferences` batch runs before the `Resolution` batch. To fix this issue, we need to store analyzed logical plan.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2293 from liancheng/spark-3414 and squashes the following commits:

d9fa1d6 [Cheng Lian] Stores analyzed logical plan when registering a temp table
2014-09-08 19:08:05 -07:00
William Benton ca0348e682 SPARK-3423: [SQL] Implement BETWEEN for SQLParser
This patch improves the SQLParser by adding support for BETWEEN conditions

Author: William Benton <willb@redhat.com>

Closes #2295 from willb/sql-between and squashes the following commits:

0016d30 [William Benton] Implement BETWEEN for SQLParser
2014-09-08 19:05:02 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
Eric Liang 7db53391f1 [SPARK-3349][SQL] Output partitioning of limit should not be inherited from child
This resolves https://issues.apache.org/jira/browse/SPARK-3349

Author: Eric Liang <ekl@google.com>

Closes #2262 from ericl/spark-3349 and squashes the following commits:

3e1b05c [Eric Liang] add regression test
ac32723 [Eric Liang] make limit/takeOrdered output SinglePartition
2014-09-08 16:14:36 -07:00
Reynold Xin 08ce18881e [SPARK-3019] Pluggable block transfer interface (BlockTransferService)
This pull request creates a new BlockTransferService interface for block fetch/upload and refactors the existing ConnectionManager to implement BlockTransferService (NioBlockTransferService).

Most of the changes are simply moving code around. The main class to inspect is ShuffleBlockFetcherIterator.

Review guide:
- Most of the ConnectionManager code is now in network.cm package
- ManagedBuffer is a new buffer abstraction backed by several different implementations (file segment, nio ByteBuffer, Netty ByteBuf)
- BlockTransferService is the main internal interface introduced in this PR
- NioBlockTransferService implements BlockTransferService and replaces the old BlockManagerWorker
- ShuffleBlockFetcherIterator replaces the told BlockFetcherIterator to use the new interface

TODOs that should be separate PRs:
- Implement NettyBlockTransferService
- Finalize the API/semantics for ManagedBuffer.release()

Author: Reynold Xin <rxin@apache.org>

Closes #2240 from rxin/blockTransferService and squashes the following commits:

64cd9d7 [Reynold Xin] Merge branch 'master' into blockTransferService
1dfd3d7 [Reynold Xin] Limit the length of the FileInputStream.
1332156 [Reynold Xin] Fixed style violation from refactoring.
2960c93 [Reynold Xin] Added ShuffleBlockFetcherIteratorSuite.
e29c721 [Reynold Xin] Updated comment for ShuffleBlockFetcherIterator.
8a1046e [Reynold Xin] Code review feedback:
2c6b1e1 [Reynold Xin] Removed println in test cases.
2a907e4 [Reynold Xin] Merge branch 'master' into blockTransferService-merge
07ccf0d [Reynold Xin] Added init check to CMBlockTransferService.
98c668a [Reynold Xin] Added failure handling and fixed unit tests.
ae05fcd [Reynold Xin] Updated tests, although DistributedSuite is hanging.
d8d595c [Reynold Xin] Merge branch 'master' of github.com:apache/spark into blockTransferService
9ef279c [Reynold Xin] Initial refactoring to move ConnectionManager to use the BlockTransferService.
2014-09-08 15:59:20 -07:00
Matthew Rocklin 939a322c85 [SPARK-3417] Use new-style classes in PySpark
Tiny PR making SQLContext a new-style class.  This allows various type logic to work more effectively

```Python
In [1]: import pyspark

In [2]: pyspark.sql.SQLContext.mro()
Out[2]: [pyspark.sql.SQLContext, object]
```

Author: Matthew Rocklin <mrocklin@gmail.com>

Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits:

4aadab6 [Matthew Rocklin] update other old-style classes
a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
2014-09-08 15:45:36 -07:00
Henry Cook 26bc7655de [SQL] Minor edits to sql programming guide.
Author: Henry Cook <hcook@eecs.berkeley.edu>

Closes #2316 from hcook/sql-docs and squashes the following commits:

373f94b [Henry Cook] Minor edits to sql programming guide.
2014-09-08 14:56:37 -07:00
Matthew Farrellee 386bc24ebe Provide a default PYSPARK_PYTHON for python/run_tests
Without this the version of python used in the test is not
recorded. The error is,

   Testing with Python version:
   ./run-tests: line 57: --version: command not found

Author: Matthew Farrellee <matt@redhat.com>

Closes #2300 from mattf/master-fix-python-run-tests and squashes the following commits:

65a09f5 [Matthew Farrellee] Provide a default PYSPARK_PYTHON for python/run_tests
2014-09-08 12:37:52 -07:00
Sandy Ryza 16a73c2473 SPARK-2978. Transformation with MR shuffle semantics
I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits:

4a5332a [Sandy Ryza] Fix Java test
c04b447 [Sandy Ryza] Fix Python doc and add back deleted code
433ad5b [Sandy Ryza] Add Java test
4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes
9b0ba99 [Sandy Ryza] Fix compilation
36e0571 [Sandy Ryza] Fix import ordering
48c12c2 [Sandy Ryza] Add Java version and additional doc
e5381cd [Sandy Ryza] Fix python style warnings
f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
2014-09-08 11:20:00 -07:00
Prashant Sharma e16a8e7db5 SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
...

Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on...

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits:

d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
2014-09-08 10:24:15 -07:00
Joseph K. Bradley 711356b422 [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib] DecisionTree aggregation improvements
Summary:
1. Variable numBins for each feature [SPARK-3043]
2. Reduced data reshaping in aggregation [SPARK-3043]
3. Choose ordering for ordered categorical features adaptively [SPARK-3156]
4. Changed nodes to use 1-indexing [SPARK-3086]
5. Small clean-ups

Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).

Speedups: This update helps most when many features use few bins but a few features use many bins.  Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
* Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
* Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec

Details:

(1) Variable numBins for each feature [SPARK-3043]

DecisionTreeMetadata now computes a variable numBins for each feature.  It also tracks numSplits.

(2) Reduced data reshaping in aggregation [SPARK-3043]

Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
* Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
* Design note: I originally tried creating Impurity classes which stored data and storing the aggregates in an Array[Array[Array[Impurity]]].  However, this led to significant slowdowns, perhaps because of overhead in creating so many objects.

The aggregate statistics are never reshaped, and cumulative sums are computed in-place.

Updated the layout of aggregation functions.  The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
For this update, the following functions were refactored:
* updateBinForOrderedFeature
* updateBinForUnorderedFeature
* binaryOrNotCategoricalBinSeqOp
* multiclassWithCategoricalBinSeqOp
* regressionBinSeqOp
The above 5 functions were replaced with:
* orderedBinSeqOp
* someUnorderedBinSeqOp

Other changes:
* calculateGainForSplit now treats all feature types the same way.
* Eliminated extractLeftRightNodeAggregates.

(3) Choose ordering for ordered categorical features adaptively [SPARK-3156]

Updated binsToBestSplit():
* This now computes cumulative sums of stats for ordered features.
* For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
* Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].

Side effects:
* In findSplitsBins: A sample of the data is only taken for data with continuous features.  It is not needed for data with only categorical features.
* In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
* TreePoint binning is simpler for categorical features.

(4) Changed nodes to use 1-indexing [SPARK-3086]

Nodes used to be indexed from 0.  Now they are indexed from 1.
Node indexing functions are now collected in object Node (Node.scala).

(5) Small clean-ups

Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
Eliminated InvalidBinIndex since it is no longer used.

CC: mengxr  manishamde  Please let me know if you have thoughts on this—thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2125 from jkbradley/dt-opt3alt and squashes the following commits:

42c192a [Joseph K. Bradley] Merge branch 'rfs' into dt-opt3alt
d3cc46b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
00e4404 [Joseph K. Bradley] optimization for TreePoint construction (pre-computing featureArity and isUnordered as arrays)
425716c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into rfs
a2acea5 [Joseph K. Bradley] Small optimizations based on profiling
aa4e4df [Joseph K. Bradley] Updated DTStatsAggregator with bug fix (nodeString should not be multiplied by statsSize)
4651154 [Joseph K. Bradley] Changed numBins semantics for unordered features. * Before: numBins = numSplits = (1 << k - 1) - 1 * Now: numBins = 2 * numSplits = 2 * [(1 << k - 1) - 1] * This also involved changing the semantics of: ** DecisionTreeMetadata.numUnorderedBins()
1e3b1c7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
1485fcc [Joseph K. Bradley] Made some DecisionTree methods private.
92f934f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
e676da1 [Joseph K. Bradley] Updated documentation for DecisionTree
37ca845 [Joseph K. Bradley] Fixed problem with how DecisionTree handles ordered categorical	features.
105f8ab [Joseph K. Bradley] Removed commented-out getEmptyBinAggregates from DecisionTree
062c31d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3alt
6d32ccd [Joseph K. Bradley] In DecisionTree.binsToBestSplit, changed loops to iterators to shorten code.
807cd00 [Joseph K. Bradley] Finished DTStatsAggregator, a wrapper around the aggregate statistics for easy but hopefully efficient indexing.  Modified old ImpurityAggregator classes and renamed them ImpurityCalculator; added ImpurityAggregator classes which work with DTStatsAggregator but do not store data.  Unit tests all succeed.
f2166fd [Joseph K. Bradley] still working on DTStatsAggregator
92f7118 [Joseph K. Bradley] Added partly written DTStatsAggregator
fd8df30 [Joseph K. Bradley] Moved some aggregation helpers outside of findBestSplitsPerGroup
d7c53ee [Joseph K. Bradley] Added more doc for ImpurityAggregator
a40f8f1 [Joseph K. Bradley] Changed nodes to be indexed from 1.  Tests work.
95cad7c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
5f94342 [Joseph K. Bradley] Added treeAggregate since not yet merged from master.  Moved node indexing functions to Node.
61c4509 [Joseph K. Bradley] Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned up DT Suite some.
3ba7166 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
b314659 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt3
9c83363 [Joseph K. Bradley] partial merge but not done yet
45f7ea7 [Joseph K. Bradley] partial merge, not yet done
5fce635 [Joseph K. Bradley] Merge branch 'dt-opt2' into dt-opt3
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
dd4d3aa [Joseph K. Bradley] Mid-process in bug fix: bug for binary classification with categorical features * Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits. * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification. * Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features. * Status: The bug may be fixed, but more testing needs to be done.
438a660 [Joseph K. Bradley] removed subsampling for mnist8m from DT
86e217f [Joseph K. Bradley] added cache to DT input
e3c84cc [Joseph K. Bradley] Added stuff fro mnist8m to D T Runner
51ef781 [Joseph K. Bradley] Fixed bug introduced by last commit: Variance impurity calculation was incorrect since counts were swapped accidentally
fd65372 [Joseph K. Bradley] Major changes: * Created ImpurityAggregator classes, rather than old aggregates. * Feature split/bin semantics are based on ordered vs. unordered ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features. * numBins can differ for each feature
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-09-08 09:47:13 -07:00
Prashant Sharma 0d1cc4ae42 [HOTFIX] A left over version change. It should make mima happy.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2317 from ScrapCodes/hotfix and squashes the following commits:

b6472d4 [Prashant Sharma] [HOTFIX] for hotfixes, a left over version change.
2014-09-08 14:32:53 +05:30
Reynold Xin eddfeddac1 [SPARK-938][doc] Add OpenStack Swift support
See compiled doc at
http://people.apache.org/~rxin/tmp/openstack-swift/_site/storage-openstack-swift.html

This is based on #1010. Closes #1010.

Author: Reynold Xin <rxin@apache.org>
Author: Gil Vernik <gilv@il.ibm.com>

Closes #2298 from rxin/openstack-swift and squashes the following commits:

ff4e394 [Reynold Xin] Two minor comments from Patrick.
279f6de [Reynold Xin] core-sites -> core-site
dfb8fea [Reynold Xin] Updated based on Gil's suggestion.
846f5cb [Reynold Xin] Added a link from overview page.
0447c9f [Reynold Xin] Removed sample code.
e9c3761 [Reynold Xin] Merge pull request #1010 from gilv/master
9233fef [Gil Vernik] Fixed typos
6994827 [Gil Vernik] Merge pull request #1 from rxin/openstack
ac0679e [Reynold Xin] Fixed an unclosed tr.
47ce99d [Reynold Xin] Merge branch 'master' into openstack
cca7192 [Gil Vernik] Removed white spases from pom.xml
99f095d [Reynold Xin] Pending openstack changes.
eb22295 [Reynold Xin] Merge pull request #1010 from gilv/master
39a9737 [Gil Vernik] Spark integration with Openstack Swift
c977658 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
2aba763 [Gil Vernik] Fix to docs/openstack-integration.md
9b625b5 [Gil Vernik] Merge branch 'master' of https://github.com/gilv/spark
eff538d [Gil Vernik] SPARK-938 - Openstack Swift object storage support
ce483d7 [Gil Vernik] SPARK-938 - Openstack Swift object storage support
b6c37ef [Gil Vernik] Openstack Swift support
2014-09-07 20:56:04 -07:00
Reynold Xin f25bbbdb3a [SPARK-3280] Made sort-based shuffle the default implementation
Sort-based shuffle has lower memory usage and seems to outperform hash-based in almost all of our testing.

Author: Reynold Xin <rxin@apache.org>

Closes #2178 from rxin/sort-shuffle and squashes the following commits:

713d341 [Reynold Xin] Fixed test failures by setting spark.shuffle.compress to the same value as spark.shuffle.spill.compress.
85165e6 [Reynold Xin] Fixed a comment typo.
aa0d372 [Reynold Xin] [SPARK-3280] Made sort-based shuffle the default implementation
2014-09-07 20:42:07 -07:00
Josh Rosen 4ba2673569 [HOTFIX] Fix broken Mima tests on the master branch
By merging #2268, which bumped the Spark version to 1.2.0-SNAPSHOT, I inadvertently broke the Mima binary compatibility tests.  The issue is that we were comparing 1.2.0-SNAPSHOT against Spark 1.0.0 without using any Mima excludes.  The right long-term fix for this is probably to publish nightly snapshots on Maven central and change the master branch to test binary compatibility against the current release candidate branch's snapshots until that release is finalized.

As a short-term fix until 1.1.0 is published on Maven central, I've configured the build to test the master branch for binary compatibility against the 1.1.0-RC4 jars.  I'll loop back and remove the Apache staging repo as soon as 1.1.0 final is available.

Author: Josh Rosen <joshrosen@apache.org>

Closes #2315 from JoshRosen/mima-fix and squashes the following commits:

776bc2c [Josh Rosen] Add two excludes to workaround Mima annotation issues.
ec90e21 [Josh Rosen] Add deploy and graphx to 1.2 MiMa excludes.
57569be [Josh Rosen] Fix MiMa tests in master branch; test against 1.1.0 RC.
2014-09-07 20:39:53 -07:00
Cheng Lian 9d69a782bd Fixed typos in make-distribution.sh
`hadoop.version` and `yarn.version` are properties rather then profiles, should use `-D` instead of `-P`.

/cc pwendell

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2121 from liancheng/fix-make-dist and squashes the following commits:

4c49158 [Cheng Lian] Also mentions Hadoop version related Maven profiles
ed5b42a [Cheng Lian] Fixed typos in make-distribution.sh
2014-09-07 20:38:32 -07:00
Ward Viaene ecfa76cdfe [SPARK-3415] [PySpark] removes SerializingAdapter code
This code removes the SerializingAdapter code that was copied from PiCloud

Author: Ward Viaene <ward.viaene@bigdatapartnership.com>

Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:

5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
65ffeff [Ward Viaene] removed duplicate test
a958866 [Ward Viaene] SPARK-3415: test script
e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
2014-09-07 18:54:36 -07:00
Reynold Xin e2614038e7 [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
Author: Reynold Xin <rxin@apache.org>

Closes #2281 from rxin/sql-limit-sort and squashes the following commits:

1ef7780 [Reynold Xin] [SPARK-3408] Fixed Limit operator so it works with sort-based shuffle.
2014-09-07 18:42:24 -07:00
Michael Armbrust 39db1bfdab [SQL] Update SQL Programming Guide
Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2258 from marmbrus/sqlDocUpdate and squashes the following commits:

f3d450b [Michael Armbrust] fix brackets
bea3bfa [Michael Armbrust] Davies suggestions
3a29fe2 [Michael Armbrust] tighten visibility
a71aa36 [Michael Armbrust] Draft of doc updates
52932c0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into sqlDocUpdate
1e8c849 [Yin Huai] Update the example used for applySchema.
9457c39 [Yin Huai] Update doc.
31ba240 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeDoc
29bc668 [Yin Huai] Draft doc for data type and schema APIs.
2014-09-07 21:34:46 -04:00
Eric Liang 6754570d83 [SPARK-3394] [SQL] Fix crash in TakeOrdered when limit is 0
This resolves https://issues.apache.org/jira/browse/SPARK-3394

Author: Eric Liang <ekl@google.com>

Closes #2264 from ericl/spark-3394 and squashes the following commits:

c87355b [Eric Liang] refactor
bfb6140 [Eric Liang] change RDD takeOrdered instead
7a51528 [Eric Liang] fix takeordered when limit = 0
2014-09-07 17:57:59 -07:00
Reynold Xin 3fb57a0ab3 [SPARK-3353] parent stage should have lower stage id.
Previously parent stages had higher stage id, but parent stages are executed first. This pull request changes the behavior so parent stages would have lower stage id.

For example, command:
```scala
sc.parallelize(1 to 10).map(x=>(x,x)).reduceByKey(_+_).count
```
breaks down into 2 stages.

The old web UI:
![screen shot 2014-09-04 at 12 42 44 am](https://cloud.githubusercontent.com/assets/323388/4146177/60fb4f42-3407-11e4-819f-853eb0e22b25.png)

Web UI with this patch:
![screen shot 2014-09-04 at 12 44 55 am](https://cloud.githubusercontent.com/assets/323388/4146178/62e08e62-3407-11e4-867b-a36b10534464.png)

Author: Reynold Xin <rxin@apache.org>

Closes #2273 from rxin/lower-stage-id and squashes the following commits:

abbb4c6 [Reynold Xin] Fixed SparkListenerSuite.
0e02379 [Reynold Xin] Updated DAGSchedulerSuite.
54ccea3 [Reynold Xin] [SPARK-3353] parent stage should have lower stage id.
2014-09-06 19:06:30 -07:00
Davies Liu 110fb8b24d [SPARK-2334] fix AttributeError when call PipelineRDD.id()
The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.

The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.

closes #1276

Author: Davies Liu <davies.liu@gmail.com>

Closes #2296 from davies/id and squashes the following commits:

e197958 [Davies Liu] fix style
9721716 [Davies Liu] fix id of PipelineRDD
2014-09-06 16:12:29 -07:00
GuoQiang Li 21a1e1bb89 [SPARK-3273][SPARK-3301]We should read the version information from the same place
Author: GuoQiang Li <witgo@qq.com>

Closes #2175 from witgo/SPARK-3273 and squashes the following commits:

cf9c65a [GuoQiang Li] We should read the version information from the same place
2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct
2014-09-06 15:08:43 -07:00