Commit graph

7313 commits

Author SHA1 Message Date
Doris Xin 45a95f82ca Remove unicode operator from RDD.scala
Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1119 from dorx/unicode and squashes the following commits:

05618c3 [Doris Xin] Remove unicode operator from RDD.scala
2014-06-18 15:01:29 -07:00
Mark Hamstra 4cbeea83e0 SPARK-2158 Clean up core/stdout file from FileAppenderSuite
@tdas

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits:

ae8e069 [Mark Hamstra] Response to TD's review
2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
2014-06-18 14:56:41 -07:00
Kay Ousterhout 3870248740 [SPARK-1466] Raise exception if pyspark Gateway process doesn't start.
If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.

Thanks to @shivaram and @stogers for helping to fix this issue!

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #383 from kayousterhout/pyspark and squashes the following commits:

36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
2014-06-18 13:16:26 -07:00
Reynold Xin dd96fcda01 Updated the comment for SPARK-2162.
A follow up on #1103

@andrewor14

Author: Reynold Xin <rxin@apache.org>

Closes #1117 from rxin/SPARK-2162 and squashes the following commits:

a4231de [Reynold Xin] Updated the comment for SPARK-2162.
2014-06-18 12:48:58 -07:00
Raymond Liu 5ad5e3486a [SPARK-2162] Double check in doGetLocal to avoid read on removed block.
other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1103 from colorant/bm and squashes the following commits:

daac114 [Raymond Liu] Address comments
d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
2014-06-18 10:57:45 -07:00
Yin Huai 587d32012c [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an explain command
```
hql("explain select * from src group by key").collect().foreach(println)

[ExplainCommand [plan#27:0]]
[ Aggregate false, [key#25], [key#25,value#26]]
[  Exchange (HashPartitioning [key#25:0], 200)]
[   Exchange (HashPartitioning [key#25:0], 200)]
[    Aggregate true, [key#25], [key#25]]
[     HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None]
```

There are two exchange operators.

However, if we do not use explain...
```
hql("select * from src group by key")

res4: org.apache.spark.sql.SchemaRDD =
SchemaRDD[8] at RDD at SchemaRDD.scala:100
== Query Plan ==
Aggregate false, [key#8], [key#8,value#9]
 Exchange (HashPartitioning [key#8:0], 200)
  Aggregate true, [key#8], [key#8]
   HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
```
The plan is fine.

The cause of this bug is explained below.

When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.

I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`.

The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1116 from yhuai/SPARK-2176 and squashes the following commits:

197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
2014-06-18 10:51:32 -07:00
Vadim Chekan 889f7b7624 [STREAMING] SPARK-2009 Key not found exception when slow receiver starts
I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod.

Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread).
https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206

In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception.

The patch makes getReceivedBlockInfo more robust by tolerating missing values.

Author: Vadim Chekan <kot.begemot@gmail.com>

Closes #961 from vchekan/branch-1.0 and squashes the following commits:

e86f82b [Vadim Chekan] Fixed indentation
4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called
(cherry picked from commit 26f6b98931)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-06-17 22:04:06 -07:00
Patrick Wendell 9e4b4bd083 Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop functions"
This reverts commit 443f5e1bbc.

This commit unfortunately would break source compatibility if users have named
the hadoopConf parameter.
2014-06-17 19:34:17 -07:00
Yin Huai d2f4f30b12 [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-2060

Programming guide: http://yhuai.github.io/site/sql-programming-guide.html

Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #999 from yhuai/newJson and squashes the following commits:

227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ce8eedd [Yin Huai] rxin's comments.
bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
94ffdaa [Yin Huai] Remove "get" from method names.
ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
79ea9ba [Yin Huai] Fix typos.
5428451 [Yin Huai] Newline
1f908ce [Yin Huai] Remove extra line.
d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
7ea750e [Yin Huai] marmbrus's comments.
6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
83013fb [Yin Huai] Update Java Example.
e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4fbddf0 [Yin Huai] Programming guide.
9df8c5a [Yin Huai] Python API.
7027634 [Yin Huai] Java API.
cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
ab810b0 [Yin Huai] Make JsonRDD private.
6df0891 [Yin Huai] Apache header.
8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
8ffed79 [Yin Huai] Update the example.
a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
65b87f0 [Yin Huai] Fix sampling...
8846af5 [Yin Huai] API doc.
52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
0387523 [Yin Huai] Address PR comments.
666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
a2313a6 [Yin Huai] Address PR comments.
f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
0576406 [Yin Huai] Add Apache license header.
af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
2014-06-17 19:14:59 -07:00
Patrick Wendell b2ebf429e2 HOTFIX: bug caused by #941
This patch should have qualified the use of PIPE. This needs to be back ported into 0.9 and 1.0.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1108 from pwendell/hotfix and squashes the following commits:

711c58d [Patrick Wendell] HOTFIX: bug caused by #941
2014-06-17 15:09:24 -07:00
Andrew Or a14807e84c [SPARK-2147 / 2161] Show removed executors on the UI
This PR includes two changes
- **[SPARK-2147]** When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens.
- **[SPARK-2161]** This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table).

This should go into 1.0.1 if possible.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1102 from andrewor14/remember-removed-executors and squashes the following commits:

2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor)
abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors
792f992 [Andrew Or] Add missing equals method in ExecutorInfo
3390b49 [Andrew Or] Add executor state column to WorkerPage
161f8a2 [Andrew Or] Display finished executors table (fix bug)
fbb65b8 [Andrew Or] Removed unused method
c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI
fe47402 [Andrew Or] Show exited executors on the Master UI
2014-06-17 12:25:55 -07:00
CodingCat 443f5e1bbc SPARK-2038: rename "conf" parameters in the saveAsHadoop functions
to distinguish with SparkConf object

https://issues.apache.org/jira/browse/SPARK-2038

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1087 from CodingCat/SPARK-2038 and squashes the following commits:

763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
2014-06-17 12:17:48 -07:00
Sandy Ryza 2794990e9e SPARK-2146. Fix takeOrdered doc
Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits:

185ff18 [Sandy Ryza] Use Seq instead of Array
c996120 [Sandy Ryza] SPARK-2146.  Fix takeOrdered doc
2014-06-17 12:03:22 -07:00
Andrew Ash b92d16b114 SPARK-1063 Add .sortBy(f) method on RDD
This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.

I think this is ready for merging.

Author: Andrew Ash <andrew@andrewash.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>

Closes #369 from ash211/sortby and squashes the following commits:

d09147a [Andrew Ash] Fix Ordering import
43d0a53 [Andrew Ash] Fix missing .collect()
29a54ed [Andrew Ash] Re-enable test by converting to a closure
5a95348 [Andrew Ash] Add license for RDDSuiteUtils
64ed6e3 [Andrew Ash] Remove leaked diff
d4de69a [Andrew Ash] Remove scar tissue
63638b5 [Andrew Ash] Add Python version of .sortBy()
45e0fde [Andrew Ash] Add Java version of .sortBy()
adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
0457b69 [Andrew Ash] Ignore failing test
99f0baf [Andrew Ash] Merge branch 'master' into sortby
222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
381eef2 [Andrew Ash] Correct silly typo
7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
2014-06-17 11:47:48 -07:00
Zongheng Yang e243c5ffac [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN.
JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053

This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`.

[This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics.

Notes / Open issues:

* Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot.
* We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query.

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1055 from concretevitamin/caseWhen and squashes the following commits:

4226eb9 [Zongheng Yang] Comment.
79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen
caf9383 [Zongheng Yang] Update a FIXME.
9d26ab8 [Zongheng Yang] Add @transient marker.
788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when.
7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient.
f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files.
1c1fbfc [Zongheng Yang] Cleanups per review comments.
7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time.
47d406a [Zongheng Yang] Do toArray once and lazily outside of eval().
bb3d109 [Zongheng Yang] Update scaladoc of a method.
aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import.
96870a8 [Zongheng Yang] Turn off scalastyle for some comments.
7392f3a [Zongheng Yang] Minor cleanup.
2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen
9f84b40 [Zongheng Yang] Add golden outputs from Hive.
db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests.
3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved).
be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts.
f2bcb9d [Zongheng Yang] WIP
5906f75 [Zongheng Yang] WIP
efd019b [Zongheng Yang] eval() and toString() bug fixes.
7d81e95 [Zongheng Yang] Clean up resolved.
a31d782 [Zongheng Yang] Finish up Case.
2014-06-17 13:30:17 +02:00
Xi Liu f5a4049e53 [SPARK-2164][SQL] Allow Hive UDF on columns of type struct
Author: Xi Liu <xil@conviva.com>

Closes #796 from xiliu82/sqlbug and squashes the following commits:

328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test
354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct
8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct
2014-06-17 13:16:02 +02:00
Andrew Or 09deb3eee0 [SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks
This is reproducible whenever we drop a block because of memory pressure.

This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map.

This PR includes this simple fix.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1080 from andrewor14/ui-blocks and squashes the following commits:

fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached
2014-06-17 01:28:22 -07:00
Daniel Darabos 23a12ce20c SPARK-2035: Store call stack for stages, display it on the UI.
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).

Author: Daniel Darabos <darabos.daniel@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #981 from darabos/darabos-call-stack and squashes the following commits:

f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d454 by Doris.
3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
b857849 [Daniel Darabos] Style: Break long line.
ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
ccd89d1 [Daniel Darabos] Fix long lines.
ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
2014-06-17 00:08:05 -07:00
Anant 8cd04c3eec SPARK-1990: added compatibility for python 2.6 for ssh_read command
https://issues.apache.org/jira/browse/SPARK-1990

There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old

Author: Anant <anant.asty@gmail.com>

Closes #941 from anantasty/SPARK-1990 and squashes the following commits:

4ca441d [Anant] Implmented check_optput withinthe module to work with python 2.6
c6ed85c [Anant] added compatibility for python 2.6 for ssh_read command
2014-06-16 23:43:01 -07:00
Kan Zhang d81c08bac9 [SPARK-2130] End-user friendly String repr for StorageLevel in Python
JIRA issue https://issues.apache.org/jira/browse/SPARK-2130

This PR adds an end-user friendly String representation for StorageLevel
in Python, similar to ```StorageLevel.description``` in Scala.
```
>>> rdd = sc.parallelize([1,2])
>>> storage_level = rdd.getStorageLevel()
>>> storage_level
StorageLevel(False, False, False, False, 1)
>>> print(storage_level)
Serialized 1x Replicated
```

Author: Kan Zhang <kzhang@apache.org>

Closes #1096 from kanzhang/SPARK-2130 and squashes the following commits:

7c8b98b [Kan Zhang] [SPARK-2130] Prettier epydoc output
cc5bf45 [Kan Zhang] [SPARK-2130] End-user friendly String representation for StorageLevel in Python
2014-06-16 23:31:31 -07:00
Anatoli Fomenko 7afa912e74 MLlib documentation fix
Synchronized mllib-optimization.md with Spark Scaladoc: removed reference to GradientDescent.runMiniBatchSGD method

This is a temporary fix to remove  a link from http://spark.apache.org/docs/latest/mllib-optimization.html to GradientDescent.runMiniBatchSGD which is not in the current online GradientDescent Scaladoc.
FIXME: revert this commit after GradientDescent Scaladoc is updated.
See images for details.

![mllib-docs-fix-1](https://cloud.githubusercontent.com/assets/1375501/3294410/ccf19bb8-f5a8-11e3-93f1-f593016209eb.png)
![mllib-docs-fix-2](https://cloud.githubusercontent.com/assets/1375501/3294411/d0b59a7e-f5a8-11e3-8fc8-329c177ef8c8.png)

Author: Anatoli Fomenko <fa@apache.org>

Closes #1098 from afomenko/master and squashes the following commits:

5cb0758 [Anatoli Fomenko] MLlib documentation fix
2014-06-16 23:10:36 -07:00
Cheng Lian 237b96bc59 Minor fix: made "EXPLAIN" output to play well with JDBC output format
Fixed the broken JDBC output. Test from Shark `beeline`:

```
beeline> !connect jdbc:hive2://localhost:10000/
scan complete in 2ms
Connecting to jdbc:hive2://localhost:10000/
Enter username for jdbc:hive2://localhost:10000/: lian
Enter password for jdbc:hive2://localhost:10000/:
Connected to: Hive (version 0.12.0)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000/>
0: jdbc:hive2://localhost:10000/> explain select * from src;
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]                                                     |
|  HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
2 rows selected (1.386 seconds)
```

Before this change, the output looked something like this:

```
+-------------------------------------------------------------------------------+
|                                     plan                                      |
+-------------------------------------------------------------------------------+
| ExplainCommand [plan#2:0]
 HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
+-------------------------------------------------------------------------------+
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1097 from liancheng/multiLineExplain and squashes the following commits:

eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
2014-06-16 16:42:17 -07:00
Cheng Lian 273afcb254 [SQL][SPARK-2094] Follow up of PR #1071 for Java API
Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1085 from liancheng/spark-2094-java and squashes the following commits:

29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
2014-06-16 21:32:51 +02:00
witgo cdf2b04570 [SPARK-1930] The Container is running beyond physical memory limits, so as to be killed
Author: witgo <witgo@qq.com>

Closes #894 from witgo/SPARK-1930 and squashes the following commits:

564307e [witgo] Update the running-on-yarn.md
3747515 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
172647b [witgo] add memoryOverhead docs
a0ff545 [witgo] leaving only two configs
a17bda2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
478ca15 [witgo] Merge branch 'master' into SPARK-1930
d1244a1 [witgo] Merge branch 'master' into SPARK-1930
8b967ae [witgo] Merge branch 'master' into SPARK-1930
655a820 [witgo] review commit
71859a7 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
e3c531d [witgo] review commit
e16f190 [witgo] different memoryOverhead
ffa7569 [witgo] review commit
5c9581f [witgo] Merge branch 'master' into SPARK-1930
9a6bcf2 [witgo] review commit
8fae45a [witgo] fix NullPointerException
e0dcc16 [witgo] Adding  configuration items
b6a989c [witgo] Fix container memory beyond limit, were killed
2014-06-16 14:27:31 -05:00
Kan Zhang 4fdb491775 [SPARK-2010] Support for nested data in PySpark SQL
JIRA issue https://issues.apache.org/jira/browse/SPARK-2010

This PR adds support for nested collection types in PySpark SQL, including
array, dict, list, set, and tuple. Example,

```
>>> from array import array
>>> from pyspark.sql import SQLContext
>>> sqlCtx = SQLContext(sc)
>>> rdd = sc.parallelize([
...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
True
>>> rdd = sc.parallelize([
...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.collect() == \
... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
True
```

Author: Kan Zhang <kzhang@apache.org>

Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:

1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
2014-06-16 11:11:29 -07:00
CodingCat 716c88aa14 SPARK-2039: apply output dir existence checking for all output formats
https://issues.apache.org/jira/browse/SPARK-2039

apply output dir existence checking for all output formats

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1088 from CodingCat/SPARK-2039 and squashes the following commits:

c52747a [CodingCat] apply output dir existence checking for all output formats
2014-06-15 23:47:58 -07:00
Ali Ghodsi 119b06a04f Updating docs to include missing information about reducers and clarify ...
...how the OFFHEAP storage level works (there has been confusion around this).

Author: Ali Ghodsi <alig@cs.berkeley.edu>

Closes #1089 from alig/master and squashes the following commits:

ca8114d [Ali Ghodsi] Updating docs to include missing information about reducers and clarify how the OFFHEAP storage level works (there has been confusion around this).
2014-06-15 23:44:30 -07:00
Andrew Ash 9672ee07fb SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
https://issues.apache.org/jira/browse/SPARK-2148

Author: Andrew Ash <andrew@andrewash.com>

Closes #1092 from ash211/SPARK-2148 and squashes the following commits:

93513df [Andrew Ash] SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
2014-06-15 23:33:22 -07:00
CrazyJvm a63aa1adb2 SPARK-1999: StorageLevel in storage tab and RDD Storage Info never changes
StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if you call rdd.unpersist() and then you give the rdd another different storage level.

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #968 from CrazyJvm/ui-storagelevel and squashes the following commits:

62555fa [CrazyJvm] change RDDInfo constructor param 'storageLevel' to var, so there's need to add another variable _storageLevel。
9f1571e [CrazyJvm] JIRA https://issues.apache.org/jira/browse/SPARK-1999 UI : StorageLevel in storage tab and RDD Storage Info never changes
2014-06-15 23:23:26 -07:00
Kan Zhang ca5d9d43b9 [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
There seems to be 2 issues.

1. When job is done, driver asks executor to shutdown. However, this clean exit was assigned FAILED executor state by Worker. I introduced EXITED executor state for executors who voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified an executor has exited, it launches another one to replace it, regardless of reason why the executor had exited. When the reason was job has finished, the unnecessary replacement got subsequently killed when App disassociates. This launching and killing of unnecessary executors shows up in the log and is confusing to users. I added check for executor exit status and avoid launching (and subsequent killing) of unnecessary replacements when executors exit cleanly.

One could ask the scheduler to tell Master job is done so that Master wouldn't launch the replacement executor. However, there is a race condition between App telling Master job is done and Worker telling Master an executor had exited. There is no guarantee the former will happen before the later. Instead, I chose to check the exit code when executor exits. If the exit code is 0, I assume executor has been asked to shutdown by driver and Master will not launch replacements.

Due to race condition, it could also happen that (although didn't happen on my local cluster), Master detects App disassociation event before the executor exits by itself. In such cases, the executor will be rightfully killed and labeled as KILLED, while the App state will show FINISHED.

Author: Kan Zhang <kzhang@apache.org>

Closes #306 from kanzhang/SPARK-1118 and squashes the following commits:

cb0cc86 [Kan Zhang] [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
2014-06-15 14:55:34 -07:00
Michael Armbrust 269fc62b20 [SQL] Support transforming TreeNodes with Option children.
Thanks goes to @marmbrus for his implementation.

Author: Michael Armbrust <michael@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1074 from concretevitamin/option-treenode and squashes the following commits:

ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074
73133c2 [Michael Armbrust] TreeNodes can't be inner classes.
ab78420 [Zongheng Yang] Add a test.
2ccb721 [Michael Armbrust] Add support for transformation of optional children.
2014-06-15 11:28:34 +02:00
Kan Zhang 7dd9fc67a6 [SPARK-1837] NumericRange should be partitioned in the same way as other...
... sequences

Author: Kan Zhang <kzhang@apache.org>

Closes #776 from kanzhang/SPARK-1837 and squashes the following commits:

e48f018 [Kan Zhang] [SPARK-1837] code refactoring
67c33b5 [Kan Zhang] minor change
403f9b1 [Kan Zhang] [SPARK-1837] NumericRange should be partitioned in the same way as other sequences
2014-06-14 14:31:28 -07:00
Kan Zhang b52603b039 [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
Author: Kan Zhang <kzhang@apache.org>

Closes #983 from kanzhang/SPARK-2013 and squashes the following commits:

0e128bb [Kan Zhang] [SPARK-2013] minor update
e728516 [Kan Zhang] [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
2014-06-14 13:22:30 -07:00
Kan Zhang 2550533a28 [SPARK-2079] Support batching when serializing SchemaRDD to Python
Added batching with default batch size 10 in SchemaRDD.javaToPython

Author: Kan Zhang <kzhang@apache.org>

Closes #1023 from kanzhang/SPARK-2079 and squashes the following commits:

2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython
19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython
2014-06-14 13:17:22 -07:00
Yin Huai 8919685091 [Spark-2137][SQL] Timestamp UDFs broken
https://issues.apache.org/jira/browse/SPARK-2137

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1081 from yhuai/SPARK-2137 and squashes the following commits:

c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137
205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp.
2014-06-13 23:28:57 -07:00
akkomar edb1f0e316 Small correction in Streaming Programming Guide doc
Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.

Author: akkomar <ak.komar@gmail.com>

Closes #1079 from akkomar/streaming-guide-doc and squashes the following commits:

32dfc62 [akkomar] Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.
2014-06-13 15:37:26 -07:00
Cheng Lian ac96d9657c [SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements
## Related JIRA issues

- Main issue:

  - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands

- Issues resolved as dependencies:

  - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses
  - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE
  - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to

- Other related issue:

  - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view

    Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue.

## PR Overview

This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once.  Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results.

This PR defines schemas for the following DDL/commands:

- EXPLAIN command

  - `plan`: String, the plan explanation

- SET command

  - `key`: String, the key(s) of the propert(y/ies) being set or queried
  - `value`: String, the value(s) of the propert(y/ies) being queried

- Other Hive native command

  - `result`: String, execution result returned by Hive

  **NOTE**: We should refine schemas for different native commands by defining physical plans for them in the future.

## Examples

### EXPLAIN command

Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL:

```
scala> loadTestTable("src")
...

scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key")
...
q0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ExplainCommandPhysical [plan#11:0]
 Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L]
  Exchange (HashPartitioning [key#4:0], 200)
   Exchange (HashPartitioning [key#4:0], 200)
    Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L]
     HiveTableScan [key#4], (MetastoreRelation default, src, None), None

scala> q0.select('plan).collect()
...
[ExplainCommandPhysical [plan#24:0]
 Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L]
  Exchange (HashPartitioning [key#17:0], 200)
   Exchange (HashPartitioning [key#17:0], 200)
    Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L]
     HiveTableScan [key#17], (MetastoreRelation default, src, None), None]

scala>
```

### SET command

In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL:

```
scala> val q1 = hql("SET")
...
q1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[7] at RDD at SchemaRDD.scala:98
== Query Plan ==
<SET command: executed by Hive, and noted by SQLContext>

scala> q1.registerAsTable("properties")

scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println)
...
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents
14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98
== Query Plan ==
TakeOrdered 10, [key#51:0 ASC]
 Project [key#51:0,value#52:1]
  SetCommandPhysical None, None, [key#55:0,value#56:1])
...
[datanucleus.autoCreateSchema,true]
[datanucleus.autoStartMechanismMode,checked]
[datanucleus.cache.level2,false]
[datanucleus.cache.level2.type,none]
[datanucleus.connectionPoolingType,BONECP]
[datanucleus.fixedDatastore,false]
[datanucleus.identifierFactory,datanucleus1]
[datanucleus.plugin.pluginRegistryBundleCheck,LOG]
[datanucleus.rdbms.useLegacyNativeValueStrategy,true]
[datanucleus.storeManagerType,rdbms]

scala>
```

### "Exactly once" semantics

At last, an example of the "exactly once" semantics:

```
scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)")
...
q2: org.apache.spark.sql.SchemaRDD =
SchemaRDD[28] at RDD at SchemaRDD.scala:98
== Query Plan ==
<Native command: executed by Hive>

scala> table("t1")
...
res9: org.apache.spark.sql.SchemaRDD =
SchemaRDD[32] at RDD at SchemaRDD.scala:98
== Query Plan ==
HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None

scala> q2.collect()
...
res10: Array[org.apache.spark.sql.Row] = Array([])

scala>
```

As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits:

d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair
f6c7715 [Cheng Lian] Added test cases for DDL/command statement RDDs
1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs
5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching
48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row]
cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands
74789c1 [Cheng Lian] Fixed failing test cases
0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics
2014-06-13 12:59:48 -07:00
Michael Armbrust 1c2fd015b0 [SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType
Author: Michael Armbrust <michael@databricks.com>

Closes #1061 from marmbrus/timestamp and squashes the following commits:

79c3903 [Michael Armbrust] Add timestamp to HiveMetastoreTypes.toMetastoreType()
2014-06-13 12:55:15 -07:00
nravi 70c8116c0a Workaround in Spark for ConcurrentModification issue (JIRA Hadoop-10456, Spark-1097)
This fix has gone into Hadoop 2.4.1. For developers using <  2.4.1, it would be good to have a workaround in Spark as well.

Fix has been tested for performance as well, no regressions found.

Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #1000 from nishkamravi2/master and squashes the following commits:

eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-06-13 10:52:21 -07:00
Xiangrui Meng b3736e3d2f [HOTFIX] add math3 version to pom
Passed `mvn package`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1075 from mengxr/takeSample-fix and squashes the following commits:

45b4590 [Xiangrui Meng] add math3 version to pom
2014-06-13 02:59:38 -07:00
Michael Armbrust 13f8cfdc04 [SPARK-2135][SQL] Use planner for in-memory scans
Author: Michael Armbrust <michael@databricks.com>

Closes #1072 from marmbrus/cachedStars and squashes the following commits:

8757c8e [Michael Armbrust] Use planner for in-memory scans.
2014-06-12 23:09:41 -07:00
John Zhao f95ac686bc [SPARK-1516]Throw exception in yarn client instead of run system.exit directly.
All the changes is in  the package of "org.apache.spark.deploy.yarn":
    1) Throw exception in ClinetArguments and ClientBase instead of exit directly.
    2) in Client's main method, if exception is caught, it will exit with code 1, otherwise exit with code 0.

After the fix, if user integrate the spark yarn client into their applications, when the argument is wrong or the running is finished, the application won't be terminated.

Author: John Zhao <jzhao@alpinenow.com>

Closes #490 from codeboyyong/jira_1516_systemexit_inyarnclient and squashes the following commits:

138cb48 [John Zhao] [SPARK-1516]Throw exception in yarn clinet instead of run system.exit directly. All the changes is in  the package of "org.apache.spark.deploy.yarn": 1) Add a ClientException with an exitCode 2) Throws exception in ClinetArguments and ClientBase instead of exit directly 3) in Client's main method, catch exception and exit with the exitCode.
2014-06-12 21:39:00 -07:00
Andrew Or 44daec5abd [Minor] Fix style, formatting and naming in BlockManager etc.
This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.

(Warning: this PR is full of unimportant nitpicks)

Author: Andrew Or <andrewor14@gmail.com>

Closes #1058 from andrewor14/bm-minor and squashes the following commits:

8e12eaf [Andrew Or] SparkException -> BlockException
c36fd53 [Andrew Or] Make parts of BlockManager more readable
0a5f378 [Andrew Or] Entry -> MemoryEntry
e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
b3470f1 [Andrew Or] More string interpolation (minor)
7f9dcab [Andrew Or] Use string interpolation (minor)
94a425b [Andrew Or] Refactor against duplicate code + minor changes
8a6a7dc [Andrew Or] Exception -> SparkException
97c410f [Andrew Or] Deal with MIMA excludes
2480f1d [Andrew Or] Fixes in StorgeLevel.scala
abb0163 [Andrew Or] Style, formatting and naming fixes
2014-06-12 20:40:58 -07:00
Doris Xin 1de1d703bf SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

Author: Doris Xin <doris.s.xin@gmail.com>
Author: dorx <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #916 from dorx/takeSample and squashes the following commits:

5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
2014-06-12 19:44:27 -07:00
Ariel Rabkin 0154587ab7 document laziness of parallelize
Took me several hours to figure out this behavior. It would be good to highlight it in the documentation.

Author: Ariel Rabkin <asrabkin@cs.princeton.edu>

Closes #1070 from asrabkin/master and squashes the following commits:

29a076e [Ariel Rabkin] doc fix
2014-06-12 17:51:33 -07:00
Shuo Xiang a6e0afdcf0 SPARK-2085: [MLlib] Apply user-specific regularization instead of uniform regularization in ALS
The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda.

Author: Shuo Xiang <sxiang@twitter.com>

Closes #1026 from coderxiang/als-reg and squashes the following commits:

93dfdb4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into als-reg
b98f19c [Shuo Xiang] merge latest master
52c7b58 [Shuo Xiang] Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
2014-06-12 17:37:06 -07:00
Patrick Wendell 1c04652c8f SPARK-1843: Replace assemble-deps with env variable.
(This change is actually small, I moved some logic into
compute-classpath that was previously in spark-class).

Assemble deps has existed for a while to allow developers to
run local code with new changes quickly. When I'm developing I
typically use a simpler approach which just prepends the Spark
classes to the classpath before the assembly jar. This is well
defined in the JVM and the Spark classes take precedence over those
in the assembly.

This approach is portable across both builds which is the main reason I'd
like to switch to it. It's also a bit easier to toggle on and off quickly.

The way you use this is the following:
```
$ ./bin/spark-shell # Use spark with the normal assembly
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell # Back to normal
```

Author: Patrick Wendell <pwendell@gmail.com>

Closes #877 from pwendell/assemble-deps and squashes the following commits:

8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
faa3168 [Patrick Wendell] Adding a warning for compatibility
3f151a7 [Patrick Wendell] Small fix
bbfb73c [Patrick Wendell] Review feedback
328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
2014-06-12 15:43:32 -07:00
Marcelo Vanzin ecde5b8375 [SPARK-2080] Yarn: report HS URL in client mode, correct user in cluster mode.
Yarn client mode was not setting the app's tracking URL to the
History Server's URL when configured by the user. Now client mode
behaves the same as cluster mode.

In SparkContext.scala, the "user.name" system property had precedence
over the SPARK_USER environment variable. This means that SPARK_USER
was never used, since "user.name" is always set by the JVM. In Yarn
cluster mode, this means the application always reported itself as
being run by user "yarn" (or whatever user was running the Yarn NM).
One could argue that the correct fix would be to use UGI.getCurrentUser()
here, but at least for Yarn that will match what SPARK_USER is set
to.

Author: Marcelo Vanzin <vanzin@cloudera.com>

This patch had conflicts when merged, resolved by
Committer: Thomas Graves <tgraves@apache.org>

Closes #1002 from vanzin/yarn-client-url and squashes the following commits:

4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also.
4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.
2014-06-12 16:19:36 -05:00
Doris Xin 83c226d454 [SPARK-2088] fix NPE in toString
After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. An NPE is thrown when toString is called by the serializer when creationSiteInfo is null.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1028 from dorx/toStringNPE and squashes the following commits:

f20021e [Doris Xin] unit test for toString after desrialization
6f0a586 [Doris Xin] Merge branch 'master' into toStringNPE
f47fecf [Doris Xin] Merge branch 'master' into toStringNPE
76199c6 [Doris Xin] [SPARK-2088] fix NPE in toString
2014-06-12 12:53:07 -07:00
Sandy Ryza ce92a9c18f SPARK-554. Add aggregateByKey.
Author: Sandy Ryza <sandy@cloudera.com>

Closes #705 from sryza/sandy-spark-554 and squashes the following commits:

2302b8f [Sandy Ryza] Add MIMA exclude
f52e0ad [Sandy Ryza] Fix Python tests for real
2f3afa3 [Sandy Ryza] Fix Python test
0b735e9 [Sandy Ryza] Fix line lengths
ae56746 [Sandy Ryza] Fix doc (replace T with V)
c2be415 [Sandy Ryza] Java and Python aggregateByKey
23bf400 [Sandy Ryza] SPARK-554.  Add aggregateByKey.
2014-06-12 08:14:25 -07:00