Commit graph

10517 commits

Author SHA1 Message Date
Cheng Lian 1a49496b4a [SPARK-6082] [SQL] Provides better error message for malformed rows when caching tables
Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4842 from liancheng/spark-6082 and squashes the following commits:

b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables
2015-03-02 16:18:00 -08:00
Michael Armbrust 8223ce6a81 [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
Author: Michael Armbrust <michael@databricks.com>

Closes #4855 from marmbrus/explodeBug and squashes the following commits:

a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
2015-03-02 16:10:54 -08:00
guliangliang 26c1c56dea [SPARK-5522] Accelerate the Histroty Server start
When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us.

It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file.

Author: guliangliang <guliangliang@qiyi.com>

Closes #4525 from marsishandsome/Spark5522 and squashes the following commits:

a865c11 [guliangliang] fix bug2
4340c2b [guliangliang] fix bug
af92a5a [guliangliang] [SPARK-5522] Accelerate the Histroty Server start
2015-03-02 15:33:23 -08:00
Marcelo Vanzin 6b348d90f4 [SPARK-6050] [yarn] Relax matching of vcore count in received containers.
Some YARN configurations return a vcore count for allocated
containers that does not match the requested resource. That means
Spark would always ignore those containers. So relax the the matching
of the vcore count to allow the Spark jobs to run.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4818 from vanzin/SPARK-6050 and squashes the following commits:

991c803 [Marcelo Vanzin] Remove config option, standardize on legacy behavior (no vcore matching).
8c9c346 [Marcelo Vanzin] Restrict lax matching to vcores only.
3359692 [Marcelo Vanzin] [SPARK-6050] [yarn] Add config option to do lax resource matching.
2015-03-02 16:41:43 -06:00
q00251598 582e5a24c5 [SPARK-6040][SQL] Fix the percent bug in tablesample
HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark.

Author: q00251598 <qiyadong@huawei.com>

Closes #4789 from watermen/SPARK-6040 and squashes the following commits:

2453ebe [q00251598] check and adjust the fraction.
2015-03-02 13:16:29 -08:00
Liang-Chi Hsieh 3f9def8117 [Minor] Fix doc typo for describing primitiveTerm effectiveness condition
It should be `true` instead of `false`?

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4762 from viirya/doc_fix and squashes the following commits:

2e37482 [Liang-Chi Hsieh] Fix doc.
2015-03-02 13:11:17 -08:00
Sean Owen 0b472f60cd SPARK-5390 [DOCS] Encourage users to post on Stack Overflow in Community Docs
Point "Community" to main Spark Community page; mention SO tag apache-spark.

Separately, the Apache site can be updated to mention, under Mailing Lists:
"StackOverflow also has an apache-spark tag for Spark Q&A." or similar.

Author: Sean Owen <sowen@cloudera.com>

Closes #4843 from srowen/SPARK-5390 and squashes the following commits:

3508ac6 [Sean Owen] Point "Community" to main Spark Community page; mention SO tag apache-spark
2015-03-02 21:10:08 +00:00
Paul Power d9a8bae778 [DOCS] Refactored Dataframe join comment to use correct parameter ordering
The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer")

Author: Paul Power <paul.power@peerside.com>

Closes #4847 from peerside/master and squashes the following commits:

ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1
e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins
2015-03-02 13:09:35 -08:00
Yanbo Liang af2effdd7b [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter for pyspark
Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter.
It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4831 from yanboliang/pyspark_classification and squashes the following commits:

12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark
2015-03-02 10:17:24 -08:00
DEBORAH SIEGEL e7d8ae444f aggregateMessages example in graphX doc
Examples illustrating difference between legacy mapReduceTriplets usage and aggregateMessages usage has type issues on the reduce for both operators.

Being just an example-  changed example to reduce the message String by concatenation. Although non-optimal for performance.

Author: DEBORAH SIEGEL <deborahsiegel@DEBORAHs-MacBook-Pro.local>

Closes #4853 from d3borah/master and squashes the following commits:

db54173 [DEBORAH SIEGEL] fixed aggregateMessages example in graphX doc
2015-03-02 10:15:32 -08:00
q00251598 9ce12aaf28 [SPARK-5741][SQL] Support the path contains comma in HiveContext
When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string```
. Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma.

### SQL
```
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

create table nzhang_part like srcpart;

insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08';

insert overwrite table nzhang_part partition (ds='2010-08-15', hr)
select * from (
select key, value, hr from srcpart where ds='2008-04-08'
union all
select '1' as key, '1' as value, 'file,' as hr from src limit 1) s;

select * from nzhang_part where hr = 'file,';
```

### Error Log
```
15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,']
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400)
at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196)

Author: q00251598 <qiyadong@huawei.com>

Closes #4532 from watermen/SPARK-5741 and squashes the following commits:

9758ab1 [q00251598] fix bug
1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths)
b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite
2015-03-02 10:13:11 -08:00
Kenneth Myers 95ac68bf12 [SPARK-6111] Fixed usage string in documentation.
Usage info in documentation does not match actual usage info.

Doc string usage says ```Usage: network_wordcount.py <zk> <topic>``` whereas the actual usage is ```Usage: kafka_wordcount.py <zk> <topic>```

Author: Kenneth Myers <myerske@us.ibm.com>

Closes #4852 from kennethmyers/kafka_wordcount_documentation_fix and squashes the following commits:

3855325 [Kenneth Myers] Fixed usage string in documentation.
2015-03-02 17:25:24 +00:00
Yin Huai 3efd8bb6cf [SPARK-6052][SQL]In JSON schema inference, we should always set containsNull of an ArrayType to true
Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go.

JIRA: https://issues.apache.org/jira/browse/SPARK-6052

Author: Yin Huai <yhuai@databricks.com>

Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits:

05eab9d [Yin Huai] Change containsNull to true.
2015-03-02 23:18:07 +08:00
Yin Huai 39a54b40af [SPARK-6073][SQL] Need to refresh metastore cache after append data in CreateMetastoreDataSourceAsSelect
JIRA: https://issues.apache.org/jira/browse/SPARK-6073

liancheng

Author: Yin Huai <yhuai@databricks.com>

Closes #4824 from yhuai/refreshCache and squashes the following commits:

b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect.
2015-03-02 22:42:18 +08:00
Lianhui Wang 49c7a8f6f3 [SPARK-6103][Graphx]remove unused class to import in EdgeRDDImpl
Class TaskContext is unused in EdgeRDDImpl, so we need to remove it from import list.

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #4846 from lianhuiwang/SPARK-6103 and squashes the following commits:

31aed64 [Lianhui Wang] remove unused class to import in EdgeRDDImpl
2015-03-02 09:06:56 +00:00
Sean Owen 948c2390ab SPARK-3357 [CORE] Internal log messages should be set at DEBUG level instead of INFO
Demote some 'noisy' log messages to debug level. I added a few more, to include everything that gets logged in stanzas like this:

```
15/03/01 00:03:54 INFO BlockManager: Removing broadcast 0
15/03/01 00:03:54 INFO BlockManager: Removing block broadcast_0_piece0
15/03/01 00:03:54 INFO MemoryStore: Block broadcast_0_piece0 of size 839 dropped from memory (free 277976091)
15/03/01 00:03:54 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:49524 in memory (size: 839.0 B, free: 265.1 MB)
15/03/01 00:03:54 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/01 00:03:54 INFO BlockManager: Removing block broadcast_0
15/03/01 00:03:54 INFO MemoryStore: Block broadcast_0 of size 1088 dropped from memory (free 277977179)
15/03/01 00:03:54 INFO ContextCleaner: Cleaned broadcast 0
```

as well as regular messages like

```
15/03/01 00:02:33 INFO MemoryStore: ensureFreeSpace(2640) called with curMem=47322, maxMem=278019440
```

WDYT? good or should some be left alone?

CC mengxr who suggested some of this.

Author: Sean Owen <sowen@cloudera.com>

Closes #4838 from srowen/SPARK-3357 and squashes the following commits:

dce75c1 [Sean Owen] Back out some debug level changes
d9b784d [Sean Owen] Demote some 'noisy' log messages to debug level
2015-03-02 08:51:03 +00:00
Saisai Shao d8fb40edea [Streaming][Minor]Fix some error docs in streaming examples
Small changes, please help to review, thanks a lot.

Author: Saisai Shao <saisai.shao@intel.com>

Closes #4837 from jerryshao/doc-fix and squashes the following commits:

545291a [Saisai Shao] Fix some error docs in streaming examples
2015-03-02 08:49:19 +00:00
MechCoder 3f00bb3ef1 [SPARK-6083] [MLLib] [DOC] Make Python API example consistent in NaiveBayes
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4834 from MechCoder/spark-6083 and squashes the following commits:

1cdd7b5 [MechCoder] Add parse function
65bbbe9 [MechCoder] [SPARK-6083] Make Python API example consistent in NaiveBayes
2015-03-01 16:28:15 -08:00
Xiangrui Meng aedbbaa3dd [SPARK-6053][MLLIB] support save/load in PySpark's ALS
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:

f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
2015-03-01 16:26:57 -08:00
Marcelo Vanzin fd8d283eeb [SPARK-6074] [sql] Package pyspark sql bindings.
This is needed for the SQL bindings to work on Yarn.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4822 from vanzin/SPARK-6074 and squashes the following commits:

fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings.
2015-03-01 11:05:10 +00:00
Josh Rosen 2df5f1f006 [SPARK-6075] Fix bug in that caused lost accumulator updates: do not store WeakReferences in localAccums map
This fixes a non-deterministic bug introduced in #4021 that could cause tasks' accumulator updates to be lost.  The problem is that `localAccums` should not hold weak references: after the task finishes running there won't be any strong references to these local accumulators, so they can get garbage-collected before the executor reads the `localAccums` map.  We don't need weak references here anyways, since this map is cleared at the end of each task.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4835 from JoshRosen/SPARK-6075 and squashes the following commits:

4f4b5b2 [Josh Rosen] Remove defensive assertions that caused test failures in code unrelated to this change
120c7b0 [Josh Rosen] [SPARK-6075] Do not store WeakReferences in localAccums map
2015-02-28 22:51:01 -08:00
Evan Yu 643300a6e2 SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException
Fix TimSort bug which causes a ArrayOutOfBoundsException.

Using the proposed fix here
http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

Author: Evan Yu <ehotou@gmail.com>

Closes #4804 from hotou/SPARK-5984 and squashes the following commits:

3421b6c [Evan Yu] SPARK-5984: Add info to LICENSE
e61c6b8 [Evan Yu] SPARK-5984: Fix license and document
6ccc280 [Evan Yu] SPARK-5984: Add License header to file
e06c0d2 [Evan Yu] SPARK-5984: Add License header to file
4d95f75 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException
479a106 [Evan Yu] SPARK-5984: Fix TimSort bug causes ArrayOutOfBoundsException
2015-02-28 18:55:34 -08:00
Sean Owen 86fcdaef62 SPARK-1965 [WEBUI] Spark UI throws NPE on trying to load the app page for non-existent app
Don't throw NPE if appId is unknown. kayousterhout is this a decent enough band-aid for avoiding a full-blown NPE? it should just render empty content instead

Author: Sean Owen <sowen@cloudera.com>

Closes #4777 from srowen/SPARK-1965 and squashes the following commits:

7e16590 [Sean Owen] Update app not found message
cb878d6 [Sean Owen] Return basic "not found" page for unknown appId
d8270da [Sean Owen] Don't throw NPE if appId is unknown
2015-02-28 15:34:08 +00:00
Sean Owen f91298e2c5 SPARK-5983 [WEBUI] Don't respond to HTTP TRACE in HTTP-based UIs
Disallow TRACE HTTP method in servlets

Author: Sean Owen <sowen@cloudera.com>

Closes #4765 from srowen/SPARK-5983 and squashes the following commits:

421b25b [Sean Owen] Disallow TRACE HTTP method in servlets
2015-02-28 15:23:59 +00:00
Michael Griffiths b36b1bc22e SPARK-6063 MLlib doesn't pass mvn scalastyle check due to UTF chars in LDAModel.scala
Remove unicode characters from MLlib file.

Author: Michael Griffiths <msjgriffiths@gmail.com>
Author: Griffiths, Michael (NYC-RPM) <michael.griffiths@reprisemedia.com>

Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following commits:

bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27)
38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
2015-02-28 14:48:03 +00:00
Cheng Lian e6003f0a57 [SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #4792 from liancheng/spark-5775 and squashes the following commits:

538f506 [Cheng Lian] Addresses comments
cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin
b0b74fb [Yin Huai] Remove runtime pattern matching.
ca6e038 [Cheng Lian] Fixes SPARK-5775
2015-02-28 21:15:43 +08:00
Patrick Wendell 9168259813 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #1128 (close requested by 'srowen')
Closes #3425 (close requested by 'srowen')
Closes #4770 (close requested by 'srowen')
Closes #2813 (close requested by 'srowen')
2015-02-27 23:10:09 -08:00
Burak Yavuz 6d8e5fbc0d [SPARK-5979][SPARK-6032] Smaller safer --packages fix
pwendell tdas
This is the safer parts of PR #4754:
 - SPARK-5979: All dependencies with the groupId `org.apache.spark` passed through `--packages`, were being excluded from the dependency tree on the assumption that they would be in the assembly jar. This is not the case, therefore the exclusion rules had to be defined more explicitly.
 - SPARK-6032: Ivy prints a whole lot of logs while retrieving dependencies. These were printed to `System.out`. Moved the logging to `System.err`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4802 from brkyvz/simple-streaming-fix and squashes the following commits:

e0f38cb [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into simple-streaming-fix
bad921c [Burak Yavuz] [SPARK-5979][SPARK-6032] Smaller safer fix
2015-02-27 22:59:35 -08:00
Marcelo Vanzin dba08d1fc3 [SPARK-6070] [yarn] Remove unneeded classes from shuffle service jar.
These may conflict with the classes already in the NM. We shouldn't
be repackaging them.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4820 from vanzin/SPARK-6070 and squashes the following commits:

871b566 [Marcelo Vanzin] The "d'oh how didn't I think of it before" solution.
3cba946 [Marcelo Vanzin] Use profile instead, so that dependencies don't need to be explicitly listed.
7a18a1b [Marcelo Vanzin] [SPARK-6070] [yarn] Remove unneeded classes from shuffle service jar.
2015-02-27 22:44:11 -08:00
Davies Liu e0e64ba4b1 [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType
The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).

This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).

cc pwendell  JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #4808 from davies/leak and squashes the following commits:

6a322a4 [Davies Liu] tests refactor
3da44fc [Davies Liu] fix __eq__ of Singleton
534ac90 [Davies Liu] add more checks
46999dc [Davies Liu] fix tests
d9ae973 [Davies Liu] fix memory leak in sql
2015-02-27 20:07:17 -08:00
Cheng Lian 8c468a6600 [SPARK-5751] [SQL] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem.

Many thanks to chenghao-intel for pointing out this issue!

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits:

252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory
1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
2015-02-28 08:41:49 +08:00
Saisai Shao 5f7f3b938e [Streaming][Minor] Remove useless type signature of Java Kafka direct stream API
cc tdas .

Author: Saisai Shao <saisai.shao@intel.com>

Closes #4817 from jerryshao/signature-minor-fix and squashes the following commits:

eebfaac [Saisai Shao] Remove useless type parameter
2015-02-27 13:01:42 -08:00
Joseph K. Bradley d17cb2ba33 [SPARK-4587] [mllib] [docs] Fixed save,load calls in ML guide examples
Should pass spark context to save/load

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4816 from jkbradley/ml-io-doc-fix and squashes the following commits:

83d369d [Joseph K. Bradley] added comment to save,load parts of ML guide examples
2841170 [Joseph K. Bradley] Fixed save,load calls in ML guide examples
2015-02-27 13:00:36 -08:00
zsxwing 57566d0af3 [SPARK-6059][Yarn] Add volatile to ApplicationMaster's reporterThread and allocator
`ApplicationMaster.reporterThread` and `ApplicationMaster.allocator` are accessed in multiple threads, so they should be marked as `volatile`.

Author: zsxwing <zsxwing@gmail.com>

Closes #4814 from zsxwing/SPARK-6059 and squashes the following commits:

17d9386 [zsxwing] Add volatile to ApplicationMaster's reporterThread and allocator
2015-02-27 13:33:39 +00:00
zsxwing e747e98490 [SPARK-6058][Yarn] Log the user class exception in ApplicationMaster
Because ApplicationMaster doesn't set SparkUncaughtExceptionHandler, the exception in the user class won't be logged. This PR added a `logError` for it.

Author: zsxwing <zsxwing@gmail.com>

Closes #4813 from zsxwing/SPARK-6058 and squashes the following commits:

806c932 [zsxwing] Log the user class exception
2015-02-27 13:31:46 +00:00
Zhang, Liye 8cd1692c90 [SPARK-6036][CORE] avoid race condition between eventlogListener and akka actor system
For detail description, pls refer to [SPARK-6036](https://issues.apache.org/jira/browse/SPARK-6036).

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #4785 from liyezhang556520/EventLogInProcess and squashes the following commits:

8b0b0a6 [Zhang, Liye] stop listener after DAGScheduler
79b15b3 [Zhang, Liye] SPARK-6036 avoid race condition between eventlogListener and akka actor system
2015-02-26 23:11:43 -08:00
许鹏 0375a413b8 fix spark-6033, clarify the spark.worker.cleanup behavior in standalone mode
jira case spark-6033 https://issues.apache.org/jira/browse/SPARK-6033

In standalone deploy mode, the cleanup will only remove the stopped application's directories.

The original description about the cleanup behavior is incorrect.

Author: 许鹏 <peng.xu@fraudmetrix.cn>

Closes #4803 from hseagle/spark-6033 and squashes the following commits:

927a6a0 [许鹏] fix the incorrect description about the spark.worker.cleanup in standalone mode
2015-02-26 23:06:34 -08:00
Andrew Or 7c99a014fb [SPARK-6046] Privatize SparkConf.translateConfKey
The warning of deprecated configs is actually done when the configs are set, not when they are get. As a result we don't need to explicitly call `translateConfKey` outside of `SparkConf` just to print the warning again in vain.

Author: Andrew Or <andrew@databricks.com>

Closes #4797 from andrewor14/warn-deprecated-config and squashes the following commits:

8fb43e6 [Andrew Or] Privatize SparkConf.translateConfKey
2015-02-26 22:39:46 -08:00
Lukasz Jastrzebski 4a8a0a8ecd SPARK-2168 [Spark core] Use relative URIs for the app links in the History Server.
As agreed in PR #1160 adding test to verify if history server generates relative links to applications.

Author: Lukasz Jastrzebski <lukasz.jastrzebski@gmail.com>

Closes #4778 from elyast/master and squashes the following commits:

0c07fab [Lukasz Jastrzebski] Incorporating comments for SPARK-2168
6d7866d [Lukasz Jastrzebski] Adjusting test for  SPARK-2168 for master branch
d6f4fbe [Lukasz Jastrzebski] Added test for  SPARK-2168
2015-02-26 22:38:06 -08:00
jerryshao 67595eb8fb [SPARK-5495][UI] Add app and driver kill function in master web UI
Add application kill function in master web UI for standalone mode. Details can be seen in [SPARK-5495](https://issues.apache.org/jira/browse/SPARK-5495).

The snapshot of UI shows as below:
![snapshot](https://dl.dropboxusercontent.com/u/19230832/master_ui.png)

Please help to review, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #4288 from jerryshao/SPARK-5495 and squashes the following commits:

fa3e486 [jerryshao] Add some conditions
9a7be93 [jerryshao] Add kill Driver function
a239776 [jerryshao] Change the code format
ff5195d [jerryshao] Add app kill function in master web UI
2015-02-26 22:36:48 -08:00
jerryshao 12135e9054 [SPARK-5771][UI][hotfix] Change Requested Cores into * if default cores is not set
cc andrewor14, srowen.

Author: jerryshao <saisai.shao@intel.com>

Closes #4800 from jerryshao/SPARK-5771 and squashes the following commits:

a2483c2 [jerryshao] Change the UI of Requested Cores into * if default cores is not set
2015-02-26 22:35:43 -08:00
Yin Huai 5e5ad6558d [SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore.
JIRA: https://issues.apache.org/jira/browse/SPARK-6024

Author: Yin Huai <yhuai@databricks.com>

Closes #4795 from yhuai/wideSchema and squashes the following commits:

4882e6f [Yin Huai] Address comments.
73e71b4 [Yin Huai] Address comments.
143927a [Yin Huai] Simplify code.
cc1d472 [Yin Huai] Make the schema wider.
12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
e9b4f70 [Yin Huai] Failed test.
2015-02-26 20:46:05 -08:00
Liang-Chi Hsieh 4ad5153f54 [SPARK-6037][SQL] Avoiding duplicate Parquet schema merging
`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:

ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
2015-02-27 11:06:47 +08:00
Hong Shen 18f2098433 [SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent.  We should expireDeadHosts at HeartbeatReceiver.

Author: Hong Shen <hongshen@tencent.com>

Closes #4363 from shenh062326/my_change3 and squashes the following commits:

2c9a46a [Hong Shen] Change some code style.
1a042ff [Hong Shen] Change some code style.
2dc456e [Hong Shen] Change some code style.
d221493 [Hong Shen] Fix test failed
7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver
b904aed [Hong Shen] Fix failed test
52725af [Hong Shen] Remove assert in SparkContext.killExecutors
5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors
a858fb5 [Hong Shen] A minor change in HeartbeatReceiver
3e221d9 [Hong Shen] A minor change in HeartbeatReceiver
6bab7aa [Hong Shen] Change a code style.
07952f3 [Hong Shen] Change configs name and code style.
ce9257e [Hong Shen] Fix test failed
bccd515 [Hong Shen] Fix test failed
8e77408 [Hong Shen] Fix test failed
c1dfda1 [Hong Shen] Fix test failed
e197e20 [Hong Shen] Fix test failed
fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages
b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor
c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver
2015-02-26 18:43:23 -08:00
Sean Owen fbc469473d SPARK-4579 [WEBUI] Scheduling Delay appears negative
Ensure scheduler delay handles unfinished task case, and ensure delay is never negative even due to rounding

Author: Sean Owen <sowen@cloudera.com>

Closes #4796 from srowen/SPARK-4579 and squashes the following commits:

ad6713c [Sean Owen] Ensure scheduler delay handles unfinished task case, and ensure delay is never negative even due to rounding
2015-02-26 17:35:09 -08:00
tedyu e60ad2f4c4 SPARK-6045 RecordWriter should be checked against null in PairRDDFunctio...
...ns#saveAsNewAPIHadoopDataset

Author: tedyu <yuzhihong@gmail.com>

Closes #4794 from tedyu/master and squashes the following commits:

2632a57 [tedyu] SPARK-6045 RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset
2d8d4b1 [tedyu] SPARK-6045 RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset
2015-02-26 23:27:09 +00:00
mohit.goyal b38dec2ffd [SPARK-5951][YARN] Remove unreachable driver memory properties in yarn client mode
Remove unreachable driver memory properties in yarn client mode

Author: mohit.goyal <mohit.goyal@guavus.com>

Closes #4730 from zuxqoj/master and squashes the following commits:

977dc96 [mohit.goyal] remove not rechable deprecated variables in yarn client mode
2015-02-26 14:27:47 -08:00
moussa taifi c871e2dae0 Add a note for context termination for History server on Yarn
The history server on Yarn only shows completed jobs. This adds a note concerning the needed explicit context termination at the end of a spark job which is a best practice anyway.
Related to SPARK-2972 and SPARK-3458

Author: moussa taifi <moutai10@gmail.com>

Closes #4721 from moutai/add-history-server-note-for-closing-the-spark-context and squashes the following commits:

9f5b6c3 [moussa taifi] Fix upper case typo for YARN
3ad3db4 [moussa taifi] Add context termination for History server on Yarn
2015-02-26 14:20:30 -08:00
Sean Owen 3fb53c0298 SPARK-4300 [CORE] Race condition during SparkWorker shutdown
Close appender saving stdout/stderr before destroying process to avoid exception on reading closed input stream.
(This also removes a redundant `waitFor()` although it was harmless)

CC tdas since I think you wrote this method.

Author: Sean Owen <sowen@cloudera.com>

Closes #4787 from srowen/SPARK-4300 and squashes the following commits:

e0cdabf [Sean Owen] Close appender saving stdout/stderr before destroying process to avoid exception on reading closed input stream
2015-02-26 14:08:56 -08:00
Cheolsoo Park 5f3238b3b0 [SPARK-6018] [YARN] NoSuchMethodError in Spark app is swallowed by YARN AM
Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #4773 from piaozhexiu/SPARK-6018 and squashes the following commits:

2a919d5 [Cheolsoo Park] Rename e with cause to avoid duplicate names
1e71d2d [Cheolsoo Park] Replace placeholder with throwable
eb5750d [Cheolsoo Park] NoSuchMethodError in Spark app is swallowed by YARN AM
2015-02-26 13:53:49 -08:00