Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#4573 from ryan-williams/history and squashes the following commits:
a8647ec [Ryan Williams] fix test calls to .replay()
98aa3fe [Ryan Williams] include filename in history-parsing error message
8deecf0 [Ryan Williams] add line number to history-parsing error message
b668b52 [Ryan Williams] add log info line to history-eventlog parsing
Author: sboeschhuawei <stephen.boesch@huawei.com>
Closes#4495 from javadba/picexamples and squashes the following commits:
3c84b14 [sboeschhuawei] PIC Examples updates from Xiangrui's comments round 5
2878675 [sboeschhuawei] Fourth round with xiangrui on PICExample
d7ac350 [sboeschhuawei] Updates to PICExample from Xiangrui's comments round 3
d7f0cba [sboeschhuawei] Updates to PICExample from Xiangrui's comments round 3
cef28f4 [sboeschhuawei] Further updates to PICExample from Xiangrui's comments
f7ff43d [sboeschhuawei] Update to PICExample from Xiangrui's comments
efeec45 [sboeschhuawei] Update to PICExample from Xiangrui's comments
03e8de4 [sboeschhuawei] Added PICExample
c509130 [sboeschhuawei] placeholder for pic examples
5864d4a [sboeschhuawei] placeholder for pic examples
Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in script tool.
![9](https://cloud.githubusercontent.com/assets/7402327/6183331/cab1b74e-b38e-11e4-9daa-e26e6015cff3.JPG)
Author: uncleGen <hustyugm@gmail.com>
Author: genmao.ygm <genmao.ygm@alibaba-inc.com>
Closes#4522 from uncleGen/master-clean-150211 and squashes the following commits:
9f2127c [genmao.ygm] revert the behavior of "-v"
015ddee [uncleGen] minor changes
463f02c [uncleGen] minor changes
Some other processes might use the pid saved in pid file. In that case we should ignore it and launch daemons.
JIRA is down for maintenance. I will file one once it return.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#3683 from WangTaoTheTonic/otherproc and squashes the following commits:
daa86a1 [WangTaoTheTonic] some bash style fix
8befee7 [WangTaoTheTonic] handle the mistake scenario
cf4ecc6 [WangTaoTheTonic] remove redundant condition
f36cfb4 [WangTaoTheTonic] some other processes might take the pid
This PR fix the issue SPARK-3365.
The reason is Spark generated wrong schema for the type `List` in `ScalaReflection.scala`
for example:
the generated schema for type `Seq[String]` is:
```
{"name":"x","type":{"type":"array","elementType":"string","containsNull":true},"nullable":true,"metadata":{}}`
```
the generated schema for type `List[String]` is:
```
{"name":"x","type":{"type":"struct","fields":[]},"nullable":true,"metadata":{}}`
```
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#4581 from tianyi/SPARK-3365 and squashes the following commits:
a097e86 [tianyi] change the order of resolution in ScalaReflection.scala
Author: Yin Huai <yhuai@databricks.com>
Closes#4542 from yhuai/moveSaveMode and squashes the following commits:
65a4425 [Yin Huai] Move SaveMode to sql package.
Please see https://issues.apache.org/jira/browse/SPARK-5335.
The fix itself is in e58a8b01a8bedcbfbbc6d04b1c1489255865cf87 commit. Two earlier commits are fixes of another VPC related bug waiting to be merged. I should have created former bug fix in own branch then this fix would not have former fixes. :(
This code is released under the project's license.
Author: Vladimir Grigor <vladimir@kiosked.com>
Author: Vladimir Grigor <vladimir@voukka.com>
Closes#4122 from voukka/SPARK-5335_delete_sg_vpc and squashes the following commits:
090dca9 [Vladimir Grigor] fixes as per review: removed printing of group_id and added comment
730ec05 [Vladimir Grigor] fix for SPARK-5335: Destroying cluster in VPC with "--delete-groups" fails to remove security groups
Author: Yin Huai <yhuai@databricks.com>
Closes#4544 from yhuai/jsonUseLongTypeByDefault and squashes the following commits:
6e2ffc2 [Yin Huai] Use LongType as the default type for integers in JSON schema inference.
There a bunch of logging coming from driver and worker, it's noisy and scaring, and a lots of exception in it, people are confusing about the tests are failing or not.
This PR will mute the logging during tests, only show them if any one failed.
Author: Davies Liu <davies@databricks.com>
Closes#4572 from davies/mute and squashes the following commits:
1e9069c [Davies Liu] mute the logging during python tests
The `$MVN` command variable may have spaces, so when referring to it, must wrap in quotes.
Author: David Y. Ross <dyross@gmail.com>
Closes#4540 from dyross/dyr-fix-make-distribution2 and squashes the following commits:
5a41596 [David Y. Ross] SPARK-5747: Fix wordsplitting bugs in make-distribution.sh
some time since some reasons, it lead to some exception while NMClient start some containers.example:we do not config spark_shuffle on some machines, so it will throw a exception:
java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist.
because YarnAllocator use ThreadPoolExecutor to start Container, so we can not find which container or hostname throw exception. I think we should catch YarnException in ExecutorRunnable when start container. if there are some exceptions, we can know the container id or hostname of failed container.
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes#4554 from lianhuiwang/SPARK-5759 and squashes the following commits:
caf5a99 [lianhuiwang] use SparkException to warp exception
c02140f [lianhuiwang] ExecutorRunnable should catch YarnException while NMClient start container
The changes are summarized in the commit message. Test or test-related code accounts for 90% of the lines changed.
Author: Andrew Or <andrew@databricks.com>
Closes#4557 from andrewor14/rest-tests and squashes the following commits:
b4dc980 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest-tests
b55e40f [Andrew Or] Add test for unknown fields
cc96993 [Andrew Or] private[spark] -> private[rest]
578cf45 [Andrew Or] Clean up test code a little
d82d971 [Andrew Or] v1 -> serverVersion
ea48f65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into rest-tests
00999a8 [Andrew Or] Revamp tests + fix a few corner cases
mateiz was excluding the time to write this final file from the shuffle write time intentional?
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#4559 from kayousterhout/SPARK-5762 and squashes the following commits:
5c6f3d9 [Kay Ousterhout] Use foreach
94e4237 [Kay Ousterhout] Removed open time metrics added inadvertently
ace156c [Kay Ousterhout] Moved metrics to finally block
d773276 [Kay Ousterhout] Use nano time
5a59906 [Kay Ousterhout] [SPARK-5762] Fix shuffle write time for sort-based shuffle
Author: Venkata Ramana G <ramana.gollamudihuawei.com>
Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
Closes#4561 from gvramana/word_split and squashes the following commits:
285c8d4 [Venkata Ramana Gollamudi] Fixed word split problem in run-example and compute-classpath
Author: Katsunori Kanda <potix2@gmail.com>
Closes#4566 from potix2/ec2-update-version-1-2-1 and squashes the following commits:
77e7840 [Katsunori Kanda] [EC2] Update default Spark version to 1.2.1
ksakellis I stumbled on your JIRA for this yesterday; I know it's assigned to you but I'd already done this for my own uses a while ago so thought I could help save you the work of doing it! Hopefully this doesn't duplicate any work you've already done.
Here's a screenshot of what the UI looks like:
![image](https://cloud.githubusercontent.com/assets/1108612/6135352/c03e7276-b11c-11e4-8f11-c6aefe1f35b9.png)
Based on a discussion with pwendell, I put the data read remotely in as an additional metric rather than showing it in brackets as you'd suggested, Kostas. The assumption here is that the average user doesn't care about the differentiation between local / remote data, so it's better not to pollute the UI.
I also added data about the local read time, which I've found very helpful for debugging, but I didn't put it in the UI because I think it's probably something not a ton of people will need to use.
With this change, the total read time and total write time shown in the UI will be equal, fixing a long-term source of user confusion:
![image](https://cloud.githubusercontent.com/assets/1108612/6135399/25f14490-b11d-11e4-8086-20be5f4002e6.png)
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#4510 from kayousterhout/SPARK-5645 and squashes the following commits:
4a0182c [Kay Ousterhout] oops
5f5da1b [Kay Ousterhout] Small style fix
5da04cf [Kay Ousterhout] Addressed more comments from Kostas
ba05149 [Kay Ousterhout] Remove parens
a9dc685 [Kay Ousterhout] Kostas comment, test fix
33d2e2d [Kay Ousterhout] Merge remote-tracking branch 'upstream/master' into SPARK-5645
347e2cd [Kay Ousterhout] [SPARK-5645] Added local read bytes/time to task metrics
Author: Michael Armbrust <michael@databricks.com>
Author: wangfei <wangfei1@huawei.com>
Closes#4558 from marmbrus/errorMessages and squashes the following commits:
5e5ab50 [Michael Armbrust] Merge pull request #15 from scwf/errorMessages
fa38881 [wangfei] fix for grouping__id
f279a71 [wangfei] make right references for ScriptTransformation
d29fbde [Michael Armbrust] extra case
1a797b4 [Michael Armbrust] comments
d4e9015 [Michael Armbrust] add comment
af9e668 [Michael Armbrust] no braces
34eb3a4 [Michael Armbrust] more work
6197cd5 [Michael Armbrust] [SQL] Better error messages for analysis failures
Updated examples using the new api and added DataFrame concept
Author: Antonio Navarro Perez <ajnavarro@users.noreply.github.com>
Closes#4560 from ajnavarro/ajnavarro-doc-sql-update and squashes the following commits:
82ebcf3 [Antonio Navarro Perez] Changed a missing JavaSQLContext to SQLContext.
8d5376a [Antonio Navarro Perez] fixed typo
8196b6b [Antonio Navarro Perez] [SQL][DOCS] Update sql documentation
Consider only x.y.z verisons from JIRA. CC JoshRosen who will probably know this script well.
Alternative is to call the version "2.0.0" after all in JIRA.
Author: Sean Owen <sowen@cloudera.com>
Closes#4570 from srowen/SPARK-5776 and squashes the following commits:
fffafde [Sean Owen] Consider only x.y.z verisons from JIRA
This PR detaches MLlib model import/export code from SQL's JSON support, and hence unblocks #4544 . yhuai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4555 from mengxr/SPARK-5757 and squashes the following commits:
b0415e8 [Xiangrui Meng] replace SQL JSON usage by json4s
[Was previously PR4507]
As per SPARK-5655, recently committed code chmod 700s all application files created on the local fs by a spark executor. This is both unnecessary and broken on YARN, where files created in the nodemanager's working directory are already owned by the user running the job and the 'yarn' group. Group read permission is also needed for the auxiliary shuffle service to be able to read the files, as this is running as the 'yarn' user.
Author: Andrew Rowson <github@growse.com>
Closes#4509 from growse/master and squashes the following commits:
7ca993c [Andrew Rowson] Moved chmod700 functionality into Utils.getOrCreateLocalRootDirs
f57ce6b [Andrew Rowson] [SPARK-5655] Don't chmod700 application files if running in a YARN container
RAT fails on cache paths. add to .rat-excludes
Author: Oren Mazor <oren.mazor@gmail.com>
Closes#4569 from orenmazor/apache_master and squashes the following commits:
d0c9e7e [Oren Mazor] ignore cache paths for RAT tests
(for master / 1.4 only)
Author: Sean Owen <sowen@cloudera.com>
Closes#4526 from srowen/SPARK-5727.2 and squashes the following commits:
83ba49c [Sean Owen] Remove Debian packaging
Eases use in the spark-shell.
Author: Michael Armbrust <michael@databricks.com>
Closes#4545 from marmbrus/serialization and squashes the following commits:
04748e6 [Michael Armbrust] @scala.annotation.varargs
b36e219 [Michael Armbrust] moreFixes
- Removed DataFrame.apply for projection & filtering since they are extremely confusing.
- Added implicits for RDD[Int], RDD[Long], and RDD[String]
Author: Reynold Xin <rxin@databricks.com>
Closes#4543 from rxin/df-cleanup and squashes the following commits:
81ec915 [Reynold Xin] [SQL] More DataFrame fixes.
As a follow-up to https://github.com/apache/spark/pull/4524
Author: Reynold Xin <rxin@databricks.com>
Closes#4539 from rxin/SPARK-3688 and squashes the following commits:
5ac56c7 [Reynold Xin] exists
da8eea4 [Reynold Xin] [SPARK-3688][SQL] More inline comments for LogicalPlan.
This PR fixed the resolving problem described in https://issues.apache.org/jira/browse/SPARK-3688
```
CREATE TABLE t1(x INT);
CREATE TABLE t2(a STRUCT<x: INT>, k INT);
SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k;
```
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#4524 from tianyi/SPARK-3688 and squashes the following commits:
237a256 [tianyi] resolve a name with table.column pattern first.
Also I fix a bunch of bad output in test cases.
Author: Michael Armbrust <michael@databricks.com>
Closes#4520 from marmbrus/selfJoin and squashes the following commits:
4f4a85c [Michael Armbrust] comments
49c8e26 [Michael Armbrust] fix tests
6fc38de [Michael Armbrust] fix style
55d64b3 [Michael Armbrust] fix dataframe selfjoins
Looking at the code, I believe this remark about `take(n)` computing partitions on the driver is no longer correct. Apologies if I'm wrong.
This came up in http://stackoverflow.com/q/28436559/3318517.
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#4533 from darabos/patch-2 and squashes the following commits:
cc80f3a [Daniel Darabos] Remove outdated remark about take(n).
1. DataFrame.renameColumn
2. DataFrame.show() and _repr_
3. Use simpleString() rather than jsonValue in DataFrame.dtypes
4. createDataFrame from local Python data, including pandas.DataFrame
Author: Davies Liu <davies@databricks.com>
Closes#4528 from davies/df3 and squashes the following commits:
014acea [Davies Liu] fix typo
6ba526e [Davies Liu] fix tests
46f5f95 [Davies Liu] address comments
6cbc154 [Davies Liu] dataframe.show() and improve dtypes
6f94f25 [Davies Liu] create DataFrame from local Python data
This just adds a deprecation message. It's intended for backporting to branch 1.3 but can go in master too, to be followed by another PR that removes it for 1.4.
Author: Sean Owen <sowen@cloudera.com>
Closes#4516 from srowen/SPARK-5727.1 and squashes the following commits:
d48989f [Sean Owen] Refer to Spark 1.4
6c1c8b3 [Sean Owen] Deprecate Debian packaging
Use temp dir for ActiveMQ database
Author: Sean Owen <sowen@cloudera.com>
Closes#4517 from srowen/SPARK-5728 and squashes the following commits:
1d3aeb8 [Sean Owen] Use temp dir for ActiveMQ database
Author: cody koeninger <cody@koeninger.org>
Closes#4511 from koeninger/kafkaRdd-leader-to-broker and squashes the following commits:
f7151d4 [cody koeninger] [SPARK-4964] test refactoring
6f8680b [cody koeninger] [SPARK-4964] add test of the scala api for KafkaUtils.createRDD
f81e016 [cody koeninger] [SPARK-4964] leave KafkaStreamSuite host and port as private
5173f3f [cody koeninger] [SPARK-4964] test the Java variations of createRDD
e9cece4 [cody koeninger] [SPARK-4964] pass leaders as a map to ensure 1 leader per TopicPartition
The `initialState` of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4501 from viirya/sim_lda and squashes the following commits:
4870fe4 [Liang-Chi Hsieh] For comments.
9af1487 [Liang-Chi Hsieh] Refactor initial step of LDA to remove redundant operations.
Also took the chance to fixed up some style ...
Author: Reynold Xin <rxin@databricks.com>
Closes#4489 from rxin/SPARK-5702 and squashes the following commits:
74f42e3 [Reynold Xin] [SPARK-5702][SQL] Allow short names for built-in data sources.
If the user specifies a bad REST URL, the server will throw an NPE instead of propagating the error back. This is because the default `ErrorServlet` has the wrong prefix. This is a one line fix. I am will add more comprehensive tests in a separate patch.
Author: Andrew Or <andrew@databricks.com>
Closes#4518 from andrewor14/rest-npe and squashes the following commits:
16b15bc [Andrew Or] Correct ErrorServlet context prefix
Previously, SparkHadoopWriter always committed its tasks without question. The problem is that when speculation is enabled sometimes this can result in multiple tasks committing their output to the same file. Even though an HDFS-writing task may be re-launched due to speculation, the original task is not killed and may eventually commit as well.
This can cause strange race conditions where multiple tasks that commit interfere with each other, with the result being that some partition files are actually lost entirely. For more context on these kinds of scenarios, see SPARK-4879.
In Hadoop MapReduce jobs, the application master is a central coordinator that authorizes whether or not any given task can commit. Before a task commits its output, it queries the application master as to whether or not such a commit is safe, and the application master does bookkeeping as tasks are requesting commits. Duplicate tasks that would write to files that were already written to from other tasks are prohibited from committing.
This patch emulates that functionality - the crucial missing component was a central arbitrator, which is now a module called the OutputCommitCoordinator. The coordinator lives on the driver and the executors can obtain a reference to this actor and request its permission to commit. As tasks commit and are reported as completed successfully or unsuccessfully by the DAGScheduler, the commit coordinator is informed of the task completion events as well to update its internal state.
Future work includes more rigorous unit testing and extra optimizations should this patch cause a performance regression. It is unclear what the overall cost of communicating back to the driver on every hadoop-committing task will be. It's also important for those hitting this issue to backport this onto previous version of Spark because the bug has serious consequences, that is, data is lost.
Currently, the OutputCommitCoordinator is only used when `spark.speculation` is true. It can be disabled by setting `spark.hadoop.outputCommitCoordination.enabled=false` in SparkConf.
This patch is an updated version of #4155 (by mccheah), which in turn was an updated version of this PR.
Closes#4155.
Author: mcheah <mcheah@palantir.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#4066 from JoshRosen/SPARK-4879-sparkhadoopwriter-fix and squashes the following commits:
658116b [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4879-sparkhadoopwriter-fix
ed783b2 [Josh Rosen] Address Andrew’s feedback.
e7be65a [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4879-sparkhadoopwriter-fix
14861ea [Josh Rosen] splitID -> partitionID in a few places
ed8b554 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4879-sparkhadoopwriter-fix
48d5c1c [Josh Rosen] Roll back copiesRunning change in TaskSetManager
3969f5f [Josh Rosen] Re-enable guarding of commit coordination with spark.speculation setting.
ede7590 [Josh Rosen] Add test to ensure that a job that denies all commits cannot complete successfully.
97da5fe [Josh Rosen] Use actor only for RPC; call methods directly in DAGScheduler.
f582574 [Josh Rosen] Some cleanup in OutputCommitCoordinatorSuite
a7c0e29 [Josh Rosen] Create fake TaskInfo using dummy fields instead of Mockito.
997b41b [Josh Rosen] Roll back unnecessary DAGSchedulerSingleThreadedProcessLoop refactoring:
459310a [Josh Rosen] Roll back TaskSetManager changes that broke other tests.
dd00b7c [Josh Rosen] Move CommitDeniedException to executors package; remove `@DeveloperAPI` annotation.
c79df98 [Josh Rosen] Some misc. code style + doc changes:
f7d69c5 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4879-sparkhadoopwriter-fix
92e6dc9 [Josh Rosen] Bug fix: use task ID instead of StageID to index into authorizedCommitters.
b344bad [Josh Rosen] (Temporarily) re-enable “always coordinate” for testing purposes.
0aec91e [Josh Rosen] Only coordinate when speculation is enabled; add configuration option to bypass new coordination.
594e41a [mcheah] Fixing a scalastyle error
60a47f4 [mcheah] Writing proper unit test for OutputCommitCoordinator and fixing bugs.
d63f63f [mcheah] Fixing compiler error
9fe6495 [mcheah] Fixing scalastyle
1df2a91 [mcheah] Throwing exception if SparkHadoopWriter commit denied
d431144 [mcheah] Using more concurrency to process OutputCommitCoordinator requests.
c334255 [mcheah] Properly handling messages that could be sent after actor shutdown.
8d5a091 [mcheah] Was mistakenly serializing the accumulator in test suite.
9c6a4fa [mcheah] More OutputCommitCoordinator cleanup on stop()
78eb1b5 [mcheah] Better OutputCommitCoordinatorActor stopping; simpler canCommit
83de900 [mcheah] Making the OutputCommitCoordinatorMessage serializable
abc7db4 [mcheah] TaskInfo can't be null in DAGSchedulerSuite
f135a8e [mcheah] Moving the output commit coordinator from class into method.
1c2b219 [mcheah] Renaming oudated names for test function classes
66a71cd [mcheah] Removing whitespace modifications
6b543ba [mcheah] Removing redundant accumulator in unit test
c9decc6 [mcheah] Scalastyle fixes
bc80770 [mcheah] Unit tests for OutputCommitCoordinator
6e6f748 [mcheah] [SPARK-4879] Use the Spark driver to authorize Hadoop commits.
Do not recursively strip out projects. Only strip the first level project.
```scala
df("colA") + df("colB").as("colC")
```
Previously, the above would construct an invalid plan.
Author: Reynold Xin <rxin@databricks.com>
Closes#4519 from rxin/computability and squashes the following commits:
87ff763 [Reynold Xin] Code review feedback.
015c4fc [Reynold Xin] [SQL][DataFrame] Fix column computability.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4496 from chenghao-intel/df_explain and squashes the following commits:
552aa58 [Cheng Hao] Add explain support for DF
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
Author: Davies Liu <davies@databricks.com>
Closes#4498 from davies/create and squashes the following commits:
08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4468 from chenghao-intel/json and squashes the following commits:
aeb7801 [Cheng Hao] avoid multiple json generator created
Also start from the bottom so we show the first error instead of the top error.
Author: Michael Armbrust <michael@databricks.com>
Closes#4439 from marmbrus/analysisException and squashes the following commits:
45862a0 [Michael Armbrust] fix hive test
a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException
f88079f [Michael Armbrust] update more cases
fede90a [Michael Armbrust] newline
fbf4bc3 [Michael Armbrust] move to sql
6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.