JIRA: https://issues.apache.org/jira/browse/SPARK-10437
If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8599 from viirya/orderby-agg.
* a follow-up to 16b6d18613 as `--num-executors` flag is not suppported.
* links + formatting
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes#8762 from jaceklaskowski/docs-spark-on-yarn.
Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
Author: noelsmith <mail@noelsmith.com>
Closes#8773 from noel-smith/mllib-random-versionadded-fix.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
jira: https://issues.apache.org/jira/browse/SPARK-10491
We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
Let me know if new UT needed.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#8663 from hhbyyh/movedspr.
Comments preceding toMessage method state: "The edge partition is encoded in the lower
* 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits.
This contribution is my original work and I license the work to the Spark project under it's open source license.
Author: Robin East <robin.east@xense.co.uk>
Closes#8756 from insidedctm/master.
Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes#8759 from jaceklaskowski/docs-submitting-apps.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes#8633 from noel-smith/SPARK-10273-since-mllib-feature.
PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8166 from yanboliang/spark-9793.
This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions.
The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example.
Author: Matei Zaharia <matei@databricks.com>
Closes#8180 from mateiz/spark-9851.
This is a follow-up patch to #8723. I missed one case there.
Author: Andrew Or <andrew@databricks.com>
Closes#8727 from andrewor14/fix-threading-suite.
Make this lazy so that it can set the yarn mode before creating the securityManager.
Author: Tom Graves <tgraves@yahoo-inc.com>
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Closes#8719 from tgravescs/SPARK-10549.
Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)
Author: Sean Owen <sowen@cloudera.com>
Closes#8736 from srowen/SPARK-10576.
`ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option.
cc srowen
Author: Erick Tryzelaar <erick.tryzelaar@gmail.com>
Closes#8754 from erickt/master.
This PR is in conflict with #8535 and #8573. Will update this one when they are merged.
Author: zsxwing <zsxwing@gmail.com>
Closes#8642 from zsxwing/expand-nest-join.
Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility
* process in a lighter-weight, backwards-compatible way
Author: Edoardo Vacchi <uncommonnonsense@gmail.com>
Closes#6356 from evacchi/sqlctx-refactoring-lite.
Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes#8751 from pnpritchard/SPARK-10573.
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8457 from yanboliang/spark-10194.
The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#8739 from sarutak/SPARK-10584.
This is a follow-up of https://github.com/apache/spark/pull/8317.
When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.
However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details.
Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8687 from cloud-fan/direct-committer.
A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.
This patch is a quick fix. The question of enforcement is still up.
No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.
It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).
Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
Closes#8062 from BertrandDechoux/SPARK-9720.
This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8521 from JoshRosen/SPARK-10330-part2.
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes#6297 from JihongMA/SPARK-SQL.
Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order
Author: Sean Owen <sowen@cloudera.com>
Closes#8706 from srowen/SPARK-10547.
When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.
Manual testing shows the exception chained properly, and the test suite still looks fine as well.
This contribution is my original work and I license the work to the project under the project's open source license.
Author: Daniel Imfeld <daniel@danielimfeld.com>
Closes#8725 from dimfeld/dimfeld-patch-1.
This commit ensures if an assertion fails within a thread, it will ultimately fail the test. Otherwise we end up potentially masking real bugs by not propagating assertion failures properly.
Author: Andrew Or <andrew@databricks.com>
Closes#8723 from andrewor14/fix-threading-suite.
1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method
2. Add tests for this
Author: Andrew Or <andrew@databricks.com>
Closes#8708 from andrewor14/local-hash-join-follow-up.
This PR is in conflict with #8535. I will update this one when #8535 gets merged.
Author: zsxwing <zsxwing@gmail.com>
Closes#8573 from zsxwing/more-local-operators.
Just fixing a typo in exception message, raised when attempting to pickle SparkContext.
Author: Icaro Medeiros <icaro.medeiros@gmail.com>
Closes#8724 from icaromedeiros/master.
See this thread for background:
http://search-hadoop.com/m/q3RTt0rWvIkHAE81
We should check the range of partition Id and provide meaningful message through exception.
Alternatively, we can use abs() and modulo to force the partition Id into legitimate range. However, expectation is that user should correct the logic error in his / her code.
Author: tedyu <yuzhihong@gmail.com>
Closes#8703 from tedyu/master.
If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now.
https://issues.apache.org/jira/browse/SPARK-10540
Author: Yin Huai <yhuai@databricks.com>
Closes#8705 from yhuai/SPARK-10540-ignore.
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8679 from jkbradley/doc-fixes-1.5.
We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:
1. Do `vectorType == "sparse"` only once.
2. Update `hashCode` and `equals`.
3. Remove inherited doc.
4. Delete temp dir in `afterAll`.
Lewuathe
Author: Xiangrui Meng <meng@databricks.com>
Closes#8699 from mengxr/SPARK-10537.
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8508 from yanboliang/spark-10026.
I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes#8697 from y-shimizu/SPARK-10518.
Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against.
Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation.
Author: Ahir Reddy <ahirreddy@gmail.com>
Closes#8709 from ahirreddy/sbt-scala-version-fix.