This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
Author: Reynold Xin <rxin@databricks.com>
Closes#10734 from rxin/simplify-case.
This replaces the `execfile` used for running custom python shell scripts
with explicit open, compile and exec (as recommended by 2to3). The reason
for this change is to make the pythonstartup option compatible with python3.
Author: Erik Selin <erik.selin@gmail.com>
Closes#10255 from tyro89/pythonstartup-python3.
This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior).
This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10709 from JoshRosen/SPARK-9383.
Removes some duplicated code that was reintroduced during a merge.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#10711 from jodersky/repl-2.11-duplicate.
The default run has changed, but the documentation didn't fully reflect the change.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#10740 from skyluc/issue/mesos-modes-doc.
https://github.com/apache/spark/pull/10736 was merged yesterday and caused the master start to fail because of the style issue.
Author: Yin Huai <yhuai@databricks.com>
Closes#10742 from yhuai/fixStyle.
This is the final PR about SPARK-12692.
We have removed all of white spaces before comma from code so let's enforce style checking.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10736 from sarutak/SPARK-12692-followup-enforce-checking.
Fix the style violation (space before , and :).
This PR is a followup for #10643 and rework of #10685 .
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10732 from sarutak/SPARK-12692-followup-sql.
cloud-fan Can you please take a look ?
In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#10520 from dilipbiswal/spark-12558.
Fix the style violation (space before , and :).
This PR is a followup for #10643
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10719 from sarutak/SPARK-12692-followup-core.
There are many potential benefits of having an efficient in memory columnar format as an alternate
to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
remaining implementation can be done as follow up patches.
As stated in the in the JIRA, there are useful external components that operate on memory in a
simple columnar format. ColumnarBatch would serve that purpose and could server as a
zero-serialization/zero-copy exchange for this use case.
This patch supports running the underlying data either on heap or off heap. On heap runs a bit
faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
interface (ColumnVector).
This differs from Parquet or the existing columnar cache because this is *not* intended to be used
as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
batches in memory per task. The layout of the values is just dense arrays of the value type.
Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>
Closes#10628 from nongli/spark-12635.
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
- Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10692 from zsxwing/py4j-0.9.1.
This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation.
Author: Cheng Lian <lian@databricks.com>
Closes#10712 from liancheng/spark-12724-datasources-sql-gen.
This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.
Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.
Author: Reynold Xin <rxin@databricks.com>
Closes#10722 from rxin/SPARK-12768.
Let me know whether you'd like to see it in other place
Author: Robert Kruszewski <robertk@palantir.com>
Closes#10210 from robert3005/feature/pluggable-optimizer.
This pull request does a few small things:
1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here.
2. Added unit test for SimplifyConditionals.
3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite
Author: Reynold Xin <rxin@databricks.com>
Closes#10716 from rxin/SPARK-12762.
[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows
* IndexShuffleBlockResolverSuite fails in windows due to file is not closed.
* mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala".
https://issues.apache.org/jira/browse/SPARK-12582
Author: Yucai Yu <yucai.yu@intel.com>
Closes#10526 from yucai/master.
Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue".
It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase.
Author: Tommy YU <tummyyu@163.com>
Closes#10587 from Wenpei/rdd_aggregate_doc.
Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE.
Our training folks hit this exact same issue when concocting an example and had the same solution.
Author: Sean Owen <sowen@cloudera.com>
Closes#10675 from srowen/SPARK-5273.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10718 from sarutak/SPARK-12692-followup-sql.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10686 from sarutak/SPARK-12692-followup-yarn.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10685 from sarutak/SPARK-12692-followup-streaming.
https://issues.apache.org/jira/browse/SPARK-11823
This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test.
Author: Yin Huai <yhuai@databricks.com>
Closes#10715 from yhuai/SPARK-11823-ignore.
Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable.
Author: Cheng Lian <lian@databricks.com>
Closes#10445 from liancheng/boolean-simplification-simplification.
```
[info] Exception encountered when attempting to run a suite with class name:
org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 milliseconds)
[info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
[info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
[info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
[info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
[info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
[info] at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info] at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info] at java.lang.Thread.run(Thread.java:745)
```
/cc liancheng
Author: wangfei <wangfei_hello@126.com>
Closes#10682 from scwf/fix-test.
The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)```
We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack.
cc rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10649 from hvanhovell/SPARK-12576.
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9484 from hhbyyh/ldaTopicPre.
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.
Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10627 from hhbyyh/w2voverflow.
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10552 from yanboliang/spark-12603.
This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script.
First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed.
I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests.
/cc zsxwing
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10704 from JoshRosen/fix-build-test-problems.
JIRA: https://issues.apache.org/jira/browse/SPARK-12744
This PR makes parsing JSON integers to timestamps consistent with casting behavior.
Author: Anatoliy Plastinin <anatoliy.plastinin@gmail.com>
Closes#10687 from antlypls/fix-json-timestamp-parsing.
The current Spark Streaming kinesis connector references a quite old version 1.9.40 of the AWS Java SDK (1.10.40 is current). Numerous AWS features including Kinesis Firehose are unavailable in 1.9. Those two versions of the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 respectively) such that one cannot include the current AWS SDK in a project that also uses the Spark Streaming Kinesis ASL.
Author: BrianLondon <brian@seatgeek.com>
Closes#10256 from BrianLondon/master.
According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort.
Author: Udo Klein <git@blinkenlight.net>
Closes#10640 from udoklein/patch-1.
This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact.
While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at 3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10693 from JoshRosen/netty-hotfix.
#10659 removed the repository `https://repo.eclipse.org/content/repositories/paho-releases` but it's needed by MiMa because `spark-streaming-mqtt(1.6.0)` depends on `mqttv3(1.0.1)` and it is provided by the removed repository and maven-central provide only `mqttv3(1.0.2)` for now.
Otherwise, if `mqttv3(1.0.1)` is absent from the local repository, dev/mima should fail.
JoshRosen Do you have any other better idea?
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10688 from sarutak/SPARK-4628-followup.
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.
Plus a few fixes to imports that cropped in since my recent cleanups.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10612 from vanzin/SPARK-3873-enable.
Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse).
This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds.
/cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath.
Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10672 from JoshRosen/enforce-netty-exclusions.
Fix the style violation (space before `,` and `:`).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10683 from sarutak/SPARK-12692-followup-graphx.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10684 from sarutak/SPARK-12692-followup-mllib.