The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations. In order to achieve this I also made the following simplifications:
- Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis.
- Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.
Deferred to a follow up PR:
- Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation.
- Eliminate serializations in more cases by adding more cases to `EliminateSerialization`
Author: Michael Armbrust <michael@databricks.com>
Closes#10747 from marmbrus/encoderExpressions.
This patch significantly speeds up the BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when Exception throw" test, reducing the test time from 45s to ~250ms. The key change was to set `spark.shuffle.io.maxRetries` to 0 (the code previously set `spark.network.timeout` to `2s`, but this didn't make a difference because the slowdown was not due to this timeout).
Along the way, I also cleaned up the way that we handle SparkConf in BlockManagerSuite: previously, each test would mutate a shared SparkConf instance, while now each test gets a fresh SparkConf.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10759 from JoshRosen/SPARK-12174.
When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10754 from sarutak/SPARK-12821.
The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient.
This closes#10737.
Author: Reynold Xin <rxin@databricks.com>
Closes#10755 from rxin/SPARK-12771.
Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10728 from zsxwing/SPARK-12784.
When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed. Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop. This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception. Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur. Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844.
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#10703 from cloud-fan/use-hash-expr-in-shuffle.
We've already removed local execution but didn't deprecate `TaskContext.isRunningLocally()`; we should deprecate it for 2.0.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10751 from JoshRosen/remove-local-exec-from-taskcontext.
Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10707 from jkbradley/kmeans-doc-fix.
jira: https://issues.apache.org/jira/browse/SPARK-12026
The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
I tested on local and the change can improve the performance and the running time was stable.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10146 from hhbyyh/chiSq.
This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty.
This problem only lies in here, no such issue in `HashShuffleWriter`.
Please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#10376 from jerryshao/SPARK-12400.
I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it.
```
ERROR spark.TaskContextImpl: Error in TaskCompletionListener
java.lang.NullPointerException
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
```
Author: Carson Wang <carson.wang@intel.com>
Closes#10637 from carsonwang/FixNPE.
This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
Author: Reynold Xin <rxin@databricks.com>
Closes#10734 from rxin/simplify-case.
This replaces the `execfile` used for running custom python shell scripts
with explicit open, compile and exec (as recommended by 2to3). The reason
for this change is to make the pythonstartup option compatible with python3.
Author: Erik Selin <erik.selin@gmail.com>
Closes#10255 from tyro89/pythonstartup-python3.
This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior).
This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10709 from JoshRosen/SPARK-9383.
Removes some duplicated code that was reintroduced during a merge.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#10711 from jodersky/repl-2.11-duplicate.
The default run has changed, but the documentation didn't fully reflect the change.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#10740 from skyluc/issue/mesos-modes-doc.
https://github.com/apache/spark/pull/10736 was merged yesterday and caused the master start to fail because of the style issue.
Author: Yin Huai <yhuai@databricks.com>
Closes#10742 from yhuai/fixStyle.
This is the final PR about SPARK-12692.
We have removed all of white spaces before comma from code so let's enforce style checking.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10736 from sarutak/SPARK-12692-followup-enforce-checking.
Fix the style violation (space before , and :).
This PR is a followup for #10643 and rework of #10685 .
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10732 from sarutak/SPARK-12692-followup-sql.
cloud-fan Can you please take a look ?
In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#10520 from dilipbiswal/spark-12558.
Fix the style violation (space before , and :).
This PR is a followup for #10643
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10719 from sarutak/SPARK-12692-followup-core.
There are many potential benefits of having an efficient in memory columnar format as an alternate
to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
remaining implementation can be done as follow up patches.
As stated in the in the JIRA, there are useful external components that operate on memory in a
simple columnar format. ColumnarBatch would serve that purpose and could server as a
zero-serialization/zero-copy exchange for this use case.
This patch supports running the underlying data either on heap or off heap. On heap runs a bit
faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
interface (ColumnVector).
This differs from Parquet or the existing columnar cache because this is *not* intended to be used
as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
batches in memory per task. The layout of the values is just dense arrays of the value type.
Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>
Closes#10628 from nongli/spark-12635.
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
- Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10692 from zsxwing/py4j-0.9.1.
This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation.
Author: Cheng Lian <lian@databricks.com>
Closes#10712 from liancheng/spark-12724-datasources-sql-gen.
This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.
Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.
Author: Reynold Xin <rxin@databricks.com>
Closes#10722 from rxin/SPARK-12768.
Let me know whether you'd like to see it in other place
Author: Robert Kruszewski <robertk@palantir.com>
Closes#10210 from robert3005/feature/pluggable-optimizer.
This pull request does a few small things:
1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here.
2. Added unit test for SimplifyConditionals.
3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite
Author: Reynold Xin <rxin@databricks.com>
Closes#10716 from rxin/SPARK-12762.
[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows
* IndexShuffleBlockResolverSuite fails in windows due to file is not closed.
* mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala".
https://issues.apache.org/jira/browse/SPARK-12582
Author: Yucai Yu <yucai.yu@intel.com>
Closes#10526 from yucai/master.
Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue".
It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase.
Author: Tommy YU <tummyyu@163.com>
Closes#10587 from Wenpei/rdd_aggregate_doc.
Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE.
Our training folks hit this exact same issue when concocting an example and had the same solution.
Author: Sean Owen <sowen@cloudera.com>
Closes#10675 from srowen/SPARK-5273.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10718 from sarutak/SPARK-12692-followup-sql.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10686 from sarutak/SPARK-12692-followup-yarn.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10685 from sarutak/SPARK-12692-followup-streaming.
https://issues.apache.org/jira/browse/SPARK-11823
This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test.
Author: Yin Huai <yhuai@databricks.com>
Closes#10715 from yhuai/SPARK-11823-ignore.
Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable.
Author: Cheng Lian <lian@databricks.com>
Closes#10445 from liancheng/boolean-simplification-simplification.
```
[info] Exception encountered when attempting to run a suite with class name:
org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 milliseconds)
[info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
[info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
[info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
[info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
[info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
[info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
[info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
[info] at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info] at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info] at java.lang.Thread.run(Thread.java:745)
```
/cc liancheng
Author: wangfei <wangfei_hello@126.com>
Closes#10682 from scwf/fix-test.
The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)```
We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack.
cc rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10649 from hvanhovell/SPARK-12576.
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9484 from hhbyyh/ldaTopicPre.
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.
Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#10627 from hhbyyh/w2voverflow.
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10552 from yanboliang/spark-12603.
This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script.
First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed.
I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests.
/cc zsxwing
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10704 from JoshRosen/fix-build-test-problems.