In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
This could be a performance problem. (unless this is the intended behavior)
Author: yingjieMiao <yingjie@42go.com>
Closes#2648 from yingjieMiao/rdd_take and squashes the following commits:
d758218 [yingjieMiao] scala style fix
a8e74bb [yingjieMiao] python style fix
4b6e777 [yingjieMiao] infix operator style fix
4391d3b [yingjieMiao] typo fix.
692f4e6 [yingjieMiao] cap numPartsToTry
c4483dc [yingjieMiao] style fix
1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
d31ff7e [yingjieMiao] handle the edge case after 1 iteration
a2aa36b [yingjieMiao] RDD take method: overestimate too much
Author: Reynold Xin <rxin@apache.org>
Closes#2727 from rxin/SPARK-3861-broadcast-hash-2 and squashes the following commits:
9c7b1a2 [Reynold Xin] Revert "Reuse CompactBuffer in UniqueKeyHashedRelation."
97626a1 [Reynold Xin] Reuse CompactBuffer in UniqueKeyHashedRelation.
7fcffb5 [Reynold Xin] Make UniqueKeyHashedRelation private[joins].
18eb214 [Reynold Xin] Merge branch 'SPARK-3861-broadcast-hash' into SPARK-3861-broadcast-hash-1
4b9d0c9 [Reynold Xin] UniqueKeyHashedRelation.get should return null if the value is null.
e0ebdd1 [Reynold Xin] Added a test case.
90b58c0 [Reynold Xin] [SPARK-3861] Avoid rebuilding hash tables on each partition
0c0082b [Reynold Xin] Fix line length.
cbc664c [Reynold Xin] Rename join -> joins package.
a070d44 [Reynold Xin] Fix line length in HashJoin
a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
During trainning Gradient Boosting Decision Tree on large-scale sparse data, spark spill hundreds of data onto disk. And find the bug below:
In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage.
In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too.
After added unpersist, it works right.
https://issues.apache.org/jira/browse/SPARK-3918
Author: omgteam <Kimlong.Liu@gmail.com>
Closes#2775 from omgteam/master and squashes the following commits:
815d543 [omgteam] adjust tab to spaces
1a36f83 [omgteam] Bug: fix without unpersist baggedInput in RandomForest.scala
There are three [Custom Receiver Guide] links in streaming doc, the first is wrong.
Author: w00228970 <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#2749 from scwf/streaming-doc and squashes the following commits:
0cd76b7 [wangfei] update link tojump to the Akka-specific section
45b0646 [w00228970] wrong link in streaming doc
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Closes#2778 from giwa/patch-2 and squashes the following commits:
a59f9a1 [Ken Takagiwa] Add echo "Run streaming tests ..."
Author: GuoQiang Li <witgo@qq.com>
Closes#2763 from witgo/SPARK-3905 and squashes the following commits:
17d7990 [GuoQiang Li] The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1
This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:
getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values
copyBytes() - return just begining of internal array determined by internal length property
It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.
dbtsai
Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>
Closes#2712 from james64/3121-bugfix and squashes the following commits:
f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed
This was reported in https://issues.apache.org/jira/browse/SPARK-3445. There are API differences between the 0.23.* and the 2.0.*-alpha branches that are not accounted for when this code was introduced.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2776 from andrewor14/fix-yarn-alpha and squashes the following commits:
ec94752 [Andrew Or] Fix compilation error for 2.0.*-alpha
Previously, when the val partitionStrategy was created it called a function in the Analytics object which was a copy of the PartitionStrategy.fromString() method. This function has been removed, and the assignment of partitionStrategy now uses the PartitionStrategy.fromString method instead. In this way, it better matches the declarations of edge/vertex StorageLevel variables.
Author: NamelessAnalyst <NamelessAnalyst@users.noreply.github.com>
Closes#2569 from NamelessAnalyst/branch-1.1 and squashes the following commits:
c24ff51 [NamelessAnalyst] Update Analytics.scala
(cherry picked from commit 5a21e3e7e9)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This PR accomplishes this by sending the remote exception's stacktrace as the payload in the "negative ACK / error message."
Author: Josh Rosen <joshrosen@apache.org>
Closes#2741 from JoshRosen/propagate-cm-exceptions-to-sender and squashes the following commits:
b5366cc [Josh Rosen] Explicitly encode error messages using UTF-8.
cef18b3 [Josh Rosen] [SPARK-3887] Send stracktrace in ConnectionManager error messages.
Sphinx documents contains a corrupted ReST format and have some warnings.
The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
commit: 0e8203f4fb
output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.
Build finished. The HTML pages are in _build/html.
```
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
./python/run-tests search a Python 2.6 executable on PATH and use it if available.
When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
...riden alternatives, can have default argument.
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:
d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.
We had to upgrade our Hive 0.12 version as well to deal with a protobuf
conflict (both hive and akka have been using a shaded protobuf version).
This is testing a correctly patched version of Hive 0.12.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2756 from pwendell/hotfix and squashes the following commits:
cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k].
In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2740 from davies/batchsize and squashes the following commits:
52cdb88 [Davies Liu] update docs
185f2b9 [Davies Liu] use AutoBatchedSerializer by default
In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.
Author: Aaron Davidson <aaron@databricks.com>
Closes#2742 from aarondav/mmap and squashes the following commits:
a152065 [Aaron Davidson] Add other pathway back
52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.
Author: Anand Avati <avati@redhat.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2752 from pwendell/akka-upgrade and squashes the following commits:
4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2737 from ravipesala/SPARK-3834 and squashes the following commits:
0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases
Using `MEMORY_AND_DISK` as default storage level for in-memory table caching. Due to the in-memory columnar representation, recomputing an in-memory cached table partitions can be very expensive.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2686 from liancheng/spark-3824 and squashes the following commits:
35d2ed0 [Cheng Lian] Removes extra space
1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
ba565f0 [Cheng Lian] Maks CachedBatch serializable
07f0204 [Cheng Lian] Sets in-memory table default storage level to MEMORY_AND_DISK
This PR is a follow up of #2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL.
A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.
The `ExtendedHiveQlParser` now only handle Hive specific extensions.
Also took the chance to refactor/reformat `SqlParser` for better readability.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2698 from liancheng/gen-sql-parser and squashes the following commits:
ceada76 [Cheng Lian] Minor styling fixes
9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser
bb2ab12 [Cheng Lian] SET property value can be empty string
ce8860b [Cheng Lian] Passes test suites
e86968e [Cheng Lian] Removes debugging code
8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it)
d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers
I noticed a few issues with how temp directories are created and deleted:
*Minor*
* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_
*Bit Less Minor*
* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.
I noticed a few other things that might be changed but wanted to ask first:
* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.
Author: Sean Owen <sowen@cloudera.com>
Closes#2670 from srowen/SPARK-3811 and squashes the following commits:
071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
This prevents it from changing during serialization, leading to corrupted results.
Author: Michael Armbrust <michael@databricks.com>
Closes#2656 from marmbrus/generateBug and squashes the following commits:
efa32eb [Michael Armbrust] Store the output of a generator in a val. This prevents it from changing during serialization.
This pull request addresses a few issues related to PySpark's IPython support:
- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).
There are more details in a block comment in `bin/pyspark`.
Author: Josh Rosen <joshrosen@apache.org>
Closes#2651 from JoshRosen/SPARK-3772 and squashes the following commits:
7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
"case when" conditional function is already supported in Spark SQL but there is no support in SqlParser. So added parser support to it.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2678 from ravipesala/SPARK-3813 and squashes the following commits:
70c75a7 [ravipesala] Fixed styles
713ea84 [ravipesala] Updated as per admin comments
709684f [ravipesala] Changed parser to support case when function.
The alias parameter is being ignored, which makes it more difficult to specify a qualifier for Generator expressions.
Author: Nathan Howell <nhowell@godaddy.com>
Closes#2721 from NathanHowell/SPARK-3858 and squashes the following commits:
8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into logical plan node
chenghao-intel assigned this to me, check PR #2284 for previous discussion
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2529 from adrian-wang/rowapi and squashes the following commits:
c6594b2 [Daoyuan Wang] using boxed
7b7e6e3 [Daoyuan Wang] update pattern match
7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
1614493 [Daoyuan Wang] add missing row api
This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.
* To query those corrupt records
```
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
```
* To skip corrupt records and query regular records
```
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL
```
Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#2680 from yhuai/corruptJsonRecord and squashes the following commits:
4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record".
b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test.
ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.
In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType function and a toTimestamp function.
Author: Mike Timper <mike@aurorafeint.com>
Closes#2720 from mtimper/master and squashes the following commits:
9386ab8 [Mike Timper] Fix and tests for SPARK-3853
./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.
It is harder for us to recognize what test programs were executed and which test was failed.
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits:
c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
To fix two issues in CliSuite
1 CliSuite throw IndexOutOfBoundsException:
Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)
Actually, it is the Mutil-Threads lead to this problem.
2 Using ```line.startsWith``` instead ```line.contains``` to assert expected answer. This is a tiny bug in CliSuite, for test case "Simple commands", there is a expected answers "5", if we use ```contains``` that means output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" or "14/10/06 11:54:36 INFO StatsReportListener: 0% ```5```% 10% 25% 50% 75% 90% 95% 100%" will make the assert true.
Author: scwf <wangfei1@huawei.com>
Closes#2666 from scwf/clisuite and squashes the following commits:
11430db [scwf] fix-clisuite
The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list.
Author: Yash Datta <Yash.Datta@guavus.com>
Closes#2561 from saucam/branch-1.1 and squashes the following commits:
4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order 2. Fix optimization condition 3. Add tests for null in filter list 4. Add test case that optimization is not triggered in case of attributes in filter list
afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite 2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause
0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding
bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments
430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well
bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries
Author: Vida Ha <vida@databricks.com>
Closes#2621 from vidaha/vida/SPARK-3752 and squashes the following commits:
d7fdbbc [Vida Ha] Add tests for different UDF's
...re logs to avoid Executors swallowing errors
This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.
Author: zsxwing <zsxwing@gmail.com>
Closes#2593 from zsxwing/SPARK-3741 and squashes the following commits:
1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors
cc mengxr
Author: GuoQiang Li <witgo@qq.com>
Closes#2730 from witgo/SPARK-3856 and squashes the following commits:
2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade
Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).
### Implementation Details
Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in
Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.
CC mengxr manishamde jkbradley, please help me review this, thanks.
Author: Qiping Li <liqiping1991@gmail.com>
Closes#2708 from chouqin/avoid-agg and squashes the following commits:
8e269ea [Qiping Li] adjust code and comments
eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
c41b1b6 [Qiping Li] fix pyspark unit test
7ad7a71 [Qiping Li] fix unit test
822c912 [Qiping Li] add comments and unit test
e41d715 [Qiping Li] fix bug in test suite
6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training
It took me a minute to track this down, so I thought it could be useful to have it in the docs.
I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.
Author: nartz <nartzpod@gmail.com>
Author: Nathan Artz <nathanartz@Nathans-MacBook-Pro.local>
Closes#2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:
a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs
Truncate appName in WebUI if it is too long.
Author: Xiangrui Meng <meng@databricks.com>
Closes#2707 from mengxr/truncate-app-name and squashes the following commits:
87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
Upgrade to akka 2.3.4
Author: Anand Avati <avati@redhat.com>
Closes#1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
Got warning msg:
~~~
[warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm
[warn] var norm = vector.toBreeze.norm(p)
~~~
dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#2718 from mengxr/SPARK-3856 and squashes the following commits:
4f38169 [Xiangrui Meng] use norm operator
Author: Reynold Xin <rxin@apache.org>
Closes#2719 from rxin/sql-join-break and squashes the following commits:
0c0082b [Reynold Xin] Fix line length.
cbc664c [Reynold Xin] Rename join -> joins package.
a070d44 [Reynold Xin] Fix line length in HashJoin
a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
Builds all wrappers at first according to object inspector types to avoid per row costs.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2592 from liancheng/hive-value-wrapper and squashes the following commits:
9696559 [Cheng Lian] Passes all tests
4998666 [Cheng Lian] Prevents per row dynamic dispatching and pattern matching when inserting Hive values
Includes partition keys into account when applying `PreInsertionCasts` rule.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2672 from liancheng/fix-pre-insert-casts and squashes the following commits:
def1a1a [Cheng Lian] Makes PreInsertionCasts handle partitions properly
Calling `BinaryArithmetic.dataType` will throws exception until it's resolved, but in type coercion rule `Division`, seems doesn't follow this.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2559 from chenghao-intel/type_coercion and squashes the following commits:
199a85d [Cheng Hao] Simplify the divide rule
dc55218 [Cheng Hao] fix bug of type coercion in div
marmbrus
Update README.md to be consistent with Spark 1.1
Author: Liquan Pei <liquanpei@gmail.com>
Closes#2706 from Ishiihara/SparkSQL-readme and squashes the following commits:
33b9d4b [Liquan Pei] keep README.md up to date
This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases.
Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`.
JoshRosen davies Please help review PySpark related changes, thanks!
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2563 from liancheng/datatype-to-json and squashes the following commits:
fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation
438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments
6b6387b [Cheng Lian] Removes debugging code
6a3ee3a [Cheng Lian] Addresses per review comments
dc158b5 [Cheng Lian] Addresses PEP8 issues
99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion
a983a6c [Cheng Lian] Adds PySpark support
f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON
If we write the filter which is always FALSE like
SELECT * from person WHERE FALSE;
200 tasks will run. I think, 1 task is enough.
And current optimizer cannot optimize the case NOT is duplicated like
SELECT * from person WHERE NOT ( NOT (age > 30));
The filter rule above should be simplified
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2692 from sarutak/SPARK-3831 and squashes the following commits:
25f3e20 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3831
23c750c [Kousuke Saruta] Improved unsupported predicate test case
a11b9f3 [Kousuke Saruta] Modified NOT predicate test case in PartitionBatchPruningSuite
8ea872b [Kousuke Saruta] Fixed the number of tasks when the data of LocalRelation is empty.
dev/scalastyle create a log file 'scalastyle.txt'. it is overwrote per running but never deleted even though dev/mima and dev/lint-python delete their log files.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2702 from sarutak/scalastyle-txt-cleanup and squashes the following commits:
d6e238e [Kousuke Saruta] Fixed dev/scalastyle to cleanup scalastyle.txt