This avoids issues when HDFS is configured in a way that would not
allow the hardcoded default replication of "3".
Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest
one available on Maven Central is 0.23.7, so I chose to not add code
to access that method via reflection.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#2831 from vanzin/SPARK-3979 and squashes the following commits:
b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication.
There is a unused variable(count) in saveAsHadoopDataset in PairRDDFunctions.scala. The initial idea of this variable seems to count the number of records, so I am adding a log statement to log the number of records that has been written to the writer.
Author: likun <jacky.likun@huawei.com>
Author: jackylk <jacky.likun@huawei.com>
Closes#2791 from jackylk/SPARK-3935 and squashes the following commits:
a874047 [jackylk] removing the unused variable in PairRddFunctions.scala
3bf43c7 [likun] log the number of records has been written
Its hard to debug which broadcast variables refer to what in a big codebase. Printing call site information helps in debugging.
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#2829 from shivaram/spark-broadcast-print and squashes the following commits:
cd6dbdf [Shivaram Venkataraman] Print call site information for broadcasts
JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes JobProgressPage is accessed between webui starting and postEnvironmentUpdate, then the lazy val isFairScheduler will be false. The Fair Scheduler Pools section will not display any more.
Author: yantangzhai <tyz0303@163.com>
Author: YanTangZhai <hakeemzhai@tencent.com>
Closes#1966 from YanTangZhai/SPARK-3067 and squashes the following commits:
d4323f8 [yantangzhai] update [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
b6391cc [yantangzhai] revert [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
d2226cd [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
aac7f7b [yantangzhai] [SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section sometimes
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
Sorry. I found that I forgot to add `afterExecute` for `handleConnectExecutor` in #2593.
Author: zsxwing <zsxwing@gmail.com>
Closes#2794 from zsxwing/SPARK-3741 and squashes the following commits:
a0bc4dd [zsxwing] Add afterExecute for handleConnectExecutor
Introduced in f7e79bc42c, I'm not sure why we need two spark.executor.memory here.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>
Closes#2745 from WangTaoTheTonic/redundantconfig and squashes the following commits:
e7564dc [WangTao] too long line
fdbdb1f [WangTaoTheTonic] trivial workaround
d06b6e5 [WangTaoTheTonic] remove redundant spark.executor.memory in doc
In BlockManagermasterActor, _remainingMem would increase memSize for twice when updateBlockInfo if new storageLevel is invalid and old storageLevel is "useMemory". Also, _remainingMem should increase with original memory size instead of new memSize.
Author: Zhang, Liye <liye.zhang@intel.com>
Closes#2792 from liyezhang556520/spark-3941-remainMem and squashes the following commits:
3d487cc [Zhang, Liye] make the code concise
0380a32 [Zhang, Liye] [SPARK-3941][CORE] _remainingmem should not increase twice when updateBlockInfo
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!
I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.
Author: Aaron Davidson <aaron@databricks.com>
Closes#2784 from aarondav/fix-timeout and squashes the following commits:
bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
This is a small number of clean-up changes on top of #2782. Closes#2782.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2803 from pwendell/pr-2782 and squashes the following commits:
56d5b7a [Patrick Wendell] Minor clean-up
44089ec [Patrick Wendell] Clean-up the TaskContext API.
ed551ce [Prashant Sharma] Fixed a typo
df261d0 [Prashant Sharma] Josh's suggestion
facf3b1 [Prashant Sharma] Fixed the mima issue.
7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
bbd9e05 [Prashant Sharma] adding missed out files to git.
ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
The removed `Future` was used to end the test case as soon as the Spark SQL CLI process exits. When the process exits prematurely, this mechanism prevents the test case to wait until timeout. But it also creates a race condition: when `foundAllExpectedAnswers.tryFailure` is called, there are chances that the last expected output line of the CLI process hasn't been caught by the main logics of the test code, thus fails the test case.
Removing this `Future` doesn't affect correctness.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2823 from liancheng/clean-clisuite and squashes the following commits:
489a97c [Cheng Lian] Fixes the race condition that may cause test failure
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2830 from davies/fix_pickle and squashes the following commits:
0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
Author: Shiti <ssaxena.ece@gmail.com>
Closes#2810 from Shiti/master and squashes the following commits:
051d82f [Shiti] setting the default value of uri scheme to "file" where matching "file" or None yields the same result
Author: prudhvi <prudhvi953@gmail.com>
Closes#2799 from prudhvije/ScalaStyle/space-after-comment-start and squashes the following commits:
fc263a1 [prudhvi] [Core] Using scalastyle to check the space after comment start
This is another implementation about #1256
cc andrewor14 vanzin
Author: GuoQiang Li <witgo@qq.com>
Closes#2379 from witgo/SPARK-2098-new and squashes the following commits:
4ef1cbd [GuoQiang Li] review commit
49ef70e [GuoQiang Li] Refactor getDefaultPropertiesFile
c45d20c [GuoQiang Li] All Spark processes should support spark-defaults.conf, config file
HT to Diana, just proposing an implementation of her suggestion, which I rather agreed with. Is there a second/third for the motion?
Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.
Author: Sean Owen <sowen@cloudera.com>
Closes#2787 from srowen/SPARK-1307 and squashes the following commits:
b5b82e2 [Sean Owen] Refer to "self-contained" rather than "standalone" apps to avoid confusion with standalone deployment mode. And fix placement of reference to this in MLlib docs.
Modified not to pollute environment variables.
Just moved the main logic into `XXX2.cmd` from `XXX.cmd`, and call `XXX2.cmd` with cmd command in `XXX.cmd`.
`pyspark.cmd` and `spark-class.cmd` are already using the same way, but `spark-shell.cmd`, `spark-submit.cmd` and `/python/docs/make.bat` are not.
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#2797 from tsudukim/feature/SPARK-3943 and squashes the following commits:
b397a7d [Masayoshi TSUZUKI] [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows
When _JAVA_OPTIONS environment variable is set, a command "java -version" outputs a message like "Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8".
./bin/spark-class knows java version from the first line of "java -version" output, so it mistakes java version with _JAVA_OPTIONS set.
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2725 from cocoatomo/issues/3869-mistake-java-version and squashes the following commits:
f894ebd [cocoatomo] [SPARK-3869] ./bin/spark-class miss Java version with _JAVA_OPTIONS set
Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
Author: Sean Owen <sowen@cloudera.com>
Closes#2801 from srowen/SPARK-3803 and squashes the following commits:
b4e6d92 [Sean Owen] Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
Author: shitis <ssaxena.ece@gmail.com>
Closes#2795 from Shiti/master and squashes the following commits:
46897d7 [shitis] Using Option Wrapper to convert String with value null to None
Modified to ignore not the docs/ directory, but only the docs/_build/ which is the output directory of sphinx build.
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#2796 from tsudukim/feature/SPARK-3946 and squashes the following commits:
2bea6a9 [Masayoshi TSUZUKI] [SPARK-3946] gitignore in /python includes wrong directory
Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label. Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen). Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
Author: Bill Bejeck <bbejeck@gmail.com>
Closes#2309 from bbejeck/spark-memory-worker and squashes the following commits:
51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label. Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen). Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by 79e45c9323 (diff-79391110e9f26657e415aa169a004998R153).
More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)
Tested in Standalone mode, but not in Mesos.
Author: Aaron Davidson <aaron@databricks.com>
Closes#2779 from aarondav/fix-standalone and squashes the following commits:
725227a [Aaron Davidson] Fix ExecutorRunnerTest
9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
@harishreedharan @pwendell
See JIRA for diagnosis of the problem
https://issues.apache.org/jira/browse/SPARK-3912
The solution was to reimplement it.
1. Find a free port (by binding and releasing a server-scoket), and then use that port
2. Remove thread.sleep()s, instead repeatedly try to create a sender and send data and check whether data was sent. Use eventually() to minimize waiting time.
3. Check whether all the data was received, without caring about batches.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#2773 from tdas/flume-test-fix and squashes the following commits:
93cd7f6 [Tathagata Das] Reimplimented FlumeStreamSuite to be more robust.
As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.
In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.
As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](https://github.com/apache/spark/pull/2675#issuecomment-58464189) below for details.
----
(Copied from PR description of #2214)
This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:
1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.
So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: wangfei <wangfei1@huawei.com>
Closes#2675 from liancheng/fix-thriftserver-tests and squashes the following commits:
1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
7805c33 [wangfei] reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
d116405 [wangfei] make sure that log4j level is INFO
ee92a82 [Cheng Lian] Relaxes timeout
7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver
name should throw exception with name instead of exprId.
Author: Liquan Pei <liquanpei@gmail.com>
Closes#2758 from Ishiihara/SparkSQL-bug and squashes the following commits:
aa36a3b [Liquan Pei] small bug
SparkSql crashes on selecting tables using custom serde.
Example:
----------------
CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE;
The following exception is seen on running a query like 'select * from table_name limit 1':
ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException
at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80)
at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
at org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:100)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
Author: chirag <chirag.aggarwal@guavus.com>
Closes#2674 from chiragaggarwal/branch-1.1 and squashes the following commits:
370c31b [chirag] SPARK-3807: Add a test case to validate the fix.
1f26805 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (Incorporated Review Comments)
ba4bc0c [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
5c73b72 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
(cherry picked from commit 925e22d313)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Adds some functions that were very useful when trying to track down the bug from #2656. This change also changes the tree output for query plans to include the `'` prefix to unresolved nodes and `!` prefix to nodes that refer to non-existent attributes.
Author: Michael Armbrust <michael@databricks.com>
Closes#2657 from marmbrus/debugging and squashes the following commits:
654b926 [Michael Armbrust] Clean-up, add tests
763af15 [Michael Armbrust] Add typeChecking debugging functions
8c69303 [Michael Armbrust] Add inputSet, references to QueryPlan. Improve tree string with a prefix to denote invalid or unresolved nodes.
fbeab54 [Michael Armbrust] Better toString, factories for AttributeSet.
Author: Venkata Ramana G <ramana.gollamudihuawei.com>
Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
Closes#2713 from gvramana/remove_unnecessary_columns and squashes the following commits:
b7ba768 [Venkata Ramana Gollamudi] Added comment and checkstyle fix
6a93459 [Venkata Ramana Gollamudi] cloned hiveconf for each TableScanOperators so that only required columns are added
Original problem is [SPARK-3764](https://issues.apache.org/jira/browse/SPARK-3764).
`AppendingParquetOutputFormat` uses a binary-incompatible method `context.getTaskAttemptID`.
This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#2638 from ueshin/issues/SPARK-3771 and squashes the following commits:
efd3784 [Takuya UESHIN] Add a comment to explain the reason to use reflection.
ec213c1 [Takuya UESHIN] Use reflection to prevent breaking binary-compatibility.
There are lots of temporal files created by TestHive under the /tmp by default, which may cause potential performance issue for testing. This PR will automatically delete them after test exit.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2393 from chenghao-intel/delete_temp_on_exit and squashes the following commits:
3a6511f [Cheng Hao] Remove the temp dir after text exit
This PR adds a new rule `CheckAggregation` to the analyzer to provide better error message for non-aggregate attributes with aggregation.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2774 from liancheng/non-aggregate-attr and squashes the following commits:
5246004 [Cheng Lian] Passes test suites
bf1878d [Cheng Lian] Adds checks for non-aggregate attributes with aggregation
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2344 from adrian-wang/date and squashes the following commits:
f15074a [Daoyuan Wang] remove outdated lines
2038085 [Daoyuan Wang] update return type
00fe81f [Daoyuan Wang] address lian cheng's comments
0df6ea1 [Daoyuan Wang] rebase and remove simple string
bb1b1ef [Daoyuan Wang] remove failing test
aa96735 [Daoyuan Wang] not cast for same type compare
30bf48b [Daoyuan Wang] resolve rebase conflict
617d1a8 [Daoyuan Wang] add date_udf case to white list
c37e848 [Daoyuan Wang] comment update
5429212 [Daoyuan Wang] change to long
f8f219f [Daoyuan Wang] revise according to Cheng Hao
0e0a4f5 [Daoyuan Wang] minor format
4ddcb92 [Daoyuan Wang] add java api for date
0e3110e [Daoyuan Wang] try to fix timezone issue
17fda35 [Daoyuan Wang] set test list
2dfbb5b [Daoyuan Wang] support date type
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2747 from adrian-wang/typename and squashes the following commits:
2824216 [Daoyuan Wang] remove redundant typeName
fbaf340 [Daoyuan Wang] typename
In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
This could be a performance problem. (unless this is the intended behavior)
Author: yingjieMiao <yingjie@42go.com>
Closes#2648 from yingjieMiao/rdd_take and squashes the following commits:
d758218 [yingjieMiao] scala style fix
a8e74bb [yingjieMiao] python style fix
4b6e777 [yingjieMiao] infix operator style fix
4391d3b [yingjieMiao] typo fix.
692f4e6 [yingjieMiao] cap numPartsToTry
c4483dc [yingjieMiao] style fix
1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
d31ff7e [yingjieMiao] handle the edge case after 1 iteration
a2aa36b [yingjieMiao] RDD take method: overestimate too much
Author: Reynold Xin <rxin@apache.org>
Closes#2727 from rxin/SPARK-3861-broadcast-hash-2 and squashes the following commits:
9c7b1a2 [Reynold Xin] Revert "Reuse CompactBuffer in UniqueKeyHashedRelation."
97626a1 [Reynold Xin] Reuse CompactBuffer in UniqueKeyHashedRelation.
7fcffb5 [Reynold Xin] Make UniqueKeyHashedRelation private[joins].
18eb214 [Reynold Xin] Merge branch 'SPARK-3861-broadcast-hash' into SPARK-3861-broadcast-hash-1
4b9d0c9 [Reynold Xin] UniqueKeyHashedRelation.get should return null if the value is null.
e0ebdd1 [Reynold Xin] Added a test case.
90b58c0 [Reynold Xin] [SPARK-3861] Avoid rebuilding hash tables on each partition
0c0082b [Reynold Xin] Fix line length.
cbc664c [Reynold Xin] Rename join -> joins package.
a070d44 [Reynold Xin] Fix line length in HashJoin
a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
During trainning Gradient Boosting Decision Tree on large-scale sparse data, spark spill hundreds of data onto disk. And find the bug below:
In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage.
In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too.
After added unpersist, it works right.
https://issues.apache.org/jira/browse/SPARK-3918
Author: omgteam <Kimlong.Liu@gmail.com>
Closes#2775 from omgteam/master and squashes the following commits:
815d543 [omgteam] adjust tab to spaces
1a36f83 [omgteam] Bug: fix without unpersist baggedInput in RandomForest.scala
There are three [Custom Receiver Guide] links in streaming doc, the first is wrong.
Author: w00228970 <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#2749 from scwf/streaming-doc and squashes the following commits:
0cd76b7 [wangfei] update link tojump to the Akka-specific section
45b0646 [w00228970] wrong link in streaming doc
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Closes#2778 from giwa/patch-2 and squashes the following commits:
a59f9a1 [Ken Takagiwa] Add echo "Run streaming tests ..."
Author: GuoQiang Li <witgo@qq.com>
Closes#2763 from witgo/SPARK-3905 and squashes the following commits:
17d7990 [GuoQiang Li] The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
val path = ... //path to seq file with BytesWritable as type of both key and value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1
This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:
getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values
copyBytes() - return just begining of internal array determined by internal length property
It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.
dbtsai
Author: Jakub Dubovský <james64@inMail.sk>
Author: Dubovsky Jakub <dubovsky@avast.com>
Closes#2712 from james64/3121-bugfix and squashes the following commits:
f85d24c [Jakub Dubovský] Test name changed, comments added
1b20d51 [Jakub Dubovský] Import placed correctly
406e26c [Jakub Dubovský] Scala style fixed
f92ffa6 [Dubovsky Jakub] performance tuning
480f9cd [Dubovsky Jakub] Bug 3121 fixed
This was reported in https://issues.apache.org/jira/browse/SPARK-3445. There are API differences between the 0.23.* and the 2.0.*-alpha branches that are not accounted for when this code was introduced.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2776 from andrewor14/fix-yarn-alpha and squashes the following commits:
ec94752 [Andrew Or] Fix compilation error for 2.0.*-alpha
Previously, when the val partitionStrategy was created it called a function in the Analytics object which was a copy of the PartitionStrategy.fromString() method. This function has been removed, and the assignment of partitionStrategy now uses the PartitionStrategy.fromString method instead. In this way, it better matches the declarations of edge/vertex StorageLevel variables.
Author: NamelessAnalyst <NamelessAnalyst@users.noreply.github.com>
Closes#2569 from NamelessAnalyst/branch-1.1 and squashes the following commits:
c24ff51 [NamelessAnalyst] Update Analytics.scala
(cherry picked from commit 5a21e3e7e9)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This PR accomplishes this by sending the remote exception's stacktrace as the payload in the "negative ACK / error message."
Author: Josh Rosen <joshrosen@apache.org>
Closes#2741 from JoshRosen/propagate-cm-exceptions-to-sender and squashes the following commits:
b5366cc [Josh Rosen] Explicitly encode error messages using UTF-8.
cef18b3 [Josh Rosen] [SPARK-3887] Send stracktrace in ConnectionManager error messages.
Sphinx documents contains a corrupted ReST format and have some warnings.
The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
commit: 0e8203f4fb
output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.
Build finished. The HTML pages are in _build/html.
```
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
./python/run-tests search a Python 2.6 executable on PATH and use it if available.
When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
...riden alternatives, can have default argument.
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:
d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.
We had to upgrade our Hive 0.12 version as well to deal with a protobuf
conflict (both hive and akka have been using a shaded protobuf version).
This is testing a correctly patched version of Hive 0.12.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2756 from pwendell/hotfix and squashes the following commits:
cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into [64k - 640k].
In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2740 from davies/batchsize and squashes the following commits:
52cdb88 [Davies Liu] update docs
185f2b9 [Davies Liu] use AutoBatchedSerializer by default