Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.
This patch builds on top of oza's work in #3689.
aarondav pwendell
Author: Andrew Or <andrew@databricks.com>
Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>
Closes#3731 from andrewor14/document-dynamic-allocation and squashes the following commits:
1281447 [Andrew Or] Address a few comments
b9843f2 [Andrew Or] Document the configs as well
246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
My guess for fixing https://issues.apache.org/jira/browse/SPARK-4831.
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#3678 from darabos/patch-1 and squashes the following commits:
36e1243 [Daniel Darabos] Do not include SPARK_CLASSPATH if empty.
Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances.
Author: Kanwaljit Singh <kanwaljit.singh@guavus.com>
Closes#1657 from kjsingh/branch-1.0 and squashes the following commits:
d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors
Conflicts:
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
Added equivalent module as utils.sh and modified spark-shell2.cmd to use it to parse options.
Now we can use application options.
ex) `bin\spark-shell.cmd --master spark://master:7077 -i path\to\script.txt`
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#3350 from tsudukim/feature/SPARK-3060 and squashes the following commits:
4551e56 [Masayoshi TSUZUKI] Modified too long line which defines the submission options to pass findstr command.
3a11361 [Masayoshi TSUZUKI] [SPARK-3060] spark-shell.cmd doesn't accept application options in Windows OS
the signature of registerKryoClasses is actually of Array[Class[_]] not Seq
Author: Eran Medan <ehrann.mehdan@gmail.com>
Closes#3747 from eranation/patch-1 and squashes the following commits:
ee9885d [Eran Medan] change signature of example to match released code
Currently the event logger uses a directory and several files to
describe an app's event log, all but one of which are empty. This
is not very HDFS-friendly, since creating lots of nodes in HDFS
(especially when they don't contain any data) is frowned upon due
to the node metadata being kept in the NameNode's memory.
Instead, add a header section to the event log file that contains metadata
needed to read the events. This metadata includes things like the Spark
version (for future code that may need it for backwards compatibility) and
the compression codec used for the event data.
With the new approach, aside from reducing the load on the NN, there's
also a lot less remote calls needed when reading the log directory.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#1222 from vanzin/hist-server-single-log and squashes the following commits:
cc8f5de [Marcelo Vanzin] Store header in plain text.
c7e6123 [Marcelo Vanzin] Update comment.
59c561c [Marcelo Vanzin] Review feedback.
216c5a3 [Marcelo Vanzin] Review comments.
dce28e9 [Marcelo Vanzin] Fix log overwrite test.
f91c13e [Marcelo Vanzin] Handle "spark.eventLog.overwrite", and add unit test.
346f0b4 [Marcelo Vanzin] Review feedback.
ed0023e [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
3f4500f [Marcelo Vanzin] Unit test for SPARK-3697.
45c7a1f [Marcelo Vanzin] Version of SPARK-3697 for this branch.
b3ee30b [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
a6d5c50 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
16fd491 [Marcelo Vanzin] Use unique log directory for each codec.
0ef3f70 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
d93c44a [Marcelo Vanzin] Add a newline to make the header more readable.
9e928ba [Marcelo Vanzin] Add types.
bd6ba8c [Marcelo Vanzin] Review feedback.
a624a89 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
04364dc [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
bb7c2d3 [Marcelo Vanzin] Fix scalastyle warning.
16661a3 [Marcelo Vanzin] Simplify some internal code.
cc6bce4 [Marcelo Vanzin] Some review feedback.
a722184 [Marcelo Vanzin] Do not encode metadata in log file name.
3700586 [Marcelo Vanzin] Restore log flushing.
f677930 [Marcelo Vanzin] Fix botched rebase.
ae571fa [Marcelo Vanzin] Fix end-to-end event logger test.
9db0efd [Marcelo Vanzin] Show prettier name in UI.
8f42274 [Marcelo Vanzin] Make history server parse old-style log directories.
6251dd7 [Marcelo Vanzin] Make event logger use a single file.
This patch upgrades `spark-ec2`'s Boto version to 2.34.0, since this is blocking several features. Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.
Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in `SPARK_EC2_DIR/lib`, similar to what we do in the `sbt/sbt` script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster. By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.
I've tested this with Python 2.6, too.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3737 from JoshRosen/update-boto and squashes the following commits:
0aa43cc [Josh Rosen] Remove unused setup_standalone_cluster() method.
f02935d [Josh Rosen] Enable Python deprecation warnings and fix one Boto warning:
587ae89 [Josh Rosen] [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#2848 from ryan-williams/fetch-file and squashes the following commits:
c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently
8e39c16 [Ryan Williams] code review feedback
788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#3736 from ryan-williams/hist and squashes the following commits:
421d8ff [Ryan Williams] add another random typo fix
76d6a4c [Ryan Williams] remove hdfs example
a2d0f82 [Ryan Williams] code review feedback
9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds
This consolidates some code path and makes constructor arguments simpler for a few classes.
Author: Reynold Xin <rxin@databricks.com>
Closes#3738 from rxin/sparkEnvDepRefactor and squashes the following commits:
82e02cc [Reynold Xin] Fixed couple bugs.
217062a [Reynold Xin] Code review feedback.
bd00af7 [Reynold Xin] Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor.
HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:
bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3684 from sryza/sandy-spark-3428 and squashes the following commits:
cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics
The current version of `getCallSite` visits the collection of `StackTraceElement` twice. However, it is unnecessary since we can perform our work with a single visit. We also do not need to keep filtered `StackTraceElement`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#3532 from viirya/refactor_getCallSite and squashes the following commits:
62aa124 [Liang-Chi Hsieh] Fix style.
e741017 [Liang-Chi Hsieh] Refactor getCallSite.
...ta generators
This patch adds:
* Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API
* Functions for generating exponential, gamma, and log normal RDDs and vector RDDs
* Tests for the above
Author: RJ Nowling <rnowling@gmail.com>
Closes#3680 from rnowling/spark4728 and squashes the following commits:
455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite
3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits
58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev
84fd98d [RJ Nowling] Add more values for testing distributions.
9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators
Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.
Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3429 from chenghao-intel/settable_oi and squashes the following commits:
9f0aff3 [Cheng Hao] update code style issues as feedbacks
2b0561d [Cheng Hao] Add more scala doc
f5a40e8 [Cheng Hao] add scala doc
2977e9b [Cheng Hao] remove the timezone setting for test suite
3ed284c [Cheng Hao] fix the date type comparison
f1b6749 [Cheng Hao] Update the comment
932940d [Cheng Hao] Add more unit test
72e4332 [Cheng Hao] Add settable StructObjectInspector support
Adding support to the partial aggregation of SumDistinct
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#3348 from ravipesala/SPARK-2554 and squashes the following commits:
fd28e4d [ravipesala] Fixed review comments
e60e67f [ravipesala] Fixed test cases and made it as nullable
32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).
Author: YanTangZhai <hakeemzhai@tencent.com>
Author: yantangzhai <tyz0303@163.com>
Closes#3556 from YanTangZhai/SPARK-4693 and squashes the following commits:
620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
72accf1 [YanTangZhai] Update HiveQuerySuite.scala
e572b9a [YanTangZhai] Update HiveStrategies.scala
6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
e249846 [YanTangZhai] Merge pull request #10 from apache/master
d26d982 [YanTangZhai] Merge pull request #9 from apache/master
76d4027 [YanTangZhai] Merge pull request #8 from apache/master
03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
8a00106 [YanTangZhai] Merge pull request #6 from apache/master
cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
cdef539 [YanTangZhai] Merge pull request #1 from apache/master
**sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.
Author: guowei2 <guowei2@asiainfo.com>
Closes#3617 from guowei2/SPARK-4756 and squashes the following commits:
e9b97b8 [guowei2] fix compile bug with Shim12
cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
e070998 [guowei2] fix: remove active pool of the session when it expired
...arquetFile accept hadoop glob pattern in path.
Author: Thu Kyaw <trk007@gmail.com>
Closes#3407 from tkyaw/master and squashes the following commits:
19115ad [Thu Kyaw] Merge https://github.com/apache/spark
ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.
More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rolluphttps://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
The generic idea of the implementations are :
1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
2 Explode each of the input row, and then feed them to `Aggregate`
* Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
* Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
* Output Schema of `Explode` is `child.output :+ grouping__id`
* GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
* Keep the `Aggregation expressions` the same for the `Aggregate`
The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.
A known issue will be done in the follow up PR:
* Optimization `ColumnPruning` is not supported yet for `Explosive` node.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1567 from chenghao-intel/grouping_sets and squashes the following commits:
fe65fcc [Cheng Hao] Remove the extra space
3547056 [Cheng Hao] Add more doc and Simplify the Expand
a7c869d [Cheng Hao] update code as feedbacks
d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
414b165 [Cheng Hao] revert the unnecessary changes
ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object.
cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`.
Author: Andrew Or <andrew@databricks.com>
Closes#3614 from andrewor14/dynamic-allocation-sc and squashes the following commits:
187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient
Author: Ivan Vergiliev <ivan@leanplum.com>
Closes#3605 from IvanVergiliev/change-serializer and squashes the following commits:
a49b7cf [Ivan Vergiliev] Use serializer instead of closureSerializer in aggregate/foldByKey.
spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
Should be left to the user.
Author: Ernest <earneyzxl@gmail.com>
Closes#3730 from Earne/SPARK-4880 and squashes the following commits:
d79ed04 [Ernest] remove spark.locality.wait in Analytics
The original test doesn't make sense since if you step in, the lossSum is already NaN,
and the coefficients are diverging. That's because the step size is too large for SGD,
so it doesn't work.
The correct behavior is that you should get smaller coefficients than the one
without regularization. Comparing the values using 20000.0 relative error doesn't
make sense as well.
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#3735 from dbtsai/mlortestfix and squashes the following commits:
b1a3c42 [DB Tsai] first commit
Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max.
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes#3664 from ilganeli/SPARK-3607C and squashes the following commits:
3c05690 [Ilya Ganelin] Updated documentation and refactored code to extract shared variables
Author: Timothy Chen <tnachen@gmail.com>
Closes#3349 from tnachen/mesos_doc and squashes the following commits:
737ef49 [Timothy Chen] Add TOC
5ca546a [Timothy Chen] Update description around cores requested.
26283a5 [Timothy Chen] Add mesos specific configurations into doc
... changed to a time period
Author: Sandy Ryza <sandy@cloudera.com>
Closes#3471 from sryza/sandy-spark-3779 and squashes the following commits:
20b9887 [Sandy Ryza] Deprecate old property
42b5df7 [Sandy Ryza] Review feedback
9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution.
The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf.
In addition, with this new extra options enabled to be passed to AM, it provides more flexibility.
For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in.
This is the example, currently spark will break due to hadoop.version is not passed in.
<property>
<name>mapreduce.application.classpath</name>
<value>/etc/hadoop/${hadoop.version}/mapreduce/*</value>
</property>
In the meantime, we cannot relies on mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark.
Author: Zhan Zhang <zhazhan@gmail.com>
Closes#3409 from zhzhan/Spark-4461 and squashes the following commits:
daec3d0 [Zhan Zhang] solve review comments
08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured
5a505d3 [Zhan Zhang] solve review comments
4ed43ad [Zhan Zhang] solve review comments
ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461
3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e3f9abe [Zhan Zhang] solve review comments
8963552 [Zhan Zhang] rebase
f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
dea1692 [Zhan Zhang] change the option key name to client mode specific
90d5dff [Zhan Zhang] rebase
8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
092a25f [Zhan Zhang] solve review comments
bc5a9ae [Zhan Zhang] solve review comments
782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template
6faaa97 [Zhan Zhang] solve review comments
369863f [Zhan Zhang] clean up unnecessary var
733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461
a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master
15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ebb213a [Zhan Zhang] revert
b983ef3 [Zhan Zhang] test
c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
f6a8a40 [Zhan Zhang] revert
ba14f28 [Zhan Zhang] test
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md
Author: lewuathe <lewuathe@me.com>
Closes#3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
This commit exists to close the following pull requests on Github:
Closes#3137 (close requested by 'marmbrus')
Closes#3362 (close requested by 'marmbrus')
Closes#2979 (close requested by 'JoshRosen')
Closes#2223 (close requested by 'JoshRosen')
Closes#2998 (close requested by 'marmbrus')
Closes#3202 (close requested by 'marmbrus')
Closes#3079 (close requested by 'marmbrus')
Closes#3210 (close requested by 'marmbrus')
Closes#2764 (close requested by 'marmbrus')
Closes#3618 (close requested by 'marmbrus')
Closes#3501 (close requested by 'marmbrus')
Closes#2768 (close requested by 'marmbrus')
Closes#3381 (close requested by 'marmbrus')
Closes#3510 (close requested by 'marmbrus')
Closes#3703 (close requested by 'marmbrus')
Closes#2543 (close requested by 'marmbrus')
Closes#2876 (close requested by 'marmbrus')
Closes#1281 (close requested by 'JoshRosen')
```
TestSQLContext.sparkContext.parallelize(
"""{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
"""{"ip":"27.31.100.29","headers":{}}""" ::
"""{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3708 from chenghao-intel/json and squashes the following commits:
e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
Author: Michael Armbrust <michael@databricks.com>
Closes#3727 from marmbrus/parquetNotEq and squashes the following commits:
2157bfc [Michael Armbrust] Fix parquet filter suite
+ small doc edit
+ include edit to make IntelliJ happy
CC: davies mengxr
Note to davies -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like. (Those warnings occur when generating the Python docs.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3669 from jkbradley/python-warnings and squashes the following commits:
4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc. Small doc edit. Small include edit to make IntelliJ happy.
In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2589 from chenghao-intel/source_split and squashes the following commits:
dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#3616 from adrian-wang/sqrt and squashes the following commits:
d877439 [Daoyuan Wang] fix NULLTYPE
3effa2c [Daoyuan Wang] sqrt(negative value) should return null
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.
However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1
This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#3367 from liancheng/filters-with-null and squashes the following commits:
cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
Based on #2543.
Author: Michael Armbrust <michael@databricks.com>
Closes#3724 from marmbrus/resolveGetField and squashes the following commits:
0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.
Author: carlmartin <carlmartinmax@gmail.com>
Closes#3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:
2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
Add `sort by` support for both DSL & SqlParser.
This PR is relevant with #3386, either one merged, will cause the other rebased.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3481 from chenghao-intel/sortby and squashes the following commits:
041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
`MetricsServlet` handler should be added to the web UI after initialized by `MetricsSystem`, otherwise servlet handler cannot be attached.
Author: Saisai Shao <saisai.shao@intel.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Closes#3444 from jerryshao/SPARK-4595 and squashes the following commits:
434d17e [Saisai Shao] Merge pull request #10 from JoshRosen/metrics-system-cleanup
87a2292 [Josh Rosen] Guard against misuse of MetricsSystem methods.
f779fe0 [jerryshao] Fix MetricsServlet not work issue
Author: Josh Rosen <joshrosen@databricks.com>
Closes#3719 from JoshRosen/rat-fix and squashes the following commits:
1542886 [Josh Rosen] [HOTFIX] Fix RAT exclusion for known_translations file
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
create temporary table normal_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
PATH '/xxx/data'
)
```
Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>
Closes#3470 from scwf/ddl-ulcase and squashes the following commits:
ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
This PR brings support of using StructType(and other hashable types) as key in MapType.
Author: Davies Liu <davies@databricks.com>
Closes#3714 from davies/fix_struct_in_map and squashes the following commits:
68585d7 [Davies Liu] fix primitive types in MapType
9601534 [Davies Liu] support StructType as key in MapType
Author: Cheng Hao <hao.cheng@intel.com>
Closes#3595 from chenghao-intel/udf0 and squashes the following commits:
a858973 [Cheng Hao] Add 0 arguments support for udf