This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.
cc andrewor14 pwendell, this should also be back ported to branch-1.4.
Author: Cheng Lian <lian@databricks.com>
Closes#6547 from liancheng/override-log4j and squashes the following commits:
c900cfd [Cheng Lian] Addresses Shixiong's comment
72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness
(cherry picked from commit 5cd6a63d96)
Signed-off-by: Andrew Or <andrew@databricks.com>
The `HiveThriftServer2Test` relies on proper logging behavior to assert whether the Thrift server daemon process is started successfully. However, some other jar files listed in the classpath may potentially contain an unexpected Log4J configuration file which overrides the logging behavior.
This PR writes a temporary `log4j.properties` and prepend it to driver classpath before starting the testing Thrift server process to ensure proper logging behavior.
cc andrewor14 yhuai
Author: Cheng Lian <lian@databricks.com>
Closes#6493 from liancheng/override-log4j and squashes the following commits:
c489e0e [Cheng Lian] Fixes minor Scala styling issue
b46ef0d [Cheng Lian] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior
The temporary column should be dropped after we get the prediction column. harsha2010
Author: Xiangrui Meng <meng@databricks.com>
Closes#6592 from mengxr/SPARK-8049 and squashes the following commits:
1d89107 [Xiangrui Meng] use SparkFunSuite
6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
(cherry picked from commit 89f21f66b5)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Thanks ogirardot, closes#6580
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#6590 from davies/when and squashes the following commits:
c0f2069 [Davies Liu] fix Column.when() and otherwise()
(cherry picked from commit 605ddbb27c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.
This PR fixes this issue by deferring metadata discovery after save mode checking.
Author: Cheng Lian <lian@databricks.com>
Closes#6583 from liancheng/spark-8014 and squashes the following commits:
1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014
(cherry picked from commit 686a45f0b9)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.
mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](https://github.com/apache/spark/pull/5820).
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:
6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
(cherry picked from commit ad06727fe9)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
PySpark SQL's `readerwriter` and `window` doctests weren't being run by our test runner script; this patch re-enables them.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6542 from JoshRosen/enable-more-pyspark-sql-tests and squashes the following commits:
9f46ce4 [Josh Rosen] Enable PySpark SQL readerwriter and window tests.
The minimal change would be to disable shading of Guava in the module,
and rely on the transitive dependency from other libraries instead. But
since Guava's use is so localized, I think it's better to just not use
it instead, so I replaced that code and removed all traces of Guava from
the module's build.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6555 from vanzin/SPARK-8015 and squashes the following commits:
c0ceea8 [Marcelo Vanzin] Add comments about dependency management.
c38228d [Marcelo Vanzin] Add guava dep in test scope.
b7a0349 [Marcelo Vanzin] Add libthrift exclusion.
6e0942d [Marcelo Vanzin] Add comment in pom.
2d79260 [Marcelo Vanzin] [SPARK-8015] [flume] Remove Guava dependency from flume-sink.
(cherry picked from commit 0071bd8d31)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Author: Cheng Lian <lian@databricks.com>
Closes#6581 from liancheng/spark-8037 and squashes the following commits:
d08e97b [Cheng Lian] Ignores files whose name starts with dot in HadoopFsRelation
(cherry picked from commit 1bb5d716c0)
Signed-off-by: Cheng Lian <lian@databricks.com>
87941ff8c4 accidentally removed the EvaluatedType.
Author: Yin Huai <yhuai@databricks.com>
Closes#6589 from yhuai/getBackEvaluatedType and squashes the following commits:
618c2eb [Yin Huai] Add EvaluatedType back.
The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6572 from mengxr/SPARK-7432 and squashes the following commits:
c236bb8 [Xiangrui Meng] fix flacky cv doctest
(cherry picked from commit bd97840d5c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
add schema()/format()/options() for reader, add mode()/format()/options()/partitionBy() for writer
cc rxin yhuai pwendell
Author: Davies Liu <davies@databricks.com>
Closes#6578 from davies/readwrite and squashes the following commits:
720d293 [Davies Liu] address comments
b65dfa2 [Davies Liu] Update readwriter.py
1299ab6 [Davies Liu] make Python API consistent with Scala
(cherry picked from commit 445647a1a3)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
This closes#6570.
Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6573 from rxin/deterministic and squashes the following commits:
356cd22 [Reynold Xin] Added unit test for the optimizer.
da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023
da56200 [Yin Huai] Comments.
e38f264 [Yin Huai] Comment.
f9d6a73 [Yin Huai] Add a deterministic method to Expression.
(cherry picked from commit 0f80990bfa)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/random.scala
https://issues.apache.org/jira/browse/SPARK-8020
Author: Yin Huai <yhuai@databricks.com>
Closes#6571 from yhuai/SPARK-8020-1 and squashes the following commits:
0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
(cherry picked from commit 7b7f7b6c6f)
Signed-off-by: Yin Huai <yhuai@databricks.com>
cc yhuai
Author: Davies Liu <davies@databricks.com>
Closes#6558 from davies/decimalType and squashes the following commits:
c877ca8 [Davies Liu] Update ParquetConverter.scala
48cc57c [Davies Liu] Update ParquetConverter.scala
b43845c [Davies Liu] add test
3b4a94f [Davies Liu] DecimalType is not read back when non-native type exists
(cherry picked from commit bcb47ad771)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This PR adds a Java unit test and user guide for `StringIndexer`. I put it before `OneHotEncoder` because they are closely related. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6561 from mengxr/SPARK-7582 and squashes the following commits:
4bba4f1 [Xiangrui Meng] fix example
ba1cd1b [Xiangrui Meng] fix style
7fa18d1 [Xiangrui Meng] add user guide for StringIndexer
136cb93 [Xiangrui Meng] add a Java unit test for StringIndexer
(cherry picked from commit 0221c7f0ef)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Scala `deprecated` annotation actually doesn't show up in JavaDoc.
Author: zsxwing <zsxwing@gmail.com>
Closes#6564 from zsxwing/SPARK-8025 and squashes the following commits:
2faa2bb [zsxwing] Add JavaDoc style deprecation for deprecated Streaming methods
(cherry picked from commit 7f74bb3bc6)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6569 from rxin/freqItemsWarning and squashes the following commits:
7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.
(cherry picked from commit 4c868b9943)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Also use that profile in create-release.sh
cc pwendell -- Note that this means that we need `knitr` and `roxygen` installed on the machines used for building the release. Let me know if you need help with that.
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6567 from shivaram/SPARK-8027 and squashes the following commits:
8dc8ecf [Shivaram Venkataraman] Add maven profile to build R package docs Also use that profile in create-release.sh
(cherry picked from commit cae9306c4f)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Reynold Xin <rxin@databricks.com>
Closes#6565 from rxin/alias and squashes the following commits:
286d880 [Reynold Xin] [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API
(cherry picked from commit 89f642a0e8)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6566 from rxin/crosstab and squashes the following commits:
e0ace1c [Reynold Xin] [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear
(cherry picked from commit 6396cc0303)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This prevents the spark.jars from being cleared while using `--packages` or `--jars`
cc pwendell davies brkyvz
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6568 from shivaram/SPARK-8028 and squashes the following commits:
3a9cf1f [Shivaram Venkataraman] Use addJar instead of setJars in SparkR This prevents the spark.jars from being cleared
(cherry picked from commit 6b44278ef7)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
StreamingContext.start() can throw exception because DStream.validateAtStart() fails (say, checkpoint directory not set for StateDStream). But by then JobScheduler, JobGenerator, and ReceiverTracker has already started, along with their actors. But those cannot be shutdown because the only way to do that is call StreamingContext.stop() which cannot be called as the context has not been marked as ACTIVE.
The solution in this PR is to stop the internal scheduler if start throw exception, and mark the context as STOPPED.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#6559 from tdas/SPARK-7958 and squashes the following commits:
20b2ec1 [Tathagata Das] Added synchronized
790b617 [Tathagata Das] Handled exception in StreamingContext.start()
(cherry picked from commit 2f9c7519d6)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`.
Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python.
Author: Michael Nazario <mnazario@palantir.com>
Closes#6439 from mnazario/feature/SPARK-7899 and squashes the following commits:
366ef30 [Michael Nazario] Remove hack on random.py
bb8b04d [Michael Nazario] Make doctests consistent with other tests
6ee4f75 [Michael Nazario] Change test scripts to use "-m"
673528f [Michael Nazario] Move _types back to types
This PR adds a section in the user guide for `VectorAssembler` with code examples in Python/Java/Scala. It also adds a unit test in Java.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6556 from mengxr/SPARK-7584 and squashes the following commits:
11313f6 [Xiangrui Meng] simplify Java example
0cd47f3 [Xiangrui Meng] update user guide
fd36292 [Xiangrui Meng] update Java unit test
ce61ca0 [Xiangrui Meng] add Java unit test for VectorAssembler
e399942 [Xiangrui Meng] scala/python example code
(cherry picked from commit 90c606925e)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Increase the duration and timeout in streaming python tests.
Author: Davies Liu <davies@databricks.com>
Closes#6239 from davies/flaky_tests and squashes the following commits:
d6aee8f [Davies Liu] fix window tests
26317f7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_tests
7947db6 [Davies Liu] fix streaming flaky tests
(cherry picked from commit b7ab0299b0)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
pwendell tdas
Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Closes#6544 from nishkamravi2/master_nravi and squashes the following commits:
46e8c03 [Nishkam Ravi] Slight modification to streaming docs
(cherry picked from commit e7c7e51f2e)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Author: Davies Liu <davies@databricks.com>
Closes#6532 from davies/decimal and squashes the following commits:
c7fcbce [Davies Liu] Update tests.py
1425359 [Davies Liu] DecimalType should not be singleton
(cherry picked from commit 91777a1c3a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Sun Rui <rui.sun@intel.com>
Closes#6183 from sun-rui/SPARK-7227 and squashes the following commits:
dd6f5b3 [Sun Rui] Rename readEnv() back to readMap(). Add alias na.omit() for dropna().
41cf725 [Sun Rui] [SPARK-7227][SPARKR] Support fillna / dropna in R DataFrame.
(cherry picked from commit 46576ab303)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Reynold Xin <rxin@databricks.com>
Closes#6541 from rxin/trailing-whitespace-on and squashes the following commits:
f72ebe4 [Reynold Xin] [SPARK-3850] Turn style checker on for trailing whitespaces.
(cherry picked from commit 866652c903)
Signed-off-by: Reynold Xin <rxin@databricks.com>
add save load for examples:
KMeansModel
PowerIterationClusteringModel
Word2VecModel
IsotonicRegressionModel
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6498 from hhbyyh/docSaveLoad and squashes the following commits:
7f9f06d [Yuhao Yang] add missing imports
c604cad [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into docSaveLoad
1dd77cc [Yuhao Yang] update document with some missing save/load
(cherry picked from commit 0674700303)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6534 from rxin/whitespace-mllib and squashes the following commits:
38926e3 [Reynold Xin] [SPARK-3850] Trim trailing spaces for MLlib.
(cherry picked from commit e1067d0ad1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Add license for dagre-d3 and graphlib-dot
Author: zsxwing <zsxwing@gmail.com>
Closes#6539 from zsxwing/LICENSE and squashes the following commits:
82b0475 [zsxwing] Add license for dagre-d3 and graphlib-dot
(cherry picked from commit d1d2def2f5)
Signed-off-by: Andrew Or <andrew@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6535 from rxin/whitespace-sql and squashes the following commits:
de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.
(cherry picked from commit 63a50be13d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Conflicts:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
Author: Reynold Xin <rxin@databricks.com>
Closes#6533 from rxin/whitespace-2 and squashes the following commits:
038314c [Reynold Xin] [SPARK-3850] Trim trailing spaces for core.
(cherry picked from commit 74fdc97c72)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Conflicts:
core/src/main/scala/org/apache/spark/storage/TachyonBlockManager.scala
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
Author: Reynold Xin <rxin@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes#6527 from rxin/covariant-equals and squashes the following commits:
e7d7784 [Reynold Xin] [SPARK-7975] Enforce CovariantEqualsChecker
(cherry picked from commit 7896e99b2a)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#6529 from liancheng/schemardd-deprecation-fix and squashes the following commits:
49765c2 [Cheng Lian] Adds @deprecated Scaladoc entry for SchemaRDD
(cherry picked from commit 8764dccebd)
Signed-off-by: Reynold Xin <rxin@databricks.com>