Author: Cheng Lian <lian@databricks.com>
Closes#6749 from liancheng/java-sample-fix and squashes the following commits:
5b44585 [Cheng Lian] Fixes a minor Java example error in SQL programming guide
(cherry picked from commit 8f7308f9c4)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This provides preliminary documentation pointing out how to use the
Hadoop free builds. I am hoping over time this list can grow to
include most of the popular Hadoop distributions.
Getting more people using these builds will help us long term reduce
the number of binaries we build.
Author: Patrick Wendell <patrick@databricks.com>
Closes#6729 from pwendell/hadoop-provided and squashes the following commits:
1113b76 [Patrick Wendell] [SPARK-6511] [Documentation] Explain how to use Hadoop provided builds
(cherry picked from commit 6e4fb0c9e8)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
Just as a safeguard against DOM rewriting.
Author: Andrew Or <andrew@databricks.com>
Closes#6732 from andrewor14/dag-viz-trim and squashes the following commits:
7e9bacb [Andrew Or] [MINOR] [UI] DAG visualization: trim whitespace from input
(cherry picked from commit 0d5892dc72)
Signed-off-by: Andrew Or <andrew@databricks.com>
Author: hqzizania <qian.huang@intel.com>
Closes#6190 from hqzizania/R and squashes the following commits:
1641f9e [hqzizania] fixes and add test units
bb3411a [hqzizania] Convert NAs to null type in SparkR DataFrames
(cherry picked from commit a5c52c1a34)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
This was caused by this commit: f271347
This patch does not attempt to fix the root cause of why the `VisibleForTesting` annotation causes a NPE in the shell. We should find a way to fix that separately.
Author: Andrew Or <andrew@databricks.com>
Closes#6711 from andrewor14/fix-spark-shell and squashes the following commits:
bf62ecc [Andrew Or] Prevent NPE in spark-shell
Even with all the efforts to cleanup the temp directories created by
unit tests, Spark leaves a lot of garbage in /tmp after a test run.
This change overrides java.io.tmpdir to place those files under the
build directory instead.
After an sbt full unit test run, I was left with > 400 MB of temp
files. Since they're now under the build dir, it's much easier to
clean them up.
Also make a slight change to a unit test to make it not pollute the
source directory with test data.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6674 from vanzin/SPARK-8126 and squashes the following commits:
0f8ad41 [Marcelo Vanzin] Make sure tmp dir exists when tests run.
643e916 [Marcelo Vanzin] [MINOR] [BUILD] Use custom temp directory during build.
As I have tested, if we cancel or kill the app then the final status may be undefined, killed or succeeded, so clean up staging directory when appMaster exit at any final application status.
Author: linweizhong <linweizhong@huawei.com>
Closes#6409 from Sephiroth-Lin/SPARK-7705 and squashes the following commits:
3a5a0a5 [linweizhong] Update
83dc274 [linweizhong] Update
923d44d [linweizhong] Update
0dd7c2d [linweizhong] Update
b76a102 [linweizhong] Update code style
7846b69 [linweizhong] Update
bd6cf0d [linweizhong] Refactor
aed9f18 [linweizhong] Clean up stagingDir when launch app on yarn
95595c3 [linweizhong] Cleanup of .sparkStaging directory when AppMaster exit at any final application status
(cherry picked from commit eacd4a929b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
this is a follow up of #3621
/cc liancheng pwendell
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#6639 from adrian-wang/kryodoc and squashes the following commits:
3c4b1cf [Daoyuan Wang] [DOC] kryo default setting in SQL Thrift server
(cherry picked from commit 10fc2f6f51)
Signed-off-by: Reynold Xin <rxin@databricks.com>
This is a follow-up patch to #6577 to replace columnEnclosing to quoteIdentifier.
I also did some minor cleanup to the JdbcDialect file.
Author: Reynold Xin <rxin@databricks.com>
Closes#6689 from rxin/jdbc-quote and squashes the following commits:
bad365f [Reynold Xin] Fixed test compilation...
e39e14e [Reynold Xin] Fixed compilation.
db9a8e0 [Reynold Xin] [SPARK-8004][SQL] Quote identifier in JDBC data source.
(cherry picked from commit d6d601a07b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6688 from rxin/df-alias-replace and squashes the following commits:
774c19c [Reynold Xin] [SPARK-8146] DataFrame Python API: Alias replace in DataFrameNaFunctions.
(cherry picked from commit 0ac47083f7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-8141
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6687 from viirya/reuse_partition_column_types and squashes the following commits:
dab0688 [Liang-Chi Hsieh] Reuse partitionColumnTypes.
(cherry picked from commit 26d07f1ece)
Signed-off-by: Cheng Lian <lian@databricks.com>
…moved if dynamic allocation is enabled.
This is a work in progress. This patch ensures that an executor that has cached RDD blocks are not removed,
but makes no attempt to find another executor to remove. This is meant to get some feedback on the current
approach, and if it makes sense then I will look at choosing another executor to remove. No testing has been done either.
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes#6508 from harishreedharan/dymanic-caching and squashes the following commits:
dddf1eb [Hari Shreedharan] Minor configuration description update.
10130e2 [Hari Shreedharan] Fix compile issue.
5417b53 [Hari Shreedharan] Add documentation for new config. Remove block from cachedBlocks when it is dropped.
875916a [Hari Shreedharan] Make some code more readable.
39940ca [Hari Shreedharan] Handle the case where the executor has not yet registered.
90ad711 [Hari Shreedharan] Remove unused imports and unused methods.
063985c [Hari Shreedharan] Send correct message instead of recursively calling same method.
ec2fd7e [Hari Shreedharan] Add file missed in last commit
5d10fad [Hari Shreedharan] Update cached blocks status using local info, rather than doing an RPC.
193af4c [Hari Shreedharan] WIP. Use local state rather than via RPC.
ae932ff [Hari Shreedharan] Fix config param name.
272969d [Hari Shreedharan] Fix seconds to millis bug.
5a1993f [Hari Shreedharan] Add timeout for cache executors. Ignore broadcast blocks while checking if there are cached blocks.
57fefc2 [Hari Shreedharan] [SPARK-7955][Core] Ensure executors with cached RDD blocks are not removed if dynamic allocation is enabled.
(cherry picked from commit 3285a51121)
Signed-off-by: Andrew Or <andrew@databricks.com>
As described in SPARK-8079, when writing a DataFrame to a `HadoopFsRelation`, if `HadoopFsRelation.prepareForWriteJob` throws exception, an unexpected NPE will be thrown during job abortion. (This issue doesn't bring much damage since the job is failing anyway.)
This PR makes the job/task abortion logic in `InsertIntoHadoopFsRelation` more robust to avoid such confusing exceptions.
Author: Cheng Lian <lian@databricks.com>
Closes#6612 from liancheng/spark-8079 and squashes the following commits:
87cd81e [Cheng Lian] Addresses @rxin's comment
1864c75 [Cheng Lian] Addresses review comments
9e6dbb3 [Cheng Lian] Makes InsertIntoHadoopFsRelation job/task abortion more robust
(cherry picked from commit 16fc49617e)
Signed-off-by: Cheng Lian <lian@databricks.com>
This is a minor change.
Author: amey <amey@skytree.net>
Closes#6655 from ameyc/JIRA-7991/support-passing-list-to-describe and squashes the following commits:
e8a1dff [amey] Adding support for passing lists to describe.
(cherry picked from commit 356a4a9b93)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Add documentation for spark.sql.planner.externalSort
Author: Luca Martinetti <luca@luca.io>
Closes#6272 from lucamartinetti/docs-externalsort and squashes the following commits:
985661b [Luca Martinetti] [SPARK-7747] [SQL] [DOCS] Add documentation for spark.sql.planner.externalSort
(cherry picked from commit 4060526cd3)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Author: zsxwing <zsxwing@gmail.com>
Closes#6659 from zsxwing/SPARK-8112 and squashes the following commits:
a5d7da6 [zsxwing] Address comments
d255b6e [zsxwing] Fix the negative event count issue
(cherry picked from commit 4f16d3fe2e)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
cc davies sun-rui
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6620 from shivaram/sparkr-read-schema and squashes the following commits:
16a6726 [Shivaram Venkataraman] Fix loadDF to pass schema Also add a unit test
a229877 [Shivaram Venkataraman] Use wrapper function to DataFrameReader
ee70ba8 [Shivaram Venkataraman] Support user-specified schema in read.df
(cherry picked from commit 12f5eaeee1)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Fixed the broken links (Examples) in the documentation.
Author: Akhil Das <akhld@darktech.ca>
Closes#6666 from akhld/patch-2 and squashes the following commits:
2228b83 [Akhil Das] Update streaming-kafka-integration.md
(cherry picked from commit 019dc9f558)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Even with all the efforts to cleanup the temp directories created by
unit tests, Spark leaves a lot of garbage in /tmp after a test run.
This change overrides java.io.tmpdir to place those files under the
build directory instead.
After an sbt full unit test run, I was left with > 400 MB of temp
files. Since they're now under the build dir, it's much easier to
clean them up.
Also make a slight change to a unit test to make it not pollute the
source directory with test data.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6653 from vanzin/unit-test-tmp and squashes the following commits:
31e2dd5 [Marcelo Vanzin] Fix tests that depend on each other.
aa92944 [Marcelo Vanzin] [minor] [build] Use custom temp directory during build.
(cherry picked from commit b16b5434ff)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Completely trivial but I noticed this wrinkle in a log message today; `$sender` doesn't refer to anything and isn't interpolated here.
Author: Sean Owen <sowen@cloudera.com>
Closes#6650 from srowen/Interpolation and squashes the following commits:
518687a [Sean Owen] Actually interpolate log string
7edb866 [Sean Owen] Trivial: remove unused interpolation var in log message
(cherry picked from commit 3a5c4da473)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Ted Blackman <ted.blackman@gmail.com>
Closes#6656 from belisarius222/branch-1.4 and squashes the following commits:
747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
The log page should only show desired length of bytes. Currently it shows bytes from the startIndex to the end of the file. The "Next" button on the page is always disabled.
Author: Carson Wang <carson.wang@intel.com>
Closes#6640 from carsonwang/logpage and squashes the following commits:
58cb3fd [Carson Wang] Show correct length of bytes on log page
(cherry picked from commit 63bc0c4430)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available
Related to discussion in #6567
cc pwendell srowen -- Let me know if this looks better
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6593 from shivaram/sparkr-pom-cleanup and squashes the following commits:
b282241 [Shivaram Venkataraman] Remove sparkr-docs from release script as well
8f100a5 [Shivaram Venkataraman] Move man pages creation to install-dev.sh This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available
(cherry picked from commit 3dc005282a)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests. Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:
514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests. Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
(cherry picked from commit df7da07a86)
Signed-off-by: Reynold Xin <rxin@databricks.com>
If maxTaskFailures is 1, the task set is aborted after 1 task failure. Other documentation and the code supports this reading, I think it's just this comment that was off. It's easy to make this mistake — can you please double-check if I'm correct? Thanks!
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#6621 from darabos/patch-2 and squashes the following commits:
dfebdec [Daniel Darabos] Fix comment.
(cherry picked from commit 10ba188087)
Signed-off-by: Sean Owen <sowen@cloudera.com>
This includes the following commits:
original: 9eb222c
hotfix1: 8c99793
hotfix2: a4f2412
scalastyle check: 609c492
---
Original patch #6441
Branch-1.3 patch #6602
Author: Andrew Or <andrew@databricks.com>
Closes#6598 from andrewor14/demarcate-tests-1.4 and squashes the following commits:
4c3c566 [Andrew Or] Merge branch 'branch-1.4' of github.com:apache/spark into demarcate-tests-1.4
e217b78 [Andrew Or] [SPARK-7558] Guard against direct uses of FunSuite / FunSuiteLike
46d4361 [Andrew Or] Various whitespace changes (minor)
3d9bf04 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
eaa520e [Andrew Or] Fix tests?
b4d93de [Andrew Or] Fix tests
634a777 [Andrew Or] Fix log message
a932e8d [Andrew Or] Fix manual things that cannot be covered through automation
8bc355d [Andrew Or] Add core tests as dependencies in all modules
75d361f [Andrew Or] Introduce base abstract class for all test suites
For branch-1.4.
This is identical to #6629 and is strictly not necessary. I'm opening this as a PR since it changes Jenkins test behavior and I want to test it out here.
Author: Andrew Or <andrew@databricks.com>
Closes#6630 from andrewor14/build-check-hive-1.4 and squashes the following commits:
186ec65 [Andrew Or] [BUILD] Use right branch when checking against Hive
Currently hive tests alone take 40m. The right thing to do is
to reduce the test time. However, that is a bigger project and
we currently have PRs blocking on tests not timing out.
cc shaneknapp pwendell JoshRosen
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6623 from shivaram/SPARK-8084 and squashes the following commits:
0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error
(cherry picked from commit 0576c3c4ff)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#6624 from ryan-williams/execs and squashes the following commits:
b6f71d4 [Ryan Williams] don't attempt to lower number of executors by 0
(cherry picked from commit 51898b5158)
Signed-off-by: Andrew Or <andrew@databricks.com>
This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6595 from mengxr/SPARK-8051 and squashes the following commits:
b6a36b9 [Xiangrui Meng] add doc
f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
8ee7c7e [Xiangrui Meng] use SparkFunSuite
e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist
(cherry picked from commit 26c9d7a0f9)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
cc andrewor14
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:
db244ae [Shivaram Venkataraman] Make Python Lint happy
0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN
(cherry picked from commit d3e026f879)
Signed-off-by: Andrew Or <andrew@databricks.com>
The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.
This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.
Author: zsxwing <zsxwing@gmail.com>
Closes#6546 from zsxwing/SPARK-7989 and squashes the following commits:
5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
(cherry picked from commit f27134782e)
Signed-off-by: Andrew Or <andrew@databricks.com>
Conflicts:
core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala
Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.
Author: zsxwing <zsxwing@gmail.com>
Closes#6550 from zsxwing/SPARK-8001 and squashes the following commits:
607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
(cherry picked from commit 1d8669f15c)
Signed-off-by: Andrew Or <andrew@databricks.com>
Author: Timothy Chen <tnachen@gmail.com>
Closes#6615 from tnachen/mesos_driver_path and squashes the following commits:
4f47b7c [Timothy Chen] Use the correct base path in mesos driver page.
(cherry picked from commit bfbf12b349)
Signed-off-by: Andrew Or <andrew@databricks.com>
Java-friendly APIs added:
* GaussianMixture.run()
* GaussianMixtureModel.predict()
* DistributedLDAModel.javaTopicDistributions()
* StreamingKMeans: trainOn, predictOn, predictOnValues
* Statistics.corr
* params
* added doc to w() since Java docs do not inherit doc
* removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam
* made DoubleArrayParam Java-friendly w() actually Java-friendly
I generated the doc and verified all changes.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6562 from jkbradley/java-api-1.4 and squashes the following commits:
c16821b [Joseph K. Bradley] Small fixes based on code review.
d955581 [Joseph K. Bradley] unit test fixes
29b6b0d [Joseph K. Bradley] small fixes
fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params
(cherry picked from commit 20a26b595c)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6608 from rxin/parquet-analysis and squashes the following commits:
b5dc8e2 [Reynold Xin] Code review feedback.
5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
(cherry picked from commit 939e4f3d8d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Sun Rui <rui.sun@intel.com>
Closes#6605 from sun-rui/SPARK-8063 and squashes the following commits:
51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
(cherry picked from commit 708c63bbbe)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
https://issues.apache.org/jira/browse/SPARK-7973
Author: Yin Huai <yhuai@databricks.com>
Closes#6525 from yhuai/SPARK-7973 and squashes the following commits:
763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes.
e598a08 [Yin Huai] Increase the timeout to 3 minutes.
(cherry picked from commit f1646e1023)
Signed-off-by: Yin Huai <yhuai@databricks.com>