`spark-daemon.sh` will confirm the process id by fuzzy matching the class name while stopping the service, however, it will fail if the java process arguments is very long (greater than 4096 characters).
This PR looses the check for the service process.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4611 from chenghao-intel/stopping_service and squashes the following commits:
a0051f6 [Cheng Hao] loosen the process checking while stopping a service
This PR adds a `finalize` method in DiskMapIterator to clean up the resources even if some exception happens during processing data.
Author: zsxwing <zsxwing@gmail.com>
Closes#4219 from zsxwing/SPARK-5423 and squashes the following commits:
d4b2ca6 [zsxwing] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file
The stability of the new submission gateway assumes that the arguments in `DriverWrapper` are consistent across multiple Spark versions. However, this is not at all clear from the code itself. In fact, this was broken in 20a6013106, which is fortunately OK because both that commit and the original commit that added this gateway are part of the same release.
To prevent this from happening again we should at the very least add a huge warning where appropriate.
Author: Andrew Or <andrew@databricks.com>
Closes#4687 from andrewor14/driver-wrapper-warning and squashes the following commits:
7989b56 [Andrew Or] Add huge compatibility warning
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes#4653 from jacek-lewandowski/SPARK-5548-2-master and squashes the following commits:
0e199b6 [Jacek Lewandowski] SPARK-5548: applied reviewer's comments
843eafb [Jacek Lewandowski] SPARK-5548: Fix for AkkaUtilsSuite failure - attempt 2
marmbrus am I missing something obvious here? I verified that this fixes the problem for me (on 1.2.1) on EC2, but I'm confused about how others wouldn't have noticed this?
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#4630 from kayousterhout/SPARK-5846_1.3 and squashes the following commits:
2022ad4 [Kay Ousterhout] [SPARK-5846] Correctly set job description and pool for SQL jobs
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4680 from mengxr/SPARK-5897 and squashes the following commits:
847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL.
Also, LongType in SQL will come back as `int`.
Author: Davies Liu <davies@databricks.com>
Closes#4666 from davies/long and squashes the following commits:
6bc6cc4 [Davies Liu] infer int as LongType
Also added test cases for checking the serializability of HiveContext and SQLContext.
Author: Reynold Xin <rxin@databricks.com>
Closes#4628 from rxin/SPARK-5840 and squashes the following commits:
ecb3bcd [Reynold Xin] test cases and reviews.
55eb822 [Reynold Xin] [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction.
Docs for BlockMatrix. mengxr
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#4664 from brkyvz/SPARK-5507PR and squashes the following commits:
4db30b0 [Burak Yavuz] [SPARK-5507] Added documentation for BlockMatrix
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4661 from mengxr/SPARK-5519 and squashes the following commits:
58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
Correct exclusion path for JBLAS native libs.
(More explanation coming soon on the mailing list re: 1.3.0 RC1)
Author: Sean Owen <sowen@cloudera.com>
Closes#4673 from srowen/SPARK-5669.2 and squashes the following commits:
e29693c [Sean Owen] Correct exclusion path for JBLAS native libs
A variable `shutdownCallback` in SparkDeploySchedulerBackend can be accessed from multiple threads so it should be enclosed by synchronized block.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#3781 from sarutak/SPARK-4949 and squashes the following commits:
c146c93 [Kousuke Saruta] Removed "setShutdownCallback" method
c7265dc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4949
42ca528 [Kousuke Saruta] Changed the declaration of the variable "shutdownCallback" as a volatile reference instead of AtomicReference
552df7c [Kousuke Saruta] Changed the declaration of the variable "shutdownCallback" as a volatile reference instead of AtomicReference
f556819 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4949
1b60fd1 [Kousuke Saruta] Improved the locking logics
5942765 [Kousuke Saruta] Enclosed shutdownCallback in SparkDeploySchedulerBackend by synchronized block
numClassesForClassification has been renamed to numClasses.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4672 from MechCoder/minor-doc and squashes the following commits:
d2ddb7f [MechCoder] Minor doc fix in GBT classification example
Also add tests for distinct()
Author: Davies Liu <davies@databricks.com>
Closes#4667 from davies/repartition and squashes the following commits:
79059fd [Davies Liu] add test
cb4915e [Davies Liu] fix repartition
This pull request replaces calls to deprecated methods from `java.util.Date` with near-equivalents in `java.util.Calendar`.
Author: Tor Myklebust <tmyklebu@gmail.com>
Closes#4668 from tmyklebu/master and squashes the following commits:
66215b1 [Tor Myklebust] Use GregorianCalendar instead of Timestamp get methods.
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4670 from liancheng/df-cleanup and squashes the following commits:
3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
The test was incorrect. Instead of counting the number of records, it counted the number of partitions of RDD generated by DStream. Which is not its intention. I will be testing this patch multiple times to understand its flakiness.
PS: This was caused by my refactoring in https://github.com/apache/spark/pull/4384/
koeninger check it out.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#4597 from tdas/kafka-flaky-test and squashes the following commits:
d236235 [Tathagata Das] Unignored last test.
e9a1820 [Tathagata Das] fix test
JIRA: https://issues.apache.org/jira/browse/SPARK-5723
Author: Yin Huai <yhuai@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#4639 from yhuai/defaultCTASFileFormat and squashes the following commits:
a568137 [Yin Huai] Merge remote-tracking branch 'upstream/master' into defaultCTASFileFormat
ad2b07d [Yin Huai] Update tests and error messages.
8af5b2a [Yin Huai] Update conf key and unit test.
5a67903 [Yin Huai] Use data source write path for Hive's CTAS statements when no storage format/handler is specified.
https://issues.apache.org/jira/browse/SPARK-5875 has a case to reproduce the bug and explain the root cause.
Author: Yin Huai <yhuai@databricks.com>
Closes#4663 from yhuai/projectResolved and squashes the following commits:
472f7b6 [Yin Huai] If a logical.Project has any AggregateExpression or Generator, it's resolved field should be false.
This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its `cacheLocs` map.
This map is accessed by the `getCacheLocs` and `clearCacheLocs()` methods, which can be called by separate threads, since DAGScheduler's `getPreferredLocs()` method is called by SparkContext and indirectly calls `getCacheLocs()`. If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.
Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread. An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations. However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop). Since `cacheLocs` is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.
See #3345 for additional context.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#4660 from JoshRosen/SPARK-4454 and squashes the following commits:
12d64ba [Josh Rosen] Properly synchronize accesses to DAGScheduler cacheLocs map.
Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.
The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.
Author: Davies Liu <davies@databricks.com>
Closes#4629 from davies/narrow and squashes the following commits:
dffe34e [Davies Liu] improve test, check number of stages for join/cogroup
1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow
4d29932 [Davies Liu] address comment
cc28d97 [Davies Liu] add unit tests
940245e [Davies Liu] address comments
ff5a0a6 [Davies Liu] skip the partitionBy() on Python side
eb26c62 [Davies Liu] narrow dependency in PySpark
The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.
This PR is based on #4562 from chenghao-intel.
JIRA: https://issues.apache.org/jira/browse/SPARK-5852
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4655 from yhuai/CTASParquet and squashes the following commits:
b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not.
It also skip the Hive tests in pyspark.sql.tests if no hive is available.
Author: Davies Liu <davies@databricks.com>
Closes#4659 from davies/sqlctx and squashes the following commits:
0e6629a [Davies Liu] sqlCtx in pyspark
Author: Davies Liu <davies@databricks.com>
Closes#4658 from davies/explain and squashes the following commits:
db87ea2 [Davies Liu] output explain in Python
This patch bring the pull based progress API into Python, also a example in Python.
Author: Davies Liu <davies@databricks.com>
Closes#3027 from davies/progress_api and squashes the following commits:
b1ba984 [Davies Liu] fix style
d3b9253 [Davies Liu] add tests, mute the exception after stop
4297327 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
969fa9d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
25590c9 [Davies Liu] update with Java API
360de2d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
c0f1021 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
023afb3 [Davies Liu] add Python API and example for progress API
Author: Michael Armbrust <michael@databricks.com>
Closes#4657 from marmbrus/pythonUdfs and squashes the following commits:
a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
In unit test, the table src(key INT, value STRING) is not the same as HIVE src(key STRING, value STRING)
https://github.com/apache/hive/blob/branch-0.13/data/scripts/q_test_init.sql
And in the reflect.q, test failed for expression `reflect("java.lang.Integer", "valueOf", key, 16)`, which expect the argument `key` as STRING not INT.
This PR doesn't aim to change the `src` schema, we can do that after 1.3 released, however, we probably need to re-generate all the golden files.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4584 from chenghao-intel/reflect and squashes the following commits:
e5bdc3a [Cheng Hao] Move the test case reflect into blacklist
184abfd [Cheng Hao] revert the change to table src1
d9bcf92 [Cheng Hao] Update the HiveContext Unittest
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4649 from viirya/use_checkpath and squashes the following commits:
0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
Current `ParquetConversions` in `HiveMetastoreCatalog` will transformUp the given plan multiple times if there are many Metastore Parquet tables. Since the transformUp operation is recursive, it should be better to only perform it once.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4651 from viirya/parquet_atonce and squashes the following commits:
c1ed29d [Liang-Chi Hsieh] Fix bug.
e0f919b [Liang-Chi Hsieh] Only transformUp the given plan once.
Author: CodingCat <zhunansjtu@gmail.com>
Closes#4656 from CodingCat/fix_typo and squashes the following commits:
b41d15c [CodingCat] recover
689fe46 [CodingCat] fix typo
A jar file containing Python sources in it could be used as a Python package, just like zip file.
spark-submit already put the jar file into PYTHONPATH, this patch also put it in the sys.path, then it could be used in Python worker.
Author: Davies Liu <davies@databricks.com>
Closes#4652 from davies/jar and squashes the following commits:
17d3f76 [Davies Liu] support .jar as python package
Avoid call to remove shutdown hook being called from shutdown hook
CC pwendell JoshRosen MattWhelan
Author: Sean Owen <sowen@cloudera.com>
Closes#4648 from srowen/SPARK-5841.2 and squashes the following commits:
51548db [Sean Owen] Avoid call to remove shutdown hook being called from shutdown hook
This commit exists to close the following pull requests on Github:
Closes#3297 (close requested by 'andrewor14')
Closes#3345 (close requested by 'pwendell')
Closes#2729 (close requested by 'srowen')
Closes#2320 (close requested by 'pwendell')
Closes#4529 (close requested by 'andrewor14')
Closes#2098 (close requested by 'srowen')
Closes#4120 (close requested by 'andrewor14')
For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4231 from MechCoder/spark-3381 and squashes the following commits:
58c19a5 [MechCoder] COSMIT
c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint
b2b9b89 [MechCoder] COSMIT
d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
hasShutdownDeleteTachyonDir(file: TachyonFile) should use shutdownDeleteTachyonPaths(not shutdownDeletePaths) to determine Whether contain file. To solve it ,delete two unused function.
Author: xukun 00228947 <xukun.xu@huawei.com>
Author: viper-kun <xukun.xu@huawei.com>
Closes#4418 from viper-kun/deleteunusedfun and squashes the following commits:
87340eb [viper-kun] fix style
3d6c69e [xukun 00228947] fix bug
2bc397e [xukun 00228947] deleteunusedfun
previous behavior was to log an error; this is fine in the general
case where no `spark.metrics.conf` parameter was specified, in which
case a default `metrics.properties` is looked for, and the execption
logged and suppressed if it doesn't exist.
if the user has purposefully specified a metrics.conf file, however,
it makes more sense to show them an error when said file doesn't
exist.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#4571 from ryan-williams/metrics and squashes the following commits:
5bccb14 [Ryan Williams] private-ize some MetricsConfig members
08ff998 [Ryan Williams] rename METRICS_CONF: DEFAULT_METRICS_CONF_FILENAME
f4d7fab [Ryan Williams] fix tests
ad24b0e [Ryan Williams] add "metrics.properties" to .rat-excludes
94e810b [Ryan Williams] throw if nonexistent Sink class is specified
31d2c30 [Ryan Williams] metrics code review feedback
56287db [Ryan Williams] throw if nonexistent metrics config file provided
1. added explain()
2. add isLocal()
3. do not call show() in __repl__
4. add foreach() and foreachPartition()
5. add distinct()
6. fix functions.col()/column()/lit()
7. fix unit tests in sql/functions.py
8. fix unicode in showString()
Author: Davies Liu <davies@databricks.com>
Closes#4645 from davies/df6 and squashes the following commits:
6b46a2c [Davies Liu] fix DataFrame Python API
`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python.
Author: Xiangrui Meng <meng@databricks.com>
Closes#4647 from mengxr/SPARK-5858 and squashes the following commits:
036dc7f [Xiangrui Meng] remove unnecessary first() call
12c5548 [Xiangrui Meng] check numFeatures only once
I've seen out of memory exceptions when trying
to run many parallel builds against the same Zinc
server during packaging. We should use the same
increased memory settings we use for Maven itself.
I tested this and confirmed that the Nailgun JVM
launched with higher memory.
Author: Patrick Wendell <patrick@databricks.com>
Closes#4643 from pwendell/zinc-memory and squashes the following commits:
717cfb0 [Patrick Wendell] SPARK-5856: Launch Zinc with larger memory options.
Author: jerryshao <saisai.shao@intel.com>
Closes#4612 from jerryshao/SPARK-5826 and squashes the following commits:
7ec71db [jerryshao] Remove transient for conf statement
88d84e6 [jerryshao] Fix Configuration not serializable problem
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes#4593 from mengxr/SPARK-5802 and squashes the following commits:
ae3be84 [Xiangrui Meng] cache transformed data in glm
Author: Reynold Xin <rxin@databricks.com>
Closes#4640 from rxin/SPARK-5853 and squashes the following commits:
9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row.
Author: Patrick Wendell <patrick@databricks.com>
Closes#4638 from pwendell/SPARK-5850 and squashes the following commits:
386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
There is chance of dead lock that the Python process is waiting for ending mark from JVM, but which is eaten by corrupted stream.
This PR checks the ending mark from Python in non-block way, so it will not blocked by Python process.
There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used.
cc JoshRosen pwendell
Author: Davies Liu <davies@databricks.com>
Closes#4601 from davies/freeze and squashes the following commits:
e15a8c3 [Davies Liu] update logging
890329c [Davies Liu] Merge branch 'freeze' of github.com:davies/spark into freeze
2bd2228 [Davies Liu] add more logging
656d544 [Davies Liu] Update PythonRDD.scala
05e1085 [Davies Liu] check ending mark in non-block way
Added a bunch of tags.
Also changed parquetFile to take varargs rather than a string followed by varargs.
Author: Reynold Xin <rxin@databricks.com>
Closes#4636 from rxin/df-doc and squashes the following commits:
651f80c [Reynold Xin] Fixed parquetFile in PySpark.
8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes.