This brings the master branch in sync with branch-1.1
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#2510 from shivaram/spark-ec2-version and squashes the following commits:
bb0dd16 [Shivaram Venkataraman] Set EC2 version to 1.1.0 and update version map
Sometimes Jenkins posts [spurious reports of new classes being added](https://github.com/apache/spark/pull/2339#issuecomment-56570170). I believe this stems from diffing the patch against `master`, as opposed to against `master...`, which starts from the commit the PR was branched from.
This patch fixes that behavior.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2512 from nchammas/diff-only-commits-ahead and squashes the following commits:
c065599 [Nicholas Chammas] comment typo fix
a453c67 [Nicholas Chammas] diff from branch point
This is a refactored version of the original PR https://github.com/apache/spark/pull/1723 my mubarak
Please take a look andrewor14, mubarak
Author: Mubarak Seyed <mubarak.seyed@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#2464 from tdas/streaming-callsite and squashes the following commits:
dc54c71 [Tathagata Das] Made changes based on PR comments.
390b45d [Tathagata Das] Fixed minor bugs.
904cd92 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-callsite
7baa427 [Tathagata Das] Refactored getCallSite and setCallSite to make it simpler. Also added unit test for DStream creation site.
b9ed945 [Mubarak Seyed] Adding streaming utils
c461cf4 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
ceb43da [Mubarak Seyed] Changing default regex function name
8c5d443 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
196121b [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
491a1eb [Mubarak Seyed] Removing streaming visibility from getRDDCreationCallSite in DStream
33a7295 [Mubarak Seyed] Fixing review comments: Merging both setCallSite methods
c26d933 [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
f51fd9f [Mubarak Seyed] Fixing scalastyle, Regex for Utils.getCallSite, and changing method names in DStream
5051c58 [Mubarak Seyed] Getting return value of compute() into variable and call setCallSite(prevCallSite) only once. Adding return for other code paths (for None)
a207eb7 [Mubarak Seyed] Fixing code review comments
ccde038 [Mubarak Seyed] Removing Utils import from MappedDStream
2a09ad6 [Mubarak Seyed] Changes in Utils.scala for SPARK-1853
1d90cc3 [Mubarak Seyed] Changes for SPARK-1853
5f3105a [Mubarak Seyed] Merge remote-tracking branch 'upstream/master'
70f494f [Mubarak Seyed] Changes for SPARK-1853
1500deb [Mubarak Seyed] Changes in Spark Streaming UI
9d38d3c [Mubarak Seyed] [SPARK-1853] Show Streaming application code context (file, line number) in Spark Stages UI
d466d75 [Mubarak Seyed] Changes for spark streaming UI
`SPARK_DRIVER_MEMORY` was only used to start the `SparkSubmit` JVM, which becomes the driver only in client mode but not cluster mode. In cluster mode, this property is simply not propagated to the worker nodes.
`SPARK_EXECUTOR_MEMORY` is picked up from `SparkContext`, but in cluster mode the driver runs on one of the worker machines, where this environment variable may not be set.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2500 from andrewor14/memory-env-vars and squashes the following commits:
6217b38 [Andrew Or] Respect SPARK_*_MEMORY for cluster mode
...the driver
Author: Sandy Ryza <sandy@cloudera.com>
Closes#2487 from sryza/sandy-spark-3612 and squashes the following commits:
2b7353d [Sandy Ryza] SPARK-3612. Executor shouldn't quit if heartbeat message fails to reach the driver
Guava's Optional refers to some package private classes / methods, and
when those are relocated the code stops working, throwing exceptions.
So add the affected classes to the exception list too, and add a unit
test.
(Note that this unit test only really makes sense in maven, since we
don't relocate in the sbt build. Also, JavaAPISuite doesn't seem to
be run by "mvn test" - I had to manually add command line options to
enable it.)
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#2496 from vanzin/SPARK-3647 and squashes the following commits:
84f58d7 [Marcelo Vanzin] [SPARK-3647] Add more exceptions to Guava relocation.
This will allow us to take advantage of things like the spark.defaults file.
Author: Michael Armbrust <michael@databricks.com>
Closes#2493 from marmbrus/copySparkConf and squashes the following commits:
0bd1377 [Michael Armbrust] Copy SQL configuration from SparkConf when a SQLContext is created.
tdas is this what you had in mind for this JIRA? I saw this one and thought it would be easy to take care of, and helpful as I use streaming from Java.
I could do the same for `Time`? Happy to do so.
Author: Sean Owen <sowen@cloudera.com>
Closes#2403 from srowen/SPARK-2745 and squashes the following commits:
5a9e706 [Sean Owen] Change "Duration" to "Durations" to avoid changing Duration case class API
bda301c [Sean Owen] Just delegate to Scala binary operator syntax to avoid scalastyle warning
7dde949 [Sean Owen] Disable scalastyle for false positives. Add Java static factory methods seconds(), minutes() to Duration. Add Java-friendly methods to Time too, and unit tests. Remove unnecessary math.floor from Time.floor()
4dee32e [Sean Owen] Add named methods to Duration in parallel to symbolic methods for Java-friendliness. Also add unit tests for Duration, in Scala and Java.
It returns null metadata from parquet if querying on empty parquet file while calculating splits.So added null check and returns the empty splits.
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2456 from ravipesala/SPARK-3536 and squashes the following commits:
1e81a50 [ravipesala] Fixed the issue when querying on empty parquet file.
Since we have moved to `ConventionHelper`, it is quite easy to avoid call `javaClassToDataType` in hive simple udf. This will solve SPARK-3582.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2506 from adrian-wang/spark3582 and squashes the following commits:
450c28e [Daoyuan Wang] not limit argument type for hive simple udf
this patch fixes timestamp smaller than 0 and cast int as timestamp
select cast(1000 as timestamp) from src limit 1;
should return 1970-01-01 00:00:01, but we now take it as 1000 seconds.
also, current implementation has bug when the time is before 1970-01-01 00:00:00.
rxin marmbrus chenghao-intel
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2458 from adrian-wang/timestamp and squashes the following commits:
4274b1d [Daoyuan Wang] set test not related to timezone
1234f66 [Daoyuan Wang] fix timestamp smaller than 0 and cast int as timestamp
This is part of a broader effort to clean up the Yarn integration code after #2020.
The high-level changes in this PR include:
- Removing duplicate code, especially across the alpha and stable APIs
- Simplify unnecessarily complex method signatures and hierarchies
- Rename unclear variable and method names
- Organize logging output produced when the user runs Spark on Yarn
- Extensively add documentation
- Privatize classes where possible
I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a jar that references classes in other jars in both client and cluster mode. I also made changes in the alpha API, though I do not have access to an alpha cluster. I have verified that it compiles, but it would be ideal if others can help test it.
For those interested in some examples in detail, please read on.
--------------------------------------------------------------------------------------------------------
***Appendix***
- The loop to `getApplicationReport` from the RM is duplicated in 4 places: in the stable `Client`, alpha `Client`, and twice in `YarnClientSchedulerBackend`. We should not have different loops for client and cluster deploy modes.
- There are many fragmented small helper methods that are only used once and should just be inlined. For instance, `ClientBase#getLocalPath` returns `null` on certain conditions, and its only caller `ClientBase#addFileToClasspath` checks whether the value returned is `null`. We could just have the caller check on that same condition to avoid passing `null`s around.
- In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument `classpathSeparator` that always has the same value upstream (i.e. `File.pathSeparator`). This argument is now removed from the signature and all callers of this method upstream.
- `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was unclear whether we are copying a remote file to our local file system, or copying a locally visible file to a remote file system. Also, even the content of the method has inaccurately named variables. We use `val remoteFs` to signify the file system of the locally visible file and `val fs` to signify the remote, destination file system. These are now renamed `srcFs` and `destFs` respectively.
- We currently log the AM container's environment and resource mappings directly as Scala collections. This is incredibly hard to read and probably too verbose for the average Spark user. In other modes (e.g. standalone), we also don't log the launch commands by default, so the logging level of these information is now set to `DEBUG`.
- None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil` etc.) is intended to be used by a Spark application (the user should go through Spark submit instead). At the very least they should be `private[spark]`.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2350 from andrewor14/yarn-cleanup and squashes the following commits:
39e8c7b [Andrew Or] Address review comments
6619f9b [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
2ca6d64 [Andrew Or] Improve logging in application monitor
a3b9693 [Andrew Or] Minor changes
7dd6298 [Andrew Or] Simplify ClientBase#monitorApplication
547487c [Andrew Or] Provide default values for null application report entries
a0ad1e9 [Andrew Or] Fix class not found error
1590141 [Andrew Or] Address review comments
45ccdea [Andrew Or] Remove usages of getAMMemory
d8e33b6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
ed0b42d [Andrew Or] Fix alpha compilation error
c0587b4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
6d74888 [Andrew Or] Minor comment changes
6573c1d [Andrew Or] Clean up, simplify and document code for setting classpaths
e4779b6 [Andrew Or] Clean up log messages + variable naming in ClientBase
8766d37 [Andrew Or] Heavily add documentation to Client* classes + various clean-ups
6c94d79 [Andrew Or] Various cleanups in ClientBase and ClientArguments
ef7069a [Andrew Or] Clean up YarnClientSchedulerBackend more
6de9072 [Andrew Or] Guard against potential NPE in debug logging mode
fabe4c4 [Andrew Or] Reuse more code in YarnClientSchedulerBackend
3f941dc [Andrew Or] First cut at simplifying the Client (stable and alpha)
Assign value of yarn container log directory to java opts "spark.yarn.app.container.log.dir", So user defined log4j.properties can reference this value and write log to YARN container's log directory.
Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished.
User defined log4j.properties reference example:
log4j.appender.rolling_file.File = ${spark.yarn.app.container.log.dir}/spark.log
Author: peng.zhang <peng.zhang@xiaomi.com>
Closes#1573 from renozhang/yarn-log-dir and squashes the following commits:
16c5cb8 [peng.zhang] Update doc
f2b5e2a [peng.zhang] Change variable's name, and update running-on-yarn.md
503ea2d [peng.zhang] Support log4j log to yarn container dir
GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows:
```scala
var X: Double = maxVal
while (X >= maxVal) {
val Z = rand.nextGaussian()
X = math.exp(mu + sigma*Z)
}
math.round(X.toFloat)
```
When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will violate the guarantee. For example, if maxVal is 5 and X is 4.9, then X < maxVal, but `math.round(X.toFloat)` is 5.
This PR instead rounds X before checking the loop condition, guaranteeing that the condition will hold for the return value.
Author: Ankur Dave <ankurdave@gmail.com>
Closes#2439 from ankurdave/SPARK-3578 and squashes the following commits:
f6655e5 [Ankur Dave] Go back to math.floor
5900c22 [Ankur Dave] Round X in loop condition
6fd5fb1 [Ankur Dave] Run sampleLogNormal bounds check 1000 times
1638598 [Ankur Dave] Round down in sampleLogNormal to guarantee upper bound
Author: Grega Kespret <grega.kespret@gmail.com>
Closes#2479 from gregakespret/patch-1 and squashes the following commits:
dd6b90a [Grega Kespret] Update docs to use jsonRDD instead of wrong jsonRdd.
Author: RJ Nowling <rnowling@gmail.com>
Closes#2459 from rnowling/tfidf-fix and squashes the following commits:
b370a91 [RJ Nowling] Fix variable name misspelling in MLLib Feature Extraction guide
ignore .idea_modules , ```sbt/sbt gen-idea``` generate this dir.
Author: wangfei <wangfei1@huawei.com>
Closes#2476 from scwf/patch-4 and squashes the following commits:
e6ab88a [wangfei] ignore .idea_modules
Addresses the issue in https://issues.apache.org/jira/browse/SPARK-3595, namely saveAsHadoopFile hardcoding the OutputCommitter. This is not ideal when running Spark jobs that write to S3, especially when running them from an EMR cluster where the default OutputCommitter is a DirectOutputCommitter.
Author: Ian Hummel <ian@themodernlife.net>
Closes#2450 from themodernlife/spark-3595 and squashes the following commits:
f37a0e5 [Ian Hummel] Update based on comments from pwendell
a11d9f3 [Ian Hummel] Fix formatting
4359664 [Ian Hummel] Add an example showing usage
8b6be94 [Ian Hummel] Add ability to specify OutputCommitter, espcially useful when writing to an S3 bucket from an EMR cluster
This commit exists to close the following pull requests on Github:
Closes#1328 (close requested by 'pwendell')
Closes#2314 (close requested by 'pwendell')
Closes#997 (close requested by 'pwendell')
Closes#550 (close requested by 'pwendell')
Closes#1506 (close requested by 'pwendell')
Closes#2423 (close requested by 'mengxr')
Closes#554 (close requested by 'joshrosen')
https://issues.apache.org/jira/browse/SPARK-3599
Author: WangTao <barneystinson@aliyun.com>
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2454 from WangTaoTheTonic/avoidLoadingFrequently and squashes the following commits:
3681182 [WangTao] do not use clone
7dca036 [WangTao] use lazy val instead
2a79f26 [WangTaoTheTonic] Avoid loaing properties file frequently
**This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala. Specifically, while resolution remains case insensitive, it is now case preserving.**
_This PR is a follow up to #2293 (and to a lesser extent #2262#2334)._
In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones. While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a `LowerCaseSchema` operator) it had the consequence of breaking assumptions made by `MultiInstanceRelation`. Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).
In this PR, I instead remove the need to insert `LowerCaseSchema` operators at all, by moving the concern of matching up identifiers completely into analysis. Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).
While it is rolled back in this PR, storing the analyzed plan might actually be a good idea. For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore. This can be addressed in a follow up PR.
Follow-ups:
- Configurable case sensitivity
- Consider storing analyzed plans for temp tables
Author: Michael Armbrust <michael@databricks.com>
Closes#2382 from marmbrus/lowercase and squashes the following commits:
c21171e [Michael Armbrust] Ensure the resolver is used for field lookups and ensure that case insensitive resolution is still case preserving.
d4320f1 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into lowercase
2de881e [Michael Armbrust] Address comments.
219805a [Michael Armbrust] style
5b93711 [Michael Armbrust] Replace LowerCaseSchema with Resolver.
This helps to replace shuffled hash joins with broadcast hash joins in some cases.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#2468 from liancheng/more-stats and squashes the following commits:
32687dc [Cheng Lian] Moved the test case to PlannerSuite
5595a91 [Cheng Lian] Removes debugging code
73faf69 [Cheng Lian] Test case for auto choosing broadcast hash join
f30fe1d [Cheng Lian] Adds sizeInBytes estimation for Limit when all output types are native types
The included test waits 100 ms after job completion for task completion events to come in so it can verify they have reasonable finish times. Does anyone know a better way to wait on listener events that are expected to come in?
Author: Sandy Ryza <sandy@cloudera.com>
Closes#2440 from sryza/sandy-spark-3574 and squashes the following commits:
c81439b [Sandy Ryza] Fix test failure
b340956 [Sandy Ryza] SPARK-3574. Remove shuffleFinishTime metric
Author: Matthew Farrellee <matt@redhat.com>
Closes#2467 from mattf/master-pyspark-remove-numslices-from-tests and squashes the following commits:
c49a87b [Matthew Farrellee] [PySpark] remove unnecessary use of numSlices from pyspark tests
"val conf" was used instead of "SparkConf conf" in Java snippet.
Author: Santiago M. Mola <santi@mola.io>
Closes#2472 from smola/patch-1 and squashes the following commits:
5bfeb9b [Santiago M. Mola] Fix Java example in Streaming Programming Guide
Author: Vida Ha <vida@databricks.com>
Closes#2466 from vidaha/vida/spark-3608 and squashes the following commits:
9509776 [Vida Ha] Break if the instance tag naming succeeds
This is now supported!
Author: andrewor14 <andrewor14@gmail.com>
Author: Andrew Or <andrewor14@gmail.com>
Closes#2461 from andrewor14/document-standalone-cluster and squashes the following commits:
85c8b9e [andrewor14] Wording change per Patrick
35e30ee [Andrew Or] Fix outdated docs for standalone cluster
Simple mistake, simple fix:
```shell
args="arg1 arg2 arg3"
sbt $args # sbt sees 3 arguments
sbt "$args" # sbt sees 1 argument
```
Should fix the problems we are seeing [here](https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/694/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/console), for example.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2462 from nchammas/fix-sbt-master-build and squashes the following commits:
4500c86 [Nicholas Chammas] warn about quoting
10018a6 [Nicholas Chammas] Revert "test hadoop1 build"
7d5356c [Nicholas Chammas] Revert "re-add bad quoting for testing"
061600c [Nicholas Chammas] re-add bad quoting for testing
b2de56c [Nicholas Chammas] test hadoop1 build
43fb854 [Nicholas Chammas] unquote profile args
This is just another solution to SPARK-3485, in addition to PR #2355
In this patch, we will use ConventionHelper and FunctionRegistry to invoke a simple udf evaluation, which rely more on hive, but much cleaner and safer.
We can discuss which one is better.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#2407 from adrian-wang/simpleudf and squashes the following commits:
15762d2 [Daoyuan Wang] add posmod test which would fail the test but now ok
0d69eb4 [Daoyuan Wang] another way to pass to hive simple udf
Author: Sandy Ryza <sandy@cloudera.com>
Closes#2460 from sryza/sandy-spark-3605 and squashes the following commits:
09d940b [Sandy Ryza] SPARK-3605. Fix typo in SchemaRDD.
Fix the issue when applySchema() to an RDD of Row.
Also add type mapping for BinaryType.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2448 from davies/row and squashes the following commits:
dd220cf [Davies Liu] fix test
3f3f188 [Davies Liu] add more test
f559746 [Davies Liu] add tests, fix serialization
9688fd2 [Davies Liu] support applySchema to RDD of Row
This feature allows user to add cache table from the select query.
Example : ```CACHE TABLE testCacheTable AS SELECT * FROM TEST_TABLE```
Spark takes this type of SQL as command and it does lazy caching just like ```SQLContext.cacheTable```, ```CACHE TABLE <name>``` does.
It can be executed from both SQLContext and HiveContext.
Recreated the pull request after rebasing with master.And fixed all the comments raised in previous pull requests.
https://github.com/apache/spark/pull/2381https://github.com/apache/spark/pull/2390
Author : ravipesala ravindra.pesalahuawei.com
Author: ravipesala <ravindra.pesala@huawei.com>
Closes#2397 from ravipesala/SPARK-2594 and squashes the following commits:
a5f0beb [ravipesala] Simplified the code as per Admin comment.
8059cd2 [ravipesala] Changed the behaviour from eager caching to lazy caching.
d6e469d [ravipesala] Code review comments by Admin are handled.
c18aa38 [ravipesala] Merge remote-tracking branch 'remotes/ravipesala/Add-Cache-table-as' into SPARK-2594
394d5ca [ravipesala] Changed style
fb1759b [ravipesala] Updated as per Admin comments
8c9993c [ravipesala] Changed the style
d8b37b2 [ravipesala] Updated as per the comments by Admin
bc0bffc [ravipesala] Merge remote-tracking branch 'ravipesala/Add-Cache-table-as' into Add-Cache-table-as
e3265d0 [ravipesala] Updated the code as per the comments by Admin in pull request.
724b9db [ravipesala] Changed style
aaf5b59 [ravipesala] Added comment
dc33895 [ravipesala] Updated parser to support add cache table command
b5276b2 [ravipesala] Updated parser to support add cache table command
eebc0c1 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
6758f80 [ravipesala] Changed style
7459ce3 [ravipesala] Added comment
13c8e27 [ravipesala] Updated parser to support add cache table command
4e858d8 [ravipesala] Updated parser to support add cache table command
b803fc8 [ravipesala] Add CACHE TABLE <name> AS SELECT ...
When do the query like:
```
select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as timestamp)) from src;
```
SparkSQL will raise exception:
```
[info] scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
[info] at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
[info] at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2368 from chenghao-intel/cast_exception and squashes the following commits:
5c9c3a5 [Cheng Hao] make more clear code
49dfc50 [Cheng Hao] Add no-op for Cast and revert the position of SimplifyCasts
b804abd [Cheng Hao] Add unit test to show the failure in identical data type casting
330a5c8 [Cheng Hao] Update Code based on comments
b834ed4 [Cheng Hao] Fix bug of HiveSimpleUDF with unnecessary type cast which cause exception in constant folding
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
All the modules are refactored to use this protocol.
Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
Author: Davies Liu <davies.liu@gmail.com>
Closes#2378 from davies/pickle_mllib and squashes the following commits:
dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
810f97f [Davies Liu] fix equal of matrix
032cd62 [Davies Liu] add more type check and conversion for user_product
bd738ab [Davies Liu] address comments
e431377 [Davies Liu] fix cache of rdd, refactor
19d0967 [Davies Liu] refactor Picklers
2511e76 [Davies Liu] cleanup
1fccf1a [Davies Liu] address comments
a2cc855 [Davies Liu] fix tests
9ceff73 [Davies Liu] test size of serialized Rating
44e0551 [Davies Liu] fix cache
a379a81 [Davies Liu] fix pickle array in python2.7
df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
154d141 [Davies Liu] fix autobatchedpickler
44736d7 [Davies Liu] speed up pickling array in Python 2.7
e1d1bfc [Davies Liu] refactor
708dc02 [Davies Liu] fix tests
9dcfb63 [Davies Liu] fix style
88034f0 [Davies Liu] rafactor, address comments
46a501e [Davies Liu] choose batch size automatically
df19464 [Davies Liu] memorize the module and class name during pickleing
f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
722dd96 [Davies Liu] cleanup _common.py
0ee1525 [Davies Liu] remove outdated tests
b02e34f [Davies Liu] remove _common.py
84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
4d7963e [Davies Liu] remove muanlly serialization
6d26b03 [Davies Liu] fix tests
c383544 [Davies Liu] classification
f2a0856 [Davies Liu] mllib/regression
d9f691f [Davies Liu] mllib/util
cccb8b1 [Davies Liu] mllib/tree
8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
aa2287e [Davies Liu] random
f1544c4 [Davies Liu] refactor clustering
52d1350 [Davies Liu] use new protocol in mllib/stat
b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
Author: Matthew Farrellee <matt@redhat.com>
Closes#2304 from mattf/SPARK-1701-partition-over-slice-for-python-examples and squashes the following commits:
928a581 [Matthew Farrellee] [SPARK-1701] [PySpark] remove slice terminology from python examples
This is a partial solution to SPARK-1701, only addressing the
documentation confusion.
Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.
Author: Matthew Farrellee <matt@redhat.com>
Closes#2305 from mattf/SPARK-1701 and squashes the following commits:
c0af05d [Matthew Farrellee] Further tweak
06f80fc [Matthew Farrellee] Wording tweak from Josh Rosen's review
7b045e0 [Matthew Farrellee] [SPARK-1701] Clarify slice vs partition in the programming guide
This commit exists to close the following pull requests on Github:
Closes#726 (close requested by 'pwendell')
Closes#151 (close requested by 'pwendell')
VertexRDD.apply had a bug where it ignored the merge function for
duplicate vertices and instead used whichever vertex attribute occurred
first. This commit fixes the bug by passing the merge function through
to ShippableVertexPartition.apply, which merges any duplicates using the
merge function and then fills in missing vertices using the specified
default vertex attribute. This commit also adds a unit test for
VertexRDD.apply.
Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Author: Blie Arkansol <xiaodi@sjtu.edu.cn>
Author: Ankur Dave <ankurdave@gmail.com>
Closes#1903 from larryxiao/2062 and squashes the following commits:
625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062
476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values
614059f [Larry Xiao] doc update: note about the default null value vertices construction
dfdb3c9 [Larry Xiao] minor fix
1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces
e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
4fbc29c [Blie Arkansol] undo unnecessary change
efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc
b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062
52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled
581e9ee [Larry Xiao] TODO: VertexRDDSuite
20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.
BLAS doesn't support sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.
Author: Burak <brkyvz@gmail.com>
Closes#2294 from brkyvz/SPARK-3418 and squashes the following commits:
88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).
Author: Davies Liu <davies.liu@gmail.com>
Closes#2417 from davies/command and squashes the following commits:
fbf4e97 [Davies Liu] bugfix
aefd508 [Davies Liu] use broadcast automatically for large closure
This was introduced in #2449
Author: Andrew Or <andrewor14@gmail.com>
Closes#2452 from andrewor14/standalone-hot-fix and squashes the following commits:
d5190ca [Andrew Or] Put that line in the right place
Author: Victsm <victor.nju@gmail.com>
Author: Min Shen <mshen@linkedin.com>
Closes#2449 from Victsm/SPARK-3560 and squashes the following commits:
918405a [Victsm] Removed the additional space
4502a2a [Min Shen] [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode.
(cherry picked from commit 832dff64dd)
Signed-off-by: Andrew Or <andrewor14@gmail.com>
https://issues.apache.org/jira/browse/SPARK-3589
"export CLASSPATH" in spark-class is redundant since same variable is exported before.
We could reuse defined value "isYarnCluster" in SparkSubmit.scala.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2445 from WangTaoTheTonic/removeRedundant and squashes the following commits:
6fb6872 [WangTaoTheTonic] remove redundant code
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2426 from sarutak/emacs-metafiles-ignore and squashes the following commits:
a306020 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
6a0a5eb [Kousuke Saruta] Added cmd file entry to .rat-excludes and .gitignore
897da63 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
8cade06 [Kousuke Saruta] Modified .gitignore to ignore emacs lock file and backup file
This patch makes some small changes to fix this problem:
1. We document specific versions of Jekyll/Kramdown to use that match
those used when building the upstream docs.
2. We add a configuration for a property that for some reason varies across
packages of Jekyll/Kramdown even with the same version.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2443 from pwendell/jekyll and squashes the following commits:
54ee2ab [Patrick Wendell] SPARK-3579 Jekyll doc generation is different across environments.
...on
As improvement of https://github.com/apache/spark/pull/1944, we should use more special exit code to represent ClassNotFoundException.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2421 from WangTaoTheTonic/classnotfoundExitCode and squashes the following commits:
645a22a [WangTaoTheTonic] Serveral typos to trigger Jenkins
d6ae559 [WangTaoTheTonic] use 101 instead
a2d6465 [WangTaoTheTonic] use 127 instead
fbb232f [WangTaoTheTonic] Using a special exit code instead of 1 to represent ClassNotFoundException
Author: GuoQiang Li <witgo@qq.com>
Closes#2326 from witgo/rat-excludes and squashes the following commits:
860904e [GuoQiang Li] rat exclude dependency-reduced-pom.xml