Implemented almost all math functions found in scala.math (max, min and abs were already present).
cc mengxr marmbrus
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5616 from brkyvz/math-udfs and squashes the following commits:
fb27153 [Burak Yavuz] reverted exception message
836a098 [Burak Yavuz] fixed test and addressed small comment
e5f0d13 [Burak Yavuz] addressed code review v2.2
b26c5fb [Burak Yavuz] addressed review v2.1
2761f08 [Burak Yavuz] addressed review v2
6588a5b [Burak Yavuz] fixed merge conflicts
b084e10 [Burak Yavuz] Addressed code review
029e739 [Burak Yavuz] fixed atan2 test
534cc11 [Burak Yavuz] added more tests, addressed comments
fa68dbe [Burak Yavuz] added double specific test data
937d5a5 [Burak Yavuz] use doubles instead of ints
8e28fff [Burak Yavuz] Added apache header
7ec8f7f [Burak Yavuz] Added math functions for DataFrames
`HeartbeatReceiver` will call `TaskScheduler.executorHeartbeatReceived`, which is a blocking operation because `TaskScheduler.executorHeartbeatReceived` will call
```Scala
blockManagerMaster.driverEndpoint.askWithReply[Boolean](
BlockManagerHeartbeat(blockManagerId), 600 seconds)
```
finally. Even if it asks from a local Actor, it may block the current Akka thread. E.g., the reply may be dispatched to the same thread of the ask operation. So the reply cannot be processed. An extreme case is setting the thread number of Akka dispatch thread pool to 1.
jstack log:
```
"sparkDriver-akka.actor.default-dispatcher-14" daemon prio=10 tid=0x00007f2a8c02d000 nid=0x725 waiting on condition [0x00007f2b1d6d0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006197a0868> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEndpointRef.askWithReply(RpcEnv.scala:355)
at org.apache.spark.scheduler.DAGScheduler.executorHeartbeatReceived(DAGScheduler.scala:169)
at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:367)
at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:103)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:182)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:128)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:203)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:127)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```
This PR moved this blocking operation to a separated thread.
Author: zsxwing <zsxwing@gmail.com>
Closes#5723 from zsxwing/SPARK-7174 and squashes the following commits:
98bfe48 [zsxwing] Use a single thread for checking timeout and reporting executorHeartbeatReceived
5b3b545 [zsxwing] Move calling `TaskScheduler.executorHeartbeatReceived` to another thread to avoid blocking the Akka thread pool
jira: https://issues.apache.org/jira/browse/SPARK-7090
LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in https://github.com/apache/spark/pull/4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
Concrete changes:
1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
-adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.
Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5661 from hhbyyh/ldaRefactor and squashes the following commits:
0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
Author: Sean Owen <sowen@cloudera.com>
Closes#5703 from srowen/SPARK-7145 and squashes the following commits:
21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
Set up a shutdown hook to try to stop the Spark context in
case the user forgets to do it. The main effect is that any
open logs files are flushed and closed, which is particularly
interesting for event logs.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5696 from vanzin/SPARK-3090 and squashes the following commits:
3b554b5 [Marcelo Vanzin] [SPARK-3090] [core] Stop SparkContext if user forgets to.
Added a check to the SparkContext.union method to check that a partitioner is defined on all RDDs when instantiating a PartitionerAwareUnionRDD.
Author: Steven She <steven@canopylabs.com>
Closes#5679 from stevencanopy/SPARK-7103 and squashes the following commits:
5a3d846 [Steven She] SPARK-7103: Fix crash with SparkContext.union when at least one RDD has no partitioner
Author: hlin09 <hlin09pu@gmail.com>
Closes#5568 from hlin09/zipPartitions and squashes the following commits:
12c08a5 [hlin09] Fix comments
d2d32db [hlin09] Merge branch 'master' into zipPartitions
ec56d2f [hlin09] Fix test.
27655d3 [hlin09] Adds support for zipPartitions.
....py
Author: tedyu <yuzhihong@gmail.com>
Closes#5673 from tedyu/master and squashes the following commits:
ab7c72b [tedyu] SPARK-7107 Adjust indentation to pass Python style tests
6e25939 [tedyu] Adjust line length to be shorter than 100 characters
18d172a [tedyu] SPARK-7107 Add parameter for zookeeper.znode.parent to hbase_inputformat.py
Author: Jeff Harrison <jeffrharrison@gmail.com>
Closes#5667 from His-name-is-Joof/joofspark and squashes the following commits:
f8814a6 [Jeff Harrison] newline added after RDD show() output
4d9d972 [Jeff Harrison] Merge branch 'master' into joofspark
9d2295e [Jeff Harrison] parallelize with 1:10
878b830 [Jeff Harrison] Merge branch 'master' into joofspark
c8c0b80 [Jeff Harrison] add test for RDD function show()
123be65 [Jeff Harrison] SPARK-6856
turned on hive-thriftserver profile in release script
Author: Misha Chernetsov <chernetsov@gmail.com>
Closes#5429 from chernetsov/master and squashes the following commits:
9cc36af [Misha Chernetsov] [SPARK-4925] Publish Spark SQL hive-thriftserver maven artifact turned on hive-thriftserver profile in release script for scala 2.10
according liancheng‘s comment in https://issues.apache.org/jira/browse/SPARK-6505, this patch remove the reflection call in HiveFunctionWrapper, and implement the functions named "deserializeObjectByKryo" and "serializeObjectByKryo" according the functions with the save name in
org.apache.hadoop.hive.ql.exec.Utilities.java
Author: baishuo <vc_java@hotmail.com>
Closes#5660 from baishuo/SPARK-6505-20150423 and squashes the following commits:
ae61ec4 [baishuo] modify code style
78d9fa3 [baishuo] modify code style
0b522a7 [baishuo] modify code style
a5ff9c7 [baishuo] Remove the reflection call in HiveFunctionWrapper
rename DataTypeParser.apply to DataTypeParser.parse to make it more clear and readable.
/cc rxin
Author: wangfei <wangfei1@huawei.com>
Closes#5710 from scwf/apply and squashes the following commits:
c319977 [wangfei] rename apply to parse
Author: Reynold Xin <rxin@databricks.com>
Closes#5705 from rxin/df-pid and squashes the following commits:
401018f [Reynold Xin] [SPARK-7152][SQL] Add a Column expression for partition ID.
1. predict(predict.toString) has already output prefix “predict” thus it’s duplicated to print ", predict = " again
2. there are some extra spaces
Author: Alain <aihe@usc.edu>
Closes#5687 from AiHe/tree-node-issue-2 and squashes the following commits:
9862b9a [Alain] Pass scala coding style checking
44ba947 [Alain] Minor][MLLIB] Format toString method in MLLIB
bdc402f [Alain] [Minor][MLLIB] Fix a formatting bug in toString method in Node
426eee7 [Alain] [Minor][MLLIB] Fix a formatting bug in toString method in Node.scala
Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see https://github.com/apache/spark/pull/4690)
cc andrewor14, srowen
Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Closes#5672 from nishkamravi2/master_nravi and squashes the following commits:
0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#5662 from ScrapCodes/SPARK-7092/scala-update-2.11.6 and squashes the following commits:
58cf4f9 [Prashant Sharma] [SPARK-7092] Update spark scala version to 2.11.6
Author: Yin Huai <yhuai@databricks.com>
Closes#5702 from yhuai/howToGenerateGoldenFiles and squashes the following commits:
9c4a7f8 [Yin Huai] Update readme to include instructions on generating golden answer files based on Hive 0.13.1.
This is a continuation of [https://github.com/apache/spark/pull/5530] (which was for Decision Trees), but for ensembles: Random Forests and Gradient-Boosted Trees. Please refer to the JIRA [https://issues.apache.org/jira/browse/SPARK-6113], the design doc linked from the JIRA, and the previous PR linked above for design discussions.
This PR follows the example set by the previous PR for Decision Trees. It includes a few cleanups to Decision Trees.
Note: There is one issue which will be addressed in a separate PR: Ensembles' component Models have no parent or fittingParamMap. I plan to submit a separate PR which makes those values in Model be Options. It does not matter much which PR gets merged first.
CC: mengxr manishamde codedeft chouqin
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5626 from jkbradley/dt-api-ensembles and squashes the following commits:
729167a [Joseph K. Bradley] small cleanups based on code review
bbae2a2 [Joseph K. Bradley] Updated per all comments in code review
855aa9a [Joseph K. Bradley] scala style fix
ea3d901 [Joseph K. Bradley] Added GBT to spark.ml, with tests and examples
c0f30c1 [Joseph K. Bradley] Added random forests and test suites to spark.ml. Not tested yet. Need to add example as well
d045ebd [Joseph K. Bradley] some more updates, but far from done
ee1a10b [Joseph K. Bradley] Added files from old PR and did some initial updates.
To run Spark, default will read file from HDFS if we don't set the schema.
Author: linweizhong <linweizhong@huawei.com>
Closes#5684 from Sephiroth-Lin/pyspark_example_minor and squashes the following commits:
19fe145 [linweizhong] Update example sql.py, so that can read file correctly
This is a reopening of #4867.
A short summary of the issues resolved from the previous PR:
1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.
2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.
It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!
Author: Calvin Jia <jia.calvin@gmail.com>
Closes#5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:
0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
9263097 [Calvin Jia] Merge master to test latest changes
dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
a3a29da [Calvin Jia] Update tachyon-client exclusions.
0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
a204df9 [Calvin Jia] Update make distribution tachyon version.
a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
910fabd [Calvin Jia] Update to master
eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.
Author: Sun Rui <rui.sun@intel.com>
Closes#5613 from sun-rui/SPARK-6852 and squashes the following commits:
abaf02e [Sun Rui] Change the type of default numPartitions from integer to numeric in generics.R.
29d67c1 [Sun Rui] [SPARK-6852][SPARKR] Accept numeric as numPartitions in SparkR.
Author: Sun Rui <rui.sun@intel.com>
Closes#5628 from sun-rui/SPARK-7033 and squashes the following commits:
046bc9e [Sun Rui] Clean split usage in tests.
d531c86 [Sun Rui] [SPARK-7033][SPARKR] Clean usage of split. Use partition instead where applicable.
See [SPARK-6528](https://issues.apache.org/jira/browse/SPARK-6528). Add IDF transformer in ML package.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#5266 from yinxusen/SPARK-6528 and squashes the following commits:
741db31 [Xusen Yin] get param from new paramMap
d169967 [Xusen Yin] add final to param and IDF class
c9c3759 [Xusen Yin] simplify test suite
5867c09 [Xusen Yin] refine IDF transformer with new interfaces
7727cae [Xusen Yin] Merge branch 'master' into SPARK-6528
4338a37 [Xusen Yin] Merge branch 'master' into SPARK-6528
aef2cdf [Xusen Yin] add doc and group for param
5760b49 [Xusen Yin] fix code style
2add691 [Xusen Yin] fix code style and test
03fbecb [Xusen Yin] remove duplicated code
2aa4be0 [Xusen Yin] clean test suite
4802c67 [Xusen Yin] add IDF transformer and test suite
yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5681 from mengxr/SPARK-7115 and squashes the following commits:
9ac27cd [Xiangrui Meng] skip the very first 1 in poly expansion
Also took the chance to improve documentation for various types.
Author: Reynold Xin <rxin@databricks.com>
Closes#5675 from rxin/data-type-matching-expr and squashes the following commits:
0f31856 [Reynold Xin] One more function documentation.
27c1973 [Reynold Xin] Added more documentation.
336a36d [Reynold Xin] [SQL] Fixed expression data type matching.
fix typo
Author: Ken Geis <geis.ken@gmail.com>
Closes#5674 from kgeis/patch-1 and squashes the following commits:
5ae67de [Ken Geis] Update sql-programming-guide.md
This pr tries to provide a way to let python users workaround https://issues.apache.org/jira/browse/SPARK-6231.
Author: Yin Huai <yhuai@databricks.com>
Closes#5634 from yhuai/pythonDFAlias and squashes the following commits:
8465acd [Yin Huai] Add an alias to a Python DF.
When specifying non-spark properties (i.e. names don't start with spark.) in the command line and config file, spark-submit and spark-shell behave differently, causing confusion to users.
Here is the summary-
* spark-submit
* --conf k=v => silently ignored
* spark-defaults.conf => applied
* spark-shell
* --conf k=v => show a warning message and ignored
* spark-defaults.conf => show a warning message and ignored
I assume that ignoring non-spark properties is intentional. If so, it should always be ignored with a warning message in all cases.
Author: Cheolsoo Park <cheolsoop@netflix.com>
Closes#5617 from piaozhexiu/SPARK-7037 and squashes the following commits:
8957950 [Cheolsoo Park] Add IgnoreNonSparkProperties method
fedd01c [Cheolsoo Park] Ignore non-spark properties with a warning message in all cases
Author: Sun Rui <rui.sun@intel.com>
Closes#5655 from sun-rui/SPARK-6818 and squashes the following commits:
7c66570 [Sun Rui] [SPARK-6818][SPARKR] Support column deletion in SparkR DataFrame API.
It was over 1000 lines of code, making it harder to find all the types. Only moved code around, and didn't change any.
Author: Reynold Xin <rxin@databricks.com>
Closes#5670 from rxin/break-types and squashes the following commits:
8c59023 [Reynold Xin] Check in missing files.
dcd5193 [Reynold Xin] [SQL] Break dataTypes.scala into multiple files.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5649 from mengxr/SPARK-7070 and squashes the following commits:
c66023c [Xiangrui Meng] setBeta should call setTopicConcentration
Author: Tijo Thomas <tijoparacka@gmail.com>
Closes#5656 from tijoparacka/FIX_PATHISSUE_CHANGE_VERSION_SCRIPT and squashes the following commits:
ab4f4b1 [Tijo Thomas] removed whitespace
24478c9 [Tijo Thomas] modified to provide the spark base dir while searching for pom and also while changing the vesrion no
7b8e10b [Tijo Thomas] Modified for providing the base directories while finding the list of pom files and also while changing the version no
https://issues.apache.org/jira/browse/SPARK-6879
Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
If an exception was throwed, add it to `applications` to wait for the next loop.
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:
4a533eb [WangTaoTheTonic] treat ACE specially
cb45105 [WangTaoTheTonic] rebase
d4d5251 [WangTaoTheTonic] per Marcelo's comments
d7455d8 [WangTaoTheTonic] slightly change when delete file
b0abca5 [WangTaoTheTonic] use global var to store apps to clean
94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
9872a9d [WangTaoTheTonic] use the right path
fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
Author: wizz <wizz@wizz-dev01.kawasaki.flab.fujitsu.com>
Closes#5658 from kuromatsu-nobuyuki/SPARK-7085 and squashes the following commits:
6ec2d21 [wizz] Fix miniBatchFraction parameter in train method called with 4 arguments
The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5635 from JoshRosen/SPARK-7058 and squashes the following commits:
ed90f75 [Josh Rosen] Update UI tooltip
a3743b4 [Josh Rosen] Update comments.
4f52910 [Josh Rosen] Roll back whitespace change
e9cf9f4 [Josh Rosen] Remove unused variable
9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
Author: Vinod K C <vinod.kc@huawei.com>
Closes#5633 from vinodkc/use_correct_classloader_driverload and squashes the following commits:
73c5380 [Vinod K C] Use correct ClassLoader for JDBC Driver
Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated.
The solution in this PR: Introduce the following methods on StreamingContext
1. `new StreamingContext(checkpointDirectory, sparkContext)`
Recreate StreamingContext from checkpoint using the provided SparkContext
2. `StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext => StreamingContext)`
If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction
TODO: the corresponding Java and Python API has to be added as well.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#5428 from tdas/SPARK-6752 and squashes the following commits:
94db63c [Tathagata Das] Fix long line.
524f519 [Tathagata Das] Many changes based on PR comments.
eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
36a7823 [Tathagata Das] Minor changes.
204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5625 from chenghao-intel/transform and squashes the following commits:
5ec1dd2 [Cheng Hao] fix the deadlock issue in ScriptTransform
Author: Prabeesh K <prabeesh.k@namshi.com>
Closes#5653 from prabeesh/fix and squashes the following commits:
9d7a9f5 [Prabeesh K] fixed scala string interpolation error
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#5652 from ScrapCodes/hf/compilation-fix-scala-2.11 and squashes the following commits:
819ff06 [Prashant Sharma] [HOTFIX] Fix compilation for scala 2.11.
Also renamed JvmType to InternalType.
Author: Reynold Xin <rxin@databricks.com>
Closes#5651 from rxin/native-to-atomic-type and squashes the following commits:
cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.
Author: Reynold Xin <rxin@databricks.com>
Closes#5646 from rxin/remove-primitive-type and squashes the following commits:
01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType
Author: Reynold Xin <rxin@databricks.com>
Closes#5648 from rxin/vectorAssembler-boolean and squashes the following commits:
1bf3d40 [Reynold Xin] [MLlib] Add support for BooleanType to VectorAssembler.