This bug only happen on Python 3 and Windows.
I tested this manually with python 3 and disable python daemon, no unit test yet.
Author: Davies Liu <davies@databricks.com>
Closes#8181 from davies/open_mode.
What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.
~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.
jkbradley holdenk
Author: Xiangrui Meng <meng@databricks.com>
Closes#8152 from mengxr/SPARK-9922.
If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions.
Author: Davies Liu <davies@databricks.com>
Closes#8173 from davies/fix_pandas.
I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.
1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8126 from MechCoder/java_incop.
To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8164 from yanboliang/mlp-name.
Copied ML models must have the same parent of original ones
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes#7447 from Lewuathe/SPARK-9073.
PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#8130 from liancheng/spark-9757/old-hive-parquet-decimal.
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
jkbradley yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes#8148 from mengxr/SPARK-9918 and squashes the following commits:
149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
I made a mistake in #8049 by casting literal value to attribute's data type, which would cause simply truncate the literal value and push a wrong filter down.
JIRA: https://issues.apache.org/jira/browse/SPARK-9927
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8157 from yjshen/rever8049.
The problem with defining setters in the base class is that it doesn't return the correct type in Java.
ericl
Author: Xiangrui Meng <meng@databricks.com>
Closes#8143 from mengxr/SPARK-9914 and squashes the following commits:
d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
This patch add a thread-safe lookup for BytesToBytseMap, and use that in broadcasted HashedRelation.
Author: Davies Liu <davies@databricks.com>
Closes#8151 from davies/safeLookup.
https://issues.apache.org/jira/browse/SPARK-9920
Taking `sqlContext.sql("select i, sum(j1) as sum from testAgg group by i").explain()` as an example, the output of our current master is
```
== Physical Plan ==
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)]
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)]
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
With this PR, the output will be
```
== Physical Plan ==
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)], output=[i#0,sum#18L])
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)], output=[i#0,currentSum#22L])
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
Author: Yin Huai <yhuai@databricks.com>
Closes#8150 from yhuai/SPARK-9920.
sparkr.zip is now built by SparkSubmit on a need-to-build basis.
cc shivaram
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8147 from brkyvz/make-dist-fix.
There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8136 from mengxr/SPARK-9903.
Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.
CC: mengxr EronWright
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8004 from jkbradley/ml-api-public-items and squashes the following commits:
7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
Currently, UnsafeRowSerializer does not close the InputStream, will cause fd leak if the InputStream has an open fd in it.
TODO: the fd could still be leaked, if any items in the stream is not consumed. Currently it replies on GC to close the fd in this case.
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#8116 from davies/fd_leak.
I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8124 from JoshRosen/disable-ui-in-sparksubmitsuite.
I added lots of expression functions for SparkR. This PR includes only functions whose params are only `(Column)` or `(Column, Column)`. And I think we need to improve how to test those functions. However, it would be better to work on another issue.
## Diff Summary
- Add lots of functions in `functions.R` and their generic in `generic.R`
- Add aliases for `ceiling` and `sign`
- Move expression functions from `column.R` to `functions.R`
- Modify `rdname` from `column` to `functions`
I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way to define it.
## New Supported Functions
```
approxCountDistinct
ascii
base64
bin
bitwiseNOT
ceil (alias: ceiling)
crc32
dayofmonth
dayofyear
explode
factorial
hex
hour
initcap
isNaN
last_day
length
log2
ltrim
md5
minute
month
negate
quarter
reverse
round
rtrim
second
sha1
signum (alias: sign)
size
soundex
to_date
trim
unbase64
unhex
weekofyear
year
datediff
levenshtein
months_between
nanvl
pmod
```
## JIRA
[[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8123 from yu-iskw/SPARK-9855.
…fails
Author: cody koeninger <cody@koeninger.org>
Closes#8133 from koeninger/SPARK-9780 and squashes the following commits:
406259d [cody koeninger] [SPARK-9780][Streaming][Kafka] prevent NPE if KafkaRDD instantiation fails
As per the TODO move weightCol to Shared Params.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
Refactor Utils class and create ShutdownHookManager.
NOTE: Wasn't able to run /dev/run-tests on windows machine.
Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark)
ex:
log4j.rootCategory=WARN,console,redis
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.graphx.Pregel=INFO
log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender
log4j.appender.redis.endpoints=hostname:port
log4j.appender.redis.key=mykey
log4j.appender.redis.alwaysBatch=false
log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1
Author: michellemay <mlemay@gmail.com>
Closes#8109 from michellemay/SPARK-9826.
… allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext.
Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com>
Closes#7657 from neurons/SPARK-9092.
Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set.
CC: mengxr dbtsai feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8079 from jkbradley/logreg-reinstate-threshold.
Check and add miss docs for PySpark ML (this issue only check miss docs for o.a.s.ml not o.a.s.mllib).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8059 from yanboliang/SPARK-9766.
rxin
First pull request for Spark so let me know if I am missing anything
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Brennan Ashton <bashton@brennanashton.com>
Closes#8016 from btashton/patch-1.
From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8115 from jkbradley/copyvalues-fix.
If the correct parameter is not provided, Hive will run into an error
because it calls methods that are specific to the local filesystem to
copy the data.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8086 from vanzin/SPARK-9804.
This is the sister patch to #8011, but for aggregation.
In a nutshell: create the `TungstenAggregationIterator` before computing the parent partition. Internally this creates a `BytesToBytesMap` which acquires a page in the constructor as of this patch. This ensures that the aggregation operator is not starved since we reserve at least 1 page in advance.
rxin yhuai
Author: Andrew Or <andrew@databricks.com>
Closes#8038 from andrewor14/unsafe-starve-memory-agg.
This is based on KaiXinXiaoLei's changes in #7716.
The issue is that when someone calls `sc.killExecutor("1")` on the same executor twice quickly, then the executor target will be adjusted downwards by 2 instead of 1 even though we're only actually killing one executor. In certain cases where we don't adjust the target back upwards quickly, we'll end up with jobs hanging.
This is a common danger because there are many places where this is called:
- `HeartbeatReceiver` kills an executor that has not been sending heartbeats
- `ExecutorAllocationManager` kills an executor that has been idle
- The user code might call this, which may interfere with the previous callers
While it's not clear whether this fixes SPARK-9745, fixing this potential race condition seems like a strict improvement. I've added a regression test to illustrate the issue.
Author: Andrew Or <andrew@databricks.com>
Closes#8078 from andrewor14/da-double-kill.
This allows clients to retrieve the original exception from the
cause field of the SparkException that is thrown by the driver.
If the original exception is not in fact Serializable then it will
not be returned, but the message and stacktrace will be. (All Java
Throwables implement the Serializable interface, but this is no
guarantee that a particular implementation can actually be
serialized.)
Author: Tom White <tom@cloudera.com>
Closes#7014 from tomwhite/propagate-user-exceptions.
This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.
This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.
The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.
Author: Cheng Lian <lian@databricks.com>
Closes#8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
This PR fixes unable to push filter down to JDBC source caused by `Cast` during pattern matching.
While we are comparing columns of different type, there's a big chance we need a cast on the column, therefore not match the pattern directly on Attribute and would fail to push down.
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8049 from yjshen/jdbc_pushdown.