Detailed exception log can be seen in [SPARK-9877](https://issues.apache.org/jira/browse/SPARK-9877), the problem is when creating `StandaloneRestServer`, `self` (`masterEndpoint`) is null. So this fix is creating `StandaloneRestServer` when `self` is available.
Author: jerryshao <sshao@hortonworks.com>
Closes#8127 from jerryshao/SPARK-9877.
In these tests, we use a custom listener and we assert on fields in the stage / task completion events. However, these events are posted in a separate thread so they're not guaranteed to be posted in time. This commit fixes this flakiness through a job end registration callback.
Author: Andrew Or <andrew@databricks.com>
Closes#8176 from andrewor14/fix-accumulator-suite.
When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory.
This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist.
Author: Carson Wang <carson.wang@intel.com>
Closes#8090 from carsonwang/SPARK-9809.
Currently, we access the `page.pageNumer` after it's freed, that could be modified by other thread, cause NPE.
The same TaskMemoryManager could be used by multiple threads (for example, Python UDF and TransportScript), so it should be thread safe to allocate/free memory/page. The underlying Bitset and HashSet are not thread safe, we should put them inside a synchronized block.
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#8177 from davies/memory_manager.
Modified type of ShuffleMapStage.numAvailableOutputs from Long to Int
Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
Closes#8183 from nssalian/SPARK-9923.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.
As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.
Targeted for 1.5 and master
CC: manishamde mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8187 from jkbradley/tree-1cat.
Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#8190 from mengxr/SPARK-9661-fix.
As `InternalRow` does not extend `Row` now, I think we can remove it.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8170 from viirya/remove_canequal.
Currently, pageSize of TungstenSort is calculated from driver.memory, it should use executor.memory instead.
Also, in the worst case, the safeFactor could be 4 (because of rounding), increase it to 16.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#8175 from davies/page_size.
A fundamental limitation of the existing SQL tests is that *there is simply no way to create your own `SparkContext`*. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure.
This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch *all* the SQL test files.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111)
<!-- Reviewable:end -->
Author: Andrew Or <andrew@databricks.com>
Closes#8111 from andrewor14/sql-tests-refactor.
When the free memory in executor goes low, the cached broadcast objects need to serialized into disk, but currently the deserialized UnsafeHashedRelation can't be serialized , fail with NPE. This PR fixes that.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#8174 from davies/serialize_hashed.
This bug only happen on Python 3 and Windows.
I tested this manually with python 3 and disable python daemon, no unit test yet.
Author: Davies Liu <davies@databricks.com>
Closes#8181 from davies/open_mode.
What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.
~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.
jkbradley holdenk
Author: Xiangrui Meng <meng@databricks.com>
Closes#8152 from mengxr/SPARK-9922.
If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions.
Author: Davies Liu <davies@databricks.com>
Closes#8173 from davies/fix_pandas.
I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.
1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#8126 from MechCoder/java_incop.
To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8164 from yanboliang/mlp-name.
Copied ML models must have the same parent of original ones
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes#7447 from Lewuathe/SPARK-9073.
PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#8130 from liancheng/spark-9757/old-hive-parquet-decimal.
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
jkbradley yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes#8148 from mengxr/SPARK-9918 and squashes the following commits:
149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
I made a mistake in #8049 by casting literal value to attribute's data type, which would cause simply truncate the literal value and push a wrong filter down.
JIRA: https://issues.apache.org/jira/browse/SPARK-9927
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8157 from yjshen/rever8049.
The problem with defining setters in the base class is that it doesn't return the correct type in Java.
ericl
Author: Xiangrui Meng <meng@databricks.com>
Closes#8143 from mengxr/SPARK-9914 and squashes the following commits:
d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
This patch add a thread-safe lookup for BytesToBytseMap, and use that in broadcasted HashedRelation.
Author: Davies Liu <davies@databricks.com>
Closes#8151 from davies/safeLookup.
https://issues.apache.org/jira/browse/SPARK-9920
Taking `sqlContext.sql("select i, sum(j1) as sum from testAgg group by i").explain()` as an example, the output of our current master is
```
== Physical Plan ==
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)]
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)]
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
With this PR, the output will be
```
== Physical Plan ==
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)], output=[i#0,sum#18L])
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)], output=[i#0,currentSum#22L])
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
Author: Yin Huai <yhuai@databricks.com>
Closes#8150 from yhuai/SPARK-9920.
sparkr.zip is now built by SparkSubmit on a need-to-build basis.
cc shivaram
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8147 from brkyvz/make-dist-fix.
There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#8136 from mengxr/SPARK-9903.
Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.
CC: mengxr EronWright
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8004 from jkbradley/ml-api-public-items and squashes the following commits:
7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
Currently, UnsafeRowSerializer does not close the InputStream, will cause fd leak if the InputStream has an open fd in it.
TODO: the fd could still be leaked, if any items in the stream is not consumed. Currently it replies on GC to close the fd in this case.
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#8116 from davies/fd_leak.
I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8124 from JoshRosen/disable-ui-in-sparksubmitsuite.
I added lots of expression functions for SparkR. This PR includes only functions whose params are only `(Column)` or `(Column, Column)`. And I think we need to improve how to test those functions. However, it would be better to work on another issue.
## Diff Summary
- Add lots of functions in `functions.R` and their generic in `generic.R`
- Add aliases for `ceiling` and `sign`
- Move expression functions from `column.R` to `functions.R`
- Modify `rdname` from `column` to `functions`
I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way to define it.
## New Supported Functions
```
approxCountDistinct
ascii
base64
bin
bitwiseNOT
ceil (alias: ceiling)
crc32
dayofmonth
dayofyear
explode
factorial
hex
hour
initcap
isNaN
last_day
length
log2
ltrim
md5
minute
month
negate
quarter
reverse
round
rtrim
second
sha1
signum (alias: sign)
size
soundex
to_date
trim
unbase64
unhex
weekofyear
year
datediff
levenshtein
months_between
nanvl
pmod
```
## JIRA
[[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8123 from yu-iskw/SPARK-9855.
…fails
Author: cody koeninger <cody@koeninger.org>
Closes#8133 from koeninger/SPARK-9780 and squashes the following commits:
406259d [cody koeninger] [SPARK-9780][Streaming][Kafka] prevent NPE if KafkaRDD instantiation fails
As per the TODO move weightCol to Shared Params.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.