Author: Ted Blackman <ted.blackman@gmail.com>
Closes#6656 from belisarius222/branch-1.4 and squashes the following commits:
747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
(cherry picked from commit f02af7c8f7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
I kept some of the sql import there to avoid changing too many lines.
Author: Reynold Xin <rxin@databricks.com>
Closes#6661 from rxin/remove-wildcard-import-sqlcontext and squashes the following commits:
c265347 [Reynold Xin] Fixed ListTablesSuite failure.
de9d491 [Reynold Xin] Fixed tests.
73b5365 [Reynold Xin] Mima.
8f6b642 [Reynold Xin] Fixed style violation.
443f6e8 [Reynold Xin] [SPARK-8113][SQL] Remove some wildcard import on TestSQLContext._
Derby has a `derby.system.durability` configuration property that can be used to disable I/O synchronization calls for writes. This sacrifices durability but can result in large performance gains, which is appropriate for tests.
We should enable this in our test system properties in order to speed up the Hive compatibility tests. I saw 2-3x speedups locally with this change.
See https://db.apache.org/derby/docs/10.8/ref/rrefproperdurability.html for more documentation of this property.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6651 from JoshRosen/hive-compat-suite-speedup and squashes the following commits:
b7a08a2 [Josh Rosen] Set derby.system.durability=test in our unit tests.
The log page should only show desired length of bytes. Currently it shows bytes from the startIndex to the end of the file. The "Next" button on the page is always disabled.
Author: Carson Wang <carson.wang@intel.com>
Closes#6640 from carsonwang/logpage and squashes the following commits:
58cb3fd [Carson Wang] Show correct length of bytes on log page
This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
more efficient over time as we optimize Aggregate (via Tungsten).
Author: Reynold Xin <rxin@databricks.com>
Closes#6637 from rxin/replace-distinct and squashes the following commits:
b3cc50e [Reynold Xin] Mima excludes.
93d6117 [Reynold Xin] Code review feedback.
87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
This is a follow-up on #6393. I am removing the following files in this PR.
```
./sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala
./sql/hive-thriftserver/v0.13.1/src/main/scala/org/apache/spark/sql/hive/thriftserver/Shim13.scala
```
Basically, I re-factored the shim code as follows-
* Rewrote code directly with Hive 0.13 methods, or
* Converted code into private methods, or
* Extracted code into separate classes
But for leftover code that didn't fit in any of these cases, I created a HiveShim object. For eg, helper functions which wrap Hive 0.13 methods to work around Hive bugs are placed here.
Author: Cheolsoo Park <cheolsoop@netflix.com>
Closes#6604 from piaozhexiu/SPARK-6909 and squashes the following commits:
5dccc20 [Cheolsoo Park] Remove hive shim code
This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available
Related to discussion in #6567
cc pwendell srowen -- Let me know if this looks better
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6593 from shivaram/sparkr-pom-cleanup and squashes the following commits:
b282241 [Shivaram Venkataraman] Remove sparkr-docs from release script as well
8f100a5 [Shivaram Venkataraman] Move man pages creation to install-dev.sh This also helps us get rid of the sparkr-docs maven profile as docs are now built by just using -Psparkr when the roxygen2 package is available
Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743).
Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala`
```diff
- val readContext = getReadSupport(configuration).init(
+ val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init(
```
Since ParquetInputFormat.getReadSupport was made package private in the latest release.
Thanks
-- Thomas Omans
Author: Thomas Omans <tomans@cj.com>
Closes#6597 from eggsby/SPARK-7743 and squashes the following commits:
2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0
Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests. Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:
514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests. Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.
After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#6479 from davies/janino and squashes the following commits:
cc689f5 [Davies Liu] remove globalLock
262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
eec3a33 [Davies Liu] address comments from Josh
f37c8c3 [Davies Liu] fix DecimalType and cast to String
202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
a21e968 [Davies Liu] fix style
0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
551a851 [Davies Liu] fix tests
c3bdffa [Davies Liu] remove print
6089ce5 [Davies Liu] change logging level
7e46ac3 [Davies Liu] fix style
d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
da4926a [Davies Liu] fix tests
03660f3 [Davies Liu] WIP: use Janino to compile Java source
f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
f7d66cf [Davies Liu] use template based string for codegen
If maxTaskFailures is 1, the task set is aborted after 1 task failure. Other documentation and the code supports this reading, I think it's just this comment that was off. It's easy to make this mistake — can you please double-check if I'm correct? Thanks!
Author: Daniel Darabos <darabos.daniel@gmail.com>
Closes#6621 from darabos/patch-2 and squashes the following commits:
dfebdec [Daniel Darabos] Fix comment.
This commit exists to close the following pull requests on Github:
Closes#5976 (close requested by 'JoshRosen')
Closes#4576 (close requested by 'pwendell')
Closes#3430 (close requested by 'pwendell')
Closes#2495 (close requested by 'pwendell')
Right now we always run hive tests in branch-1.4 PRs because we compare whether the diff against master involves hive changes. Really we should be comparing against the target branch itself.
Author: Andrew Or <andrew@databricks.com>
Closes#6629 from andrewor14/build-check-hive and squashes the following commits:
450fbbd [Andrew Or] [BUILD] Use right branch when checking against Hive
Currently hive tests alone take 40m. The right thing to do is
to reduce the test time. However, that is a bigger project and
we currently have PRs blocking on tests not timing out.
cc shaneknapp pwendell JoshRosen
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6623 from shivaram/SPARK-8084 and squashes the following commits:
0ec5b26 [Shivaram Venkataraman] Make SparkR scripts fail on error
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#6624 from ryan-williams/execs and squashes the following commits:
b6f71d4 [Ryan Williams] don't attempt to lower number of executors by 0
Minor error in the monitoring docs. Also made indentation changes in `ApiRootResource`
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes#6628 from harishreedharan/eventlog-formatting and squashes the following commits:
a12553d [Hari Shreedharan] Javadoc updates.
ca399b6 [Hari Shreedharan] [HOTFIX] History Server API docs error fix.
Added stats from cross validation as a val in the cross validation model to save them for user access.
Author: leahmcguire <lmcguire@salesforce.com>
Closes#5915 from leahmcguire/saveCVmetrics and squashes the following commits:
49b507b [leahmcguire] fixed tyle error
67537b1 [leahmcguire] rebased
85907f0 [leahmcguire] fixed name
59987cc [leahmcguire] changed param name and test according to comments
36e71e3 [leahmcguire] rebasing
4b8223e [leahmcguire] fixed name
4ddffc6 [leahmcguire] changed param name and test according to comments
3a995da [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#6595 from mengxr/SPARK-8051 and squashes the following commits:
b6a36b9 [Xiangrui Meng] add doc
f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
8ee7c7e [Xiangrui Meng] use SparkFunSuite
e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist
cc andrewor14
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:
db244ae [Shivaram Venkataraman] Make Python Lint happy
0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN
Replaced `fs.listFiles` with Hadoop-1 friendly `fs.listStatus` method.
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes#6619 from harishreedharan/evetlog-hadoop-1-fix and squashes the following commits:
6192078 [Hari Shreedharan] [HOTFIX] Fix Hadoop-1 build caused by #5972.
The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.
This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.
Author: zsxwing <zsxwing@gmail.com>
Closes#6546 from zsxwing/SPARK-7989 and squashes the following commits:
5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.
Author: zsxwing <zsxwing@gmail.com>
Closes#6550 from zsxwing/SPARK-8001 and squashes the following commits:
607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
This should help reduce latency for new executor allocations.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6600 from vanzin/SPARK-8059 and squashes the following commits:
8387a3a [Marcelo Vanzin] [SPARK-8059] [yarn] Wake up allocation thread when new requests arrive.
Author: Timothy Chen <tnachen@gmail.com>
Closes#6615 from tnachen/mesos_driver_path and squashes the following commits:
4f47b7c [Timothy Chen] Use the correct base path in mesos driver page.
Java-friendly APIs added:
* GaussianMixture.run()
* GaussianMixtureModel.predict()
* DistributedLDAModel.javaTopicDistributions()
* StreamingKMeans: trainOn, predictOn, predictOnValues
* Statistics.corr
* params
* added doc to w() since Java docs do not inherit doc
* removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam
* made DoubleArrayParam Java-friendly w() actually Java-friendly
I generated the doc and verified all changes.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6562 from jkbradley/java-api-1.4 and squashes the following commits:
c16821b [Joseph K. Bradley] Small fixes based on code review.
d955581 [Joseph K. Bradley] unit test fixes
29b6b0d [Joseph K. Bradley] small fixes
fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params
Author: Reynold Xin <rxin@databricks.com>
Closes#6608 from rxin/parquet-analysis and squashes the following commits:
b5dc8e2 [Reynold Xin] Code review feedback.
5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
Author: Sun Rui <rui.sun@intel.com>
Closes#6605 from sun-rui/SPARK-8063 and squashes the following commits:
51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
...m History Server
This PR adds a new API that allows the user to download event logs for an application as a zip file. APIs have been added to download all logs for a given application or just for a specific attempt.
This also add an additional method to the ApplicationHistoryProvider to get the raw files, zipped.
Author: Hari Shreedharan <hshreedharan@apache.org>
Closes#5792 from harishreedharan/eventlog-download and squashes the following commits:
221cc26 [Hari Shreedharan] Update docs with new API information.
a131be6 [Hari Shreedharan] Fix style issues.
5528bd8 [Hari Shreedharan] Merge branch 'master' into eventlog-download
6e8156e [Hari Shreedharan] Simplify tests, use Guava stream copy methods.
d8ddede [Hari Shreedharan] Remove unnecessary case in EventLogDownloadResource.
ffffb53 [Hari Shreedharan] Changed interface to use zip stream. Added more tests.
1100b40 [Hari Shreedharan] Ensure that `Path` does not appear in interfaces, by rafactoring interfaces.
5a5f3e2 [Hari Shreedharan] Fix test ordering issue.
0b66948 [Hari Shreedharan] Minor formatting/import fixes.
4fc518c [Hari Shreedharan] Fix rat failures.
a48b91f [Hari Shreedharan] Refactor to make attemptId optional in the API. Also added tests.
0fc1424 [Hari Shreedharan] File download now works for individual attempts and the entire application.
350d7e8 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into eventlog-download
fd6ab00 [Hari Shreedharan] Fix style issues
32b7662 [Hari Shreedharan] Use UIRoot directly in ApiRootResource. Also, use `Response` class to set headers.
7b362b2 [Hari Shreedharan] Almost working.
3d18ebc [Hari Shreedharan] [WIP] Try getting the event log download to work.
1. range() overloaded in SQLContext.scala
2. range() modified in python sql context.py
3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py
Author: animesh <animesh@apache.spark>
Closes#6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:
935899c [animesh] SPARK-7980:python+scala changes
Author: Patrick Wendell <patrick@databricks.com>
Closes#6328 from pwendell/spark-1.5-update and squashes the following commits:
2f42d02 [Patrick Wendell] A few more excludes
4bebcf0 [Patrick Wendell] Update to RC4
61aaf46 [Patrick Wendell] Using new release candidate
55f1610 [Patrick Wendell] Another exclude
04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
https://issues.apache.org/jira/browse/SPARK-7973
Author: Yin Huai <yhuai@databricks.com>
Closes#6525 from yhuai/SPARK-7973 and squashes the following commits:
763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes.
e598a08 [Yin Huai] Increase the timeout to 3 minutes.
jira: https://issues.apache.org/jira/browse/SPARK-7983
Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).
add a quick check.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6538 from hhbyyh/loadSVM and squashes the following commits:
79d9c11 [Yuhao Yang] optimization as respond to comments
4310710 [Yuhao Yang] merge conflict
96460f1 [Yuhao Yang] merge conflict
20a2811 [Yuhao Yang] use require
6e4f8ca [Yuhao Yang] add check for ascending order
9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is.
This PR mainly works on apply input types checking for arithmetic and predicate expressions.
TODO: apply type checking interface to more expressions.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6405 from cloud-fan/6444 and squashes the following commits:
b5ff31b [Wenchen Fan] address comments
b917275 [Wenchen Fan] rebase
39929d9 [Wenchen Fan] add todo
0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private
3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison
8883025 [Wenchen Fan] apply type check interface to CaseWhen
cffb67c [Wenchen Fan] to have resolved call the data type check function
6eaadff [Wenchen Fan] add equal type constraint to EqualTo
3affbd8 [Wenchen Fan] more fixes
654d46a [Wenchen Fan] improve tests
e0a3628 [Wenchen Fan] improve error message
1524ff6 [Wenchen Fan] fix style
69ca3fe [Wenchen Fan] add error message and tests
c71d02c [Wenchen Fan] fix hive tests
6491721 [Wenchen Fan] use value class TypeCheckResult
7ae76b9 [Wenchen Fan] address comments
cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`
It fails in my system since I have version `1.10` :P
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6579 from MechCoder/np_ver and squashes the following commits:
15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
jira: https://issues.apache.org/jira/browse/SPARK-8043
I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#6584 from hhbyyh/naiveDocExample and squashes the following commits:
a01a206 [Yuhao Yang] fix for Gaussian mixture
2fb8b96 [Yuhao Yang] update NaiveBayes and SVM examples in doc
I found this by chance while building spark and think it is better to keep its name consistent with other sub-projects (Spark Project *).
I am not gonna file JIRA as it is a pretty small issue.
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes#6603 from WangTaoTheTonic/projName and squashes the following commits:
994b3ba [WangTaoTheTonic] make the project name consistent
I searched the Spark codebase for all occurrences of "scalingVector"
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6596 from jkbradley/scalingVec-rename and squashes the following commits:
d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec
This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.
At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.
The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:
- #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
- #6218: DataFrame.describe() should cast all aggregates to String
- #6400: Use output schema, not relation schema, for data source input conversion
Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc:
> It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.
Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.
In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:
740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
befc613 [Josh Rosen] Add tests to document Option-handling behavior.
5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
677ff27 [Josh Rosen] Fix null handling bug; add tests.
8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
7ca7fcb [Josh Rosen] Comments and cleanup
1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
This is scala example code for both linear and logistic regression. Python and Java versions are to be added.
Author: DB Tsai <dbt@netflix.com>
Closes#6576 from dbtsai/elasticNetExample and squashes the following commits:
e7ca406 [DB Tsai] fix test
6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
136e0dd [DB Tsai] address feedback
1ec29d4 [DB Tsai] fix style
9462f5f [DB Tsai] add example
Author: Ram Sriharsha <rsriharsha@hw11853.local>
Closes#6358 from harsha2010/SPARK-7387 and squashes the following commits:
63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
aeb6bb6 [Ram Sriharsha] Python Style Fix
54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
615e91c [Ram Sriharsha] cleanup
204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.
cc andrewor14 pwendell, this should also be back ported to branch-1.4.
Author: Cheng Lian <lian@databricks.com>
Closes#6547 from liancheng/override-log4j and squashes the following commits:
c900cfd [Cheng Lian] Addresses Shixiong's comment
72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness
The temporary column should be dropped after we get the prediction column. harsha2010
Author: Xiangrui Meng <meng@databricks.com>
Closes#6592 from mengxr/SPARK-8049 and squashes the following commits:
1d89107 [Xiangrui Meng] use SparkFunSuite
6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
Thanks ogirardot, closes#6580
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#6590 from davies/when and squashes the following commits:
c0f2069 [Davies Liu] fix Column.when() and otherwise()