Commit graph

11283 commits

Author SHA1 Message Date
Xiangrui Meng 26c9d7a0f9 [SPARK-8051] [MLLIB] make StringIndexerModel silent if input column does not exist
This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6595 from mengxr/SPARK-8051 and squashes the following commits:

b6a36b9 [Xiangrui Meng] add doc
f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051
8ee7c7e [Xiangrui Meng] use SparkFunSuite
e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist
2015-06-03 15:16:24 -07:00
Shivaram Venkataraman d3e026f879 [SPARK-3674] [EC2] Clear SPARK_WORKER_INSTANCES when using YARN
cc andrewor14

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6424 from shivaram/spark-worker-instances-yarn-ec2 and squashes the following commits:

db244ae [Shivaram Venkataraman] Make Python Lint happy
0593d1b [Shivaram Venkataraman] Clear SPARK_WORKER_INSTANCES when using YARN
2015-06-03 15:14:38 -07:00
Hari Shreedharan a8f1f1543e [HOTFIX] Fix Hadoop-1 build caused by #5792.
Replaced `fs.listFiles` with Hadoop-1 friendly `fs.listStatus` method.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6619 from harishreedharan/evetlog-hadoop-1-fix and squashes the following commits:

6192078 [Hari Shreedharan] [HOTFIX] Fix Hadoop-1 build caused by #5972.
2015-06-03 15:11:02 -07:00
zsxwing f27134782e [SPARK-7989] [CORE] [TESTS] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
The flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite will fail if there are not enough executors up before running the jobs.

This PR adds `JobProgressListener.waitUntilExecutorsUp`. The tests for the cluster mode can use it to wait until the expected executors are up.

Author: zsxwing <zsxwing@gmail.com>

Closes #6546 from zsxwing/SPARK-7989 and squashes the following commits:

5560e09 [zsxwing] Fix a typo
3b69840 [zsxwing] Fix flaky tests in ExternalShuffleServiceSuite and SparkListenerWithClusterSuite
2015-06-03 15:04:20 -07:00
zsxwing 1d8669f15c [SPARK-8001] [CORE] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
Some places forget to call `assert` to check the return value of `AsynchronousListenerBus.waitUntilEmpty`. Instead of adding `assert` in these places, I think it's better to make `AsynchronousListenerBus.waitUntilEmpty` throw `TimeoutException`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6550 from zsxwing/SPARK-8001 and squashes the following commits:

607674a [zsxwing] Make AsynchronousListenerBus.waitUntilEmpty throw TimeoutException if timeout
2015-06-03 15:03:07 -07:00
Marcelo Vanzin aa40c44207 [SPARK-8059] [YARN] Wake up allocation thread when new requests arrive.
This should help reduce latency for new executor allocations.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6600 from vanzin/SPARK-8059 and squashes the following commits:

8387a3a [Marcelo Vanzin] [SPARK-8059] [yarn] Wake up allocation thread when new requests arrive.
2015-06-03 14:59:30 -07:00
Timothy Chen bfbf12b349 [SPARK-8083] [MESOS] Use the correct base path in mesos driver page.
Author: Timothy Chen <tnachen@gmail.com>

Closes #6615 from tnachen/mesos_driver_path and squashes the following commits:

4f47b7c [Timothy Chen] Use the correct base path in mesos driver page.
2015-06-03 14:57:23 -07:00
Andrew Or c6a6dd0d07 [MINOR] [UI] Improve confusing message on log page
It's good practice to check if the input path is in the directory
we expect to avoid potentially confusing error messages.
2015-06-03 14:47:09 -07:00
Joseph K. Bradley 20a26b595c [SPARK-8054] [MLLIB] Added several Java-friendly APIs + unit tests
Java-friendly APIs added:
* GaussianMixture.run()
* GaussianMixtureModel.predict()
* DistributedLDAModel.javaTopicDistributions()
* StreamingKMeans: trainOn, predictOn, predictOnValues
* Statistics.corr
* params
  * added doc to w() since Java docs do not inherit doc
  * removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam
  * made DoubleArrayParam Java-friendly w() actually Java-friendly

I generated the doc and verified all changes.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6562 from jkbradley/java-api-1.4 and squashes the following commits:

c16821b [Joseph K. Bradley] Small fixes based on code review.
d955581 [Joseph K. Bradley] unit test fixes
29b6b0d [Joseph K. Bradley] small fixes
fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params
2015-06-03 14:34:20 -07:00
Reynold Xin 2c5a06cafd Update documentation for [SPARK-7980] [SQL] Support SQLContext.range(end) 2015-06-03 14:20:27 -07:00
Reynold Xin 939e4f3d8d [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
Author: Reynold Xin <rxin@databricks.com>

Closes #6608 from rxin/parquet-analysis and squashes the following commits:

b5dc8e2 [Reynold Xin] Code review feedback.
5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.
2015-06-03 13:57:57 -07:00
Sun Rui 708c63bbbe [SPARK-8063] [SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
Author: Sun Rui <rui.sun@intel.com>

Closes #6605 from sun-rui/SPARK-8063 and squashes the following commits:

51ca48b [Sun Rui] [SPARK-8063][SPARKR] Spark master URL conflict between MASTER env variable and --master command line option.
2015-06-03 11:56:35 -07:00
Hari Shreedharan d2a86eb8f0 [SPARK-7161] [HISTORY SERVER] Provide REST api to download event logs fro...
...m History Server

This PR adds a new API that allows the user to download event logs for an application as a zip file. APIs have been added to download all logs for a given application or just for a specific attempt.

This also add an additional method to the ApplicationHistoryProvider to get the raw files, zipped.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #5792 from harishreedharan/eventlog-download and squashes the following commits:

221cc26 [Hari Shreedharan] Update docs with new API information.
a131be6 [Hari Shreedharan] Fix style issues.
5528bd8 [Hari Shreedharan] Merge branch 'master' into eventlog-download
6e8156e [Hari Shreedharan] Simplify tests, use Guava stream copy methods.
d8ddede [Hari Shreedharan] Remove unnecessary case in EventLogDownloadResource.
ffffb53 [Hari Shreedharan] Changed interface to use zip stream. Added more tests.
1100b40 [Hari Shreedharan] Ensure that `Path` does not appear in interfaces, by rafactoring interfaces.
5a5f3e2 [Hari Shreedharan] Fix test ordering issue.
0b66948 [Hari Shreedharan] Minor formatting/import fixes.
4fc518c [Hari Shreedharan] Fix rat failures.
a48b91f [Hari Shreedharan] Refactor to make attemptId optional in the API. Also added tests.
0fc1424 [Hari Shreedharan] File download now works for individual attempts and the entire application.
350d7e8 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into eventlog-download
fd6ab00 [Hari Shreedharan] Fix style issues
32b7662 [Hari Shreedharan] Use UIRoot directly in ApiRootResource. Also, use `Response` class to set headers.
7b362b2 [Hari Shreedharan] Almost working.
3d18ebc [Hari Shreedharan] [WIP] Try getting the event log download to work.
2015-06-03 13:43:13 -05:00
animesh d053a31be9 [SPARK-7980] [SQL] Support SQLContext.range(end)
1. range() overloaded in SQLContext.scala
2. range() modified in python sql context.py
3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py

Author: animesh <animesh@apache.spark>

Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:

935899c [animesh] SPARK-7980:python+scala changes
2015-06-03 11:28:18 -07:00
Patrick Wendell 2c4d550eda [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
Author: Patrick Wendell <patrick@databricks.com>

Closes #6328 from pwendell/spark-1.5-update and squashes the following commits:

2f42d02 [Patrick Wendell] A few more excludes
4bebcf0 [Patrick Wendell] Update to RC4
61aaf46 [Patrick Wendell] Using new release candidate
55f1610 [Patrick Wendell] Another exclude
04b4f04 [Patrick Wendell] More issues with transient 1.4 changes
36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0
2015-06-03 10:11:27 -07:00
Yin Huai f1646e1023 [SPARK-7973] [SQL] Increase the timeout of two CliSuite tests.
https://issues.apache.org/jira/browse/SPARK-7973

Author: Yin Huai <yhuai@databricks.com>

Closes #6525 from yhuai/SPARK-7973 and squashes the following commits:

763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes.
e598a08 [Yin Huai] Increase the timeout to 3 minutes.
2015-06-03 09:26:21 -07:00
Yuhao Yang 28dbde3874 [SPARK-7983] [MLLIB] Add require for one-based indices in loadLibSVMFile
jira: https://issues.apache.org/jira/browse/SPARK-7983

Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT).

add a quick check.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6538 from hhbyyh/loadSVM and squashes the following commits:

79d9c11 [Yuhao Yang] optimization as respond to comments
4310710 [Yuhao Yang] merge conflict
96460f1 [Yuhao Yang] merge conflict
20a2811 [Yuhao Yang] use require
6e4f8ca [Yuhao Yang] add check for ascending order
9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception
5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
2015-06-03 13:15:57 +02:00
Wenchen Fan d38cf217e0 [SPARK-7562][SPARK-6444][SQL] Improve error reporting for expression data type mismatch
It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is.

This PR mainly works on apply input types checking for arithmetic and predicate expressions.

TODO: apply type checking interface to more expressions.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6405 from cloud-fan/6444 and squashes the following commits:

b5ff31b [Wenchen Fan] address comments
b917275 [Wenchen Fan] rebase
39929d9 [Wenchen Fan] add todo
0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private
3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison
8883025 [Wenchen Fan] apply type check interface to CaseWhen
cffb67c [Wenchen Fan] to have resolved call the data type check function
6eaadff [Wenchen Fan] add equal type constraint to EqualTo
3affbd8 [Wenchen Fan] more fixes
654d46a [Wenchen Fan] improve tests
e0a3628 [Wenchen Fan] improve error message
1524ff6 [Wenchen Fan] fix style
69ca3fe [Wenchen Fan] add error message and tests
c71d02c [Wenchen Fan] fix hive tests
6491721 [Wenchen Fan] use value class TypeCheckResult
7ae76b9 [Wenchen Fan] address comments
cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
2015-06-03 00:47:52 -07:00
Reynold Xin ce320cb2db [SPARK-8060] Improve DataFrame Python test coverage and documentation.
Author: Reynold Xin <rxin@databricks.com>

Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:

baa8ad5 [Reynold Xin] Code review feedback.
f081d47 [Reynold Xin] More documentation updates.
c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
2015-06-03 00:23:34 -07:00
MechCoder 452eb82dd7 [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust
The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`

It fails in my system since I have version `1.10` :P

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6579 from MechCoder/np_ver and squashes the following commits:

15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
2015-06-02 23:24:47 -07:00
Yuhao Yang 43adbd5611 [SPARK-8043] [MLLIB] [DOC] update NaiveBayes and SVM examples in doc
jira: https://issues.apache.org/jira/browse/SPARK-8043

I found some issues during testing the save/load examples in markdown Documents, as a part of 1.4 QA plan

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #6584 from hhbyyh/naiveDocExample and squashes the following commits:

a01a206 [Yuhao Yang] fix for Gaussian mixture
2fb8b96 [Yuhao Yang] update NaiveBayes and SVM examples in doc
2015-06-02 23:15:38 -07:00
WangTaoTheTonic ccaa823290 [MINOR] make the launcher project name consistent with others
I found this by chance while building spark and think it is better to keep its name consistent with other sub-projects (Spark Project *).

I am not gonna file JIRA as it is a pretty small issue.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6603 from WangTaoTheTonic/projName and squashes the following commits:

994b3ba [WangTaoTheTonic] make the project name consistent
2015-06-02 22:59:48 -07:00
Joseph K. Bradley 07c16cb5ba [SPARK-8053] [MLLIB] renamed scalingVector to scalingVec
I searched the Spark codebase for all occurrences of "scalingVector"

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6596 from jkbradley/scalingVec-rename and squashes the following commits:

d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec
2015-06-02 22:56:56 -07:00
Josh Rosen cafd5056e1 [SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors
This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.

At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`.  In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods.  This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.

The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:

- #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
- #6218: DataFrame.describe() should cast all aggregates to String
- #6400: Use output schema, not relation schema, for data source input conversion

Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema.  According to the `createDataFrame()` Scaladoc:

>  It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.

Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats.  This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.

In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows.  Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch.  Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:

740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
befc613 [Josh Rosen] Add tests to document Option-handling behavior.
5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
677ff27 [Josh Rosen] Fix null handling bug; add tests.
8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
7ca7fcb [Josh Rosen] Comments and cleanup
1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
2015-06-02 22:11:03 -07:00
DB Tsai a86b3e9b9b [SPARK-7547] [ML] Scala Example code for ElasticNet
This is scala example code for both linear and logistic regression. Python and Java versions are to be added.

Author: DB Tsai <dbt@netflix.com>

Closes #6576 from dbtsai/elasticNetExample and squashes the following commits:

e7ca406 [DB Tsai] fix test
6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter
136e0dd [DB Tsai] address feedback
1ec29d4 [DB Tsai] fix style
9462f5f [DB Tsai] add example
2015-06-02 19:12:08 -07:00
Ram Sriharsha c3f4c32571 [SPARK-7387] [ML] [DOC] CrossValidator example code in Python
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6358 from harsha2010/SPARK-7387 and squashes the following commits:

63efda2 [Ram Sriharsha] more examples for classifier to distinguish mapreduce from spark properly
aeb6bb6 [Ram Sriharsha] Python Style Fix
54a500c [Ram Sriharsha] Merge branch 'master' into SPARK-7387
615e91c [Ram Sriharsha] cleanup
204c4e3 [Ram Sriharsha] Merge branch 'master' into SPARK-7387
7246d35 [Ram Sriharsha] [SPARK-7387][ml][doc] CrossValidator example code in Python
2015-06-02 18:53:04 -07:00
Cheng Lian 5cd6a63d96 [SQL] [TEST] [MINOR] Follow-up of PR #6493, use Guava API to ensure Java 6 friendliness
This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness.

cc andrewor14 pwendell, this should also be back ported to branch-1.4.

Author: Cheng Lian <lian@databricks.com>

Closes #6547 from liancheng/override-log4j and squashes the following commits:

c900cfd [Cheng Lian] Addresses Shixiong's comment
72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness
2015-06-02 17:07:13 -07:00
Xiangrui Meng 89f21f66b5 [SPARK-8049] [MLLIB] drop tmp col from OneVsRest output
The temporary column should be dropped after we get the prediction column. harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6592 from mengxr/SPARK-8049 and squashes the following commits:

1d89107 [Xiangrui Meng] use SparkFunSuite
6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output
2015-06-02 16:51:17 -07:00
Davies Liu 605ddbb27c [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
Thanks ogirardot, closes #6580

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6590 from davies/when and squashes the following commits:

c0f2069 [Davies Liu] fix Column.when() and otherwise()
2015-06-02 13:38:06 -07:00
Cheng Lian 686a45f0b9 [SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append
The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later.

This PR fixes this issue by deferring metadata discovery after save mode checking.

Author: Cheng Lian <lian@databricks.com>

Closes #6583 from liancheng/spark-8014 and squashes the following commits:

1aafabd [Cheng Lian] Updates comments
088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined
8fbd93f [Cheng Lian] Fixes SPARK-8014
2015-06-02 13:32:13 -07:00
Mike Dusenberry ad06727fe9 [SPARK-7985] [ML] [MLlib] [Docs] Remove "fittingParamMap" references. Updating ML Doc "Estimator, Transformer, and Param" examples.
Updating ML Doc's *"Estimator, Transformer, and Param"* example to use `model.extractParamMap` instead of `model.fittingParamMap`, which no longer exists.

mengxr, I believe this addresses (part of) the *update documentation* TODO list item from [PR 5820](https://github.com/apache/spark/pull/5820).

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6514 from dusenberrymw/Fix_ML_Doc_Estimator_Transformer_Param_Example and squashes the following commits:

6366e1f [Mike Dusenberry] Updating instances of model.extractParamMap to model.parent.extractParamMap, since the Params of the parent Estimator could possibly differ from thos of the Model.
d850e0e [Mike Dusenberry] Removing all references to "fittingParamMap" throughout Spark, since it has been removed.
0480304 [Mike Dusenberry] Updating the ML Doc "Estimator, Transformer, and Param" Java example to use model.extractParamMap() instead of model.fittingParamMap(), which no longer exists.
7d34939 [Mike Dusenberry] Updating ML Doc "Estimator, Transformer, and Param" example to use model.extractParamMap instead of model.fittingParamMap, which no longer exists.
2015-06-02 12:38:14 -07:00
Marcelo Vanzin 0071bd8d31 [SPARK-8015] [FLUME] Remove Guava dependency from flume-sink.
The minimal change would be to disable shading of Guava in the module,
and rely on the transitive dependency from other libraries instead. But
since Guava's use is so localized, I think it's better to just not use
it instead, so I replaced that code and removed all traces of Guava from
the module's build.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6555 from vanzin/SPARK-8015 and squashes the following commits:

c0ceea8 [Marcelo Vanzin] Add comments about dependency management.
c38228d [Marcelo Vanzin] Add guava dep in test scope.
b7a0349 [Marcelo Vanzin] Add libthrift exclusion.
6e0942d [Marcelo Vanzin] Add comment in pom.
2d79260 [Marcelo Vanzin] [SPARK-8015] [flume] Remove Guava dependency from flume-sink.
2015-06-02 11:20:33 -07:00
Cheng Lian 1bb5d716c0 [SPARK-8037] [SQL] Ignores files whose name starts with dot in HadoopFsRelation
Author: Cheng Lian <lian@databricks.com>

Closes #6581 from liancheng/spark-8037 and squashes the following commits:

d08e97b [Cheng Lian] Ignores files whose name starts with dot in HadoopFsRelation
2015-06-03 00:59:50 +08:00
Xiangrui Meng bd97840d5c [SPARK-7432] [MLLIB] fix flaky CrossValidator doctest
The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6572 from mengxr/SPARK-7432 and squashes the following commits:

c236bb8 [Xiangrui Meng] fix flacky cv doctest
2015-06-02 08:51:00 -07:00
Davies Liu 445647a1a3 [SPARK-8021] [SQL] [PYSPARK] make Python read/write API consistent with Scala
add schema()/format()/options() for reader,  add mode()/format()/options()/partitionBy() for writer

cc rxin yhuai  pwendell

Author: Davies Liu <davies@databricks.com>

Closes #6578 from davies/readwrite and squashes the following commits:

720d293 [Davies Liu] address comments
b65dfa2 [Davies Liu] Update readwriter.py
1299ab6 [Davies Liu] make Python API consistent with Scala
2015-06-02 08:37:18 -07:00
Yin Huai 0f80990bfa [SPARK-8023][SQL] Add "deterministic" attribute to Expression to avoid collapsing nondeterministic projects.
This closes #6570.

Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6573 from rxin/deterministic and squashes the following commits:

356cd22 [Reynold Xin] Added unit test for the optimizer.
da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023
da56200 [Yin Huai] Comments.
e38f264 [Yin Huai] Comment.
f9d6a73 [Yin Huai] Add a deterministic method to Expression.
2015-06-02 00:20:52 -07:00
Yin Huai 7b7f7b6c6f [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early
https://issues.apache.org/jira/browse/SPARK-8020

Author: Yin Huai <yhuai@databricks.com>

Closes #6571 from yhuai/SPARK-8020-1 and squashes the following commits:

0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
2015-06-02 00:16:56 -07:00
Davies Liu bcb47ad771 [SPARK-6917] [SQL] DecimalType is not read back when non-native type exists
cc yhuai

Author: Davies Liu <davies@databricks.com>

Closes #6558 from davies/decimalType and squashes the following commits:

c877ca8 [Davies Liu] Update ParquetConverter.scala
48cc57c [Davies Liu] Update ParquetConverter.scala
b43845c [Davies Liu] add test
3b4a94f [Davies Liu] DecimalType is not read back when non-native type exists
2015-06-01 23:12:29 -07:00
Xiangrui Meng 0221c7f0ef [SPARK-7582] [MLLIB] user guide for StringIndexer
This PR adds a Java unit test and user guide for `StringIndexer`. I put it before `OneHotEncoder` because they are closely related. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6561 from mengxr/SPARK-7582 and squashes the following commits:

4bba4f1 [Xiangrui Meng] fix example
ba1cd1b [Xiangrui Meng] fix style
7fa18d1 [Xiangrui Meng] add user guide for StringIndexer
136cb93 [Xiangrui Meng] add a Java unit test for StringIndexer
2015-06-01 22:03:29 -07:00
Reynold Xin b53a011647 Fixed typo in the previous commit. 2015-06-01 21:41:53 -07:00
Yin Huai e797dba58e [SPARK-7965] [SPARK-7972] [SQL] Handle expressions containing multiple window expressions and make parser match window frames in case insensitive way
JIRAs:
https://issues.apache.org/jira/browse/SPARK-7965
https://issues.apache.org/jira/browse/SPARK-7972

Author: Yin Huai <yhuai@databricks.com>

Closes #6524 from yhuai/7965-7972 and squashes the following commits:

c12c79c [Yin Huai] Add doc for returned value.
de64328 [Yin Huai] Address rxin's comments.
fc9b1ad [Yin Huai] wip
2996da4 [Yin Huai] scala style
20b65b7 [Yin Huai] Handle expressions containing multiple window expressions.
9568b21 [Yin Huai] case insensitive matches
41f633d [Yin Huai] Failed test case.
2015-06-01 21:40:17 -07:00
zsxwing 7f74bb3bc6 [SPARK-8025][Streaming]Add JavaDoc style deprecation for deprecated Streaming methods
Scala `deprecated` annotation actually doesn't show up in JavaDoc.

Author: zsxwing <zsxwing@gmail.com>

Closes #6564 from zsxwing/SPARK-8025 and squashes the following commits:

2faa2bb [zsxwing] Add JavaDoc style deprecation for deprecated Streaming methods
2015-06-01 21:36:49 -07:00
Reynold Xin 75dda33f3e Revert "[SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early"
This reverts commit 91f6be87bc.
2015-06-01 21:35:55 -07:00
Yin Huai 91f6be87bc [SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early
https://issues.apache.org/jira/browse/SPARK-8020

Author: Yin Huai <yhuai@databricks.com>

Closes #6563 from yhuai/SPARK-8020 and squashes the following commits:

4e5addc [Yin Huai] style
bf766c6 [Yin Huai] Failed test.
0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.
2015-06-01 21:33:57 -07:00
Reynold Xin 4c868b9943 [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API
Author: Reynold Xin <rxin@databricks.com>

Closes #6569 from rxin/freqItemsWarning and squashes the following commits:

7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.
2015-06-01 21:29:39 -07:00
Shivaram Venkataraman cae9306c4f [SPARK-8027] [SPARKR] Add maven profile to build R package docs
Also use that profile in create-release.sh

cc pwendell -- Note that this means that we need `knitr` and `roxygen` installed on the machines used for building the release. Let me know if you need help with that.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6567 from shivaram/SPARK-8027 and squashes the following commits:

8dc8ecf [Shivaram Venkataraman] Add maven profile to build R package docs Also use that profile in create-release.sh
2015-06-01 21:21:45 -07:00
Reynold Xin 89f642a0e8 [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API
Author: Reynold Xin <rxin@databricks.com>

Closes #6565 from rxin/alias and squashes the following commits:

286d880 [Reynold Xin] [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API
2015-06-01 21:13:15 -07:00
Reynold Xin 6396cc0303 [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear
Author: Reynold Xin <rxin@databricks.com>

Closes #6566 from rxin/crosstab and squashes the following commits:

e0ace1c [Reynold Xin] [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear
2015-06-01 21:11:19 -07:00
Shivaram Venkataraman 6b44278ef7 [SPARK-8028] [SPARKR] Use addJar instead of setJars in SparkR
This prevents the spark.jars from being cleared while using `--packages` or `--jars`

cc pwendell davies brkyvz

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6568 from shivaram/SPARK-8028 and squashes the following commits:

3a9cf1f [Shivaram Venkataraman] Use addJar instead of setJars in SparkR This prevents the spark.jars from being cleared
2015-06-01 21:01:14 -07:00
Andrew Or 15d7c90aeb [MINOR] [UI] Improve error message on log page
Currently if a bad log type if specified, then we get blank.
We should provide a more informative error message.
2015-06-01 20:09:45 -07:00