Commit graph

10722 commits

Author SHA1 Message Date
Ilya Ganelin 4bdfb7bab3 [SPARK-5750][SPARK-3441][SPARK-5836][CORE] Added documentation explaining shuffle
I've updated the Spark Programming Guide to add a section on the shuffle operation providing some background on what it does. I've also addressed some of its performance impacts.

I've included documentation to address the following issues:
https://issues.apache.org/jira/browse/SPARK-5836
https://issues.apache.org/jira/browse/SPARK-3441
https://issues.apache.org/jira/browse/SPARK-5750

https://issues.apache.org/jira/browse/SPARK-4227 is related but can be addressed in a separate PR since it involves updates to the Spark Configuration Guide.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5074 from ilganeli/SPARK-5750 and squashes the following commits:

6178e24 [Ilya Ganelin] Update programming-guide.md
7a0b96f [Ilya Ganelin] Update programming-guide.md
2c5df08 [Ilya Ganelin] Merge branch 'SPARK-5750' of github.com:ilganeli/spark into SPARK-5750
dffbd2d [Ilya Ganelin] [SPARK-5750] Slight wording update
1ff4eb4 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
85f9c6e [Ilya Ganelin] Update programming-guide.md
349d1fa [Ilya Ganelin] Added cross linkf or configuration page
eeb5a7a [Ilya Ganelin] [SPARK-5750] Added some minor fixes
dd5cc9d [Ilya Ganelin] [SPARK-5750] Fixed some factual inaccuracies with regards to shuffle internals.
a8adb57 [Ilya Ganelin] [SPARK-5750] Incoporated feedback from Sean Owen
9954bbe [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5750
159dd1c [Ilya Ganelin] [SPARK-5750] Style fixes from rxin.
75ef67b [Ilya Ganelin] [SPARK-5750][SPARK-3441][SPARK-5836] Added documentation explaining the shuffle operation and included errata from a number of other JIRAs
2015-03-30 11:54:01 +01:00
CodingCat de6733036e [SPARK-6596] fix the instruction on building scaladoc
In README.md under docs/ directory, it says that

> You can build just the Spark scaladoc by running build/sbt doc from the SPARK_PROJECT_ROOT directory.

I guess the right approach is build/sbt unidoc

Author: CodingCat <zhunansjtu@gmail.com>

Closes #5253 from CodingCat/SPARK-6596 and squashes the following commits:

af379ed [CodingCat] fix the instruction on building scaladoc
2015-03-30 11:41:43 +01:00
Eran Medan 17b13c53ec [spark-sql] a better exception message than "scala.MatchError" for unsupported types in Schema creation
Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef)

This fix is just to have a nicer error message than a MatchError

Author: Eran Medan <ehrann.mehdan@gmail.com>

Closes #5235 from eranation/patch-2 and squashes the following commits:

af4b1a2 [Eran Medan] Line should be under 100 chars
0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException
524be86 [Eran Medan] better exception than scala.MatchError: Any
2015-03-30 00:02:52 -07:00
Li Zhihui 01dc9f50d1 Fix string interpolator error in HeartbeatReceiver
Error log before fixed
<code>15/03/29 10:07:25 ERROR YarnScheduler: Lost an executor 24 (already removed): Executor heartbeat timed out after ${now - lastSeenMs} ms</code>

Author: Li Zhihui <zhihui.li@intel.com>

Closes #5255 from li-zhihui/fixstringinterpolator and squashes the following commits:

c93f2b7 [Li Zhihui] Fix string interpolator error in HeartbeatReceiver
2015-03-29 21:30:37 -07:00
zsxwing a8d53afb4e [SPARK-5124][Core] A standard RPC interface and an Akka implementation
This PR added a standard internal RPC interface for Spark and an Akka implementation. See [the design document](https://issues.apache.org/jira/secure/attachment/12698710/Pluggable%20RPC%20-%20draft%202.pdf) for more details.

I will split the whole work into multiple PRs to make it easier for code review. This is the first PR and avoid to touch too many files.

Author: zsxwing <zsxwing@gmail.com>

Closes #4588 from zsxwing/rpc-part1 and squashes the following commits:

fe3df4c [zsxwing] Move registerEndpoint and use actorSystem.dispatcher in asyncSetupEndpointRefByURI
f6f3287 [zsxwing] Remove RpcEndpointRef.toURI
8bd1097 [zsxwing] Fix docs and the code style
f459380 [zsxwing] Add RpcAddress.fromURI and rename urls to uris
b221398 [zsxwing] Move send methods above ask methods
15cfd7b [zsxwing] Merge branch 'master' into rpc-part1
9ffa997 [zsxwing] Fix MiMa tests
78a1733 [zsxwing] Merge remote-tracking branch 'origin/master' into rpc-part1
385b9c3 [zsxwing] Fix the code style and add docs
2cc3f78 [zsxwing] Add an asynchronous version of setupEndpointRefByUrl
e8dfec3 [zsxwing] Remove 'sendWithReply(message: Any, sender: RpcEndpointRef): Unit'
08564ae [zsxwing] Add RpcEnvFactory to create RpcEnv
e5df4ca [zsxwing] Handle AkkaFailure(e) in Actor
ec7c5b0 [zsxwing] Fix docs
7fc95e1 [zsxwing] Implement askWithReply in RpcEndpointRef
9288406 [zsxwing] Document thread-safety for setupThreadSafeEndpoint
3007c09 [zsxwing] Move setupDriverEndpointRef to RpcUtils and rename to makeDriverRef
c425022 [zsxwing] Fix the code style
5f87700 [zsxwing] Move the logical of processing message to a private function
3e56123 [zsxwing] Use lazy to eliminate CountDownLatch
07f128f [zsxwing] Remove ActionScheduler.scala
4d34191 [zsxwing] Remove scheduler from RpcEnv
7cdd95e [zsxwing] Add docs for RpcEnv
51e6667 [zsxwing] Add 'sender' to RpcCallContext and rename the parameter of receiveAndReply to 'context'
ffc1280 [zsxwing] Rename 'fail' to 'sendFailure' and other minor code style changes
28e6d0f [zsxwing] Add onXXX for network events and remove the companion objects of network events
3751c97 [zsxwing] Rename RpcResponse to RpcCallContext
fe7d1ff [zsxwing] Add explicit reply in rpc
7b9e0c9 [zsxwing] Fix the indentation
04a106e [zsxwing] Remove NopCancellable and add a const NOP in object SettableCancellable
2a579f4 [zsxwing] Remove RpcEnv.systemName
155b987 [zsxwing] Change newURI to uriOf and add some comments
45b2317 [zsxwing] A standard RPC interface and An Akka implementation
2015-03-29 21:25:09 -07:00
June.He 0e2753ff14 [SPARK-6585][Tests]Fix FileServerSuite testcase in some Env.
Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException

Author: June.He <jun.hejun@huawei.com>

Closes #5239 from sisihj/SPARK-6585 and squashes the following commits:

cb19ae3 [June.He] Change FileServerSuite.test("HttpFileServer should not work with SSL when the server is untrusted") catch SSLException
2015-03-29 12:47:22 +01:00
Thomas Graves 52ece26b8f [SPARK-6558] Utils.getCurrentUserName returns the full principal name instead of login name
Utils.getCurrentUserName returns UserGroupInformation.getCurrentUser().getUserName() when SPARK_USER isn't set. It should return UserGroupInformation.getCurrentUser().getShortUserName()
getUserName() returns the users full principal name (ie user1CORP.COM). getShortUserName() returns just the users login name (user1).

This just happens to work on YARN because the Client code sets:
env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName()

Author: Thomas Graves <tgraves@apache.org>

Closes #5229 from tgravescs/SPARK-6558 and squashes the following commits:

24830bf [Thomas Graves] Utils.getCurrentUserName returns the full principal name instead of login name
2015-03-29 12:43:30 +01:00
Nishkam Ravi e3eb393961 [SPARK-6406] Launch Spark using assembly jar instead of a separate launcher jar
Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #5085 from nishkamravi2/master_nravi and squashes the following commits:

bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2015-03-29 12:40:37 +01:00
Brennon York 55153f5c14 [SPARK-4123][Project Infra]: Show new dependencies added in pull requests
Starting work on this, but need to find a way to ensure that, after doing a checkout from `apache/master`, we can successfully return to the current checkout. I believe that `git rev-parse HEAD` will get me what I want, but pushing this PR up to test what the Jenkins boxes are seeing.

Author: Brennon York <brennon.york@capitalone.com>

Closes #5093 from brennonyork/SPARK-4123 and squashes the following commits:

42e243e [Brennon York] moved starting test output to before pr tests, fixed indentation, changed mvn call to build/mvn
dadd941 [Brennon York] reverted assembly pom, put the regular test suite back in play
7aa1dee [Brennon York] set new dendencies into a <code> block, removed the bash debugging flag
0074566 [Brennon York] fixed minor echo issue with quotes
e229802 [Brennon York] updated to print the new dependency found
27bb9b5 [Brennon York] changed the assembly pom to test whether the pr test will pick up new deps
5375ad8 [Brennon York] git output to dev null
9bce980 [Brennon York] ensure both gate files exist
8f3c4b4 [Brennon York] updated to reflect the correct pushed in HEAD variable
2bc7b27 [Brennon York] added a pom gate check
a18db71 [Brennon York] full test of new deps script
ea170de [Brennon York] dont let mvn execute tests
f70d8cd [Brennon York] testing mvn with package
62ffd65 [Brennon York] updated dependency output message and changed compile to package given the jenkins failure output
04747e4 [Brennon York] adding simple mvn statement to see if command executes and prints compile output
87f9bea [Brennon York] added -x flag with bash to get insight into what is executing and what isnt
9e87208 [Brennon York] added set blocks to catch any non-zero exit codes and updated output
6b3042b [Brennon York] removed excess git checkout print statements
4077d46 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
2bb5527 [Brennon York] added echo statement so jenkins logs which pr tests are running
d027f8f [Brennon York] proper piping of unnecessary stderr and stdout
6e2890d [Brennon York] updated test output newlines
d9f6f7f [Brennon York] removed echo
bad9a3a [Brennon York] added back the new deps test
e9e3ad1 [Brennon York] removed escapes for quotes
97e5cfb [Brennon York] commenting out new deps script
17379a5 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
56f74a8 [Brennon York] updated the unop for ensuring a test is available
f2abc8c [Brennon York] removed the git checkout
6912584 [Brennon York] added this_mssg echo output
c610d42 [Brennon York] removed the error to dev/null
b98f78c [Brennon York] added the removed deps and echo output for jenkins testing
291a8fe [Brennon York] updated location of maven binary
126ce61 [Brennon York] removing new deps test to isolate why jenkins isn't posting messages
f8011d8 [Brennon York] minor updates and style changes
63a35c9 [Brennon York] updated new dependencies test
dae7ba8 [Brennon York] Capturing output directly from dependency builds
94d3547 [Brennon York] adding the new dependencies script into the test mix
2bca3c3 [Brennon York] added a git checkout 'git rev-parse HEAD' to the end of each pr test
ae83b90 [Brennon York] removed jenkins tests to grab some values from the jenkins box
4110993 [Brennon York] beginning work on pr test to add new dependencies
2015-03-29 12:37:53 +01:00
Reynold Xin 5eef00d0c6 [DOC] Improvements to Python docs.
Author: Reynold Xin <rxin@databricks.com>

Closes #5238 from rxin/pyspark-docs and squashes the following commits:

c285951 [Reynold Xin] Reset deprecation warning.
8c1031e [Reynold Xin] inferSchema
dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.
2015-03-28 23:59:27 -07:00
Xiangrui Meng f75f633b21 [SPARK-6571][MLLIB] use wrapper in MatrixFactorizationModel.load
This fixes `predictAll` after load. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:

82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
2015-03-28 15:08:05 -07:00
WangTaoTheTonic 99631438c0 [SPARK-6552][Deploy][Doc]expose start-slave.sh to user and update outdated doc
https://issues.apache.org/jira/browse/SPARK-6552

/cc srowen

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5205 from WangTaoTheTonic/SPARK-6552 and squashes the following commits:

b02263c [WangTaoTheTonic] use less than rather than less equal
f0fa408 [WangTaoTheTonic] expose start-slave.sh
2015-03-28 12:32:35 +00:00
Adam Budde 5909f0973d [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema
Opening to replace #5188.

When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.

In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.

In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.

This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.

Author: Adam Budde <budde@amazon.com>

Closes #5214 from budde/nullable-fields and squashes the following commits:

a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538
9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema
2015-03-28 09:14:09 +08:00
Reynold Xin 3af7334304 [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 row, not 1 row
Author: Reynold Xin <rxin@databricks.com>

Closes #5226 from rxin/empty-df and squashes the following commits:

1306d88 [Reynold Xin] Proper fix.
e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
2015-03-27 14:56:57 -07:00
Xusen Yin d5497ab134 [SPARK-6526][ML] Add Normalizer transformer in ML package
See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).

mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits:

6faa7bf [Xusen Yin] fix style
8a462da [Xusen Yin] remove duplications
ab35ab0 [Xusen Yin] add test suite
bc8cd0f [Xusen Yin] fix comment
79774c9 [Xusen Yin] add Normalizer transformer in ML package
2015-03-27 13:29:10 -07:00
Davies Liu 887e1b72df [SPARK-6574] [PySpark] fix sql example
Fix the import in sql example.

Author: Davies Liu <davies@databricks.com>

Closes #5230 from davies/fix_sql_example and squashes the following commits:

7ecc5f4 [Davies Liu] fix sql example
2015-03-27 11:42:26 -07:00
Michael Armbrust 5d9c37c23d [SPARK-6550][SQL] Use analyzed plan in DataFrame
This is based on bug and test case proposed by viirya.  See #5203 for a excellent description of the problem.

TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed plan which is thrown away.  At execution time, we once again analyze the plan.  However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid.

As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`.

Author: Michael Armbrust <michael@databricks.com>

Closes #5217 from marmbrus/preanalyzer and squashes the following commits:

1f98e2d [Michael Armbrust] revert change
dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame
089c52e [Michael Armbrust] WIP
2015-03-27 11:40:00 -07:00
Dean Chen aa2b991748 [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7
Fixes bug causing Kryo serialization to fail with Avro files in between stages.

https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249

Author: Dean Chen <deanchen5@gmail.com>

Closes #5193 from deanchen/SPARK-6544 and squashes the following commits:

813d4c5 [Dean Chen] [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7
2015-03-27 14:32:51 +00:00
zsxwing da546b7ba0 [SPARK-6556][Core] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
The current reading logic of `executorTimeoutMs` is:
```Scala
private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout",
    sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
```
So if `spark.storage.blockManagerSlaveTimeoutMs` is 10000 and `spark.network.timeout` is not set, executorTimeoutMs will be 10000 * 1000. But the correct value should have been 10000.

`checkTimeoutIntervalMs` has the same issue.

This PR fixes them.

Author: zsxwing <zsxwing@gmail.com>

Closes #5209 from zsxwing/SPARK-6556 and squashes the following commits:

6a0a411 [zsxwing] Fix docs
c7d5422 [zsxwing] Add comments for executorTimeoutMs and checkTimeoutIntervalMs
ccd5147 [zsxwing] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
2015-03-27 12:31:06 +00:00
Yu ISHIKAWA f43a61031f [SPARK-6341][mllib] Upgrade breeze from 0.11.1 to 0.11.2
There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
https://issues.apache.org/jira/browse/SPARK-6341

And thanks you for your great cooperation, David Hall(dlwh)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits:

ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
2015-03-27 00:15:02 -07:00
mcheah 49d2ec63ec [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB.
Kryo buffers are backed by byte arrays, but primitive arrays can only be
up to 2GB in size. It is misleading to allow users to set buffers past
this size.

Author: mcheah <mcheah@palantir.com>

Closes #5218 from mccheah/feature/limit-kryo-buffer and squashes the following commits:

1d6d1be [mcheah] Fixing numeric typo
e2e30ce [mcheah] Removing explicit int and double type to match style
09fd80b [mcheah] Should be >= not >. Slightly more consistent error message.
60634f9 [mcheah] [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB.
2015-03-26 22:48:42 -07:00
Brennon York 39fb579683 [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference
Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.

To demonstrate a basic example with pseudocode:

```
Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
> Set((0L,0))
```

Author: Brennon York <brennon.york@capitalone.com>

Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:

248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
6575d92 [Brennon York] updated mima exclude
aaa030b [Brennon York] completed graph#minus functionality
7227c0f [Brennon York] beginning work on minus functionality
2015-03-26 19:08:09 -07:00
Michael Armbrust aad0032276 [DOCS][SQL] Fix JDBC example
Author: Michael Armbrust <michael@databricks.com>

Closes #5192 from marmbrus/fixJDBCDocs and squashes the following commits:

b48a33d [Michael Armbrust] [DOCS][SQL] Fix JDBC example
2015-03-26 14:51:46 -07:00
Cheng Lian 71a0d40ebd [SPARK-6554] [SQL] Don't push down predicates which reference partition column(s)
There are two cases for the new Parquet data source:

1. Partition columns exist in the Parquet data files

   We don't need to push-down these predicates since partition pruning already handles them.

1. Partition columns don't exist in the Parquet data files

   We can't push-down these predicates since they are considered as invalid columns by Parquet.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5210)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5210 from liancheng/spark-6554 and squashes the following commits:

4f7ec03 [Cheng Lian] Adds comments
e134ced [Cheng Lian] Don't push down predicates which reference partition column(s)
2015-03-26 13:11:37 -07:00
Reynold Xin 784fcd5327 [SPARK-6117] [SQL] Improvements to DataFrame.describe()
1. Slightly modifications to the code to make it more readable.
2. Added Python implementation.
3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis.

Author: Reynold Xin <rxin@databricks.com>

Closes #5201 from rxin/df-describe and squashes the following commits:

25a7834 [Reynold Xin] Reset run-tests.
6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
2015-03-26 12:26:13 -07:00
Sean Owen c3a52a0824 SPARK-6532 [BUILD] LDAModel.scala fails scalastyle on Windows
Use standard UTF-8 source / report encoding for scalastyle

Author: Sean Owen <sowen@cloudera.com>

Closes #5211 from srowen/SPARK-6532 and squashes the following commits:

16a33e5 [Sean Owen] Use standard UTF-8 source / report encoding for scalastyle
2015-03-26 10:52:31 -07:00
Sean Owen fe15ea9760 SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases
Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly

Author: Sean Owen <sowen@cloudera.com>

Closes #5148 from srowen/SPARK-6480 and squashes the following commits:

974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
2015-03-26 15:00:23 +00:00
Yuhao Yang 3ddb975fae [MLlib]remove unused import
minor thing. Let me know if jira is required.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #5207 from hhbyyh/adjustImport and squashes the following commits:

2240121 [Yuhao Yang] remove unused import
2015-03-26 13:27:05 +00:00
Yash Datta 1c05027a14 [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #5141 from saucam/replace_col and squashes the following commits:

e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
2015-03-26 21:13:38 +08:00
zsxwing 0c88ce5416 [SPARK-6468][Block Manager] Fix the race condition of subDirs in DiskBlockManager
There are two race conditions of `subDirs` in `DiskBlockManager`:

1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition.
2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here).

This PR fixed the above race conditions.

Author: zsxwing <zsxwing@gmail.com>

Closes #5136 from zsxwing/SPARK-6468 and squashes the following commits:

cbb872b [zsxwing] Fix the race condition of subDirs in DiskBlockManager
2015-03-26 12:54:48 +00:00
Michael Armbrust f88f51bbd4 [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryo
Author: Michael Armbrust <michael@databricks.com>

Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:

bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
2015-03-26 18:46:57 +08:00
DoingDone9 855cba8fe5 [SPARK-6546][Build] Using the wrong code that will make spark compile failed!!
wrong code : val tmpDir = Files.createTempDir()
not Files should Utils

Author: DoingDone9 <799203320@qq.com>

Closes #5198 from DoingDone9/FilesBug and squashes the following commits:

6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala
e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-03-26 17:04:19 +08:00
azagrebin 5bbcd1304c [SPARK-6117] [SQL] add describe function to DataFrame for summary statis...
Please review my solution for SPARK-6117

Author: azagrebin <azagrebin@gmail.com>

Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits:

f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case
ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns
9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics
2015-03-26 00:25:04 -07:00
Davies Liu f535802977 [SPARK-6536] [PySpark] Column.inSet() in Python
```
>>> df[df.name.inSet("Bob", "Mike")].collect()
[Row(age=5, name=u'Bob')]
>>> df[df.age.inSet([1, 2, 3])].collect()
[Row(age=2, name=u'Alice')]
```

Author: Davies Liu <davies@databricks.com>

Closes #5190 from davies/in and squashes the following commits:

6b73a47 [Davies Liu] Column.inSet() in Python
2015-03-26 00:01:24 -07:00
Michael Armbrust 276ef1c3cf [SPARK-6463][SQL] AttributeSet.equal should compare size
Previously this could result in sets compare equals when in fact the right was a subset of the left.

Based on #5133 by sisihj

Author: sisihj <jun.hejun@huawei.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #5194 from marmbrus/pr/5133 and squashes the following commits:

5ed4615 [Michael Armbrust] fix imports
d4cbbc0 [Michael Armbrust] Add test cases
0a0834f [sisihj]  AttributeSet.equal should compare size
2015-03-25 19:22:05 -07:00
KaiXinXiaoLei e87bf3713e The UT test of spark is failed. Because there is a test in SQLQuerySuite about creating table “test”
If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are  running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala  ,and this table not droped. So when running "drop cached table", table test already exists.

There is error info:
01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test”

And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is:

  test("SPARK-4825 save join to table") {
    val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF()
    sql("CREATE TABLE test1 (key INT, value STRING)")
    testData.insertInto("test1")
    sql("CREATE TABLE test2 (key INT, value STRING)")
    testData.insertInto("test2")
    testData.insertInto("test2")
    sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key =   b.key")
    checkAnswer(
      table("test"),
      sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
  }

Author: KaiXinXiaoLei <huleilei1@huawei.com>

Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits:

7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
2015-03-25 19:15:30 -07:00
Daoyuan Wang 5ab6e9f0c0 [SPARK-6202] [SQL] enable variable substitution on test framework
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #4930 from adrian-wang/testvs and squashes the following commits:

2ce590f [Daoyuan Wang] add explicit function types
b1d68bf [Daoyuan Wang] only substitute for parseSql
9c4a950 [Daoyuan Wang] add a comment explaining
18fb481 [Daoyuan Wang] enable variable substitute on test framework
2015-03-25 18:43:26 -07:00
DoingDone9 328daf65f8 [SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl
Author: DoingDone9 <799203320@qq.com>

Closes #4973 from DoingDone9/sort_token and squashes the following commits:

855fa10 [DoingDone9] Update HiveQl.scala
c7080b3 [DoingDone9] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-03-25 18:41:59 -07:00
Liang-Chi Hsieh 73d57754dd [SPARK-6326][SQL] Improve castStruct to be faster
Current `castStruct` should be very slow. This pr slightly improves it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5017 from viirya/faster_caststruct and squashes the following commits:

385d5b0 [Liang-Chi Hsieh] Further improved.
746fcfb [Liang-Chi Hsieh] Make castStruct faster.
2015-03-25 17:52:23 -07:00
jeanlyn e6d1406abd [SPARK-5498][SQL]fix query exception when partition schema does not match table schema
In hive,the schema of partition may be difference from  the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example:
* We take a look of the schema for the partition and the table

```sql
DESCRIBE partition_test PARTITION (dt='1');
id                  	int              	None
name                	string              	None
dt                  	string              	None

# Partition Information
# col_name            	data_type           	comment

dt                  	string              	None
```
```
DESCRIBE partition_test;
OK
id                  	bigint              	None
name                	string              	None
dt                  	string              	None

# Partition Information
# col_name            	data_type           	comment

dt                  	string              	None
```
*  run the sql
```sql
SELECT * FROM partition_test where dt='1';
```
we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt`

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #4289 from jeanlyn/schema and squashes the following commits:

9c8da74 [jeanlyn] fix style
b41d6b9 [jeanlyn] fix compile errors
07d84b6 [jeanlyn] Merge branch 'master' into schema
535b0b6 [jeanlyn] reduce conflicts
d6c93c5 [jeanlyn] fix bug
1e8b30c [jeanlyn] fix code style
0549759 [jeanlyn] fix code style
c879aa1 [jeanlyn] clean the code
2a91a87 [jeanlyn] add more test case and clean the code
12d800d [jeanlyn] fix code style
63d170a [jeanlyn] fix compile problem
7470901 [jeanlyn] reduce conflicts
afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1
b1527d5 [jeanlyn] fix type mismatch
10744ca [jeanlyn] Insert a space after the start of the comment
3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
2015-03-25 17:47:45 -07:00
Cheng Lian 8c3b0052f4 [SPARK-6450] [SQL] Fixes metastore Parquet table conversion
The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed.

Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5183 from liancheng/spark-6450 and squashes the following commits:

3536780 [Cheng Lian] Fixes metastore Parquet table conversion
2015-03-25 17:40:19 -07:00
Josh Rosen d44a3362ed [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup()
`StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.

This patch adds a new map to `JobProgressListener` in order to speed up these lookups.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits:

e39c5c7 [Josh Rosen] Address review feedback
6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing
2c49614 [Josh Rosen] getOrElse
97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener
2015-03-25 17:40:00 -07:00
MechCoder 4fc4d0369e [SPARK-5987] [MLlib] Save/load for GaussianMixtureModels
Should be self explanatory.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4986 from MechCoder/spark-5987 and squashes the following commits:

7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
2015-03-25 14:45:23 -07:00
Yanbo Liang 435337381f [SPARK-6256] [MLlib] MLlib Python API parity check for regression
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
    setValidateData
LassoWithSGD
    setIntercept
    setValidateData
RidgeRegressionWithSGD
    setIntercept
    setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4997 from yanboliang/spark-6256 and squashes the following commits:

102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
2015-03-25 13:38:33 -07:00
Andrew Or c1b74df604 [SPARK-5771] Master UI inconsistently displays application cores
If the user calls `sc.stop()`, then the number of cores under "Completed Applications" will be 0. If the user does not call `sc.stop()`, then the number of cores will be however many cores were being used before the application exited. This PR makes both cases have the behavior of the latter.

Note that there have been a series of PR that attempted to fix this. For the full discussion, please refer to #4841. The unregister event is necessary because of a subtle race condition explained in that PR.

Tested this locally with and without calling `sc.stop()`.

Author: Andrew Or <andrew@databricks.com>

Closes #5177 from andrewor14/master-ui-cores and squashes the following commits:

62449d1 [Andrew Or] Freeze application state before finishing it
2015-03-25 13:28:32 -07:00
Kousuke Saruta acef51defb [SPARK-6537] UIWorkloadGenerator: The main thread should not stop SparkContext until all jobs finish
The main thread of UIWorkloadGenerator spawn sub threads to launch jobs but the main thread stop SparkContext without waiting for finishing those threads.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5187 from sarutak/SPARK-6537 and squashes the following commits:

4e9307a [Kousuke Saruta] Fixed UIWorkloadGenerator so that the main thread stop SparkContext after all jobs finish
2015-03-25 13:27:15 -07:00
zsxwing 883b7e9030 [SPARK-6076][Block Manager] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER
In dcd1e42d6b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L538) , when StorageLevel is `MEMORY_AND_DISK_SER`, it will copy the content from file into memory, then put it into MemoryStore.
```scala
              val copyForMemory = ByteBuffer.allocate(bytes.limit)
              copyForMemory.put(bytes)
              memoryStore.putBytes(blockId, copyForMemory, level)
              bytes.rewind()
```
However, if the file is bigger than the free memory, OOM will happen. A better approach is testing if there is enough memory. If not, copyForMemory should not be created, since this is an optional operation.

Author: zsxwing <zsxwing@gmail.com>

Closes #4827 from zsxwing/SPARK-6076 and squashes the following commits:

7d25545 [zsxwing] Add alias for tryToPut and dropFromMemory
1100a54 [zsxwing] Replace call-by-name with () => T
0cc0257 [zsxwing] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER
2015-03-25 12:17:18 -07:00
DoingDone9 968408b345 [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, because this will make some UDAF can not work.
spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage"

Author: DoingDone9 <799203320@qq.com>

Closes #5131 from DoingDone9/udaf and squashes the following commits:

9de08d0 [DoingDone9] Update HiveUdfSuite.scala
49c62dc [DoingDone9] Update hiveUdfs.scala
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-03-25 11:11:52 -07:00
Augustin Borsu 982952f4ae [ML][FEATURE] SPARK-5566: RegEx Tokenizer
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.

Author: Augustin Borsu <augustin@sagacify.com>
Author: Augustin Borsu <a.borsu@gmail.com>
Author: Augustin Borsu <aborsu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4504 from aborsu985/master and squashes the following commits:

716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
a164800 [Xiangrui Meng] remove tabs
556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
9651aec [Xiangrui Meng] update test
f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
148126f [Augustin Borsu] Added return type to public functions
12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
6a85982 [Augustin Borsu] Style corrections
38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
e262bac [Augustin Borsu] Added unit tests in scala
cd6642e [Augustin Borsu] Changed regex to pattern
132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9f8685a [Augustin Borsu] RegexTokenizer
9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9547e9d [Augustin Borsu] RegEx Tokenizer
01cd26f [Augustin Borsu] RegExTokenizer
2015-03-25 10:16:39 -07:00
Yanbo Liang 10c78607b2 [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) should initialize numFeatures
In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5167 from yanboliang/spark-6496 and squashes the following commits:

8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
2015-03-25 17:05:56 +00:00