Commit graph

9844 commits

Author SHA1 Message Date
Andrew Or 38e624a732 [SPARK-5816] Add huge compatibility warning in DriverWrapper
The stability of the new submission gateway assumes that the arguments in `DriverWrapper` are consistent across multiple Spark versions. However, this is not at all clear from the code itself. In fact, this was broken in 20a6013106, which is fortunately OK because both that commit and the original commit that added this gateway are part of the same release.

To prevent this from happening again we should at the very least add a huge warning where appropriate.

Author: Andrew Or <andrew@databricks.com>

Closes #4687 from andrewor14/driver-wrapper-warning and squashes the following commits:

7989b56 [Andrew Or] Add huge compatibility warning
2015-02-19 09:56:25 -08:00
Jacek Lewandowski fb87f44921 SPARK-5548: Fix for AkkaUtilsSuite failure - attempt 2
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #4653 from jacek-lewandowski/SPARK-5548-2-master and squashes the following commits:

0e199b6 [Jacek Lewandowski] SPARK-5548: applied reviewer's comments
843eafb [Jacek Lewandowski] SPARK-5548: Fix for AkkaUtilsSuite failure - attempt 2
2015-02-19 09:53:36 -08:00
Kay Ousterhout e945aa6139 [SPARK-5846] Correctly set job description and pool for SQL jobs
marmbrus am I missing something obvious here? I verified that this fixes the problem for me (on 1.2.1) on EC2, but I'm confused about how others wouldn't have noticed this?

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #4630 from kayousterhout/SPARK-5846_1.3 and squashes the following commits:

2022ad4 [Kay Ousterhout] [SPARK-5846] Correctly set job description and pool for SQL jobs
2015-02-19 09:49:34 +08:00
Xiangrui Meng d12d2ad76e [SPARK-5879][MLLIB] update PIC user guide and add a Java example
Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:

847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
2015-02-18 16:29:32 -08:00
Davies Liu aa8f10e82a [SPARK-5722] [SQL] [PySpark] infer int as LongType
The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL.

Also, LongType in SQL will come back as `int`.

Author: Davies Liu <davies@databricks.com>

Closes #4666 from davies/long and squashes the following commits:

6bc6cc4 [Davies Liu] infer int as LongType
2015-02-18 14:17:04 -08:00
Reynold Xin f0e3b71077 [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction
Also added test cases for checking the serializability of HiveContext and SQLContext.

Author: Reynold Xin <rxin@databricks.com>

Closes #4628 from rxin/SPARK-5840 and squashes the following commits:

ecb3bcd [Reynold Xin] test cases and reviews.
55eb822 [Reynold Xin] [SPARK-5840][SQL] HiveContext cannot be serialized due to tuple extraction.
2015-02-18 14:02:32 -08:00
Burak Yavuz a8eb92dcb9 [SPARK-5507] Added documentation for BlockMatrix
Docs for BlockMatrix. mengxr

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4664 from brkyvz/SPARK-5507PR and squashes the following commits:

4db30b0 [Burak Yavuz] [SPARK-5507] Added documentation for BlockMatrix
2015-02-18 10:11:08 -08:00
Xiangrui Meng 85e9d091d5 [SPARK-5519][MLLIB] add user guide with example code for fp-growth
The API is still not very Java-friendly because `Array[Item]` in `freqItemsets` is recognized as `Object` in Java. We might want to define a case class to wrap the return pair to make it Java friendly.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4661 from mengxr/SPARK-5519 and squashes the following commits:

58ccc25 [Xiangrui Meng] add user guide with example code for fp-growth
2015-02-18 10:09:56 -08:00
Sean Owen 5aecdcf1f2 SPARK-5669 [BUILD] [HOTFIX] Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
Correct exclusion path for JBLAS native libs.
(More explanation coming soon on the mailing list re: 1.3.0 RC1)

Author: Sean Owen <sowen@cloudera.com>

Closes #4673 from srowen/SPARK-5669.2 and squashes the following commits:

e29693c [Sean Owen] Correct exclusion path for JBLAS native libs
2015-02-18 14:41:44 +00:00
Kousuke Saruta 82197ed3bd [SPARK-4949]shutdownCallback in SparkDeploySchedulerBackend should be enclosed by synchronized block.
A variable `shutdownCallback` in SparkDeploySchedulerBackend can be accessed from multiple threads so it should be enclosed by synchronized block.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3781 from sarutak/SPARK-4949 and squashes the following commits:

c146c93 [Kousuke Saruta] Removed "setShutdownCallback" method
c7265dc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4949
42ca528 [Kousuke Saruta] Changed the declaration of the variable "shutdownCallback" as a volatile reference instead of AtomicReference
552df7c [Kousuke Saruta] Changed the declaration of the variable "shutdownCallback" as a volatile reference instead of AtomicReference
f556819 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4949
1b60fd1 [Kousuke Saruta] Improved the locking logics
5942765 [Kousuke Saruta] Enclosed shutdownCallback in SparkDeploySchedulerBackend by synchronized block
2015-02-18 12:20:11 +00:00
MechCoder e79a7a626d SPARK-4610 addendum: [Minor] [MLlib] Minor doc fix in GBT classification example
numClassesForClassification has been renamed to numClasses.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4672 from MechCoder/minor-doc and squashes the following commits:

d2ddb7f [MechCoder] Minor doc fix in GBT classification example
2015-02-18 10:13:40 +00:00
Davies Liu c1b6fa9838 [SPARK-5878] fix DataFrame.repartition() in Python
Also add tests for distinct()

Author: Davies Liu <davies@databricks.com>

Closes #4667 from davies/repartition and squashes the following commits:

79059fd [Davies Liu] add test
cb4915e [Davies Liu] fix repartition
2015-02-18 01:00:54 -08:00
Tor Myklebust de0dd6de24 Avoid deprecation warnings in JDBCSuite.
This pull request replaces calls to deprecated methods from `java.util.Date` with near-equivalents in `java.util.Calendar`.

Author: Tor Myklebust <tmyklebu@gmail.com>

Closes #4668 from tmyklebu/master and squashes the following commits:

66215b1 [Tor Myklebust] Use GregorianCalendar instead of Timestamp get methods.
2015-02-18 01:00:13 -08:00
Cheng Lian 61ab08549c [Minor] [SQL] Cleans up DataFrame variable names and toDF() calls
Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #4670 from liancheng/df-cleanup and squashes the following commits:

3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
2015-02-17 23:36:20 -08:00
Tathagata Das 3912d33246 [SPARK-5731][Streaming][Test] Fix incorrect test in DirectKafkaStreamSuite
The test was incorrect. Instead of counting the number of records, it counted the number of partitions of RDD generated by DStream. Which is not its intention. I will be testing this patch multiple times to understand its flakiness.

PS: This was caused by my refactoring in https://github.com/apache/spark/pull/4384/

koeninger check it out.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4597 from tdas/kafka-flaky-test and squashes the following commits:

d236235 [Tathagata Das] Unignored last test.
e9a1820 [Tathagata Das] fix test
2015-02-17 22:44:16 -08:00
Yin Huai e50934f11e [SPARK-5723][SQL]Change the default file format to Parquet for CTAS statements.
JIRA: https://issues.apache.org/jira/browse/SPARK-5723

Author: Yin Huai <yhuai@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #4639 from yhuai/defaultCTASFileFormat and squashes the following commits:

a568137 [Yin Huai] Merge remote-tracking branch 'upstream/master' into defaultCTASFileFormat
ad2b07d [Yin Huai] Update tests and error messages.
8af5b2a [Yin Huai] Update conf key and unit test.
5a67903 [Yin Huai] Use data source write path for Hive's CTAS statements when no storage format/handler is specified.
2015-02-17 18:14:33 -08:00
Yin Huai d5f12bfe8f [SPARK-5875][SQL]logical.Project should not be resolved if it contains aggregates or generators
https://issues.apache.org/jira/browse/SPARK-5875 has a case to reproduce the bug and explain the root cause.

Author: Yin Huai <yhuai@databricks.com>

Closes #4663 from yhuai/projectResolved and squashes the following commits:

472f7b6 [Yin Huai] If a logical.Project has any AggregateExpression or Generator, it's resolved field should be false.
2015-02-17 17:50:39 -08:00
Josh Rosen a51fc7ef9a [SPARK-4454] Revert getOrElse() cleanup in DAGScheduler.getCacheLocs()
This method is performance-sensitive and this change wasn't necessary.
2015-02-17 17:45:16 -08:00
Josh Rosen d46d6246d2 [SPARK-4454] Properly synchronize accesses to DAGScheduler cacheLocs map
This patch addresses a race condition in DAGScheduler by properly synchronizing accesses to its `cacheLocs` map.

This map is accessed by the `getCacheLocs` and `clearCacheLocs()` methods, which can be called by separate threads, since DAGScheduler's `getPreferredLocs()` method is called by SparkContext and indirectly calls `getCacheLocs()`.  If this map is cleared by the DAGScheduler event processing thread while a user thread is submitting a job and computing preferred locations, then this can cause the user thread to throw "NoSuchElementException: key not found" errors.

Most accesses to DAGScheduler's internal state do not need synchronization because that state is only accessed from the event processing loop's thread.  An alternative approach to fixing this bug would be to refactor this code so that SparkContext sends the DAGScheduler a message in order to get the list of preferred locations.  However, this would involve more extensive changes to this code and would be significantly harder to backport to maintenance branches since some of the related code has undergone significant refactoring (e.g. the introduction of EventLoop).  Since `cacheLocs` is the only state that's accessed in this way, adding simple synchronization seems like a better short-term fix.

See #3345 for additional context.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4660 from JoshRosen/SPARK-4454 and squashes the following commits:

12d64ba [Josh Rosen] Properly synchronize accesses to DAGScheduler cacheLocs map.
2015-02-17 17:39:58 -08:00
Burak Yavuz ae6cfb3acd [SPARK-5811] Added documentation for maven coordinates and added Spark Packages support
Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits:

56ccccd [Burak Yavuz] fixed broken test
64cb8ee [Burak Yavuz] passed pep8 on local
c07b81e [Burak Yavuz] fixed pep8
a8bd6b7 [Burak Yavuz] submit PR
4ef4046 [Burak Yavuz] ready for PR
8fb02e5 [Burak Yavuz] merged master
25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar
560d13b [Burak Yavuz] before PR
17d3f76 [Davies Liu] support .jar as python package
a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811
c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates
2015-02-17 17:23:22 -08:00
Davies Liu c3d2b90bde [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark
Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.

The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.

Author: Davies Liu <davies@databricks.com>

Closes #4629 from davies/narrow and squashes the following commits:

dffe34e [Davies Liu] improve test, check number of stages for join/cogroup
1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow
4d29932 [Davies Liu] address comment
cc28d97 [Davies Liu] add unit tests
940245e [Davies Liu] address comments
ff5a0a6 [Davies Liu] skip the partitionBy() on Python side
eb26c62 [Davies Liu] narrow dependency in PySpark
2015-02-17 16:54:57 -08:00
Yin Huai 117121a4ec [SPARK-5852][SQL]Fail to convert a newly created empty metastore parquet table to a data source parquet table.
The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.

This PR is based on #4562 from chenghao-intel.

JIRA: https://issues.apache.org/jira/browse/SPARK-5852

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4655 from yhuai/CTASParquet and squashes the following commits:

b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
2015-02-17 15:47:59 -08:00
Davies Liu 4d4cc760fa [SPARK-5872] [SQL] create a sqlCtx in pyspark shell
The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not.

It also skip the Hive tests in pyspark.sql.tests if no hive is available.

Author: Davies Liu <davies@databricks.com>

Closes #4659 from davies/sqlctx and squashes the following commits:

0e6629a [Davies Liu] sqlCtx in pyspark
2015-02-17 15:44:37 -08:00
Davies Liu 3df85dccbc [SPARK-5871] output explain in Python
Author: Davies Liu <davies@databricks.com>

Closes #4658 from davies/explain and squashes the following commits:

db87ea2 [Davies Liu] output explain in Python
2015-02-17 13:48:38 -08:00
Davies Liu 445a755b88 [SPARK-4172] [PySpark] Progress API in Python
This patch bring the pull based progress API into Python, also a example in Python.

Author: Davies Liu <davies@databricks.com>

Closes #3027 from davies/progress_api and squashes the following commits:

b1ba984 [Davies Liu] fix style
d3b9253 [Davies Liu] add tests, mute the exception after stop
4297327 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
969fa9d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
25590c9 [Davies Liu] update with Java API
360de2d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
c0f1021 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
023afb3 [Davies Liu] add Python API and example for progress API
2015-02-17 13:36:43 -08:00
Michael Armbrust de4836f8f1 [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
Author: Michael Armbrust <michael@databricks.com>

Closes #4657 from marmbrus/pythonUdfs and squashes the following commits:

a7823a8 [Michael Armbrust] [SPARK-5868][SQL] Fix python UDFs in HiveContext and checks in SQLContext
2015-02-17 13:23:45 -08:00
Cheng Hao 9d281fa560 [SQL] [Minor] Update the HiveContext Unittest
In unit test, the table src(key INT, value STRING) is not the same as HIVE src(key STRING, value STRING)
https://github.com/apache/hive/blob/branch-0.13/data/scripts/q_test_init.sql

And in the reflect.q, test failed for expression `reflect("java.lang.Integer", "valueOf", key, 16)`, which expect the argument `key` as STRING not INT.

This PR doesn't aim to change the `src` schema, we can do that after 1.3 released, however, we probably need to re-generate all the golden files.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #4584 from chenghao-intel/reflect and squashes the following commits:

e5bdc3a [Cheng Hao] Move the test case reflect into blacklist
184abfd [Cheng Hao] revert the change to table src1
d9bcf92 [Cheng Hao] Update the HiveContext Unittest
2015-02-17 12:25:35 -08:00
Liang-Chi Hsieh ac506b7c28 [Minor][SQL] Use same function to check path parameter in JSONRelation
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4649 from viirya/use_checkpath and squashes the following commits:

0f9a1a1 [Liang-Chi Hsieh] Use same function to check path parameter.
2015-02-17 12:24:13 -08:00
Liang-Chi Hsieh 4611de1cef [SPARK-5862][SQL] Only transformUp the given plan once in HiveMetastoreCatalog
Current `ParquetConversions` in `HiveMetastoreCatalog` will transformUp the given plan multiple times if there are many Metastore Parquet tables. Since the transformUp operation is recursive, it should be better to only perform it once.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4651 from viirya/parquet_atonce and squashes the following commits:

c1ed29d [Liang-Chi Hsieh] Fix bug.
e0f919b [Liang-Chi Hsieh] Only transformUp the given plan once.
2015-02-17 12:23:18 -08:00
CodingCat 31efb39c1d [Minor] fix typo in SQL document
Author: CodingCat <zhunansjtu@gmail.com>

Closes #4656 from CodingCat/fix_typo and squashes the following commits:

b41d15c [CodingCat] recover
689fe46 [CodingCat] fix typo
2015-02-17 12:16:52 -08:00
Davies Liu fc4eb9505a [SPARK-5864] [PySpark] support .jar as python package
A jar file containing Python sources in it could be used as a Python package, just like zip file.

spark-submit already put the jar file into PYTHONPATH, this patch also put it in the sys.path, then it could be used in Python worker.

Author: Davies Liu <davies@databricks.com>

Closes #4652 from davies/jar and squashes the following commits:

17d3f76 [Davies Liu] support .jar as python package
2015-02-17 12:05:06 -08:00
Sean Owen 49c19fdbad SPARK-5841 [CORE] [HOTFIX] Memory leak in DiskBlockManager
Avoid call to remove shutdown hook being called from shutdown hook

CC pwendell JoshRosen MattWhelan

Author: Sean Owen <sowen@cloudera.com>

Closes #4648 from srowen/SPARK-5841.2 and squashes the following commits:

51548db [Sean Owen] Avoid call to remove shutdown hook being called from shutdown hook
2015-02-17 19:40:06 +00:00
Patrick Wendell 24f358b9d6 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #3297 (close requested by 'andrewor14')
Closes #3345 (close requested by 'pwendell')
Closes #2729 (close requested by 'srowen')
Closes #2320 (close requested by 'pwendell')
Closes #4529 (close requested by 'andrewor14')
Closes #2098 (close requested by 'srowen')
Closes #4120 (close requested by 'andrewor14')
2015-02-17 11:35:26 -08:00
MechCoder 9b746f3808 [SPARK-3381] [MLlib] Eliminate bins for unordered features in DecisionTrees
For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4231 from MechCoder/spark-3381 and squashes the following commits:

58c19a5 [MechCoder] COSMIT
c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint
b2b9b89 [MechCoder] COSMIT
d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features
2015-02-17 11:19:23 -08:00
xukun 00228947 b271c265b7 [SPARK-5661]function hasShutdownDeleteTachyonDir should use shutdownDeleteTachyonPaths to determine whether contains file
hasShutdownDeleteTachyonDir(file: TachyonFile) should use shutdownDeleteTachyonPaths(not shutdownDeletePaths) to determine Whether contain file. To solve it ,delete two unused function.

Author: xukun 00228947 <xukun.xu@huawei.com>
Author: viper-kun <xukun.xu@huawei.com>

Closes #4418 from viper-kun/deleteunusedfun and squashes the following commits:

87340eb [viper-kun] fix style
3d6c69e [xukun 00228947] fix bug
2bc397e [xukun 00228947] deleteunusedfun
2015-02-17 18:59:41 +00:00
Ryan Williams d8f69cf788 [SPARK-5778] throw if nonexistent metrics config file provided
previous behavior was to log an error; this is fine in the general
case where no `spark.metrics.conf` parameter was specified, in which
case a default `metrics.properties` is looked for, and the execption
logged and suppressed if it doesn't exist.

if the user has purposefully specified a metrics.conf file, however,
it makes more sense to show them an error when said file doesn't
exist.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4571 from ryan-williams/metrics and squashes the following commits:

5bccb14 [Ryan Williams] private-ize some MetricsConfig members
08ff998 [Ryan Williams] rename METRICS_CONF: DEFAULT_METRICS_CONF_FILENAME
f4d7fab [Ryan Williams] fix tests
ad24b0e [Ryan Williams] add "metrics.properties" to .rat-excludes
94e810b [Ryan Williams] throw if nonexistent Sink class is specified
31d2c30 [Ryan Williams] metrics code review feedback
56287db [Ryan Williams] throw if nonexistent metrics config file provided
2015-02-17 10:57:16 -08:00
Davies Liu d8adefefcc [SPARK-5859] [PySpark] [SQL] fix DataFrame Python API
1. added explain()
2. add isLocal()
3. do not call show() in __repl__
4. add foreach() and foreachPartition()
5. add distinct()
6. fix functions.col()/column()/lit()
7. fix unit tests in sql/functions.py
8. fix unicode in showString()

Author: Davies Liu <davies@databricks.com>

Closes #4645 from davies/df6 and squashes the following commits:

6b46a2c [Davies Liu] fix DataFrame Python API
2015-02-17 10:22:48 -08:00
Michael Armbrust c74b07fa94 [SPARK-5166][SPARK-5247][SPARK-5258][SQL] API Cleanup / Documentation
Author: Michael Armbrust <michael@databricks.com>

Closes #4642 from marmbrus/docs and squashes the following commits:

d291c34 [Michael Armbrust] python tests
9be66e3 [Michael Armbrust] comments
d56afc2 [Michael Armbrust] fix style
f004747 [Michael Armbrust] fix build
c4a907b [Michael Armbrust] fix tests
42e2b73 [Michael Armbrust] [SQL] Documentation / API Clean-up.
2015-02-17 10:21:17 -08:00
Xiangrui Meng c76da36c21 [SPARK-5858][MLLIB] Remove unnecessary first() call in GLM
`numFeatures` is only used by multinomial logistic regression. Calling `.first()` for every GLM causes performance regression, especially in Python.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4647 from mengxr/SPARK-5858 and squashes the following commits:

036dc7f [Xiangrui Meng] remove unnecessary first() call
12c5548 [Xiangrui Meng] check numFeatures only once
2015-02-17 10:17:45 -08:00
Patrick Wendell 3ce46e94fe SPARK-5856: In Maven build script, launch Zinc with more memory
I've seen out of memory exceptions when trying
to run many parallel builds against the same Zinc
server during packaging. We should use the same
increased memory settings we use for Maven itself.

I tested this and confirmed that the Nailgun JVM
launched with higher memory.

Author: Patrick Wendell <patrick@databricks.com>

Closes #4643 from pwendell/zinc-memory and squashes the following commits:

717cfb0 [Patrick Wendell] SPARK-5856: Launch Zinc with larger memory options.
2015-02-17 10:10:01 -08:00
Josh Rosen ee6e3eff02 Revert "[SPARK-5363] [PySpark] check ending mark in non-block way"
This reverts commits ac6fe67e1d and c06e42f2c1.
2015-02-17 07:49:02 -08:00
jerryshao a65766bf02 [SPARK-5826][Streaming] Fix Configuration not serializable problem
Author: jerryshao <saisai.shao@intel.com>

Closes #4612 from jerryshao/SPARK-5826 and squashes the following commits:

7ec71db [jerryshao] Remove transient for conf statement
88d84e6 [jerryshao] Fix Configuration not serializable problem
2015-02-17 10:45:18 +00:00
Patrick Wendell c06e42f2c1 HOTFIX: Style issue causing build break
Caused by #4601
2015-02-16 22:11:28 -08:00
Xiangrui Meng fd84229e2a [SPARK-5802][MLLIB] cache transformed data in glm
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai

Author: Xiangrui Meng <meng@databricks.com>

Closes #4593 from mengxr/SPARK-5802 and squashes the following commits:

ae3be84 [Xiangrui Meng] cache transformed data in glm
2015-02-16 22:09:04 -08:00
Reynold Xin d380f324c6 [SPARK-5853][SQL] Schema support in Row.
Author: Reynold Xin <rxin@databricks.com>

Closes #4640 from rxin/SPARK-5853 and squashes the following commits:

9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row.
2015-02-16 20:42:57 -08:00
Patrick Wendell a51d51ffac SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream
Author: Patrick Wendell <patrick@databricks.com>

Closes #4638 from pwendell/SPARK-5850 and squashes the following commits:

386126f [Patrick Wendell] SPARK-5850: Remove experimental label for Scala 2.11 and FlumePollingStream.
2015-02-16 20:33:33 -08:00
Davies Liu ac6fe67e1d [SPARK-5363] [PySpark] check ending mark in non-block way
There is chance of dead lock that the Python process is waiting for ending mark from JVM, but which is eaten by corrupted stream.

This PR checks the ending mark from Python in non-block way, so it will not blocked by Python process.

There is a small chance that the ending mark is sent by Python process but not available right now, then Python process will not be used.

cc JoshRosen pwendell

Author: Davies Liu <davies@databricks.com>

Closes #4601 from davies/freeze and squashes the following commits:

e15a8c3 [Davies Liu] update logging
890329c [Davies Liu] Merge branch 'freeze' of github.com:davies/spark into freeze
2bd2228 [Davies Liu] add more logging
656d544 [Davies Liu] Update PythonRDD.scala
05e1085 [Davies Liu] check ending mark in non-block way
2015-02-16 20:32:03 -08:00
Reynold Xin 0e180bfc3c [SQL] Various DataFrame doc changes.
Added a bunch of tags.

Also changed parquetFile to take varargs rather than a string followed by varargs.

Author: Reynold Xin <rxin@databricks.com>

Closes #4636 from rxin/df-doc and squashes the following commits:

651f80c [Reynold Xin] Fixed parquetFile in PySpark.
8dc3024 [Reynold Xin] [SQL] Various DataFrame doc changes.
2015-02-16 19:00:30 -08:00
Josh Rosen 58a82a7882 [SPARK-5849] Handle more types of invalid JSON requests in SubmitRestProtocolMessage.parseAction
This patch improves SubmitRestProtocol's handling of invalid JSON requests in cases where those requests were parsable as JSON but not as JSON objects (e.g. they could be parsed as ararys or strings).  I replaced an unchecked cast with pattern-matching and added a new test case.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4637 from JoshRosen/rest-protocol-cast and squashes the following commits:

b3f282b [Josh Rosen] [SPARK-5849] Handle more types of invalid JSON in SubmitRestProtocolMessage.parseAction
2015-02-16 18:08:02 -08:00
azagrebin 16687651f0 [SPARK-3340] Deprecate ADD_JARS and ADD_FILES
I created a patch that disables the environment variables.
Thereby scala or python shell log a warning message to notify user about the deprecation
with the following message:
scala: "ADD_JARS environment variable is deprecated, use --jar spark submit argument instead"
python: "Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead"

Is it what is expected or the code associated with the variables should be just completely removed?
Should it be somewhere documented?

Author: azagrebin <azagrebin@gmail.com>

Closes #4616 from azagrebin/master and squashes the following commits:

bab1aa9 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: minor readability issue
0643895 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: add warning messages
42f0107 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES
2015-02-16 18:06:58 -08:00