Commit graph

8271 commits

Author SHA1 Message Date
Ken Takagiwa 8a74e4b2a8 [DOCS] Fixed wrong links
Author: Ken Takagiwa <ugw.gi.world@gmail.com>

Closes #2042 from giwa/patch-1 and squashes the following commits:

216fe0e [Ken Takagiwa] Fixed wrong links
2014-08-19 22:43:22 -07:00
Josh Rosen ebcb94f701 [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs
This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).

This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.

It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.

I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).

Author: Josh Rosen <joshrosen@apache.org>

Closes #2002 from JoshRosen/local-dirs and squashes the following commits:

efad8c6 [Josh Rosen] Address review comments:
1dec709 [Josh Rosen] Minor updates to Javadocs.
7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
399ac25 [Josh Rosen] Update getLocalDir() documentation.
bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
007298b [Josh Rosen] Allow environment variables to be mocked in tests.
6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
2014-08-19 22:42:50 -07:00
Xiangrui Meng 0a984aa155 [SPARK-3142][MLLIB] output shuffle data directly in Word2Vec
Sorry I didn't realize this in #2043. Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2049 from mengxr/more-w2v and squashes the following commits:

050b1c5 [Xiangrui Meng] output shuffle data directly
2014-08-19 22:16:22 -07:00
Reynold Xin 8adfbc2b6b [SPARK-3119] Re-implementation of TorrentBroadcast.
This is a re-implementation of TorrentBroadcast, with the following changes:

1. Removes most of the mutable, transient state from TorrentBroadcast (e.g. totalBytes, num of blocks fetched).
2. Removes TorrentInfo and TorrentBlock
3. Replaces the BlockManager.getSingle call in readObject with a getLocal, resuling in one less RPC call to the BlockManagerMasterActor to find the location of the block.
4. Removes the metadata block, resulting in one less block to fetch.
5. Removes an extra memory copy for deserialization (by using Java's SequenceInputStream).

Basically for a regular broadcasted object with only one block, the number of RPC calls goes from 5+1 to 2+1).

Old TorrentBroadcast for object of a single block:
1 RPC to ask for location of the broadcast variable
1 RPC to ask for location of the metadata block
1 RPC to fetch the metadata block
1 RPC to ask for location of the first data block
1 RPC to fetch the first data block
1 RPC to tell the driver we put the first data block in
i.e. 5 + 1

New TorrentBroadcast for object of a single block:
1 RPC to ask for location of the first data block
1 RPC to get the first data block
1 RPC to tell the driver we put the first data block in
i.e. 2 + 1

Author: Reynold Xin <rxin@apache.org>

Closes #2030 from rxin/torrentBroadcast and squashes the following commits:

5bacb9d [Reynold Xin] Always add the object to driver's block manager.
0d8ed5b [Reynold Xin] Added getBytes to BlockManager and uses that in TorrentBroadcast.
2d6a5fb [Reynold Xin] Use putBytes/getRemoteBytes throughout.
3670f00 [Reynold Xin] Code review feedback.
c1185cd [Reynold Xin] [SPARK-3119] Re-implementation of TorrentBroadcast.
2014-08-19 22:11:13 -07:00
Xiangrui Meng fce5c0fb63 [HOTFIX][Streaming][MLlib] use temp folder for checkpoint
or Jenkins will complain about no Apache header in checkpoint files. tdas rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2046 from mengxr/tmp-checkpoint and squashes the following commits:

0d3ec73 [Xiangrui Meng] remove ssc.stop
9797843 [Xiangrui Meng] change checkpointDir to lazy val
89964ab [Xiangrui Meng] use temp folder for checkpoint
2014-08-19 22:05:29 -07:00
Xiangrui Meng 068b6fe6a1 [SPARK-3130][MLLIB] detect negative values in naive Bayes
because NB treats feature values as term frequencies. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2038 from mengxr/nb-neg and squashes the following commits:

52c37c3 [Xiangrui Meng] address comments
65f892d [Xiangrui Meng] detect negative values in nb
2014-08-19 21:01:23 -07:00
wangfei 0e3ab94d41 [SQL] add note of use synchronizedMap in SQLConf
Refer to:
http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.

also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.

Author: wangfei <wangfei_hello@126.com>
Author: scwf <wangfei1@huawei.com>

Closes #1996 from scwf/sqlconf and squashes the following commits:

93bc0c5 [wangfei] revert change of HiveQuerySuite
0cc05dd [wangfei] add note for use synchronizedMap
3c224d31 [scwf] fix formate
a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
2014-08-19 19:37:02 -07:00
freeman c7252b0097 [SPARK-3112][MLLIB] Add documentation and example for StreamingLR
Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.

mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #2047 from freeman-lab/streaming-lr-docs and squashes the following commits:

568d250 [freeman] Tweaks to wording / formatting
05a1139 [freeman] Added documentation and example for StreamingLR
2014-08-19 18:07:42 -07:00
Xiangrui Meng 1870dbaa55 [MLLIB] minor update to word2vec
very minor update Ishiihara

Author: Xiangrui Meng <meng@databricks.com>

Closes #2043 from mengxr/minor-w2v and squashes the following commits:

be649fd [Xiangrui Meng] remove map because we only need append
eccefcc [Xiangrui Meng] minor updates to word2vec
2014-08-19 17:41:37 -07:00
Reynold Xin 8b9dc99101 [SPARK-2468] Netty based block server / client module
Previous pull request (#1907) was reverted. This brings it back. Still looking into the hang.

Author: Reynold Xin <rxin@apache.org>

Closes #1971 from rxin/netty1 and squashes the following commits:

b0be96f [Reynold Xin] Added test to make sure outstandingRequests are cleaned after firing the events.
4c6d0ee [Reynold Xin] Pass callbacks cleanly.
603dce7 [Reynold Xin] Upgrade Netty to 4.0.23 to fix the DefaultFileRegion bug.
88be1d4 [Reynold Xin] Downgrade to 4.0.21 to work around a bug in writing DefaultFileRegion.
002626a [Reynold Xin] Remove netty-test-file.txt.
db6e6e0 [Reynold Xin] Revert "Revert "[SPARK-2468] Netty based block server / client module""
2014-08-19 17:40:35 -07:00
Xiangrui Meng 825d4fe47b [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs
Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #2041 from mengxr/stat-doc and squashes the following commits:

fc5eedf [Xiangrui Meng] add missing comma
ffde810 [Xiangrui Meng] address comments
aef6d07 [Xiangrui Meng] add doc for random data generation
b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
2014-08-19 16:06:48 -07:00
Davies Liu d7e80c2597 [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.
If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1894 from davies/zip and squashes the following commits:

c4652ea [Davies Liu] add more test cases
6d05fc8 [Davies Liu] Merge branch 'master' into zip
813b1e4 [Davies Liu] add more tests for failed cases
a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
2014-08-19 14:46:32 -07:00
hzw19900416 76eaeb4523 Move a bracket in validateSettings of SparkConf
Move a bracket in validateSettings of SparkConf

Author: hzw19900416 <carlmartinmax@gmail.com>

Closes #2012 from hzw19900416/codereading and squashes the following commits:

e717fb6 [hzw19900416] Move a bracket in validateSettings of SparkConf
2014-08-19 14:04:49 -07:00
Vida Ha 94053a7b76 SPARK-2333 - spark_ec2 script should allow option for existing security group
- Uses the name tag to identify machines in a cluster.
    - Allows overriding the security group name so it doesn't need to coincide with the cluster name.
    - Outputs the request id's of up to 10 pending spot instance requests.

Author: Vida Ha <vida@databricks.com>

Closes #1899 from vidaha/vida/ec2-reuse-security-group and squashes the following commits:

c80d5c3 [Vida Ha] wrap retries in a try catch block
b2989d5 [Vida Ha] SPARK-2333: spark_ec2 script should allow option for existing security group
2014-08-19 13:35:05 -07:00
freeman 31f0b071ef [SPARK-3128][MLLIB] Use streaming test suite for StreamingLR
Refactored tests for streaming linear regression to use existing  streaming test utilities. Summary of changes:
- Made ``mllib`` depend on tests from ``streaming``
- Rewrote accuracy and convergence tests to use ``setupStreams`` and ``runStreams``
- Added new test for the accuracy of predictions generated by ``predictOnValue``

These tests should run faster, be easier to extend/maintain, and provide a reference for new tests.

mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #2037 from freeman-lab/streamingLR-predict-tests and squashes the following commits:

e851ca7 [freeman] Fixed long lines
50eb0bf [freeman] Refactored tests to use streaming test tools
32c43c2 [freeman] Added test for prediction
2014-08-19 13:28:57 -07:00
Kousuke Saruta cbfc26ba45 [SPARK-3089] Fix meaningless error message in ConnectionManager
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2000 from sarutak/SPARK-3089 and squashes the following commits:

02dfdea [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3089
e759ce7 [Kousuke Saruta] Improved error message when closing SendingConnection
2014-08-19 10:15:11 -07:00
Thomas Graves 7eb9cbc273 [SPARK-3072] YARN - Exit when reach max number failed executors
In some cases on hadoop 2.x the spark application master doesn't properly exit and hangs around for 10 minutes after its really done.  We should make sure it exits properly and stops the driver.

Author: Thomas Graves <tgraves@apache.org>

Closes #2022 from tgravescs/SPARK-3072 and squashes the following commits:

665701d [Thomas Graves] Exit when reach max number failed executors
2014-08-19 09:40:31 -05:00
Matt Forbes cd0720ca77 Fix typo in decision tree docs
Candidate splits were inconsistent with the example.

Author: Matt Forbes <matt@tellapart.com>

Closes #1837 from emef/tree-doc and squashes the following commits:

3be14a1 [Matt Forbes] Fix typo in decision tree docs
2014-08-18 21:43:32 -07:00
Reynold Xin 82577339dd [SPARK-3116] Remove the excessive lockings in TorrentBroadcast
Author: Reynold Xin <rxin@apache.org>

Closes #2028 from rxin/torrentBroadcast and squashes the following commits:

92c62a5 [Reynold Xin] Revert the MEMORY_AND_DISK_SER changes.
03a5221 [Reynold Xin] [SPARK-3116] Remove the excessive lockings in TorrentBroadcast
2014-08-18 20:51:41 -07:00
Josh Rosen 1f1819b20f [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.
This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.

This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified.  It also includes Davies' fix for the bug.

Closes #2026.

Author: Josh Rosen <joshrosen@apache.org>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:

9af2708 [Davies Liu] bugfix: disable compression of command
0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.
2014-08-18 20:42:19 -07:00
Xiangrui Meng 217b5e915e [SPARK-3108][MLLIB] add predictOnValues to StreamingLR and fix predictOn
It is useful in streaming to allow users to carry extra data with the prediction, for monitoring the prediction error for example. freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2023 from mengxr/predict-on-values and squashes the following commits:

cac47b8 [Xiangrui Meng] add classtag
2821b3b [Xiangrui Meng] use mapValues
0925efa [Xiangrui Meng] add predictOnValues to StreamingLR and fix predictOn
2014-08-18 18:20:54 -07:00
Joseph K. Bradley c8b16ca0d8 [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 18:01:39 -07:00
Joseph K. Bradley 115eeb30dd [mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:

b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-18 14:40:05 -07:00
Marcelo Vanzin 6201b27643 [SPARK-2718] [yarn] Handle quotes and other characters in user args.
Due to the way Yarn runs things through bash, normal quoting doesn't
work as expected. This change applies the necessary voodoo to the user
args to avoid issues with bash and special characters.

The change also uncovered an issue with the event logger app name
sanitizing code; it wasn't cleaning up all "bad" characters, so
sometimes it would fail to create the log dirs. I just added some
more bad character replacements.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1724 from vanzin/SPARK-2718 and squashes the following commits:

cc84b89 [Marcelo Vanzin] Review feedback.
c1a257a [Marcelo Vanzin] Add test for backslashes.
55571d4 [Marcelo Vanzin] Unbreak yarn-client.
515613d [Marcelo Vanzin] [SPARK-2718] [yarn] Handle quotes and other characters in user args.
2014-08-18 14:10:10 -07:00
Davies Liu d1d0ee41c2 [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8
bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as "utf-8".

Author: Davies Liu <davies.liu@gmail.com>

Closes #2018 from davies/fix_utf8 and squashes the following commits:

4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
2014-08-18 13:58:35 -07:00
Reynold Xin 3a5962f0f5 Removed .travis.yml file since we are not using Travis. 2014-08-18 13:38:56 -07:00
Marcelo Vanzin 66ade00f91 [SPARK-2169] Don't copy appName / basePath everywhere.
Instead of keeping copies in all pages, just reference the values
kept in the base SparkUI instance (by making them available via
getters).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1252 from vanzin/SPARK-2169 and squashes the following commits:

4412fc6 [Marcelo Vanzin] Simplify UIUtils.headerSparkPage signature.
4e5d35a [Marcelo Vanzin] [SPARK-2169] Don't copy appName / basePath everywhere.
2014-08-18 13:25:30 -07:00
Michael Armbrust 3abd0c1cda [SPARK-2406][SQL] Initial support for using ParquetTableScan to read HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.

Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:

1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition.  Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
2014-08-18 13:17:10 -07:00
Matei Zaharia 9eb74c7d2c [SPARK-3091] [SQL] Add support for caching metadata on Parquet files
For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.

Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.

Author: Matei Zaharia <matei@databricks.com>

Closes #2005 from mateiz/parquet-cache and squashes the following commits:

dae8efe [Matei Zaharia] Bug fix
c71e9ed [Matei Zaharia] Handle empty statuses directly
22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata
8fb56ce [Matei Zaharia] Cache file block locations too
453bd21 [Matei Zaharia] Bug fix
4094df6 [Matei Zaharia] First attempt at caching Parquet footers
2014-08-18 11:00:10 -07:00
Patrick Wendell 6bca8898a1 SPARK-3025 [SQL]: Allow JDBC clients to set a fair scheduler pool
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1937 from pwendell/scheduler and squashes the following commits:

b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
2014-08-18 10:52:20 -07:00
Matei Zaharia 4bf3de7107 [SPARK-3085] [SQL] Use compact data structures in SQL joins
This reuses the CompactBuffer from Spark Core to save memory and pointer
dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
but unfortunately that slows things down because it seems to do more
equals() calls and the equals on GenericRow, and especially JoinedRow,
is pretty expensive.

Author: Matei Zaharia <matei@databricks.com>

Closes #1993 from mateiz/spark-3085 and squashes the following commits:

188221e [Matei Zaharia] Remove unneeded import
5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
2014-08-18 10:45:24 -07:00
Matei Zaharia 6a13dca12f [SPARK-3084] [SQL] Collect broadcasted tables in parallel in joins
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.

Author: Matei Zaharia <matei@databricks.com>

Closes #1990 from mateiz/spark-3084 and squashes the following commits:

f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
2014-08-18 10:05:52 -07:00
Patrick Wendell 7ae28d1247 SPARK-3096: Include parquet hive serde by default in build
A small change - we should just add this dependency. It doesn't have any recursive deps and it's needed for reading have parquet tables.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #2009 from pwendell/parquet and squashes the following commits:

e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default in build
2014-08-18 10:00:46 -07:00
Chandan Kumar f45efbb8aa [SPARK-2862] histogram method fails on some choices of bucketCount
Author: Chandan Kumar <chandan.kumar@imaginea.com>

Closes #1787 from nrchandan/spark-2862 and squashes the following commits:

a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases
4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id
13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug
2014-08-18 09:52:25 -07:00
CrazyJvm c0cbbdeaf4 SPARK-3093 : masterLock in Worker is no longer need
there's no need to use masterLock in Worker now since all communications are within Akka actor

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #2008 from CrazyJvm/no-need-master-lock and squashes the following commits:

dd39e20 [CrazyJvm] fix format
58e7fa5 [CrazyJvm] there's no need to use masterLock now since all communications are within Akka actor
2014-08-18 09:34:36 -07:00
Liquan Pei 9306b8c6c8 [MLlib] Remove transform(dataset: RDD[String]) from Word2Vec public API
mengxr
Remove  transform(dataset: RDD[String]) from public API.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2010 from Ishiihara/Word2Vec-api and squashes the following commits:

17b1031 [Liquan Pei] remove transform(dataset: RDD[String]) from public API
2014-08-18 01:15:45 -07:00
Liquan Pei eef779b8d6 [SPARK-2842][MLlib]Word2Vec documentation
mengxr
Documentation for Word2Vec

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2003 from Ishiihara/Word2Vec-doc and squashes the following commits:

4ff11d4 [Liquan Pei] minor fix
8d7458f [Liquan Pei] code reformat
6df0dcb [Liquan Pei] add Word2Vec documentation
2014-08-17 23:30:47 -07:00
Liquan Pei 3c8fa50590 [SPARK-3097][MLlib] Word2Vec performance improvement
mengxr Please review the code. Adding weights in reduceByKey soon.

Only output model entry for words appeared in the partition before merging and use reduceByKey to combine model. In general, this implementation is 30s or so faster than implementation using big array.

Author: Liquan Pei <liquanpei@gmail.com>

Closes #1932 from Ishiihara/Word2Vec-improve2 and squashes the following commits:

d5377a9 [Liquan Pei] use syn0Global and syn1Global to represent model
cad2011 [Liquan Pei] bug fix for synModify array out of bound
083aa66 [Liquan Pei] update synGlobal in place and reduce synOut size
9075e1c [Liquan Pei] combine syn0Global and syn1Global to synGlobal
aa2ab36 [Liquan Pei] use reduceByKey to combine models
2014-08-17 23:29:44 -07:00
Sandy Ryza df652ea02a SPARK-2900. aggregate inputBytes per stage
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1826 from sryza/sandy-spark-2900 and squashes the following commits:

43f9091 [Sandy Ryza] SPARK-2900
2014-08-17 22:39:06 -07:00
Patrick Wendell 5173f3c40f SPARK-2884: Create binary builds in parallel with release script. 2014-08-17 22:31:04 -07:00
Xiangrui Meng c77f40668f [SPARK-3087][MLLIB] fix col indexing bug in chi-square and add a check for number of distinct values
There is a bug determining the column index. dorx

Author: Xiangrui Meng <meng@databricks.com>

Closes #1997 from mengxr/chisq-index and squashes the following commits:

8fc2ab2 [Xiangrui Meng] fix col indexing bug and add a check for number of distinct values
2014-08-17 20:53:18 -07:00
Hari Shreedharan 95470a03ae [HOTFIX][STREAMING] Allow the JVM/Netty to decide which port to bind to in Flume Polling Tests.
Author: Hari Shreedharan <harishreedharan@gmail.com>

Closes #1820 from harishreedharan/use-free-ports and squashes the following commits:

b939067 [Hari Shreedharan] Remove unused import.
67856a8 [Hari Shreedharan] Remove findFreePort.
0ea51d1 [Hari Shreedharan] Make some changes to getPort to use map on the serverOpt.
1fb0283 [Hari Shreedharan] Merge branch 'master' of https://github.com/apache/spark into use-free-ports
b351651 [Hari Shreedharan] Allow Netty to choose port, and query it to decide the port to bind to. Leaving findFreePort as is, if other tests want to use it at some point.
e6c9620 [Hari Shreedharan] Making sure the second sink uses the correct port.
11c340d [Hari Shreedharan] Add info about race condition to scaladoc.
e89d135 [Hari Shreedharan] Adding Scaladoc.
6013bb0 [Hari Shreedharan] [STREAMING] Find free ports to use before attempting to create Flume Sink in Flume Polling Suite
2014-08-17 19:50:31 -07:00
Chris Fregly 99243288b0 [SPARK-1981] updated streaming-kinesis.md
fixed markup, separated out sections more-clearly, more thorough explanations

Author: Chris Fregly <chris@fregly.com>

Closes #1757 from cfregly/master and squashes the following commits:

9b1c71a [Chris Fregly] better explained why spark checkpoints are disabled in the example (due to no stateful operations being used)
0f37061 [Chris Fregly] SPARK-1981:  (Kinesis streaming support) updated streaming-kinesis.md
862df67 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
8e1ae2e [Chris Fregly] Merge remote-tracking branch 'upstream/master'
4774581 [Chris Fregly] updated docs, renamed retry to retryRandom to be more clear, removed retries around store() method
0393795 [Chris Fregly] moved Kinesis examples out of examples/ and back into extras/kinesis-asl
691a6be [Chris Fregly] fixed tests and formatting, fixed a bug with JavaKinesisWordCount during union of streams
0e1c67b [Chris Fregly] Merge remote-tracking branch 'upstream/master'
74e5c7c [Chris Fregly] updated per TD's feedback.  simplified examples, updated docs
e33cbeb [Chris Fregly] Merge remote-tracking branch 'upstream/master'
bf614e9 [Chris Fregly] per matei's feedback:  moved the kinesis examples into the examples/ dir
d17ca6d [Chris Fregly] per TD's feedback:  updated docs, simplified the KinesisUtils api
912640c [Chris Fregly] changed the foundKinesis class to be a publically-avail class
db3eefd [Chris Fregly] Merge remote-tracking branch 'upstream/master'
21de67f [Chris Fregly] Merge remote-tracking branch 'upstream/master'
6c39561 [Chris Fregly] parameterized the versions of the aws java sdk and kinesis client
338997e [Chris Fregly] improve build docs for kinesis
828f8ae [Chris Fregly] more cleanup
e7c8978 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
cd68c0d [Chris Fregly] fixed typos and backward compatibility
d18e680 [Chris Fregly] Merge remote-tracking branch 'upstream/master'
b3b0ff1 [Chris Fregly] [SPARK-1981] Add AWS Kinesis streaming support
2014-08-17 19:33:15 -07:00
Michael Armbrust bfa09b01d7 [SQL] Improve debug logging and toStrings.
Author: Michael Armbrust <michael@databricks.com>

Closes #2004 from marmbrus/codgenDebugging and squashes the following commits:

b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
2014-08-17 19:00:38 -07:00
Michael Armbrust 5ecb08ea06 Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
Revert #1891 due to issues with hadoop 1 compatibility.

Author: Michael Armbrust <michael@databricks.com>

Closes #2007 from marmbrus/revert1891 and squashes the following commits:

68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
2014-08-17 18:10:45 -07:00
Patrick Wendell 318e28b503 SPARK-2881. Upgrade snappy-java to 1.1.1.3.
This upgrades snappy-java which fixes the issue reported in SPARK-2881.
This is the master branch equivalent to #1994 which provides a different
work-around for the 1.1 branch.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1995 from pwendell/snappy-1.1 and squashes the following commits:

0c7c4c2 [Patrick Wendell] SPARK-2881. Upgrade snappy-java to 1.1.1.3.
2014-08-17 17:06:55 -07:00
Joseph K. Bradley 73ab7f141c [SPARK-3042] [mllib] DecisionTree Filter top-down instead of bottom-up
DecisionTree needs to match each example to a node at each iteration.  It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node.

Fix: Filter top-down using the partly built tree itself.

Major changes:
* Eliminated Filter class, findBinsForLevel() method.
* Set up node parent links in main loop over levels in train().
* Added predictNodeIndex() for filtering top-down.
* Added DTMetadata class

Other changes:
* Pre-compute set of unorderedFeatures.

Notes for following expected PR based on [https://issues.apache.org/jira/browse/SPARK-3043]:
* The unorderedFeatures set will next be stored in a metadata structure to simplify function calls (to store other items such as the data in strategy).

I've done initial tests indicating that this speeds things up, but am only now running large-scale ones.

CC: mengxr manishamde chouqin  Any comments are welcome---thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1975 from jkbradley/dt-opt2 and squashes the following commits:

a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-16 23:53:14 -07:00
Xiangrui Meng fbad72288d [SPARK-3077][MLLIB] fix some chisq-test
- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test should have a null hypothesis
- correct null hypothesis statement for independence test
- p-value: 0.01 -> 0.1

Author: Xiangrui Meng <meng@databricks.com>

Closes #1982 from mengxr/fix-chisq and squashes the following commits:

5f0de02 [Xiangrui Meng] make ChiSqTestResult constructor package private
bc74ea1 [Xiangrui Meng] update chisq-test
2014-08-16 21:16:27 -07:00
GuoQiang Li bc95fe08df In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
cc JoshRosen sarutak

Author: GuoQiang Li <witgo@qq.com>

Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits:

4a700fa [GuoQiang Li] In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
2014-08-16 20:05:55 -07:00
Davies Liu 2fc8aca086 [SPARK-1065] [PySpark] improve supporting for large broadcast
Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast
2014-08-16 16:59:34 -07:00