I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.
I've added both the caching level, and the actual memory state of the RDD.
Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.
For example:
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14
+-(4) MappedRDD[2] at map at <console>:14
| MapPartitionsRDD[1] at mapPartitions at <console>:12
| ParallelCollectionRDD[0] at parallelize at <console>:12
```
should change to
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated]
| CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
+-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated]
| MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated]
| CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
| ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated]
```
Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>
Closes#1535 from nkronenfeld/feature/debug-caching2 and squashes the following commits:
40490bc [Nathan Kronenfeld] Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate memory output
794e6a3 [Nathan Kronenfeld] Attempt to merge mima changes from master
6fe9e80 [Nathan Kronenfeld] Add exclusions to allow for signature change in toDebugString (will back out if necessary)
31d6769 [Nathan Kronenfeld] Attempt to get rid of style errors. Add comments for the new memory usage parameter.
a0f6f76 [Nathan Kronenfeld] Add parameter to RDD.toDebugString to allow detailed memory info to be shown or not. Default is for it not to be shown.
f8f565a [Nathan Kronenfeld] Fix code style error
8f54287 [Nathan Kronenfeld] Changed string addition to string interpolation as per PR comments
2a0cd4d [Nathan Kronenfeld] Fixed a small formatting issue I forgot to copy over from the old branch
8fbecb6 [Nathan Kronenfeld] Add caching information to rdd.toDebugString
(This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903)
Right now, `mvn compile test-compile` fails to compile Spark. (Don't worry; `mvn package` works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven.
It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example:
```
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest
[error] class QueryTest extends PlanTest {
[error] ^
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value
[error] test("SPARK-1669: cacheTable should be idempotent") {
[error] ^
...
```
The issue I believe is that generation of a `test-jar` is bound here to the `compile` phase, but the test classes are not being compiled in this phase. It should bind to the `test-compile` phase.
It works when executing `mvn package` or `mvn install` since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration.
It would be nice for a simple `mvn compile test-compile` to work since the test code is perfectly compilable given the Maven declarations.
On the plus side, this change is low-risk as it only affects tests.
yhuai made the original scalatest change and has glanced at this and thinks it makes sense.
Author: Sean Owen <srowen@gmail.com>
Closes#1879 from srowen/SPARK-2955 and squashes the following commits:
ad8242f [Sean Owen] Generate test-jar on test-compile for modules whose tests are needed by others' tests
JIRA: https://issues.apache.org/jira/browse/SPARK-2736
This patch includes:
1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
3. Fixing a classloading issue.
cc @MLnick @JoshRosen @mateiz
Author: Kan Zhang <kzhang@apache.org>
Closes#1916 from kanzhang/SPARK-2736 and squashes the following commits:
02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
82cc505 [Kan Zhang] [SPARK-2736] Update data sample
0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
This is a rewrite of the original Netty module that was added about 1.5 years ago. The old code was turned off by default and didn't really work because it lacked a frame decoder (only worked with very very small blocks).
For this pull request, I tried to make the changes non-instrusive to the rest of Spark. I only added an init and shutdown to BlockManager/DiskBlockManager, and a bunch of comments to help me understand the existing code base.
Compared with the old Netty module, this one features:
- It appears to work :)
- SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
- SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning
- SPARK-2942: io errors are reported from server to client (the protocol uses negative length to indicate error)
- SPARK-2940: fetching multiple blocks in a single request to reduce syscalls
- SPARK-2959: clients share a single thread pool
- SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
- SPARK-2625: added fetchWaitTime metric and fixed thread-safety issue in metrics update.
- SPARK-2367: bump Netty version to 4.0.21.Final to address an Epoll bug (https://groups.google.com/forum/#!topic/netty/O7m-HxCJpCA)
Compared with the existing communication manager, this one features:
- IMO it is substantially easier to understand
- zero-copy send for the server for on-disk blocks
- one-copy receive (due to a frame decoder)
- don't quote me on this, but I think a lot less sys calls
- SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
- SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
- SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning
TODOs before it can fully replace the existing ConnectionManager, if that ever happens (most of them should probably be done in separate PRs since this needs to be turned on explicitly)
- [x] Basic test cases
- [ ] More unit/integration tests for failures
- [ ] Performance analysis
- [ ] Support client connection reuse so we don't need to keep opening new connections (not sure how useful this would be)
- [ ] Support putting blocks in addition to fetching blocks (i.e. two way transfer)
- [x] Support serving non-disk blocks
- [ ] Support SASL authentication
For a more comprehensive list, see https://issues.apache.org/jira/browse/SPARK-2468
Thanks to @coderplay for peer coding with me on a Sunday.
Author: Reynold Xin <rxin@apache.org>
Closes#1907 from rxin/netty and squashes the following commits:
f921421 [Reynold Xin] Upgrade Netty to 4.0.22.Final to fix another Epoll bug.
4b174ca [Reynold Xin] Shivaram's code review comment.
4a3dfe7 [Reynold Xin] Switched to nio for default (instead of epoll on Linux).
56bfb9d [Reynold Xin] Bump Netty version to 4.0.21.Final for some bug fixes.
b443a4b [Reynold Xin] Added debug message to help debug Jenkins failures.
57fc4d7 [Reynold Xin] Added test cases for BlockHeaderEncoder and BlockFetchingClientHandlerSuite.
22623e9 [Reynold Xin] Added exception handling and test case for BlockServerHandler and BlockFetchingClientHandler.
6550dd7 [Reynold Xin] Fixed block mgr init bug.
60c2edf [Reynold Xin] Beefed up server/client integration tests.
38d88d5 [Reynold Xin] Added missing test files.
6ce3f3c [Reynold Xin] Added some basic test cases.
47f7ce0 [Reynold Xin] Created server and client packages and moved files there.
b16f412 [Reynold Xin] Added commit count.
f13022d [Reynold Xin] Remove unused clone() in BlockFetcherIterator.
c57d68c [Reynold Xin] Added back missing files.
842dfa7 [Reynold Xin] Made everything work with proper reference counting.
3fae001 [Reynold Xin] Connected the new netty network module with rest of Spark.
1a8f6d4 [Reynold Xin] Completed protocol documentation.
2951478 [Reynold Xin] New Netty implementation.
cc7843d [Reynold Xin] Basic skeleton.
Note this also passes the TaskContext itself to the TaskCompletionListener. In the future we can mark TaskContext with the exception object if exception occurs during task execution.
Author: Reynold Xin <rxin@apache.org>
Closes#1938 from rxin/TaskContext and squashes the following commits:
145de43 [Reynold Xin] Added JavaTaskCompletionListenerImpl for Java API friendly guarantee.
f435ea5 [Reynold Xin] Added license header for TaskCompletionListener.
dc4ed27 [Reynold Xin] [SPARK-3027] TaskContext: tighten the visibility and provide Java friendly callback API
Mac OS X's find is from the BSD variant that doesn't have -printf option.
Author: Reynold Xin <rxin@apache.org>
Closes#1953 from rxin/mima and squashes the following commits:
e284afe [Reynold Xin] Make dev/mima runnable on Mac OS X.
...ationInfo is initialized properly after deserialization
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes#1947 from jacek-lewandowski/master and squashes the following commits:
713b2f1 [Jacek Lewandowski] SPARK-3009: Reverted readObject method in ApplicationInfo so that ApplicationInfo is initialized properly after deserialization
Reverts #1924 due to build failures with hadoop 0.23.
Author: Michael Armbrust <michael@databricks.com>
Closes#1949 from marmbrus/revert1924 and squashes the following commits:
6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
In theory, the scale of your inputs are irrelevant to logistic regression.
You can "theoretically" multiply X1 by 1E6 and the estimate for β1 will
adjust accordingly. It will be 1E-6 times smaller than the original β1, due
to the invariance property of MLEs.
However, during the optimization process, the convergence (rate)
depends on the condition number of the training dataset. Scaling
the variables often reduces this condition number, thus improving
the convergence rate.
Without reducing the condition number, some training datasets
mixing the columns with different scales may not be able to converge.
GLMNET and LIBSVM packages perform the scaling to reduce
the condition number, and return the weights in the original scale.
See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
Here, if useFeatureScaling is enabled, we will standardize the training
features by dividing the variance of each column (without subtracting
the mean to densify the sparse vector), and train the model in the
scaled space. Then we transform the coefficients from the scaled space
to the original scale as GLMNET and LIBSVM do.
Currently, it's only enabled in LogisticRegressionWithLBFGS.
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#1897 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
f19fc02 [DB Tsai] Added more comments
1d85289 [DB Tsai] Improve the convergence rate by minimize the condition number in LOR with LBFGS
- Added override.
- Marked some variables as private.
Author: Reynold Xin <rxin@apache.org>
Closes#1943 from rxin/metricsSource and squashes the following commits:
fbfa943 [Reynold Xin] Minor cleanup of metrics.Source. - Added override. - Marked some variables as private.
https://issues.apache.org/jira/browse/SPARK-2925
Run cmd like this will get the error
bin/spark-sql --driver-java-options '-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,address=8788,server=y,suspend=y'
Error: Unrecognized option '-Xnoagent'.
Run with --help for usage help or --verbose for debug output
Author: wangfei <wangfei_hello@126.com>
Author: wangfei <wangfei1@huawei.com>
Closes#1851 from scwf/patch-2 and squashes the following commits:
516554d [wangfei] quote variables to fix this issue
8bd40f2 [wangfei] quote variables to fix this problem
e6d79e3 [wangfei] fix start-thriftserver bug when set driver-java-options
948395d [wangfei] fix spark-sql error when set --driver-java-options
Only encode unicode objects to UTF-8, and not strings
Author: Ahir Reddy <ahirreddy@gmail.com>
Closes#1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:
ca4e9ba [Ahir Reddy] Encoding Fix
This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.
JIRA: https://issues.apache.org/jira/browse/SPARK-2927
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#1855 from yhuai/parquetBinaryAsString and squashes the following commits:
689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
Author: Chia-Yung Su <chiayung@appier.com>
Closes#1924 from joesu/bugfix-spark3011 and squashes the following commits:
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
The previous behaviour of swallowing ClassNotFound exceptions when running a custom Kryo registrator could lead to difficult to debug problems later on at serialisation / deserialisation time, see SPARK-2878. Instead it is better to fail fast.
Added test case.
Author: Graham Dennis <graham.dennis@gmail.com>
Closes#1827 from GrahamDennis/feature/spark-2893 and squashes the following commits:
fbe4cb6 [Graham Dennis] [SPARK-2878]: Update the test case to match the updated exception message
65e53c5 [Graham Dennis] [SPARK-2893]: Improve message when a spark.kryo.registrator fails.
f480d85 [Graham Dennis] [SPARK-2893] Fix typo.
b59d2c2 [Graham Dennis] SPARK-2893: Do not swallow Exceptions when running a custom spark.kryo.registrator
Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.
Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.
This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.
Author: Aaron Davidson <aaron@databricks.com>
Closes#1321 from aarondav/allowlocal and squashes the following commits:
136b253 [Aaron Davidson] Fix DAGSchedulerSuite
5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
As mentioned in SPARK-2465, using `MEMORY_AND_DISK_SER` for user/product in/out links together with `spark.rdd.compress=true` can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets.
Another option for the method name is `setIntermediateRDDStorageLevel`.
Author: Xiangrui Meng <meng@databricks.com>
Closes#1913 from mengxr/als-storagelevel and squashes the following commits:
d942017 [Xiangrui Meng] rename to setIntermediateRDDStorageLevel
7550029 [Xiangrui Meng] add ALS.setIntermediateDataStorageLevel
These configs looked inconsistent from the rest.
Author: Andrew Or <andrewor14@gmail.com>
Closes#1936 from andrewor14/docs-code and squashes the following commits:
15f578a [Andrew Or] Add <code> tag
Modified the order of the options and arguments in spark-shell.cmd
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#1918 from tsudukim/feature/SPARK-3006 and squashes the following commits:
8bba494 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
1a32410 [Masayoshi TSUZUKI] [SPARK-3006] Failed to execute spark-shell in Windows OS
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1933 from pwendell/speculation and squashes the following commits:
33a3473 [Patrick Wendell] Use OpenHashSet
8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI
it seems that set command does not run by SparkSQLDriver. it runs on hive api.
user can not change reduce number by setting spark.sql.shuffle.partitions
but i think setting hive properties seems just a role to spark sql.
Author: guowei <guowei@upyoo.com>
Closes#1904 from guowei2/temp-branch and squashes the following commits:
7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#1891 from sarutak/SPARK-2970 and squashes the following commits:
4a2d2fe [Kousuke Saruta] Modified comment style
8bd833c [Kousuke Saruta] Modified style
6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
Author: Michael Armbrust <michael@databricks.com>
Closes#1863 from marmbrus/parquetPredicates and squashes the following commits:
10ad202 [Michael Armbrust] left <=> right
f249158 [Michael Armbrust] quiet parquet tests.
802da5b [Michael Armbrust] Add test case.
eab2eda [Michael Armbrust] Fix parquet predicate push down bug
This is a follow up of #1880.
Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1901 from liancheng/precise-init-buffer-size and squashes the following commits:
d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
Author: Michael Armbrust <michael@databricks.com>
Closes#1915 from marmbrus/arrayUDF and squashes the following commits:
a1c503d [Michael Armbrust] Support for udfs that take complex types
In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes#1760 from tianyi/spark-2817 and squashes the following commits:
7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support
JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)
HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.
Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1920 from liancheng/spark-3004 and squashes the following commits:
1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
Iterator.fill uses less memory
Author: Xiangrui Meng <meng@databricks.com>
Closes#1930 from mengxr/rand-gen-iter and squashes the following commits:
24178ca [Xiangrui Meng] use Iterator.fill instead of Array.fill
1. skip partitionBy() when numOfPartition is 1
2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
rangePartitioner
Author: Davies Liu <davies.liu@gmail.com>
Closes#1898 from davies/sort and squashes the following commits:
0a9608b [Davies Liu] Merge branch 'master' into sort
1cf9565 [Davies Liu] improve performance of sortByKey()
because Pyrolite does not support array from Python 2.6
Author: Davies Liu <davies.liu@gmail.com>
Closes#1928 from davies/fix_array and squashes the following commits:
858e6c5 [Davies Liu] convert array into list
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#1885 from sarutak/SPARK-2963 and squashes the following commits:
ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
For both Scala and Python.
The ser/de util functions were moved out of `PythonMLLibAPI` and into their own object to avoid creating the `PythonMLLibAPI` object inside of `MultivariateStatisticalSummarySerialized`, which is then referenced inside of a method in `PythonMLLibAPI`.
`MultivariateStatisticalSummarySerialized` was created to serialize the `Vector` fields in `MultivariateStatisticalSummary`.
Author: Doris Xin <doris.s.xin@gmail.com>
Closes#1911 from dorx/colStats and squashes the following commits:
77b9924 [Doris Xin] developerAPI tag
de9cbbe [Doris Xin] reviewer comments and moved more ser/de
459faba [Doris Xin] colStats in Statistics for both Scala and Python
Author: Zhang, Liye <liye.zhang@intel.com>
Closes#1892 from liyezhang556520/lazy_memory_request and squashes the following commits:
335ab61 [Zhang, Liye] [SPARK-1777 (partial)] bugfix: make size of requested memory correctly
Since this is a file to file copy, using transferTo should be faster.
Author: Raymond Liu <raymond.liu@intel.com>
Closes#1884 from colorant/externalSorter and squashes the following commits:
6e42f3c [Raymond Liu] More code into copyStream
bfb496b [Raymond Liu] Use transferTo when copy merge files in ExternalSorter
Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".
Author: Reynold Xin <rxin@apache.org>
Closes#1873 from rxin/compressionCodecShortForm and squashes the following commits:
9f50962 [Reynold Xin] Specify short-form compression codec names first.
63f78ee [Reynold Xin] Updated configuration documentation.
47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation.
Author: Ameet Talwalkar <atalwalkar@gmail.com>
Closes#1908 from atalwalkar/master and squashes the following commits:
fe6938a [Ameet Talwalkar] made xiangruis suggested changes
840028b [Ameet Talwalkar] made xiangruis suggested changes
7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
Python 2.6 does not handle float error well as 2.7+
Author: Davies Liu <davies.liu@gmail.com>
Closes#1910 from davies/fix_test and squashes the following commits:
7e51200 [Davies Liu] fix flaky tests
mengxr
Correctly set vectorSize and alpha in Word2Vec training.
Author: Liquan Pei <liquanpei@gmail.com>
Closes#1900 from Ishiihara/Word2Vec-bugfix and squashes the following commits:
85f64f2 [Liquan Pei] correctly set vectorSize and alpha
This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([3000000] records)
LeftOuterJoin: took 15179 ms ([3000000] records)
RightOuterJoin: took 15515 ms ([3000000] records)
RightOuterJoin: took 15276 ms ([3000000] records)
FullOuterJoin: took 19150 ms ([6000000] records)
FullOuterJoin: took 18935 ms ([6000000] records)
After:
LeftOuterJoin: took 15218 ms ([3000000] records)
LeftOuterJoin: took 13503 ms ([3000000] records)
RightOuterJoin: took 13663 ms ([3000000] records)
RightOuterJoin: took 14025 ms ([3000000] records)
FullOuterJoin: took 16624 ms ([6000000] records)
FullOuterJoin: took 16578 ms ([6000000] records)
```
Besides the performance improvement, I also do some clean up as suggested in #1147
Author: Cheng Hao <hao.cheng@intel.com>
Closes#1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:
ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
Author: Michael Armbrust <michael@databricks.com>
Closes#1880 from marmbrus/columnBatches and squashes the following commits:
0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches
Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#1888 from ueshin/issues/SPARK-2968 and squashes the following commits:
d128c95 [Takuya UESHIN] Fix nullability of Explode.
Output attributes of opposite side of `OuterJoin` should be nullable.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes#1887 from ueshin/issues/SPARK-2965 and squashes the following commits:
bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
I should use `EliminateAnalysisOperators` in `analyze` instead of manually pattern matching.
Author: Yin Huai <huaiyin.thu@gmail.com>
Closes#1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:
f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
Author: wangfei <wangfei1@huawei.com>
Closes#1852 from scwf/patch-3 and squashes the following commits:
ae28c29 [wangfei] use SparkSQLEnv.stop() in ShutdownHook
JIRA issue: [SPARK-2590](https://issues.apache.org/jira/browse/SPARK-2590)
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1853 from liancheng/inc-collect-option and squashes the following commits:
cb3ea45 [Cheng Lian] Moved incremental collection option to Thrift server
43ce3aa [Cheng Lian] Changed incremental collect option name
623abde [Cheng Lian] Added option to handle incremental collection, disabled by default
https://issues.apache.org/jira/browse/SPARK-2844
Author: Ahir Reddy <ahirreddy@gmail.com>
Closes#1768 from ahirreddy/python-hive-context-fix and squashes the following commits:
7972d3b [Ahir Reddy] Correctly set JVM HiveContext if it is passed into Python HiveContext constructor
for training with LBFGS Optimizer which will converge faster than SGD.
Author: DB Tsai <dbtsai@alpinenow.com>
Closes#1862 from dbtsai/dbtsai-lbfgs-lor and squashes the following commits:
aa84b81 [DB Tsai] small change
f852bcd [DB Tsai] Remove duplicate method
f119fdc [DB Tsai] Formatting
97776aa [DB Tsai] address more feedback
85b4a91 [DB Tsai] address feedback
3cf50c2 [DB Tsai] LogisticRegressionWithLBFGS interface