Commit graph

8764 commits

Author SHA1 Message Date
Xiangrui Meng 84e5da87e3 [SPARK-4084] Reuse sort key in Sorter
Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .

The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.

The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:

~~~
[info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
Java Arrays.sort() on non-primitive int array: Took 13237 ms
Java Arrays.sort() on non-primitive int array: Took 13320 ms
Java Arrays.sort() on non-primitive int array: Took 15718 ms
Java Arrays.sort() on non-primitive int array: Took 13283 ms
Java Arrays.sort() on non-primitive int array: Took 13267 ms
Java Arrays.sort() on non-primitive int array: Took 15122 ms
Java Arrays.sort() on non-primitive int array: Took 15495 ms
Java Arrays.sort() on non-primitive int array: Took 14877 ms
Java Arrays.sort() on non-primitive int array: Took 16429 ms
Java Arrays.sort() on non-primitive int array: Took 14250 ms
Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2683 ms
Java Arrays.sort() on primitive int array: Took 2701 ms
Java Arrays.sort() on primitive int array: Took 2746 ms
Java Arrays.sort() on primitive int array: Took 2685 ms
Java Arrays.sort() on primitive int array: Took 2735 ms
Java Arrays.sort() on primitive int array: Took 2669 ms
Java Arrays.sort() on primitive int array: Took 2693 ms
Java Arrays.sort() on primitive int array: Took 2680 ms
Java Arrays.sort() on primitive int array: Took 2642 ms
Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
Sorter without key reuse on primitive int array: Took 10732 ms
Sorter without key reuse on primitive int array: Took 12482 ms
Sorter without key reuse on primitive int array: Took 10718 ms
Sorter without key reuse on primitive int array: Took 12650 ms
Sorter without key reuse on primitive int array: Took 10747 ms
Sorter without key reuse on primitive int array: Took 10783 ms
Sorter without key reuse on primitive int array: Took 12721 ms
Sorter without key reuse on primitive int array: Took 10604 ms
Sorter without key reuse on primitive int array: Took 10622 ms
Sorter without key reuse on primitive int array: Took 11843 ms
Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
Sorter with key reuse on primitive int array: Took 5141 ms
Sorter with key reuse on primitive int array: Took 5298 ms
Sorter with key reuse on primitive int array: Took 5066 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5203 ms
Sorter with key reuse on primitive int array: Took 5274 ms
Sorter with key reuse on primitive int array: Took 5186 ms
Sorter with key reuse on primitive int array: Took 5159 ms
Sorter with key reuse on primitive int array: Took 5164 ms
Sorter with key reuse on primitive int array: Took 5078 ms
Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
~~~

So with key reuse, it is faster and less likely to trigger GC.

Author: Xiangrui Meng <meng@databricks.com>
Author: Reynold Xin <rxin@apache.org>

Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:

d73c3d0 [Xiangrui Meng] address comments
0b7b682 [Xiangrui Meng] fix mima
a72f53c [Xiangrui Meng] update timeIt
38ba50c [Xiangrui Meng] update timeIt
720f731 [Xiangrui Meng] add doc about JIT specialization
78f2879 [Xiangrui Meng] update tests
7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
b00db4d [Xiangrui Meng] doc and tests
cf94e8a [Xiangrui Meng] renaming
464ddce [Reynold Xin] cherry-pick rxin's commit
2014-10-28 15:14:41 -07:00
Cheng Hao 4b55482abf [SPARK-3343] [SQL] Add serde support for CTAS
Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits:

e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1
cfb3662 [Cheng Hao] revert to hive 0.12
c8a547d [Cheng Hao] Support SerDe properties within CTAS
2014-10-28 14:36:06 -07:00
zsxwing abcafcfba3 [Spark 3922] Refactor spark-core to use Utils.UTF_8
A global UTF8 constant is very helpful to handle encoding problems when converting between String and bytes. There are several solutions here:

1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala
2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7)
3. io.netty.util.CharsetUtil.UTF_8
4. com.google.common.base.Charsets.UTF_8
5. org.apache.commons.lang.CharEncoding.UTF_8
6. org.apache.commons.lang3.CharEncoding.UTF_8

IMO, I prefer option 1) because people can find it easily.

This is a PR for option 1) and only fixes Spark Core.

Author: zsxwing <zsxwing@gmail.com>

Closes #2781 from zsxwing/SPARK-3922 and squashes the following commits:

f974edd [zsxwing] Merge branch 'master' into SPARK-3922
2d27423 [zsxwing] Refactor spark-core to use Refactor spark-core to use Utils.UTF_8
2014-10-28 14:26:57 -07:00
Daoyuan Wang 47a40f60d6 [SPARK-3988][SQL] add public API for date type
Add json and python api for date type.
By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2901 from adrian-wang/spark3988 and squashes the following commits:

c51a24d [Daoyuan Wang] convert datetime to date
5670626 [Daoyuan Wang] minor line combine
f760d8e [Daoyuan Wang] fix indent
444f100 [Daoyuan Wang] fix a typo
1d74448 [Daoyuan Wang] fix scala style
8d7dd22 [Daoyuan Wang] add json and python api for date type
2014-10-28 13:43:25 -07:00
ravipesala 5807cb40ae [SPARK-3814][SQL] Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
Currently there is no support of Bitwise & , | in Spark HiveQl and Spark SQL as well. So this PR support the same.
I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) And I handled all review comments in that PR

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits:

a391c7a [ravipesala] Rebase with master
2014-10-28 13:36:06 -07:00
Kousuke Saruta 6c1b981c3f [SPARK-4058] [PySpark] Log file name is hard coded even though there is a variable '$LOG_FILE '
In a script 'python/run-tests', log file name is represented by a variable 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log file name in the script.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2905 from sarutak/SPARK-4058 and squashes the following commits:

7710490 [Kousuke Saruta] Fixed python/run-tests not to use hard-coded log file name
2014-10-28 12:58:25 -07:00
Michael Griffiths 2f254dacf4 [SPARK-4065] Add check for IPython on Windows
This issue employs logic similar to the bash launcher (pyspark) to check
if IPTYHON=1, and if so launch ipython with options in IPYTHON_OPTS.
This fix assumes that ipython is available in the system Path, and can
be invoked with a plain "ipython" command.

Author: Michael Griffiths <msjgriffiths@gmail.com>

Closes #2910 from msjgriffiths/pyspark-windows and squashes the following commits:

ef34678 [Michael Griffiths] Change build message to comply with [SPARK-3775]
361e3d8 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows
9ce72d1 [Michael Griffiths] [SPARK-4065] Add check for IPython on Windows
2014-10-28 12:47:21 -07:00
Kousuke Saruta 4d52cec21d [SPARK-4089][Doc][Minor] The version number of Spark in _config.yaml is wrong.
The version number of Spark in docs/_config.yaml for master branch should be 1.2.0 for now.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2943 from sarutak/SPARK-4089 and squashes the following commits:

aba7fb4 [Kousuke Saruta] Fixed the version number of Spark in _config.yaml
2014-10-28 12:44:12 -07:00
Kousuke Saruta 247c529b35 [SPARK-3657] yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell
tgravescs reported this issue.

Following is quoted from tgravescs' report.

YarnRMClientImpl.registerApplicationMaster can throw null pointer exception when setting the trackingurl if its empty:

    appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority())

I hit this just start spark-shell without the tracking url set.

14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710)
        at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38)
        at org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2981 from sarutak/SPARK-3657-2 and squashes the following commits:

e2fd6bc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3657
70b8882 [Kousuke Saruta] Fixed NPE thrown
2014-10-28 12:37:09 -07:00
WangTaoTheTonic 1ea3e3dc9d [SPARK-4096][YARN]let ApplicationMaster accept executor memory argument in same format as JVM memory strings
Here `ApplicationMaster` accept executor memory argument only in number format, we should let it accept JVM style memory strings as well.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2955 from WangTaoTheTonic/modifyDesc and squashes the following commits:

ab98c70 [WangTaoTheTonic] append parameter passed in
3779767 [WangTaoTheTonic] Update executor memory description in the help message
2014-10-28 12:31:42 -07:00
Kousuke Saruta 44d8b45a38 [SPARK-4110] Wrong comments about default settings in spark-daemon.sh
In spark-daemon.sh, thare are following comments.

    #   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_PREFIX}/conf.
    #   SPARK_LOG_DIR   Where log files are stored.  PWD by default.

But, I think the default value for SPARK_CONF_DIR is `${SPARK_HOME}/conf` and for SPARK_LOG_DIR is `${SPARK_HOME}/logs`.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2972 from sarutak/SPARK-4110 and squashes the following commits:

5a171a2 [Kousuke Saruta] Fixed wrong comments
2014-10-28 12:29:01 -07:00
Shivaram Venkataraman 7768a800d4 [SPARK-4031] Make torrent broadcast read blocks on use.
This avoids reading torrent broadcast variables when they are referenced in the closure but not used in the closure. This is done by using a `lazy val` to read broadcast blocks

cc rxin JoshRosen for review

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2871 from shivaram/broadcast-read-value and squashes the following commits:

1456d65 [Shivaram Venkataraman] Use getUsedTimeMs and remove readObject
d6c5ee9 [Shivaram Venkataraman] Use laxy val to implement readBroadcastBlock
0b34df7 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
9cec507 [Shivaram Venkataraman] Test if broadcast variables are read lazily
768b40b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/apache/spark into broadcast-read-value
8792ed8 [Shivaram Venkataraman] Make torrent broadcast read blocks on use. This avoids reading broadcast variables when they are referenced in the closure but not used by the code.
2014-10-28 10:14:16 -07:00
WangTaoTheTonic 0ac52e3055 [SPARK-4098][YARN]use appUIAddress instead of appUIHostPort in yarn-client mode
https://issues.apache.org/jira/browse/SPARK-4098

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2958 from WangTaoTheTonic/useAddress and squashes the following commits:

29236e6 [WangTaoTheTonic] use appUIAddress instead of appUIHostPort in yarn-cluster mode
2014-10-28 09:51:44 -05:00
WangTaoTheTonic e8813be653 [SPARK-4095][YARN][Minor]extract val isLaunchingDriver in ClientBase
Instead of checking if `args.userClass` is null repeatedly, we extract it to an global val as in `ApplicationMaster`.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2954 from WangTaoTheTonic/MemUnit and squashes the following commits:

13bda20 [WangTaoTheTonic] extract val isLaunchingDriver in ClientBase
2014-10-28 08:53:10 -05:00
WangTaoTheTonic 47346cd029 [SPARK-4116][YARN]Delete the abandoned log4j-spark-container.properties
Since its name reduced at https://github.com/apache/spark/pull/560, the log4j-spark-container.properties was never used again.
And I have searched its name globally in code and found no cite.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2977 from WangTaoTheTonic/delLog4j and squashes the following commits:

fb2729f [WangTaoTheTonic] delete the log4j file obsoleted
2014-10-28 08:46:31 -05:00
Davies Liu fae095bc7c [SPARK-3961] [MLlib] [PySpark] Python API for mllib.feature
Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF

cc mengxr

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2819 from davies/feature and squashes the following commits:

4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
2014-10-28 03:50:22 -07:00
Josh Rosen 46c63417c1 [SPARK-4107] Fix incorrect handling of read() and skip() return values
`read()` may return fewer bytes than requested; when this occurred, the old code would silently return less data than requested, which might cause stream corruption errors.  `skip()` faces similar issues, too.

This patch fixes several cases where we mis-handle these methods' return values.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #2969 from JoshRosen/file-channel-read-fix and squashes the following commits:

e724a9f [Josh Rosen] Fix similar issue of not checking skip() return value.
cbc03ce [Josh Rosen] Update the other log message, too.
01e6015 [Josh Rosen] file.getName -> file.getAbsolutePath
d961d95 [Josh Rosen] Fix another issue in FileServerSuite.
b9265d2 [Josh Rosen] Fix a similar (minor) issue in TestUtils.
cd9d76f [Josh Rosen] Fix a similar error in Tachyon:
3db0008 [Josh Rosen] Fix a similar read() error in Utils.offsetBytes().
db985ed [Josh Rosen] Fix unsafe usage of FileChannel.read():
2014-10-28 00:04:16 -07:00
Ryan Williams 4ceb048b38 fix broken links in README.md
seems like `building-spark.html` was renamed to `building-with-maven.html`?

Is Maven the blessed build tool these days, or SBT? I couldn't find a building-with-sbt page so I went with the Maven one here.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #2859 from ryan-williams/broken-links-readme and squashes the following commits:

7692253 [Ryan Williams] fix broken links in README.md
2014-10-27 23:55:13 -07:00
GuoQiang Li 7c0c26cd12 [SPARK-4064]NioBlockTransferService.fetchBlocks may cause spark to hang.
cc @rxin

Author: GuoQiang Li <witgo@qq.com>

Closes #2929 from witgo/SPARK-4064 and squashes the following commits:

20110f2 [GuoQiang Li] Modify the exception msg
3425225 [GuoQiang Li] review commits
2b07e49 [GuoQiang Li] If we create a lot of big broadcast variables, Spark may hang
2014-10-27 23:31:46 -07:00
wangxiaojing 0c34fa5b4b [SPARK-3907][SQL] Add truncate table support
JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907

Add turncate table support
TRUNCATE TABLE table_name [PARTITION partition_spec];
partition_spec:
  : (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table.

Author: wangxiaojing <u9jing@gmail.com>

Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits:

63dbd81 [wangxiaojing] change hive scalastyle
7a03707 [wangxiaojing] add comment
f6e710e [wangxiaojing] change truncate table
a1f692c [wangxiaojing] Correct spelling mistakes
3b20007 [wangxiaojing] add truncate can not support column err message
e483547 [wangxiaojing] add golden file
77b1f20 [wangxiaojing]  add truncate table support
2014-10-27 22:02:52 -07:00
Yin Huai 27470d3406 [SQL] Correct a variable name in JavaApplySchemaSuite.applySchemaToJSON
`schemaRDD2` is not tested because `schemaRDD1` is registered again.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2869 from yhuai/JavaApplySchemaSuite and squashes the following commits:

95fe894 [Yin Huai] Correct variable name.
2014-10-27 20:50:09 -07:00
wangfei 89af6dfc3a [SPARK-4041][SQL] Attributes names in table scan should converted to lowercase when compare with relation attributes
In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```.
```neededColumnIDs``` may be not correct if not convert to lowercase.

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #2884 from scwf/fixColumnIds and squashes the following commits:

6174046 [scwf] use AttributeMap for this issue
dc74a24 [wangfei] use lowerName and add a test case for this issue
3ff3a80 [wangfei] more safer change
294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs
2014-10-27 20:46:26 -07:00
Alex Liu 698a7eab77 [SPARK-3816][SQL] Add table properties from storage handler to output jobConf
...ob conf in SparkHadoopWriter class

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits:

79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf
2014-10-27 20:43:29 -07:00
Cheng Hao 418ad83fe1 [SPARK-3911] [SQL] HiveSimpleUdf can not be optimized in constant folding
```
explain extended select cos(null) from src limit 1;
```
outputs:
```
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  MetastoreRelation default, src, None

== Physical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5]
  HiveTableScan [], (MetastoreRelation default, src, None), None
```
After patching this PR it outputs
```
== Parsed Logical Plan ==
Limit 1
 Project ['cos(null) AS c_0#0]
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Limit 1
 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0]
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Limit 1
 Project [null AS c_0#0]
  MetastoreRelation default, src, None

== Physical Plan ==
Limit 1
 Project [null AS c_0#0]
  HiveTableScan [], (MetastoreRelation default, src, None), None
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits:

1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest
1e52dda [Cheng Hao] add unit test for hive simple udf constant folding
01609ff [Cheng Hao] support constant folding for HiveSimpleUdf
2014-10-27 20:42:05 -07:00
coderxiang 7e3a1ada86 [MLlib] SPARK-3987: add test case on objective value for NNLS
Also update step parameter to pass the proposed test

Author: coderxiang <shuoxiangpub@gmail.com>

Closes #2965 from coderxiang/nnls-test and squashes the following commits:

24b06f9 [coderxiang] add test case on objective value for NNLS; update step parameter to pass the test
2014-10-27 19:43:39 -07:00
Sean Owen bfa614b127 SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.

Author: Sean Owen <sowen@cloudera.com>

Closes #2928 from srowen/SPARK-4022 and squashes the following commits:

61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
2014-10-27 10:53:15 -07:00
Cheng Lian 1d7bcc8840 [SQL] Fixes caching related JoinSuite failure
PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics.

This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection.

Author: Cheng Lian <lian@databricks.com>

Closes #2960 from liancheng/fix-join-suite and squashes the following commits:

715b2de [Cheng Lian] Fixes caching related JoinSuite failure
2014-10-27 10:06:09 -07:00
Sandy Ryza dea302ddbd SPARK-2621. Update task InputMetrics incrementally
The patch takes advantage an API provided in Hadoop 2.5 that allows getting accurate data on Hadoop FileSystem bytes read.  It eliminates the old method, which naively accepts the split size as the input bytes.  An impact of this change will be that input metrics go away when using against Hadoop versions earlier thatn 2.5.  I can add this back in, but my opinion is that no metrics are better than inaccurate metrics.

This is difficult to write a test for because we don't usually build against a version of Hadoop that contains the function we need.  I've tested it manually on a pseudo-distributed cluster.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2087 from sryza/sandy-spark-2621 and squashes the following commits:

23010b8 [Sandy Ryza] Missing style fixes
74fc9bb [Sandy Ryza] Make getFSBytesReadOnThreadCallback private
1ab662d [Sandy Ryza] Clear things up a bit
984631f [Sandy Ryza] Switch from pull to push model and add test
7ef7b22 [Sandy Ryza] Add missing curly braces
219abc9 [Sandy Ryza] Fall back to split size
90dbc14 [Sandy Ryza] SPARK-2621. Update task InputMetrics incrementally
2014-10-27 10:04:24 -07:00
Prashant Sharma c9e05ca27c [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2878 from ScrapCodes/SPARK-4032/deprecate-yarn-alpha and squashes the following commits:

17e9857 [Prashant Sharma] added deperecated comment to Client and ExecutorRunnable.
3a34b1e [Prashant Sharma] Updated docs...
4608dea [Prashant Sharma] [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
2014-10-27 10:02:48 -07:00
Shivaram Venkataraman 9aa340a23f [SPARK-4030] Make destroy public for broadcast variables
This change makes the destroy function public for broadcast variables. Motivation for the change is described in https://issues.apache.org/jira/browse/SPARK-4030.
This patch also logs where destroy was called from if a broadcast variable is used after destruction.

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #2922 from shivaram/broadcast-destroy and squashes the following commits:

a11abab [Shivaram Venkataraman] Fix scala style in Utils.scala
bed9c9d [Shivaram Venkataraman] Make destroy blocking by default
e80c1ab [Shivaram Venkataraman] Make destroy public for broadcast variables Also log where destroy was called from if a broadcast variable is used after destruction.
2014-10-27 08:45:36 -07:00
Liang-Chi Hsieh 6377adaf32 [SPARK-3970] Remove duplicate removal of local dirs
The shutdown hook of `DiskBlockManager` would remove localDirs. So do not need to register them with `Utils.registerShutdownDeleteDir`. It causes duplicate removal of these local dirs and corresponding exceptions.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2826 from viirya/fix_duplicate_localdir_remove and squashes the following commits:

051d4b5 [Liang-Chi Hsieh] check dir existing and return empty List as default.
2b91a9c [Liang-Chi Hsieh] remove duplicate removal of local dirs.
2014-10-26 18:02:06 -07:00
scwf f4e8c289d8 [SPARK-4042][SQL] Append columns ids and names before broadcast
Append columns ids and names before broadcast ```hiveExtraConf```  in ```HadoopTableReader```.

Author: scwf <wangfei1@huawei.com>

Closes #2885 from scwf/HadoopTableReader and squashes the following commits:

a8c498c [scwf] append columns ids and names before broadcast
2014-10-26 16:56:11 -07:00
Kousuke Saruta 3a9d66cf59 [SPARK-4061][SQL] We cannot use EOL character in the operand of LIKE predicate.
We cannot use EOL character like \n or \r in the operand of LIKE predicate.
So following condition is never true.

    -- someStr is 'hoge\nfuga'
    where someStr LIKE 'hoge_fuga'

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2908 from sarutak/spark-sql-like-match-modification and squashes the following commits:

d15798b [Kousuke Saruta] Remove test setting for thriftserver
f99a2f4 [Kousuke Saruta] Fixed LIKE predicate so that we can use EOL character as in a operand
2014-10-26 16:54:07 -07:00
Kousuke Saruta ace41e8bf2 [SPARK-3959][SPARK-3960][SQL] SqlParser fails to parse literal -9223372036854775808 (Long.MinValue). / We can apply unary minus only to literal.
SqlParser fails to parse -9223372036854775808 (Long.MinValue) so we cannot write queries such like as follows.

    SELECT value FROM someTable WHERE value > -9223372036854775808

Additionally, because of the wrong syntax definition, we cannot apply unary minus only to literal. So, we cannot write such expressions.

    -(value1 + value2) // Parenthesized expressions
    -column // Columns
    -MAX(column) // Functions

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2816 from sarutak/spark-sql-dsl-improvement2 and squashes the following commits:

32a5005 [Kousuke Saruta] Remove test setting for thriftserver
c2bab5e [Kousuke Saruta] Fixed SPARK-3959 and SPARK-3960
2014-10-26 16:40:29 -07:00
ravipesala 974d7b238b [SPARK-3483][SQL] Special chars in column names
Supporting special chars in column names by using back ticks. Closed https://github.com/apache/spark/pull/2804 and created this PR as it has merge conflicts

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #2927 from ravipesala/SPARK-3483-NEW and squashes the following commits:

f6329f3 [ravipesala] Rebased with master
2014-10-26 16:36:11 -07:00
Yin Huai 0481aaa8d7 [SPARK-4068][SQL] NPE in jsonRDD schema inference
Please refer to added tests for cases that can trigger the bug.

JIRA: https://issues.apache.org/jira/browse/SPARK-4068

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2918 from yhuai/SPARK-4068 and squashes the following commits:

d360eae [Yin Huai] Handle nulls when building key paths from elements of an array.
2014-10-26 16:32:02 -07:00
Yin Huai 05308426f0 [SPARK-4052][SQL] Use scala.collection.Map for pattern matching instead of using Predef.Map (it is scala.collection.immutable.Map)
Please check https://issues.apache.org/jira/browse/SPARK-4052 for cases triggering this bug.

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #2899 from yhuai/SPARK-4052 and squashes the following commits:

1188f70 [Yin Huai] Address liancheng's comments.
b6712be [Yin Huai] Use scala.collection.Map instead of Predef.Map (scala.collection.immutable.Map).
2014-10-26 16:30:15 -07:00
Kousuke Saruta d518bc24af [SPARK-3953][SQL][Minor] Confusable variable name.
In SqlParser.scala, there is following code.

    case d ~ p ~ r ~ f ~ g ~ h ~ o ~ l  =>
      val base = r.getOrElse(NoRelation)
      val withFilter = f.map(f => Filter(f, base)).getOrElse(base)

In the code above, there are 2 variables which have same name "f" in near place.
One is receiver "f" and other is bound variable "f".

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2807 from sarutak/SPARK-3953 and squashes the following commits:

4957c32 [Kousuke Saruta] Improved variable name in SqlParser.scala
2014-10-26 16:28:33 -07:00
Kousuke Saruta dc51f4d6d8 [SQL][DOC] Wrong package name "scala.math.sql" in sql-programming-guide.md
In sql-programming-guide.md, there is a wrong package name "scala.math.sql".

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2873 from sarutak/wrong-packagename-fix and squashes the following commits:

4d5ecf4 [Kousuke Saruta] Fixed wrong package name in sql-programming-guide.md
2014-10-26 16:27:29 -07:00
GuoQiang Li 89e8a5d8ba [SPARK-3997][Build]scalastyle should output the error location
Author: GuoQiang Li <witgo@qq.com>

Closes #2846 from witgo/SPARK-3997 and squashes the following commits:

d6a57f8 [GuoQiang Li] scalastyle should output the error location
2014-10-26 16:24:50 -07:00
Cheng Lian 2838bf8aad [SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics
This PR refines in-memory columnar table statistics:

1. adds 2 more statistics for in-memory table columns: `count` and `sizeInBytes`
1. adds filter pushdown support for `IS NULL` and `IS NOT NULL`.
1. caches and propagates statistics in `InMemoryRelation` once the underlying cached RDD is materialized.

   Statistics are collected to driver side with an accumulator.

This PR also fixes SPARK-3914 by properly propagating in-memory statistics.

Author: Cheng Lian <lian@databricks.com>

Closes #2860 from liancheng/propagates-in-mem-stats and squashes the following commits:

0cc5271 [Cheng Lian] Restricts visibility of o.a.s.s.c.p.l.Statistics
c5ff904 [Cheng Lian] Fixes test table name conflict
a8c818d [Cheng Lian] Refines tests
1d01074 [Cheng Lian] Bug fix: shouldn't call STRING.actualSize on null string value
7dc6a34 [Cheng Lian] Adds more in-memory table statistics and propagates them properly
2014-10-26 16:10:09 -07:00
Michael Armbrust 879a165858 [HOTFIX][SQL] Temporarily turn off hive-server tests.
The thirift server is not available in the default (hive13) profile yet which is breaking all SQL only PRs.  This turns off these test until #2685 is merged.

Author: Michael Armbrust <michael@databricks.com>

Closes #2950 from marmbrus/fixTests and squashes the following commits:

1a6dfee [Michael Armbrust] [HOTFIX][SQL] Temporarily turn of hive-server tests.
2014-10-26 15:24:41 -07:00
Liang-Chi Hsieh 0af7e514c6 [SPARK-3925][SQL] Do not consider the ordering of qualifiers during comparison
The orderings should not be considered during the comparison between old qualifiers and new qualifiers.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2783 from viirya/full_qualifier_comp and squashes the following commits:

89f652c [Liang-Chi Hsieh] modification for comment.
abb5762 [Liang-Chi Hsieh] More comprehensive comparison of qualifiers.
2014-10-26 14:29:13 -07:00
anant asthana 677852c3fa Just fixing comment that shows usage
Author: anant asthana <anant.asty@gmail.com>

Closes #2948 from anantasty/patch-1 and squashes the following commits:

d8fea0b [anant asthana] Just fixing comment that shows usage
2014-10-26 14:14:12 -07:00
Josh Rosen bf589fc717 [SPARK-3616] Add basic Selenium tests to WebUISuite
This patch adds Selenium tests for Spark's web UI.  To avoid adding extra
dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
which is pure-Java, instead of, say, ChromeDriver.

I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
SPARK-3021, SPARK-2105, and SPARK-2527.  I wasn't able to reproduce these bugs;
I suspect that the older ones might have been fixed by other patches.

In order to use HtmlUnitDriver, I added an explicit dependency on the
org.apache.httpcomponents version of httpclient in order to prevent jets3t's
older version from taking precedence on the classpath.

I also upgraded ScalaTest to 2.2.1.

Author: Josh Rosen <joshrosen@apache.org>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:

fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
2014-10-26 11:29:27 -07:00
Daniel Lemire b75954015f Update RoaringBitmap to 0.4.3
Roaring has been updated to version 0.4.3. We fixed a rarely occurring bug with serialization. No API or format changes were made.

Author: Daniel Lemire <lemire@gmail.com>

Closes #2938 from lemire/master and squashes the following commits:

431f3a0 [Daniel Lemire] Recommended bug fix release
2014-10-26 10:03:20 -07:00
Sean Owen df7974b8e5 SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8
This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors.

Author: Sean Owen <sowen@cloudera.com>

Closes #2909 from srowen/SPARK-3359 and squashes the following commits:

f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.
2014-10-25 23:18:02 -07:00
Andrew Or c683444008 [SPARK-4071] Unroll fails silently if BlockManager is small
In tests, we may want to have BlockManagers of size < 1MB (spark.storage.unrollMemoryThreshold). However, these BlockManagers are useless because we can't unroll anything in them ever. At the very least we need to log a warning.

tdas

Author: Andrew Or <andrew@databricks.com>

Closes #2917 from andrewor14/unroll-safely-logging and squashes the following commits:

38947e3 [Andrew Or] Warn against starting a block manager that's too small
fd621b4 [Andrew Or] Warn against failure to reserve initial memory threshold
2014-10-25 20:07:44 -07:00
Josh Rosen 2e52e4f815 Revert "[SPARK-4056] Upgrade snappy-java to 1.1.1.5"
This reverts commit 898b22ab1f.

Reverting because this may be causing OOMs.
2014-10-25 17:07:44 -07:00
Davies Liu e41786c774 [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM
In case of take() or exception in Python, python worker may exit before JVM read() all the response, then the write thread may raise "Connection reset" exception.

Python should always wait JVM to close the socket first.

cc JoshRosen This is a warm fix, or the tests will be flaky, sorry for that.

Author: Davies Liu <davies@databricks.com>

Closes #2941 from davies/fix_exit and squashes the following commits:

9d4d21e [Davies Liu] fix race
2014-10-25 01:20:39 -07:00