Commit graph

9439 commits

Author SHA1 Message Date
Venkata Ramana Gollamudi f33d550464 [SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support
Supported passing array to percentile and percentile_approx UDAFs
To support percentile_approx,  constant inspectors are supported for GenericUDAF
Constant folding support added to CreateArray expression
Avoided constant udf expression re-evaluation

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes #2802 from gvramana/percentile_array_support and squashes the following commits:

a0182e5 [Venkata Ramana Gollamudi] fixed review comment
a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
47f6365 [Venkata Ramana Gollamudi] fixed test
cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
2014-12-17 15:41:35 -08:00
Cheng Hao 8d0d2a65eb [SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul...
```
TestSQLContext.sparkContext.parallelize(
  """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
  """{"ip":"27.31.100.29","headers":{}}""" ::
  """{"ip":"27.31.100.29","headers":""}""" :: Nil)
```
As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3708 from chenghao-intel/json and squashes the following commits:

e7a72e9 [Cheng Hao] add more concise unit test
853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
2014-12-17 15:01:59 -08:00
Michael Armbrust 19c0faad6d [HOTFIX][SQL] Fix parquet filter suite
Author: Michael Armbrust <michael@databricks.com>

Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:

2157bfc [Michael Armbrust] Fix parquet filter suite
2014-12-17 14:27:02 -08:00
Joseph K. Bradley affc3f460f [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc
+ small doc edit
+ include edit to make IntelliJ happy

CC: davies  mengxr

Note to davies  -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like.  (Those warnings occur when generating the Python docs.)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3669 from jkbradley/python-warnings and squashes the following commits:

4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc.  Small doc edit.  Small include edit to make IntelliJ happy.
2014-12-17 14:12:46 -08:00
Cheng Hao 636d9fc450 [SPARK-3739] [SQL] Update the split num base on block size for table scanning
In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #2589 from chenghao-intel/source_split and squashes the following commits:

dff38e7 [Cheng Hao] Remove the extra blank line
160a2b6 [Cheng Hao] fix the compiling bug
04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
2014-12-17 13:39:36 -08:00
Daoyuan Wang 902e4d54ac [SPARK-4755] [SQL] sqrt(negative value) should return null
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3616 from adrian-wang/sqrt and squashes the following commits:

d877439 [Daoyuan Wang] fix NULLTYPE
3effa2c [Daoyuan Wang] sqrt(negative value) should return null
2014-12-17 12:51:27 -08:00
Cheng Lian 6277135376 [SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet
Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.

However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1

This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3367 from liancheng/filters-with-null and squashes the following commits:

cc41281 [Cheng Lian] Fixes several styling issues
de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
2014-12-17 12:48:04 -08:00
Michael Armbrust 7ad579ee97 [SPARK-3698][SQL] Fix case insensitive resolution of GetField.
Based on #2543.

Author: Michael Armbrust <michael@databricks.com>

Closes #3724 from marmbrus/resolveGetField and squashes the following commits:

0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
2014-12-17 12:43:51 -08:00
carlmartin 4782def094 [SPARK-4694]Fix HiveThriftServer2 cann't stop In Yarn HA mode.
HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.

Author: carlmartin <carlmartinmax@gmail.com>

Closes #3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:

2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
2014-12-17 12:24:03 -08:00
Cheng Hao 5fdcbdc0c9 [SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser
Add `sort by` support for both DSL & SqlParser.

This PR is relevant with #3386, either one merged, will cause the other rebased.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3481 from chenghao-intel/sortby and squashes the following commits:

041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
2014-12-17 12:01:57 -08:00
Saisai Shao cf50631a66 [SPARK-4595][Core] Fix MetricsServlet not work issue
`MetricsServlet` handler should be added to the web UI after initialized by `MetricsSystem`, otherwise servlet handler cannot be attached.

Author: Saisai Shao <saisai.shao@intel.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: jerryshao <saisai.shao@intel.com>

Closes #3444 from jerryshao/SPARK-4595 and squashes the following commits:

434d17e [Saisai Shao] Merge pull request #10 from JoshRosen/metrics-system-cleanup
87a2292 [Josh Rosen] Guard against misuse of MetricsSystem methods.
f779fe0 [jerryshao] Fix MetricsServlet not work issue
2014-12-17 11:47:44 -08:00
Josh Rosen 3d0c37b811 [HOTFIX] Fix RAT exclusion for known_translations file
Author: Josh Rosen <joshrosen@databricks.com>

Closes #3719 from JoshRosen/rat-fix and squashes the following commits:

1542886 [Josh Rosen] [HOTFIX] Fix RAT exclusion for known_translations file
2014-12-16 23:00:25 -08:00
Andrew Or 4e1112e7b0 [Release] Update contributors list format and sort it
Additionally, we now warn the user when a duplicate author name
arises, in which case he/she needs to resolve it manually.
2014-12-16 22:14:18 -08:00
scwf 60698801eb [SPARK-4618][SQL] Make foreign DDL commands options case-insensitive
Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
So flowing cmd work
```
      create temporary table normal_parquet
      USING org.apache.spark.sql.parquet
      OPTIONS (
        PATH '/xxx/data'
      )
```

Author: scwf <wangfei1@huawei.com>
Author: wangfei <wangfei1@huawei.com>

Closes #3470 from scwf/ddl-ulcase and squashes the following commits:

ae78509 [scwf] address comments
8f4f585 [wangfei] address comments
3c132ef [scwf] minor fix
a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
4f86401 [scwf] adding CaseInsensitiveMap
e244e8d [wangfei] using lower case in json
e0cb017 [wangfei] make options in-casesensitive
2014-12-16 21:26:36 -08:00
Davies Liu ec5c4279ed [SPARK-4866] support StructType as key in MapType
This PR brings support of using StructType(and other hashable types) as key in MapType.

Author: Davies Liu <davies@databricks.com>

Closes #3714 from davies/fix_struct_in_map and squashes the following commits:

68585d7 [Davies Liu] fix primitive types in MapType
9601534 [Davies Liu] support StructType as key in MapType
2014-12-16 21:23:28 -08:00
Cheng Hao 770d8153a5 [SPARK-4375] [SQL] Add 0 argument support for udf
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3595 from chenghao-intel/udf0 and squashes the following commits:

a858973 [Cheng Hao] Add 0 arguments support for udf
2014-12-16 21:21:11 -08:00
Takuya UESHIN ddc7ba31cb [SPARK-4720][SQL] Remainder should also return null if the divider is 0.
This is a follow-up of SPARK-4593 (#3443).

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3581 from ueshin/issues/SPARK-4720 and squashes the following commits:

c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
2014-12-16 21:19:57 -08:00
Cheng Hao 0aa834adea [SPARK-4744] [SQL] Short circuit evaluation for AND & OR in CodeGen
Author: Cheng Hao <hao.cheng@intel.com>

Closes #3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:

f466303 [Cheng Hao] short circuit for AND & OR
2014-12-16 21:18:39 -08:00
Cheng Lian 3b395e1051 [SPARK-4798][SQL] A new set of Parquet testing API and test suites
This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:

- `ParquetQuerySuite`
- `ParquetTestData`

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3644 from liancheng/parquet-tests and squashes the following commits:

800e745 [Cheng Lian] Enforces ordering of test output
3bb8731 [Cheng Lian] Refactors HiveParquetSuite
aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
2014-12-16 21:16:03 -08:00
Andrew Or b85044ecfa [Release] Cache known author translations locally
This bypasses unnecessary calls to the Github and JIRA API.
Additionally, having a local cache allows us to remember names
that we had to manually discover ourselves.
2014-12-16 19:28:43 -08:00
Andrew Or 6f80b749e0 [Release] Major improvements to generate contributors script
This commit introduces several major improvements to the script
that generates the contributors list for release notes, notably:

(1) Use release tags instead of a range of commits. Across branches,
commits are not actually strictly two-dimensional, and so it is not
sufficient to specify a start hash and an end hash. Otherwise, we
end up counting commits that were already merged in an older branch.

(2) Match PR numbers in addition to commit hashes. This is related
to the first point in that if a PR is already merged in an older
minor release tag, it should be filtered out here. This requires us
to do some intelligent regex parsing on the commit description in
addition to just relying on the GitHub API.

(3) Relax author validity check. The old code fails on a name that
has many middle names, for instance. The test was just too strict.

(4) Use GitHub authentication. This allows us to make far more
requests through the GitHub API than before (5000 as opposed to 60
per hour).

(5) Translate from Github username, not commit author name. This is
important because the commit author name is not always configured
correctly by the user. For instance, the username "falaki" used to
resolve to just "Hossein", which was treated as a github username
and translated to something else that is completely arbitrary.

(6) Add an option to use the untranslated name. If there is not
a satisfactory candidate to replace the untranslated name with,
at least allow the user to not translate it.
2014-12-16 17:55:27 -08:00
Jacky Li fa66ef6c97 [SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin
In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.

Author: Jacky Li <jacky.likun@huawei.com>

Closes #3133 from jackylk/timeout-config and squashes the following commits:

733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
557acd4 [Jacky Li] switch to sqlContext.getConf
81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
2014-12-16 15:34:59 -08:00
Michael Armbrust a66c23e134 [SPARK-4827][SQL] Fix resolution of deeply nested Project(attr, Project(Star,...)).
Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass.  Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.

Author: Michael Armbrust <michael@databricks.com>

Closes #3674 from marmbrus/projectStars and squashes the following commits:

d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
2014-12-16 15:31:19 -08:00
tianyi 30f6b85c81 [SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin
In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.

for
table test_csv contains 1 million records
table dim_csv contains 10 thousand records

SQL:
`select * from test_csv a left outer join dim_csv b on a.key = b.key`

the result is:
master:
```
CSV: 12671 ms
CSV: 9021 ms
CSV: 9200 ms
Current Mem Usage:787788984
```
after patch:
```
CSV: 10382 ms
CSV: 7543 ms
CSV: 7469 ms
Current Mem Usage:208145728
```

Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi.asiainfo@gmail.com>

Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:

72a8aec [tianyi] avoid having mutable state stored inside of the task
99c5c97 [tianyi] performance optimization
d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
2be45d1 [tianyi] fix spell bug
1f2c6f1 [tianyi] remove commented codes
a676de6 [tianyi] optimize some codes
9e7d5b5 [tianyi] remove commented old codes
838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
2014-12-16 15:22:29 -08:00
wangxiaojing ea1315e3e2 [SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite
In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite

Author: wangxiaojing <u9jing@gmail.com>

Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:

ea0e495 [wangxiaojing] change style
53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
2014-12-16 14:45:56 -08:00
Holden Karau b0dfdbdd18 SPARK-4767: Add support for launching in a specified placement group to spark_ec2
Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy

Author: Holden Karau <holden@pigscanfly.ca>

Closes #3623 from holdenk/SPARK-4767-add-support-for-launching-in-a-specified-placement-group-to-spark-ec2-scripts and squashes the following commits:

111a5fd [Holden Karau] merge in master
70ace25 [Holden Karau] Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy
2014-12-16 14:37:04 -08:00
zsxwing 6530243a52 [SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled'
The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.

```Scala
scala> :paste
// Entering paste mode (ctrl-D to finish)

abstract class Foo {

  protected val sqlContext = "Foo"

  val codegenEnabled: Boolean = {
    println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
    if (sqlContext != null) {
      true
    } else {
      false
    }
  }
}

class Bar extends Foo {
  override val sqlContext = "Bar"
}

println(new Bar().codegenEnabled)

// Exiting paste mode, now interpreting.

null
false
defined class Foo
defined class Bar
```

We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.

Author: zsxwing <zsxwing@gmail.com>

Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits:

1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
2014-12-16 14:13:40 -08:00
jerryshao dc8280dcca [SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue
Author: jerryshao <saisai.shao@intel.com>

Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits:

4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
2014-12-16 14:08:28 -08:00
Peter Vandenabeele 1a9e35e57a [DOCS][SQL] Add a Note on jsonFile having separate JSON objects per line
* This commit hopes to avoid the confusion I faced when trying
  to submit a regular, valid multi-line JSON file, also see

  http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

Author: Peter Vandenabeele <peter@vandenabeele.com>

Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:

1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
2014-12-16 13:58:01 -08:00
Judy Nash 17688d1429 [SQL] SPARK-4700: Add HTTP protocol spark thrift server
Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.

Author: Judy Nash <judynash@microsoft.com>
Author: judynash <judynash@microsoft.com>

Closes #3672 from judynash/master and squashes the following commits:

526315d [Judy Nash] correct spacing on startThriftServer method
31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
2b1d312 [Judy Nash] updated http thrift server support based on feedback
377532c [judynash] add HTTP protocol spark thrift server
2014-12-16 12:37:26 -08:00
Mike Jennings d12c0711fa [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
Based on this gist:
https://gist.github.com/amar-analytx/0b62543621e1f246c0a2

We use security group ids instead of security group to get around this issue:
https://github.com/boto/boto/issues/350

Author: Mike Jennings <mvj101@gmail.com>
Author: Mike Jennings <mvj@google.com>

Closes #2872 from mvj101/SPARK-3405 and squashes the following commits:

be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly.
4dc6756 [Mike Jennings] Remove duplicate comment
731d94c [Mike Jennings] Update for code review.
ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405
1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405
52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
2014-12-16 12:13:21 -08:00
jbencook cb48447493 [SPARK-4855][mllib] testing the Chi-squared hypothesis test
This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc516 and moves some of the error messaging in to python.

It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)

Hopefully, SPARK-2980 can be closed.

Author: jbencook <jbenjamincook@gmail.com>

Closes #3679 from jbencook/master and squashes the following commits:

44078e0 [jbencook] checking that bad input throws the correct exceptions
f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
7536cf1 [jbencook] removing python checks for invalid input
a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
2014-12-16 11:37:23 -08:00
Davies Liu ed362008f0 [SPARK-4437] update doc for WholeCombineFileRecordReader
update doc for WholeCombineFileRecordReader

Author: Davies Liu <davies@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #3301 from davies/fix_doc and squashes the following commits:

1d7422f [Davies Liu] Merge pull request #2 from JoshRosen/whole-text-file-cleanup
dc3d21a [Josh Rosen] More genericization in ConfigurableCombineFileRecordReader.
95d13eb [Davies Liu] address comment
bf800b9 [Davies Liu] update doc for WholeCombineFileRecordReader
2014-12-16 11:19:36 -08:00
Davies Liu c246b95dd2 [SPARK-4841] fix zip with textFile()
UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().

Also, if two RDD have the same batch size already, they did not need re-serialize any more.

Author: Davies Liu <davies@databricks.com>

Closes #3706 from davies/fix_4841 and squashes the following commits:

20ce3a3 [Davies Liu] fix bug in _reserialize()
e3ebf7c [Davies Liu] add comment
379d2c8 [Davies Liu] fix zip with textFile()
2014-12-15 22:58:26 -08:00
meiyoula c7628771da [SPARK-4792] Add error message when making local dir unsuccessfully
Author: meiyoula <1039320815@qq.com>

Closes #3635 from XuTingjun/master and squashes the following commits:

dd1c66d [meiyoula] when old is deleted, it will throw an exception where call it
2a55bc2 [meiyoula] Update DiskBlockManager.scala
1483a4a [meiyoula] Delete multiple retries to make dir
67f7902 [meiyoula] Try some times to make dir maybe more reasonable
1c51a0c [meiyoula] Update DiskBlockManager.scala
2014-12-15 22:30:18 -08:00
Sean Owen 81112e4b57 SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.

Author: Sean Owen <sowen@cloudera.com>

Closes #3692 from srowen/SPARK-4814 and squashes the following commits:

caca704 [Sean Owen] Disable assertions just for Hive
f71e783 [Sean Owen] Enable assertions for SBT and Maven build
2014-12-15 17:12:05 -08:00
wangfei 5c24759ddc [Minor][Core] fix comments in MapOutputTracker
Using driver and executor in the comments of ```MapOutputTracker``` is more clear.

Author: wangfei <wangfei1@huawei.com>

Closes #3700 from scwf/commentFix and squashes the following commits:

aa68524 [wangfei] master and worker should be driver and executor
2014-12-15 16:46:43 -08:00
Sean Owen 2a28bc6100 SPARK-785 [CORE] ClosureCleaner not invoked on most PairRDDFunctions
This looked like perhaps a simple and important one. `combineByKey` looks like it should clean its arguments' closures, and that in turn covers apparently all remaining functions in `PairRDDFunctions` which delegate to it.

Author: Sean Owen <sowen@cloudera.com>

Closes #3690 from srowen/SPARK-785 and squashes the following commits:

8df68fe [Sean Owen] Clean context of most remaining functions in PairRDDFunctions, which ultimately call combineByKey
2014-12-15 16:06:15 -08:00
Ryan Williams 8176b7a02e [SPARK-4668] Fix some documentation typos.
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #3523 from ryan-williams/tweaks and squashes the following commits:

d2eddaa [Ryan Williams] code review feedback
ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
c6cfad9 [Ryan Williams] remove unnecessary if statement
b74ea35 [Ryan Williams] comment fix
b0221f0 [Ryan Williams] fix a gendered pronoun
c71ffed [Ryan Williams] use names on a few boolean parameters
89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
83e8358 [Ryan Williams] fix pom.xml typo
dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
2014-12-15 14:52:17 -08:00
Ilya Ganelin 38703bbca8 [SPARK-1037] The name of findTaskFromList & findTask in TaskSetManager.scala is confusing
Hi all - I've renamed the methods referenced in this JIRA to clarify that they modify the provided arrays (find vs. deque).

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #3665 from ilganeli/SPARK-1037B and squashes the following commits:

64c177c [Ilya Ganelin] Renamed deque to dequeue
f27d85e [Ilya Ganelin] Renamed private methods to clarify that they modify the provided parameters
683482a [Ilya Ganelin] Renamed private methods to clarify that they modify the provided parameters
2014-12-15 14:51:15 -08:00
Josh Rosen f6b8591a08 [SPARK-4826] Fix generation of temp file names in WAL tests
This PR should fix SPARK-4826, an issue where a bug in how we generate temp. file names was causing spurious test failures in the write ahead log suites.

Closes #3695.
Closes #3701.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3704 from JoshRosen/SPARK-4826 and squashes the following commits:

f2307f5 [Josh Rosen] Use Spark Utils class for directory creation/deletion
a693ddb [Josh Rosen] remove unused Random import
b275e41 [Josh Rosen] Move creation of temp. dir to beforeEach/afterEach.
9362919 [Josh Rosen] [SPARK-4826] Fix bug in generation of temp file names. in WAL suites.
86c1944 [Josh Rosen] Revert "HOTFIX: Disabling failing block manager test"
2014-12-15 14:33:43 -08:00
Yuu ISHIKAWA 8098fab06c [SPARK-4494][mllib] IDFModel.transform() add support for single vector
I improved `IDFModel.transform` to allow using a single vector.

[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #3603 from yu-iskw/idf and squashes the following commits:

256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
2014-12-15 13:44:15 -08:00
Patrick Wendell 4c0673879b HOTFIX: Disabling failing block manager test 2014-12-15 10:54:45 -08:00
Peter Klipfel 2a2983f7c5 fixed spelling errors in documentation
changed "form" to "from" in 3 documentation entries for Kafka integration

Author: Peter Klipfel <peter@klipfel.me>

Closes #3691 from peterklipfel/master and squashes the following commits:

0fe7fc5 [Peter Klipfel] fixed spelling errors in documentation
2014-12-14 00:01:16 -08:00
Patrick Wendell ef84dab8c6 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #3488 (close requested by 'pwendell')
Closes #2939 (close requested by 'marmbrus')
Closes #3173 (close requested by 'marmbrus')
2014-12-11 23:38:40 -08:00
Daoyuan Wang 41a3f93438 [SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3676 from adrian-wang/countexpr and squashes the following commits:

dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null
2014-12-11 22:56:42 -08:00
Sasaki Toru 8091dd62ea [SPARK-4742][SQL] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits:

6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding
20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat
2014-12-11 22:54:21 -08:00
Cheng Hao 0abbff2862 [SPARK-4825] [SQL] CTAS fails to resolve when created using saveAsTable
Fix bug when query like:
```
  test("save join to table") {
    val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString))
    sql("CREATE TABLE test1 (key INT, value STRING)")
    testData.insertInto("test1")
    sql("CREATE TABLE test2 (key INT, value STRING)")
    testData.insertInto("test2")
    testData.insertInto("test2")
    sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test")
    checkAnswer(
      table("test"),
      sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
  }
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3673 from chenghao-intel/spark_4825 and squashes the following commits:

e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS
e004895 [Cheng Hao] fix bug
2014-12-11 22:51:49 -08:00
Daoyuan Wang cbb634ae69 [SQL] enable empty aggr test case
This is fixed by SPARK-4318 #3184

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3445 from adrian-wang/emptyaggr and squashes the following commits:

982575e [Daoyuan Wang] enable empty aggr test case
2014-12-11 22:50:18 -08:00
Daoyuan Wang acb3be6bc5 [SPARK-4828] [SQL] sum and avg on empty table should always return null
So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance.

Can we merge #3445 before I add a comparison test case from this?

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #3675 from adrian-wang/sumempty and squashes the following commits:

42df763 [Daoyuan Wang] sum and avg on empty table should always return null
2014-12-11 22:49:27 -08:00