Commit graph

14364 commits

Author SHA1 Message Date
Udo Klein bd723bd53d removed lambda from sortByKey()
According to the documentation the sortByKey method does not take a lambda as an argument, thus the example is flawed. Removed the argument completely as this will default to ascending sort.

Author: Udo Klein <git@blinkenlight.net>

Closes #10640 from udoklein/patch-1.
2016-01-11 09:30:08 +00:00
Wenchen Fan f253feff62 [SPARK-12539][FOLLOW-UP] always sort in partitioning writer
address comments in #10498 , especially https://github.com/apache/spark/pull/10498#discussion_r49021259

Author: Wenchen Fan <wenchen@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #10638 from cloud-fan/bucket-write.
2016-01-11 00:44:33 -08:00
Josh Rosen f13c7f8f7d [SPARK-12734][HOTFIX][TEST-MAVEN] Fix bug in Netty exclusions
This is a hotfix for a build bug introduced by the Netty exclusion changes in #10672. We can't exclude `io.netty:netty` because Akka depends on it. There's not a direct conflict between `io.netty:netty` and `io.netty:netty-all`, because the former puts classes in the `org.jboss.netty` namespace while the latter uses the `io.netty` namespace. However, there still is a conflict between `org.jboss.netty:netty` and `io.netty:netty`, so we need to continue to exclude the JBoss version of that artifact.

While the diff here looks somewhat large, note that this is only a revert of a some of the changes from #10672. You can see the net changes in pom.xml at 3119206b71...5211ab8 (diff-600376dffeb79835ede4a0b285078036)

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10693 from JoshRosen/netty-hotfix.
2016-01-11 00:31:29 -08:00
Kousuke Saruta 008a558285 [SPARK-4628][BUILD] Add a resolver to MiMaBuild.scala for mqttv3(1.0.1).
#10659 removed the repository `https://repo.eclipse.org/content/repositories/paho-releases` but it's needed by MiMa because `spark-streaming-mqtt(1.6.0)` depends on `mqttv3(1.0.1)` and it is provided by the removed repository and maven-central provide only `mqttv3(1.0.2)` for now.
Otherwise, if `mqttv3(1.0.1)` is absent from the local repository, dev/mima should fail.

JoshRosen Do you have any other better idea?

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10688 from sarutak/SPARK-4628-followup.
2016-01-10 23:33:57 -08:00
Marcelo Vanzin 6439a82503 [SPARK-3873][BUILD] Enable import ordering error checking.
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.

Plus a few fixes to imports that cropped in since my recent cleanups.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10612 from vanzin/SPARK-3873-enable.
2016-01-10 20:04:50 -08:00
Josh Rosen 3ab0138b0f [SPARK-12734][BUILD] Fix Netty exclusion and use Maven Enforcer to prevent future bugs
Netty classes are published under multiple artifacts with different names, so our build needs to exclude the `io.netty:netty` and `org.jboss.netty:netty` versions of the Netty artifact. However, our existing exclusions were incomplete, leading to situations where duplicate Netty classes would wind up on the classpath and cause compile errors (or worse).

This patch fixes the exclusion issue by adding more exclusions and uses Maven Enforcer's [banned dependencies](https://maven.apache.org/enforcer/enforcer-rules/bannedDependencies.html) rule to prevent these classes from accidentally being reintroduced. I also updated `dev/test-dependencies.sh` to run `mvn validate` so that the enforcer rules can run as part of pull request builds.

/cc rxin srowen pwendell. I'd like to backport at least the exclusion portion of this fix to `branch-1.5` in order to fix the documentation publishing job, which fails nondeterministically due to incompatible versions of Netty classes taking precedence on the compile-time classpath.

Author: Josh Rosen <rosenville@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #10672 from JoshRosen/enforce-netty-exclusions.
2016-01-10 19:59:01 -08:00
Kousuke Saruta 3119206b71 [SPARK-12692][BUILD][GRAPHX] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before `,` and `:`).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10683 from sarutak/SPARK-12692-followup-graphx.
2016-01-10 15:41:22 -08:00
Kousuke Saruta e5904bb5e7 [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10684 from sarutak/SPARK-12692-followup-mllib.
2016-01-10 12:38:57 -08:00
Jacek Laskowski b78e028e37 [SPARK-12736][CORE][DEPLOY] Standalone Master cannot be started due t…
…o NoClassDefFoundError: org/spark-project/guava/collect/Maps

/cc srowen rxin

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10674 from jaceklaskowski/SPARK-12736.
2016-01-10 10:36:01 +00:00
Reynold Xin 5b0d544339 [SPARK-12735] Consolidate & move spark-ec2 to AMPLab managed repository.
Author: Reynold Xin <rxin@databricks.com>

Closes #10673 from rxin/SPARK-12735.
2016-01-09 20:28:20 -08:00
Reynold Xin 3efd106e5c Close #10665 2016-01-09 20:25:28 -08:00
Reynold Xin b23c4521f5 [SPARK-12340] Fix overflow in various take functions.
This is a follow-up for the original patch #10562.

Author: Reynold Xin <rxin@databricks.com>

Closes #10670 from rxin/SPARK-12340.
2016-01-09 11:21:58 -08:00
Yanbo Liang 3d77cffec0 [SPARK-12645][SPARKR] SparkR support hash function
Add ```hash``` function for SparkR ```DataFrame```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10597 from yanboliang/spark-12645.
2016-01-09 12:29:51 +05:30
Liang-Chi Hsieh 95cd5d95ce [SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause
JIRA: https://issues.apache.org/jira/browse/SPARK-12577

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10620 from viirya/fix-parentheses.
2016-01-08 21:48:06 -08:00
Josh Rosen 090d691323 [SPARK-4628][BUILD] Remove all non-Maven-Central repositories from build
This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository.

I tested this by running

```
build/mvn \
        -Phive \
        -Phive-thriftserver \
        -Pkinesis-asl \
        -Pspark-ganglia-lgpl \
        -Pyarn \
        dependency:go-offline
```

inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10659 from JoshRosen/SPARK-4628.
2016-01-08 20:58:53 -08:00
Josh Rosen 1fdf9bbd67 [SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuite
This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10667 from JoshRosen/block-mgr-tests-cleanup.
2016-01-08 20:50:08 -08:00
Cheng Lian d9447cac74 [SPARK-12593][SQL] Converts resolved logical plan back to SQL
This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings.  For now, the major use case is to canonicalize Spark SQL native view support.  The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized.

The current version is still in WIP status, and is quite limited.  Known limitations include:

1.  The logical plan must be analyzed but not optimized

    The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation.  Future versions should be able to recover erased scope information by inserting subqueries when necessary.

1.  The logical plan must be created using HiveQL query string

    Query plans generated by composing arbitrary DataFrame API combinations are not supported yet.  Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation.  For example, the following query plan

    ```
    Filter (a#1 < 10)
     +- MetastoreRelation default, src, None
    ```

    need to be canonicalized into the following form before SQL generation:

    ```
    Project [a#1, b#2, c#3]
     +- Filter (a#1 < 10)
         +- MetastoreRelation default, src, None
    ```

    Otherwise, the SQL generation process will have to handle a large number of special cases.

1.  Only a fraction of expressions and basic logical plan operators are supported in this PR

    Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings.

    Known unsupported components are:

    - Expressions
      - Part of math expressions
      - Part of string expressions (buggy?)
      - Null expressions
      - Calendar interval literal
      - Part of date time expressions
      - Complex type creators
      - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN`
    - Logical plan operators/patterns
      - Cube, rollup, and grouping set
      - Script transformation
      - Generator
      - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule
      - Window functions

    Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner:

*   For all select queries, we try to convert it back to SQL
*   If the query plan is convertible, we parse the generated SQL into a new logical plan
*   Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

- [x] Fix failed test cases
- [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
- [x] Comments and documentation

Author: Cheng Lian <lian@databricks.com>

Closes #10541 from liancheng/sql-generation.
2016-01-08 14:08:13 -08:00
Sean Owen 659fd9d04b [SPARK-4819] Remove Guava's "Optional" from public API
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.
2016-01-08 13:02:30 -08:00
Thomas Graves 553fd7b912 [SPARK-12654] sc.wholeTextFiles with spark.hadoop.cloneConf=true fail…
…s on secure Hadoop

https://issues.apache.org/jira/browse/SPARK-12654

So the bug here is that WholeTextFileRDD.getPartitions has:
val conf = getConf
in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext.
The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #10651 from tgravescs/SPARK-12654.
2016-01-08 14:38:19 -06:00
Udo Klein 8c70cb4c62 fixed numVertices in transitive closure example
Author: Udo Klein <git@blinkenlight.net>

Closes #10642 from udoklein/patch-2.
2016-01-08 20:32:37 +00:00
Jeff Zhang 00d9261724 [DOCUMENTATION] doc fix of job scheduling
spark.shuffle.service.enabled is spark application related configuration, it is not necessary to set it in yarn-site.xml

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10657 from zjffdu/doc-fix.
2016-01-08 11:38:46 -08:00
Bryan Cutler ea104b8f1c [SPARK-12701][CORE] FileAppender should use join to ensure writing thread completion
Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701.
2016-01-08 11:08:45 -08:00
Liang-Chi Hsieh cfe1ba56e4 [SPARK-12687] [SQL] Support from clause surrounded by ().
JIRA: https://issues.apache.org/jira/browse/SPARK-12687

Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10660 from viirya/fix-union.
2016-01-08 09:50:41 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Kousuke Saruta 794ea553bd [SPARK-12692][BUILD] Scala style: check no white space before comma and colon
We should not put a white space before `,` and `:` so let's check it.
Because there are lots of style violations, first, I'd like to add a checker, enable and let the level `warning`.
Then, I'd like to fix the style step by step.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10643 from sarutak/SPARK-12692.
2016-01-08 00:53:15 -08:00
Reynold Xin 726bd3c4ec Fix indentation for the previous patch. 2016-01-07 21:15:43 -08:00
Kevin Yu 5028a001d5 [SPARK-12317][SQL] Support units (m,k,g) in SQLConf
This PR is continue from previous closed PR 10314.

In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input.

For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file.

marmbrus srowen : Can you help review this code changes ? Thanks.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10629 from kevinyu98/spark-12317.
2016-01-07 21:13:17 -08:00
Shixiong Zhu 28e0e500a2 [SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for Kryo
The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10609 from zsxwing/SPARK-12591.
2016-01-07 17:46:24 -08:00
Shixiong Zhu c94199e977 [SPARK-12507][STREAMING][DOCUMENT] Expose closeFileAfterWrite and allowBatching configurations for Streaming
/cc tdas brkyvz

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10453 from zsxwing/streaming-conf.
2016-01-07 17:37:46 -08:00
Sean Owen 5a4021998a [SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value}
Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes.

Author: Sean Owen <sowen@cloudera.com>

Closes #10641 from srowen/SPARK-12604.2.
2016-01-07 17:21:03 -08:00
Shixiong Zhu c0c397509b [SPARK-12510][STREAMING] Refactor ActorReceiver to support Java
This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.
2016-01-07 15:26:55 -08:00
Kazuaki Ishizaki 34dbc8af21 [SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription
Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit``

The policy is here, as describe at https://github.com/apache/spark/pull/10488

Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10524 from kiszk/SPARK-12580.
2016-01-07 13:56:34 -08:00
Darek Blasiak 8346518357 [SPARK-12598][CORE] bug in setMinPartitions
There is a bug in the calculation of ```maxSplitSize```.  The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <darek.blasiak@640labs.com>

Closes #10546 from datafarmer/setminpartitionsbug.
2016-01-07 21:15:40 +00:00
Jacek Laskowski 1b2c2162af [STREAMING][MINOR] More contextual information in logs + minor code i…
…mprovements

Please review and merge at your convenience. Thanks!

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10595 from jaceklaskowski/streaming-minor-fixes.
2016-01-07 21:12:57 +00:00
Jacek Laskowski 07b314a57a [MINOR] Fix for BUILD FAILURE for Scala 2.11
It was introduced in 917d3fc069

/cc cloud-fan rxin

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.
2016-01-07 10:39:46 -08:00
Sameer Agarwal f194d9911a [SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits
https://issues.apache.org/jira/browse/SPARK-12662

cc yhuai

Author: Sameer Agarwal <sameer@databricks.com>

Closes #10626 from sameeragarwal/randomsplit.
2016-01-07 10:37:15 -08:00
zero323 592f64985d [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None
If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.
2016-01-07 10:32:56 -08:00
Jacek Laskowski 8113dbda0b [STREAMING][DOCS][EXAMPLES] Minor fixes
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10603 from jaceklaskowski/streaming-actor-custom-receiver.
2016-01-07 00:27:13 -08:00
Davies Liu fd1dcfaf26 [SPARK-12542][SQL] support except/intersect in HiveQl
Parse the SQL query with except/intersect in FROM clause for HivQL.

Author: Davies Liu <davies@databricks.com>

Closes #10622 from davies/intersect.
2016-01-06 23:46:12 -08:00
Davies Liu 6a1c864ab6 [SPARK-12295] [SQL] external spilling for window functions
This PR manage the memory used by window functions (buffered rows), also enable external spilling.

After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.

Author: Davies Liu <davies@databricks.com>

Closes #10605 from davies/unsafe_window.
2016-01-06 23:21:52 -08:00
zzcclp 84e77a15df [DOC] fix 'spark.memory.offHeap.enabled' default value to false
modify 'spark.memory.offHeap.enabled' default value to false

Author: zzcclp <xm_zzc@sina.com>

Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
2016-01-06 23:06:21 -08:00
Yin Huai e5cde7ab11 Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"
This reverts commit fcd013cf70.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
2016-01-06 22:03:31 -08:00
Guillaume Poulin b673852037 [SPARK-12678][CORE] MapPartitionsRDD clearDependencies
MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <poulin.guillaume@gmail.com>

Closes #10623 from gpoulin/map_partition_deps.
2016-01-06 21:34:46 -08:00
jerryshao 174e72ceca [SPARK-12673][UI] Add missing uri prepending for job description
Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #10618 from jerryshao/SPARK-12673.
2016-01-06 21:28:29 -08:00
Josh Rosen 8e19c7663a [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.

Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.

For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
2016-01-06 20:50:31 -08:00
Robert Dodier 6b6d02be0d [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFile
This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).

For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)

Author: Robert Dodier <robert_dodier@users.sourceforge.net>

Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
2016-01-06 19:49:10 -08:00
Nong Li a74d743cc7 [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.

We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10589 from nongli/spark-12640.
2016-01-06 19:20:43 -08:00
Sean Owen ac56cf605b [SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java
Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change

Author: Sean Owen <sowen@cloudera.com>

Closes #10554 from srowen/SPARK-12604.
2016-01-06 17:17:32 -08:00
Wenchen Fan 917d3fc069 [SPARK-12539][SQL] support writing bucketed table
This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
```
df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")
```

When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.

Note that there may be multiply files for one bucket, as the data is distributed.

Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.

Limitations:

* Can't write bucketed data without hive metastore.
* Can't insert bucketed data into existing hive tables.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10498 from cloud-fan/bucket-write.
2016-01-06 16:58:10 -08:00
Davies Liu 6f7ba6409a [SPARK-12681] [SQL] split IdentifiersParser.g into two files
To avoid to have a huge Java source (over 64K loc), that can't be compiled.

cc hvanhovell

Author: Davies Liu <davies@databricks.com>

Closes #10624 from davies/split_ident.
2016-01-06 15:54:00 -08:00