Commit graph

7926 commits

Author SHA1 Message Date
Davies Liu f1e71d4c3b [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()
Using external sort to support sort large datasets in reduce stage.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1978 from davies/sort and squashes the following commits:

bbcd9ba [Davies Liu] check spilled bytes in tests
b125d2f [Davies Liu] add test for external sort in rdd
eae0176 [Davies Liu] choose different disks from different processes and instances
1f075ed [Davies Liu] Merge branch 'master' into sort
eb53ca6 [Davies Liu] Merge branch 'master' into sort
644abaf [Davies Liu] add license in LICENSE
19f7873 [Davies Liu] improve tests
55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
2014-08-26 16:57:40 -07:00
Michael Armbrust c4787a3690 [SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences
It is common to want to describe sets of attributes that are in various parts of a query plan.  However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically.  For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.

In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality.  (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32))  This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.

I also took this opportunity to refactor the expression and query plan base classes.  In all but one instance the logic for computing the `references` of an `Expression` were the same.  Thus, I moved this logic into the base class.

For query plans the semantics of  the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?).  As a result, this method wasn't really used very much.  So, I removed it.

TODO:
 - [x] Finish scala doc for `AttributeSet`
 - [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
 - [x] Finish removing `references` from `QueryPlan`

Author: Michael Armbrust <michael@databricks.com>

Closes #2109 from marmbrus/attributeSets and squashes the following commits:

1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
2014-08-26 16:29:14 -07:00
Burak 1208f72ac7 [SPARK-2839][MLlib] Stats Toolkit documentation updated
Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar

https://issues.apache.org/jira/browse/SPARK-2839

P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one.

Author: Burak <brkyvz@gmail.com>

Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits:

a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments
bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark
213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review
fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated
2014-08-26 15:18:42 -07:00
Xiangrui Meng adbd5c1636 [SPARK-3226][MLLIB] doc update for native libraries
to mention `-Pnetlib-lgpl` option. atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2128 from mengxr/mllib-native and squashes the following commits:

4cbba57 [Xiangrui Meng] update mllib dependencies
2014-08-26 15:12:27 -07:00
Takuya UESHIN 6b5584ef1c [SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map.
Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits:

3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.
2014-08-26 15:04:08 -07:00
Takuya UESHIN 98c2bb0bbd [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
2014-08-26 13:22:55 -07:00
Davies Liu 3cedc4f4d7 [SPARK-2871] [PySpark] add histgram() API
RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
2014-08-26 13:04:30 -07:00
chutium 8856c3d860 [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
There are 4 different compression codec available for ```ParquetOutputFormat```

in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```

original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083

i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)

btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (e440108de5).

Author: chutium <teng.qiu@gmail.com>

Closes #2039 from chutium/parquet-compression and squashes the following commits:

2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
2014-08-26 11:51:26 -07:00
Andrew Or b21ae5bbb9 [SPARK-2886] Use more specific actor system name than "spark"
As of #1777 we log the name of the actor system when it binds to a port. The current name "spark" is super general and does not convey any meaning. For instance, the following line is taken from my driver log after setting `spark.driver.port` to 5001.
```
14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port 5001.
```
This commit renames this to "sparkDriver" and "sparkExecutor". The goal of this unambitious PR is simply to make the logged information more explicit without introducing any change in functionality.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1810 from andrewor14/service-name and squashes the following commits:

8c459ed [Andrew Or] Use a common variable for driver/executor actor system names
3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor
921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into service-name
c8c6a62 [Andrew Or] Do not include hyphens in actor name
1c1b42e [Andrew Or] Avoid spaces in akka system name
f644b55 [Andrew Or] Use more specific service name
2014-08-25 23:36:09 -07:00
Daoyuan Wang 52fbdc2ded [Spark-3222] [SQL] Cross join support in HiveQL
We can simple treat cross join as inner join without join conditions.

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>

Closes #2124 from adrian-wang/crossjoin and squashes the following commits:

8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
2014-08-25 22:56:35 -07:00
Kousuke Saruta 62f5009f67 [SPARK-2976] Replace tabs with spaces
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:

1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
d1e0666 [Kousuke Saruta] Modified styles
c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
c003b36 [Kousuke Saruta] Removed tab from sorttable.js
2014-08-25 19:40:23 -07:00
witgo 9f04db17e5 SPARK-2481: The environment variables SPARK_HISTORY_OPTS is covered in spark-env.sh
Author: witgo <witgo@qq.com>
Author: GuoQiang Li <witgo@qq.com>

Closes #1341 from witgo/history_env and squashes the following commits:

b4fd9f8 [GuoQiang Li] review commit
0ebe401 [witgo] *-history-server.sh load spark-config.sh
2014-08-25 19:22:27 -07:00
Chia-Yung Su 4243bb6634 [SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile
fix compile error on hadoop 0.23 for the pull request #1924.

Author: Chia-Yung Su <chiayung@appier.com>

Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:

be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
2014-08-25 18:20:19 -07:00
wangfei 507a1b5200 [SQL] logWarning should be logInfo in getResultSetSchema
Author: wangfei <wangfei_hello@126.com>

Closes #1939 from scwf/patch-5 and squashes the following commits:

f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
2014-08-25 17:46:43 -07:00
Cheng Hao 156eb39661 [SPARK-3058] [SQL] Support EXTENDED for EXPLAIN
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
 Filter ('key = 1)
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  MetastoreRelation default, src, None

== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None

Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
  MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
  MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
  MappedRDD[10] at map at TableReader.scala:240
  HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```

It's the sub task of #1847. But can go without any dependency.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:

295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
2014-08-25 17:43:56 -07:00
Cheng Lian cae9414d38 [SPARK-2929][SQL] Refactored Thrift server and CLI suites
Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1856 from liancheng/thriftserver-tests and squashes the following commits:

2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
2014-08-25 16:29:59 -07:00
Takuya UESHIN d299e2bf2f [SPARK-3204][SQL] MaxOf would be foldable if both left and right are foldable.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits:

7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.
2014-08-25 16:27:00 -07:00
Cheng Lian 805fec845b Fixed a typo in docs/running-on-mesos.md
It should be `spark-env.sh` rather than `spark.env.sh`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits:

f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md
2014-08-25 14:56:51 -07:00
Xiangrui Meng fd8ace2d9a [FIX] fix error message in sendMessageReliably
rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2120 from mengxr/sendMessageReliably and squashes the following commits:

b14400c [Xiangrui Meng] fix error message in sendMessageReliably
2014-08-25 14:55:20 -07:00
Allan Douglas R. de Oliveira cc40a709c0 SPARK-3180 - Better control of security groups
Adds the --authorized-address and --additional-security-group options as explained in the issue.

Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com>

Closes #2088 from douglaz/configurable_sg and squashes the following commits:

e3e48ca [Allan Douglas R. de Oliveira] Adds the option to specify the address authorized to access the SG and another option to provide an additional existing SG
2014-08-25 13:55:04 -07:00
Sean Owen cd30db566a SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml files
(EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink `pom.xml`

- `scalatest` is not declared as a test-scope dependency
- Its Avro version doesn't match the rest of the build
- Its Flume version is not synced with the other Flume module
- The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version
- It depends on Scala Lang directly, which it shouldn't

Author: Sean Owen <sowen@cloudera.com>

Closes #1726 from srowen/SPARK-2798 and squashes the following commits:

a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume versions, remove direct Scala dependency, fix '2.10' in Flume dependency
2014-08-25 13:29:07 -07:00
Xiangrui Meng 220f413686 [SPARK-2495][MLLIB] make KMeans constructor public
to re-construct k-means models freeman-lab

Author: Xiangrui Meng <meng@databricks.com>

Closes #2112 from mengxr/public-constructors and squashes the following commits:

18d53a9 [Xiangrui Meng] make KMeans constructor public
2014-08-25 12:30:02 -07:00
Davies Liu fb0db77242 [SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId()
RDD.zipWithIndex()

        Zips this RDD with its element indices.

        The ordering is first based on the partition index and then the
        ordering of items within each partition. So the first item in
        the first partition gets index 0, and the last item in the last
        partition receives the largest index.

        This method needs to trigger a spark job when this RDD contains
        more than one partitions.

        >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
        [(0, 0), (1, 1), (2, 2), (3, 3)]

RDD.zipWithUniqueId()

        Zips this RDD with generated unique Long ids.

        Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
        n is the number of partitions. So there may exist gaps, but this
        method won't trigger a spark job, which is different from
        L{zipWithIndex}

        >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
        [(0, 0), (2, 1), (1, 2), (3, 3)]

Author: Davies Liu <davies.liu@gmail.com>

Closes #2092 from davies/zipWith and squashes the following commits:

cebe5bf [Davies Liu] improve test cases, reverse the order of index
0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
2014-08-24 21:16:05 -07:00
Reza Zadeh b1b20301b3 [MLlib][SPARK-2997] Update SVD documentation to reflect roughly square
Update the documentation to reflect the fact we can handle roughly square matrices.

Author: Reza Zadeh <rizlar@gmail.com>

Closes #2070 from rezazadeh/svddocs and squashes the following commits:

826b8fe [Reza Zadeh] left singular vectors
3f34fc6 [Reza Zadeh] PCA is still TS
7ffa2aa [Reza Zadeh] better title
aeaf39d [Reza Zadeh] More docs
788ed13 [Reza Zadeh] add computational cost explanation
6429c59 [Reza Zadeh] Add link to rowmatrix docs
1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
2014-08-24 17:35:54 -07:00
DB Tsai 572952ae61 [SPARK-2841][MLlib] Documentation for feature transformations
Documentation for newly added feature transformations:
1. TF-IDF
2. StandardScaler
3. Normalizer

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #2068 from dbtsai/transformer-documentation and squashes the following commits:

109f324 [DB Tsai] address feedback
2014-08-24 17:33:33 -07:00
Kousuke Saruta ded6796bf5 [SPARK-3192] Some scripts have 2 space indentation but other scripts have 4 space indentation.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2104 from sarutak/SPARK-3192 and squashes the following commits:

db78419 [Kousuke Saruta] Modified indentation of spark-shell
2014-08-24 09:43:44 -07:00
Raymond Liu 8861cdf112 Clean unused code in SortShuffleWriter
Just clean unused code which have been moved into ExternalSorter.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1882 from colorant/sortShuffleWriter and squashes the following commits:

e6337be [Raymond Liu] Clean unused code in SortShuffleWriter
2014-08-23 19:47:14 -07:00
Davies Liu 8df4dad495 [SPARK-2871] [PySpark] add approx API for RDD
RDD.countApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate version of count() that returns a potentially incomplete
        result within a timeout, even if not all tasks have finished.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> rdd.countApprox(1000, 1.0)
        1000

RDD.sumApprox(self, timeout, confidence=0.95)

        Approximate operation to return the sum within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000))
        >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate operation to return the mean within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000)) / 1000.0
        >>> (rdd.meanApprox(1000) - r) / r < 0.05
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2095 from davies/approx and squashes the following commits:

e8c252b [Davies Liu] add approx API for RDD
2014-08-23 19:33:34 -07:00
Davies Liu db436e36c4 [SPARK-2871] [PySpark] add key argument for max(), min() and top(n)
RDD.max(key=None)

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
        >>> rdd.max()
        43.0
        >>> rdd.max(key=str)
        5.0

RDD.min(key=None)

        Find the minimum item in this RDD.

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
        >>> rdd.min()
        2.0
        >>> rdd.min(key=str)
        10.0

RDD.top(num, key=None)

        Get the top N elements from a RDD.

        Note: It returns the list sorted in descending order.
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
        [12]
        >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
        [6, 5]
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
        [4, 3, 2]

Author: Davies Liu <davies.liu@gmail.com>

Closes #2094 from davies/cmp and squashes the following commits:

ccbaf25 [Davies Liu] add `key` to top()
ad7e374 [Davies Liu] fix tests
2f63512 [Davies Liu] change `comp` to `key` in min/max
dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
2014-08-23 18:55:13 -07:00
Michael Armbrust 3519b5e8e5 [SPARK-2967][SQL] Follow-up: Also copy hash expressions in sort based shuffle fix.
Follow-up to #2066

Author: Michael Armbrust <michael@databricks.com>

Closes #2072 from marmbrus/sortShuffle and squashes the following commits:

2ff8114 [Michael Armbrust] Fix bug
2014-08-23 16:21:08 -07:00
Michael Armbrust 7e191fe29b [SPARK-2554][SQL] CountDistinct partial aggregation and object allocation improvements
Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>

Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:

5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
2014-08-23 16:19:10 -07:00
Yin Huai 2fb1c72ea2 [SQL] Make functionRegistry in HiveContext transient.
Seems we missed `transient` for the `functionRegistry` in `HiveContext`.

cc: marmbrus

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:

6534e7d [Yin Huai] Make functionRegistry transient.
2014-08-23 12:46:41 -07:00
Liang-Chi Hsieh 76bb044b9e [Minor] fix typo
Fix a typo in comment.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2105 from viirya/fix_typo and squashes the following commits:

6596a80 [Liang-Chi Hsieh] fix typo.
2014-08-23 10:08:25 -07:00
Daoyuan Wang f3d65cd0bf [SPARK-3068]remove MaxPermSize option for jvm 1.8
In JVM 1.8.0, MaxPermSize is no longer supported.
In spark `stderr` output, there would be a line of

    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2011 from adrian-wang/maxpermsize and squashes the following commits:

ef1d660 [Daoyuan Wang] direct get java version in runtime
37db9c1 [Daoyuan Wang] code refine
3c1d554 [Daoyuan Wang] remove MaxPermSize option for jvm 1.8
2014-08-23 08:09:30 -07:00
Kousuke Saruta 323cd92b9b [SPARK-2963] REGRESSION - The description about how to build for using CLI and Thrift JDBC server is absent in proper document -
The most important things I mentioned in #1885 is as follows.

* People who build Spark is not always programmer.
* If a person who build Spark is not a programmer, he/she won't read programmer's guide before building.

So, how to build for using CLI and JDBC server is not only in programmer's guide.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2080 from sarutak/SPARK-2963 and squashes the following commits:

ee07c76 [Kousuke Saruta] Modified regression of the description about building for using Thrift JDBC server and CLI
ed53329 [Kousuke Saruta] Modified description and notaton of proper noun
07c59fc [Kousuke Saruta] Added a description about how to build to use HiveServer and CLI for SparkSQL to building-with-maven.md
6e6645a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2963
c88fa93 [Kousuke Saruta] Added a description about building to use HiveServer and CLI for SparkSQL
2014-08-22 22:34:24 -07:00
Tathagata Das 3004074152 [SPARK-3169] Removed dependency on spark streaming test from spark flume sink
Due to maven bug https://jira.codehaus.org/browse/MNG-1378, maven could not resolve spark streaming classes required by the spark-streaming test-jar dependency of external/flume-sink. There is no particular reason that the external/flume-sink has to depend on Spark Streaming at all, so I am eliminating this dependency. Also I have removed the exclusions present in the Flume dependencies, as there is no reason to exclude them (they were excluded in the external/flume module to prevent dependency collisions with Spark).

Since Jenkins will test the sbt build and the unit test, I only tested maven compilation locally.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2101 from tdas/spark-sink-pom-fix and squashes the following commits:

8f42621 [Tathagata Das] Added Flume sink exclusions back, and added netty to test dependencies
93b559f [Tathagata Das] Removed dependency on spark streaming test from spark flume sink
2014-08-22 21:34:48 -07:00
Reynold Xin a5219db1ec Link to Contributing to Spark wiki page on README.md. 2014-08-22 17:05:35 -07:00
XuTingjun 220c2d7699 [SPARK-2742][yarn] delete useless variables
Author: XuTingjun <1039320815@qq.com>

Closes #1614 from XuTingjun/yarn-bug and squashes the following commits:

f07096e [XuTingjun] Update ClientArguments.scala
2014-08-22 10:45:00 -05:00
Joseph K. Bradley 050f8d01e4 [SPARK-2840] [mllib] DecisionTree doc update (Java, Python examples)
Updated DecisionTree documentation, with examples for Java, Python.
Added same Java example to code as well.
CC: @mengxr  @manishamde @atalwalkar

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2063 from jkbradley/dt-docs and squashes the following commits:

2dd2c19 [Joseph K. Bradley] Last updates based on github review.
9dd1b6b [Joseph K. Bradley] Updated decision tree doc.
d802369 [Joseph K. Bradley] Updates based on comments: cache data, corrected doc text.
b9bee04 [Joseph K. Bradley] Updated DT examples
57eee9f [Joseph K. Bradley] Created JavaDecisionTree example from example in docs, and corrected doc example as needed.
d939a92 [Joseph K. Bradley] Updated DecisionTree documentation.  Added Java, Python examples.
2014-08-21 00:17:29 -07:00
Xiangrui Meng e0f946265b [SPARK-2843][MLLIB] add a section about regularization parameter in ALS
atalwalkar srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #2064 from mengxr/als-doc and squashes the following commits:

b2e20ab [Xiangrui Meng] introduced -> discussed
98abdd7 [Xiangrui Meng] add reference
339bd08 [Xiangrui Meng] add a section about regularization parameter in ALS
2014-08-20 17:47:39 -07:00
Xiangrui Meng e1571874f2 [SPARK-3143][MLLIB] add tf-idf user guide
Moved TF-IDF before Word2Vec because the former is more basic. I also added a link for Word2Vec. atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2061 from mengxr/tfidf-doc and squashes the following commits:

ca04c70 [Xiangrui Meng] address comments
a5ea4b4 [Xiangrui Meng] add tf-idf user guide
2014-08-20 17:41:36 -07:00
Andrew Or ba3c730e35 [SPARK-3140] Clarify confusing PySpark exception message
We read the py4j port from the stdout of the `bin/spark-submit` subprocess. If there is interference in stdout (e.g. a random echo in `spark-submit`), we throw an exception with a warning message. We do not, however, distinguish between this case from the case where no stdout is produced at all.

I wasted a non-trivial amount of time being baffled by this exception in search of places where I print random whitespace (in vain, of course). A clearer exception message that distinguishes between these cases will prevent similar headaches that I have gone through.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2067 from andrewor14/python-exception and squashes the following commits:

742f823 [Andrew Or] Further clarify warning messages
e96a7a0 [Andrew Or] Distinguish between unexpected output and no output at all
2014-08-20 17:07:39 -07:00
Marcelo Vanzin c9f743957f [SPARK-2848] Shade Guava in uber-jars.
For further discussion, please check the JIRA entry.

This change moves Guava classes to a different package so that they don't conflict with the user-provided Guava (or the Hadoop-provided one). Since one class (Optional) was exposed through Spark's public API, that class was forked from Guava at the current dependency version (14.0.1) so that it can be kept going forward (until the API is cleaned).

Note this change has a few implications:
- *all* classes in the final jars will reference the relocated classes. If Hadoop classes are included (i.e. "-Phadoop-provided" is not activated), those will also reference the Guava 14 classes (instead of the Guava 11 classes from the Hadoop classpath).
- if the Guava version in Spark is ever changed, the new Guava will still reference the forked Optional class; this may or may not be a problem, but in the long term it's better to think about removing Optional from the public API.

For the end user, there are two visible implications:

- Guava is not provided as a transitive dependency anymore (since it's "provided" in Spark)
- At runtime, unless they provide their own, they'll either have no Guava or Hadoop's version of Guava (11), depending on how they set up their classpath.

Note that this patch does not change the sbt deliverables; those will still contain guava in its original package, and provide guava as a compile-time dependency. This assumes that maven is the canonical build, and sbt-built artifacts are not (officially) published.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1813 from vanzin/SPARK-2848 and squashes the following commits:

9bdffb0 [Marcelo Vanzin] Undo sbt build changes.
819b445 [Marcelo Vanzin] Review feedback.
05e0a3d [Marcelo Vanzin] Merge branch 'master' into SPARK-2848
fef4370 [Marcelo Vanzin] Unfork Optional.java.
d3ea8e1 [Marcelo Vanzin] Exclude asm classes from final jar.
637189b [Marcelo Vanzin] Add hacky filter to prefer Spark's copy of Optional.
2fec990 [Marcelo Vanzin] Shade Guava in the sbt build.
616998e [Marcelo Vanzin] Shade Guava in the maven build, fork Guava's Optional.java.
2014-08-20 16:23:10 -07:00
Alex Liu d9e94146a6 [SPARK-2846][SQL] Add configureInputJobPropertiesForStorageHandler to initialization of job conf
...al job conf

Author: Alex Liu <alex_liu68@yahoo.com>

Closes #1927 from alexliu68/SPARK-SQL-2846 and squashes the following commits:

e4bdc4c [Alex Liu] SPARK-SQL-2846 add configureInputJobPropertiesForStorageHandler to initial job conf
2014-08-20 16:14:06 -07:00
wangfei a1e8b1bc97 SPARK_LOGFILE and SPARK_ROOT_LOGGER no longer need in spark-daemon.sh
Author: wangfei <wangfei_hello@126.com>

Closes #2057 from scwf/patch-7 and squashes the following commits:

1b7b9a5 [wangfei] SPARK_LOGFILE and SPARK_ROOT_LOGGER no longer need in spark-daemon.sh
2014-08-20 16:00:46 -07:00
Michael Armbrust a2e658dcda [SPARK-2967][SQL] Fix sort based shuffle for spark sql.
Add explicit row copies when sort based shuffle is on.

Author: Michael Armbrust <michael@databricks.com>

Closes #2066 from marmbrus/sortShuffle and squashes the following commits:

fcd7bb2 [Michael Armbrust] Fix sort based shuffle for spark sql.
2014-08-20 15:51:14 -07:00
Reynold Xin fb60bec34e [SPARK-2298] Encode stage attempt in SparkListener & UI.
Simple way to reproduce this in the UI:

```scala
val f = new java.io.File("/tmp/test")
f.delete()
sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) =>
  if (context.partitionId == 0) {
    val f = new java.io.File("/tmp/test")
    if (!f.exists) {
      f.mkdir()
      System.exit(0);
    }
  }
  iter
}.count()
```

Author: Reynold Xin <rxin@apache.org>

Closes #1545 from rxin/stage-attempt and squashes the following commits:

3ee1d2a [Reynold Xin] - Rename attempt to retry in UI. - Properly report stage failure in FetchFailed.
40a6bd5 [Reynold Xin] Updated test suites.
c414c36 [Reynold Xin] Fixed the hanging in JobCancellationSuite.
b3e2eed [Reynold Xin] Oops previous code didn't compile.
0f36075 [Reynold Xin] Mark unknown stage attempt with id -1 and drop that in JobProgressListener.
6c08b07 [Reynold Xin] Addressed code review feedback.
4e5faa2 [Reynold Xin] [SPARK-2298] Encode stage attempt in SparkListener & UI.
2014-08-20 15:37:27 -07:00
Andrew Or b3ec51bfd7 [SPARK-2849] Handle driver configs separately in client mode
In client deploy mode, the driver is launched from within `SparkSubmit`'s JVM. This means by the time we parse Spark configs from `spark-defaults.conf`, it is already too late to control certain properties of the driver's JVM. We currently ignore these configs in client mode altogether.
```
spark.driver.memory
spark.driver.extraJavaOptions
spark.driver.extraClassPath
spark.driver.extraLibraryPath
```
This PR handles these properties before launching the driver JVM. It achieves this by spawning a separate JVM that runs a new class called `SparkSubmitDriverBootstrapper`, which spawns `SparkSubmit` as a sub-process with the appropriate classpath, library paths, java opts and memory.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1845 from andrewor14/handle-configs-bash and squashes the following commits:

bed4bdf [Andrew Or] Change a few comments / messages (minor)
24dba60 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
08fd788 [Andrew Or] Warn against external usages of SparkSubmitDriverBootstrapper
ff34728 [Andrew Or] Minor comments
51aeb01 [Andrew Or] Filter out JVM memory in Scala rather than Bash (minor)
9a778f6 [Andrew Or] Fix PySpark: actually kill driver on termination
d0f20db [Andrew Or] Don't pass empty library paths, classpath, java opts etc.
a78cb26 [Andrew Or] Revert a few changes in utils.sh (minor)
9ba37e2 [Andrew Or] Don't barf when the properties file does not exist
8867a09 [Andrew Or] A few more naming things (minor)
19464ad [Andrew Or] SPARK_SUBMIT_JAVA_OPTS -> SPARK_SUBMIT_OPTS
d6488f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
1ea6bbe [Andrew Or] SparkClassLauncher -> SparkSubmitDriverBootstrapper
a91ea19 [Andrew Or] Fix precedence of library paths, classpath, java opts and memory
158f813 [Andrew Or] Remove "client mode" boolean argument
c84f5c8 [Andrew Or] Remove debug print statement (minor)
b71f52b [Andrew Or] Revert a few more changes (minor)
7d94a8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
3a8235d [Andrew Or] Only parse the properties file if special configs exist
c37e08d [Andrew Or] Revert a few more changes
a396eda [Andrew Or] Nullify my own hard work to simplify bash
0effa1e [Andrew Or] Add code in Scala that handles special configs
c886568 [Andrew Or] Fix lines too long + a few comments / style (minor)
7a4190a [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
7396be2 [Andrew Or] Explicitly comment that multi-line properties are not supported
fa11ef8 [Andrew Or] Parse the properties file only if the special configs exist
371cac4 [Andrew Or] Add function prefix (minor)
be99eb3 [Andrew Or] Fix tests to not include multi-line configs
bd0d468 [Andrew Or] Simplify parsing config file by ignoring multi-line arguments
56ac247 [Andrew Or] Use eval and set to simplify splitting
8d4614c [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
aeb79c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
2732ac0 [Andrew Or] Integrate BASH tests into dev/run-tests + log error properly
8d26a5c [Andrew Or] Add tests for bash/utils.sh
4ae24c3 [Andrew Or] Fix bug: escape properly in quote_java_property
b3c4cd5 [Andrew Or] Fix bug: count the number of quotes instead of detecting presence
c2273fc [Andrew Or] Fix typo (minor)
e793e5f [Andrew Or] Handle multi-line arguments
5d8f8c4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
c7b9926 [Andrew Or] Minor changes to spark-defaults.conf.template
a992ae2 [Andrew Or] Escape spark.*.extraJavaOptions correctly
aabfc7e [Andrew Or] escape -> split (minor)
45a1eb9 [Andrew Or] Fix bug: escape escaped backslashes and quotes properly...
1cdc6b1 [Andrew Or] Fix bug: escape escaped double quotes properly
c854859 [Andrew Or] Add small comment
c13a2cb [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
8e552b7 [Andrew Or] Include an example of spark.*.extraJavaOptions
de765c9 [Andrew Or] Print spark-class command properly
a4df3c4 [Andrew Or] Move parsing and escaping logic to utils.sh
dec2343 [Andrew Or] Only export variables if they exist
fa2136e [Andrew Or] Escape Java options + parse java properties files properly
ef12f74 [Andrew Or] Minor formatting
4ec22a1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
e5cfb46 [Andrew Or] Collapse duplicate code + fix potential whitespace issues
4edcaa8 [Andrew Or] Redirect stdout to stderr for python
130f295 [Andrew Or] Handle spark.driver.memory too
98dd8e3 [Andrew Or] Add warning if properties file does not exist
8843562 [Andrew Or] Fix compilation issues...
75ee6b4 [Andrew Or] Remove accidentally added file
63ed2e9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
0025474 [Andrew Or] Revert SparkSubmit handling of --driver-* options for only cluster mode
a2ab1b0 [Andrew Or] Parse spark.driver.extra* in bash
250cb95 [Andrew Or] Do not ignore spark.driver.extra* for client mode
2014-08-20 15:01:47 -07:00
Kousuke Saruta c1ba4cd6b4 [SPARK-3149] Connection establishment information is not enough.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2060 from sarutak/SPARK-3149 and squashes the following commits:

1cc89af [Kousuke Saruta] Modified log message of accepting connection
2014-08-20 14:04:39 -07:00
Kousuke Saruta 0ea46ac800 [SPARK-3062] [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled
#1891 was to avoid IOException when EventLogging is enabled.
The solution used ShutdownHookManager but it was defined only Hadoop 2.x. Hadoop 1.x don't have ShutdownHookManager so #1891 doesn't compile on Hadoop 1.x

Now, I had a compromised solution for both Hadoop 1.x and 2.x.
Only for FileLogger, an unique FileSystem object is created.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1970 from sarutak/SPARK-2970 and squashes the following commits:

240c91e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2970
0e7b45d [Kousuke Saruta] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
e1262ec [Kousuke Saruta] Modified Filelogger to use unique FileSystem instance
2014-08-20 13:26:11 -07:00