Commit graph

14878 commits

Author SHA1 Message Date
Herman van Hovell 8121a4b1cb [SPARK-13277][BUILD] Follow-up ANTLR warnings are treated as build errors
It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors.

cc rxin / viirya

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11174 from hvanhovell/ANTLR-warnings-as-errors.
2016-02-11 18:23:44 -08:00
Davies Liu b10af5e238 [SPARK-12915][SQL] add SQL metrics of numOutputRows for whole stage codegen
This PR add SQL metrics (numOutputRows) for generated operators (same as non-generated), the cost is about 0.2 nano seconds per row.

<img width="806" alt="gen metrics" src="https://cloud.githubusercontent.com/assets/40902/12994694/47f5881e-d0d7-11e5-9d47-78229f559ab0.png">

Author: Davies Liu <davies@databricks.com>

Closes #11170 from davies/gen_metric.
2016-02-11 18:00:03 -08:00
Liu Xiang a5257048d7 [SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lost transformSchema
https://issues.apache.org/jira/browse/SPARK-12765

Author: Liu Xiang <lxmtlab@gmail.com>

Closes #10720 from sloth2012/sloth.
2016-02-11 17:28:37 -08:00
sethah b354673886 [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.
2016-02-11 16:42:44 -08:00
Yanbo Liang 30e0095566 [SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import
PySpark ml.clustering support export/import.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10999 from yanboliang/spark-13035.
2016-02-11 15:55:40 -08:00
Yanbo Liang 2426eb3e16 [MINOR][ML][PYSPARK] Cleanup test cases of clustering.py
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10975 from yanboliang/clustering-cleanup.
2016-02-11 15:53:45 -08:00
Kai Jiang c8f667d7c1 [SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/import
PySpark ml.recommendation support export/import.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11044 from vectorijk/spark-13037.
2016-02-11 15:50:33 -08:00
Yu ISHIKAWA 574571c870 [SPARK-11515][ML] QuantileDiscretizer should take random seed
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9535 from yu-iskw/SPARK-11515.
2016-02-11 15:05:34 -08:00
Yu ISHIKAWA efb65e09bc [SPARK-13265][ML] Refactoring of basic ML import/export for other file system besides HDFS
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11151 from yu-iskw/SPARK-13265.
2016-02-11 15:00:23 -08:00
Reynold Xin c86009ceb9 Revert "[SPARK-13279] Remove O(n^2) operation from scheduler."
This reverts commit 50fa6fd1b3.
2016-02-11 13:31:13 -08:00
Sital Kedia 50fa6fd1b3 [SPARK-13279] Remove O(n^2) operation from scheduler.
This commit removes an unnecessary duplicate check in addPendingTask that meant
that scheduling a task set took time proportional to (# tasks)^2.

Author: Sital Kedia <skedia@fb.com>

Closes #11167 from sitalkedia/fix_stuck_driver and squashes the following commits:

3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function
2016-02-11 13:28:14 -08:00
jayadevanmurali 0d50a22084 [SPARK-12982][SQL] Add table name validation in temp table registration
Add the table name validation at the temp table creation

Author: jayadevanmurali <jayadevan.m@tcs.com>

Closes #11051 from jayadevanmurali/branch-0.2-SPARK-12982.
2016-02-11 21:21:03 +01:00
Liang-Chi Hsieh e31c80737b [SPARK-13277][SQL] ANTLR ignores other rule using the USING keyword
JIRA: https://issues.apache.org/jira/browse/SPARK-13277

There is an ANTLR warning during compilation:

    warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
    Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3

    As a result, alternative(s) 3 were disabled for that input

This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #11168 from viirya/fix-parser-using.
2016-02-11 21:09:44 +01:00
Tathagata Das 219a74a7c2 [STREAMING][TEST] Fix flaky streaming.FailureSuite
Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things
- Makes sure no SparkContext is active after every test
- Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11166 from tdas/fix-failuresuite.
2016-02-11 10:10:36 -08:00
Alex Bozarth 13c17cbb05 [SPARK-13124][WEB UI] Fixed CSS and JS issues caused by addition of JQuery DataTables
Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #11038 from ajbozarth/spark13124.
2016-02-11 08:50:27 -06:00
Junyang f9ae99fee1 [SPARK-13074][CORE] Add JavaSparkContext. getPersistentRDDs method
The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API.

Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users.

Author: Junyang <fly.shenjy@gmail.com>

Closes #10978 from flyjy/master.
2016-02-11 09:33:11 +00:00
Sasaki Toru c2f21d8898 [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.

Author: Sasaki Toru <sasakitoa@nttdata.co.jp>

Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
2016-02-11 09:30:36 +00:00
Nong Li 18bcbbdd84 [SPARK-13270][SQL] Remove extra new lines in whole stage codegen and include pipeline plan in comments.
Author: Nong Li <nong@databricks.com>

Closes #11155 from nongli/spark-13270.
2016-02-10 23:52:19 -08:00
gatorsmile e88bff1279 [SPARK-13235][SQL] Removed an Extra Distinct from the Plan when Using Union in SQL
Currently, the parser added two `Distinct` operators in the plan if we are using `Union` or `Union Distinct` in the SQL. This PR is to remove the extra `Distinct` from the plan.

For example, before the fix, the following query has a plan with two `Distinct`
```scala
sql("select * from t0 union select * from t0").explain(true)
```
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_2
   +- 'Distinct
      +- 'Project [unresolvedalias(*,None)]
         +- 'Subquery u_1
            +- 'Distinct
               +- 'Union
                  :- 'Project [unresolvedalias(*,None)]
                  :  +- 'UnresolvedRelation `t0`, None
                  +- 'Project [unresolvedalias(*,None)]
                     +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#16L]
+- Subquery u_2
   +- Distinct
      +- Project [id#16L]
         +- Subquery u_1
            +- Distinct
               +- Union
                  :- Project [id#16L]
                  :  +- Subquery t0
                  :     +- Relation[id#16L] ParquetRelation
                  +- Project [id#16L]
                     +- Subquery t0
                        +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#16L], [id#16L]
+- Aggregate [id#16L], [id#16L]
   +- Union
      :- Project [id#16L]
      :  +- Relation[id#16L] ParquetRelation
      +- Project [id#16L]
         +- Relation[id#16L] ParquetRelation
```
After the fix, the plan is changed without the extra `Distinct` as follows:
```
== Parsed Logical Plan ==
'Project [unresolvedalias(*,None)]
+- 'Subquery u_1
   +- 'Distinct
      +- 'Union
         :- 'Project [unresolvedalias(*,None)]
         :  +- 'UnresolvedRelation `t0`, None
         +- 'Project [unresolvedalias(*,None)]
           +- 'UnresolvedRelation `t0`, None

== Analyzed Logical Plan ==
id: bigint
Project [id#17L]
+- Subquery u_1
   +- Distinct
      +- Union
        :- Project [id#16L]
        :  +- Subquery t0
        :     +- Relation[id#16L] ParquetRelation
        +- Project [id#16L]
          +- Subquery t0
          +- Relation[id#16L] ParquetRelation

== Optimized Logical Plan ==
Aggregate [id#17L], [id#17L]
+- Union
  :- Project [id#16L]
  :  +- Relation[id#16L] ParquetRelation
  +- Project [id#16L]
    +- Relation[id#16L] ParquetRelation
```

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11120 from gatorsmile/unionDistinct.
2016-02-11 08:40:27 +01:00
Herman van Hovell 1842c55d89 [SPARK-13276] Catch bad characters at the end of a Table Identifier/Expression string
The parser currently parses the following strings without a hitch:
* Table Identifier:
  * `a.b.c` should fail, but results in the following table identifier `a.b`
  * `table!#` should fail, but results in the following table identifier `table`
* Expression
  * `1+2 r+e` should fail, but results in the following expression `1 + 2`

This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.

cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail)

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11159 from hvanhovell/SPARK-13276.
2016-02-11 08:30:58 +01:00
Davies Liu 8f744fe3d9 [SPARK-13234] [SQL] remove duplicated SQL metrics
For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows.

After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary.

This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that).

The new SQL UI will looks like:

![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png)

Author: Davies Liu <davies@databricks.com>

Closes #11163 from davies/remove_metrics.
2016-02-10 23:23:01 -08:00
Davies Liu b5761d150b [SPARK-12706] [SQL] grouping() and grouping_id()
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.

grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.

The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).

Author: Davies Liu <davies@databricks.com>

Closes #10677 from davies/grouping.
2016-02-10 20:13:38 -08:00
gatorsmile 0f09f02269 [SPARK-13205][SQL] SQL Generation Support for Self Join
This PR addresses two issues:
  - Self join does not work in SQL Generation
  - When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.

liancheng Could you please review the code changes? Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11084 from gatorsmile/selfJoinInSQLGen.
2016-02-11 11:08:21 +08:00
gatorsmile 663cc400f3 [SPARK-12725][SQL] Resolving Name Conflicts in SQL Generation and Name Ambiguity Caused by Internally Generated Expressions
Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.

This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.

Here's an example Spark 1.6.0 snippet for illustration:
```scala
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
```
The above code produces the following resolved plan:
```
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
   +- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
      +- Subquery t
         +- Project [id#46L AS a#47L,id#46L AS b#48L]
            +- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
```
Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.

The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.

In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.

Could you review the solution? marmbrus liancheng

I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11050 from gatorsmile/namingConflicts.
2016-02-11 10:44:39 +08:00
raela 719973b05e [SPARK-13274] Fix Aggregator Links on GroupedDataset Scala API
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator

Author: raela <raela@databricks.com>

Closes #11158 from raelawang/master.
2016-02-10 17:00:54 -08:00
Tathagata Das 0902e20288 [SPARK-13146][SQL] Management API for continuous queries
### Management API for Continuous Queries

**API for getting status of each query**
- Whether active or not
- Unique name of each query
- Status of the sources and sinks
- Exceptions

**API for managing each query**
- Immediately stop an active query
- Waiting for a query to be terminated, correctly or with error

**API for managing multiple queries**
- Listing all active queries
- Getting an active query by name
- Waiting for any one of the active queries to be terminated

**API for listening to query life cycle events**
- ContinuousQueryListener API for query start, progress and termination events.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11030 from tdas/streaming-df-management-api.
2016-02-10 16:45:06 -08:00
Sean Owen 29c547303f [SPARK-12414][CORE] Remove closure serializer
Remove spark.closure.serializer option and use JavaSerializer always

CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.

Author: Sean Owen <sowen@cloudera.com>

Closes #11150 from srowen/SPARK-12414.
2016-02-10 13:34:53 -08:00
Takeshi YAMAMURO 5947fa8fa1 [SPARK-13057][SQL] Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation
This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10965 from maropu/ImproveColumnarCache.
2016-02-10 13:34:02 -08:00
Josh Rosen ce3bdaeeff [HOTFIX] Fix Scala 2.10 build break in TakeOrderedAndProjectSuite. 2016-02-10 12:44:40 -08:00
zhuol 4b80026f07 [SPARK-13126] fix the right margin of history page.
The right margin of the history page is little bit off. A simple fix for that issue.

Author: zhuol <zhuol@yahoo-inc.com>

Closes #11029 from zhuoliu/13126.
2016-02-10 14:23:41 -06:00
Alex Bozarth 39cc620e9c [SPARK-13163][WEB UI] Column width on new History Server DataTables not getting set correctly
The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #11057 from ajbozarth/spark13163.
2016-02-10 14:07:50 -06:00
Josh Rosen 5cf20598ce [SPARK-13254][SQL] Fix planning of TakeOrderedAndProject operator
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.

This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.

In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.

/cc davies and marmbrus for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11145 from JoshRosen/take-ordered-and-project-fix.
2016-02-10 11:00:38 -08:00
Michael Gummelt 80cb963ad9 [SPARK-5095][MESOS] Support launching multiple mesos executors in coarse grained mesos mode.
This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027

In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone.  This PR implements that resolution.

This PR implements two high-level features.  These two features are co-dependent, so they're implemented both here:
- Mesos support for spark.executor.cores
- Multiple executors per slave

We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.

The contribution is my original work and I license the work to the project under the project's open source license.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #10993 from mgummelt/executor_sizing.
2016-02-10 10:53:33 -08:00
Sean Owen c0b71e0b8f [SPARK-9307][CORE][SPARK] Logging: Make it either stable or private
Make Logging private[spark]. Pretty much all there is to it.

Author: Sean Owen <sowen@cloudera.com>

Closes #11103 from srowen/SPARK-9307.
2016-02-10 11:02:00 +00:00
tedyu e834e421de [SPARK-13203] Add scalastyle rule banning use of mutable.SynchronizedBuffer
andrewor14
Please take a look

Author: tedyu <yuzhihong@gmail.com>

Closes #11134 from tedyu/master.
2016-02-10 10:58:41 +00:00
Jon Maurer 2ba9b6a2df [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts
Author: Jon Maurer <tritab@gmail.com>
Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local>

Closes #10789 from tritab/cmd_updates.
2016-02-10 09:54:22 +00:00
Gábor Lipták 9269036d8c [SPARK-11565] Replace deprecated DigestUtils.shaHex call
Author: Gábor Lipták <gliptak@gmail.com>

Closes #9532 from gliptak/SPARK-11565.
2016-02-10 09:52:35 +00:00
Shixiong Zhu b385ce3882 [SPARK-13149][SQL] Add FileStreamSource
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.

This is based on the initial work from marmbrus.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11034 from zsxwing/stream-df-file-source.
2016-02-09 18:50:06 -08:00
Takeshi YAMAMURO 6f710f9fd4 [SPARK-12476][SQL] Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

Current plan:
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})

== Physical Plan ==
Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)]
```

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10427 from maropu/RemoveFilterInJdbcScan.
2016-02-10 09:45:13 +08:00
Liang-Chi Hsieh 9267bc68fa [SPARK-10524][ML] Use the soft prediction to order categories' bins
JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8734 from viirya/dt-soft-centroids.
2016-02-09 17:10:55 -08:00
Davies Liu 0e5ebac3c1 [SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate
This PR improve the lookup of BytesToBytesMap by:

1. Generate code for calculate the hash code of grouping keys.

2. Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).

Author: Davies Liu <davies@databricks.com>

Closes #11010 from davies/gen_map.
2016-02-09 16:41:21 -08:00
Shixiong Zhu fae830d158 [SPARK-13245][CORE] Call shuffleMetrics methods only in one thread for ShuffleBlockFetcherIterator
Call shuffleMetrics's incRemoteBytesRead and incRemoteBlocksFetched when polling FetchResult from `results` so as to always use shuffleMetrics in one thread.

Also fix a race condition that could cause memory leak.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11138 from zsxwing/SPARK-13245.
2016-02-09 16:31:00 -08:00
Wenchen Fan 7fe4fe630a [SPARK-12888] [SQL] [FOLLOW-UP] benchmark the new hash expression
Adds the benchmark results as comments.

The codegen version is slower than the interpreted version for `simple` case becasue of 3 reasons:

1. codegen version use a more complex hash algorithm than interpreted version, i.e. `Murmur3_x86_32.hashInt` vs [simple multiplication and addition](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L153).
2. codegen version will write the hash value to a row first and then read it out. I tried to create a `GenerateHasher` that can generate code to return hash value directly and got about 60% speed up for the `simple` case, does it worth?
3. the row in `simple` case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.

The `array` case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10917 from cloud-fan/hash-benchmark.
2016-02-09 13:06:36 -08:00
Luciano Resende 2dbb916440 [SPARK-13189] Cleanup build references to Scala 2.10
Author: Luciano Resende <lresende@apache.org>

Closes #11092 from lresende/SPARK-13189.
2016-02-09 11:56:25 -08:00
Steve Loughran 34d0b70b30 [SPARK-12807][YARN] Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
Patch to

1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation
2. Use maven antrun to verify the JAR has the renamed classes

Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install`

Author: Steve Loughran <stevel@hortonworks.com>

Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.
2016-02-09 11:01:47 -08:00
Sean Owen 68ed3632c5 [SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as it is deprecated
Replace SynchronizeQueue with synchronized access to a Queue

Author: Sean Owen <sowen@cloudera.com>

Closes #11111 from srowen/SPARK-13170.
2016-02-09 11:23:29 +00:00
Iulian Dragos e30121afac [SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like -i file.
Now:

```
$ bin/spark-shell -i test.scala
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/01/29 17:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/29 17:37:39 INFO Main: Created spark context..
Spark context available as sc (master = local[*], app id = local-1454085459000).
16/01/29 17:37:39 INFO Main: Created sql context..
SQL context available as sqlContext.
Loading test.scala...
hello

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
```

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10984 from dragos/issue/repl-eval-file.
2016-02-09 09:05:22 +00:00
sachin aggarwal d9ba4d27f4 [SPARK-13177][EXAMPLES] Update ActorWordCount example to not directly use low level linked list as it is deprecated.
Author: sachin aggarwal <different.sachin@gmail.com>

Closes #11113 from agsachin/master.
2016-02-09 08:52:58 +00:00
Sebastián Ramírez c882ec57de [SPARK-13040][DOCS] Update JDBC deprecated SPARK_CLASSPATH documentation
Update JDBC documentation based on http://stackoverflow.com/a/30947090/219530 as SPARK_CLASSPATH is deprecated.

Also, that's how it worked, it didn't work with the SPARK_CLASSPATH or the --jars alone.

This would solve issue: https://issues.apache.org/jira/browse/SPARK-13040

Author: Sebastián Ramírez <tiangolo@gmail.com>

Closes #10948 from tiangolo/patch-docs-jdbc.
2016-02-09 08:49:34 +00:00
Holden Karau ce83fe9756 [SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & MFDataGenerator
KMeans:
Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values

MFDataGenerator:
Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere.

I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way.

Author: Holden Karau <holden@us.ibm.com>

Closes #11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
2016-02-09 08:47:28 +00:00