Commit graph

15041 commits

Author SHA1 Message Date
Marcelo Vanzin a59a357cae [SPARK-3873][MLLIB] Import order fixes.
A slight adjustment to the checker configuration was needed; there is
a handful of warnings still left, but those are because of a bug in
the checker that I'll fix separately (before enabling errors for the
checker, of course).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10535 from vanzin/SPARK-3873-mllib.
2015-12-31 23:48:55 -08:00
Liang-Chi Hsieh c9dbfcc653 [SPARK-11743][SQL] Move the test for arrayOfUDT
A following pr for #9712. Move the test for arrayOfUDT.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10538 from viirya/move-udt-test.
2015-12-31 23:48:05 -08:00
Josh Rosen 5adec63a92 [SPARK-10359][PROJECT-INFRA] Multiple fixes to dev/test-dependencies.sh script
This patch includes multiple fixes for the `dev/test-dependencies.sh` script (which was introduced in #10461):

- Use `build/mvn --force` instead of `mvn` in one additional place.
- Explicitly set a zero exit code on success.
- Set `LC_ALL=C` to make `sort` results agree across machines (see https://stackoverflow.com/questions/28881/).
- Set `should_run_build_tests=True` for `build` module (this somehow got lost).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10543 from JoshRosen/dep-script-fixes.
2015-12-31 20:23:19 -08:00
Marcelo Vanzin efb10cc9ad [SPARK-3873][STREAMING] Import order fixes for streaming.
Also included a few miscelaneous other modules that had very few violations.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10532 from vanzin/SPARK-3873-streaming.
2015-12-31 01:34:13 -08:00
Yin Huai 5cdecb1841 [SPARK-12039][SQL] Re-enable HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal column
https://issues.apache.org/jira/browse/SPARK-12039

since we do not support hadoop1, we can re-enable this test in master.

Author: Yin Huai <yhuai@databricks.com>

Closes #10533 from yhuai/SPARK-12039-enable.
2015-12-31 01:33:21 -08:00
Shixiong Zhu 4f5a24d7e7 [SPARK-7995][SPARK-6280][CORE] Remove AkkaRpcEnv and remove systemName from setupEndpointRef
### Remove AkkaRpcEnv

Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API.

### Remove systemName
There are 2 places using `systemName`:
* `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`.
* `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`.

### Remove RpcEnv.uriOf

`uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10459 from zsxwing/remove-akka-rpc-env.
2015-12-31 00:15:55 -08:00
Davies Liu e6c77874b9 [SPARK-12585] [SQL] move numFields to constructor of UnsafeRow
Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy.

It should be part of constructor of UnsafeRow.

Author: Davies Liu <davies@databricks.com>

Closes #10528 from davies/numFields.
2015-12-30 22:16:37 -08:00
Reynold Xin 93b52abca7 House cleaning: close old pull requests.
Closes #5400
Closes #5408
Closes #5423
Closes #5668
Closes #6757
Closes #6745
Closes #6613
2015-12-30 18:54:03 -08:00
Reynold Xin c642c3a210 Closes #10386 since it was superseded by #10468. 2015-12-30 18:50:30 -08:00
Reynold Xin 7b4452ba98 House cleaning: close open pull requests created before June 1st, 2015
Closes #5358
Closes #3744
Closes #3677
Closes #3536
Closes #3249
Closes #3221
Closes #2446
Closes #3794
Closes #3815
Closes #3816
Closes #3866
Closes #4286
Closes #5184
Closes #5170
Closes #5142
Closes #5025
Closes #5005
Closes #4897
Closes #4887
Closes #4849
Closes #4632
Closes #4622
Closes #4456
Closes #4449
Closes #4417
Closes #5483
Closes #5325
Closes #6545
Closes #6449
Closes #6433
Closes #6416
Closes #6403
Closes #6386
Closes #6263
Closes #6245
Closes #6213
Closes #6155
Closes #6133
Closes #6018
Closes #5978
Closes #5869
Closes #5852
Closes #5848
Closes #5754
Closes #5598
Closes #5503
Closes #4380
2015-12-30 18:49:17 -08:00
Reynold Xin be33a0cd3d [SPARK-12561] Remove JobLogger in Spark 2.0.
It was research code and has been deprecated since 1.0.0. No one really uses it since they can just use event logging.

Author: Reynold Xin <rxin@databricks.com>

Closes #10530 from rxin/SPARK-12561.
2015-12-30 18:28:08 -08:00
Marcelo Vanzin 9140d90743 [SPARK-3873][GRAPHX] Import order fixes.
There's one warning left, caused by a bug in the checker.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10537 from vanzin/SPARK-3873-graphx.
2015-12-30 18:26:08 -08:00
Marcelo Vanzin fd33333138 [SPARK-3873][YARN] Fix import ordering.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10536 from vanzin/SPARK-3873-yarn.
2015-12-30 18:20:27 -08:00
Reynold Xin ee8f8d3184 [SPARK-12588] Remove HttpBroadcast in Spark 2.0.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #10531 from rxin/SPARK-12588.
2015-12-30 18:07:07 -08:00
Herman van Hovell f76ee109d8 [SPARK-8641][SPARK-12455][SQL] Native Spark Window functions - Follow-up (docs & tests)
This PR is a follow-up for PR https://github.com/apache/spark/pull/9819. It adds documentation for the window functions and a couple of NULL tests.

The documentation was largely based on the documentation in (the source of)  Hive and Presto:
* https://prestodb.io/docs/current/functions/window.html
* https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts?

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10402 from hvanhovell/SPARK-8641-docs.
2015-12-30 16:51:07 -08:00
Carson Wang b244297966 [SPARK-12399] Display correct error message when accessing REST API with an unknown app Id
I got an exception when accessing the below REST API with an unknown application Id.
`http://<server-url>:18080/api/v1/applications/xxx/jobs`
Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx`
```
org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx
	at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
	at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
	at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
	at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
	at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
	at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
```

Author: Carson Wang <carson.wang@intel.com>

Closes #10352 from carsonwang/unknownAppFix.
2015-12-30 13:49:10 -08:00
Takeshi YAMAMURO 5c2682b0c8 [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC
This is rework from #10386 and add more tests and LIKE push-down support.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10468 from maropu/SupportMorePushdownInJdbc.
2015-12-30 13:34:37 -08:00
Josh Rosen 27a42c7108 [SPARK-10359] Enumerate dependencies in a file and diff against it for new pull requests
This patch adds a new build check which enumerates Spark's resolved runtime classpath and saves it to a file, then diffs against that file to detect whether pull requests have introduced dependency changes. The aim of this check is to make it simpler to reason about whether pull request which modify the build have introduced new dependencies or changed transitive dependencies in a way that affects the final classpath.

This supplants the checks added in SPARK-4123 / #5093, which are currently disabled due to bugs.

This patch is based on pwendell's work in #8531.

Closes #8531.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Patrick Wendell <patrick@databricks.com>

Closes #10461 from JoshRosen/SPARK-10359.
2015-12-30 12:47:42 -08:00
Holden Karau d1ca634db4 [SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <holden@us.ibm.com>

Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
2015-12-30 11:14:47 -08:00
Wenchen Fan aa48164a43 [SPARK-12495][SQL] use true as default value for propagateNull in NewInstance
Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault.

This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10443 from cloud-fan/encoder.
2015-12-30 10:56:08 -08:00
Neelesh Srinivas Salian 932cf44248 [SPARK-12263][DOCS] IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
Updated the Worker Unit IllegalStateException message to indicate no values less than 1MB instead of 0 to help solve this.
Requesting review

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #10483 from nssalian/SPARK-12263.
2015-12-30 11:14:13 +00:00
Reynold Xin 27af6157f9 Revert "[SPARK-12362][SQL][WIP] Inline Hive Parser"
This reverts commit b600bccf41 due to non-deterministic build breaks.
2015-12-30 00:08:44 -08:00
gatorsmile 4f75f785df [SPARK-12564][SQL] Improve missing column AnalysisException
```
org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns text;
```

lets put a `:` after `columns` and put the columns in `[]` so that they match the toString of DataFrame.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10518 from gatorsmile/improveAnalysisExceptionMsg.
2015-12-29 22:28:59 -08:00
Shixiong Zhu 7ab0e2289d [SPARK-12490][CORE] Limit the css style scope to fix the Streaming UI
#10441 broke the Streaming UI because of the new CSS style.

<img width="503" alt="screen shot 2015-12-29 at 4 49 04 pm" src="https://cloud.githubusercontent.com/assets/1000778/12044763/1efce0fe-ae4c-11e5-9f8b-39df08426bf8.png">

This PR just added a class for the new style and only applied them to the paged tables.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10517 from zsxwing/fix-streaming-ui.
2015-12-29 19:54:10 -08:00
Nong Li b600bccf41 [SPARK-12362][SQL][WIP] Inline Hive Parser
This is a WIP. The PR has been taken over from nongli (see https://github.com/apache/spark/pull/10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark.

I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do:
- [ ] Get it to pass jenkins build/test.
- [ ] Aknowledge Hive-project for using their parser.
- [ ] Refactorings between HiveQl and the java classes.
  - [ ] Create our own ASTNode and integrate the current implicit extentions.
  - [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```.
- [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files.
  - [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```.
  - [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration.
  - [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration.
- [ ] Moving the parser into sql/core.

cc nongli rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

Closes #10509 from hvanhovell/SPARK-12362.
2015-12-29 18:47:41 -08:00
Reynold Xin 270a659584 [SPARK-12549][SQL] Take Option[Seq[DataType]] in UDF input type specification.
In Spark we allow UDFs to declare its expected input types in order to apply type coercion. The expected input type parameter takes a Seq[DataType] and uses Nil when no type coercion is applied. It makes more sense to take Option[Seq[DataType]] instead, so we can differentiate a no-arg function vs function with no expected input type specified.

Author: Reynold Xin <rxin@databricks.com>

Closes #10504 from rxin/SPARK-12549.
2015-12-29 16:58:23 -08:00
Sean Owen be86268eb5 [SPARK-12349][SPARK-12349][ML] Fix typo in Spark version regex introduced in / PR 10327
Sorry jkbradley
Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942

Author: Sean Owen <sowen@cloudera.com>

Closes #10508 from srowen/SPARK-12349.2.
2015-12-29 16:32:26 -08:00
Hossein f6ecf14333 [SPARK-11199][SPARKR] Improve R context management story and add getOrCreate
* Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context.
* Adds a simple test

[SPARK-11199] #comment link with JIRA

Author: Hossein <hossein@databricks.com>

Closes #9185 from falaki/SPARK-11199.
2015-12-29 11:44:20 -08:00
Kazuaki Ishizaki 8e629b10cb [SPARK-12530][BUILD] Fix build break at Spark-Master-Maven-Snapshots from #1293
Compilation error caused due to string concatenations that are not a constant
Use raw string literal to avoid string concatenations

https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10488 from kiszk/SPARK-12530.
2015-12-29 10:35:23 -08:00
Forest Fang d80cc90b55 [SPARK-12526][SPARKR] ifelse, when, otherwise` unable to take Column as value
`ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values.

For example:
```r
ifelse(lit(1) == lit(1), lit(2), lit(3))
ifelse(df$mpg > 0, df$mpg, 0)
```
will both fail with
```r
attempt to replicate an object of type 'environment'
```

The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR.

For reference, added test cases which trigger failures:
```r
. Error: when(), otherwise() and ifelse() with column on a DataFrame ----------
error in evaluating the argument 'x' in selecting a method for function 'collect':
  error in evaluating the argument 'col' in selecting a method for function 'select':
  attempt to replicate an object of type 'environment'
Calls: when -> when -> ifelse -> ifelse

1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(actual, expected, ...)
8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))
Error: Test failures
Execution halted
```

Author: Forest Fang <forest.fang@outlook.com>

Closes #10481 from saurfang/spark-12526.
2015-12-29 12:45:24 +05:30
Takeshi YAMAMURO 73862a1eb9 [SPARK-11394][SQL] Throw IllegalArgumentException for unsupported types in postgresql
If DataFrame has BYTE types, throws an exception:
org.postgresql.util.PSQLException: ERROR: type "byte" does not exist

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #9350 from maropu/FixBugInPostgreJdbc.
2015-12-28 21:28:32 -08:00
Reynold Xin 1a91be8078 [SPARK-12547][SQL] Tighten scala style checker enforcement for UDF registration
We use scalastyle:off to turn off style checks in certain places where it is not possible to follow the style guide. This is usually ok. However, in udf registration, we disable the checker for a large amount of code simply because some of them exceed 100 char line limit. It is better to just disable the line limit check rather than everything.

In this pull request, I only disabled line length check, and fixed a problem (lack explicit types for public methods).

Author: Reynold Xin <rxin@databricks.com>

Closes #10501 from rxin/SPARK-12547.
2015-12-28 20:43:06 -08:00
gatorsmile 043135819c [SPARK-12522][SQL][MINOR] Add the missing document strings for the SQL configuration
Fixing the missing the document for the configuration. We can see the missing messages "TODO" when issuing the command "SET -V".
```
spark.sql.columnNameOfCorruptRecord
spark.sql.hive.verifyPartitionPath
spark.sql.sources.parallelPartitionDiscovery.threshold
spark.sql.hive.convertMetastoreParquet.mergeSchema
spark.sql.hive.convertCTAS
spark.sql.hive.thriftServer.async
```

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10471 from gatorsmile/commandDesc.
2015-12-28 17:22:18 -08:00
Josh Rosen 124a3a5e4e [SPARK-12490] Don't use Javascript for web UI's paginated table controls
The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links.

/cc zsxwing, who wrote this original code, and yhuai.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10441 from JoshRosen/simplify-paginated-table-sorting.
2015-12-28 16:42:11 -08:00
Shixiong Zhu 710b411729 [SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs
Include the following changes:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10440 from zsxwing/findbugs.
2015-12-28 15:01:51 -08:00
Josh Rosen fb572c6e4b [SPARK-12525] Fix fatal compiler warnings in Kinesis ASL due to @transient annotations
The Scala 2.11 SBT build currently fails for Spark 1.6.0 and master due to warnings about the `transient` annotation:

```
[error] [warn] /Users/joshrosen/Documents/spark/extras/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala:73: no valid targets for annotation on value sc - it is discarded unused. You may specify targets with meta-annotations, e.g. (transient param)
[error] [warn]     transient sc: SparkContext,
```

This fix implemented here is the same as what we did in #8433: remove the `transient` annotations when they are not necessary and replace use  `transient private val` in the remaining cases.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10479 from JoshRosen/fix-sbt-2.11.
2015-12-28 14:51:22 -08:00
Daoyuan Wang a6d385322e [SPARK-12222][CORE] Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception
Since we only need to implement `def skipBytes(n: Int)`,
code in #10213 could be simplified.
davies scwf

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #10253 from adrian-wang/kryo.
2015-12-29 07:02:30 +09:00
gatorsmile 01ba95d8bf [SPARK-12441][SQL] Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup
When explain any plan with Generate, we will see an exclamation mark in the plan. Normally, when we see this mark, it means the plan has an error. This PR is to correct the `missingInput` in `Generate`.

For example,
```scala
val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters")
val df2 =
  df.explode('letters) {
    case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
  }

df2.explain(true)
```
Before the fix, the plan is like
```
== Parsed Logical Plan ==
'Generate UserDefinedGenerator('letters), true, false, None
+- Project [_1#0 AS number#2,_2#1 AS letters#3]
   +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]

== Analyzed Logical Plan ==
number: int, letters: string, _1: string
Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
+- Project [_1#0 AS number#2,_2#1 AS letters#3]
   +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]

== Optimized Logical Plan ==
Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
+- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]

== Physical Plan ==
!Generate UserDefinedGenerator(letters#3), true, false, [number#2,letters#3,_1#8]
+- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
```

**Updates**: The same issues are also found in the other four Dataset operators: `MapPartitions`/`AppendColumns`/`MapGroups`/`CoGroup`. Fixed all these four.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #10393 from gatorsmile/generateExplain.
2015-12-28 12:48:30 -08:00
Stephan Kessler a6a4812434 [SPARK-7727][SQL] Avoid inner classes in RuleExecutor
Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile

This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case.

Author: Stephan Kessler <stephan.kessler@sap.com>

Closes #10174 from stephankessler/SPARK-7727.
2015-12-28 12:46:20 -08:00
Kousuke Saruta 07165ca06f [SPARK-12424][ML] The implementation of ParamMap#filter is wrong.
ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654).

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10381 from sarutak/SPARK-12424.
2015-12-29 05:33:19 +09:00
gatorsmile e01c6c8664 [SPARK-12287][SQL] Support UnsafeRow in MapPartitions/MapGroups/CoGroup
Support Unsafe Row in MapPartitions/MapGroups/CoGroup.

Added a test case for MapPartitions. Since MapGroups and CoGroup are built on AppendColumns, all the related dataset test cases already can verify the correctness when MapGroups and CoGroup processing unsafe rows.

davies cloud-fan Not sure if my understanding is right, please correct me. Thank you!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10398 from gatorsmile/unsafeRowMapGroup.
2015-12-28 12:23:28 -08:00
Yaron Weinsberg 73b70f076d [SPARK-12517] add default RDD name for one created via sc.textFile
The feature was first added at commit: 7b877b2705 but was later removed (probably by mistake) at commit: fc8b58195a.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.

Here is the symptom:

* Using spark-1.5.2-bin-hadoop2.6:

scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

* while using Spark 1.3.1:

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc

Author: Yaron Weinsberg <wyaron@gmail.com>
Author: yaron <yaron@il.ibm.com>

Closes #10456 from wyaron/master.
2015-12-29 05:19:11 +09:00
Kevin Yu fd50df413f [SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan
Hello Michael & All:

We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix.

The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection.

With this new codes, the approach to solve this problem is:

Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection).

We create 3 test cases to cover the otherwise failure cases.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #10388 from kevinyu98/spark-12231.
2015-12-28 11:58:33 -08:00
Wenchen Fan 8543997f2d [HOT-FIX] bypass hive test when parse logical plan to json
https://github.com/apache/spark/pull/10311 introduces some rare, non-deterministic flakiness for hive udf tests, see https://github.com/apache/spark/pull/10311#issuecomment-166548851

I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10430 from cloud-fan/hot-fix.
2015-12-28 11:45:44 -08:00
Josh Rosen ab6bedd85d [SPARK-12508][PROJECT-INFRA] Fix minor bugs in dev/tests/pr_public_classes.sh script
This patch fixes a handful of minor bugs in the `dev/tests/pr_public_classes.sh` script, which is used by the `run_tests_jenkins` script to detect the addition of new public classes:

- Account for differences between BSD and GNU `sed` in order to allow the script to run on OS X.
- Diff `$ghprbActualCommit^...$ghprbActualCommit ` instead of `master...$ghprbActualCommit`: since `ghprbActualCommit` is a merge commit which results from merging the PR into the target branch, this will give us the desired diff and will avoid certain race-conditions which could lead to false-positives.
- Use `echo -e` instead of `echo` so that newline characters are handled correctly in output. This should fix a formatting glitch which caused the output to appear on a single line in the GitHub comment (see [the SC2028 page](https://github.com/koalaman/shellcheck/wiki/SC2028) on the Shellcheck wiki for more details).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10455 from JoshRosen/fix-pr-public-classes-test.
2015-12-28 10:40:03 -08:00
Cheng Lian 8e23d8db7f [SPARK-12218] Fixes ORC conjunction predicate push down
This PR is a follow-up of PR #10362.

Two major changes:

1.  The fix introduced in #10362 is OK for Parquet, but may disable ORC PPD in many cases

    PR #10362 stops converting an `AND` predicate if any branch is inconvertible.  On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`.  This means, if any filter is inconvertible, no filters can be pushed down.  This PR fixes this issue by finding out all convertible filters first before doing the actual conversion.

    The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail.

1.  Copied the `AND` predicate fix for ORC from #10362 to avoid merge conflict.

Same as #10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5.

Author: Cheng Lian <lian@databricks.com>

Closes #10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.
2015-12-28 08:48:44 -08:00
jerryshao 8d49400921 [SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API
The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.

Author: jerryshao <sshao@hortonworks.com>

Closes #10350 from jerryshao/SPARK-12353.
2015-12-28 10:43:23 +00:00
felixcheung 5aa2710c1e [SPARK-12515][SQL][DOC] minor doc update for read.jdbc
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10465 from felixcheung/dfreaderjdbcdoc.
2015-12-28 10:22:45 +00:00
gatorsmile 9ab296ecdc [SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join
After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.

For example, users can do the Equi-Join like
  ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated.

Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10477 from gatorsmile/pyOuterJoin.
2015-12-27 23:18:48 -08:00
echo2mei 1e97813951 [SPARK-12396][CORE] Modify the function scheduleAtFixedRate to schedule.
Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule.
It is no need to register to master iteratively.

Author: echo2mei <534384876@qq.com>

Closes #10447 from echoTomei/master.
2015-12-25 17:42:24 -08:00