Commit graph

7575 commits

Author SHA1 Message Date
Takuya UESHIN 10b59ba230 [SPARK-2428][SQL] Add except and intersect methods to SchemaRDD.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1355 from ueshin/issues/SPARK-2428 and squashes the following commits:

b6fa264 [Takuya UESHIN] Add except and intersect methods to SchemaRDD.
2014-07-10 19:27:24 -07:00
Takuya UESHIN f5abd27129 [SPARK-2415] [SQL] RowWriteSupport should handle empty ArrayType correctly.
`RowWriteSupport` doesn't write empty `ArrayType` value, so the read value becomes `null`.
It should write empty `ArrayType` value as it is.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1339 from ueshin/issues/SPARK-2415 and squashes the following commits:

32afc87 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2415
2f05196 [Takuya UESHIN] Fix RowWriteSupport to handle empty ArrayType correctly.
2014-07-10 19:23:44 -07:00
Takuya UESHIN f62c427289 [SPARK-2431][SQL] Refine StringComparison and related codes.
Refine `StringComparison` and related codes as follows:
- `StringComparison` could be similar to `StringRegexExpression` or `CaseConversionExpression`.
- Nullability of `StringRegexExpression` could depend on children's nullabilities.
- Add a case that the like condition includes no wildcard to `LikeSimplification`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1357 from ueshin/issues/SPARK-2431 and squashes the following commits:

77766f5 [Takuya UESHIN] Add a case that the like condition includes no wildcard to LikeSimplification.
b9da9d2 [Takuya UESHIN] Fix nullability of StringRegexExpression.
680bb72 [Takuya UESHIN] Refine StringComparison.
2014-07-10 19:20:00 -07:00
Artjom-Metro ae8ca4dfba SPARK-2427: Fix Scala examples that use the wrong command line arguments index
The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script.

Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com>
Author: Artjom-Metro <artjom31415@googlemail.com>

Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits:

6111801 [Artjom-Metro] Reduce the default number of iterations
cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments
2014-07-10 16:03:30 -07:00
Issac Buenrostro 2dd6724850 [SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data consumption.
Author: Issac Buenrostro <buenrostro@ooyala.com>

Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits:

5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling.
62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala
7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active
ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption.
2014-07-10 16:01:08 -07:00
tmalaska 40a8fef4e6 [SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska
Adds MIMA binary check exclusions.

Author: tmalaska <ted.malaska@cloudera.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #1347 from tdas/FLUME-1915 and squashes the following commits:

96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver.
41d5338 [tmalaska] Address line 57 that was too long
12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...
2014-07-10 13:15:02 -07:00
Nicholas Chammas 369aa84e8f name ec2 instances and security groups consistently
Security groups created by `spark-ec2` do not prepend “spark-“ to the
name.

Since naming the instances themselves is new to `spark-ec2`, it’s better
to change that pattern to match the existing naming pattern for the
security groups, rather than the other way around.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #1344 from nchammas/master and squashes the following commits:

f7e4581 [Nicholas Chammas] unrelated pep8 fix
a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
de7292a [nchammas] Merge pull request #4 from apache/master
2e4fe00 [nchammas] Merge pull request #3 from apache/master
89fde08 [nchammas] Merge pull request #2 from apache/master
69f6e22 [Nicholas Chammas] PEP8 fixes
2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
69da6cf [nchammas] Merge pull request #1 from apache/master
2014-07-10 12:56:00 -07:00
Patrick Wendell 88006a6237 HOTFIX: Minor doc update for sbt change 2014-07-10 11:11:00 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Masayoshi TSUZUKI c2babc089b SPARK-2115: Stage kill link is too close to stage details link
Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #1350 from tsudukim/feature/SPARK-2115 and squashes the following commits:

e2263b0 [Masayoshi TSUZUKI] Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.
2014-07-10 01:18:37 -07:00
Raymond Liu 2b18ea9826 Clean up SparkKMeans example's code
remove unused code

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1352 from colorant/kmeans and squashes the following commits:

ddcd1dd [Raymond Liu] Clean up SparkKMeans example's code
2014-07-09 23:39:29 -07:00
Patrick Wendell 553c578de1 HOTFIX: Remove persistently failing test in master.
Apparently this functionality is going to be removed soon anywyas.
2014-07-09 19:44:24 -07:00
Patrick Wendell dd22bc2d57 Revert "[HOTFIX] Synchronize on SQLContext.settings in tests."
This reverts commit d4c30cd991.
2014-07-09 19:36:38 -07:00
Patrick Wendell 2e0a037dff SPARK-2416: Allow richer reporting of unit test results
The built-in Jenkins integration is pretty bad. It's very confusing to users whether tests have passed or failed and we can't easily customize the message.

With some small scripting around the Github API we can do much better than this.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1340 from pwendell/better-qa-messages and squashes the following commits:

fd6077d [Patrick Wendell] Better automation for unit tests.
2014-07-09 19:26:16 -07:00
Li Pu 1f33e1f201 SPARK-1782: svd for sparse matrix using ARPACK
copy ARPACK dsaupd/dseupd code from latest breeze
change RowMatrix to use sparse SVD
change tests for sparse SVD

All tests passed. I will run it against some large matrices.

Author: Li Pu <lpu@twitter.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Li Pu <li.pu@outlook.com>

Closes #964 from vrilleup/master and squashes the following commits:

7312ec1 [Li Pu] very minor comment fix
4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master
a461082 [Xiangrui Meng] make superscript show up correctly in doc
861ec48 [Xiangrui Meng] simplify axpy
62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs
c273771 [Li Pu] automatically determine SVD compute mode and parameters
7148426 [Li Pu] improve RowMatrix multiply
5543cce [Li Pu] improve svd api
819824b [Li Pu] add flag for dense svd or sparse svd
eb15100 [Li Pu] fix binary compatibility
4c7aec3 [Li Pu] improve comments
e7850ed [Li Pu] use aggregate and axpy
827411b [Li Pu] fix EOF new line
9c80515 [Li Pu] use non-sparse implementation when k = n
fe983b0 [Li Pu] improve scala style
96d2ecb [Li Pu] improve eigenvalue sorting
e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK
2014-07-09 12:15:08 -07:00
johnnywalleye d35e3db232 [SPARK-2417][MLlib] Fix DecisionTree tests
Fixes test failures introduced by https://github.com/apache/spark/pull/1316.

For both the regression and classification cases,
val stats is the InformationGainStats for the best tree split.
stats.predict is the predicted value for the data, before the split is made.
Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits:

ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests
2014-07-09 11:06:34 -07:00
Manuel Laflamme 0eb11527d1 [STREAMING] SPARK-2343: Fix QueueInputDStream with oneAtATime false
Fix QueueInputDStream which was not removing dequeued items when used with the oneAtATime flag disabled.

Author: Manuel Laflamme <manuel.laflamme@gmail.com>

Closes #1285 from mlaflamm/spark-2343 and squashes the following commits:

61c9e38 [Manuel Laflamme] Unit tests for queue input stream
c51d029 [Manuel Laflamme] Fix QueueInputDStream with oneAtATime false
2014-07-09 10:45:45 -07:00
Kay Ousterhout 339441f545 [SPARK-2384] Add tooltips to UI.
This patch adds tooltips to clarify some points of confusion in the UI.  When users mouse over some of the table headers (shuffle read, write, and input size) as well as over the "scheduler delay" metric shown for each stage, a black tool tip (see image below) pops up describing the metric in more detail.  After the tooltip mechanism is added by this commit, I imagine others may want to add more tooltips for other things in the UI, but I think this is a good starting point.

![tooltip](https://cloud.githubusercontent.com/assets/1108612/3491905/994e179e-059f-11e4-92f2-c6c12d248d81.jpg)

This looks scary-big but much of it is adding the bootstrap tool tip JavaScript.

Also I have no idea what to put for the license in tooltip (I left it the same -- the Twitter apache header) or for JQuery (left it as nothing) -- @mateiz what's the right thing here?

cc @pwendell @andrewor14 @rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1314 from kayousterhout/tooltips and squashes the following commits:

19981b5 [Kay Ousterhout] Exclude non-licensed javascript files from style check
d9ab5a9 [Kay Ousterhout] Response to Andrew's review
7752449 [Kay Ousterhout] [SPARK-2384] Add tooltips to UI.
2014-07-08 22:57:21 -07:00
johnnywalleye 1114207cc8 [SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160)
Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala.

In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows:

rightNodeAgg(featureIndex)(2 * (numBins - 2))
  = binData(shift + (2 * numBins - 1)))

Then there is a loop that sets the rest of the values, as in line 809:

rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
  binData(shift + (2 *(numBins - 2 - splitIndex))) +
  rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))

But since splitIndex starts at 1, this ends up skipping a set of binData values.

The changes here address this issue, for both the Regression and Classification cases.

Author: johnnywalleye <jsondag@gmail.com>

Closes #1316 from johnnywalleye/master and squashes the following commits:

73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations
2014-07-08 19:17:26 -07:00
DB Tsai ac9cdc116e [SPARK-2413] Upgrade junit_xml_listener to 0.5.1
which fixes the following issues

1) fix the class name to be fully qualified classpath
2) make sure the the reporting time is in second not in miliseond, which causing JUnit HTML to report incorrect number
3) make sure the duration of the tests are accumulative.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #1333 from dbtsai/dbtsai-junit and squashes the following commits:

bbeac4b [DB Tsai] Upgrade junit_xml_listener to 0.5.1 which fixes the following issues
2014-07-08 17:50:36 -07:00
Andrew Or bf04a390e4 [SPARK-2392] Executors should not start their own HTTP servers
Executors currently start their own unused HTTP file servers. This is because we use the same SparkEnv class for both executors and drivers, and we do not distinguish this case.

In the longer term, we should separate out SparkEnv for the driver and SparkEnv for the executors.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1335 from andrewor14/executor-http-server and squashes the following commits:

46ef263 [Andrew Or] Start HTTP server only on the driver
2014-07-08 17:35:31 -07:00
Gabriele Nizzoli e6f7bfcfbf [SPARK-2362] Fix for newFilesOnly logic in file DStream
The newFilesOnly logic should be inverted: the logic should be that if the flag newFilesOnly==true then only start reading files older than current time. As the code is now if newFilesOnly==true then it will start to read files that are older than 0L (that is: every file in the directory).

Author: Gabriele Nizzoli <mail@nizzoli.net>

Closes #1077 from gabrielenizzoli/master and squashes the following commits:

4f1d261 [Gabriele Nizzoli] Fix for newFilesOnly logic in file DStream
2014-07-08 14:23:38 -07:00
Reynold Xin 32516f866a [SPARK-2409] Make SQLConf thread safe.
Author: Reynold Xin <rxin@apache.org>

Closes #1334 from rxin/sqlConfThreadSafetuy and squashes the following commits:

c1e0a5a [Reynold Xin] Fixed the duplicate comment.
7614372 [Reynold Xin] [SPARK-2409] Make SQLConf thread safe.
2014-07-08 14:00:47 -07:00
CrazyJvm b520b6453e SPARK-2400 : fix spark.yarn.max.executor.failures explaination
According to
```scala
  private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures",
    sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3)))
```
default value should be numExecutors * 2, with minimum of 3,  and it's same to the config
`spark.yarn.max.worker.failures`

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #1282 from CrazyJvm/yarn-doc and squashes the following commits:

1a5f25b [CrazyJvm] remove deprecated config
c438aec [CrazyJvm] fix style
86effa6 [CrazyJvm] change expression
211f130 [CrazyJvm] fix html tag
2900d23 [CrazyJvm] fix style
a4b2e27 [CrazyJvm] fix configuration spark.yarn.max.executor.failures
2014-07-08 13:55:42 -05:00
Daniel Darabos c8a2313cdf [SPARK-2403] Catch all errors during serialization in DAGScheduler
https://issues.apache.org/jira/browse/SPARK-2403

Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion.

I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #1329 from darabos/spark-2403 and squashes the following commits:

3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization.
52c22ba [Daniel Darabos] Only catch NonFatal exceptions.
361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.
2014-07-08 10:43:46 -07:00
Michael Armbrust cc3e0a14da [SPARK-2395][SQL] Optimize common LIKE patterns.
Author: Michael Armbrust <michael@databricks.com>

Closes #1325 from marmbrus/slowLike and squashes the following commits:

023c3eb [Michael Armbrust] add comment.
8b421c2 [Michael Armbrust] Handle the case where the final % is actually escaped.
d34d37e [Michael Armbrust] add periods.
3bbf35f [Michael Armbrust] Roll back changes to SparkBuild
53894b1 [Michael Armbrust] Fix grammar.
4094462 [Michael Armbrust] Fix grammar.
6d3d0a0 [Michael Armbrust] Optimize common LIKE patterns.
2014-07-08 10:36:18 -07:00
Andrew Or 56e009d4f0 [EC2] Add default history server port to ec2 script
Right now I have to open it manually

Author: Andrew Or <andrewor14@gmail.com>

Closes #1296 from andrewor14/hist-serv-port and squashes the following commits:

8895a1f [Andrew Or] Add default history server port to ec2 script
2014-07-08 16:49:31 +09:00
Michael Armbrust 5a4063645d [SPARK-2391][SQL] Custom take() for LIMIT queries.
Using Spark's take can result in an entire in-memory partition to be shipped in order to retrieve a single row.

Author: Michael Armbrust <michael@databricks.com>

Closes #1318 from marmbrus/takeLimit and squashes the following commits:

77289a5 [Michael Armbrust] Update scala doc
32f0674 [Michael Armbrust] Custom take implementation for LIMIT queries.
2014-07-08 00:41:46 -07:00
witgo 3cd5029be7 Resolve sbt warnings during build Ⅱ
Author: witgo <witgo@qq.com>

Closes #1153 from witgo/expectResult and squashes the following commits:

97541d8 [witgo] merge master
ead26e7 [witgo] Resolve sbt warnings during build
2014-07-08 00:31:42 -07:00
Rishi Verma 0128905eea Updated programming-guide.md
Made sure that readers know the random number generator seed argument, within the 'takeSample' method, is optional.

Author: Rishi Verma <riverma@apache.org>

Closes #1324 from riverma/patch-1 and squashes the following commits:

4699676 [Rishi Verma] Updated programming-guide.md
2014-07-08 00:29:23 -07:00
Yanjie Gao 50561f4396 [SPARK-2235][SQL]Spark SQL basicOperator add Intersect operator
Hi all,
I want to submit a basic operator Intersect
For example , in sql case
select * from table1
intersect
select * from table2
So ,i want use this operator support this function in Spark SQL
This operator will return the  the intersection of SparkPlan child table RDD .
JIRA:https://issues.apache.org/jira/browse/SPARK-2235

Author: Yanjie Gao <gaoyanjie55@163.com>
Author: YanjieGao <396154235@qq.com>

Closes #1150 from YanjieGao/patch-5 and squashes the following commits:

4629afe [YanjieGao] reformat the code
bdc2ac0 [YanjieGao] reformat the code as Michael's suggestion
3b29ad6 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
1cfbfe6 [YanjieGao] refomat some files
ea78f33 [YanjieGao] resolve conflict and add annotation on basicOperator and remove HiveQl
0c7cca5 [YanjieGao] modify format problem
a802ca8 [YanjieGao] Merge remote branch 'upstream/master' into patch-5
5e374c7 [YanjieGao] resolve conflict in SparkStrategies and basicOperator
f7961f6 [Yanjie Gao] update the line less than
bdc4a05 [Yanjie Gao] Update basicOperators.scala
0b49837 [Yanjie Gao] delete the annotation
f1288b4 [Yanjie Gao] delete annotation
e2b64be [Yanjie Gao] Update basicOperators.scala
4dd453e [Yanjie Gao] Update SQLQuerySuite.scala
790765d [Yanjie Gao] Update SparkStrategies.scala
ac73e60 [Yanjie Gao] Update basicOperators.scala
d4ac5e5 [Yanjie Gao] Update HiveQl.scala
61e88e7 [Yanjie Gao] Update SqlParser.scala
469f099 [Yanjie Gao] Update basicOperators.scala
e5bff61 [Yanjie Gao] Spark SQL basicOperator add Intersect operator
2014-07-07 19:40:04 -07:00
Yin Huai 4352a2fdaa [SPARK-2376][SQL] Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException
JIRA: https://issues.apache.org/jira/browse/SPARK-2376

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1320 from yhuai/SPARK-2376 and squashes the following commits:

0107417 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2376
480803d [Yin Huai] Correctly handling JSON arrays in PySpark.
2014-07-07 18:37:38 -07:00
Yin Huai f0496ee108 [SPARK-2375][SQL] JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs
For example, for
```
{"array": [{"field":214748364700}, {"field":1}]}
```
the type of field is resolved as IntType. While, for
```
{"array": [{"field":1}, {"field":214748364700}]}
```
the type of field is resolved as LongType.

JIRA: https://issues.apache.org/jira/browse/SPARK-2375

Author: Yin Huai <huaiyin.thu@gmail.com>

Closes #1308 from yhuai/SPARK-2375 and squashes the following commits:

3e2e312 [Yin Huai] Update unit test.
1b2ff9f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2375
10794eb [Yin Huai] Correctly resolve the type of a field inside an array of structs.
2014-07-07 17:05:59 -07:00
Takuya UESHIN 4deeed17c4 [SPARK-2386] [SQL] RowWriteSupport should use the exact types to cast.
When execute `saveAsParquetFile` with non-primitive type, `RowWriteSupport` uses wrong type `Int` for `ByteType` and `ShortType`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1315 from ueshin/issues/SPARK-2386 and squashes the following commits:

20d89ec [Takuya UESHIN] Use None instead of null.
bd88741 [Takuya UESHIN] Add a test.
323d1d2 [Takuya UESHIN] Modify RowWriteSupport to use the exact types to cast.
2014-07-07 17:04:02 -07:00
Yin Huai c0b4cf097d [SPARK-2339][SQL] SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
Reported by http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
After we get the table from the catalog, because the table has an alias, we will temporarily insert a Subquery. Then, we convert the table alias to lower case no matter if the parser is case sensitive or not.
To see the issue ...
```
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(name: String, age: Int)

val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")

sqlContext.sql("select PEOPLE.name from people PEOPLE")
```
The plan is ...
```
== Query Plan ==
Project ['PEOPLE.name]
 ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:176
```
You can find that `PEOPLE.name` is not resolved.

This PR introduces three changes.
1.  If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work.
2.  A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but
3.  Corresponding unit tests.
With this PR, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers are handled by the analyzer.

JIRA: https://issues.apache.org/jira/browse/SPARK-2339

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1317 from yhuai/SPARK-2339 and squashes the following commits:

12d8006 [Yin Huai] Handling case sensitivity correctly. This patch introduces three changes. 1. If a table has an alias, the catalog will not lowercase the alias. If a lowercase alias is needed, the analyzer will do the work. 2. A catalog has a new val caseSensitive that indicates if this catalog is case sensitive or not. For example, a SimpleCatalog is case sensitive, but 3. Corresponding unit tests. With this patch, case sensitivity of database names and table names is handled by the catalog. Case sensitivity of other identifiers is handled by the analyzer.
2014-07-07 17:01:44 -07:00
Neville Li f7ce1b3b48 [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS
Author: Neville Li <neville@spotify.com>

Closes #1319 from nevillelyh/gh/SPARK-1977 and squashes the following commits:

1f0a355 [Neville Li] [SPARK-1977][MLLIB] register mutable BitSet in MovieLenseALS
2014-07-07 15:06:14 -07:00
Takuya UESHIN 9d5ecf8205 [SPARK-2327] [SQL] Fix nullabilities of Join/Generate/Aggregate.
Fix nullabilities of `Join`/`Generate`/`Aggregate` because:
- Output attributes of opposite side of `OuterJoin` should be nullable.
- Output attributes of generater side of `Generate` should be nullable if `join` is `true` and `outer` is `true`.
- `AttributeReference` of `computedAggregates` of `Aggregate` should be the same as `aggregateExpression`'s.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1266 from ueshin/issues/SPARK-2327 and squashes the following commits:

3ace83a [Takuya UESHIN] Add withNullability to Attribute and use it to change nullabilities.
df1ae53 [Takuya UESHIN] Modify nullabilize to leave attribute if not resolved.
799ce56 [Takuya UESHIN] Add nullabilization to Generate of SparkPlan.
a0fc9bc [Takuya UESHIN] Fix scalastyle errors.
0e31e37 [Takuya UESHIN] Fix Aggregate resultAttribute nullabilities.
09532ec [Takuya UESHIN] Fix Generate output nullabilities.
f20f196 [Takuya UESHIN] Fix Join output nullabilities.
2014-07-05 11:51:48 -07:00
Takuya UESHIN 3da8df939e [SPARK-2366] [SQL] Add column pruning for the right side of LeftSemi join.
The right side of `LeftSemi` join needs columns only used in join condition.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1301 from ueshin/issues/SPARK-2366 and squashes the following commits:

7677a39 [Takuya UESHIN] Update comments.
786d3a0 [Takuya UESHIN] Rename method name.
e0957b1 [Takuya UESHIN] Add column pruning for the right side of LeftSemi join.
2014-07-05 11:48:08 -07:00
ankit.bhardwaj 42f3abd529 [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kry...
Due to the non registration of BoundedPriorityQueue  with kryoserializer, operations which are dependend on BoundedPriorityQueue are giving exceptions.One such instance is using top along with kryo serialization.
Fixed the issue by registering BoundedPriorityQueue with kryoserializer.

Author: ankit.bhardwaj <ankit.bhardwaj@guavus.com>

Closes #1299 from AnkitBhardwaj12/BoundedPriorityQueueWithKryoIssue and squashes the following commits:

a4ae8ed [ankit.bhardwaj] [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kryo
2014-07-04 22:06:10 -07:00
Michael Armbrust 9d006c9737 [SPARK-2370][SQL] Decrease metadata retrieved for partitioned hive queries.
Author: Michael Armbrust <michael@databricks.com>

Closes #1305 from marmbrus/usePrunerPartitions and squashes the following commits:

744aa20 [Michael Armbrust] Use getAllPartitionsForPruner instead of getPartitions, which avoids retrieving auth data
2014-07-04 19:15:48 -07:00
Reynold Xin 0db5d5a22e Added SignalLogger to HistoryServer.
This was omitted in #1260. @aarondav

Author: Reynold Xin <rxin@apache.org>

Closes #1300 from rxin/historyServer and squashes the following commits:

af720a3 [Reynold Xin] Added SignalLogger to HistoryServer.
2014-07-04 17:33:07 -07:00
Patrick Wendell fc71658938 HOTFIX: Clean before building docs during release.
If the docs are built after a Maven build has finished the intermediate
state somehow causes a compiler bug during sbt compilation. This just
does a clean before attempting to build the docs.
2014-07-04 10:01:19 -07:00
Yanjie Gao 5dadda8645 [SPARK-2234][SQL]Spark SQL basicOperators add Except operator
Hi all,
I want to submit a Except operator in basicOperators.scala
In SQL case.SQL support two table do except operator.
select * from table1
except
select * from table2
This operator support the substract function .Return an table with the elements from `this` that are not in `other`.This operator should limit the input SparkPlan Seq only has two member.The check will later support
JIRA:https://issues.apache.org/jira/browse/SPARK-2234

Author: Yanjie Gao <gaoyanjie55@163.com>
Author: YanjieGao <396154235@qq.com>
Author: root <root@node4.(none)>
Author: gaoyanjie <gaoyanjie55@163.com>

Closes #1151 from YanjieGao/patch-6 and squashes the following commits:

f19f899 [YanjieGao] add a new blank line in basicoperators.scala
2ff7d73 [YanjieGao] resolve the identation in SqlParser and SparkStrategies
fdb5227 [YanjieGao] Merge remote branch 'upstream/master' into patch-6
9940d19 [YanjieGao] make comment less than 100c
09c7413 [YanjieGao] pr 1151 SqlParser add cache ,basic Operator rename Except and modify comment
b4b5867 [root] Merge remote branch 'upstream/master' into patch-6
b4c3869 [Yanjie Gao] change SparkStrategies Sparkcontext to SqlContext
7e0ec29 [Yanjie Gao] delete multi test
7e7c83f [Yanjie Gao] delete conflict except
b01beb8 [YanjieGao] resolve conflict sparkstrategies and basicOperators
4dc8166 [YanjieGao] resolve conflict
fa68a98 [Yanjie Gao] Update joins.scala
8e6bb00 [Yanjie Gao] delete conflict except
dd9ba5e [Yanjie Gao] Update joins.scala
a0d4e73 [Yanjie Gao] delete skew join
60f5ddd [Yanjie Gao] update less than 100c
0e72233 [Yanjie Gao] update SQLQuerySuite on master branch
7f916b5 [Yanjie Gao] update execution/basicOperators on master branch
a28dece [Yanjie Gao] Update logical/basicOperators on master branch
a639935 [Yanjie Gao] Update SparkStrategies.scala
3bf7def [Yanjie Gao] update SqlParser on master branch
26f833f [Yanjie Gao] update SparkStrategies.scala on master branch
8dd063f [Yanjie Gao] Update logical/basicOperators on master branch
9847dcf [Yanjie Gao] update SqlParser on masterbranch
d6a4604 [Yanjie Gao] Update joins.scala
424c507 [Yanjie Gao] Update joins.scala
7680742 [Yanjie Gao] Update SqlParser.scala
a7193d8 [gaoyanjie] [SPARK-2234][SQL]Spark SQL basicOperators add Except operator #1151
5c8a224 [Yanjie Gao] update the line less than 100c
ee066b3 [Yanjie Gao] Update basicOperators.scala
32a80ab [Yanjie Gao] remove except in HiveQl
cf232eb [Yanjie Gao] update 1comment 2space3 left.out
f1ea3f3 [Yanjie Gao] remove comment
7ea9b91 [Yanjie Gao] remove annotation
7f3d613 [Yanjie Gao] update .map(_.copy())
670a1bb [Yanjie Gao] Update HiveQl.scala
3fe7746 [Yanjie Gao] Update SQLQuerySuite.scala
a36eb0a [Yanjie Gao] Update basicOperators.scala
7859e56 [Yanjie Gao] Update SparkStrategies.scala
052346d [Yanjie Gao] Subtract is conflict with Subtract(e1,e2)
aab3785 [Yanjie Gao] Update SQLQuerySuite.scala
4bf80b1 [Yanjie Gao] update subtract to except
4bdd520 [Yanjie Gao] Update SqlParser.scala
2d4bfbd [Yanjie Gao] Update SQLQuerySuite.scala
0808921 [Yanjie Gao] SQLQuerySuite
a8a1948 [Yanjie Gao] SparkStrategies
1fe96c0 [Yanjie Gao] HiveQl.scala update
3305e40 [Yanjie Gao] SqlParser
7a98c37 [Yanjie Gao] Update basicOperators.scala
cf5b9d0 [Yanjie Gao] Update basicOperators.scala
8945835 [Yanjie Gao] object SkewJoin extends Strategy
2b98962 [Yanjie Gao] Update SqlParser.scala
dd32980 [Yanjie Gao] update1
68815b2 [Yanjie Gao] Reformat the code style
4eb43ec [Yanjie Gao] Update basicOperators.scala
aa06072 [Yanjie Gao] Reformat the code sytle
2014-07-04 02:43:57 -07:00
Reynold Xin b3e768e154 [SPARK-2059][SQL] Add analysis checks
This replaces #1263 with a test case.

Author: Reynold Xin <rxin@apache.org>
Author: Michael Armbrust <michael@databricks.com>

Closes #1265 from rxin/sql-analysis-error and squashes the following commits:

a639e01 [Reynold Xin] Added a test case for unresolved attribute analysis.
7371e1b [Reynold Xin] Merge pull request #1263 from marmbrus/analysisChecks
448c088 [Michael Armbrust] Add analysis checks
2014-07-04 00:53:41 -07:00
baishuo(白硕) 0bbe61223e Update SQLConf.scala
use concurrent.ConcurrentHashMap instead of util.Collections.synchronizedMap

Author: baishuo(白硕) <vc_java@hotmail.com>

Closes #1272 from baishuo/master and squashes the following commits:

51ec55d [baishuo(白硕)] Update SQLConf.scala
63da043 [baishuo(白硕)] Update SQLConf.scala
36b6dbd [baishuo(白硕)] Update SQLConf.scala
864faa0 [baishuo(白硕)] Update SQLConf.scala
593096b [baishuo(白硕)] Update SQLConf.scala
7304d9b [baishuo(白硕)] Update SQLConf.scala
843581c [baishuo(白硕)] Update SQLConf.scala
1d3e4a2 [baishuo(白硕)] Update SQLConf.scala
0740f28 [baishuo(白硕)] Update SQLConf.scala
2014-07-04 00:25:31 -07:00
Prashant Sharma d43415075b [SPARK-1199][REPL] Remove VALId and use the original import style for defined classes.
This is an alternate solution to #1176.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1179 from ScrapCodes/SPARK-1199/repl-fix-second-approach and squashes the following commits:

820b34b [Prashant Sharma] Here we generate two kinds of import wrappers based on whether it is a class or not.
2014-07-04 00:05:27 -07:00
Cheng Lian 544880457d [SPARK-2059][SQL] Don't throw TreeNodeException in execution.ExplainCommand
This is a fix for the problem revealed by PR #1265.

Currently `HiveComparisonSuite` ignores output of `ExplainCommand` since Catalyst query plan is quite different from Hive query plan. But exceptions throw from `CheckResolution` still breaks test cases. This PR catches any `TreeNodeException` and reports it as part of the query explanation.

After merging this PR, PR #1265 can also be merged safely.

For a normal query:

```
scala> hql("explain select key from src").foreach(println)
...
[Physical execution plan:]
[HiveTableScan [key#9], (MetastoreRelation default, src, None), None]
```

For a wrong query with unresolved attribute(s):

```
scala> hql("explain select kay from src").foreach(println)
...
[Error occurred during query planning: ]
[Unresolved attributes: 'kay, tree:]
[Project ['kay]]
[ LowerCaseSchema ]
[  MetastoreRelation default, src, None]
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1294 from liancheng/safe-explain and squashes the following commits:

4318911 [Cheng Lian] Don't throw TreeNodeException in `execution.ExplainCommand`
2014-07-03 23:41:54 -07:00
Aaron Davidson 97a0bfe1c0 SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
JIRA: https://issues.apache.org/jira/browse/SPARK-2282

This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes.

This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1220 from aarondav/SPARK-2282 and squashes the following commits:

2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
2014-07-03 23:02:36 -07:00
Andrew Or 3894a49be9 [SPARK-2307][Reprise] Correctly report RDD blocks on SparkUI
**Problem.** The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors.

**Solution.** The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`.

Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`.

----------

If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits:

45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise
a82ea25 [Andrew Or] Add tests for StorageStatusListener
8773b01 [Andrew Or] Update comment / minor changes
3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI
2014-07-03 22:48:23 -07:00
Aaron Davidson 586feb5c95 [SPARK-2350] Don't NPE while launching drivers
Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1289 from aarondav/master-fail and squashes the following commits:

1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers
2014-07-03 22:31:41 -07:00