Commit graph

7326 commits

Author SHA1 Message Date
CodingCat acc01ab326 SPARK-2038: rename "conf" parameters in the saveAsHadoop functions with source-compatibility
https://issues.apache.org/jira/browse/SPARK-2038

to differentiate with SparkConf object and at the same time keep the source level compatibility

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1137 from CodingCat/SPARK-2038 and squashes the following commits:

11abeba [CodingCat] revise the comments
7ee5712 [CodingCat] to keep the source-compatibility
763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
2014-06-25 00:23:32 -07:00
Cheng Lian 22036aeb1b [BUGFIX][SQL] Should match java.math.BigDecimal when wnrapping Hive output
The `BigDecimal` branch in `unwrap` matches to `scala.math.BigDecimal` rather than `java.math.BigDecimal`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1199 from liancheng/javaBigDecimal and squashes the following commits:

e9bb481 [Cheng Lian] Should match java.math.BigDecimal when wnrapping Hive output
2014-06-25 00:17:28 -07:00
Cheng Lian 8fade8973e [SPARK-2263][SQL] Support inserting MAP<K, V> to Hive tables
JIRA issue: [SPARK-2263](https://issues.apache.org/jira/browse/SPARK-2263)

Map objects were not converted to Hive types before inserting into Hive tables.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1205 from liancheng/spark-2263 and squashes the following commits:

c7a4373 [Cheng Lian] Addressed @concretevitamin's comment
784940b [Cheng Lian] SARPK-2263: support inserting MAP<K, V> to Hive tables
2014-06-25 00:14:34 -07:00
witgo b6b44853cd SPARK-2248: spark.default.parallelism does not apply in local mode
Author: witgo <witgo@qq.com>

Closes #1194 from witgo/SPARK-2248 and squashes the following commits:

6ac950b [witgo] spark.default.parallelism does not apply in local mode
2014-06-24 19:45:03 -07:00
Michael Armbrust 2714968e1b Fix possible null pointer in acumulator toString
Author: Michael Armbrust <michael@databricks.com>

Closes #1204 from marmbrus/nullPointerToString and squashes the following commits:

35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString
2014-06-24 19:39:19 -07:00
Matthew Farrellee 54055fb2b7 Autodetect JAVA_HOME on RPM-based systems
Author: Matthew Farrellee <matt@redhat.com>

Closes #1185 from mattf/master-1 and squashes the following commits:

42150fc [Matthew Farrellee] Autodetect JAVA_HOME on RPM-based systems
2014-06-24 19:32:33 -07:00
Cheng Hao 133495d826 [SQL]Add base row updating methods for JoinedRow
This will be helpful in join operators.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #1187 from chenghao-intel/joinedRow and squashes the following commits:

87c19e3 [Cheng Hao] Add base row set methods for JoinedRow
2014-06-24 19:07:02 -07:00
Xiangrui Meng 8ca41769fb [SPARK-1112, 2156] Bootstrap to fetch the driver's Spark properties.
This is an alternative solution to #1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than #1124.

@pwendell Are there spark properties that might be different on the driver and on the executors?

Author: Xiangrui Meng <meng@databricks.com>

Closes #1132 from mengxr/akka-bootstrap and squashes the following commits:

77ff32d [Xiangrui Meng] organize imports
68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor
46d332d [Xiangrui Meng] fix a test
7947c18 [Xiangrui Meng] increase slack size for akka
4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
2014-06-24 19:06:07 -07:00
Michael Armbrust a162c9b337 [SPARK-2264][SQL] Fix failing CachedTableSuite
Author: Michael Armbrust <michael@databricks.com>

Closes #1201 from marmbrus/fixCacheTests and squashes the following commits:

9d87ed1 [Michael Armbrust] Use analyzer (which runs to fixed point) instead of manually removing analysis operators.
2014-06-24 19:04:29 -07:00
Kay Ousterhout 1978a9033e Fix broken Json tests.
The assertJsonStringEquals method was missing an "assert" so
did not actually check that the strings were equal. This commit
adds the missing assert and fixes subsequently revealed problems
with the JsonProtocolSuite.

@andrewor14 I changed some of the test functionality to match what it
looks like you intended based on the expected strings -- let me know if
anything here looks wrong.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1198 from kayousterhout/json_test_fix and squashes the following commits:

77f858f [Kay Ousterhout] Fix broken Json tests.
2014-06-24 16:54:50 -07:00
Patrick Wendell 221909e678 HOTFIX: Disabling tests per SPARK-2264 2014-06-24 15:09:38 -07:00
Rui Li 924b7082b1 SPARK-1937: fix issue with task locality
Don't check executor/host availability when creating a TaskSetManager. Because the executors may haven't been registered when the TaskSetManager is created, in which case all tasks will be considered "has no preferred locations", and thus losing data locality in later scheduling.

Author: Rui Li <rui.li@intel.com>
Author: lirui-intel <rui.li@intel.com>

Closes #892 from lirui-intel/delaySchedule and squashes the following commits:

8444d7c [Rui Li] fix code style
fafd57f [Rui Li] keep locality constraints within the valid levels
18f9e05 [Rui Li] restrict allowed locality
5b3fb2f [Rui Li] refine UT
99f843e [Rui Li] add unit test and fix bug
fff4123 [Rui Li] fix computing valid locality levels
685ed3d [Rui Li] remove delay shedule for pendingTasksWithNoPrefs
7b0177a [Rui Li] remove redundant code
c7b93b5 [Rui Li] revise patch
3d7da02 [lirui-intel] Update TaskSchedulerImpl.scala
cab4c71 [Rui Li] revised patch
539a578 [Rui Li] fix code style
cf0d6ac [Rui Li] fix code style
3dfae86 [Rui Li] re-compute pending tasks when new host is added
a225ac2 [Rui Li] SPARK-1937: fix issue with task locality
2014-06-24 11:40:37 -07:00
Reynold Xin 420c1c3e1b [SPARK-2252] Fix MathJax for HTTPs.
Found out about this from the Hacker News link to GraphX which was using HTTPs.

@mengxr

Author: Reynold Xin <rxin@apache.org>

Closes #1189 from rxin/mllib-doc and squashes the following commits:

5328be0 [Reynold Xin] [SPARK-2252] Fix MathJax for HTTPs.
2014-06-23 23:18:47 -07:00
jerryshao 56eb8af187 [SPARK-2124] Move aggregation into shuffle implementations
This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations.

I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs.

Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task.

Author: jerryshao <saisai.shao@intel.com>

Closes #1064 from jerryshao/SPARK-2124 and squashes the following commits:

4a05a40 [jerryshao] Modify according to comments
1f7dcc8 [jerryshao] Style changes
50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer
1a96190 [jerryshao] Code modification related to the ShuffledRDD
308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
2014-06-23 20:25:46 -07:00
Reynold Xin 51c8168377 [SPARK-2227] Support dfs command in SQL.
Note that nothing gets printed to the console because we don't properly maintain session state right now.

I will have a followup PR that fixes it.

Author: Reynold Xin <rxin@apache.org>

Closes #1167 from rxin/commands and squashes the following commits:

56f04f8 [Reynold Xin] [SPARK-2227] Support dfs command in SQL.
2014-06-23 18:34:54 -07:00
Henry Saputra 383bf72c11 Cleanup on Connection, ConnectionManagerId, ConnectionManager classes part 2
Cleanup on Connection, ConnectionManagerId, and ConnectionManager classes part 2 while I was working at the code there to help IDE:
1. Remove unused imports
2. Remove parentheses in method calls that do not have side affect.
3. Add parentheses in method calls that do have side effect or not simple get to object properties.
4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consistency and cleanliness.
5. Remove semicolon
6. Remove extra spaces.
7. Remove redundant return for consistency

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #1157 from hsaputra/cleanup_connection_classes_part2 and squashes the following commits:

4be6906 [Henry Saputra] Fix Spark Scala style for line over 100 chars.
85b24f7 [Henry Saputra] Cleanup on Connection and ConnectionManager classes part 2 while I was working at the code there to help IDE: 1. Remove unused imports 2. Remove parentheses in method calls that do not have side affect. 3. Add parentheses in method calls that do have side effect. 4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consitency and cleanliness. 5. Remove semicolon 6. Remove extra spaces.
2014-06-23 17:13:26 -07:00
Marcelo Vanzin 21ddd7d1e9 [SPARK-1768] History server enhancements.
Two improvements to the history server:

- Separate the HTTP handling from history fetching, so that it's easy to add
  new backends later (thinking about SPARK-1537 in the long run)

- Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
  memory for faster access. This allows the app limit to go away, since holding
  just the listing in memory shouldn't be too expensive unless the user has millions
  of completed apps in the history (at which point I'd expect other issues to arise
  aside from history server memory usage, such as FileSystem.listStatus()
  starting to become ridiculously expensive).

I also fixed a few minor things along the way which aren't really worth mentioning.
I also removed the app's log path from the UI since that information may not even
exist depending on which backend is used (even though there is only one now).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #718 from vanzin/hist-server and squashes the following commits:

53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
2a7f68d [Marcelo Vanzin] Address review feedback.
4e72c77 [Marcelo Vanzin] Remove comment about ordering.
249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
6e2432f [Marcelo Vanzin] Second round of feedback.
b2c570a [Marcelo Vanzin] Make class package-private.
4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
e852149 [Marcelo Vanzin] Initialize new app array to expected size.
e8026f4 [Marcelo Vanzin] Review feedback.
49d2fd3 [Marcelo Vanzin] Fix a comment.
91e96ca [Marcelo Vanzin] Fix scalastyle issues.
6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
b284478 [Marcelo Vanzin] Separate history server from history backend.
2014-06-23 13:53:44 -07:00
Prashant Sharma 6dc6722a66 [SPARK-2118] spark class should complain if tools jar is missing.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1068 from ScrapCodes/SPARK-2118/tools-jar-check and squashes the following commits:

29e768b [Prashant Sharma] Code Review
5cb6f7d [Prashant Sharma] [SPARK-2118] spark class should complaing if tools jar is missing.
2014-06-23 13:35:09 -07:00
Cheng Lian a4bc442ca2 [SPARK-1669][SQL] Made cacheTable idempotent
JIRA issue: [SPARK-1669](https://issues.apache.org/jira/browse/SPARK-1669)

Caching the same table multiple times should end up with only 1 in-memory columnar representation of this table.

Before:

```
scala> loadTestTable("src")
...
scala> cacheTable("src")
...
scala> cacheTable("src")
...
scala> table("src")
...
== Query Plan ==
InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None))))
```

After:

```
scala> loadTestTable("src")
...
scala> cacheTable("src")
...
scala> cacheTable("src")
...
scala> table("src")
...
== Query Plan ==
InMemoryColumnarTableScan [key#2,value#3], (InMemoryRelation [key#2,value#3], false, (HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None))
```

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1183 from liancheng/spark-1669 and squashes the following commits:

68f8a20 [Cheng Lian] Removed an unused import
51bae90 [Cheng Lian] Made cacheTable idempotent
2014-06-23 13:24:33 -07:00
Matthew Farrellee 853a2b951d Fix mvn detection
When mvn is not detected (not in executor's path), 'set -e' causes the
detection to terminate the script before the helpful error message can
be displayed.

Author: Matthew Farrellee <matt@redhat.com>

Closes #1181 from mattf/master-0 and squashes the following commits:

506549f [Matthew Farrellee] Fix mvn detection
2014-06-23 11:24:05 -07:00
Vlad b88238faee Fixed small running on YARN docs typo
The backslash is needed for multiline command

Author: Vlad <frolvlad@gmail.com>

Closes #1158 from frol/patch-1 and squashes the following commits:

e258044 [Vlad] Fixed small running on YARN docs typo
2014-06-23 10:55:49 -05:00
Marcelo Vanzin e380767de3 [SPARK-1395] Fix "local:" URI support in Yarn mode (again).
Recent changes ignored the fact that path may be defined with "local:"
URIs, which means they need to be explicitly added to the classpath
everywhere a remote process is started. This change fixes that by:

- Using the correct methods to add paths to the classpath
- Creating SparkConf settings for the Spark jar itself and for the
  user's jar
- Propagating those two settings to the remote processes where needed

This ensures that both in client and in cluster mode, the driver has
the necessary info to build the executor's classpath and have things
still work when they contain "local:" references.

The change also fixes some confusion in ClientBase about whether
to use SparkConf or system properties to propagate config options to
the driver and executors, by standardizing on using data held by
SparkConf.

On the cleanup front, I removed the hacky way that log4j configuration
was being propagated to handle the "local:" case. It's much more cleanly
(and generically) handled by using spark-submit arguments (--files to
upload a config file, or setting spark.executor.extraJavaOptions to pass
JVM arguments and use a local file).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #560 from vanzin/yarn-local-2 and squashes the following commits:

4e7f066 [Marcelo Vanzin] Correctly propagate SPARK_JAVA_OPTS to driver/executor.
6a454ea [Marcelo Vanzin] Use constants for PWD in test.
6dd5943 [Marcelo Vanzin] Fix propagation of config options to driver / executor.
b2e377f [Marcelo Vanzin] Review feedback.
93c3f85 [Marcelo Vanzin] Fix ClassCastException in test.
e5c682d [Marcelo Vanzin] Fix cluster mode, restore SPARK_LOG4J_CONF.
1dfbb40 [Marcelo Vanzin] Add documentation for spark.yarn.jar.
bbdce05 [Marcelo Vanzin] [SPARK-1395] Fix "local:" URI support in Yarn mode (again).
2014-06-23 08:51:11 -05:00
Jean-Martin Archer 9cb64b2c54 SPARK-2166 - Listing of instances to be terminated before the prompt
Will list the EC2 instances before detroying the cluster.
This was added because it can be scary to destroy EC2
instances without knowing which one will be impacted.

Author: Jean-Martin Archer <jeanmartin.archer@pulseenergy.com>

This patch had conflicts when merged, resolved by
Committer: Patrick Wendell <pwendell@gmail.com>

Closes #270 from j-martin/master and squashes the following commits:

826455f [Jean-Martin Archer] [SPARK-2611] Implementing recommendations
27b0a36 [Jean-Martin Archer] Listing of instances to be terminated before the prompt Will list the EC2 instances before detroying the cluster. This was added because it can be scary to destroy EC2 instances without knowing which one will be impacted.
2014-06-22 20:54:42 -07:00
Ori Kremer 9fc373e3a9 SPARK-2241: quote command line args in ec2 script
To preserve quoted command line args (in case options have space in them).

Author: Ori Kremer <ori.kremer@gmail.com>

Closes #1169 from orikremer/quote_cmd_line_args and squashes the following commits:

67e2aa1 [Ori Kremer] quote command line args
2014-06-22 20:23:49 -07:00
witgo 409d24e2b2 SPARK-2229: FileAppender throw an llegalArgumentException in jdk6
Author: witgo <witgo@qq.com>

Closes #1174 from witgo/SPARK-2229 and squashes the following commits:

f85f321 [witgo] FileAppender throw anIllegalArgumentException in JDK6
e1a8da8 [witgo] SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException in JDK6
2014-06-22 18:25:16 -07:00
Sean Owen 9fe28c35df SPARK-1316. Remove use of Commons IO
Commons IO is actually barely used, and is not a declared dependency. This just replaces with equivalents from the JDK and Guava.

Author: Sean Owen <sowen@cloudera.com>

Closes #1173 from srowen/SPARK-1316 and squashes the following commits:

2eb53db [Sean Owen] Reorder Guava import
8fde404 [Sean Owen] Remove use of Commons IO, which is not actually a dependency
2014-06-22 11:47:49 -07:00
Sean Owen 476581e8c8 SPARK-2034. KafkaInputDStream doesn't close resources and may prevent JVM shutdown
Tobias noted today on the mailing list:

========

I am trying to use Spark Streaming with Kafka, which works like a
charm – except for shutdown. When I run my program with "sbt
run-main", sbt will never exit, because there are two non-daemon
threads left that don't die.
I created a minimal example at
<https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala>.
It starts a StreamingContext and does nothing more than connecting to
a Kafka server and printing what it receives. Using the `future
Unknown macro: { ... }
` construct, I shut down the StreamingContext after some seconds and
then print the difference between the threads at start time and at end
time. The output can be found at
<https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1>.
There are a number of threads remaining that will prevent sbt from
exiting.
When I replace `KafkaUtils.createStream(...)` with a call that does
exactly the same, except that it calls `consumerConnector.shutdown()`
in `KafkaReceiver.onStop()` (which it should, IMO), the output is as
shown at <https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2>.
Does anyone have any idea what is going on here and why the program
doesn't shut down properly? The behavior is the same with both kafka
0.8.0 and 0.8.1.1, by the way.

========

Something similar was noted last year:

http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3C1380220041.2428.YahooMailNeo@web160804.mail.bf1.yahoo.com%3E

KafkaInputDStream doesn't close `ConsumerConnector` in `onStop()`, and does not close the `Executor` it creates. The latter leaves non-daemon threads and can prevent the JVM from shutting down even if streaming is closed properly.

Author: Sean Owen <sowen@cloudera.com>

Closes #980 from srowen/SPARK-2034 and squashes the following commits:

9f31a8d [Sean Owen] Restore ClassTag to private class because MIMA flags it; is the shadowing intended?
2d579a8 [Sean Owen] Close ConsumerConnector in onStop; shutdown() the local Executor that is created so that its threads stop when done; close the Zookeeper client even on exception; fix a few typos; log exceptions that otherwise vanish
2014-06-22 01:12:15 -07:00
Patrick Wendell 58b32f3470 SPARK-2231: dev/run-tests should include YARN and use a recent Hadoop version
...rsion

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1175 from pwendell/test-hadoop-version and squashes the following commits:

9210ef4 [Patrick Wendell] SPARK-2231: dev/run-tests should include YARN and use a recent Hadoop version
2014-06-22 00:55:27 -07:00
Sean Owen 1db9cbc336 SPARK-1996. Remove use of special Maven repo for Akka
Just following up Matei's suggestion to remove the Akka repo references. Builds and the audit-release script appear OK.

Author: Sean Owen <sowen@cloudera.com>

Closes #1170 from srowen/SPARK-1996 and squashes the following commits:

5ca2930 [Sean Owen] Remove outdated Akka repository references
2014-06-21 23:29:57 -07:00
Patrick Wendell 3e0b078001 HOTFIX: Add excludes for new MIMA files 2014-06-21 15:20:15 -07:00
Patrick Wendell 0a432d6a05 HOTFIX: Fix missing MIMA ignore 2014-06-21 13:02:49 -07:00
Reynold Xin ec935abce1 [SQL] Break hiveOperators.scala into multiple files.
The single file was getting very long (500+ loc).

Author: Reynold Xin <rxin@apache.org>

Closes #1166 from rxin/hiveOperators and squashes the following commits:

5b43068 [Reynold Xin] [SQL] Break hiveOperators.scala into multiple files.
2014-06-21 12:04:18 -07:00
Reynold Xin ca5d8b5904 [SQL] Pass SQLContext instead of SparkContext into physical operators.
This makes it easier to use config options in operators.

Author: Reynold Xin <rxin@apache.org>

Closes #1164 from rxin/sqlcontext and squashes the following commits:

797b2fd [Reynold Xin] Pass SQLContext instead of SparkContext into physical operators.
2014-06-20 22:49:48 -07:00
Marcelo Vanzin 648553d48e Fix some tests.
- JavaAPISuite was trying to compare a bare path with a URI. Fix by
  extracting the path from the URI, since we know it should be a
  local path anyway/

- b9be1609 excluded the ASM dependency everywhere, but easymock needs
  it (because cglib needs it). So re-add the dependency, with test
  scope this time.

The second one above actually uncovered a weird situation: the maven
test target works, even though I can't find the class sbt complains
about in its classpath. sbt complains with:

  [error] Uncaught exception when running org.apache.spark.util
  .random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
  org/objectweb/asm/Type

To avoid more weirdness caused by that, I explicitly added the asm
dependency to both maven and sbt (for tests only), and verified
the classes don't end up in the final assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #917 from vanzin/flaky-tests and squashes the following commits:

d022320 [Marcelo Vanzin] Fix some tests.
2014-06-20 20:05:12 -07:00
Anant 010c460d62 [SPARK-2061] Made splits deprecated in JavaRDDLike
The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.

Author: Anant <anant.asty@gmail.com>

Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:

b83ce6b [Anant] Fixed syntax issue
21f9210 [Anant] Fixed version number in deprecation string
9315b76 [Anant] made related changes to use partitions in python api
8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
2014-06-20 18:57:24 -07:00
Patrick Wendell a678642495 HOTFIX: Fixing style error introduced by 08d0ac 2014-06-20 18:44:54 -07:00
Doris Xin e99903b84a [SPARK-1970] Update unit test in XORShiftRandomSuite to use ChiSquareTest from commons-math3
Updating the chisquare unit test in XORShiftRandomSuite to use the ChiSquareTest in commons-math3 instead of hardcoding the chisquare statistic for the desired confidence interval.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1073 from dorx/math3Unit and squashes the following commits:

da0e891 [Doris Xin] remove math3 from common pom
9954143 [Doris Xin] merge master
c19948f [Doris Xin] Merge branch 'master' into math3Unit
8f84f19 [Doris Xin] [SPARK-1970] unit test in XORShiftRandomSuite
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
2014-06-20 18:42:02 -07:00
Andrew Ash 08d0aca78c SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
Before:

```
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:444)
	at sun.nio.ch.Net.bind(Net.java:436)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
	at $line3.$read$$iwC.<init>(<console>:14)
	at $line3.$read.<init>(<console>:16)
	at $line3.$read$.<init>(<console>:20)
	at $line3.$read$.<clinit>(<console>)
	at $line3.$eval$.<init>(<console>:7)
	at $line3.$eval$.<clinit>(<console>)
	at $line3.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@7439e55a: java.net.BindException: Address already in use
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:444)
	at sun.nio.ch.Net.bind(Net.java:436)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
	at $line3.$read$$iwC.<init>(<console>:14)
	at $line3.$read.<init>(<console>:16)
	at $line3.$read$.<init>(<console>:20)
	at $line3.$read$.<clinit>(<console>)
	at $line3.$eval$.<init>(<console>:7)
	at $line3.$eval$.<clinit>(<console>)
	at $line3.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/08 23:58:23 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/08 23:58:23 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
````

After:
```
14/06/09 00:04:12 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/09 00:04:12 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/09 00:04:12 INFO Server: jetty-8.y.z-SNAPSHOT
14/06/09 00:04:12 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
14/06/09 00:04:12 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
```

Lengthy logging comes from this line of code in Jetty: http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty.aggregate/jetty-all/9.1.3.v20140225/org/eclipse/jetty/util/component/AbstractLifeCycle.java#210

Author: Andrew Ash <andrew@andrewash.com>

Closes #1019 from ash211/SPARK-1902 and squashes the following commits:

0dd02f7 [Andrew Ash] Leave old org.eclipse.jetty silencing in place
1e2866b [Andrew Ash] Address CR comments
9d85eed [Andrew Ash] SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
2014-06-20 18:26:10 -07:00
Aaron Davidson 2044784915 [SQL] Use hive.SessionState, not the thread local SessionState
Note that this is simply mimicing lookupRelation(). I do not have a concrete notion of why this solution is necessarily right-er than SessionState.get, but SessionState.get is returning null, which is bad.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1148 from aarondav/createtable and squashes the following commits:

37c3e7c [Aaron Davidson] [SQL] Use hive.SessionState, not the thread local SessionState
2014-06-20 17:55:54 -07:00
Reynold Xin d4c7572dba Move ScriptTransformation into the appropriate place.
Author: Reynold Xin <rxin@apache.org>

Closes #1162 from rxin/script and squashes the following commits:

2c836b9 [Reynold Xin] Move ScriptTransformation into the appropriate place.
2014-06-20 17:16:56 -07:00
Andrew Or 01125a1162 Clean up CacheManager et al.
**UPDATE**

I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.

Now this is mainly a precursor to another PR (once again).

---------
*Old comment*

The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1083 from andrewor14/unroll-them-partitions and squashes the following commits:

7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
3d9a366 [Andrew Or] Minor change for readability
d12b95f [Andrew Or] Remove unused imports (minor)
a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
0091ec0 [Andrew Or] Address review feedback
44ef282 [Andrew Or] Actually return updated blocks in putBytes
2941c89 [Andrew Or] Clean up BlockStore (minor)
a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
2014-06-20 17:14:33 -07:00
Reynold Xin 0ac71d1284 [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
@willb

Author: Reynold Xin <rxin@apache.org>

Closes #1161 from rxin/having-filter and squashes the following commits:

fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE.
2014-06-20 15:38:02 -07:00
William Benton 171ebb3a82 SPARK-2180: support HAVING clauses in Hive queries
This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations.  The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`.

Author: William Benton <willb@redhat.com>

Closes #1136 from willb/SPARK-2180 and squashes the following commits:

3bbaf26 [William Benton] Added casts to HAVING expressions
83f1340 [William Benton] scalastyle fixes
18387f1 [William Benton] Add test for HAVING without GROUP BY
b880bef [William Benton] Added semantic error for HAVING without GROUP BY
942428e [William Benton] Added test coverage for SPARK-2180.
56084cc [William Benton] Add support for HAVING clauses in Hive queries.
2014-06-20 13:41:38 -07:00
Allan Douglas R. de Oliveira 6a224c31e8 SPARK-1868: Users should be allowed to cogroup at least 4 RDDs
Adds cogroup for 4 RDDs.

Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com>

Closes #813 from douglaz/more_cogroups and squashes the following commits:

f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case
0e9009c [Allan Douglas R. de Oliveira] Added scala tests
c3ffcdd [Allan Douglas R. de Oliveira] Added java tests
517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith
2f402d5 [Allan Douglas R. de Oliveira] Removed TODO
17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function
7877a2a [Allan Douglas R. de Oliveira] Fixed code
ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark
c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4
e94963c [Allan Douglas R. de Oliveira] Fixed spacing
f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues
d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
2014-06-20 11:03:03 -07:00
Gang Bai d484ddeff1 [SPARK-2163] class LBFGS optimize with Double tolerance instead of Int
https://issues.apache.org/jira/browse/SPARK-2163

This pull request includes the change for **[SPARK-2163]**:

* Changed the convergence tolerance parameter from type `Int` to type `Double`.
* Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`.
* Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`.

This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation.

Author: Gang Bai <me@baigang.net>

Closes #1104 from BaiGang/fix_int_tol and squashes the following commits:

cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.
2014-06-20 08:52:20 -07:00
Reynold Xin 2f6a835e1a [SPARK-2218] rename Equals to EqualTo in Spark SQL expressions.
Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer.

Note that this sits on top of #1144.

Author: Reynold Xin <rxin@apache.org>

Closes #1146 from rxin/equals and squashes the following commits:

f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals
bd19807 [Reynold Xin] Rename EqualsTo to EqualTo.
81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions.
c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
2014-06-20 00:34:59 -07:00
Takuya UESHIN 3249528920 [SPARK-2196] [SQL] Fix nullability of CaseWhen.
`CaseWhen` should use `branches.length` to check if `elseValue` is provided or not.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits:

510f12d [Takuya UESHIN] Add some tests.
dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable.
4f049cc [Takuya UESHIN] Fix nullability of CaseWhen.
2014-06-20 00:12:52 -07:00
Aaron Davidson f46e02fcdb SPARK-2203: PySpark defaults to use same num reduce partitions as map side
For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.

JIRA: https://issues.apache.org/jira/browse/SPARK-2203

Author: Aaron Davidson <aaron@databricks.com>

Closes #1138 from aarondav/pyfix and squashes the following commits:

1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions
2014-06-20 00:06:57 -07:00
Reynold Xin c55bbb49f7 [SPARK-2209][SQL] Cast shouldn't do null check twice.
Also took the chance to clean up cast a little bit. Too many arrows on each line before!

Author: Reynold Xin <rxin@apache.org>

Closes #1143 from rxin/cast and squashes the following commits:

dd006cb [Reynold Xin] Code review feedback.
c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice.
2014-06-20 00:01:19 -07:00
Reynold Xin 6175640973 [SPARK-2210] cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
```
explain select cast(cast(key=0 as boolean) as boolean) aaa from src
```
should be
```
[Physical execution plan:]
[Project [(key#10:0 = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

However, it is currently
```
[Physical execution plan:]
[Project [NOT((key#10=0) = 0) AS aaa#7]]
[ HiveTableScan [key#10], (MetastoreRelation default, src, None), None]
```

Author: Reynold Xin <rxin@apache.org>

Closes #1144 from rxin/booleancast and squashes the following commits:

c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.
2014-06-19 23:58:23 -07:00