ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Vida Ha	7faf755ae4	Spark-3213 Fixes issue with spark-ec2 not detecting slaves created with "Launch More like this" ... copy the spark_cluster_tag from a spot instance requests over to the instances. Author: Vida Ha <vida@databricks.com> Closes #2163 from vidaha/vida/spark-3213 and squashes the following commits: 5070a70 [Vida Ha] Spark-3214 Fix issue with spark-ec2 not detecting slaves created with 'Launch More Like This' and using Spot Requests	2014-08-27 14:26:06 -07:00
Davies Liu	4fa2fda88f	[SPARK-2871] [PySpark] add RDD.lookup(key) RDD.lookup(key) Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.parallelize(zip(l, l), 10) >>> rdd.lookup(42) # slow [42] >>> sorted = rdd.sortByKey() >>> sorted.lookup(42) # fast [42] It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning). Author: Davies Liu <davies.liu@gmail.com> Closes #2093 from davies/lookup and squashes the following commits: 1789cd4 [Davies Liu] `f` in foreach could be generator or not. 2871b80 [Davies Liu] Merge branch 'master' into lookup c6390ea [Davies Liu] address all comments 0f1bce8 [Davies Liu] add test case for lookup() be0e8ba [Davies Liu] fix preservesPartitioning eb1305d [Davies Liu] add RDD.lookup(key)	2014-08-27 13:18:33 -07:00
chutium	48f42781de	[SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter ```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370 be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")``` Author: chutium <teng.qiu@gmail.com> Closes #2044 from chutium/parquet-singlefile and squashes the following commits: 4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter	2014-08-27 13:13:04 -07:00
Chip Senkbeil	191d7cf2a6	[SPARK-3256] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL As seen with [SI-6502](https://issues.scala-lang.org/browse/SI-6502) of Scala, the _:cp_ command was broken in Scala 2.10.x. As the Spark shell is a friendly wrapper on top of the Scala REPL, it is also affected by this problem. My solution was to alter the internal classpath and invalidate any new entries. I also had to add the ability to add new entries to the parent classloader of the interpreter (SparkIMain's global). The advantage of this versus wiping the interpreter and replaying all of the commands is that you don't have to worry about rerunning heavy Spark-related commands (going to the cluster) or potentially reloading data that might have changed. Instead, you get to work from where you left off. Until this is fixed upstream for 2.10.x, I had to use reflection to alter the internal compiler classpath. The solution now looks like this: ![screen shot 2014-08-13 at 3 46 02 pm](https://cloud.githubusercontent.com/assets/2481802/3912625/f02b1440-232c-11e4-9bf6-bafb3e352d14.png) Author: Chip Senkbeil <rcsenkbe@us.ibm.com> Closes #1929 from rcsenkbeil/FixReplClasspathSupport and squashes the following commits: f420cbf [Chip Senkbeil] Added SparkContext.addJar calls to support executing code on remote clusters a826795 [Chip Senkbeil] Updated AddUrlsToClasspath to use 'new Run' suggestion over hackish compiler error 2ff1d86 [Chip Senkbeil] Added compilation failure on symbols hack to get Scala classes to load correctly a220639 [Chip Senkbeil] Added support for :cp <jar> that was broken in Scala 2.10.x for REPL	2014-08-27 13:01:11 -07:00
Cheng Hao	4238c17dc9	[SPARK-3197] [SQL] Reduce the Expression tree object creations for aggregation function (min/max) Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max. Here is the benchmark that I've done in my local. Master \| Previous Result (ms) \| Current Result (ms) ------------ \| ------------- \| ------------- local \| 3645 \| 3416 local[6] \| 3602 \| 1002 The Benchmark source code. ``` case class Record(key: Int, value: Int) object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf())) object DataPrepare extends App { import TestHive2._ val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12) runSqlHive("SHOW TABLES") runSqlHive("DROP TABLE if exists a") runSqlHive("DROP TABLE if exists result") rdd.registerAsTable("records") runSqlHive("""CREATE TABLE a (key INT, value INT) \| ROW FORMAT SERDE \| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' \| STORED AS RCFILE """.stripMargin) runSqlHive("""CREATE TABLE result (key INT, value INT) \| ROW FORMAT SERDE \| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' \| STORED AS RCFILE """.stripMargin) hql(s"""from records \| insert into table a \| select key, value """.stripMargin) } object PerformanceTest extends App { import TestHive2._ hql("SHOW TABLES") hql("set spark.sql.shuffle.partitions=12") val cmd = "select min(value), max(value) from a group by key" val results = ("Result1", benchmark(cmd)) :: ("Result2", benchmark(cmd)) :: ("Result3", benchmark(cmd)) :: Nil results.foreach { case (prompt, result) => { println(s"$prompt: took ${result._1} ms (${result._2} records)") } } def benchmark(cmd: String) = { val begin = System.currentTimeMillis() val count = hql(cmd).count val end = System.currentTimeMillis() ((end - begin), count) } } ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits: db40395 [Cheng Hao] remove the transient and add val for the expression property d56167d [Cheng Hao] Reduce the Expressions creation	2014-08-27 12:50:47 -07:00
u0jing	3b5eb7083d	[SPARK-3118][SQL]add "SHOW TBLPROPERTIES tblname;" and "SHOW COLUMNS (FROM\|IN) table_name [(FROM\|IN) db_name]" support JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118 eg: > SHOW TBLPROPERTIES test; SHOW TBLPROPERTIES test; numPartitions 0 numFiles 1 transient_lastDdlTime 1407923642 numRows 0 totalSize 82 rawDataSize 0 eg: > SHOW COLUMNS in test; SHOW COLUMNS in test; OK Time taken: 0.304 seconds id stid bo Author: u0jing <u9jing@gmail.com> Closes #2034 from u0jing/spark-3118 and squashes the following commits: b231d87 [u0jing] add golden answer files 35f4885 [u0jing] add 'show columns' and 'show tblproperties' support	2014-08-27 12:47:14 -07:00
Allan Douglas R. de Oliveira	5ac4093c9f	SPARK-3259 - User data should be given to the master Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com> Closes #2162 from douglaz/user_data_master and squashes the following commits: 10d15f6 [Allan Douglas R. de Oliveira] Give user data also to the master	2014-08-27 12:43:22 -07:00
uncleGen	d8298c46b7	[SPARK-3170][CORE][BUG]:RDD info loss in "StorageTab" and "ExecutorTab" compeleted stage only need to remove its own partitions that are no longer cached. However, "StorageTab" may lost some rdds which are cached actually. Not only in "StorageTab", "ExectutorTab" may also lose some rdd info which have been overwritten by last rdd in a same task. 1. "StorageTab": when multiple stages run simultaneously, completed stage will remove rdd info which belong to other stages that are still running. 2. "ExectutorTab": taskcontext may lose some "updatedBlocks" info of rdds in a dependency chain. Like the following example: val r1 = sc.paralize(..).cache() val r2 = r1.map(...).cache() val n = r2.count() When count the r2, r1 and r2 will be cached finally. So in CacheManager.getOrCompute, the taskcontext should contain "updatedBlocks" of r1 and r2. Currently, the "updatedBlocks" only contain the info of r2. Author: uncleGen <hustyugm@gmail.com> Closes #2131 from uncleGen/master_ui_fix and squashes the following commits: a6a8a0b [uncleGen] fix some coding style 3a1bc15 [uncleGen] fix some error in unit test 56ea488 [uncleGen] there's some line too long c82ba82 [uncleGen] Bug Fix: RDD info loss in "StorageTab" and "ExecutorTab"	2014-08-27 10:33:01 -07:00
Marcelo Vanzin	b92d823ad1	[SPARK-2933] [yarn] Refactor and cleanup Yarn AM code. This change modifies the Yarn module so that all the logic related to running the ApplicationMaster is localized. Instead of, previously, 4 different classes with mostly identical code, now we have: - A single, shared ApplicationMaster class, which can operate both in client and cluster mode, and substitutes the old ApplicationMaster (for cluster mode) and ExecutorLauncher (for client mode). The benefit here is that all different execution modes for all supported yarn versions use the same shared code for monitoring executor allocation, setting up configuration, and monitoring the process's lifecycle. - A new YarnRMClient interface, which defines basic RM functionality needed by the ApplicationMaster. This interface has concrete implementations for each supported Yarn version. - A new YarnAllocator interface, which just abstracts the existing interface of the YarnAllocationHandler class. This is to avoid having to touch the allocator code too much in this change, although it might benefit from a similar effort in the future. The end result is much easier to understand code, with much less duplication, making it much easier to fix bugs, add features, and test everything knowing that all supported versions will behave the same. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2020 from vanzin/SPARK-2933 and squashes the following commits: 3bbf3e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 ff389ed [Marcelo Vanzin] Do not interrupt reporter thread from within itself. 3a8ed37 [Marcelo Vanzin] Remote stale comment. 0f5142c [Marcelo Vanzin] Review feedback. 41f8c8a [Marcelo Vanzin] Fix app status reporting. c0794be [Marcelo Vanzin] Correctly clean up staging directory. 92770cc [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 ecaf332 [Marcelo Vanzin] Small fix to shutdown code. f02d3f8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2933 f581122 [Marcelo Vanzin] Review feedback. 557fdeb [Marcelo Vanzin] Cleanup a couple more constants. be6068d [Marcelo Vanzin] Restore shutdown hook to clean up staging dir. 5150993 [Marcelo Vanzin] Some more cleanup. b6289ab [Marcelo Vanzin] Move cluster/client code to separate methods. ecb23cd [Marcelo Vanzin] More trivial cleanup. 34f1e63 [Marcelo Vanzin] Fix some questionable error handling. 5657c7d [Marcelo Vanzin] Finish app if SparkContext initialization times out. 0e4be3d [Marcelo Vanzin] Keep "ExecutorLauncher" as the main class for client-mode AM. 91beabb [Marcelo Vanzin] Fix UI filter registration. 8c72239 [Marcelo Vanzin] Trivial cleanups. 99a52d5 [Marcelo Vanzin] Changes to the yarn-alpha project to use common AM code. 848ca6d [Marcelo Vanzin] [SPARK-2933] [yarn] Refactor and cleanup Yarn AM code.	2014-08-27 11:02:04 -05:00
Hari Shreedharan	6f671d04fa	[SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner. Currently lot of errors get thrown from Avro IPC layer when the dstream or sink is shutdown. This PR cleans it up. Some refactoring is done in the receiver code to put all of the RPC code into a single Try and just recover from that. The sink code has also been cleaned up. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #2065 from harishreedharan/clean-flume-shutdown and squashes the following commits: f93a07c [Hari Shreedharan] Formatting fixes. d7427cc [Hari Shreedharan] More fixes! a0a8852 [Hari Shreedharan] Fix race condition, hopefully! Minor other changes. 4c9ed02 [Hari Shreedharan] Remove unneeded list in Callback handler. Other misc changes. 8fee36f [Hari Shreedharan] Scala-library is required, else maven build fails. Also catch InterruptedException in TxnProcessor. 445e700 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown 87232e0 [Hari Shreedharan] Refactor Flume Input Stream. Clean up code, better error handling. 9001d26 [Hari Shreedharan] Change log level to debug in TransactionProcessor#shutdown method e7b8d82 [Hari Shreedharan] Incorporate review feedback 598efa7 [Hari Shreedharan] Clean up some exception handling code e1027c6 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' into clean-flume-shutdown ed608c8 [Hari Shreedharan] [SPARK-3154][STREAMING] Make FlumePollingInputDStream shutdown cleaner.	2014-08-27 02:39:02 -07:00
Joseph K. Bradley	171a41cb03	[SPARK-3227] [mllib] Added migration guide for v1.0 to v1.1 The only updates are in DecisionTree. CC: mengxr Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #2146 from jkbradley/mllib-migration and squashes the following commits: 5a1f487 [Joseph K. Bradley] small edit to doc 411d6d9 [Joseph K. Bradley] Added migration guide for v1.0 to v1.1. The only updates are in DecisionTree.	2014-08-27 01:45:59 -07:00
Xiangrui Meng	43dfc84f88	[SPARK-2830][MLLIB] doc update for 1.1 1. renamed mllib-basics to mllib-data-types 1. renamed mllib-stats to mllib-statistics 1. moved random data generation to the bottom of mllib-stats 1. updated toc accordingly atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits: 0bd79f3 [Xiangrui Meng] add mllib-data-types b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types 4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md	2014-08-27 01:19:48 -07:00
Michael Armbrust	e1139dd60e	[SPARK-3237][SQL] Fix parquet filters with UDFs Author: Michael Armbrust <michael@databricks.com> Closes #2153 from marmbrus/parquetFilters and squashes the following commits: 712731a [Michael Armbrust] Use closure serializer for sending filters. 1e83f80 [Michael Armbrust] Clean udf functions.	2014-08-27 00:59:23 -07:00
Tathagata Das	3e2864e404	[SPARK-3139] Made ContextCleaner to not block on shuffles As a workaround for SPARK-3015, the ContextCleaner was made "blocking", that is, it cleaned items one-by-one. But shuffles can take a long time to be deleted. Given that the RC for 1.1 is imminent, this PR makes a narrow change in the context cleaner - not wait for shuffle cleanups to complete. Also it changes the error messages on failure to delete to be milder warnings, as exceptions in the delete code path for one item does not really stop the actual functioning of the system. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #2143 from tdas/cleaner-shuffle-fix and squashes the following commits: 9c84202 [Tathagata Das] Restoring default blocking behavior in ContextCleanerSuite, and added docs to identify that spark.cleaner.referenceTracking.blocking does not control shuffle. 2181329 [Tathagata Das] Mark shuffle cleanup as non-blocking. e337cc2 [Tathagata Das] Changed semantics based on PR comments. 387b578 [Tathagata Das] Made ContextCleaner to not block on shuffles	2014-08-27 00:13:38 -07:00
Patrick Wendell	9d65f2712c	HOTFIX: Minor typo in conf template	2014-08-26 23:40:50 -07:00
Andrew Or	7557c4cfef	[SPARK-3167] Handle special driver configs in Windows This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845. Author: Andrew Or <andrewor14@gmail.com> Closes #2129 from andrewor14/windows-config and squashes the following commits: 881a8f0 [Andrew Or] Add reference to Windows taskkill 92e6047 [Andrew Or] Update a few comments (minor) 22b1acd [Andrew Or] Fix style again (minor) afcffea [Andrew Or] Fix style (minor) 72004c2 [Andrew Or] Actually respect --driver-java-options 803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH eeb34a0 [Andrew Or] Update outdated comment (minor) 35caecc [Andrew Or] In Windows, actually kill Java processes on exit f97daa2 [Andrew Or] Fix Windows spark shell stdin issue 83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)	2014-08-26 22:52:16 -07:00
Reynold Xin	bf719056b7	[SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI) This is a HOTFIX for 1.1. Author: Reynold Xin <rxin@apache.org> Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #2127 from rxin/SPARK-3224 and squashes the following commits: effb1ce [Reynold Xin] Move log message. 49282b3 [Reynold Xin] Kay's feedback. 3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224 796d282 [Kay Ousterhout] Added unit test for SPARK-3224 3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds. 1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.	2014-08-26 22:12:37 -07:00
Matei Zaharia	e70aff6c2d	Manually close old pull requests Closes #671, Closes #515	2014-08-26 19:35:19 -07:00
Matei Zaharia	ee91eb8c53	Manually close some old pull requests Closes #530, Closes #223, Closes #738, Closes #546	2014-08-26 19:29:54 -07:00
Josh Rosen	d8345471ce	Fix unclosed HTML tag in Yarn docs.	2014-08-26 18:55:00 -07:00
Martin Weindel	be043e3f20	[SPARK-3240] Adding known issue for MESOS-1688 When using Mesos with the fine-grained mode, a Spark job can run into a dead lock on low allocatable memory on Mesos slaves. As a work-around 32 MB (= Mesos MIN_MEM) are allocated for each task, to ensure Mesos making new offers after task completion. From my perspective, it would be better to fix this problem in Mesos by dropping the constraint on memory for offers, but as temporary solution this patch helps to avoid the dead lock on current Mesos versions. See [[MESOS-1688] No offers if no memory is allocatable](https://issues.apache.org/jira/browse/MESOS-1688) for details for this problem. Author: Martin Weindel <martin.weindel@gmail.com> Closes #1860 from MartinWeindel/master and squashes the following commits: 5762030 [Martin Weindel] reverting work-around a6bf837 [Martin Weindel] added known issue for issue MESOS-1688 d9d2ca6 [Martin Weindel] work around for problem with Mesos offering semantic (see [https://issues.apache.org/jira/browse/MESOS-1688])	2014-08-26 18:30:45 -07:00
Takuya UESHIN	727cb25bcc	[SPARK-3036][SPARK-3037][SQL] Add MapType/ArrayType containing null value support to Parquet. JIRA: - https://issues.apache.org/jira/browse/SPARK-3036 - https://issues.apache.org/jira/browse/SPARK-3037 Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`: ``` message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; optional int32 value; } } } ``` for `ArrayType` when `containsNull` is `true`: ``` message root { optional group a (LIST) { repeated group bag { optional int32 array; } } } ``` We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues. Notice: This PR is based on #1963 and #1889. Please check them first. /cc marmbrus, yhuai Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits: 4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet. 013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet. 62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037 8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037	2014-08-26 18:28:41 -07:00
nchammas	73b3089b8d	[Docs] Run tests like in contributing guide The Contributing to Spark guide [recommends](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-AutomatedTesting) running tests by calling `./dev/run-tests`. The README should, too. `./sbt/sbt test` does not cover Python tests or style tests. Author: nchammas <nicholas.chammas@gmail.com> Closes #2149 from nchammas/patch-2 and squashes the following commits: 2b3b132 [nchammas] [Docs] Run tests like in contributing guide	2014-08-26 17:50:04 -07:00
Cheng Lian	faeb9c0e14	[SPARK-2964] [SQL] Remove duplicated code from spark-sql and start-thriftserver.sh Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1886 from sarutak/SPARK-2964 and squashes the following commits: 8ef8751 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2964 26e7c95 [Kousuke Saruta] Revert "Shorten timeout to more reasonable value" ffb68fa [Kousuke Saruta] Modified spark-sql and start-thriftserver.sh to use bin/utils.sh 8c6f658 [Kousuke Saruta] Merge branch 'spark-3026' of https://github.com/liancheng/spark into SPARK-2964 81b43a8 [Cheng Lian] Shorten timeout to more reasonable value a89e66d [Cheng Lian] Fixed command line options quotation in scripts 9c894d3 [Cheng Lian] Fixed bin/spark-sql -S option typo be4736b [Cheng Lian] Report better error message when running JDBC/CLI without hive-thriftserver profile enabled	2014-08-26 17:33:40 -07:00
WangTao	2ffd3290fe	[SPARK-3225]Typo in script use_conf_dir => user_conf_dir in load-spark-env.sh. Author: WangTao <barneystinson@aliyun.com> Closes #1926 from WangTaoTheTonic/TypoInScript and squashes the following commits: 0c104ad [WangTao] Typo in script	2014-08-26 17:30:59 -07:00
Davies Liu	f1e71d4c3b	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() Using external sort to support sort large datasets in reduce stage. Author: Davies Liu <davies.liu@gmail.com> Closes #1978 from davies/sort and squashes the following commits: bbcd9ba [Davies Liu] check spilled bytes in tests b125d2f [Davies Liu] add test for external sort in rdd eae0176 [Davies Liu] choose different disks from different processes and instances 1f075ed [Davies Liu] Merge branch 'master' into sort eb53ca6 [Davies Liu] Merge branch 'master' into sort 644abaf [Davies Liu] add license in LICENSE 19f7873 [Davies Liu] improve tests 55602ee [Davies Liu] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
Michael Armbrust	c4787a3690	[SPARK-3194][SQL] Add AttributeSet to fix bugs with invalid comparisons of AttributeReferences It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal. In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality. (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32)) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators. I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the `references` of an `Expression` were the same. Thus, I moved this logic into the base class. For query plans the semantics of the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it. TODO: - [x] Finish scala doc for `AttributeSet` - [x] Scan the code for other instances of `Set[Attribute]` and refactor them. - [x] Finish removing `references` from `QueryPlan` Author: Michael Armbrust <michael@databricks.com> Closes #2109 from marmbrus/attributeSets and squashes the following commits: 1c0dae5 [Michael Armbrust] work on serialization bug. 9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets 3ae5288 [Michael Armbrust] review comments 40ce7f6 [Michael Armbrust] style d577cc7 [Michael Armbrust] Scaladoc cae5d22 [Michael Armbrust] remove more references implementations d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes. fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.	2014-08-26 16:29:14 -07:00
Burak	1208f72ac7	[SPARK-2839][MLlib] Stats Toolkit documentation updated Documentation updated for the Statistics Toolkit of MLlib. mengxr atalwalkar https://issues.apache.org/jira/browse/SPARK-2839 P.S. Accidentally closed #2123. New commits didn't show up after I reopened the PR. I've opened this instead and closed the old one. Author: Burak <brkyvz@gmail.com> Closes #2130 from brkyvz/StatsLib-Docs and squashes the following commits: a54a855 [Burak] [SPARK-2839][MLlib] Addressed comments bfc6896 [Burak] [SPARK-2839][MLlib] Added a more specific link to colStats() for pyspark 213fe3f [Burak] [SPARK-2839][MLlib] Modifications made according to review fec4d9d [Burak] [SPARK-2830][MLlib] Stats Toolkit documentation updated	2014-08-26 15:18:42 -07:00
Xiangrui Meng	adbd5c1636	[SPARK-3226][MLLIB] doc update for native libraries to mention `-Pnetlib-lgpl` option. atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2128 from mengxr/mllib-native and squashes the following commits: 4cbba57 [Xiangrui Meng] update mllib dependencies	2014-08-26 15:12:27 -07:00
Takuya UESHIN	6b5584ef1c	[SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map. Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits: 3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063 d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.	2014-08-26 15:04:08 -07:00
Takuya UESHIN	98c2bb0bbd	[SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull. Make `ScalaReflection` be able to handle like: - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)` - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)` - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)` - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits: 24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API. 79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API. 7cd1a7a [Takuya UESHIN] Fix json test failures. 2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true. 2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull. 9fa02f5 [Takuya UESHIN] Fix a test failure. 1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.	2014-08-26 13:22:55 -07:00
Davies Liu	3cedc4f4d7	[SPARK-2871] [PySpark] add histgram() API RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes #122, it's duplicated. Author: Davies Liu <davies.liu@gmail.com> Closes #2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API	2014-08-26 13:04:30 -07:00
chutium	8856c3d860	[SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext There are 4 different compression codec available for ```ParquetOutputFormat``` in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression``` original discuss: https://github.com/apache/spark/pull/195#discussion-diff-11002083 i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0) btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (`e440108de5`). Author: chutium <teng.qiu@gmail.com> Closes #2039 from chutium/parquet-compression and squashes the following commits: 2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy 21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext	2014-08-26 11:51:26 -07:00
Andrew Or	b21ae5bbb9	[SPARK-2886] Use more specific actor system name than "spark" As of #1777 we log the name of the actor system when it binds to a port. The current name "spark" is super general and does not convey any meaning. For instance, the following line is taken from my driver log after setting `spark.driver.port` to 5001. ``` 14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses: [akka.tcp://sparkandrews-mbp:5001] 14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkandrews-mbp:5001] 14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port 5001. ``` This commit renames this to "sparkDriver" and "sparkExecutor". The goal of this unambitious PR is simply to make the logged information more explicit without introducing any change in functionality. Author: Andrew Or <andrewor14@gmail.com> Closes #1810 from andrewor14/service-name and squashes the following commits: 8c459ed [Andrew Or] Use a common variable for driver/executor actor system names 3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor 921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into service-name c8c6a62 [Andrew Or] Do not include hyphens in actor name 1c1b42e [Andrew Or] Avoid spaces in akka system name f644b55 [Andrew Or] Use more specific service name	2014-08-25 23:36:09 -07:00
Daoyuan Wang	52fbdc2ded	[Spark-3222] [SQL] Cross join support in HiveQL We can simple treat cross join as inner join without join conditions. Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: adrian-wang <daoyuanwong@gmail.com> Closes #2124 from adrian-wang/crossjoin and squashes the following commits: 8c9b7c5 [Daoyuan Wang] add a test 7d47bbb [adrian-wang] add cross join support for hql	2014-08-25 22:56:35 -07:00
Kousuke Saruta	62f5009f67	[SPARK-2976] Replace tabs with spaces Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1895 from sarutak/SPARK-2976 and squashes the following commits: 1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976 d1e0666 [Kousuke Saruta] Modified styles c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java c003b36 [Kousuke Saruta] Removed tab from sorttable.js	2014-08-25 19:40:23 -07:00
witgo	9f04db17e5	SPARK-2481: The environment variables SPARK_HISTORY_OPTS is covered in spark-env.sh Author: witgo <witgo@qq.com> Author: GuoQiang Li <witgo@qq.com> Closes #1341 from witgo/history_env and squashes the following commits: b4fd9f8 [GuoQiang Li] review commit 0ebe401 [witgo] *-history-server.sh load spark-config.sh	2014-08-25 19:22:27 -07:00
Chia-Yung Su	4243bb6634	[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile fix compile error on hadoop 0.23 for the pull request #1924. Author: Chia-Yung Su <chiayung@appier.com> Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits: be30793 [Chia-Yung Su] remove .* and _* except _metadata 8fe2398 [Chia-Yung Su] add note to explain 40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error c7e44f2 [Chia-Yung Su] match syntax f8fc32a [Chia-Yung Su] filter out tmp dir	2014-08-25 18:20:19 -07:00
wangfei	507a1b5200	[SQL] logWarning should be logInfo in getResultSetSchema Author: wangfei <wangfei_hello@126.com> Closes #1939 from scwf/patch-5 and squashes the following commits: f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema	2014-08-25 17:46:43 -07:00
Cheng Hao	156eb39661	[SPARK-3058] [SQL] Support EXTENDED for EXPLAIN Provide `extended` keyword support for `explain` command in SQL. e.g. ``` explain extended select key as a1, value as a2 from src where key=1; == Parsed Logical Plan == Project ['key AS a1#3,'value AS a2#4] Filter ('key = 1) UnresolvedRelation None, src, None == Analyzed Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType)) MetastoreRelation default, src, None == Optimized Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) MetastoreRelation default, src, None == Physical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None Code Generation: false == RDD == (2) MappedRDD[14] at map at HiveContext.scala:350 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42 MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57 MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112 MappedRDD[10] at map at TableReader.scala:240 HadoopRDD[9] at HadoopRDD at TableReader.scala:230 ``` It's the sub task of #1847. But can go without any dependency. Author: Cheng Hao <hao.cheng@intel.com> Closes #1962 from chenghao-intel/explain_extended and squashes the following commits: 295db74 [Cheng Hao] Fix bug in printing the simple execution plan 48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN	2014-08-25 17:43:56 -07:00
Cheng Lian	cae9414d38	[SPARK-2929][SQL] Refactored Thrift server and CLI suites Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1856 from liancheng/thriftserver-tests and squashes the following commits: 2d914ca [Cheng Lian] Minor refactoring 0e12e71 [Cheng Lian] Cleaned up test output 0ee921d [Cheng Lian] Refactored Thrift server and CLI suites	2014-08-25 16:29:59 -07:00
Takuya UESHIN	d299e2bf2f	[SPARK-3204][SQL] MaxOf would be foldable if both left and right are foldable. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits: 7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.	2014-08-25 16:27:00 -07:00
Cheng Lian	805fec845b	Fixed a typo in docs/running-on-mesos.md It should be `spark-env.sh` rather than `spark.env.sh`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2119 from liancheng/fix-mesos-doc and squashes the following commits: f360548 [Cheng Lian] Fixed a typo in docs/running-on-mesos.md	2014-08-25 14:56:51 -07:00
Xiangrui Meng	fd8ace2d9a	[FIX] fix error message in sendMessageReliably rxin Author: Xiangrui Meng <meng@databricks.com> Closes #2120 from mengxr/sendMessageReliably and squashes the following commits: b14400c [Xiangrui Meng] fix error message in sendMessageReliably	2014-08-25 14:55:20 -07:00
Allan Douglas R. de Oliveira	cc40a709c0	SPARK-3180 - Better control of security groups Adds the --authorized-address and --additional-security-group options as explained in the issue. Author: Allan Douglas R. de Oliveira <allan@chaordicsystems.com> Closes #2088 from douglaz/configurable_sg and squashes the following commits: e3e48ca [Allan Douglas R. de Oliveira] Adds the option to specify the address authorized to access the SG and another option to provide an additional existing SG	2014-08-25 13:55:04 -07:00
Sean Owen	cd30db566a	SPARK-2798 [BUILD] Correct several small errors in Flume module pom.xml files (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink `pom.xml` - `scalatest` is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't Author: Sean Owen <sowen@cloudera.com> Closes #1726 from srowen/SPARK-2798 and squashes the following commits: a46e2c6 [Sean Owen] scalatest to test scope, harmonize Avro and Flume versions, remove direct Scala dependency, fix '2.10' in Flume dependency	2014-08-25 13:29:07 -07:00
Xiangrui Meng	220f413686	[SPARK-2495][MLLIB] make KMeans constructor public to re-construct k-means models freeman-lab Author: Xiangrui Meng <meng@databricks.com> Closes #2112 from mengxr/public-constructors and squashes the following commits: 18d53a9 [Xiangrui Meng] make KMeans constructor public	2014-08-25 12:30:02 -07:00
Davies Liu	fb0db77242	[SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId() RDD.zipWithIndex() Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions. >>> sc.parallelize(range(4), 2).zipWithIndex().collect() [(0, 0), (1, 1), (2, 2), (3, 3)] RDD.zipWithUniqueId() Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from L{zipWithIndex} >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect() [(0, 0), (2, 1), (1, 2), (3, 3)] Author: Davies Liu <davies.liu@gmail.com> Closes #2092 from davies/zipWith and squashes the following commits: cebe5bf [Davies Liu] improve test cases, reverse the order of index 0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()	2014-08-24 21:16:05 -07:00
Reza Zadeh	b1b20301b3	[MLlib][SPARK-2997] Update SVD documentation to reflect roughly square Update the documentation to reflect the fact we can handle roughly square matrices. Author: Reza Zadeh <rizlar@gmail.com> Closes #2070 from rezazadeh/svddocs and squashes the following commits: 826b8fe [Reza Zadeh] left singular vectors 3f34fc6 [Reza Zadeh] PCA is still TS 7ffa2aa [Reza Zadeh] better title aeaf39d [Reza Zadeh] More docs 788ed13 [Reza Zadeh] add computational cost explanation 6429c59 [Reza Zadeh] Add link to rowmatrix docs 1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square	2014-08-24 17:35:54 -07:00
DB Tsai	572952ae61	[SPARK-2841][MLlib] Documentation for feature transformations Documentation for newly added feature transformations: 1. TF-IDF 2. StandardScaler 3. Normalizer Author: DB Tsai <dbtsai@alpinenow.com> Closes #2068 from dbtsai/transformer-documentation and squashes the following commits: 109f324 [DB Tsai] address feedback	2014-08-24 17:33:33 -07:00

... 3 4 5 6 7 ...

8151 commits