ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Aaron Davidson	2044784915	[SQL] Use hive.SessionState, not the thread local SessionState Note that this is simply mimicing lookupRelation(). I do not have a concrete notion of why this solution is necessarily right-er than SessionState.get, but SessionState.get is returning null, which is bad. Author: Aaron Davidson <aaron@databricks.com> Closes #1148 from aarondav/createtable and squashes the following commits: 37c3e7c [Aaron Davidson] [SQL] Use hive.SessionState, not the thread local SessionState	2014-06-20 17:55:54 -07:00
Reynold Xin	d4c7572dba	Move ScriptTransformation into the appropriate place. Author: Reynold Xin <rxin@apache.org> Closes #1162 from rxin/script and squashes the following commits: 2c836b9 [Reynold Xin] Move ScriptTransformation into the appropriate place.	2014-06-20 17:16:56 -07:00
Andrew Or	01125a1162	Clean up CacheManager et al. UPDATE I have removed the special handling for `StorageLevel.MEMORY__SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix. Now this is mainly a precursor to another PR (once again). --------- Old comment* The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY__SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case. Author: Andrew Or <andrewor14@gmail.com> Closes #1083 from andrewor14/unroll-them-partitions and squashes the following commits: 7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions 3d9a366 [Andrew Or] Minor change for readability d12b95f [Andrew Or] Remove unused imports (minor) a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions cf5f565 [Andrew Or] Remove special handling for MEM__SER 0091ec0 [Andrew Or] Address review feedback 44ef282 [Andrew Or] Actually return updated blocks in putBytes 2941c89 [Andrew Or] Clean up BlockStore (minor) a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER	2014-06-20 17:14:33 -07:00
Reynold Xin	0ac71d1284	[SPARK-2225] Turn HAVING without GROUP BY into WHERE. @willb Author: Reynold Xin <rxin@apache.org> Closes #1161 from rxin/having-filter and squashes the following commits: fa8359a [Reynold Xin] [SPARK-2225] Turn HAVING without GROUP BY into WHERE.	2014-06-20 15:38:02 -07:00
William Benton	171ebb3a82	SPARK-2180: support HAVING clauses in Hive queries This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations. The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`. Author: William Benton <willb@redhat.com> Closes #1136 from willb/SPARK-2180 and squashes the following commits: 3bbaf26 [William Benton] Added casts to HAVING expressions 83f1340 [William Benton] scalastyle fixes 18387f1 [William Benton] Add test for HAVING without GROUP BY b880bef [William Benton] Added semantic error for HAVING without GROUP BY 942428e [William Benton] Added test coverage for SPARK-2180. 56084cc [William Benton] Add support for HAVING clauses in Hive queries.	2014-06-20 13:41:38 -07:00
Allan Douglas R. de Oliveira	6a224c31e8	SPARK-1868: Users should be allowed to cogroup at least 4 RDDs Adds cogroup for 4 RDDs. Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com> Closes #813 from douglaz/more_cogroups and squashes the following commits: f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case 0e9009c [Allan Douglas R. de Oliveira] Added scala tests c3ffcdd [Allan Douglas R. de Oliveira] Added java tests 517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith 2f402d5 [Allan Douglas R. de Oliveira] Removed TODO 17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function 7877a2a [Allan Douglas R. de Oliveira] Fixed code ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4 e94963c [Allan Douglas R. de Oliveira] Fixed spacing f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs	2014-06-20 11:03:03 -07:00
Gang Bai	d484ddeff1	[SPARK-2163] class LBFGS optimize with Double tolerance instead of Int https://issues.apache.org/jira/browse/SPARK-2163 This pull request includes the change for [SPARK-2163]: * Changed the convergence tolerance parameter from type `Int` to type `Double`. * Added types for vars in `class LBFGS`, making the style consistent with `class GradientDescent`. * Added associated test to check that optimizing via `class LBFGS` produces the same results as via calling `runLBFGS` from `object LBFGS`. This is a very minor change but it will solve the problem in my implementation of a regression model for count data, where I make use of LBFGS for parameter estimation. Author: Gang Bai <me@baigang.net> Closes #1104 from BaiGang/fix_int_tol and squashes the following commits: cecf02c [Gang Bai] Changed setConvergenceTol'' to specify tolerance with a parameter of type Double. For the reason and the problem caused by an Int parameter, please check https://issues.apache.org/jira/browse/SPARK-2163. Added a test in LBFGSSuite for validating that optimizing via class LBFGS produces the same results as calling runLBFGS from object LBFGS. Keep the indentations and styles correct.	2014-06-20 08:52:20 -07:00
Reynold Xin	2f6a835e1a	[SPARK-2218] rename Equals to EqualTo in Spark SQL expressions. Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer. Note that this sits on top of #1144. Author: Reynold Xin <rxin@apache.org> Closes #1146 from rxin/equals and squashes the following commits: f8583fd [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals 326b388 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into equals bd19807 [Reynold Xin] Rename EqualsTo to EqualTo. 81148d1 [Reynold Xin] [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions. c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.	2014-06-20 00:34:59 -07:00
Takuya UESHIN	3249528920	[SPARK-2196] [SQL] Fix nullability of CaseWhen. `CaseWhen` should use `branches.length` to check if `elseValue` is provided or not. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1133 from ueshin/issues/SPARK-2196 and squashes the following commits: 510f12d [Takuya UESHIN] Add some tests. dc25e8d [Takuya UESHIN] Fix nullable of CaseWhen to be nullable if the elseValue is nullable. 4f049cc [Takuya UESHIN] Fix nullability of CaseWhen.	2014-06-20 00:12:52 -07:00
Aaron Davidson	f46e02fcdb	SPARK-2203: PySpark defaults to use same num reduce partitions as map side For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 Author: Aaron Davidson <aaron@databricks.com> Closes #1138 from aarondav/pyfix and squashes the following commits: 1bd5751 [Aaron Davidson] SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions	2014-06-20 00:06:57 -07:00
Reynold Xin	c55bbb49f7	[SPARK-2209][SQL] Cast shouldn't do null check twice. Also took the chance to clean up cast a little bit. Too many arrows on each line before! Author: Reynold Xin <rxin@apache.org> Closes #1143 from rxin/cast and squashes the following commits: dd006cb [Reynold Xin] Code review feedback. c2b88ae [Reynold Xin] [SPARK-2209][SQL] Cast shouldn't do null check twice.	2014-06-20 00:01:19 -07:00
Reynold Xin	6175640973	[SPARK-2210] cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0) ``` explain select cast(cast(key=0 as boolean) as boolean) aaa from src ``` should be ``` [Physical execution plan:] [Project [(key#10:0 = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` However, it is currently ``` [Physical execution plan:] [Project [NOT((key#10=0) = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` Author: Reynold Xin <rxin@apache.org> Closes #1144 from rxin/booleancast and squashes the following commits: c4e543d [Reynold Xin] [SPARK-2210] boolean cast on boolean value should be removed.	2014-06-19 23:58:23 -07:00
Andre Schumacher	f479cf3743	SPARK-1293 [SQL] Parquet support for nested types It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example: ```java message AddressBook { required binary owner; optional group ownerPhoneNumbers { repeated binary array; } optional group contacts { repeated group array { required binary name; optional binary phoneNumber; } } optional group nameToApartmentNumber { repeated group map { required binary key; required int32 value; } } } ``` The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows: <h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6> - [x] Structs - [x] Lists - [x] Maps (note: currently keys need to be Strings) <h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6> - [x] Initial version <h6>Implement export (via ``saveAsParquetFile``)</h6> - [x] Initial version <h6>Test support for AvroParquet, etc.</h6> - [x] Initial testing of import of avro-generated Parquet data (simple + nested) Example: ```scala val data = TestSQLContext .parquetFile("input.dir") .toSchemaRDD data.registerAsTable("data") sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect() ``` Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Michael Armbrust <michael@databricks.com> Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits: 30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency 95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata 7eceb67 [Andre Schumacher] Review feedback 94eea3a [Andre Schumacher] Scalastyle 403061f [Andre Schumacher] Fixing some issues with tests and schema metadata b8a8b9a [Andre Schumacher] More fixes to short and byte conversion 63d1b57 [Andre Schumacher] Cleaning up and Scalastyle 88e6bdb [Andre Schumacher] Attempting to fix loss of schema 37e0a0a [Andre Schumacher] Cleaning up 14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion 3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes. f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType. 3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs. 3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master 31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom de02538 [Andre Schumacher] Cleaning up ParquetTestData 2f5a805 [Andre Schumacher] Removing stripMargin from test schemas 191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension cbb5793 [Andre Schumacher] Code review feedback 32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types 0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow 431f00f [Andre Schumacher] Fixing problems introduced during rebase c52ff2c [Andre Schumacher] Adding native-array converter 619c397 [Andre Schumacher] Completing Map testcase 79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion adc1258 [Andre Schumacher] Optimizing imports e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests 1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types d1911dc [Andre Schumacher] Simplifying ArrayType conversion f777b4b [Andre Schumacher] Scalastyle 824500c [Andre Schumacher] Adding attribute resolution for MapType b539fde [Andre Schumacher] First commit for MapType a594aed [Andre Schumacher] Scalastyle 4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg 6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows ee70125 [Andre Schumacher] fixing one problem with arrayconverter 98219cf [Andre Schumacher] added struct converter 5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files 1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data 745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook( 6125c75 [Andre Schumacher] First working nested Parquet record input 4d4892a [Andre Schumacher] First commit nested Parquet read converters aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas	2014-06-19 23:47:45 -07:00
Yin Huai	f397e92eb2	[SPARK-2177][SQL] describe table result contains only one column ``` scala> hql("describe src").collect().foreach(println) [key string None ] [value string None ] ``` The result should contain 3 columns instead of one. This screws up JDBC or even the downstream consumer of the Scala/Java/Python APIs. I am providing a workaround. We handle a subset of describe commands in Spark SQL, which are defined by ... ``` DESCRIBE [EXTENDED] [db_name.]table_name ``` All other cases are treated as Hive native commands. Also, if we upgrade Hive to 0.13, we need to check the results of context.sessionState.isHiveServerQuery() to determine how to split the result. This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. We may want to set Hive to use JsonMetaDataFormatter for the output of a DDL statement (`set hive.ddl.output.format=json` introduced by https://issues.apache.org/jira/browse/HIVE-2822). The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2177 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1118 from yhuai/SPARK-2177 and squashes the following commits: fd2534c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 b9b9aa5 [Yin Huai] rxin's comments. e7c4e72 [Yin Huai] Fix unit test. 656b068 [Yin Huai] 100 characters. 6387217 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 8003cf3 [Yin Huai] Generate strings with the format like Hive for unit tests. 9787fff [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 440c5af [Yin Huai] rxin's comments. f1a417e [Yin Huai] Update doc. 83adb2f [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 366f891 [Yin Huai] Add describe command. 74bd1d4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 342fdf7 [Yin Huai] Split to up to 3 parts. 725e88c [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2177 bb8bbef [Yin Huai] Split every string in the result of a describe command.	2014-06-19 23:41:38 -07:00
Michael Armbrust	d3b7671c1f	[SQL] Improve Speed of InsertIntoHiveTable Author: Michael Armbrust <michael@databricks.com> Closes #1130 from marmbrus/noFunctional and squashes the following commits: ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable.	2014-06-19 23:39:03 -07:00
Reynold Xin	278ec8a203	More minor scaladoc cleanup for Spark SQL. Author: Reynold Xin <rxin@apache.org> Closes #1142 from rxin/sqlclean and squashes the following commits: 67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL.	2014-06-19 22:34:21 -07:00
Patrick Wendell	e5514790d7	HOTFIX: SPARK-2208 local metrics tests can fail on fast machines Author: Patrick Wendell <pwendell@gmail.com> Closes #1141 from pwendell/hotfix and squashes the following commits: 83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines	2014-06-19 21:06:28 -07:00
Reynold Xin	5464e79175	A few minor Spark SQL Scaladoc fixes. Author: Reynold Xin <rxin@apache.org> Closes #1139 from rxin/sparksqldoc and squashes the following commits: c3049d8 [Reynold Xin] Fixed line length. 66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes.	2014-06-19 18:24:05 -07:00
nravi	f14b00a9c6	[SPARK-2151] Recognize memory format for spark-submit int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark. Author: nravi <nravi@c1704.halxg.cloudera.com> Closes #1095 from nishkamravi2/master and squashes the following commits: 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles	2014-06-19 17:11:06 -07:00
Michael Armbrust	777c5958c4	[SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once. Author: Michael Armbrust <michael@databricks.com> Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits: 9c6d9e4 [Michael Armbrust] Fix typo. 5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.	2014-06-19 14:14:03 -07:00
witgo	bce0897bc6	[SPARK-2051]In yarn.ClientBase spark.yarn.dist.* do not work Author: witgo <witgo@qq.com> Closes #969 from witgo/yarn_ClientBase and squashes the following commits: 8117765 [witgo] review commit 3bdbc52 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 5261b6c [witgo] fix sys.props.get("SPARK_YARN_DIST_FILES") e3c1107 [witgo] update docs b6a9aa1 [witgo] merge master c8b4554 [witgo] review commit 2f48789 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 8d7b82f [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 1048549 [witgo] remove Utils.resolveURIs 871f1db [witgo] add spark.yarn.dist.* documentation 41bce59 [witgo] review commit 35d6fa0 [witgo] move to ClientArguments 55d72fc [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase 9cdff16 [witgo] review commit 8bc2f4b [witgo] review commit 20e667c [witgo] Merge branch 'master' into yarn_ClientBase 0961151 [witgo] merge master ce609fc [witgo] Merge branch 'master' into yarn_ClientBase 8362489 [witgo] yarn.ClientBase spark.yarn.dist.* do not work	2014-06-19 12:11:26 -05:00
WangTao	67fca189c9	Minor fix The value "env" is never used in SparkContext.scala. Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one. Author: WangTao <barneystinson@aliyun.com> Closes #1105 from WangTaoTheTonic/master and squashes the following commits: 688358e [WangTao] Minor fix	2014-06-18 23:24:57 -07:00
Reynold Xin	640c294369	[SPARK-2187] Explain should not run the optimizer twice. @yhuai @marmbrus @concretevitamin Author: Reynold Xin <rxin@apache.org> Closes #1123 from rxin/explain and squashes the following commits: def83b0 [Reynold Xin] Update unit tests for explain. a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.	2014-06-18 22:44:12 -07:00
Doris Xin	566f70f214	Squishing a typo bug before it causes real harm in updateNumRows method in RowMatrix Author: Doris Xin <doris.s.xin@gmail.com> Closes #1125 from dorx/updateNumRows and squashes the following commits: 8564aef [Doris Xin] Squishing a typo bug before it causes real harm	2014-06-18 22:19:06 -07:00
Michael Armbrust	5ff75c748a	[SPARK-2184][SQL] AddExchange isn't idempotent ...redPartitioning. Author: Michael Armbrust <michael@databricks.com> Closes #1122 from marmbrus/fixAddExchange and squashes the following commits: 3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.	2014-06-18 17:52:42 -07:00
Doris Xin	45a95f82ca	Remove unicode operator from RDD.scala Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1119 from dorx/unicode and squashes the following commits: 05618c3 [Doris Xin] Remove unicode operator from RDD.scala	2014-06-18 15:01:29 -07:00
Mark Hamstra	4cbeea83e0	SPARK-2158 Clean up core/stdout file from FileAppenderSuite @tdas Author: Mark Hamstra <markhamstra@gmail.com> Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits: ae8e069 [Mark Hamstra] Response to TD's review 2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite	2014-06-18 14:56:41 -07:00
Kay Ousterhout	3870248740	[SPARK-1466] Raise exception if pyspark Gateway process doesn't start. If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging. Thanks to @shivaram and @stogers for helping to fix this issue! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #383 from kayousterhout/pyspark and squashes the following commits: 36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.	2014-06-18 13:16:26 -07:00
Reynold Xin	dd96fcda01	Updated the comment for SPARK-2162. A follow up on #1103 @andrewor14 Author: Reynold Xin <rxin@apache.org> Closes #1117 from rxin/SPARK-2162 and squashes the following commits: a4231de [Reynold Xin] Updated the comment for SPARK-2162.	2014-06-18 12:48:58 -07:00
Raymond Liu	5ad5e3486a	[SPARK-2162] Double check in doGetLocal to avoid read on removed block. other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed. Author: Raymond Liu <raymond.liu@intel.com> Closes #1103 from colorant/bm and squashes the following commits: daac114 [Raymond Liu] Address comments d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.	2014-06-18 10:57:45 -07:00
Yin Huai	587d32012c	[SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an explain command ``` hql("explain select * from src group by key").collect().foreach(println) [ExplainCommand [plan#27:0]] [ Aggregate false, [key#25], [key#25,value#26]] [ Exchange (HashPartitioning [key#25:0], 200)] [ Exchange (HashPartitioning [key#25:0], 200)] [ Aggregate true, [key#25], [key#25]] [ HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None] ``` There are two exchange operators. However, if we do not use explain... ``` hql("select * from src group by key") res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[8] at RDD at SchemaRDD.scala:100 == Query Plan == Aggregate false, [key#8], [key#8,value#9] Exchange (HashPartitioning [key#8:0], 200) Aggregate true, [key#8], [key#8] HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None ``` The plan is fine. The cause of this bug is explained below. When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted. I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`. The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1116 from yhuai/SPARK-2176 and squashes the following commits: 197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.	2014-06-18 10:51:32 -07:00
Vadim Chekan	889f7b7624	[STREAMING] SPARK-2009 Key not found exception when slow receiver starts I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod. Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread). https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206 In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception. The patch makes getReceivedBlockInfo more robust by tolerating missing values. Author: Vadim Chekan <kot.begemot@gmail.com> Closes #961 from vchekan/branch-1.0 and squashes the following commits: e86f82b [Vadim Chekan] Fixed indentation 4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called (cherry picked from commit `26f6b98931`) Signed-off-by: Patrick Wendell <pwendell@gmail.com>	2014-06-17 22:04:06 -07:00
Patrick Wendell	9e4b4bd083	Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop functions" This reverts commit `443f5e1bbc`. This commit unfortunately would break source compatibility if users have named the hadoopConf parameter.	2014-06-17 19:34:17 -07:00
Yin Huai	d2f4f30b12	[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL JIRA: https://issues.apache.org/jira/browse/SPARK-2060 Programming guide: http://yhuai.github.io/site/sql-programming-guide.html Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext Author: Yin Huai <huai@cse.ohio-state.edu> Closes #999 from yhuai/newJson and squashes the following commits: 227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ce8eedd [Yin Huai] rxin's comments. bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 94ffdaa [Yin Huai] Remove "get" from method names. ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 79ea9ba [Yin Huai] Fix typos. 5428451 [Yin Huai] Newline 1f908ce [Yin Huai] Remove extra line. d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 7ea750e [Yin Huai] marmbrus's comments. 6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 83013fb [Yin Huai] Update Java Example. e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map. 6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4fbddf0 [Yin Huai] Programming guide. 9df8c5a [Yin Huai] Python API. 7027634 [Yin Huai] Java API. cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset. d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson ab810b0 [Yin Huai] Make JsonRDD private. 6df0891 [Yin Huai] Apache header. 8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema. 8ffed79 [Yin Huai] Update the example. a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution. 65b87f0 [Yin Huai] Fix sampling... 8846af5 [Yin Huai] API doc. 52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson 0387523 [Yin Huai] Address PR comments. 666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson a2313a6 [Yin Huai] Address PR comments. f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used. 0576406 [Yin Huai] Add Apache license header. af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD. f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.	2014-06-17 19:14:59 -07:00
Patrick Wendell	b2ebf429e2	HOTFIX: bug caused by #941 This patch should have qualified the use of PIPE. This needs to be back ported into 0.9 and 1.0. Author: Patrick Wendell <pwendell@gmail.com> Closes #1108 from pwendell/hotfix and squashes the following commits: 711c58d [Patrick Wendell] HOTFIX: bug caused by #941	2014-06-17 15:09:24 -07:00
Andrew Or	a14807e84c	[SPARK-2147 / 2161] Show removed executors on the UI This PR includes two changes - [SPARK-2147] When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens. - [SPARK-2161] This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table). This should go into 1.0.1 if possible. Author: Andrew Or <andrewor14@gmail.com> Closes #1102 from andrewor14/remember-removed-executors and squashes the following commits: 2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor) abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors 792f992 [Andrew Or] Add missing equals method in ExecutorInfo 3390b49 [Andrew Or] Add executor state column to WorkerPage 161f8a2 [Andrew Or] Display finished executors table (fix bug) fbb65b8 [Andrew Or] Removed unused method c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI fe47402 [Andrew Or] Show exited executors on the Master UI	2014-06-17 12:25:55 -07:00
CodingCat	443f5e1bbc	SPARK-2038: rename "conf" parameters in the saveAsHadoop functions to distinguish with SparkConf object https://issues.apache.org/jira/browse/SPARK-2038 Author: CodingCat <zhunansjtu@gmail.com> Closes #1087 from CodingCat/SPARK-2038 and squashes the following commits: 763975f [CodingCat] style fix d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions	2014-06-17 12:17:48 -07:00
Sandy Ryza	2794990e9e	SPARK-2146. Fix takeOrdered doc Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc. Author: Sandy Ryza <sandy@cloudera.com> Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits: 185ff18 [Sandy Ryza] Use Seq instead of Array c996120 [Sandy Ryza] SPARK-2146. Fix takeOrdered doc	2014-06-17 12:03:22 -07:00
Andrew Ash	b92d16b114	SPARK-1063 Add .sortBy(f) method on RDD This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there. I think this is ready for merging. Author: Andrew Ash <andrew@andrewash.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@apache.org> Closes #369 from ash211/sortby and squashes the following commits: d09147a [Andrew Ash] Fix Ordering import 43d0a53 [Andrew Ash] Fix missing .collect() 29a54ed [Andrew Ash] Re-enable test by converting to a closure 5a95348 [Andrew Ash] Add license for RDDSuiteUtils 64ed6e3 [Andrew Ash] Remove leaked diff d4de69a [Andrew Ash] Remove scar tissue 63638b5 [Andrew Ash] Add Python version of .sortBy() 45e0fde [Andrew Ash] Add Java version of .sortBy() adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars 9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls 0457b69 [Andrew Ash] Ignore failing test 99f0baf [Andrew Ash] Merge branch 'master' into sortby 222ae97 [Andrew Ash] Try moving Ordering objects out to a different class 3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code 8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters 381eef2 [Andrew Ash] Correct silly typo 7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy() 0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby ca4490d [Andrew Ash] Add .sortBy(f) method on RDD	2014-06-17 11:47:48 -07:00
Zongheng Yang	e243c5ffac	[SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053 This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`. [This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics. Notes / Open issues: * Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot. * We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1055 from concretevitamin/caseWhen and squashes the following commits: 4226eb9 [Zongheng Yang] Comment. 79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen caf9383 [Zongheng Yang] Update a FIXME. 9d26ab8 [Zongheng Yang] Add @transient marker. 788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when. 7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient. f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files. 1c1fbfc [Zongheng Yang] Cleanups per review comments. 7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time. 47d406a [Zongheng Yang] Do toArray once and lazily outside of eval(). bb3d109 [Zongheng Yang] Update scaladoc of a method. aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import. 96870a8 [Zongheng Yang] Turn off scalastyle for some comments. 7392f3a [Zongheng Yang] Minor cleanup. 2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen 9f84b40 [Zongheng Yang] Add golden outputs from Hive. db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests. 3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved). be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts. f2bcb9d [Zongheng Yang] WIP 5906f75 [Zongheng Yang] WIP efd019b [Zongheng Yang] eval() and toString() bug fixes. 7d81e95 [Zongheng Yang] Clean up resolved. a31d782 [Zongheng Yang] Finish up Case.	2014-06-17 13:30:17 +02:00
Xi Liu	f5a4049e53	[SPARK-2164][SQL] Allow Hive UDF on columns of type struct Author: Xi Liu <xil@conviva.com> Closes #796 from xiliu82/sqlbug and squashes the following commits: 328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test 354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct 8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct	2014-06-17 13:16:02 +02:00
Andrew Or	09deb3eee0	[SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks This is reproducible whenever we drop a block because of memory pressure. This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map. This PR includes this simple fix. Author: Andrew Or <andrewor14@gmail.com> Closes #1080 from andrewor14/ui-blocks and squashes the following commits: fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached	2014-06-17 01:28:22 -07:00
Daniel Darabos	23a12ce20c	SPARK-2035: Store call stack for stages, display it on the UI. I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :). I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035). Author: Daniel Darabos <darabos.daniel@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #981 from darabos/darabos-call-stack and squashes the following commits: f7c6bfa [Daniel Darabos] Fix bad merge. I undid `83c226d454` by Doris. 3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack b857849 [Daniel Darabos] Style: Break long line. ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden. d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility. 4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed. 0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe. 932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null. ccd89d1 [Daniel Darabos] Fix long lines. ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show. 6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages. 8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.	2014-06-17 00:08:05 -07:00
Anant	8cd04c3eec	SPARK-1990: added compatibility for python 2.6 for ssh_read command https://issues.apache.org/jira/browse/SPARK-1990 There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old Author: Anant <anant.asty@gmail.com> Closes #941 from anantasty/SPARK-1990 and squashes the following commits: 4ca441d [Anant] Implmented check_optput withinthe module to work with python 2.6 c6ed85c [Anant] added compatibility for python 2.6 for ssh_read command	2014-06-16 23:43:01 -07:00
Kan Zhang	d81c08bac9	[SPARK-2130] End-user friendly String repr for StorageLevel in Python JIRA issue https://issues.apache.org/jira/browse/SPARK-2130 This PR adds an end-user friendly String representation for StorageLevel in Python, similar to ```StorageLevel.description``` in Scala. ``` >>> rdd = sc.parallelize([1,2]) >>> storage_level = rdd.getStorageLevel() >>> storage_level StorageLevel(False, False, False, False, 1) >>> print(storage_level) Serialized 1x Replicated ``` Author: Kan Zhang <kzhang@apache.org> Closes #1096 from kanzhang/SPARK-2130 and squashes the following commits: 7c8b98b [Kan Zhang] [SPARK-2130] Prettier epydoc output cc5bf45 [Kan Zhang] [SPARK-2130] End-user friendly String representation for StorageLevel in Python	2014-06-16 23:31:31 -07:00
Anatoli Fomenko	7afa912e74	MLlib documentation fix Synchronized mllib-optimization.md with Spark Scaladoc: removed reference to GradientDescent.runMiniBatchSGD method This is a temporary fix to remove a link from http://spark.apache.org/docs/latest/mllib-optimization.html to GradientDescent.runMiniBatchSGD which is not in the current online GradientDescent Scaladoc. FIXME: revert this commit after GradientDescent Scaladoc is updated. See images for details. ![mllib-docs-fix-1](https://cloud.githubusercontent.com/assets/1375501/3294410/ccf19bb8-f5a8-11e3-93f1-f593016209eb.png) ![mllib-docs-fix-2](https://cloud.githubusercontent.com/assets/1375501/3294411/d0b59a7e-f5a8-11e3-8fc8-329c177ef8c8.png) Author: Anatoli Fomenko <fa@apache.org> Closes #1098 from afomenko/master and squashes the following commits: 5cb0758 [Anatoli Fomenko] MLlib documentation fix	2014-06-16 23:10:36 -07:00
Cheng Lian	237b96bc59	Minor fix: made "EXPLAIN" output to play well with JDBC output format Fixed the broken JDBC output. Test from Shark `beeline`: ``` beeline> !connect jdbc:hive2://localhost:10000/ scan complete in 2ms Connecting to jdbc:hive2://localhost:10000/ Enter username for jdbc:hive2://localhost:10000/: lian Enter password for jdbc:hive2://localhost:10000/: Connected to: Hive (version 0.12.0) Driver: Hive (version 0.12.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000/> 0: jdbc:hive2://localhost:10000/> explain select * from src; +-------------------------------------------------------------------------------+ \| plan \| +-------------------------------------------------------------------------------+ \| ExplainCommand [plan#2:0] \| \| HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None \| +-------------------------------------------------------------------------------+ 2 rows selected (1.386 seconds) ``` Before this change, the output looked something like this: ``` +-------------------------------------------------------------------------------+ \| plan \| +-------------------------------------------------------------------------------+ \| ExplainCommand [plan#2:0] HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None \| +-------------------------------------------------------------------------------+ ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1097 from liancheng/multiLineExplain and squashes the following commits: eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format	2014-06-16 16:42:17 -07:00
Cheng Lian	273afcb254	[SQL][SPARK-2094] Follow up of PR #1071 for Java API Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1085 from liancheng/spark-2094-java and squashes the following commits: 29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure 92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore" 22aec97 [Cheng Lian] Follow up of PR #1071 for Java API	2014-06-16 21:32:51 +02:00
witgo	cdf2b04570	[SPARK-1930] The Container is running beyond physical memory limits, so as to be killed Author: witgo <witgo@qq.com> Closes #894 from witgo/SPARK-1930 and squashes the following commits: 564307e [witgo] Update the running-on-yarn.md 3747515 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930 172647b [witgo] add memoryOverhead docs a0ff545 [witgo] leaving only two configs a17bda2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930 478ca15 [witgo] Merge branch 'master' into SPARK-1930 d1244a1 [witgo] Merge branch 'master' into SPARK-1930 8b967ae [witgo] Merge branch 'master' into SPARK-1930 655a820 [witgo] review commit 71859a7 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930 e3c531d [witgo] review commit e16f190 [witgo] different memoryOverhead ffa7569 [witgo] review commit 5c9581f [witgo] Merge branch 'master' into SPARK-1930 9a6bcf2 [witgo] review commit 8fae45a [witgo] fix NullPointerException e0dcc16 [witgo] Adding configuration items b6a989c [witgo] Fix container memory beyond limit, were killed	2014-06-16 14:27:31 -05:00
Kan Zhang	4fdb491775	[SPARK-2010] Support for nested data in PySpark SQL JIRA issue https://issues.apache.org/jira/browse/SPARK-2010 This PR adds support for nested collection types in PySpark SQL, including array, dict, list, set, and tuple. Example, ``` >>> from array import array >>> from pyspark.sql import SQLContext >>> sqlCtx = SQLContext(sc) >>> rdd = sc.parallelize([ ... {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}}, ... {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}] True >>> rdd = sc.parallelize([ ... {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]) >>> srdd = sqlCtx.inferSchema(rdd) >>> srdd.collect() == \ ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)}, ... {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}] True ``` Author: Kan Zhang <kzhang@apache.org> Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits: 1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO 504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL	2014-06-16 11:11:29 -07:00

1 2 3 4 5 ...

7238 commits