ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Michael Armbrust	269fc62b20	[SQL] Support transforming TreeNodes with Option children. Thanks goes to @marmbrus for his implementation. Author: Michael Armbrust <michael@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1074 from concretevitamin/option-treenode and squashes the following commits: ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074 73133c2 [Michael Armbrust] TreeNodes can't be inner classes. ab78420 [Zongheng Yang] Add a test. 2ccb721 [Michael Armbrust] Add support for transformation of optional children.	2014-06-15 11:28:34 +02:00
Kan Zhang	2550533a28	[SPARK-2079] Support batching when serializing SchemaRDD to Python Added batching with default batch size 10 in SchemaRDD.javaToPython Author: Kan Zhang <kzhang@apache.org> Closes #1023 from kanzhang/SPARK-2079 and squashes the following commits: 2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython 19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython	2014-06-14 13:17:22 -07:00
Yin Huai	8919685091	[Spark-2137][SQL] Timestamp UDFs broken https://issues.apache.org/jira/browse/SPARK-2137 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1081 from yhuai/SPARK-2137 and squashes the following commits: c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137 205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp.	2014-06-13 23:28:57 -07:00
Cheng Lian	ac96d9657c	[SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements ## Related JIRA issues - Main issue: - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands - Issues resolved as dependencies: - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to - Other related issue: - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue. ## PR Overview This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once. Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results. This PR defines schemas for the following DDL/commands: - EXPLAIN command - `plan`: String, the plan explanation - SET command - `key`: String, the key(s) of the propert(y/ies) being set or queried - `value`: String, the value(s) of the propert(y/ies) being queried - Other Hive native command - `result`: String, execution result returned by Hive NOTE: We should refine schemas for different native commands by defining physical plans for them in the future. ## Examples ### EXPLAIN command Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL: ``` scala> loadTestTable("src") ... scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key") ... q0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == ExplainCommandPhysical [plan#11:0] Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L] Exchange (HashPartitioning [key#4:0], 200) Exchange (HashPartitioning [key#4:0], 200) Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L] HiveTableScan [key#4], (MetastoreRelation default, src, None), None scala> q0.select('plan).collect() ... [ExplainCommandPhysical [plan#24:0] Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L] Exchange (HashPartitioning [key#17:0], 200) Exchange (HashPartitioning [key#17:0], 200) Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L] HiveTableScan [key#17], (MetastoreRelation default, src, None), None] scala> ``` ### SET command In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL: ``` scala> val q1 = hql("SET") ... q1: org.apache.spark.sql.SchemaRDD = SchemaRDD[7] at RDD at SchemaRDD.scala:98 == Query Plan == <SET command: executed by Hive, and noted by SQLContext> scala> q1.registerAsTable("properties") scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println) ... == Query Plan == TakeOrdered 10, [key#51:0 ASC] Project [key#51:0,value#52:1] SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents 14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98 == Query Plan == TakeOrdered 10, [key#51:0 ASC] Project [key#51:0,value#52:1] SetCommandPhysical None, None, [key#55:0,value#56:1]) ... [datanucleus.autoCreateSchema,true] [datanucleus.autoStartMechanismMode,checked] [datanucleus.cache.level2,false] [datanucleus.cache.level2.type,none] [datanucleus.connectionPoolingType,BONECP] [datanucleus.fixedDatastore,false] [datanucleus.identifierFactory,datanucleus1] [datanucleus.plugin.pluginRegistryBundleCheck,LOG] [datanucleus.rdbms.useLegacyNativeValueStrategy,true] [datanucleus.storeManagerType,rdbms] scala> ``` ### "Exactly once" semantics At last, an example of the "exactly once" semantics: ``` scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)") ... q2: org.apache.spark.sql.SchemaRDD = SchemaRDD[28] at RDD at SchemaRDD.scala:98 == Query Plan == <Native command: executed by Hive> scala> table("t1") ... res9: org.apache.spark.sql.SchemaRDD = SchemaRDD[32] at RDD at SchemaRDD.scala:98 == Query Plan == HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None scala> q2.collect() ... res10: Array[org.apache.spark.sql.Row] = Array([]) scala> ``` As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits: d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair `f6c7715` [Cheng Lian] Added test cases for DDL/command statement RDDs 1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs 5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching 48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row] cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands 74789c1 [Cheng Lian] Fixed failing test cases 0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics	2014-06-13 12:59:48 -07:00
Michael Armbrust	1c2fd015b0	[SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType Author: Michael Armbrust <michael@databricks.com> Closes #1061 from marmbrus/timestamp and squashes the following commits: 79c3903 [Michael Armbrust] Add timestamp to HiveMetastoreTypes.toMetastoreType()	2014-06-13 12:55:15 -07:00
Michael Armbrust	13f8cfdc04	[SPARK-2135][SQL] Use planner for in-memory scans Author: Michael Armbrust <michael@databricks.com> Closes #1072 from marmbrus/cachedStars and squashes the following commits: 8757c8e [Michael Armbrust] Use planner for in-memory scans.	2014-06-12 23:09:41 -07:00
Takuya UESHIN	9a2448daf9	[SPARK-2052] [SQL] Add optimization for CaseConversionExpression's. Add optimization for `CaseConversionExpression`'s. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #990 from ueshin/issues/SPARK-2052 and squashes the following commits: 2568666 [Takuya UESHIN] Move some rules back. dde7ede [Takuya UESHIN] Add tests to check if ConstantFolding can handle null literals and remove the unneeded rules from NullPropagation. c4eea67 [Takuya UESHIN] Fix toString methods. 23e2363 [Takuya UESHIN] Make CaseConversionExpressions foldable if the child is foldable. 0ff7568 [Takuya UESHIN] Add tests for collapsing case statements. 3977d80 [Takuya UESHIN] Add optimization for CaseConversionExpression's.	2014-06-11 17:58:35 -07:00
Patrick Wendell	d45e0c6b98	HOTFIX: Forgot to remove false change in previous commit	2014-06-11 15:55:41 -07:00
Patrick Wendell	14e6dc94f6	HOTFIX: PySpark tests should be order insensitive. This has been messing up the SQL PySpark tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1054 from pwendell/pyspark and squashes the following commits: 1eb5487 [Patrick Wendell] False change 06f062d [Patrick Wendell] HOTFIX: PySpark tests should be order insensitive	2014-06-11 15:54:41 -07:00
Daoyuan	ce6deb1e5b	[SQL] Code Cleanup: Left Semi Hash Join Some improvement for PR #837, add another case to white list and use `filter` to build result iterator. Author: Daoyuan <daoyuan.wang@intel.com> Closes #1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits: b314d5a [Daoyuan] change hashSet name 27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL Signed-off-by: Michael Armbrust <michael@databricks.com>	2014-06-11 12:09:42 -07:00
Sameer Agarwal	4107cce58c	[SPARK-2042] Prevent unnecessary shuffle triggered by take() This PR implements `take()` on a `SchemaRDD` by inserting a logical limit that is followed by a `collect()`. This is also accompanied by adding a catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an unnecessary shuffle that is sometimes triggered by `take()`. Author: Sameer Agarwal <sameer@databricks.com> Closes #1048 from sameeragarwal/master and squashes the following commits: 3eeb848 [Sameer Agarwal] Fixing Tests 1b76ff1 [Sameer Agarwal] Deprecating limit(limitExpr: Expression) in v1.1.0 b723ac4 [Sameer Agarwal] Added limit folding tests a0ff7c4 [Sameer Agarwal] Adding catalyst rule to fold two consecutive limits 8d42d03 [Sameer Agarwal] Implement trigger() as limit() followed by collect()	2014-06-11 12:01:04 -07:00
Qiuzhuang.Lian	6e11930310	SPARK-2107: FilterPushdownSuite doesn't need Junit jar. Author: Qiuzhuang.Lian <Qiuzhuang.Lian@gmail.com> Closes #1046 from Qiuzhuang/master and squashes the following commits: 0a9921a [Qiuzhuang.Lian] SPARK-2107: FilterPushdownSuite doesn't need Junit jar.	2014-06-11 00:36:06 -07:00
Cheng Lian	0266a0c8a7	[SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968) This PR added support for SQL/HiveQL command for caching/uncaching tables: ``` scala> sql("CACHE TABLE src") ... res0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == CacheCommandPhysical src, true scala> table("src") ... res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:98 == Query Plan == InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false scala> isCached("src") res2: Boolean = true scala> sql("CACHE TABLE src") ... res3: org.apache.spark.sql.SchemaRDD = SchemaRDD[4] at RDD at SchemaRDD.scala:98 == Query Plan == CacheCommandPhysical src, false scala> table("src") ... res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[11] at RDD at SchemaRDD.scala:98 == Query Plan == HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None scala> isCached("src") res5: Boolean = false ``` Things also work for `hql`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1038 from liancheng/sqlCacheTable and squashes the following commits: ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands 6f4ce42 [Cheng Lian] Moved logical command classes to a separate file 3458a24 [Cheng Lian] Added comment for public API f0ffacc [Cheng Lian] Added isCached() predicate 15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements	2014-06-11 00:06:50 -07:00
Takuya UESHIN	0402bd77ec	[SPARK-2093] [SQL] NullPropagation should use exact type value. `NullPropagation` should use exact type value when transform `Count` or `Sum`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1034 from ueshin/issues/SPARK-2093 and squashes the following commits: 65b6ff1 [Takuya UESHIN] Modify the literal value of the result of transformation from Sum to long value. 830c20b [Takuya UESHIN] Add Cast to the result of transformation from Count. 9314806 [Takuya UESHIN] Fix NullPropagation to use exact type value.	2014-06-10 23:13:48 -07:00
Zongheng Yang	601032f5bf	HOTFIX: clear() configs in SQLConf-related unit tests. Thanks goes to @liancheng, who pointed out that `sql/test-only .SQLConfSuite .SQLQuerySuite` passed but `sql/test-only .SQLQuerySuite .SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls. This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits: 6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.	2014-06-10 21:59:01 -07:00
egraldlo	1abbde0e89	[SQL] Add average overflow test case from #978 By @egraldlo. Author: egraldlo <egraldlo@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #1033 from marmbrus/pr/978 and squashes the following commits: e228c5e [Michael Armbrust] Remove "test". 762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table. d414cd7 [egraldlo] fommatting issues 1153f75 [egraldlo] do best to avoid overflowing in function avg().	2014-06-10 14:07:55 -07:00
Cheng Hao	db0c038a66	[SPARK-2076][SQL] Pushdown the join filter & predication for outer join As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate. Author: Cheng Hao <hao.cheng@intel.com> Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits: 10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin 44c6700 [Cheng Hao] Add logical to support pushdown the join filter 0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join	2014-06-10 12:59:52 -07:00
Cheng Lian	a9a461c594	Moved hiveOperators.scala to the right package folder The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1029 from liancheng/moveHiveOperators and squashes the following commits: d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder	2014-06-10 01:14:44 -07:00
Zongheng Yang	08ed9ad813	[SPARK-1508][SQL] Add SQLConf to SQLContext. This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands. The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf. For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #956 from concretevitamin/sqlconf and squashes the following commits: 4968c11 [Zongheng Yang] Very minor cleanup. d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method. c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf 26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in. dd19666 [Zongheng Yang] Update a comment. baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor. 5f7e6d8 [Zongheng Yang] Add default num partitions. 22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method. e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap. 88dd0c8 [Zongheng Yang] Remove redundant SET Keyword. 271f0b1 [Zongheng Yang] Minor change. f8983d1 [Zongheng Yang] Minor changes per review comments. 1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case. b766af9 [Zongheng Yang] Remove a test. d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf). 555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment. c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list. 2ea8cdc [Zongheng Yang] Wrap long line. 41d7f09 [Zongheng Yang] Fix imports. 13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands. b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf. 6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete. 5b67985 [Zongheng Yang] New line at EOF. c651797 [Zongheng Yang] Add commands.scala. efd82db [Zongheng Yang] Clean up semantics of several cases of SET. c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs). 0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL. 41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite. 2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest. 3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes. d0c4578 [Zongheng Yang] Tmux typo. 0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext. ce22d80 [Zongheng Yang] Fix parsing issues. cb722c1 [Zongheng Yang] Finish up SQLConf patch. 4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.	2014-06-10 00:49:09 -07:00
Zongheng Yang	a9ec033c8c	[SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD. This PR attempts to resolve [SPARK-1704](https://issues.apache.org/jira/browse/SPARK-1704) by introducing a physical plan for EXPLAIN commands, which just prints out the debug string (containing various SparkSQL's plans) of the corresponding QueryExecution for the actual query. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1003 from concretevitamin/explain-cmd and squashes the following commits: 5b7911f [Zongheng Yang] Add a regression test. 1bfa379 [Zongheng Yang] Modify output(). 719ada9 [Zongheng Yang] Override otherCopyArgs for ExplainCommandPhysical. 4318fd7 [Zongheng Yang] Make all output one Row. 439c6ab [Zongheng Yang] Minor cleanups. 408f574 [Zongheng Yang] SPARK-1704: Add CommandStrategy and ExplainCommandPhysical.	2014-06-09 16:47:44 -07:00
Michael Armbrust	c6e041d171	[SQL] Simple framework for debugging query execution Only records number of tuples and unique dataTypes output right now... Example: ```scala scala> import org.apache.spark.sql.execution.debug._ scala> hql("SELECT value FROM src WHERE key > 10").debug(sparkContext) Results returned: 489 == Project [value#1:0] == Tuples output: 489 value StringType: {java.lang.String} == Filter (key#0:1 > 10) == Tuples output: 489 value StringType: {java.lang.String} key IntegerType: {java.lang.Integer} == HiveTableScan [value#1,key#0], (MetastoreRelation default, src, None), None == Tuples output: 500 value StringType: {java.lang.String} key IntegerType: {java.lang.Integer} ``` Author: Michael Armbrust <michael@databricks.com> Closes #1005 from marmbrus/debug and squashes the following commits: dcc3ca6 [Michael Armbrust] Add comments. c9dded2 [Michael Armbrust] Simple framework for debugging query execution	2014-06-09 14:24:19 -07:00
Daoyuan	0cf6002801	[SPARK-1495][SQL]add support for left semi join Just submit another solution for #395 Author: Daoyuan <daoyuan.wang@intel.com> Author: Michael Armbrust <michael@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #837 from adrian-wang/left-semi-join-support and squashes the following commits: d39cd12 [Daoyuan Wang] Merge pull request #1 from marmbrus/pr/837 6713c09 [Michael Armbrust] Better debugging for failed query tests. 035b73e [Michael Armbrust] Add test for left semi that can't be done with a hash join. 5ec6fa4 [Michael Armbrust] Add left semi to SQL Parser. 4c726e5 [Daoyuan] improvement according to Michael 8d4a121 [Daoyuan] add golden files for leftsemijoin 83a3c8a [Daoyuan] scala style fix 14cff80 [Daoyuan] add support for left semi join	2014-06-09 11:31:36 -07:00
Michael Armbrust	a6c72ab16e	[SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on data in HDFS Basically there is a race condition (possibly a scala bug?) when these values are recomputed on all of the slaves that results in an incorrect projection being generated (possibly because the GUID uniqueness contract is broken?). In general we should probably enforce that all expression planing occurs on the driver, as is now occurring here. Author: Michael Armbrust <michael@databricks.com> Closes #1004 from marmbrus/fixAggBug and squashes the following commits: e0c116c [Michael Armbrust] Compute aggregate expression during planning instead of lazily on workers.	2014-06-07 14:20:33 -07:00
witgo	41c4a33105	[SPARK-1841]: update scalatest to version 2.1.5 Author: witgo <witgo@qq.com> Closes #713 from witgo/scalatest and squashes the following commits: b627a6a [witgo] merge master 51fb3d6 [witgo] merge master 3771474 [witgo] fix RDDSuite 996d6f9 [witgo] fix TimeStampedWeakValueHashMap test 9dfa4e7 [witgo] merge bug 1479b22 [witgo] merge master 29b9194 [witgo] fix code style 022a7a2 [witgo] fix test dependency a52c0fa [witgo] fix test dependency cd8f59d [witgo] Merge branch 'master' of https://github.com/apache/spark into scalatest 046540d [witgo] fix RDDSuite.scala 2c543b9 [witgo] fix ReplSuite.scala c458928 [witgo] update scalatest to version 2.1.5	2014-06-06 11:45:21 -07:00
Michael Armbrust	8d210560be	[SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive. Followup: #989 Author: Michael Armbrust <michael@databricks.com> Closes #994 from marmbrus/caseSensitiveFunctions2 and squashes the following commits: 9d9c8ed [Michael Armbrust] Fix DIV and BETWEEN.	2014-06-06 11:31:37 -07:00
Michael Armbrust	41db44c428	[SPARK-2050][SQL] LIKE, RLIKE and IN in HQL should not be case sensitive. Author: Michael Armbrust <michael@databricks.com> Closes #989 from marmbrus/caseSensitiveFuncitons and squashes the following commits: 681de54 [Michael Armbrust] LIKE, RLIKE and IN in HQL should not be case sensitive.	2014-06-05 23:20:59 -07:00
Michael Armbrust	c7a183b2c2	[SPARK-2041][SQL] Correctly analyze queries where columnName == tableName. Author: Michael Armbrust <michael@databricks.com> Closes #985 from marmbrus/tableName and squashes the following commits: 3caaa27 [Michael Armbrust] Correctly analyze queries where columnName == tableName.	2014-06-05 17:42:08 -07:00
Takuya UESHIN	e4c11eef2f	[SPARK-2036] [SQL] CaseConversionExpression should check if the evaluated value is null. `CaseConversionExpression` should check if the evaluated value is `null`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #982 from ueshin/issues/SPARK-2036 and squashes the following commits: 61e1c54 [Takuya UESHIN] Add check if the evaluated value is null.	2014-06-05 12:00:31 -07:00
Takuya UESHIN	7c160293d6	[SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits: e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.	2014-06-05 11:27:33 -07:00
Syed Hashmi	7782a304ad	[SPARK-1942] Stop clearing spark.driver.port in unit tests stop resetting spark.driver.port in unit tests (scala, java and python). Author: Syed Hashmi <shashmi@cloudera.com> Author: CodingCat <zhunansjtu@gmail.com> Closes #943 from syedhashmi/master and squashes the following commits: 885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool) b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master' b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner" 57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner" 1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests 4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread" fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner 6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread 4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner	2014-06-03 12:04:47 -07:00
Cheng Lian	862283e9cc	Avoid dynamic dispatching when unwrapping Hive data. This is a follow up of PR #758. The `unwrapHiveData` function is now composed statically before actual rows are scanned according to the field object inspector to avoid dynamic dispatching cost. According to the same micro benchmark used in PR #758, this simple change brings slight performance boost: 2.5% for CSV table and 1% for RCFile table. ``` Optimized version: CSV: 6870 ms, RCFile: 5687 ms CSV: 6832 ms, RCFile: 5800 ms CSV: 6822 ms, RCFile: 5679 ms CSV: 6704 ms, RCFile: 5758 ms CSV: 6819 ms, RCFile: 5725 ms Original version: CSV: 7042 ms, RCFile: 5667 ms CSV: 6883 ms, RCFile: 5703 ms CSV: 7115 ms, RCFile: 5665 ms CSV: 7020 ms, RCFile: 5981 ms CSV: 6871 ms, RCFile: 5906 ms ``` Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #935 from liancheng/staticUnwrapping and squashes the following commits: c49c70c [Cheng Lian] Avoid dynamic dispatching when unwrapping Hive data.	2014-06-02 19:20:23 -07:00
egraldlo	ec8be274a7	[SPARK-1995][SQL] system function upper and lower can be supported I don't know whether it's time to implement system function about string operation in spark sql now. Author: egraldlo <egraldlo@gmail.com> Closes #936 from egraldlo/stringoperator and squashes the following commits: 3c6c60a [egraldlo] Add UPPER, LOWER, MAX and MIN into hive parser ea76d0a [egraldlo] modify the formatting issues b49f25e [egraldlo] modify the formatting issues 1f0bbb5 [egraldlo] system function upper and lower supported 13d3267 [egraldlo] system function upper and lower supported	2014-06-02 18:02:57 -07:00
Cheng Lian	d000ca98a8	[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #939 from liancheng/spark-1958 and squashes the following commits: bdc4a14 [Cheng Lian] Copy rows to present immutable data to users 8250976 [Cheng Lian] Added return type explicitly for public API 192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.	2014-06-02 12:09:43 -07:00
Michael Armbrust	1a0da0ec57	[SQL] SPARK-1964 Add timestamp to hive metastore type parser. Author: Michael Armbrust <michael@databricks.com> Closes #913 from marmbrus/timestampMetastore and squashes the following commits: 8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.	2014-05-31 12:34:22 -07:00
Takuya UESHIN	3ce81494c5	[SPARK-1947] [SQL] Child of SumDistinct or Average should be widened to prevent overflows the same as Sum. Child of `SumDistinct` or `Average` should be widened to prevent overflows the same as `Sum`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #902 from ueshin/issues/SPARK-1947 and squashes the following commits: 99c3dcb [Takuya UESHIN] Insert Cast for SumDistinct and Average.	2014-05-31 11:30:03 -07:00
Cheng Lian	cf989601d0	[SPARK-1959] String "NULL" shouldn't be interpreted as null value JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959) Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #909 from liancheng/spark-1959 and squashes the following commits: 306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value	2014-05-30 22:13:11 -07:00
Cheng Lian	8f7141fbc0	[SPARK-1368][SQL] Optimized HiveTableScan JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) \| row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' \| with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) \| row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' \|stored as \| inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' \| outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ \|from scan_csv \|insert overwrite table scan_rcfile \|select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match some definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan	2014-05-29 15:24:03 -07:00
Takuya UESHIN	9df86835b6	[SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int value. `ApproxCountDistinctMergeFunction` should return `Int` value because the `dataType` of `ApproxCountDistinct` is `IntegerType`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #893 from ueshin/issues/SPARK-1938 and squashes the following commits: 3970e88 [Takuya UESHIN] Remove a superfluous line. 5ad7ec1 [Takuya UESHIN] Make dataType for each of CountDistinct, ApproxCountDistinctMerge and ApproxCountDistinct LongType. cbe7c71 [Takuya UESHIN] Revert a change. fc3ac0f [Takuya UESHIN] Fix evaluated value type of ApproxCountDistinctMergeFunction to Int.	2014-05-27 22:17:50 -07:00
LY Lai	0682567450	[SQL] SPARK-1922 Allow underscore in column name of a struct field https://issues.apache.org/jira/browse/SPARK-1922 . Author: LY Lai <ly.lai@vpon.com> Closes #873 from lyuanlai/master and squashes the following commits: 2253263 [LY Lai] Allow underscore in struct field column name	2014-05-27 16:08:38 -07:00
Takuya UESHIN	3b0babad1f	[SPARK-1915] [SQL] AverageFunction should not count if the evaluated value is null. Average values are difference between the calculation is done partially or not partially. Because `AverageFunction` (in not-partially calculation) counts even if the evaluated value is null. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #862 from ueshin/issues/SPARK-1915 and squashes the following commits: b1ff3c0 [Takuya UESHIN] Modify AverageFunction not to count if the evaluated value is null.	2014-05-27 14:55:23 -07:00
Takuya UESHIN	d1375a2bff	[SPARK-1926] [SQL] Nullability of Max/Min/First should be true. Nullability of `Max`/`Min`/`First` should be `true` because they return `null` if there are no rows. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #881 from ueshin/issues/SPARK-1926 and squashes the following commits: 322610f [Takuya UESHIN] Fix nullability of Min/Max/First.	2014-05-27 14:53:57 -07:00
Takuya UESHIN	d6395d86f9	[SPARK-1914] [SQL] Simplify CountFunction not to traverse to evaluate all child expressions. `CountFunction` should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #861 from ueshin/issues/SPARK-1914 and squashes the following commits: 3b37315 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-1914 2afa238 [Takuya UESHIN] Simplify CountFunction not to traverse to evaluate all child expressions.	2014-05-26 00:17:20 -07:00
Aaron Davidson	c3576ffcd7	[SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations ```scala rdd.aggregate(Sum('val)) ``` is just shorthand for ```scala rdd.groupBy()(Sum('val)) ``` but seems be more natural than doing a groupBy with no grouping expressions when you really just want an aggregation over all rows. Did not add a JavaSchemaRDD or Python API, as these seem to be lacking several other methods like groupBy() already -- leaving that cleanup for future patches. Author: Aaron Davidson <aaron@databricks.com> Closes #874 from aarondav/schemardd and squashes the following commits: e9e68ee [Aaron Davidson] Add comment db6afe2 [Aaron Davidson] Introduce SchemaRDD#aggregate() for simple aggregations	2014-05-25 18:37:44 -07:00
Reynold Xin	d66642e397	SPARK-1822: Some minor cleanup work on SchemaRDD.count() Minor cleanup following #841. Author: Reynold Xin <rxin@apache.org> Closes #868 from rxin/schema-count and squashes the following commits: 5442651 [Reynold Xin] SPARK-1822: Some minor cleanup work on SchemaRDD.count()	2014-05-25 01:44:49 -07:00
Kan Zhang	6052db9dc1	[SPARK-1822] SchemaRDD.count() should use query optimizer Author: Kan Zhang <kzhang@apache.org> Closes #841 from kanzhang/SPARK-1822 and squashes the following commits: 2f8072a [Kan Zhang] [SPARK-1822] Minor style update cf4baa4 [Kan Zhang] [SPARK-1822] Adding Scaladoc e67c910 [Kan Zhang] [SPARK-1822] SchemaRDD.count() should use optimizer	2014-05-25 00:06:42 -07:00
Cheng Lian	5afe6af0b1	[SPARK-1913][SQL] Bug fix: column pruning error in Parquet support JIRA issue: [SPARK-1913](https://issues.apache.org/jira/browse/SPARK-1913) When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #863 from liancheng/spark-1913 and squashes the following commits: f976b73 [Cheng Lian] Addessed the readability issue commented by @rxin f5b257d [Cheng Lian] Added back comments deleted by mistake ae60ab3 [Cheng Lian] [SPARK-1913] Attributes referenced only in predicates pushed down should remain in ParquetTableScan operator	2014-05-24 20:42:01 -07:00
Takuya UESHIN	bb88875ad5	[SPARK-1889] [SQL] Apply splitConjunctivePredicates to join condition while finding join ke... ...ys. When tables are equi-joined by multiple-keys `HashJoin` should be used, but `CartesianProduct` and then `Filter` are used. The join keys are paired by `And` expression so we need to apply `splitConjunctivePredicates` to join condition while finding join keys. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #836 from ueshin/issues/SPARK-1889 and squashes the following commits: fe1c387 [Takuya UESHIN] Apply splitConjunctivePredicates to join condition while finding join keys.	2014-05-21 15:37:47 -07:00
Tathagata Das	7f0cfe47f4	[Hotfix] Blacklisted flaky HiveCompatibility test `lateral_view_outer` query sometimes returns a different set of 10 rows. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #838 from tdas/hive-test-fix2 and squashes the following commits: 9128a0d [Tathagata Das] Blacklisted flaky HiveCompatibility test.	2014-05-20 10:27:12 -07:00
witgo	6a2c5c610c	[SPARK-1875]NoClassDefFoundError: StringUtils when building with hadoop 1.x and hive Author: witgo <witgo@qq.com> Closes #824 from witgo/SPARK-1875_commons-lang-2.6 and squashes the following commits: ef7231d [witgo] review commit ead3c3b [witgo] SPARK-1875:NoClassDefFoundError: StringUtils when building against Hadoop 1	2014-05-19 19:40:29 -07:00
Andre Schumacher	40d6acd6ba	SPARK-1487 [SQL] Support record filtering via predicate pushdown in Parquet Simple filter predicates such as LessThan, GreaterThan, etc., where one side is a literal and the other one a NamedExpression are now pushed down to the underlying ParquetTableScan. Here are some results for a microbenchmark with a simple schema of six fields of different types where most records failed the test: \| Uncompressed \| Compressed -------------\| ------------- \| ------------- File size \| 10 GB \| 2 GB Speedup \| 2 \| 1.8 Since mileage may vary I added a new option to SparkConf: `org.apache.spark.sql.parquet.filter.pushdown` Default value would be `true` and setting it to `false` disables the pushdown. When most rows are expected to pass the filter or when there are few fields performance can be better when pushdown is disabled. The default should fit situations with a reasonable number of (possibly nested) fields where not too many records on average pass the filter. Because of an issue with Parquet ([see here](https://github.com/Parquet/parquet-mr/issues/371])) currently only predicates on non-nullable attributes are pushed down. If one would know that for a given table no optional fields have missing values one could also allow overriding this. Author: Andre Schumacher <andre.schumacher@iki.fi> Closes #511 from AndreSchumacher/parquet_filter and squashes the following commits: 16bfe83 [Andre Schumacher] Removing leftovers from merge during rebase 7b304ca [Andre Schumacher] Fixing formatting c36d5cb [Andre Schumacher] Scalastyle 3da98db [Andre Schumacher] Second round of review feedback 7a78265 [Andre Schumacher] Fixing broken formatting in ParquetFilter a86553b [Andre Schumacher] First round of code review feedback b0f7806 [Andre Schumacher] Optimizing imports in ParquetTestData 85fea2d [Andre Schumacher] Adding SparkConf setting to disable filter predicate pushdown f0ad3cf [Andre Schumacher] Undoing changes not needed for this PR 210e9cb [Andre Schumacher] Adding disjunctive filter predicates a93a588 [Andre Schumacher] Adding unit test for filtering 6d22666 [Andre Schumacher] Extending ParquetFilters 93e8192 [Andre Schumacher] First commit Parquet record filtering	2014-05-16 13:41:41 -07:00

1 2 3

149 commits