ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	40c4cb2fe7	[SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions ```scala df.selectExpr("abs(colA)", "colB") df.filter("age > 21") ``` Author: Reynold Xin <rxin@databricks.com> Closes #4348 from rxin/SPARK-5579 and squashes the following commits: 2baeef2 [Reynold Xin] Fix Python. b416372 [Reynold Xin] [SPARK-5579][SQL][DataFrame] Support for project/filter using SQL expressions.	2015-02-03 22:15:35 -08:00
Reynold Xin	1077f2e1de	[SPARK-5578][SQL][DataFrame] Provide a convenient way for Scala users to use UDFs A more convenient way to define user-defined functions. Author: Reynold Xin <rxin@databricks.com> Closes #4345 from rxin/defineUDF and squashes the following commits: 639c0f8 [Reynold Xin] udf tests. 0a0b339 [Reynold Xin] defineUDF -> udf. b452b8d [Reynold Xin] Fix UDF registration. d2e42c3 [Reynold Xin] SQLContext.udf.register() returns a UserDefinedFunction also. 4333605 [Reynold Xin] [SQL][DataFrame] defineUDF.	2015-02-03 20:07:46 -08:00
Davies Liu	068c0e2ee0	[SPARK-5554] [SQL] [PySpark] add more tests for DataFrame Python API Add more tests and docs for DataFrame Python API, improve test coverage, fix bugs. Author: Davies Liu <davies@databricks.com> Closes #4331 from davies/fix_df and squashes the following commits: dd9919f [Davies Liu] fix tests 467332c [Davies Liu] support string in cast() 83c92fe [Davies Liu] address comments c052f6f [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df 8dd19a9 [Davies Liu] fix tests in python 2.6 35ccb9f [Davies Liu] fix build 78ebcfa [Davies Liu] add sql_test.py in run_tests 9ab78b4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df 6040ba7 [Davies Liu] fix docs 3ab2661 [Davies Liu] add more tests for DataFrame	2015-02-03 16:01:56 -08:00
Daoyuan Wang	db821ed2ed	[SPARK-4508] [SQL] build native date type to conform behavior to Hive The previous #3732 is reverted due to some test failure. Have fixed that. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4325 from adrian-wang/datenative and squashes the following commits: 096e20d [Daoyuan Wang] fix for mixed timezone 0ed0fdc [Daoyuan Wang] fix test data a2fdd4e [Daoyuan Wang] getDate c37832b [Daoyuan Wang] row to catalyst f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion 024c9a6 [Daoyuan Wang] clean some import order d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally 374abd5 [Daoyuan Wang] spark native date type support	2015-02-03 12:21:45 -08:00
wangfei	5adbb39482	[SPARK-5383][SQL] Support alias for udtfs Add support for alias of udtfs, such as ``` select stack(2, key, value, key, value) as (a, b) from src limit 5; select a, b from (select stack(2, key, value, key, value) as (a, b) from src) t limit 5 ``` Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #4186 from scwf/multi-alias-names and squashes the following commits: c35e922 [wangfei] fix conflicts adc8311 [wangfei] minor format fix 2783aed [wangfei] convert it to a Generate instead of leaving it inside of a Project clause a87668a [wangfei] minor improvement b25d9b3 [wangfei] resolve conflicts d38f041 [wangfei] style fix 8cfcebf [wangfei] minor improvement 12a239e [wangfei] fix test case 050177f [wangfei] added extendedCheckRules 3d69329 [wangfei] added CheckMultiAlias to analyzer 324150d [wangfei] added multi alias node 74f5a81 [Fei Wang] imports order fix 5bc3f59 [scwf] style fix 3daec28 [scwf] support alias for udfs with multi output columns	2015-02-03 12:16:31 -08:00
Cheng Hao	ca7a6cdff0	[SPARK-5550] [SQL] Support the case insensitive for UDF SQL in HiveContext, should be case insensitive, however, the following query will fail. ```scala udf.register("random0", () => { Math.random()}) assert(sql("SELECT RANDOM0() FROM src LIMIT 1").head().getDouble(0) >= 0.0) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #4326 from chenghao-intel/udf_case_sensitive and squashes the following commits: 485cf66 [Cheng Hao] Support the case insensitive for UDF	2015-02-03 12:12:26 -08:00
Daoyuan Wang	0c20ce69fb	[SPARK-4987] [SQL] parquet timestamp type support Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3820 from adrian-wang/parquettimestamp and squashes the following commits: b1e2a0d [Daoyuan Wang] fix for nanos 4dadef1 [Daoyuan Wang] fix wrong read 93f438d [Daoyuan Wang] parquet timestamp support	2015-02-03 12:06:06 -08:00
Reynold Xin	4204a1271d	[SQL] DataFrame API update 1. Added Java-friendly version of the expression operators (i.e. gt, geq) 2. Added JavaDoc for most operators 3. Simplified expression operators by having only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column. 4. agg function now accepts varargs of (String, String). Author: Reynold Xin <rxin@databricks.com> Closes #4332 from rxin/df-update and squashes the following commits: ab0aa69 [Reynold Xin] Added Java friendly expression methods. Added JavaDoc. For each expression operator, have only one version of the function (that accepts Any). Previously we had two methods for each expression operator, one accepting Any and another accepting Column. 576d07a [Reynold Xin] random commit.	2015-02-03 10:34:56 -08:00
Reynold Xin	523a93523d	[SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility Author: Reynold Xin <rxin@databricks.com> Closes #4327 from rxin/schemarddTypeAlias and squashes the following commits: e5a8ff3 [Reynold Xin] [SPARK-5551][SQL] Create type alias for SchemaRDD for source backward compatibility	2015-02-03 00:29:23 -08:00
Reynold Xin	37df330135	[SQL][DataFrame] Remove DataFrameApi, ExpressionApi, and GroupedDataFrameApi They were there mostly for code review and easier check of the API. I don't think they need to be there anymore. Author: Reynold Xin <rxin@databricks.com> Closes #4328 from rxin/remove-df-api and squashes the following commits: 723d600 [Reynold Xin] [SQL][DataFrame] Remove DataFrameApi and ColumnApi.	2015-02-03 00:29:04 -08:00
Yin Huai	13531dd97c	[SPARK-5501][SPARK-5420][SQL] Write support for the data source API This PR aims to support `INSERT INTO/OVERWRITE TABLE tableName` and `CREATE TABLE tableName AS SELECT` for the data source API (partitioned tables are not supported). In this PR, I am also adding the support of `IF NOT EXISTS` for our ddl parser. The current semantic of `IF NOT EXISTS` is explained as follows. * For a `CREATE TEMPORARY TABLE` statement, it does not `IF NOT EXISTS` for now. * For a `CREATE TABLE` statement (we are creating a metastore table), if there is an existing table having the same name ... * when `IF NOT EXISTS` clause is used, we will do nothing. * when `IF NOT EXISTS` clause is not used, the user will see an exception saying the table already exists. TODOs: - [x] CTAS support - [x] Programmatic APIs - [ ] Python API (another PR) - [x] More unit tests - [ ] Documents (another PR) marmbrus liancheng rxin Author: Yin Huai <yhuai@databricks.com> Closes #4294 from yhuai/writeSupport and squashes the following commits: 3db1539 [Yin Huai] save does not take overwrite. 1c98881 [Yin Huai] Fix test. 142372a [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport 34e1bfb [Yin Huai] Address comments. 1682ca6 [Yin Huai] Better support for CTAS statements. e789d64 [Yin Huai] For the Scala API, let users to use tuples to provide options. 0128065 [Yin Huai] Short hand versions of save and load. 66ebd74 [Yin Huai] Formatting. 9203ec2 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupport e5d29f2 [Yin Huai] Programmatic APIs. 1a719a5 [Yin Huai] CREATE TEMPORARY TABLE with IF NOT EXISTS is not allowed for now. 909924f [Yin Huai] Add saveAsTable for the data source API to DataFrame. 95a7c71 [Yin Huai] Fix bug when handling IF NOT EXISTS clause in a CREATE TEMPORARY TABLE statement. d37b19c [Yin Huai] Cheng's comments. fd6758c [Yin Huai] Use BeforeAndAfterAll. 7880891 [Yin Huai] Support CREATE TABLE AS SELECT STATEMENT and the IF NOT EXISTS clause. cb85b05 [Yin Huai] Initial write support. 2f91354 [Yin Huai] Make INSERT OVERWRITE/INTO statements consistent between HiveQL and SqlParser.	2015-02-02 23:30:44 -08:00
Tor Myklebust	8f471a66db	[SPARK-5472][SQL] A JDBC data source for Spark SQL. This pull request contains a Spark SQL data source that can pull data from, and can put data into, a JDBC database. I have tested both read and write support with H2, MySQL, and Postgres. It would surprise me if both read and write support worked flawlessly out-of-the-box for any other database; different databases have different names for different JDBC data types and different meanings for SQL types with the same name. However, this code is designed (see `DriverQuirks.scala`) to make it relatively painless to add support for another database by augmenting the type mapping contained in this PR. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #4261 from tmyklebu/master and squashes the following commits: cf167ce [Tor Myklebust] Work around other Java tests ruining TestSQLContext. 67893bf [Tor Myklebust] Move the jdbcRDD methods into SQLContext itself. 585f95b [Tor Myklebust] Dependencies go into the project's pom.xml. 829d5ba [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark 41647ef [Tor Myklebust] Hide a couple things that don't need to be public. 7318aea [Tor Myklebust] Fix scalastyle warnings. a09eeac [Tor Myklebust] JDBC data source for Spark SQL. 176bb98 [Tor Myklebust] Add test deps for JDBC support.	2015-02-02 19:50:14 -08:00
Reynold Xin	554403fd91	[SQL] Improve DataFrame API error reporting 1. Throw UnsupportedOperationException if a Column is not computable. 2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run). Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4296 from rxin/col-computability and squashes the following commits: 6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability f79034c [Davies Liu] fix python tests 5afe1ff [Reynold Xin] Fix scala test. 17f6bae [Reynold Xin] Various fixes. b932e86 [Reynold Xin] Added eager analysis for error reporting. e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn	2015-02-02 19:01:47 -08:00
Patrick Wendell	eccb9fbb2d	Revert "[SPARK-4508] [SQL] build native date type to conform behavior to Hive" This reverts commit `1646f89d96`.	2015-02-02 17:52:17 -08:00
Reynold Xin	8aa3cfff66	[SPARK-5514] DataFrame.collect should call executeCollect Author: Reynold Xin <rxin@databricks.com> Closes #4313 from rxin/SPARK-5514 and squashes the following commits: e34e91b [Reynold Xin] [SPARK-5514] DataFrame.collect should call executeCollect	2015-02-02 16:55:36 -08:00
seayi	dca6faa29a	[SPARK-5195][sql]Update HiveMetastoreCatalog.scala(override the MetastoreRelation's sameresult method only compare databasename and table name) override the MetastoreRelation's sameresult method only compare databasename and table name because in previous : cache table t1; select count() from t1; it will read data from memory but the sql below will not,instead it read from hdfs: select count() from t1 t; because cache data is keyed by logical plan and compare with sameResult ,so when table with alias the same table 's logicalplan is not the same logical plan with out alias so modify the sameresult method only compare databasename and table name Author: seayi <405078363@qq.com> Author: Michael Armbrust <michael@databricks.com> Closes #3898 from seayi/branch-1.2 and squashes the following commits: 8f0c7d2 [seayi] Update CachedTableSuite.scala a277120 [seayi] Update HiveMetastoreCatalog.scala 8d910aa [seayi] Update HiveMetastoreCatalog.scala	2015-02-02 16:18:55 -08:00
Daoyuan Wang	1646f89d96	[SPARK-4508] [SQL] build native date type to conform behavior to Hive Store daysSinceEpoch as an Int value(4 bytes) to represent DateType, instead of using java.sql.Date(8 bytes as Long) in catalyst row. This ensures the same comparison behavior of Hive and Catalyst. Subsumes #3381 I thinks there are already some tests in JavaSQLSuite, and for python it will not affect python's datetime class. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3732 from adrian-wang/datenative and squashes the following commits: 0ed0fdc [Daoyuan Wang] fix test data a2fdd4e [Daoyuan Wang] getDate c37832b [Daoyuan Wang] row to catalyst f0005b1 [Daoyuan Wang] add date in sql parser and java type conversion 024c9a6 [Daoyuan Wang] clean some import order d6715fc [Daoyuan Wang] refactoring Date as Primitive Int internally 374abd5 [Daoyuan Wang] spark native date type support	2015-02-02 15:49:22 -08:00
Liang-Chi Hsieh	683e938242	[SPARK-5212][SQL] Add support of schema-less, custom field delimiter and SerDe for HiveQL transform This pr adds the support of schema-less syntax, custom field delimiter and SerDe for HiveQL's transform. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4014 from viirya/schema_less_trans and squashes the following commits: ac2d1fe [Liang-Chi Hsieh] Refactor codes for comments. a137933 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans aa10fbd [Liang-Chi Hsieh] Add Hive golden answer files again. 575f695 [Liang-Chi Hsieh] Add Hive golden answer files for new unit tests. a422562 [Liang-Chi Hsieh] Use createQueryTest for unit tests and remove unnecessary imports. ccb71e3 [Liang-Chi Hsieh] Refactor codes for comments. 37bd391 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into schema_less_trans 6000889 [Liang-Chi Hsieh] Wrap input and output schema into ScriptInputOutputSchema. 21727f7 [Liang-Chi Hsieh] Move schema-less output to proper place. Use multilines instead of a long line SQL. 9a6dc04 [Liang-Chi Hsieh] setRecordReaderID is introduced in 0.13.1, use reflection API to call it. 7a14f31 [Liang-Chi Hsieh] Fix bug. 799b5e1 [Liang-Chi Hsieh] Call getSerializedClass instead of using Text. be2c3fc [Liang-Chi Hsieh] Fix style. 32d3046 [Liang-Chi Hsieh] Add SerDe support. ab22f7b [Liang-Chi Hsieh] Fix style. 7a48e42 [Liang-Chi Hsieh] Add support of custom field delimiter. b1729d9 [Liang-Chi Hsieh] Fix style. ccee49e [Liang-Chi Hsieh] Add unit test. f561c37 [Liang-Chi Hsieh] Add support of schema-less script transformation.	2015-02-02 13:53:55 -08:00
Cheng Lian	ec1003219b	[SPARK-5465] [SQL] Fixes filter push-down for Parquet data source Not all Catalyst filter expressions can be converted to Parquet filter predicates. We should try to convert each individual predicate and then collect those convertible ones. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4255) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4255 from liancheng/spark-5465 and squashes the following commits: 14ccd37 [Cheng Lian] Fixes filter push-down for Parquet data source	2015-02-01 18:52:39 -08:00
Daoyuan Wang	8cf4a1f02e	[SPARK-5262] [SPARK-5244] [SQL] add coalesce in SQLParser and widen types for parameters of coalesce I'll add test case in #4040 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4057 from adrian-wang/coal and squashes the following commits: 4d0111a [Daoyuan Wang] address Yin's comments c393e18 [Daoyuan Wang] fix rebase conflicts e47c03a [Daoyuan Wang] add coalesce in parser c74828d [Daoyuan Wang] cast types for coalesce	2015-02-01 18:51:38 -08:00
OopsOutOfMemory	1b56f1d6bb	[SPARK-5196][SQL] Support `comment` in Create Table Field DDL Support `comment` in create a table field. __CREATE TEMPORARY TABLE people(name string `comment` "the name of a person")__ Author: OopsOutOfMemory <victorshengli@126.com> Closes #3999 from OopsOutOfMemory/meta_comment and squashes the following commits: 39150d4 [OopsOutOfMemory] add comment and refine test suite	2015-02-01 18:41:58 -08:00
Liang-Chi Hsieh	ef89b82d83	[Minor][SQL] Little refactor DataFrame related codes Simplify some codes related to DataFrame. * Calling `toAttributes` instead of a `map`. * Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4298 from viirya/refactor_df and squashes the following commits: 1d61c64 [Liang-Chi Hsieh] Revert it. f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame. 2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.	2015-02-01 17:52:18 -08:00
kai	f54c9f607b	[SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead Author: kai <kaizeng@eecs.berkeley.edu> Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits: 78658ef [kai] remove redundant field "childOutput"	2015-01-30 23:19:10 -08:00
Joseph K. Bradley	e643de42a7	[SPARK-5504] [sql] convertToCatalyst should support nested arrays After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should. The test suite modification made the test fail before the fix in ScalaReflection. The fix makes the test suite succeed. CC: marmbrus Author: Joseph K. Bradley <joseph@databricks.com> Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits: 6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.	2015-01-30 15:40:14 -08:00
Takuya UESHIN	6f21dce5f4	[SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits: 3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct. faea19d [Takuya UESHIN] Use overload instead of default value for Java support. d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457 663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.	2015-01-30 01:21:35 -08:00
Reynold Xin	80def9deb3	[SQL] Support df("") to select all columns in a data frame. This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for , tblName.) and ResolvedStar (used for df("")). Author: Reynold Xin <rxin@databricks.com> Closes #4283 from rxin/df-star and squashes the following commits: c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar. 1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.	2015-01-29 19:09:08 -08:00
Josh Rosen	22271f9693	[SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply() This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names. This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`. Author: Josh Rosen <joshrosen@databricks.com> Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits: b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().	2015-01-29 18:23:05 -08:00
Reynold Xin	ce9c43ba8c	[SQL] DataFrame API improvements 1. Added Dsl.column in case Dsl.col is shadowed. 2. Allow using String to specify the target data type in cast. 3. Support sorting on multiple columns using column names. 4. Added Java API test file. Author: Reynold Xin <rxin@databricks.com> Closes #4280 from rxin/dsl1 and squashes the following commits: 33ecb7a [Reynold Xin] Add the Java test. d06540a [Reynold Xin] [SQL] DataFrame API improvements.	2015-01-29 17:24:00 -08:00
Yin Huai	c00d517d66	[SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions I believe that SPARK-4296 has been fixed by `3684fd21e1`. I am adding tests based #3910 (change the udf to HiveUDF instead). Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits: 6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin 6cfadd2 [Yin Huai] Actually, this issue has been fixed by `3684fd21e1`. d42b707 [Yin Huai] Update comment. 8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block, revert this change. 443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions	2015-01-29 15:49:34 -08:00
wangfei	c1b3eebf97	[SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result `select key, count( * ) from src group by key, 1` will get the wrong answer. e.g. for this table ``` val testData2 = TestSQLContext.sparkContext.parallelize( TestData2(1, 1) :: TestData2(1, 2) :: TestData2(2, 1) :: TestData2(2, 2) :: TestData2(3, 1) :: TestData2(3, 2) :: Nil, 2).toSchemaRDD testData2.registerTempTable("testData2") ``` result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1` is ``` [1,1] [2,2] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #4169 from scwf/agg-bug and squashes the following commits: 05751db [wangfei] fix bugs when literal in agg grouping expressioons	2015-01-29 15:47:18 -08:00
wangfei	fbaf9e0896	[SPARK-5367][SQL] Support star expression in udf now spark sql does not support star expression in udf, run the following sql by spark-sql will get error ``` select concat(*) from src ``` Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #4163 from scwf/udf-star and squashes the following commits: 9db7b39 [wangfei] addressed comments da1da09 [scwf] minor fix f87b5f9 [scwf] added test case 587bf7e [wangfei] compile fix eb93c16 [wangfei] fix star resolve issue in udf	2015-01-29 15:44:53 -08:00
Yash Datta	de221ea032	[SPARK-4786][SQL]: Parquet filter pushdown for castable types Enable parquet filter pushdown of castable types like short, byte that can be cast to integer Author: Yash Datta <Yash.Datta@guavus.com> Closes #4156 from saucam/filter_short and squashes the following commits: a403979 [Yash Datta] SPARK-4786: Fix styling issues d029866 [Yash Datta] SPARK-4786: Add test case cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types	2015-01-29 15:42:23 -08:00
Michael Davies	940f375611	[SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for Strin... ...gs. Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding. The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String. In measurements this could account for over 25% of entire query time. For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change. Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits: 327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings. 33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.	2015-01-29 15:40:59 -08:00
Liang-Chi Hsieh	bce0ba1fbd	[SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on Hive 0.13.1 I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4223 from viirya/fix_hivetest and squashes the following commits: 97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.	2015-01-29 15:28:22 -08:00
Reynold Xin	715632232d	[SPARK-5445][SQL] Consolidate Java and Scala DSL static methods. Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl. Author: Reynold Xin <rxin@databricks.com> Closes #4276 from rxin/dsl and squashes the following commits: 30aa611 [Reynold Xin] Add all files. 1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.	2015-01-29 15:13:09 -08:00
Reynold Xin	5ad78f6205	[SQL] Various DataFrame DSL update. 1. Added foreach, foreachPartition, flatMap to DataFrame. 2. Added col() in dsl. 3. Support renaming columns in toDataFrame. 4. Support type inference on arrays (in addition to Seq). 5. Updated mllib to use the new DSL. Author: Reynold Xin <rxin@databricks.com> Closes #4260 from rxin/sql-dsl-update and squashes the following commits: 73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve. fab3ccc [Reynold Xin] Bug fix. d31fcd2 [Reynold Xin] Style fix. 62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.	2015-01-29 00:01:10 -08:00
Reynold Xin	5b9760de8d	[SPARK-5445][SQL] Made DataFrame dsl usable in Java Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago. Author: Reynold Xin <rxin@databricks.com> Closes #4241 from rxin/df-docupdate and squashes the following commits: c0f4810 [Reynold Xin] Fix Python merge conflict. 094c7d7 [Reynold Xin] Minor style fix. Reset Python tests. 3c89f4a [Reynold Xin] Package. dfe6962 [Reynold Xin] Updated Python aggregate. 5dd4265 [Reynold Xin] Made dsl Java callable. 14b3c27 [Reynold Xin] Fix literal expression for symbols. 68b31cb [Reynold Xin] Literal. 4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.	2015-01-28 19:10:32 -08:00
Reynold Xin	c8e934ef3c	[SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame. and [SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext Author: Reynold Xin <rxin@databricks.com> Closes #4242 from rxin/sqlCleanup and squashes the following commits: e351cb2 [Reynold Xin] Fixed toDataFrame. 6545c42 [Reynold Xin] More changes. 728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.	2015-01-28 12:10:01 -08:00
Reynold Xin	d74373225e	[SPARK-5097][SQL] Test cases for DataFrame expressions. Author: Reynold Xin <rxin@databricks.com> Closes #4235 from rxin/df-tests1 and squashes the following commits: f341db6 [Reynold Xin] [SPARK-5097][SQL] Test cases for DataFrame expressions.	2015-01-27 18:10:49 -08:00
Reynold Xin	119f45d61d	[SPARK-5097][SQL] DataFrame This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities. TODOs: With the exception of Python support, other tasks can be done in separate, follow-up PRs. - [ ] Audit of the API - [ ] Documentation - [ ] More test cases to cover the new API - [x] Python support - [ ] Type alias SchemaRDD Author: Reynold Xin <rxin@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #4173 from rxin/df1 and squashes the following commits: 0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1 23b4427 [Reynold Xin] Mima. 828f70d [Reynold Xin] Merge pull request #7 from davies/df 257b9e6 [Davies Liu] add repartition 6bf2b73 [Davies Liu] fix collect with UDT and tests e971078 [Reynold Xin] Missing quotes. b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now. a728bf2 [Reynold Xin] Example rename. e8aa3d3 [Reynold Xin] groupby -> groupBy. 9662c9e [Davies Liu] improve DataFrame Python API 4ae51ea [Davies Liu] python API for dataframe 1e5e454 [Reynold Xin] Fixed a bug with symbol conversion. 2ca74db [Reynold Xin] Couple minor fixes. ea98ea1 [Reynold Xin] Documentation & literal expressions. 2b22684 [Reynold Xin] Got rid of IntelliJ problems. 02bbfbc [Reynold Xin] Tightening imports. ffbce66 [Reynold Xin] Fixed compilation error. 59b6d8b [Reynold Xin] Style violation. b85edfb [Reynold Xin] ALS. 8c37f0a [Reynold Xin] Made MLlib and examples compile 6d53134 [Reynold Xin] Hive module. d35efd5 [Reynold Xin] Fixed compilation error. ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite. 66d5ef1 [Reynold Xin] SQLContext minor patch. c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!	2015-01-27 16:08:24 -08:00
Cheng Hao	27bccc5ea9	[SPARK-5202] [SQL] Add hql variable substitution support https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution This is a block issue for the CLI user, it impacts the existed hql scripts from Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #4003 from chenghao-intel/substitution and squashes the following commits: bb41fd6 [Cheng Hao] revert the removed the implicit conversion af7c31a [Cheng Hao] add hql variable substitution support	2015-01-21 17:34:18 -08:00
Cheng Lian	ba19689fe7	[SQL] [Minor] Remove deprecated parquet tests This PR removes the deprecated `ParquetQuerySuite`, renamed `ParquetQuerySuite2` to `ParquetQuerySuite`, and refactored changes introduced in #4115 to `ParquetFilterSuite` . It is a follow-up of #3644. Notice that test cases in the old `ParquetQuerySuite` have already been well covered by other test suites introduced in #3644. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4116) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4116 from liancheng/remove-deprecated-parquet-tests and squashes the following commits: f73b8f9 [Cheng Lian] Removes deprecated Parquet test suite	2015-01-21 14:38:10 -08:00
Josh Rosen	b328ac6c8c	Revert "[SPARK-5244] [SQL] add coalesce() in sql parser" This reverts commit `812d3679f5`.	2015-01-21 14:27:43 -08:00
Cheng Hao	8361078efa	[SPARK-5009] [SQL] Long keyword support in SQL Parsers * The `SqlLexical.allCaseVersions` will cause `StackOverflowException` if the key word is too long, the patch will fix that by normalizing all of the keywords in `SqlLexical`. * And make a unified SparkSQLParser for sharing the common code. Author: Cheng Hao <hao.cheng@intel.com> Closes #3926 from chenghao-intel/long_keyword and squashes the following commits: 686660f [Cheng Hao] Support Long Keyword and Refactor the SQLParsers	2015-01-21 13:05:56 -08:00
Daoyuan Wang	812d3679f5	[SPARK-5244] [SQL] add coalesce() in sql parser Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4040 from adrian-wang/coalesce and squashes the following commits: 0ac8e8f [Daoyuan Wang] add coalesce() in sql parser	2015-01-21 12:59:41 -08:00
Reynold Xin	d181c2a1fc	[SPARK-5323][SQL] Remove Row's Seq inheritance. Author: Reynold Xin <rxin@databricks.com> Closes #4115 from rxin/row-seq and squashes the following commits: e33abd8 [Reynold Xin] Fixed compilation error. cceb650 [Reynold Xin] Python test fixes, and removal of WrapDynamic. `0334a52` [Reynold Xin] mkString. 9cdeb7d [Reynold Xin] Hive tests. 15681c2 [Reynold Xin] Fix more test cases. ea9023a [Reynold Xin] Fixed a catalyst test. c5e2cb5 [Reynold Xin] Minor patch up. b9cab7c [Reynold Xin] [SPARK-5323][SQL] Remove Row's Seq inheritance.	2015-01-20 15:16:14 -08:00
Yin Huai	bc20a52b34	[SPARK-5287][SQL] Add defaultSizeOf to every data type. JIRA: https://issues.apache.org/jira/browse/SPARK-5287 This PR only add `defaultSizeOf` to data types and make those internal type classes `protected[sql]`. I will use another PR to cleanup the type hierarchy of data types. Author: Yin Huai <yhuai@databricks.com> Closes #4081 from yhuai/SPARK-5287 and squashes the following commits: 90cec75 [Yin Huai] Update unit test. e1c600c [Yin Huai] Make internal classes protected[sql]. 7eaba68 [Yin Huai] Add `defaultSize` method to data types. fd425e0 [Yin Huai] Add all native types to NativeType.defaultSizeOf.	2015-01-20 13:26:36 -08:00
Cheng Lian	8140802786	[SQL][Minor] Refactors deeply nested FP style code in BooleanSimplification This is a follow-up of #4090. The original deeply nested `reduceOption` code is hard to grasp. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4091) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4091 from liancheng/refactor-boolean-simplification and squashes the following commits: cd8860b [Cheng Lian] Improves `compareConditions` to handle more subtle cases 1bf3258 [Cheng Lian] Avoids converting predicate sets to lists e833ca4 [Cheng Lian] Refactors deeply nested FP style code	2015-01-20 11:20:14 -08:00
Reynold Xin	debc031953	[SQL][minor] Add a log4j file for catalyst test. Author: Reynold Xin <rxin@databricks.com> Closes #4117 from rxin/catalyst-test-log4j and squashes the following commits: 8ad610b [Reynold Xin] [SQL][minor] Add a log4j file for catalyst test.	2015-01-20 00:55:25 -08:00
Yin Huai	2604bc35d7	[SPARK-5286][SQL] Fail to drop an invalid table when using the data source API JIRA: https://issues.apache.org/jira/browse/SPARK-5286 Author: Yin Huai <yhuai@databricks.com> Closes #4076 from yhuai/SPARK-5286 and squashes the following commits: 6b69ed1 [Yin Huai] Catch all exception when we try to uncache a query.	2015-01-19 10:45:29 -08:00

1 2 3 4 5 ...

785 commits