ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yijie Shen	6c4fdbec33	[SPARK-8887] [SQL] Explicit define which data types can be used as dynamic partition columns This PR enforce dynamic partition column data type requirements by adding analysis rules. JIRA: https://issues.apache.org/jira/browse/SPARK-8887 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8201 from yjshen/dynamic_partition_columns.	2015-08-14 21:03:14 -07:00
Andrew Or	8187b3ae47	[SPARK-9580] [SQL] Replace singletons in SQL tests A fundamental limitation of the existing SQL tests is that there is simply no way to create your own `SparkContext`. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure. This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch all the SQL test files. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111) <!-- Reviewable:end --> Author: Andrew Or <andrew@databricks.com> Closes #8111 from andrewor14/sql-tests-refactor.	2015-08-13 17:42:01 -07:00
hyukjinkwon	c2520f501a	[SPARK-9935] [SQL] EqualNotNull not processed in ORC https://issues.apache.org/jira/browse/SPARK-9935 Author: hyukjinkwon <gurwls223@gmail.com> Closes #8163 from HyukjinKwon/master.	2015-08-13 16:07:03 -07:00
Cheng Lian	6993031011	[SPARK-9757] [SQL] Fixes persistence of Parquet relation with decimal column PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8130 from liancheng/spark-9757/old-hive-parquet-decimal.	2015-08-13 16:16:50 +08:00
Yin Huai	84a27916a6	[SPARK-9885] [SQL] Also pass barrierPrefixes and sharedPrefixes to IsolatedClientLoader when hiveMetastoreJars is set to maven. https://issues.apache.org/jira/browse/SPARK-9885 cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #8158 from yhuai/classloaderMaven.	2015-08-13 15:08:57 +08:00
Josh Rosen	7b13ed27c1	[SPARK-9870] Disable driver UI and Master REST server in SparkSubmitSuite I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins. Author: Josh Rosen <joshrosen@databricks.com> Closes #8124 from JoshRosen/disable-ui-in-sparksubmitsuite.	2015-08-12 18:52:11 -07:00
Michael Armbrust	660e6dcff8	[SPARK-9449] [SQL] Include MetastoreRelation's inputFiles Author: Michael Armbrust <michael@databricks.com> Closes #8119 from marmbrus/metastoreInputFiles.	2015-08-12 17:07:29 -07:00
Yin Huai	7035d880a0	[SPARK-9894] [SQL] Json writer should handle MapData. https://issues.apache.org/jira/browse/SPARK-9894 Author: Yin Huai <yhuai@databricks.com> Closes #8137 from yhuai/jsonMapData.	2015-08-12 16:45:15 -07:00
Michel Lemay	ab7e721cfe	[SPARK-9826] [CORE] Fix cannot use custom classes in log4j.properties Refactor Utils class and create ShutdownHookManager. NOTE: Wasn't able to run /dev/run-tests on windows machine. Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark) ex: log4j.rootCategory=WARN,console,redis log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO log4j.logger.org.apache.spark.graphx.Pregel=INFO log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender log4j.appender.redis.endpoints=hostname:port log4j.appender.redis.key=mykey log4j.appender.redis.alwaysBatch=false log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1 Author: michellemay <mlemay@gmail.com> Closes #8109 from michellemay/SPARK-9826.	2015-08-12 16:41:35 -07:00
Marcelo Vanzin	57ec27dd77	[SPARK-9804] [HIVE] Use correct value for isSrcLocal parameter. If the correct parameter is not provided, Hive will run into an error because it calls methods that are specific to the local filesystem to copy the data. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8086 from vanzin/SPARK-9804.	2015-08-12 10:38:30 -07:00
Cheng Lian	3ecb379430	[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.	2015-08-12 20:01:34 +08:00
Josh Rosen	91e9389f39	[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join This patch adds a new `SortMergeOuterJoin` operator that performs left and right outer joins using sort merge join. It also refactors `SortMergeJoin` in order to improve performance and code clarity. Along the way, I also performed a couple pieces of minor cleanup and optimization: - Rename the `HashJoin` physical planner rule to `EquiJoinSelection`, since it's also used for non-hash joins. - Rewrite the comment at the top of `HashJoin` to better explain the precedence for choosing join operators. - Update `JoinSuite` to use `SqlTestUtils.withConf` for changing SQLConf settings. This patch incorporates several ideas from adrian-wang's patch, #5717. Closes #5717. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7904) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7904 from JoshRosen/outer-join-smj and squashes 1 commits.	2015-08-10 22:04:41 -07:00
Reynold Xin	40ed2af587	[SPARK-9763][SQL] Minimize exposure of internal SQL classes. There are a few changes in this pull request: 1. Moved all data sources to execution.datasources, except the public JDBC APIs. 2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution. 3. Moved ui and metric package into execution. 4. Added more documentation on some internal classes. 5. Renamed DataSourceRegister.format -> shortName. 6. Added "override" modifier on shortName. 7. Removed IntSQLMetric. Author: Reynold Xin <rxin@databricks.com> Closes #8056 from rxin/SPARK-9763 and squashes the following commits: 9df4801 [Reynold Xin] Removed hardcoded name in test cases. d9babc6 [Reynold Xin] Shorten. e484419 [Reynold Xin] Removed VisibleForTesting. 171b812 [Reynold Xin] MimaExcludes. 2041389 [Reynold Xin] Compile ... 79dda42 [Reynold Xin] Compile. 0818ba3 [Reynold Xin] Removed IntSQLMetric. c46884f [Reynold Xin] Two more fixes. f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes.	2015-08-10 13:49:23 -07:00
Yijie Shen	3ca995b78f	[SPARK-6212] [SQL] The EXPLAIN output of CTAS only shows the analyzed plan JIRA: https://issues.apache.org/jira/browse/SPARK-6212 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7986 from yjshen/ctas_explain and squashes the following commits: bb6fee5 [Yijie Shen] refine test f731041 [Yijie Shen] address comment b2cf8ab [Yijie Shen] bug fix bd7eb20 [Yijie Shen] ctas explain	2015-08-08 21:05:50 -07:00
Joseph Batchik	a3aec918be	[SPARK-9486][SQL] Add data source aliasing for external packages Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader	2015-08-08 11:03:01 -07:00
Yijie Shen	23695f1d2d	[SPARK-9728][SQL]Support CalendarIntervalType in HiveQL This PR enables converting interval term in HiveQL to CalendarInterval Literal. JIRA: https://issues.apache.org/jira/browse/SPARK-9728 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8034 from yjshen/interval_hiveql and squashes the following commits: 7fe9a5e [Yijie Shen] declare throw exception and add unit test fce7795 [Yijie Shen] convert hiveql interval term into CalendarInterval literal	2015-08-08 11:01:25 -07:00
Michael Armbrust	49702bd738	[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup: - Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations. - The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`. - `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks. - `commands.scala` has been split up. Author: Michael Armbrust <michael@databricks.com> Closes #8010 from marmbrus/fsWriting and squashes the following commits: 00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes 775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting 17b690e [Michael Armbrust] remove comment 40f0372 [Michael Armbrust] address comments f5675bd [Michael Armbrust] char -> string 7e2d0a4 [Michael Armbrust] make sure we close current writer 8100100 [Michael Armbrust] delete empty commands.scala 71cc717 [Michael Armbrust] update comment 8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions	2015-08-07 16:24:50 -07:00
Reynold Xin	05d04e10a8	[SPARK-9733][SQL] Improve physical plan explain for data sources All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source. Without this patch: ``` == Physical Plan == NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false)) Exchange hashpartitioning(date#0,cat#1) NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false)) PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at ``` With this patch: ``` == Physical Plan == TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)] Exchange hashpartitioning(date#0,cat#1) TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)] ConvertToUnsafe Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2] ``` Author: Reynold Xin <rxin@databricks.com> Closes #8024 from rxin/SPARK-9733 and squashes the following commits: 811b90e [Reynold Xin] Fixed Python test case. 52cab77 [Reynold Xin] Cast. eea9ccc [Reynold Xin] Fix test case. fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.	2015-08-07 13:41:45 -07:00
Reynold Xin	4309262ec9	[SPARK-9700] Pick default page size more intelligently. Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node). This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available. Author: Reynold Xin <rxin@databricks.com> Closes #8012 from rxin/pagesize and squashes the following commits: 16f4756 [Reynold Xin] Fixed failing test. 5afd570 [Reynold Xin] private... 0d5fb98 [Reynold Xin] Update default value. 674a6cd [Reynold Xin] Address review feedback. dc00e05 [Reynold Xin] Merge with master. 73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.	2015-08-06 23:18:29 -07:00
Cheng Lian	f0cda587fb	[SPARK-7550] [SQL] [MINOR] Fixes logs when persisting DataFrames Author: Cheng Lian <lian@databricks.com> Closes #8021 from liancheng/spark-7550/fix-logs and squashes the following commits: b7bd0ed [Cheng Lian] Fixes logs	2015-08-06 22:49:01 -07:00
Yin Huai	3504bf3aa9	[SPARK-9630] [SQL] Clean up new aggregate operators (SPARK-9240 follow up) This is the followup of https://github.com/apache/spark/pull/7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`. Author: Yin Huai <yhuai@databricks.com> Closes #7954 from yhuai/agg-followUp and squashes the following commits: 4d2f4fc [Yin Huai] Add comments and free map. 0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process. 91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to TungstenAggregateIteraotr and make it only work with UnsafeRow.	2015-08-06 15:04:44 -07:00
Christian Kadner	abfedb9cd7	[SPARK-9211] [SQL] [TEST] normalize line separators before generating MD5 hash The golden answer file names for the existing Hive comparison tests were generated using a MD5 hash of the query text which uses Unix-style line separator characters `\n` (LF). This PR ensures that all occurrences of the Windows-style line separator `\r\n` (CR) are replaced with `\n` (LF) before generating the MD5 hash to produce an identical MD5 hash for golden answer file names generated on Windows. Author: Christian Kadner <ckadner@us.ibm.com> Closes #7563 from ckadner/SPARK-9211_working and squashes the following commits: d541db0 [Christian Kadner] [SPARK-9211][SQL] normalize line separators before MD5 hash	2015-08-06 14:15:42 -07:00
Wenchen Fan	1f62f104c7	[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info This re-applies #7955, which was reverted due to a race condition to fix build breaking. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Reynold Xin <rxin@databricks.com> Closes #8002 from rxin/InternalRow-toSeq and squashes the following commits: 332416a [Reynold Xin] Merge pull request #7955 from cloud-fan/toSeq 21665e2 [Wenchen Fan] fix hive again... 4addf29 [Wenchen Fan] fix hive bc16c59 [Wenchen Fan] minor fix 33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq 3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow	2015-08-06 13:11:59 -07:00
Davies Liu	2eca46a17a	Revert "[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info" This reverts commit `6e009cb9c4`.	2015-08-06 11:15:37 -07:00
Wenchen Fan	6e009cb9c4	[SPARK-9632][SQL] update InternalRow.toSeq to make it accept data type info Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7955 from cloud-fan/toSeq and squashes the following commits: 21665e2 [Wenchen Fan] fix hive again... 4addf29 [Wenchen Fan] fix hive bc16c59 [Wenchen Fan] minor fix 33d802c [Wenchen Fan] pass data type info to InternalRow.toSeq 3dd033e [Wenchen Fan] move the default special getters implementation from InternalRow to BaseGenericInternalRow	2015-08-06 10:40:54 -07:00
Cheng Lian	9f94c85ff3	[SPARK-9593] [SQL] [HOTFIX] Makes the Hadoop shims loading fix more robust This is a follow-up of #7929. We found that Jenkins SBT master build still fails because of the Hadoop shims loading issue. But the failure doesn't appear to be deterministic. My suspect is that Hadoop `VersionInfo` class may fail to inspect Hadoop version, and the shims loading branch is skipped. This PR tries to make the fix more robust: 1. When Hadoop version is available, we load `Hadoop20SShims` for versions <= 2.0.x as srowen suggested in PR #7929. 2. Otherwise, we use `Path.getPathWithoutSchemeAndAuthority` as a probe method, which doesn't exist in Hadoop 1.x or 2.0.x. If this method is not found, `Hadoop20SShims` is also loaded. Author: Cheng Lian <lian@databricks.com> Closes #7994 from liancheng/spark-9593/fix-hadoop-shims and squashes the following commits: e1d3d70 [Cheng Lian] Fixes typo in comments 8d971da [Cheng Lian] Makes the Hadoop shims loading fix more robust	2015-08-06 09:53:53 -07:00
Yin Huai	d5a9af3230	[SPARK-9664] [SQL] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction. https://issues.apache.org/jira/browse/SPARK-9664 Author: Yin Huai <yhuai@databricks.com> Closes #7982 from yhuai/udafRegister and squashes the following commits: 0cc2287 [Yin Huai] Remove UDAFRegistration and add apply to UserDefinedAggregateFunction.	2015-08-05 21:50:35 -07:00
Cheng Hao	119b590538	[SPARK-6923] [SPARK-7550] [SQL] Persists data source relations in Hive compatible format when possible This PR is a fork of PR #5733 authored by chenghao-intel. For committers who's going to merge this PR, please set the author to "Cheng Hao <hao.chengintel.com>". ---- When a data source relation meets the following requirements, we persist it in Hive compatible format, so that other systems like Hive can access it: 1. It's a `HadoopFsRelation` 2. It has only one input path 3. It's non-partitioned 4. It's data source provider can be naturally mapped to a Hive builtin SerDe (e.g. ORC and Parquet) Author: Cheng Lian <lian@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #7967 from liancheng/spark-6923/refactoring-pr-5733 and squashes the following commits: 5175ee6 [Cheng Lian] Fixes an oudated comment 3870166 [Cheng Lian] Fixes build error and comments 864acee [Cheng Lian] Refactors PR #5733 3490cdc [Cheng Hao] update the scaladoc 6f57669 [Cheng Hao] write schema info to hivemetastore for data source	2015-08-06 11:13:44 +08:00
Michael Armbrust	23d982204b	[SPARK-9141] [SQL] Remove project collapsing from DataFrame API Currently we collapse successive projections that are added by `withColumn`. However, this optimization violates the constraint that adding nodes to a plan will never change its analyzed form and thus breaks caching. Instead of doing early optimization, in this PR I just fix some low-hanging slowness in the analyzer. In particular, I add a mechanism for skipping already analyzed subplans, `resolveOperators` and `resolveExpression`. Since trees are generally immutable after construction, it's safe to annotate a plan as already analyzed as any transformation will create a new tree with this bit no longer set. Together these result in a faster analyzer than before, even with added timing instrumentation. ``` Original Code [info] 3430ms [info] 2205ms [info] 1973ms [info] 1982ms [info] 1916ms Without Project Collapsing in DataFrame [info] 44610ms [info] 45977ms [info] 46423ms [info] 46306ms [info] 54723ms With analyzer optimizations [info] 6394ms [info] 4630ms [info] 4388ms [info] 4093ms [info] 4113ms With resolveOperators [info] 2495ms [info] 1380ms [info] 1685ms [info] 1414ms [info] 1240ms ``` Author: Michael Armbrust <michael@databricks.com> Closes #7920 from marmbrus/withColumnCache and squashes the following commits: 2145031 [Michael Armbrust] fix hive udfs tests 5a5a525 [Michael Armbrust] remove wrong comment 7a507d5 [Michael Armbrust] style b59d710 [Michael Armbrust] revert small change 1fa5949 [Michael Armbrust] move logic into LogicalPlan, add tests 0e2cb43 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into withColumnCache c926e24 [Michael Armbrust] naming e593a2d [Michael Armbrust] style f5a929e [Michael Armbrust] [SPARK-9141][SQL] Remove project collapsing from DataFrame API 38b1c83 [Michael Armbrust] WIP	2015-08-05 09:01:45 -07:00
Cheng Hao	519cf6d3f7	[SPARK-9381] [SQL] Migrate JSON data source to the new partitioning data source Support partitioning for the JSON data source. Still 2 open issues for the `HadoopFsRelation` - `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting" - When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`) Author: Cheng Hao <hao.cheng@intel.com> Closes #7696 from chenghao-intel/json and squashes the following commits: d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply 307111d [Cheng Hao] fix bug in the unit test 8738c8a [Cheng Hao] fix bug in unit testing 35f2cde [Cheng Hao] support partition for json format	2015-08-05 22:35:55 +08:00
Cheng Lian	70112ff22b	[SPARK-9593] [SQL] Fixes Hadoop shims loading This PR is used to workaround CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1. Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking version information gathered from Hadoop jar files. If the major version number is 1, `Hadoop20SShims` will be loaded. Otherwise, if the major version number is 2, `Hadoop23Shims` will be chosen. However, CDH Hadoop versions like 2.0.0-mr1-cdh4.1.1 have 2 as major version number, but contain Hadoop 1 code. This confuses Hive `ShimLoader` and loads wrong version of shims. In this PR we check for existence of the `Path.getPathWithoutSchemeAndAuthority` method, which doesn't exist in Hadoop 1 (it's also the method that reveals this shims loading issue), and load `Hadoop20SShims` when it doesn't exist. Author: Cheng Lian <lian@databricks.com> Closes #7929 from liancheng/spark-9593/fix-hadoop-shims-loading and squashes the following commits: c99b497 [Cheng Lian] Narrows down the fix to handle "2.0.0-cdh4" Hadoop versions only b17e955 [Cheng Lian] Updates comments 490d8f2 [Cheng Lian] Fixes Scala style issue 9c6c12d [Cheng Lian] Fixes Hadoop shims loading	2015-08-05 20:03:54 +08:00
Davies Liu	781c8d71a0	[SPARK-9119] [SPARK-8359] [SQL] match Decimal.precision/scale with DecimalType Let Decimal carry the correct precision and scale with DecimalType. cc rxin yhuai Author: Davies Liu <davies@databricks.com> Closes #7925 from davies/decimal_scale and squashes the following commits: e19701a [Davies Liu] some tweaks 57d78d2 [Davies Liu] fix tests 5d5bc69 [Davies Liu] match precision and scale with DecimalType	2015-08-04 23:12:49 -07:00
zhichao.li	6f8f0e265a	[SPARK-7119] [SQL] Give script a default serde with the user specific types This is to address this issue that there would be not compatible type exception when running this: `from (from src select transform(key, value) using 'cat' as (thing1 int, thing2 string)) t select thing1 + 2;` 15/04/24 00:58:55 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: org.apache.spark.sql.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.plus(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:127) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1618) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) chenghao-intel marmbrus Author: zhichao.li <zhichao.li@intel.com> Closes #6638 from zhichao-li/transDataType2 and squashes the following commits: a36cc7c [zhichao.li] style b9252a8 [zhichao.li] delete cacheRow f6968a4 [zhichao.li] give script a default serde	2015-08-04 18:26:05 -07:00
Davies Liu	73dedb589d	[SPARK-8246] [SQL] Implement get_json_object This is based on #7485 , thanks to NathanHowell Tests were copied from Hive, but do not seem to be super comprehensive. I've generally replicated Hive's unusual behavior rather than following a JSONPath reference, except for one case (as noted in the comments). I don't know if there is a way of fully replicating Hive's behavior without a slower TreeNode implementation, so I've erred on the side of performance instead. Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Nathan Howell <nhowell@godaddy.com> Closes #7901 from davies/get_json_object and squashes the following commits: 3ace9b9 [Davies Liu] Merge branch 'get_json_object' of github.com:davies/spark into get_json_object 98766fc [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object a7dc6d0 [Davies Liu] Update JsonExpressionsSuite.scala c818519 [Yin Huai] new results. 18ce26b [Davies Liu] fix tests 6ac29fb [Yin Huai] Golden files. 25eebef [Davies Liu] use HiveQuerySuite e0ac6ec [Yin Huai] Golden answer files. 940c060 [Davies Liu] tweat code style 44084c5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into get_json_object 9192d09 [Nathan Howell] Match Hive’s behavior for unwrapping arrays of one element 8dab647 [Nathan Howell] [SPARK-8246] [SQL] Implement get_json_object	2015-08-04 09:07:09 -07:00
Sean Owen	76d74090d6	[SPARK-9534] [BUILD] Enable javac lint for scalac parity; fix a lot of build warnings, 1.5.0 edition Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process. I'll explain several of the changes inline in comments. Author: Sean Owen <sowen@cloudera.com> Closes #7862 from srowen/SPARK-9534 and squashes the following commits: ea51618 [Sean Owen] Enable most javac lint warnings; fix a lot of build warnings. In a few cases, touch up surrounding code in the process.	2015-08-04 12:02:26 +01:00
Steve Loughran	a2409d1c8e	[SPARK-8064] [SQL] Build against Hive 1.2.1 Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran <stevel@hortonworks.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Patrick Wendell <patrick@databricks.com> Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is "by hand" c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. does not address the issue 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name 1ab9bc4 [Steve Loughran] SPARK-8064 TestHive to use sered2.thrift.test.Complex bf3a249 [Steve Loughran] SPARK-8064: more resubmit than fix; tighten startup timeout to 60s. Still no obvious reason why jersey server code in spark-assembly isn't being picked up -it hasn't been shaded c829b8f [Steve Loughran] SPARK-8064: reinstate yarn-rm-server dependencies to hive-exec to ensure that jersey server is on classpath on hadoop versions < 2.6 0b0f738 [Steve Loughran] SPARK-8064: thrift server startup to fail fast on any exception in the main thread 13abaf1 [Steve Loughran] SPARK-8064 Hive compatibilty tests sin sync with explain/show output from Hive 1.2.1 d14d5ea [Steve Loughran] SPARK-8064: DATE is now a predicate; you can't use it as a field in select ops 26eef1c [Steve Loughran] SPARK-8064: HIVE-9039 renamed TOK_UNION => TOK_UNIONALL while adding TOK_UNIONDISTINCT 3d64523 [Steve Loughran] SPARK-8064 improve diagns on uknown token; fix scalastyle failure d0360f6 [Steve Loughran] SPARK-8064: delicate merge in of the branch vanzin/hive-1.1 1126e5a [Steve Loughran] SPARK-8064: name of unrecognized file format wasn't appearing in error text 8cb09c4 [Steve Loughran] SPARK-8064: test resilience/assertion improvements. Independent of the rest of the work; can be backported to earlier versions dec12cb [Steve Loughran] SPARK-8064: when a CLI suite test fails include the full output text in the raised exception; this ensures that the stdout/stderr is included in jenkins reports, so it becomes possible to diagnose the cause. 463a670 [Steve Loughran] SPARK-8064 run-tests.py adds a hadoop-2.6 profile, and changes info messages to say "w/Hive 1.2.1" in console output 2531099 [Steve Loughran] SPARK-8064 successful attempt to get rid of pentaho as a transitive dependency of hive-exec 1d59100 [Steve Loughran] SPARK-8064 (unsuccessful) attempt to get rid of pentaho as a transitive dependency of hive-exec 75733fc [Steve Loughran] SPARK-8064 change thrift binary startup message to "Starting ThriftBinaryCLIService on port" 3ebc279 [Steve Loughran] SPARK-8064 move strings used to check for http/bin thrift services up into constants c80979d [Steve Loughran] SPARK-8064: SparkSQLCLIDriver drops remote mode support. CLISuite Tests pass instead of timing out: undetected regression? 27e8370 [Steve Loughran] SPARK-8064 fix some style & IDE warnings 00e50d6 [Steve Loughran] SPARK-8064 stop excluding hive shims from dependency (commented out , for now) cb4f142 [Steve Loughran] SPARK-8054 cut pentaho dependency from calcite f7aa9cb [Steve Loughran] SPARK-8064 everything compiles with some commenting and moving of classes into a hive package 6c310b4 [Steve Loughran] SPARK-8064 subclass Hive ServerOptionsProcessor to make it public again f61a675 [Steve Loughran] SPARK-8064 thrift server switched to Hive 1.2.1, though it doesn't compile everywhere 4890b9d [Steve Loughran] SPARK-8064, build against Hive 1.2.1	2015-08-03 15:24:42 -07:00
Yin Huai	1ebd41b141	[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used. https://issues.apache.org/jira/browse/SPARK-9240 Author: Yin Huai <yhuai@databricks.com> Closes #7813 from yhuai/AggregateOperator and squashes the following commits: e317e2b [Yin Huai] Remove unnecessary change. 74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator ba6afbc [Yin Huai] Add a little bit more comments. c9cf3b6 [Yin Huai] update 0f1b06f [Yin Huai] Remove unnecessary code. 21fd15f [Yin Huai] Remove unnecessary change. 964f88b [Yin Huai] Implement fallback strategy. b1ea5cf [Yin Huai] wip 7fcbd87 [Yin Huai] Add a flag to control what iterator to use. 533d5b2 [Yin Huai] Prepare for fallback! 33b7022 [Yin Huai] wip bd9282b [Yin Huai] UDAFs now supports UnsafeRow. f52ee53 [Yin Huai] wip 3171f44 [Yin Huai] wip d2c45a0 [Yin Huai] wip f60cc83 [Yin Huai] Also check input schema. af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter. 299008c [Yin Huai] First round cleanup. 3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator.	2015-08-03 00:23:08 -07:00
Liang-Chi Hsieh	0722f43316	[SPARK-7937][SQL] Support comparison on StructType This brings #6519 up-to-date with master branch. Closes #6519. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Reynold Xin <rxin@databricks.com> Closes #7877 from rxin/sort-struct and squashes the following commits: 4968231 [Reynold Xin] Minor fixes. 2537813 [Reynold Xin] Merge branch 'compare_named_struct' of github.com:viirya/spark-1 into sort-struct d2ba8ad [Liang-Chi Hsieh] Remove unused import. 3a3f40e [Liang-Chi Hsieh] Don't need to add compare to InternalRow because we can use RowOrdering. dae6aad [Liang-Chi Hsieh] Fix nested struct. d5349c7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct 43d4354 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct 1f66196 [Liang-Chi Hsieh] Reuse RowOrdering and GenerateOrdering. f8b2e9c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct 1187a65 [Liang-Chi Hsieh] Fix scala style. 9d67f68 [Liang-Chi Hsieh] Fix wrongly merging. 8f4d775 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct 94b27d5 [Liang-Chi Hsieh] Remove test for error on complex type comparison. 2071693 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into compare_named_struct 3c142e4 [Liang-Chi Hsieh] Fix scala style. cf58dc3 [Liang-Chi Hsieh] Use checkAnswer. f651b8d [Liang-Chi Hsieh] Remove Either and move orderings to BinaryComparison to reuse it. b6e1009 [Liang-Chi Hsieh] Fix scala style. 3922b54 [Liang-Chi Hsieh] Support ordering on named_struct.	2015-08-02 17:53:44 -07:00
Davies Liu	c1b0cbd762	[SPARK-8185] [SPARK-8188] [SPARK-8191] [SQL] function datediff, to_utc_timestamp, from_utc_timestamp This PR is based on #7643 , thanks to adrian-wang Author: Davies Liu <davies@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7847 from davies/datediff and squashes the following commits: 74333d7 [Davies Liu] fix bug 22d8a8c [Davies Liu] optimize 85cdd21 [Davies Liu] remove unnecessary tests 241d90c [Davies Liu] Merge branch 'master' of github.com:apache/spark into datediff e9dc0f5 [Davies Liu] fix datediff/to_utc_timestamp/from_utc_timestamp c360447 [Daoyuan Wang] function datediff, to_utc_timestamp, from_utc_timestamp (commits merged)	2015-08-01 21:46:46 -07:00
Wenchen Fan	1d59a4162b	[SPARK-9480][SQL] add MapData and cleanup internal row stuff This PR adds a `MapData` as internal representation of map type in Spark SQL, and provides a default implementation with just 2 `ArrayData`. After that, we have specialized getters for all internal type, so I removed generic getter in `ArrayData` and added specialized `toArray` for it. Also did some refactor and cleanup for `InternalRow` and its subclasses. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7799 from cloud-fan/map-data and squashes the following commits: 77d482f [Wenchen Fan] fix python e8f6682 [Wenchen Fan] skip MapData equality check in HiveInspectorSuite 40cc9db [Wenchen Fan] add toString 6e06ec9 [Wenchen Fan] some more cleanup a90aca1 [Wenchen Fan] add MapData	2015-08-01 00:17:15 -07:00
Yin Huai	3fc0cb9200	[SPARK-9233] [SQL] Enable code-gen in window function unit tests Since code-gen is enabled by default, it is better to run window function tests with code-gen. https://issues.apache.org/jira/browse/SPARK-9233 Author: Yin Huai <yhuai@databricks.com> Closes #7832 from yhuai/SPARK-9233 and squashes the following commits: 4e4e4cc [Yin Huai] style ca80e07 [Yin Huai] Test window function with codegen.	2015-07-31 14:13:06 -07:00
Herman van Hovell	39ab199a3f	[SPARK-8640] [SQL] Enable Processing of Multiple Window Frames in a Single Window Operator This PR enables the processing of multiple window frames in a single window operator. This should improve the performance of processing multiple window expressions wich share partition by/order by clauses, because it will be more efficient with respect to memory use and group processing. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7515 from hvanhovell/SPARK-8640 and squashes the following commits: f0e1c21 [Herman van Hovell] Changed Window Logical/Physical plans to use partition by/order by specs directly instead of using WindowSpec. e1711c2 [Herman van Hovell] Enabled the processing of multiple window frames in a single Window operator.	2015-07-31 12:08:25 -07:00
WangTaoTheTonic	a3a85d73da	[SPARK-9496][SQL]do not print the password in config https://issues.apache.org/jira/browse/SPARK-9496 We better do not print the password in log. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #7815 from WangTaoTheTonic/master and squashes the following commits: c7a5145 [WangTaoTheTonic] do not print the password in config	2015-07-30 23:50:06 -07:00
Wenchen Fan	c0cc0eaec6	[SPARK-9390][SQL] create a wrapper for array type Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7724 from cloud-fan/array-data and squashes the following commits: d0408a1 [Wenchen Fan] fix python 661e608 [Wenchen Fan] rebase f39256c [Wenchen Fan] fix hive... 6dbfa6f [Wenchen Fan] fix hive again... 8cb8842 [Wenchen Fan] remove element type parameter from getArray 43e9816 [Wenchen Fan] fix mllib e719afc [Wenchen Fan] fix hive 4346290 [Wenchen Fan] address comment d4a38da [Wenchen Fan] remove sizeInBytes and add license 7e283e2 [Wenchen Fan] create a wrapper for array type	2015-07-30 10:04:30 -07:00
Joseph Batchik	1221849f91	[SPARK-8005][SQL] Input file name Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it. This supports: `df.select(inputFileName())` and `sqlContext.sql("select input_file_name() from table")` Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7743 from JDrit/input_file_name and squashes the following commits: abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string d2f323d [Joseph Batchik] updates per review 102061f [Joseph Batchik] updates per review 75313f5 [Joseph Batchik] small fixes c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL	2015-07-29 23:35:55 -07:00
Josh Rosen	1b0099fc62	[SPARK-9411] [SQL] Make Tungsten page sizes configurable We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads. These sizes are now controlled by a new configuration, `spark.buffer.pageSize`. The new default is 64 megabytes. Author: Josh Rosen <joshrosen@databricks.com> Closes #7741 from JoshRosen/SPARK-9411 and squashes the following commits: a43c4db [Josh Rosen] Fix pow 2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable 0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable bc734f0 [Josh Rosen] Rename configuration e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable	2015-07-29 16:00:30 -07:00
Reynold Xin	6662ee2124	[SPARK-9418][SQL] Use sort-merge join as the default shuffle join. Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator. Author: Reynold Xin <rxin@databricks.com> Closes #7733 from rxin/smj and squashes the following commits: 61e4d34 [Reynold Xin] Fixed test case. 5ffd731 [Reynold Xin] Fixed JoinSuite. a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.	2015-07-28 17:42:35 -07:00
Reynold Xin	b7f54119f8	[SPARK-9420][SQL] Move expressions in sql/core package to catalyst. Since catalyst package already depends on Spark core, we can move those expressions into catalyst, and simplify function registry. This is a followup of #7478. Author: Reynold Xin <rxin@databricks.com> Closes #7735 from rxin/SPARK-8003 and squashes the following commits: 2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.	2015-07-28 17:03:59 -07:00
Josh Rosen	59b92add7c	[SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult: - If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive. - TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators. - Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job. This can lead to silently incorrect query results. - The writer thread is not a daemon thread, but it should be. In addition, the code in this file is extremely messy: - Lots of fields are nullable but the nullability isn't clearly explained. - Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope. - Some code was misindented. - The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself. - There were no unit tests for the operator itself, only end-to-end tests. This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #7710 from JoshRosen/script-transform and squashes the following commits: 16c44e2 [Josh Rosen] Update some comments 983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes 6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name 494cde0 [Josh Rosen] Propagate TaskContext to writer thread 323bb2b [Josh Rosen] Fix error-swallowing bug b31258d [Josh Rosen] Rename iterator variables to disambiguate. 88278de [Josh Rosen] Split ScriptTransformation writer thread into own class. 8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue 4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer. bd4c948 [Josh Rosen] Skip launching of external command for empty partitions. b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.	2015-07-28 16:04:48 -07:00
Joseph Batchik	b88b868eb3	[SPARK-8003][SQL] Added virtual column support to Spark Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them. This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003) This allows you to perform queries such as: ```sql select spark__partition__id, count(*) as c from table group by spark__partition__id; ``` Author: Joseph Batchik <josephbatchik@gmail.com> Author: JD <jd@csh.rit.edu> Closes #7478 from JDrit/virt_columns and squashes the following commits: 7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well f8a9c6c [Joseph Batchik] merging in master e49da48 [JD] fixes for @rxin's suggestions 60e120b [JD] fixing test in merge 4bf8554 [JD] merging in master c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()	2015-07-28 14:39:25 -07:00
Aaron Davidson	35ef853b3f	[SPARK-9397] DataFrame should provide an API to find source data files if applicable Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it. This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations. Author: Aaron Davidson <aaron@databricks.com> Closes #7717 from aarondav/paths and squashes the following commits: ff67430 [Aaron Davidson] inputFiles 0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable	2015-07-28 10:12:09 -07:00
Yijie Shen	63a492b931	[SPARK-8828] [SQL] Revert SPARK-5680 JIRA: https://issues.apache.org/jira/browse/SPARK-8828 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits: c37ccb1 [Yijie Shen] add test case 8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value e2305ac [Yijie Shen] fix bug - avg on decimal column 7cb0e95 [Yijie Shen] [wip] resolving bugs 1fadb5a [Yijie Shen] remove occurance 17c6248 [Yijie Shen] revert SPARK-5680	2015-07-27 22:47:33 -07:00
Michael Armbrust	ce89ff477a	[SPARK-9386] [SQL] Feature flag for metastore partition pruning Since we have been seeing a lot of failures related to this new feature, lets put it behind a flag and turn it off by default. Author: Michael Armbrust <michael@databricks.com> Closes #7703 from marmbrus/optionalMetastorePruning and squashes the following commits: 6ad128c [Michael Armbrust] style 8447835 [Michael Armbrust] [SPARK-9386][SQL] Feature flag for metastore partition pruning fd37b87 [Michael Armbrust] add config flag	2015-07-27 17:32:34 -07:00
Cheng Lian	8e7d2bee23	[SPARK-9378] [SQL] Fixes test case "CTAS with serde" This is a proper version of PR #7693 authored by viirya The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default. Author: Cheng Lian <lian@databricks.com> Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits: 4413af0 [Cheng Lian] Fixes test case "CTAS with serde"	2015-07-27 13:28:03 -07:00
Yin Huai	55946e76fd	[SPARK-9349] [SQL] UDAF cleanup https://issues.apache.org/jira/browse/SPARK-9349 With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`. Author: Yin Huai <yhuai@databricks.com> Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits: db36542 [Yin Huai] Add comments to UDAF examples. ae17f66 [Yin Huai] Address comments. 9c9fa5f [Yin Huai] UDAF cleanup.	2015-07-27 13:26:57 -07:00
Cheng Lian	72981bc8f0	[SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally. Author: Cheng Lian <lian@databricks.com> Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits: f3bcd4b [Cheng Lian] Addresses PR comments e0eb76a [Cheng Lian] Fixes styling issues 41e2207 [Cheng Lian] Fixes multi-database support d4d1ec2 [Cheng Lian] Adds multi-database test cases	2015-07-27 17:15:35 +08:00
Wenchen Fan	4ffd3a1db5	[SPARK-9371][SQL] fix the support for special chars in column names for hive context Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7684 from cloud-fan/hive and squashes the following commits: da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context	2015-07-26 23:58:03 -07:00
Cheng Hao	1efe97dc9e	[SPARK-8867][SQL] Support list / describe function usage As Hive does, we need to list all of the registered UDF and its usage for user. We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF. e.g. ```scala ExpressionDescription( usage = "_FUNC_(expr) - Returns the absolute value of the numeric value", extended = """> SELECT _FUNC_('-1') 1""") case class Abs(child: Expression) extends UnaryArithmetic { ... ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #7259 from chenghao-intel/desc_function and squashes the following commits: cf29bba [Cheng Hao] fixing the code style issue 5193855 [Cheng Hao] Add more powerful parser for show functions c645a6b [Cheng Hao] fix bug in unit test 78d40f1 [Cheng Hao] update the padding issue for usage 48ee4b3 [Cheng Hao] update as feedback 70eb4e9 [Cheng Hao] add show/describe function support	2015-07-26 18:34:19 -07:00
Cheng Lian	c025c3d0a1	[SPARK-9095] [SQL] Removes the old Parquet support This PR removes the old Parquet support: - Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites. - Renames `ParquetRelation2` to `ParquetRelation` - Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files. This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow. There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848. Author: Cheng Lian <lian@databricks.com> Closes #7441 from liancheng/spark-9095 and squashes the following commits: c7b6e38 [Cheng Lian] Removes WriteToFile 2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation ca9e1b7 [Cheng Lian] Removes old Parquet support	2015-07-26 16:49:19 -07:00
Reynold Xin	6c400b4f39	[SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code. Replaced them with get(ordinal, datatype) so we can use UnsafeRow here. I passed the data types throughout. Author: Reynold Xin <rxin@databricks.com> Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits: 3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.	2015-07-26 10:27:39 -07:00
Reynold Xin	4a01bfc2a2	[SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter. Author: Reynold Xin <rxin@databricks.com> Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits: ee2874c [Reynold Xin] Add a default implementation for getStruct. 1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType. 033ee88 [Reynold Xin] Removed getAs in non test code.	2015-07-25 23:52:37 -07:00
Reynold Xin	b1f4b4abfd	[SPARK-9348][SQL] Remove apply method on InternalRow. Author: Reynold Xin <rxin@databricks.com> Closes #7665 from rxin/remove-row-apply and squashes the following commits: 0b43001 [Reynold Xin] support getString in UnsafeRow. 176d633 [Reynold Xin] apply -> get. 2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.	2015-07-25 18:41:51 -07:00
Cheng Lian	e2ec018e37	[SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`. Author: Cheng Lian <lian@databricks.com> Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits: 036a50c [Cheng Lian] Addresses PR comment f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation	2015-07-25 11:42:49 -07:00
Reynold Xin	431ca39be5	[SPARK-9285][SQL] Remove InternalRow's inheritance from Row. I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields. Author: Reynold Xin <rxin@databricks.com> Closes #7626 from rxin/internalRow and squashes the following commits: e124daf [Reynold Xin] Fixed test case. 805ceb7 [Reynold Xin] Commented out the failed test suite. f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining. 76d9081 [Reynold Xin] Fixed data sources. 7807f70 [Reynold Xin] Fixed DataFrameSuite. cb60cd2 [Reynold Xin] Code review & small bug fixes. 0a2948b [Reynold Xin] Fixed style. 3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.	2015-07-24 09:37:36 -07:00
Davies Liu	8a94eb23d5	[SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases). In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18). If no precision and scale is provide, it's Decimal(10, 0) as before. Author: Davies Liu <davies@databricks.com> Closes #7605 from davies/decimal_unlimited and squashes the following commits: aa3f115 [Davies Liu] fix tests and style fb0d20d [Davies Liu] address comments bfaae35 [Davies Liu] fix style df93657 [Davies Liu] address comments and clean up 06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited 4c28969 [Davies Liu] fix tests 8d783cc [Davies Liu] fix tests 788631c [Davies Liu] fix double with decimal in Union/except 1779bde [Davies Liu] fix scala style c9c7c78 [Davies Liu] remove Decimal.Unlimited	2015-07-23 18:31:13 -07:00
Reynold Xin	d71a13f475	[SPARK-9262][build] Treat Scala compiler warnings as errors I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). Most of the work are done by ericl. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes #7598 from rxin/warnings and squashes the following commits: beb311b [Reynold Xin] Fixed tests. 542c031 [Reynold Xin] Fixed one more warning. 87c354a [Reynold Xin] Fixed all non-deprecation warnings. 78660ac [Eric Liang] first effort to fix warnings	2015-07-22 21:02:19 -07:00
Matei Zaharia	fe26584a1f	[SPARK-9244] Increase some memory defaults There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. This memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. We increase it to 128. - spark.executor.memory: Defaults at 512m, which is really small. We increase it to 1g. Author: Matei Zaharia <matei@databricks.com> Closes #7586 from mateiz/configs and squashes the following commits: ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults	2015-07-22 15:28:09 -07:00
Yin Huai	c03299a18b	[SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367. This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true. This new code path contains: * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`) * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`). * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface . * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together). With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is: 1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface. 2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true: * code-gen is disabled. * there is any function that cannot be converted (right now, Hive UDAFs). * the schema of grouping expressions contain any complex data type. * There are multiple distinct columns. Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function. Thanks chenghao-intel for his initial work on it. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7458 from yhuai/UDAF and squashes the following commits: 7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it. b04d6c8 [Yin Huai] Remove unnecessary change. f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator. 3b43b24 [Yin Huai] bug fix. 00eb298 [Yin Huai] Make it compile. a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path. 8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose. dc96fd1 [Yin Huai] Many updates: 85c9c4b [Yin Huai] newline. 43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF c3614d7 [Yin Huai] Handle single distinct column. 68b8ee9 [Yin Huai] Support single distinct column set. WIP 3013579 [Yin Huai] Format. d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it. e243ca6 [Yin Huai] Add aggregation iterators. a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum. 594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2. 380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places. a19fea6 [Yin Huai] Add UDAF interface. 262d4c4 [Yin Huai] Make it compile. b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 6edb5ac [Yin Huai] Format update. 70b169c [Yin Huai] Remove groupOrdering. 4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules. d821a34 [Yin Huai] Cleanup. 32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 5b46d41 [Yin Huai] Bug fix. aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions. 2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer. 1b490ed [Michael Armbrust] make hive test 8cfa6a9 [Michael Armbrust] add test 1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places. 072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references. f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF 39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences. b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version. 5c00f3f [Michael Armbrust] First draft of codegen 6bbc6ba [Michael Armbrust] now with correct answers\! f7996d0 [Michael Armbrust] Add AlgebraicAggregate dded1c5 [Yin Huai] wip	2015-07-21 23:26:11 -07:00
Dennis Huo	c07838b5a9	[SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector. IsolatedClientLoader.isSharedClass includes all of com.google.\, presumably for Guava, protobuf, and/or other shared Google libraries, but needs to count com.google.cloud.\ as "hive classes" when determining which ClassLoader to use. Otherwise, things like HiveContext.parquetFile will throw a ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://) path. On StackOverflow: http://stackoverflow.com/questions/31478955 EDIT: Adding yhuai who worked on the relevant classloading isolation pieces. Author: Dennis Huo <dhuo@google.com> Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits: 1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.	2015-07-21 13:12:11 -07:00
Reynold Xin	60c0ce134d	[SPARK-8906][SQL] Move all internal data source classes into execution.datasources. This way, the sources package contains only public facing interfaces. Author: Reynold Xin <rxin@databricks.com> Closes #7565 from rxin/move-ds and squashes the following commits: 7661aff [Reynold Xin] Mima 9d5196a [Reynold Xin] Rearranged imports. 3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.	2015-07-21 11:56:38 -07:00
Cheng Lian	d38c5029a2	[SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python. Author: Cheng Lian <lian@databricks.com> Closes #7444 from liancheng/spark-9100 and squashes the following commits: 284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC	2015-07-21 15:08:44 +08:00
Cheng Lian	dde0e12f32	[SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details. [1]: `536ac41a7e` [2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391 [3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059 Author: Cheolsoo Park <cheolsoop@netflix.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits: 5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore 536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421	2015-07-20 15:12:14 -07:00
Reynold Xin	163e3f1df9	[SPARK-8241][SQL] string function: concat_ws. I also changed the semantics of concat w.r.t. null back to the same behavior as Hive. That is to say, concat now returns null if any input is null. Author: Reynold Xin <rxin@databricks.com> Closes #7504 from rxin/concat_ws and squashes the following commits: 83fd950 [Reynold Xin] Fixed type casting. 3ae85f7 [Reynold Xin] Write null better. cdc7be6 [Reynold Xin] Added code generation for pure string mode. a61c4e4 [Reynold Xin] Updated comments. 2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.	2015-07-19 16:48:47 -07:00
Herman van Hovell	7a81245345	[SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved. Git commit message is wrong BTW :(... Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits: 4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)	2015-07-19 16:29:50 -07:00
Herman van Hovell	a9a0d0cebf	[SPARK-8638] [SQL] Window Function Performance Improvements ## Description Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information. ## Improvements * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR. * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. ## Benchmarking I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type: Frame \| Master \| SPARK-8638 ----- \| ------ \| ---------- Entire Frame \| 2 s \| 1 s Sliding \| 18 s \| 1 s Growing \| 14 s \| 0.9 s Shrinking \| 13 s \| 1 s Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits: 3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase) 2eb3b33 [Herman van Hovell] Corrected reverse range frame processing. 2cd2d5b [Herman van Hovell] Corrected reverse range frame processing. b0654d7 [Herman van Hovell] Tests for exotic frame specifications. e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code. 1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite. ac2f682 [Herman van Hovell] Added a few more comments. 1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods. bb020e6 [Herman van Hovell] Major overhaul of Window operator.	2015-07-18 23:44:38 -07:00
Tarek Auel	83b682beec	[SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions Jira: https://issues.apache.org/jira/browse/SPARK-8199 https://issues.apache.org/jira/browse/SPARK-8184 https://issues.apache.org/jira/browse/SPARK-8183 https://issues.apache.org/jira/browse/SPARK-8182 https://issues.apache.org/jira/browse/SPARK-8181 https://issues.apache.org/jira/browse/SPARK-8180 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-8177 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-9115 Regarding `day`and `dayofmonth` are both necessary? ~~I am going to add `Quarter` to this PR as well.~~ Done. ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done Author: Tarek Auel <tarek.auel@googlemail.com> Author: Tarek Auel <tarek.auel@gmail.com> Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits: f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests bb567b6 [Tarek Auel] [SPARK-8199] fixed test 3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix 256c357 [Tarek Auel] [SPARK-8199] code cleanup 5983dcc [Tarek Auel] [SPARK-8199] whitespace fix 6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488 4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master 70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199 3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix 746b80a [Tarek Auel] [SPARK-8199] build fix 0ad6db8 [Tarek Auel] [SPARK-8199] minor fix 523542d [Tarek Auel] [SPARK-8199] address comments 2259299 [Tarek Auel] [SPARK-8199] day_of_month alias d01b977 [Tarek Auel] [SPARK-8199] python underscore 56c4a92 [Tarek Auel] [SPARK-8199] update python docu e223bc0 [Tarek Auel] [SPARK-8199] refactoring d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it 1b2e540 [Tarek Auel] [SPARK-8119] style fix 0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring 1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199 740af0e [Tarek Auel] implement date function using a calculation based on days 4fb66da [Tarek Auel] WIP: date functions on calculation only 1a436c9 [Tarek Auel] wip f775f39 [Tarek Auel] fixed return type ad17e96 [Tarek Auel] improved implementation c42b444 [Tarek Auel] Removed merge conflict file ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues 10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast 7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite 6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track 7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199 5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199 f120415 [Tarek Auel] improved runtime a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat 5fe74e1 [Tarek Auel] fixed python style 3bfac90 [Tarek Auel] fixed style 356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation 02efc5d [Tarek Auel] removed doubled code a5ea120 [Tarek Auel] added python api; changed test to be more meaningful b680db6 [Tarek Auel] added codegeneration to all functions c739788 [Tarek Auel] added support for quarter SPARK-8178 849fb41 [Tarek Auel] fixed stupid test 638596f [Tarek Auel] improved codegen 4d8049b [Tarek Auel] fixed tests and added type check 5ebb235 [Tarek Auel] resolved naming conflict d0e2f99 [Tarek Auel] date functions	2015-07-18 22:48:05 -07:00
Reynold Xin	9914b1b2c5	[SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already. Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star). Author: Reynold Xin <rxin@databricks.com> Closes #7487 from rxin/codegenfallback and squashes the following commits: 14ebf38 [Reynold Xin] Fixed Conv 6c1c882 [Reynold Xin] Fixed Alias. b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions. cb5c066 [Reynold Xin] Removed extra import. 39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 18:18:19 -07:00
Reynold Xin	6e1e2eba69	[SPARK-8240][SQL] string function: concat Author: Reynold Xin <rxin@databricks.com> Closes #7486 from rxin/concat and squashes the following commits: 5217d6e [Reynold Xin] Removed Hive's concat test. f5cb7a3 [Reynold Xin] Concat is never nullable. ae4e61f [Reynold Xin] Removed extra import. fddcbbd [Reynold Xin] Fixed NPE. 22e831c [Reynold Xin] Added missing file. 57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 14:07:56 -07:00
Yijie Shen	529a2c2d92	[SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math JIRA: https://issues.apache.org/jira/browse/SPARK-8280 https://issues.apache.org/jira/browse/SPARK-8281 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7451 from yijieshen/nan_null2 and squashes the following commits: 47a529d [Yijie Shen] style fix 63dee44 [Yijie Shen] handle log expressions similar to Hive 188be51 [Yijie Shen] null to nan in Math Expression	2015-07-17 17:33:19 -07:00
Reynold Xin	b2aa490bb6	[SPARK-9142] [SQL] Removing unnecessary self types in Catalyst. Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions. Author: Reynold Xin <rxin@databricks.com> Closes #7479 from rxin/remove-self-types and squashes the following commits: e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.	2015-07-17 15:02:13 -07:00
Jan Prach	b536d5dc6c	[SPARK-9015] [BUILD] Clean project import in scala ide Cleanup maven for a clean import in scala-ide / eclipse. * remove groovy plugin which is really not needed at all * add-source from build-helper-maven-plugin is not needed as recent version of scala-maven-plugin do it automatically * add lifecycle-mapping plugin to hide a few useless warnings from ide Author: Jan Prach <jendap@gmail.com> Closes #7375 from jendap/clean-project-import-in-scala-ide and squashes the following commits: c4b4c0f [Jan Prach] fix whitespaces 5a83e07 [Jan Prach] Revert "remove java compiler warnings from java tests" 312007e [Jan Prach] scala-maven-plugin itself add scala sources by default f47d856 [Jan Prach] remove spark-1.4-staging repository c8a54db [Jan Prach] remove java compiler warnings from java tests 999a068 [Jan Prach] remove some maven warnings in scala ide 80fbdc5 [Jan Prach] remove groovy and gmavenplus plugin	2015-07-16 18:42:41 +01:00
Cheng Hao	e27212317c	[SPARK-8972] [SQL] Incorrect result for rollup We don't support the complex expression keys in the rollup/cube, and we even will not report it if we have the complex group by keys, that will cause very confusing/incorrect result. e.g. `SELECT key%100 FROM src GROUP BY key %100 with ROLLUP` This PR adds an additional project during the analyzing for the complex GROUP BY keys, and that projection will be the child of `Expand`, so to `Expand`, the GROUP BY KEY are always the simple key(attribute names). Author: Cheng Hao <hao.cheng@intel.com> Closes #7343 from chenghao-intel/expand and squashes the following commits: 1ebbb59 [Cheng Hao] update the comment 827873f [Cheng Hao] update as feedback 34def69 [Cheng Hao] Add more unit test and comments c695760 [Cheng Hao] fix bug of incorrect result for rollup	2015-07-15 23:35:27 -07:00
Steve Loughran	ec9b621647	SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed fix teardown to skip table delete if hive context is null Author: Steve Loughran <stevel@hortonworks.com> Closes #7425 from steveloughran/stevel/patches/SPARK-9070-JavaDataFrameSuite-NPE and squashes the following commits: 1982d38 [Steve Loughran] SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed	2015-07-15 12:15:35 -07:00
Yijie Shen	f0e129740d	[SPARK-8279][SQL]Add math function round JIRA: https://issues.apache.org/jira/browse/SPARK-8279 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6938 from yijieshen/udf_round_3 and squashes the following commits: 07a124c [Yijie Shen] remove useless def children 392b65b [Yijie Shen] add negative scale test in DecimalSuite 61760ee [Yijie Shen] address reviews 302a78a [Yijie Shen] Add dataframe function test 31dfe7c [Yijie Shen] refactor round to make it readable 8c7a949 [Yijie Shen] rebase & inputTypes update 9555e35 [Yijie Shen] tiny style fix d10be4a [Yijie Shen] use TypeCollection to specify wanted input and implicit cast c3b9839 [Yijie Shen] rely on implict cast to handle string input b0bff79 [Yijie Shen] make round's inner method's name more meaningful 9bd6930 [Yijie Shen] revert accidental change e6f44c4 [Yijie Shen] refactor eval and genCode 1b87540 [Yijie Shen] modify checkInputDataTypes using foldable 5486b2d [Yijie Shen] DataFrame API modification 2077888 [Yijie Shen] codegen versioned eval 6cd9a64 [Yijie Shen] refactor Round's constructor 9be894e [Yijie Shen] add round functions in o.a.s.sql.functions 7c83e13 [Yijie Shen] more tests on round 56db4bb [Yijie Shen] Add decimal support to Round 7e163ae [Yijie Shen] style fix 653d047 [Yijie Shen] Add math function round	2015-07-14 23:30:41 -07:00
Michael Armbrust	c6b1a9e74e	Revert SPARK-6910 and SPARK-9027 Revert #7216 and #7386. These patch seems to be causing quite a few test failures: ``` Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:351) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:180) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:135) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:172) at org.apache.spark.sql.hive.client.ClientWrapper.getPartitionsByFilter(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.HiveTable.getPartitions(ClientInterface.scala:78) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(HiveMetastoreCatalog.scala:670) at org.apache.spark.sql.hive.execution.HiveTableScan.doExecute(HiveTableScan.scala:137) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:89) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:164) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:151) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 85 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:185) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:452) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:357) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:279) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$TreeNode.generateJDOFilter(ExpressionTree.java:243) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:590) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:2417) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2029) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:146) at org.apache.hadoop.hive.metastore.ObjectStore$4.getJdoResult(ObjectStore.java:2332) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-with-YARN/2945/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/ Author: Michael Armbrust <michael@databricks.com> Closes #7409 from marmbrus/revertMetastorePushdown and squashes the following commits: 92fabd3 [Michael Armbrust] Revert SPARK-6910 and SPARK-9027 5d3bdf2 [Michael Armbrust] Revert "[SPARK-9027] [SQL] Generalize metastore predicate pushdown"	2015-07-14 22:57:39 -07:00
Josh Rosen	11e5c37286	[SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly. `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits: e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName c0b7885 [Josh Rosen] Hopefully fix the last two cases d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass 046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName 62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion. d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName	2015-07-14 16:08:17 -07:00
Michael Armbrust	37f2d9635f	[SPARK-9027] [SQL] Generalize metastore predicate pushdown Add support for pushing down metastore filters that are in different orders and add some unit tests. Author: Michael Armbrust <michael@databricks.com> Closes #7386 from marmbrus/metastoreFilters and squashes the following commits: 05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown	2015-07-14 11:22:09 -07:00
Cheolsoo Park	408b384de9	[SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review. The changes include- * Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore. * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string. * Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter. * Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string. But there are some cases in which predicate pushdown is disabled- Case \| Predicate pushdown ------- \| ----------------------------- Hive integral and string types \| Yes Hive varchar type \| No Hive 0.13 and newer \| Yes Hive 0.12 and older \| No convertMetastoreParquet=false \| Yes convertMetastoreParquet=true \| No In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong. cc marmbrus Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits: aa1490f [Cheolsoo Park] Fix ordering of imports c212c4d [Cheolsoo Park] Incorporate review comments 5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore	2015-07-13 19:45:10 -07:00
Jonathan Alter	e14b545d2d	[SPARK-7977] [BUILD] Disallowing println Author: Jonathan Alter <jonalter@users.noreply.github.com> Closes #7093 from jonalter/SPARK-7977 and squashes the following commits: ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite 7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite 10724b6 [Jonathan Alter] Changing some printlns to logs in tests eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0b1dcb4 [Jonathan Alter] More println cleanup aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0c16fa3 [Jonathan Alter] Replacing some printlns with logs 45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 5c8e283 [Jonathan Alter] Allowing println in audit-release examples 5b50da1 [Jonathan Alter] Allowing printlns in example files ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 83ab635 [Jonathan Alter] Fixing new printlns 54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns b837c3a [Jonathan Alter] Disallowing println	2015-07-10 11:34:01 +01:00
Michael Armbrust	768907eb7b	[SPARK-8926][SQL] Good errors for ExpectsInputType expressions For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.` Author: Michael Armbrust <michael@databricks.com> Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits: c654a0e [Michael Armbrust] fix udts and make errors pretty 137160d [Michael Armbrust] style 5428fda [Michael Armbrust] style 10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions	2015-07-08 22:05:58 -07:00
Andrew Or	47ef423f86	[SPARK-8910] Fix MiMa flaky due to port contention issue Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this causes port contention sometimes and fails the MiMa tests. Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves. Author: Andrew Or <andrew@databricks.com> Closes #7300 from andrewor14/mima-flaky and squashes the following commits: b55a547 [Andrew Or] Do not enable SparkUI during tests	2015-07-08 20:29:08 -07:00
Yijie Shen	a290814877	[SPARK-8866][SQL] use 1us precision for timestamp type JIRA: https://issues.apache.org/jira/browse/SPARK-8866 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7283 from yijieshen/micro_timestamp and squashes the following commits: dc735df [Yijie Shen] update CastSuite to avoid round error 714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp 8d4aa6b [Yijie Shen] use 1us precision for timestamp type	2015-07-08 20:20:17 -07:00
Cheng Lian	4ffc27caaf	[SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]). ### Major changes 1. `CatalystConverter` class hierarchy refactoring - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`. Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`. This simplifies the design since converters don't need to care about details of their parent converters anymore. - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter` Specifically, now all row objects are represented by `SpecificMutableRow` during conversion. - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter` `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal. The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way. - Implements backwards-compatibility rules in `CatalystArrayConverter` When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`. 2. Requested columns handling When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema. In this PR, the schema for requested columns is constructed using the following method: - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column. - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`. - Unions all single-field `MessageType`s into a full schema containing all requested fields With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files. ### Testing This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in. [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [2]: https://issues.apache.org/jira/browse/SPARK-6774 [3]: https://issues.apache.org/jira/browse/SPARK-6123 [4]: https://issues.apache.org/jira/browse/SPARK-8848 Author: Cheng Lian <lian@databricks.com> Closes #7231 from liancheng/spark-6776 and squashes the following commits: 360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite c6fbc06 [Cheng Lian] Removes WIP file committed by mistake b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa 598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift 926af87 [Cheng Lian] Simplifies Parquet compatibility test suites 7946ee1 [Cheng Lian] Fixes Scala styling issues 3d7ab36 [Cheng Lian] Fixes .rat-excludes a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation 1d390aa [Cheng Lian] Adds parquet-thrift compatibility test 440f7b3 [Cheng Lian] Adds generated files to .rat-excludes 13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite 06cfe9d [Cheng Lian] Adds comments about TimestampType handling a099d3e [Cheng Lian] More comments 0cc1b37 [Cheng Lian] Fixes MiMa checks 884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes 802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns 38fe1e7 [Cheng Lian] Adds explicit return type 7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change 1781dff [Cheng Lian] Adds test case for SPARK-8811 6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals a74fb2c [Cheng Lian] More comments 0525346 [Cheng Lian] Removes old Parquet record converters 03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules	2015-07-08 15:51:01 -07:00
Keuntae Park	f031543782	[SPARK-8783] [SQL] CTAS with WITH clause does not work Currently, CTESubstitution only handles the case that WITH is on the top of the plan. I think it SHOULD handle the case that WITH is child of CTAS. This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan. Author: Keuntae Park <sirpkt@apache.org> Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits: e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH 1671c77 [Keuntae Park] WITH clause can be inside CTAS	2015-07-08 14:29:52 -07:00
Takeshi YAMAMURO	3e831a2696	[SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> types used in Hive UDF To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits: 916099a [Takeshi YAMAMURO] Fix style errors 7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF	2015-07-08 10:33:27 -07:00
Cheng Hao	351a36d0c5	[SPARK-8883][SQL]Remove the OverrideFunctionRegistry Remove the `OverrideFunctionRegistry` from the Spark SQL, as the subclasses of `FunctionRegistry` have their own way to the delegate to the right underlying `FunctionRegistry`. Author: Cheng Hao <hao.cheng@intel.com> Closes #7260 from chenghao-intel/override and squashes the following commits: 164d093 [Cheng Hao] enable the function registry 2ca8459 [Cheng Hao] remove the OverrideFunctionRegistry	2015-07-08 00:10:24 -07:00
Reynold Xin	770ff1025e	[SPARK-8876][SQL] Remove InternalRow type alias in expressions package. The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about. Author: Reynold Xin <rxin@databricks.com> Closes #7270 from rxin/internalrow and squashes the following commits: 72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.	2015-07-07 17:40:14 -07:00
Takeshi YAMAMURO	1821fc1658	[SPARK-6747] [SQL] Throw an AnalysisException when unsupported Java list types used in Hive UDF The current implementation can't handle List<> as a return type in Hive UDF and throws meaningless Match Error. We assume an UDF below; public class UDFToListString extends UDF { public List<String> evaluate(Object o) { return Arrays.asList("xxx", "yyy", "zzz"); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To make udf developers more understood, we need to throw a more suitable exception. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7248 from maropu/FixBugInHiveInspectors and squashes the following commits: 1c3df2a [Takeshi YAMAMURO] Fix comments 56305de [Takeshi YAMAMURO] Fix conflicts 92ed7a6 [Takeshi YAMAMURO] Throw an exception when java list type used 2844a8e [Takeshi YAMAMURO] Apply comments 7114a47 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite fdb2ae4 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String af61f2e [Takeshi YAMAMURO] Remove a new type 7f812fd [Takeshi YAMAMURO] Fix code-style errors 6984bf4 [Takeshi YAMAMURO] Apply review comments 93e3d4e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString ee232db [Takeshi YAMAMURO] Support List as a return type in Hive UDF 1e82316 [Takeshi YAMAMURO] Apply comments 21e8763 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite a488712 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String 1c7b9d1 [Takeshi YAMAMURO] Remove a new type f965c34 [Takeshi YAMAMURO] Fix code-style errors 9406416 [Takeshi YAMAMURO] Apply review comments e21ce7e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString e553f10 [Takeshi YAMAMURO] Support List as a return type in Hive UDF	2015-07-06 19:44:31 -07:00
Yin Huai	7b467cc934	[SPARK-8588] [SQL] Regression test This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by `457d07eaa0`). Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #7103 from yhuai/SPARK-8588-test and squashes the following commits: eb5f418 [Yin Huai] Add a query test. c61a173 [Yin Huai] Regression test for SPARK-8588.	2015-07-06 16:28:47 -07:00
Cheng Lian	20a4d7dbd1	[SPARK-8501] [SQL] Avoids reading schema from empty ORC files ORC writes empty schema (`struct<>`) to ORC files containing zero rows. This is OK for Hive since the table schema is managed by the metastore. But it causes trouble when reading raw ORC files via Spark SQL since we have to discover the schema from the files. Notice that the ORC data source always avoids writing empty ORC files, but it's still problematic when reading Hive tables which contain empty part-files. Author: Cheng Lian <lian@databricks.com> Closes #7199 from liancheng/spark-8501 and squashes the following commits: bb8cd95 [Cheng Lian] Addresses comments a290221 [Cheng Lian] Avoids reading schema from empty ORC files	2015-07-02 21:30:57 -07:00

1 2 3 4 5 ...

750 commits