## What changes were proposed in this pull request?
This PR is a simple fix for an exception message to print `string[]` content correctly.
```java
String[] colPath = requestedSchema.getPaths().get(i);
...
- throw new IOException("Required column is missing in data file. Col: " + colPath);
+ throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(colPath));
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12041 from dongjoon-hyun/fix_exception_message_with_string_array.
## What changes were proposed in this pull request?
This PR fixes two trivial typos: 'does not **much**' --> 'does not **match**'.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12042 from dongjoon-hyun/fix_typo_by_replacing_much_with_match.
## What changes were proposed in this pull request?
Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path.
## How was this patch tested?
None.
Author: Cheng Lian <lian@databricks.com>
Closes#12003 from liancheng/spark-14208-option-renaming.
## What changes were proposed in this pull request?
This PR implements buildReader for json data source and enable it in the new data source code path.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11960 from cloud-fan/json.
## What changes were proposed in this pull request?
This adds a metric to parquet scans that measures the time in just the scan phase. This is
only possible when the scan returns ColumnarBatches, otherwise the overhead is too high.
This combined with the pipeline metric lets us easily see what percent of the time was
in the scan.
Author: Nong Li <nong@databricks.com>
Closes#12007 from nongli/spark-14210.
## What changes were proposed in this pull request?
This improves the Filter codegen for NULLs by deferring loading the values for IsNotNull.
Instead of generating code like:
boolean isNull = ...
int value = ...
if (isNull) continue;
we will generate:
boolean isNull = ...
if (isNull) continue;
int value = ...
This is useful since retrieving the values can be non-trivial (they can be dictionary encoded
among other things). This currently only works when the attribute comes from the column batch
but could be extended to other cases in the future.
## How was this patch tested?
On tpcds q55, this fixes the regression from introducing the IsNotNull predicates.
```
TPCDS Snappy: Best/Avg Time(ms) Rate(M/s) Per Row(ns)
--------------------------------------------------------------------------------
q55 4564 / 5036 25.2 39.6
q55 4064 / 4340 28.3 35.3
```
Author: Nong Li <nong@databricks.com>
Closes#11792 from nongli/spark-13981.
### What changes were proposed in this pull request?
This PR migrates all HiveQl parsing to the new ANTLR4 parser. This PR is build on top of https://github.com/apache/spark/pull/12011, and we should wait with merging until that one is in (hence the WIP tag).
As soon as this PR is merged we can start removing much of the old parser infrastructure.
### How was this patch tested?
Exisiting Hive unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12015 from hvanhovell/SPARK-14213.
## What changes were proposed in this pull request?
After DataFrame and Dataset are merged, the trait `Queryable` becomes unnecessary as it has only one implementation. We should remove it.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12001 from cloud-fan/df-ds.
## What changes were proposed in this pull request?
Before: We just pass all role commands to Hive even though it doesn't work.
After: We throw an `AnalysisException` that looks like this:
```
scala> sql("CREATE ROLE x")
org.apache.spark.sql.AnalysisException: Unsupported Hive operation: CREATE ROLE;
at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:213)
at org.apache.spark.sql.hive.HiveQl$$anonfun$parsePlan$1.apply(HiveQl.scala:208)
at org.apache.spark.sql.catalyst.parser.CatalystQl.safeParse(CatalystQl.scala:49)
at org.apache.spark.sql.hive.HiveQl.parsePlan(HiveQl.scala:208)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198)
```
## How was this patch tested?
`HiveQuerySuite`
Author: Andrew Or <andrew@databricks.com>
Closes#11948 from andrewor14/ddl-role-management.
## What changes were proposed in this pull request?
Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name.
This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however.
## How was this patch tested?
`SessionCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11972 from andrewor14/temp-functions.
## What changes were proposed in this pull request?
Extract the workaround for HADOOP-10622 introduced by #11940 into UninterruptibleThread so that we can test and reuse it.
## How was this patch tested?
Unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11971 from zsxwing/uninterrupt.
## What changes were proposed in this pull request?
UserDefinedType is a developer API in Spark 1.x. With very high probability we will create a new API for user-defined type that also works well with column batches as well as encoders (datasets). In Spark 2.0, let's make `UserDefinedType` `private[spark]` first.
## How was this patch tested?
Existing unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11955 from rxin/SPARK-14155.
## What changes were proposed in this pull request?
This patch addresses the remaining comments left in #11750 and #11918 after they are merged. For a full list of changes in this patch, just trace the commits.
## How was this patch tested?
`SessionCatalogSuite` and `CatalogTestCases`
Author: Andrew Or <andrew@databricks.com>
Closes#12006 from andrewor14/session-catalog-followup.
#### What changes were proposed in this pull request?
This PR adds all the current Spark SQL DDL commands to the new ANTLR 4 based SQL parser.
I have found a few inconsistencies in the current commands:
- Function has an alias field. This is actually the class name of the function.
- Partition specifications should contain nulls in some commands, and contain `None`s in others.
- `AlterTableSkewedLocation`: Should defines which columns have skewed values, and should allow us to define storage for each skewed combination of values. We currently only allow one value per field.
- `AlterTableSetFileFormat`: Should only have one file format, it currently supports both.
I have implemented all these comments like they were, and I propose to improve them in follow-up PRs.
#### How was this patch tested?
The existing DDLCommandSuite.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12011 from hvanhovell/SPARK-14086.
## What changes were proposed in this pull request?
Currently, for the key that can not fit within a long, we build a hash map for UnsafeHashedRelation, it's converted to BytesToBytesMap after serialization and deserialization. We should build a BytesToBytesMap directly to have better memory efficiency.
In order to do that, BytesToBytesMap should support multiple (K,V) pair with the same K, Location.putNewKey() is renamed to Location.append(), which could append multiple values for the same key (same Location). `Location.newValue()` is added to find the next value for the same key.
## How was this patch tested?
Existing tests. Added benchmark for broadcast hash join with duplicated keys.
Author: Davies Liu <davies@databricks.com>
Closes#11870 from davies/map2.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14156
In HiveComparisonTest, when catalyst results are different to hive results, we will collect the messages for computed tables during the test. During creating the message, we use sparkPlan. But we actually run the query with executedPlan. So the error message is sometimes confusing.
For example, as wholestage codegen is enabled by default now. The shown spark plan for computed tables is the plan before wholestage codegen.
A concrete is the following error message shown before this patch. It is the error shown when running `HiveCompatibilityTest` `auto_join26`.
auto_join26 has one SQL to create table:
INSERT OVERWRITE TABLE dest_j1
SELECT x.key, count(1) FROM src1 x JOIN src y ON (x.key = y.key) group by x.key; (1)
Then a SQL to retrieve the result:
select * from dest_j1 x order by x.key; (2)
When the above SQL (2) to retrieve the result fails, In `HiveComparisonTest` we will try to collect and show the generated data from table `dest_j1` using the SQL (1)'s spark plan. The you will see this error:
TungstenAggregate(key=[key#8804], functions=[(count(1),mode=Partial,isDistinct=false)], output=[key#8804,count#8834L])
+- Project [key#8804]
+- BroadcastHashJoin [key#8804], [key#8806], Inner, BuildRight, None
:- Filter isnotnull(key#8804)
: +- InMemoryColumnarTableScan [key#8804], [isnotnull(key#8804)], InMemoryRelation [key#8804,value#8805], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8717,value#8718], MetastoreRelation default, src1, None, Some(src1)
+- Filter isnotnull(key#8806)
+- InMemoryColumnarTableScan [key#8806], [isnotnull(key#8806)], InMemoryRelation [key#8806,value#8807], true, 5, StorageLevel(true, true, false, true, 1), HiveTableScan [key#8760,value#8761], MetastoreRelation default, src, None, Some(src)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:82)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:140)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:137)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:87)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:82)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 70 more
Caused by: java.lang.UnsupportedOperationException: Filter does not implement doExecuteBroadcast
at org.apache.spark.sql.execution.SparkPlan.doExecuteBroadcast(SparkPlan.scala:221)
The message is confusing because it is not the plan actually run by SparkSQL engine to create the generated table. The plan actually run is no problem. But as before this patch, we run `e.sparkPlan.collect` to retrieve and show the generated data, spark plan is not the plan we can run. So the above error will be shown.
After this patch, we won't see the error because the executed plan is no problem and works.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11957 from viirya/use-executedplan.
## What changes were proposed in this pull request?
The indentation of debug log output by `CodeGenerator` is weird.
The first line of the generated code should be put on the next line of the first line of the log message.
```
16/03/28 11:10:24 DEBUG CodeGenerator: /* 001 */
/* 002 */ public java.lang.Object generate(Object[] references) {
/* 003 */ return new SpecificSafeProjection(references);
...
```
After this patch is applied, we get debug log like as follows.
```
16/03/28 10:45:50 DEBUG CodeGenerator:
/* 001 */
/* 002 */ public java.lang.Object generate(Object[] references) {
/* 003 */ return new SpecificSafeProjection(references);
...
```
## How was this patch tested?
Ran some jobs and checked debug logs.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#11990 from sarutak/fix-debuglog-indentation.
## What changes were proposed in this pull request?
This PR fixes the following two testcases in order to test the correct usages.
```
checkSqlGeneration("SELECT substr('This is a test', 'is')")
checkSqlGeneration("SELECT substring('This is a test', 'is')")
```
Actually, the testcases works but tests on exceptional cases.
```
scala> sql("SELECT substr('This is a test', 'is')")
res0: org.apache.spark.sql.DataFrame = [substring(This is a test, CAST(is AS INT), 2147483647): string]
scala> sql("SELECT substr('This is a test', 'is')").collect()
res1: Array[org.apache.spark.sql.Row] = Array([null])
```
## How was this patch tested?
Pass the modified unit tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11963 from dongjoon-hyun/fix_substr_testcase.
#### What changes were proposed in this pull request?
This PR is to provide native parsing support for two DDL commands: ```Describe Database``` and ```Alter Database Set Properties```
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
##### 1. ALTER DATABASE
**Syntax:**
```SQL
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)
```
- `ALTER DATABASE` is to add new (key, value) pairs into `DBPROPERTIES`
##### 2. DESCRIBE DATABASE
**Syntax:**
```SQL
DESCRIBE DATABASE [EXTENDED] db_name
```
- `DESCRIBE DATABASE` shows the name of the database, its comment (if one has been set), and its root location on the filesystem. When `extended` is true, it also shows the database's properties
#### How was this patch tested?
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
This patch had conflicts when merged, resolved by
Committer: Yin Huai <yhuai@databricks.com>
Closes#11977 from gatorsmile/parseAlterDatabase.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for our ORC data source. It also fixed several minor styling issues related to `HadoopFsRelation` planning code path.
Note that `OrcNewInputFormat` doesn't rely on `OrcNewSplit` for creating `OrcRecordReader`s, plain `FileSplit` is just fine. That's why we can simply create the record reader with the help of `OrcNewInputFormat` and `FileSplit`.
## How was this patch tested?
Existing test cases should do the work
Author: Cheng Lian <lian@databricks.com>
Closes#11936 from liancheng/spark-14116-build-reader-for-orc.
### What changes were proposed in this pull request?
Based on the Hive DDL document https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
The syntax of DDL command for Drop Database is
```SQL
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
```
- If `IF EXISTS` is not specified, the default behavior is to issue a warning message if `database_name` does't exist
- `RESTRICT` is the default behavior.
This PR is to provide a native parsing support for `DROP DATABASE`.
#### How was this patch tested?
Added a test case `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11962 from gatorsmile/parseDropDatabase.
## What changes were proposed in this pull request?
1. merge consumeChild into consume()
2. always generate code for input variables and UnsafeRow, a plan can use eight of them.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11975 from davies/gen_refactor.
## What changes were proposed in this pull request?
This PR fixes some newly added java-lint errors(unused-imports, line-lengsth).
## How was this patch tested?
Pass the Jenkins tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11968 from dongjoon-hyun/SPARK-14167.
## What changes were proposed in this pull request?
HDFSMetadataLog uses newer FileContext API to achieve atomic renaming. However, FileContext implementations may not exist for many scheme for which there may be FileSystem implementations. In those cases, rather than failing completely, we should fallback to the FileSystem based implementation, and log warning that there may be file consistency issues in case the log directory is concurrently modified.
In addition I have also added more tests to increase the code coverage.
## How was this patch tested?
Unit test.
Tested on cluster with custom file system.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11925 from tdas/SPARK-14109.
## What changes were proposed in this pull request?
This PR moves flume back to Spark as per the discussion in the dev mail-list.
## How was this patch tested?
Existing Jenkins tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11895 from zsxwing/move-flume-back.
## What changes were proposed in this pull request?
There is a potential dead-lock in Hadoop Shell.runCommand before 2.5.0 ([HADOOP-10622](https://issues.apache.org/jira/browse/HADOOP-10622)). If we interrupt some thread running Shell.runCommand, we may hit this issue.
This PR adds some protecion to prevent from interrupting the microBatchThread when we may run into Shell.runCommand. There are two places will call Shell.runCommand now:
- offsetLog.add
- FileStreamSource.getOffset
They will create a file using HDFS API and call Shell.runCommand to set the file permission.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11940 from zsxwing/workaround-for-HADOOP-10622.
## What changes were proposed in this pull request?
This PR adds support for automatically inferring `IsNotNull` constraints from any non-nullable attributes that are part of an operator's output. This also fixes the issue that causes the optimizer to hit the maximum number of iterations for certain queries in https://github.com/apache/spark/pull/11828.
## How was this patch tested?
Unit test in `ConstraintPropagationSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11953 from sameeragarwal/infer-isnotnull.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-12443
`constructorFor` will call `dataTypeFor` to determine if a type is `ObjectType` or not. If there is not case for `Decimal`, it will be recognized as `ObjectType` and causes the bug.
## How was this patch tested?
Test is added into `ExpressionEncoderSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#10399 from viirya/fix-encoder-decimal.
## What changes were proposed in this pull request?
StateStoreCoordinator.reportActiveInstance is async, so subsequence state checks must be in eventually.
## How was this patch tested?
Jenkins tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11924 from tdas/state-store-flaky-fix.
## What changes were proposed in this pull request?
This PR is a minor cleanup task as part of https://issues.apache.org/jira/browse/SPARK-14008 to explicitly identify/catch the `UnsupportedOperationException` while initializing the vectorized parquet reader. Other exceptions will simply be thrown back to `SqlNewHadoopPartition`.
## How was this patch tested?
N/A (cleanup only; no new functionality added)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11950 from sameeragarwal/parquet-cleanup.
## What changes were proposed in this pull request?
As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
## How was this patch tested?
various new tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11879 from cloud-fan/create_map.
## What changes were proposed in this pull request?
This PR fix the conflict between ColumnPruning and PushPredicatesThroughProject, because ColumnPruning will try to insert a Project before Filter, but PushPredicatesThroughProject will move the Filter before Project.This is fixed by remove the Project before Filter, if the Project only do column pruning.
The RuleExecutor will fail the test if reached max iterations.
Closes#11745
## How was this patch tested?
Existing tests.
This is a test case still failing, disabled for now, will be fixed by https://issues.apache.org/jira/browse/SPARK-14137
Author: Davies Liu <davies@databricks.com>
Closes#11828 from davies/fail_rule.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/11410, we missed a corner case: define the inner class and use it in `Dataset` at the same time by using paste mode. For this case, the inner class and the `Dataset` are inside same line object, when we build the `Dataset`, we try to get outer pointer from line object, and it will fail because the line object is not initialized yet.
https://issues.apache.org/jira/browse/SPARK-13456?focusedCommentId=15209174&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209174 is an example for this corner case.
This PR make the process of getting outer pointer from line object lazy, so that we can successfully build the `Dataset` and finish initializing the line object.
## How was this patch tested?
new test in repl suite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11931 from cloud-fan/repl.
## What changes were proposed in this pull request?
We ran into a problem today debugging some class loading problem during deserialization, and JVM was masking the underlying exception which made it very difficult to debug. We can however log the exceptions using try/catch ourselves in serialization/deserialization. The good thing is that all these methods are already using Utils.tryOrIOException, so we can just put the try catch and logging in a single place.
## How was this patch tested?
A logging change with a manual test.
Author: Reynold Xin <rxin@databricks.com>
Closes#11951 from rxin/SPARK-14149.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
Dataset has two variants of groupByKey, one for untyped and the other for typed. It actually doesn't make as much sense to have an untyped API here, since apps that want to use untyped APIs should just use the groupBy "DataFrame" API.
## How was this patch tested?
This patch removes a method, and removes the associated tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11949 from rxin/SPARK-14145.
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.
## How was this patch tested?
Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11946 from rxin/SPARK-14142.
#### What changes were proposed in this pull request?
This PR is to support group by position in SQL. For example, when users input the following query
```SQL
select c1 as a, c2, c3, sum(*) from tbl group by 1, 3, c4
```
The ordinals are recognized as the positions in the select list. Thus, `Analyzer` converts it to
```SQL
select c1, c2, c3, sum(*) from tbl group by c1, c3, c4
```
This is controlled by the config option `spark.sql.groupByOrdinal`.
- When true, the ordinal numbers in group by clauses are treated as the position in the select list.
- When false, the ordinal numbers are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them.
- When the positions specified in the group by clauses correspond to the aggregate functions in select list, output an exception message.
- star is not allowed to use in the select list when users specify ordinals in group by
Note: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li
Also cc all the people who are involved in the previous discussion: rxin cloud-fan marmbrus yhuai hvanhovell adrian-wang chenghao-intel tejasapatil
#### How was this patch tested?
Added a few test cases for both positive and negative test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11846 from gatorsmile/groupByOrdinal.
#### What changes were proposed in this pull request?
This PR is to support star expansion in hash. For example,
```SQL
val structDf = testData2.select("a", "b").as("record")
structDf.select(hash($"*")
```
In addition, it refactors the codes for the rule `ResolveStar` and fixes a regression for star expansion in group by when using SQL API. For example,
```SQL
SELECT * FROM testData2 group by a, b
```
cc cloud-fan Now, the code for star resolution is much cleaner. The coverage is better. Could you check if this refactoring is good? Thanks!
#### How was this patch tested?
Added a few test cases to cover it.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11904 from gatorsmile/starResolution.
## What changes were proposed in this pull request?
`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext
## How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#11836 from andrewor14/use-session-catalog.
This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present.
Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.
Author: Michael Armbrust <michael@databricks.com>
Closes#11897 from marmbrus/fileSink.
This PR introduces a 64-bit hashcode expression. Such an expression is especially usefull for HyperLogLog++ and other probabilistic datastructures.
I have implemented xxHash64 which is a 64-bit hashing algorithm created by Yann Colet and Mathias Westerdahl. This is a high speed (C implementation runs at memory bandwidth) and high quality hashcode. It exploits both Instruction Level Parralellism (for speed) and the multiplication and rotation techniques (for quality) like MurMurHash does.
The initial results are promising. I have added a CG'ed test to the `HashBenchmark`, and this results in the following results (running from SBT):
Running benchmark: Hash For simple
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz
Hash For simple: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1011 / 1016 132.8 7.5 1.0X
codegen version 1864 / 1869 72.0 13.9 0.5X
codegen version 64-bit 1614 / 1644 83.2 12.0 0.6X
Running benchmark: Hash For normal
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz
Hash For normal: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 2467 / 2475 0.9 1176.1 1.0X
codegen version 2008 / 2115 1.0 957.5 1.2X
codegen version 64-bit 728 / 758 2.9 347.0 3.4X
Running benchmark: Hash For array
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz
Hash For array: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1544 / 1707 0.1 11779.6 1.0X
codegen version 2728 / 2745 0.0 20815.5 0.6X
codegen version 64-bit 2508 / 2549 0.1 19132.8 0.6X
Running benchmark: Hash For map
Running case: interpreted version
Running case: codegen version
Running case: codegen version 64-bit
Intel(R) Core(TM) i7-4750HQ CPU 2.00GHz
Hash For map: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
interpreted version 1819 / 1826 0.0 444014.3 1.0X
codegen version 183 / 194 0.0 44642.9 9.9X
codegen version 64-bit 173 / 174 0.0 42120.9 10.5X
This shows that algorithm is consistently faster than MurMurHash32 in all cases and up to 3x (!) in the normal case.
I have also added this to HyperLogLog++ and it cuts the processing time of the following code in half:
val df = sqlContext.range(1<<25).agg(approxCountDistinct("id"))
df.explain()
val t = System.nanoTime()
df.show()
val ns = System.nanoTime() - t
// Before
ns: Long = 5821524302
// After
ns: Long = 2836418963
cc cloud-fan (you have been working on hashcodes) / rxin
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11209 from hvanhovell/xxHash.
## What changes were proposed in this pull request?
In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here.
https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#
## How was this patch tested?
- [x] Unit tests
- [x] Cluster tests
**Coverage from unit tests**
<img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png">
## TODO
- [x] Fix updates() iterator to avoid duplicate updates for same key
- [x] Use Coordinator in ContinuousQueryManager
- [x] Plugging in hadoop conf and other confs
- [x] Unit tests
- [x] StateStore object lifecycle and methods
- [x] StateStoreCoordinator communication and logic
- [x] StateStoreRDD fault-tolerance
- [x] StateStoreRDD preferred location using StateStoreCoordinator
- [ ] Cluster tests
- [ ] Whether preferred locations are set correctly
- [ ] Whether recovery works correctly with distributed storage
- [x] Basic performance tests
- [x] Docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11645 from tdas/state-store.
## What changes were proposed in this pull request?
This PR adds support for TimestampType in the vectorized parquet reader
## How was this patch tested?
1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.INT96)` that made us fall back on parquet-mr for handling timestamps. This condition is now removed.
2. The `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `TimestampType`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change. Similarly, the `ParquetHiveCompatibilitySuite.SPARK-10177 timestamp` test that fails when the gating condition is removed, should now pass as well.
3. Added tests in `HadoopFsRelationTest` that test both the dictionary encoded and non-encoded versions across all supported datatypes.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11882 from sameeragarwal/timestamp-parquet.
## What changes were proposed in this pull request?
This PR rollback some changes in #11274 , which introduced some performance regression when do a simple aggregation on parquet scan with one integer column.
Does not really understand how this change introduce this huge impact, maybe related show JIT compiler inline functions. (saw very different stats from profiling).
## How was this patch tested?
Manually run the parquet reader benchmark, before this change:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 2391 / 3107 43.9 22.8 1.0X
```
After this change
```
Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Vectorized 2032 / 2626 51.6 19.4 1.0X```
Author: Davies Liu <davies@databricks.com>
Closes#11912 from davies/fix_regression.
This patch refactors the `MemoryStore` so that it can be tested without needing to construct / mock an entire `BlockManager`.
- The block manager's serialization- and compression-related methods have been moved from `BlockManager` to `SerializerManager`.
- `BlockInfoManager `is now passed directly to classes that need it, rather than being passed via the `BlockManager`.
- The `MemoryStore` now calls `dropFromMemory` via a new `BlockEvictionHandler` interface rather than directly calling the `BlockManager`. This change helps to enforce a narrow interface between the `MemoryStore` and `BlockManager` functionality and makes this interface easier to mock in tests.
- Several of the block unrolling tests have been moved from `BlockManagerSuite` into a new `MemoryStoreSuite`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11899 from JoshRosen/reduce-memorystore-blockmanager-coupling.
#### What changes were proposed in this pull request?
The PR https://github.com/apache/spark/pull/10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change.
#### How was this patch tested?
Added a new test case.
liancheng Could you check if the code refactoring is fine? Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11427 from gatorsmile/collapseProjectRefactor.
## What changes were proposed in this pull request?
This PR does the renaming as suggested by marmbrus in [this comment][1].
## How was this patch tested?
Existing tests.
[1]: 6d37e1eb90 (commitcomment-16654694)
Author: Cheng Lian <lian@databricks.com>
Closes#11889 from liancheng/spark-13817-follow-up.
## What changes were proposed in this pull request?
Try to fix a flaky hang
## How was this patch tested?
Existing Jenkins test
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11909 from zsxwing/hotfix2.
## What changes were proposed in this pull request?
1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
4. Remove "subtract" function since it is just an alias for "except".
## How was this patch tested?
All changes should be covered by existing tests. Also added couple test cases to cover "name".
Author: Reynold Xin <rxin@databricks.com>
Closes#11908 from rxin/SPARK-14088.
## What changes were proposed in this pull request?
This PR updates `sql/README.md` according to the latest console output and removes some unused imports in `sql` module. This is done by manually, so there is no guarantee to remove all unused imports.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11907 from dongjoon-hyun/update_sql_module.
## What changes were proposed in this pull request?
This fix tries to fix several SQL test warnings under the sql/core/src/test directory. The fixed warnings includes "[unchecked]", "[rawtypes]", and "[varargs]".
## How was this patch tested?
All existing tests passed.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#11857 from yongtang/SPARK-13401.
## What changes were proposed in this pull request?
Round() in database usually round the number up (away from zero), it's different than Math.round() in Java.
For example:
```
scala> java.lang.Math.round(-3.5)
res3: Long = -3
```
In Database, we should return -4.0 in this cases.
This PR remove the buggy special case for scale=0.
## How was this patch tested?
Add tests for negative values with tie.
Author: Davies Liu <davies@databricks.com>
Closes#11894 from davies/fix_round.
## What changes were proposed in this pull request?
Fix an issue that DataFrameReaderWriterSuite may hang forever.
## How was this patch tested?
Existing tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11902 from zsxwing/hotfix.
## What changes were proposed in this pull request?
This patch moves StringToColumn implicit class into SQLImplicits. This was kept in SQLContext.implicits object for binary backward compatibility, in the Spark 1.x series. It makes more sense for this API to be in SQLImplicits since that's the single class that defines all the SQL implicits.
## How was this patch tested?
Should be covered by existing unit tests.
Author: Reynold Xin <rxin@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11878 from rxin/SPARK-14060.
## What changes were proposed in this pull request?
This patch changed the return type for SQLContext.range from `Dataset[Long]` (Scala primitive) to `Dataset[java.lang.Long]` (Java boxed long).
Previously, SPARK-13894 changed the return type of range from `Dataset[Row]` to `Dataset[Long]`. The problem is that due to https://issues.scala-lang.org/browse/SI-4388, Scala compiles primitive types in generics into just Object, i.e. range at bytecode level now just returns `Dataset[Object]`. This is really bad for Java users because they are losing type safety and also need to add a type cast every time they use range.
Talked to Jason Zaugg from Lightbend (Typesafe) who suggested the best approach is to return `Dataset[java.lang.Long]`. The downside is that when Scala users want to explicitly type a closure used on the dataset returned by range, they would need to use `java.lang.Long` instead of the Scala `Long`.
## How was this patch tested?
The signature change should be covered by existing unit tests and API tests. I also added a new test case in DatasetSuite for range.
Author: Reynold Xin <rxin@databricks.com>
Closes#11880 from rxin/SPARK-14063.
This PR relaxes the requirements of a `Sink` for structured streaming to only require idempotent appending of data. Previously the `Sink` needed to be able to transactionally append data while recording an opaque offset indicated how far in a stream we have processed.
In order to do this, a new write-ahead-log has been added to stream execution, which records the offsets that will are present in each batch. The log is created in the newly added `checkpointLocation`, which defaults to `${spark.sql.streaming.checkpointLocation}/${queryName}` but can be overriden by setting `checkpointLocation` in `DataFrameWriter`.
In addition to making sinks easier to write the addition of batchIds and a checkpoint location is done in anticipation of integration with the the `StateStore` (#11645).
Author: Michael Armbrust <michael@databricks.com>
Closes#11804 from marmbrus/batchIds.
## What changes were proposed in this pull request?
Currently, **BooleanSimplification** optimization can handle the following cases.
* a && (!a || b ) ==> a && b
* a && (b || !a ) ==> a && b
However, it can not handle the followings cases since those equations fail at the comparisons between their canonicalized forms.
* a < 1 && (!(a < 1) || b) ==> (a < 1) && b
* a <= 1 && (!(a <= 1) || b) ==> (a <= 1) && b
* a > 1 && (!(a > 1) || b) ==> (a > 1) && b
* a >= 1 && (!(a >= 1) || b) ==> (a >= 1) && b
This PR implements the above cases and also the followings, too.
* a < 1 && ((a >= 1) || b ) ==> (a < 1) && b
* a <= 1 && ((a > 1) || b ) ==> (a <= 1) && b
* a > 1 && ((a <= 1) || b) ==> (a > 1) && b
* a >= 1 && ((a < 1) || b) ==> (a >= 1) && b
## How was this patch tested?
Pass the Jenkins tests including new test cases in BooleanSimplicationSuite.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11851 from dongjoon-hyun/SPARK-14029.
SPARK-13774: IllegalArgumentException: Can not create a Path from an empty string for incorrect file path
**Overview:**
- If a non-existent path is given in this call
``
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
``
it throws the following error:
`java.lang.IllegalArgumentException: Can not create a Path from an empty string` …..
`It gets called from inferSchema call in org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`
- The purpose of this JIRA is to throw a better error message.
- With the fix, you will now get a _Path does not exist_ error message.
```
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
...
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
... 49 elided
```
**Details**
_Changes include:_
- Check if path exists or not in resolveRelation in DataSource, and throw an AnalysisException with message like “Path does not exist: $path”
- AnalysisException is thrown similar to the exceptions thrown in resolveRelation.
- The glob path and the non glob path is checked with minimal calls to path exists. If the globPath is empty, then it is a nonexistent glob pattern and an error will be thrown. In the scenario that it is not globPath, it is necessary to only check if the first element in the Seq is valid or not.
_Test modifications:_
- Changes went in for 3 tests to account for this error checking.
- SQLQuerySuite:test("run sql directly on files") – Error message needed to be updated.
- 2 tests failed in MetastoreDataSourcesSuite because they had a dummy path and so test is modified to give a tempdir and allow it to move past so it can continue to test the codepath it meant to test
_New Tests:_
2 new tests are added to DataFrameSuite to validate that glob and non-glob path will throw the new error message.
_Testing:_
Unit tests were run with the fix.
**Notes/Questions to reviewers:**
- There is some code duplication in DataSource.scala in resolveRelation method and also createSource with respect to getting the paths. I have not made any changes to the createSource codepath. Should we make the change there as well ?
- From other JIRAs, I know there is restructuring and changes going on in this area, not sure how that will affect these changes, but since this seemed like a starter issue, I looked into it. If we prefer not to add the overhead of the checks, or if there is a better place to do so, let me know.
I would appreciate your review. Thanks for your time and comments.
Author: Sunitha Kambhampati <skambha@us.ibm.com>
Closes#11775 from skambha/improve_errmsg.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13953
Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11881 from HyukjinKwon/SPARK-13953.
## What changes were proposed in this pull request?
This is a follow-up of PR #11348.
After PR #11348, a predicate is never pushed through a project as long as the project contains any non-deterministic fields. Thus, it's impossible that the candidate filter condition can reference any non-deterministic projected fields, and related logic can be safely cleaned up.
To be more specific, the following optimization is allowed:
```scala
// From:
df.select('a, 'b).filter('c > rand(42))
// To:
df.filter('c > rand(42)).select('a, 'b)
```
while this isn't:
```scala
// From:
df.select('a, rand('b) as 'rb, 'c).filter('c > 'rb)
// To:
df.filter('c > rand('b)).select('a, rand('b) as 'rb, 'c)
```
## How was this patch tested?
Existing test cases should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#11864 from liancheng/spark-13473-cleanup.
## What changes were proposed in this pull request?
As we have completed the `SQLBuilder`, we can safely turn on native view by default.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11872 from cloud-fan/native-view.
This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included.
This code should be tested by the many existing tests for parquet.
Author: Michael Armbrust <michael@databricks.com>
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>
Closes#11709 from marmbrus/parquetReader.
## What changes were proposed in this pull request?
This patch adds support for reading `DecimalTypes` with high (> 18) precision in `VectorizedColumnReader`
## How was this patch tested?
1. `VectorizedColumnReader` initially had a gating condition on `primitiveType.getDecimalMetadata().getPrecision() > Decimal.MAX_LONG_DIGITS()` that made us fall back on parquet-mr for handling high-precision decimals. This condition is now removed.
2. In particular, the `ParquetHadoopFsRelationSuite` (that tests for all supported hive types -- including `DecimalType(25, 5)`) fails when the gating condition is removed (https://github.com/apache/spark/pull/11808) and should now pass with this change.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11869 from sameeragarwal/bigdecimal-parquet.
This PR resolves two issues:
First, expanding * inside aggregate functions of structs when using Dataframe/Dataset APIs. For example,
```scala
structDf.groupBy($"a").agg(min(struct($"record.*")))
```
Second, it improves the error messages when having invalid star usage when using Dataframe/Dataset APIs. For example,
```scala
pagecounts4PartitionsDS
.map(line => (line._1, line._3))
.toDF()
.groupBy($"_1")
.agg(sum("*") as "sumOccurances")
```
Before the fix, the invalid usage will issue a confusing error message, like:
```
org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input columns _1, _2;
```
After the fix, the message is like:
```
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in function 'sum'
```
cc: rxin nongli cloud-fan
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11208 from gatorsmile/sumDataSetResolution.
## What changes were proposed in this pull request?
This patch merges DatasetHolder and DataFrameHolder. This makes more sense because DataFrame/Dataset are now one class.
In addition, fixed some minor issues with pull request #11732.
## How was this patch tested?
Updated existing unit tests that test these implicits.
Author: Reynold Xin <rxin@databricks.com>
Closes#11737 from rxin/SPARK-13898.
## What changes were proposed in this pull request?
WholeStageCodegen naturally breaks the execution into pipelines that are easier to
measure duration. This is more granular than the task timings (a task can be multiple
pipelines) and is integrated with the web ui.
We currently report total time (across all tasks), min/mask/median to get a sense of how long each is taking.
## How was this patch tested?
Manually tested looking at the web ui.
Author: Nong Li <nong@databricks.com>
Closes#11741 from nongli/spark-13916.
## What changes were proposed in this pull request?
There is only one exception: `PythonUDF`. However, I don't think the `PythonUDF#` prefix is useful, as we can only create python udf under python context. This PR removes the `PythonUDF#` prefix from `PythonUDF.toString`, so that it doesn't need to overrde `sql`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11859 from cloud-fan/tmp.
## What changes were proposed in this pull request?
This PR generates code that get a value in each column from ```ColumnVector``` instead of creating ```InternalRow``` when ```ColumnarBatch``` is accessed. This PR improves benchmark program by up to 15%.
This PR consists of two parts:
1. Get an ```ColumnVector ``` by using ```ColumnarBatch.column()``` method
2. Get a value of each column by using ```rdd_col${COLIDX}.getInt(ROWIDX)``` instead of ```rdd_row.getInt(COLIDX)```
This is a motivated example.
````
sqlContext.conf.setConfString(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
sqlContext.conf.setConfString(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
val values = 10
withTempPath { dir =>
withTempTable("t1", "tempTable") {
sqlContext.range(values).registerTempTable("t1")
sqlContext.sql("select id % 2 as p, cast(id as INT) as id from t1")
.write.partitionBy("p").parquet(dir.getCanonicalPath)
sqlContext.read.parquet(dir.getCanonicalPath).registerTempTable("tempTable")
sqlContext.sql("select sum(p) from tempTable").collect
}
}
````
The original code
````java
...
/* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) {
/* 073 */ InternalRow rdd_row = rdd_batch.getRow(rdd_batchIdx++);
/* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
/* 075 */ /* input[0, int] */
/* 076 */ boolean rdd_isNull = rdd_row.isNullAt(0);
/* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_row.getInt(0));
...
````
The code generated by this PR
````java
/* 072 */ while (!shouldStop() && rdd_batchIdx < numRows) {
/* 073 */ org.apache.spark.sql.execution.vectorized.ColumnVector rdd_col0 = rdd_batch.column(0);
/* 074 */ /*** CONSUME: TungstenAggregate(key=[], functions=[(sum(cast(p#4 as bigint)),mode=Partial,isDistinct=false)], output=[sum#10L]) */
/* 075 */ /* input[0, int] */
/* 076 */ boolean rdd_isNull = rdd_col0.getIsNull(rdd_batchIdx);
/* 077 */ int rdd_value = rdd_isNull ? -1 : (rdd_col0.getInt(rdd_batchIdx));
...
/* 128 */ rdd_batchIdx++;
/* 129 */ }
/* 130 */ if (shouldStop()) return;
````
Performance
Without this PR
````
model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
Read data column 434 / 488 36.3 27.6 1.0X
Read partition column 302 / 346 52.1 19.2 1.4X
Read both columns 588 / 643 26.8 37.4 0.7X
````
With this PR
````
model name : Intel(R) Xeon(R) CPU E5-2667 v2 3.30GHz
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
Read data column 392 / 516 40.1 24.9 1.0X
Read partition column 256 / 318 61.4 16.3 1.5X
Read both columns 523 / 539 30.1 33.3 0.7X
````
## How was this patch tested?
Tested by existing test suites and benchmark
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#11636 from kiszk/SPARK-13805.
## What changes were proposed in this pull request?
This PR try acquire the memory for hash map in shuffled hash join, fail the task if there is no enough memory (otherwise it could OOM the executor).
It also removed unused HashedRelation.
## How was this patch tested?
Existing unit tests. Manual tests with TPCDS Q78.
Author: Davies Liu <davies@databricks.com>
Closes#11826 from davies/cleanup_hash2.
## What changes were proposed in this pull request?
This is a more aggressive version of PR #11820, which not only fixes the original problem, but also does the following updates to enforce the at-most-one-qualifier constraint:
- Renames `NamedExpression.qualifiers` to `NamedExpression.qualifier`
- Uses `Option[String]` rather than `Seq[String]` for `NamedExpression.qualifier`
Quoted PR description of #11820 here:
> Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`.
## How was this patch tested?
Existing tests should be enough.
Author: Cheng Lian <lian@databricks.com>
Closes#11822 from liancheng/spark-14004-aggressive.
## What changes were proposed in this pull request?
case classes defined in REPL are wrapped by line classes, and we have a trick for scala 2.10 REPL to automatically register the wrapper classes to `OuterScope` so that we can use when create encoders.
However, this trick doesn't work right after we upgrade to scala 2.11, and unfortunately the tests are only in scala 2.10, which makes this bug hidden until now.
This PR moves the encoder tests to scala 2.11 `ReplSuite`, and fixes this bug by another approach(the previous trick can't port to scala 2.11 REPL): make `OuterScope` smarter that can detect classes defined in REPL and load the singleton of line wrapper classes automatically.
## How was this patch tested?
the migrated encoder tests in `ReplSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11410 from cloud-fan/repl.
## What changes were proposed in this pull request?
Ad-hoc Dataset API ScalaDoc fixes
## How was this patch tested?
By building and checking ScalaDoc locally.
Author: Cheng Lian <lian@databricks.com>
Closes#11862 from liancheng/ds-doc-fixes.
## What changes were proposed in this pull request?
`SubqueryHolder` is only used when generate SQL string in `SQLBuilder`, it's more clear to make it an inner class in `SQLBuilder`.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11861 from cloud-fan/gensql.
## What changes were proposed in this pull request?
When we validate an encoder, we may call `dataType` on unresolved expressions. This PR fix the validation so that we will resolve attributes first.
## How was this patch tested?
a new test in `DatasetSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11816 from cloud-fan/encoder.
#### What changes were proposed in this pull request?
This PR is to support order by position in SQL, e.g.
```SQL
select c1, c2, c3 from tbl order by 1 desc, 3
```
should be equivalent to
```SQL
select c1, c2, c3 from tbl order by c1 desc, c3 asc
```
This is controlled by config option `spark.sql.orderByOrdinal`.
- When true, the ordinal numbers are treated as the position in the select list.
- When false, the ordinal number in order/sort By clause are ignored.
- Only convert integer literals (not foldable expressions). If found foldable expressions, ignore them
- This also works with select *.
**Question**: Do we still need sort by columns that contain zero reference? In this case, it will have no impact on the sorting results. IMO, we should not allow users do it. rxin cloud-fan marmbrus yhuai hvanhovell
-- Update: In these cases, they are ignored in this case.
**Note**: This PR is taken from https://github.com/apache/spark/pull/10731. When merging this PR, please give the credit to zhichao-li
Also cc all the people who are involved in the previous discussion: adrian-wang chenghao-intel tejasapatil
#### How was this patch tested?
Added a few test cases for both positive and negative test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11815 from gatorsmile/orderByPosition.
## What changes were proposed in this pull request?
This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11848 from dongjoon-hyun/add_proper_period_and_space.
## What changes were proposed in this pull request?
[Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
```xml
- <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
- <!--
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
- -->
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
-167,5 +164,7
</module>
<module name="CommentsIndentation"/>
<module name="UnusedImports"/>
+ <module name="RedundantImport"/>
+ <module name="RedundantModifier"/>
```
## How was this patch tested?
Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
After passing the Jenkins tests, `dev/lint-java` should passes locally.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11831 from dongjoon-hyun/SPARK-14011.
## What changes were proposed in this pull request?
Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
This PR adds the support for parse modes just like CSV data source. There are three modes below:
- `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.
This PR also make JSON data source share the `ParseModes` in CSV data source.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11756 from HyukjinKwon/SPARK-13764.
#### What changes were proposed in this pull request?
This PR is to add a new Optimizer rule for pruning Sort if its SortOrder is no-op. In the phase of **Optimizer**, if a specific `SortOrder` does not have any reference, it has no effect on the sorting results. If `Sort` is empty, remove the whole `Sort`.
For example, in the following SQL query
```SQL
SELECT * FROM t ORDER BY NULL + 5
```
Before the fix, the plan is like
```
== Analyzed Logical Plan ==
a: int, b: int
Sort [(cast(null as int) + 5) ASC], true
+- Project [a#92,b#93]
+- SubqueryAlias t
+- Project [_1#89 AS a#92,_2#90 AS b#93]
+- LocalRelation [_1#89,_2#90], [[1,2],[1,2]]
== Optimized Logical Plan ==
Sort [null ASC], true
+- LocalRelation [a#92,b#93], [[1,2],[1,2]]
== Physical Plan ==
WholeStageCodegen
: +- Sort [null ASC], true, 0
: +- INPUT
+- Exchange rangepartitioning(null ASC, 5), None
+- LocalTableScan [a#92,b#93], [[1,2],[1,2]]
```
After the fix, the plan is like
```
== Analyzed Logical Plan ==
a: int, b: int
Sort [(cast(null as int) + 5) ASC], true
+- Project [a#92,b#93]
+- SubqueryAlias t
+- Project [_1#89 AS a#92,_2#90 AS b#93]
+- LocalRelation [_1#89,_2#90], [[1,2],[1,2]]
== Optimized Logical Plan ==
LocalRelation [a#92,b#93], [[1,2],[1,2]]
== Physical Plan ==
LocalTableScan [a#92,b#93], [[1,2],[1,2]]
```
cc rxin cloud-fan marmbrus Thanks!
#### How was this patch tested?
Added a test suite for covering this rule
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11840 from gatorsmile/sortElimination.
## What changes were proposed in this pull request?
Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two.
Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain |K| + |V| fields.
This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it.
## How was this patch tested?
This is a rename to improve API understandability. Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11841 from rxin/SPARK-13897.
## What changes were proposed in this pull request?
500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to 500L << 23 and got negative numbers instead.
## How was this patch tested?
I'm only modifying test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#11839 from rxin/SPARK-14018.
## What changes were proposed in this pull request?
This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file.
## How was this patch tested?
N/A (refactoring only)
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11834 from sameeragarwal/rename.
## What changes were proposed in this pull request?
This PR cleans up the new parquet record reader with the following changes:
1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`.
2. Removes the non-vectorized column reader code from `ColumnReader`.
3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader`
4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED`
## How was this patch tested?
Refactoring only; Existing tests should reveal any problems.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11799 from sameeragarwal/vectorized-parquet.
## What changes were proposed in this pull request?
As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by
```
case NonFatal(e) => fail(
s"""Failed to analyze the converted SQL string:
|
|# Original HiveQL query string:
|$queryString
|
|# Resolved query plan:
|${originalQuery.analyzed.treeString}
|
|# Converted SQL query string:
|$convertedSQL
""".stripMargin, e)
```
Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`).
```
case e: Throwable =>
val errorMessage =
s"""
|Failed to execute query using catalyst:
|Error: ${e.getMessage}
|${stackTraceToString(e)}
|$queryString
|$query
|== HIVE - ${hive.size} row(s) ==
|${hive.mkString("\n")}
""".stripMargin
```
## How was this patch tested?
Existing tests.
Author: Yin Huai <yhuai@databricks.com>
Closes#11825 from yhuai/SPARK-13972-follow-up.
## What changes were proposed in this pull request?
ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast.
ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory.
This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side).
A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin.
## How was this patch tested?
Added new unit tests for ShuffledHashJoin.
Author: Davies Liu <davies@databricks.com>
Closes#11788 from davies/shuffle_join.
## What changes were proposed in this pull request?
Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`.
This PR fixes this issue by only picking the first qualifiers.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Existing tests should be enough.
Author: Cheng Lian <lian@databricks.com>
Closes#11820 from liancheng/spark-14004-single-qualifier.
## What changes were proposed in this pull request?
Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log.
We will send new PRs for spotted bugs, and merge this one after all bugs are fixed.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11782 from cloud-fan/test.
## What changes were proposed in this pull request?
The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string.
## How was this patch tested?
The re-enabled test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11818 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast.
## How was this patch tested?
Just documentation/api stability update.
Author: Reynold Xin <rxin@databricks.com>
Closes#11814 from rxin/dataset-docs.
## What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-13930
Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too.
## How was this patch tested?
Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`.
Without this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X
collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X
With this patch:
model name : Westmere E56xx/L56xx/X56xx (Nehalem-C)
collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect limit 1 million 833 / 1284 1.3 794.4 1.0X
collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11759 from viirya/execute-take.
## What changes were proposed in this pull request?
This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups
* `groupname basic`: Basic Dataset functions
* `groupname action`: Actions
* `groupname untypedrel`: Untyped Language Integrated Relational Queries
* `groupname typedrel`: Typed Language Integrated Relational Queries
* `groupname func`: Functional Transformations
* `groupname rdd`: RDD Operations
* `groupname output`: Output Operations
`since` tag and sample code are also updated. We may want to add more sample code for typed APIs.
## How was this patch tested?
Documentation change. Checked by building unidoc locally.
Author: Cheng Lian <lian@databricks.com>
Closes#11769 from liancheng/spark-13826-ds-api-doc.
PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure. This PR fixes this issue by expanding this pattern match into several simpler ones.
Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case.
Compilation against Scala 2.10
Author: tedyu <yuzhihong@gmail.com>
Closes#11798 from yy2016/master.
## What changes were proposed in this pull request?
We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation.
an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1`
before this PR, we will generate:
```
SELECT attr_1 AS a FROM
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
```
We missed a sub-query and this SQL string is illegal.
After this PR, we will generate:
```
SELECT attr_1 AS a FROM (
SELECT attr_1 FROM (
SELECT a AS attr_1 FROM tbl
) AS sub_q0
WHERE attr_1 = 1
) AS t
```
TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string.
## How was this patch tested?
`LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11786 from cloud-fan/bug-fix.
## What changes were proposed in this pull request?
We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`.
also addressed 2 minor comments in https://github.com/apache/spark/pull/11696
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11783 from cloud-fan/tmp.
## What changes were proposed in this pull request?
Compilation against Scala 2.10 fails with:
```
[error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for unreachability.
[error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.)
[error] [warn] private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match {
```
## How was this patch tested?
Compilation against Scala 2.10
Author: tedyu <yuzhihong@gmail.com>
Closes#11787 from yy2016/master.
JIRA: https://issues.apache.org/jira/browse/SPARK-13838
## What changes were proposed in this pull request?
We should also clear the variable code in `BoundReference.genCode` to prevent it to be evaluated twice, as we did in `evaluateVariables`.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#11674 from viirya/avoid-reevaluate.
## What changes were proposed in this pull request?
Support queries that JOIN tables with USING clause.
SELECT * from table1 JOIN table2 USING <column_list>
USING clause can be used as a means to simplify the join condition
when :
1) Equijoin semantics is desired and
2) The column names in the equijoin have the same name.
We already have the support for Natural Join in Spark. This PR makes
use of the already existing infrastructure for natural join to
form the join condition and also the projection list.
## How was the this patch tested?
Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#11297 from dilipbiswal/spark-13427.
## What changes were proposed in this pull request?
This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.
This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes.
Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596
## How was this patch tested?
new tests in `LogicalPlanToSQLSuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11696 from cloud-fan/generate.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf.
I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)).
Author: Ryan Blue <blue@apache.org>
Closes#11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.
Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo.
This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers.
In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11755 from JoshRosen/automatically-pick-best-serializer.
## What changes were proposed in this pull request?
Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly.
## How was this patch tested?
This patch will not affect the real code path.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#11758 from adrian-wang/spark12855.
## What changes were proposed in this pull request?
This PR removes three minor duplicated lines. First one is making the following unreachable code warning.
```
JoinSuite.scala:52: unreachable code
[warn] case j: BroadcastHashJoin => j
```
The other two are just consecutive repetitions in `Seq` of MiMa filters.
## How was this patch tested?
Pass the existing Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11773 from dongjoon-hyun/remove_duplicated_line.
## What changes were proposed in this pull request?
Fix expression generation for optional types.
Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases.
This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR).
## How was this patch tested?
A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11708 from jodersky/SPARK-13118-package-objects.
## What changes were proposed in this pull request?
We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen.
Updated the benchmark for `collect`, we could see 20-30% speedup.
## How was this patch tested?
existing unit tests.
Author: Davies Liu <davies@databricks.com>
Closes#11740 from davies/avoid_copy2.
## What changes were proposed in this pull request?
This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below:
```json
{"a": {"b": 1}}
{"a": []}
```
and the schema is given as below:
```scala
val schema =
StructType(
StructField("a", StructType(
StructField("b", StringType) :: Nil
)) :: Nil)
```
- **Before**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```scala
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
...
```
- **After**
```scala
sqlContext.read.schema(schema).json(path).show()
```
```bash
+----+
| a|
+----+
| [1]|
|null|
+----+
```
For other data types, in this case it converts the given values are `null` but only this case emits an exception.
This PR makes the support for wrapped rows applied only at the top level.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11752 from HyukjinKwon/SPARK-3308-follow-up.
## What changes were proposed in this pull request?
As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.
A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.
## How was this patch tested?
800+ lines of tests in `SessionCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11750 from andrewor14/temp-catalog.
## What changes were proposed in this pull request?
Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type.
## How was this patch tested?
Existing tests were successfully run on local machine.
Author: Jakob Odersky <jakob@odersky.com>
Closes#11379 from jodersky/SPARK-11011-udt-types.
## What changes were proposed in this pull request?
**[I'll link it to the JIRA once ASF JIRA is back online]**
This PR modifies the existing `CombineFilters` rule to remove redundant conditions while combining individual filter predicates. For instance, queries of the form `table.where('a === 1 && 'b === 1).where('a === 1 && 'c === 1)` will now be optimized to ` table.where('a === 1 && 'b === 1 && 'c === 1)` (instead of ` table.where('a === 1 && 'a === 1 && 'b === 1 && 'c === 1)`)
## How was this patch tested?
Unit test in `FilterPushdownSuite`
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11670 from sameeragarwal/combine-filters.
## What changes were proposed in this pull request?
This PR generalizes the `NullFiltering` optimizer rule in catalyst to `InferFiltersFromConstraints` that can automatically infer all relevant filters based on an operator's constraints while making sure of 2 things:
(a) no redundant filters are generated, and
(b) filters that do not contribute to any further optimizations are not generated.
## How was this patch tested?
Extended all tests in `InferFiltersFromConstraintsSuite` (that were initially based on `NullFilteringSuite` to test filter inference in `Filter` and `Join` operators.
In particular the 2 tests ( `single inner join with pre-existing filters: filter out values on either side` and `multiple inner joins: filter out values on all sides on equi-join keys` attempts to highlight/test the real potential of this rule for join optimization.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#11665 from sameeragarwal/infer-filters.
#### What changes were proposed in this pull request?
This PR is to convert to SQL from analyzed logical plans containing operator `ScriptTransformation`.
For example, below is the SQL containing `Transform`
```
SELECT TRANSFORM (a, b, c, d) USING 'cat' FROM parquet_t2
```
Its logical plan is like
```
ScriptTransformation [a#210L,b#211L,c#212L,d#213L], cat, [key#208,value#209], HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),true)
+- SubqueryAlias parquet_t2
+- Relation[a#210L,b#211L,c#212L,d#213L] ParquetRelation
```
The generated SQL will be like
```
SELECT TRANSFORM (`parquet_t2`.`a`, `parquet_t2`.`b`, `parquet_t2`.`c`, `parquet_t2`.`d`) USING 'cat' AS (`key` string, `value` string) FROM `default`.`parquet_t2`
```
#### How was this patch tested?
Seven test cases are added to `LogicalPlanToSQLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11503 from gatorsmile/transformToSQL.
## What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.
To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
```
SELECT sq_1.key, sq_1.max
FROM (
SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
FROM (
SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
) AS sq_0
) AS sq_1
```
You can see, the `key` columns become ambiguous after `sq_0`.
After this PR, it will generate something like:
```
SELECT attr_30 AS key, attr_37 AS max
FROM (
SELECT attr_30, attr_37
FROM (
SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
FROM (
SELECT attr_30, attr_35 FROM
(SELECT key AS attr_30 FROM t1) AS sq_0
INNER JOIN
(SELECT key AS attr_35 FROM t1) AS sq_1
) AS sq_2
) AS sq_3
) AS sq_4
```
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
## How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11658 from cloud-fan/gensql.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13894
Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`.
## How was this patch tested?
No additional unit test required.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#11730 from chenghao-intel/range.
## What changes were proposed in this pull request?
There is a feature of hive SQL called multi-insert. For example:
```
FROM src
INSERT OVERWRITE TABLE dest1
SELECT key + 1
INSERT OVERWRITE TABLE dest2
SELECT key WHERE key > 2
INSERT OVERWRITE TABLE dest3
SELECT col EXPLODE(arr) exp AS col
...
```
We partially support it currently, with some limitations: 1) WHERE can't reference columns produced by LATERAL VIEW. 2) It's not executed eagerly, i.e. `sql("...multi-insert clause...")` won't take place right away like other commands, e.g. CREATE TABLE.
This PR removes these limitations and make us fully support multi-insert.
## How was this patch tested?
new tests in `SQLQuerySuite`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11754 from cloud-fan/lateral-view.
## What changes were proposed in this pull request?
Follow up to https://github.com/apache/spark/pull/11657
- Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
- And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
- And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11725 from srowen/SPARK-13823.2.
## What changes were proposed in this pull request?
The purpose of [SPARK-12653](https://issues.apache.org/jira/browse/SPARK-12653) is re-enabling a regression test.
Historically, the target regression test is added by [SPARK-8498](093c34838d), but is temporarily disabled by [SPARK-12615](8ce645d4ee) due to binary compatibility error.
The following is the current error message at the submitting spark job with the pre-built `test.jar` file in the target regression test.
```
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.$lessinit$greater$default$6()Lscala/collection/Map;
```
Simple rebuilding `test.jar` can not recover the purpose of testcase since we need to support both Scala 2.10 and 2.11 for a while. For example, we will face the following Scala 2.11 error if we use `test.jar` built by Scala 2.10.
```
Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
```
This PR replace the existing `test.jar` with `test-2.10.jar` and `test-2.11.jar` and improve the regression test to use the suitable jar file.
## How was this patch tested?
Pass the existing Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11744 from dongjoon-hyun/SPARK-12653.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13899
This PR makes CSV data source produce `InternalRow` instead of `Row`.
Basically, this resembles JSON data source. It uses the same codes for casting.
## How was this patch tested?
Unit tests were used within IDE and code style was checked by `./dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11717 from HyukjinKwon/SPARK-13899.
## What changes were proposed in this pull request?
This PR brings codegen support for broadcast left-semi join.
## How was this patch tested?
Existing tests. Added benchmark, the result show 7X speedup.
Author: Davies Liu <davies@databricks.com>
Closes#11742 from davies/gen_semi.
## What changes were proposed in this pull request?
Remove the wrong "expected" parameter in MathFunctionsSuite.scala's checkNaNWithoutCodegen.
This function is to check NaN value, so the "expected" parameter is useless. The Callers do not pass "expected" value and the similar function like checkNaNWithGeneratedProjection and checkNaNWithOptimization do not use it also.
Author: Yucai Yu <yucai.yu@intel.com>
Closes#11718 from yucai/unused_expected.
## What changes were proposed in this pull request?
This PR just move some code from SortMergeOuterJoin into SortMergeJoin.
This is for support codegen for outer join.
## How was this patch tested?
existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11743 from davies/gen_smjouter.
## What changes were proposed in this pull request?
This patch changes DataFrameReader.text()'s return type from DataFrame to Dataset[String].
Closes#11731.
## How was this patch tested?
Updated existing integration tests to reflect the change.
Author: Reynold Xin <rxin@databricks.com>
Closes#11739 from rxin/SPARK-13895.
## What changes were proposed in this pull request?
Change the return type of toJson in Dataset class
## How was this patch tested?
No additional unit test required.
Author: Stavros Kontopoulos <stavros.kontopoulos@typesafe.com>
Closes#11732 from skonto/fix_toJson.
## What changes were proposed in this pull request?
Our internal code can go through SessionState.catalog and SessionState.analyzer. This brings two small benefits:
1. Reduces internal dependency on SQLContext.
2. Removes 2 public methods in Java (Java does not obey package private visibility).
More importantly, according to the design in SPARK-13485, we'd need to claim this catalog function for the user-facing public functions, rather than having an internal field.
## How was this patch tested?
Existing unit/integration test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#11716 from rxin/SPARK-13893.
## What changes were proposed in this pull request?
Use method 'testQuietly' to avoid ContinuousQuerySuite flooding the console logs with garbage
Make ContinuousQuerySuite not output logs to the console. The logs will still output to unit-tests.log.
## How was this patch tested?
Just check Jenkins output.
Author: Xin Ren <iamshrek@126.com>
Closes#11703 from keypointt/SPARK-13660.
#### What changes were proposed in this pull request?
Before this PR, two Optimizer rules `ColumnPruning` and `PushPredicateThroughProject` reverse each other's effects. Optimizer always reaches the max iteration when optimizing some queries. Extra `Project` are found in the plan. For example, below is the optimized plan after reaching 100 iterations:
```
Join Inner, Some((cast(id1#16 as bigint) = id1#18L))
:- Project [id1#16]
: +- Filter isnotnull(cast(id1#16 as bigint))
: +- Project [id1#16]
: +- Relation[id1#16,newCol#17] JSON part: struct<>, data: struct<id1:int,newCol:int>
+- Filter isnotnull(id1#18L)
+- Relation[id1#18L] JSON part: struct<>, data: struct<id1:bigint>
```
This PR splits the optimizer rule `ColumnPruning` to `ColumnPruning` and `EliminateOperators`
The issue becomes worse when having another rule `NullFiltering`, which could add extra Filters for `IsNotNull`. We have to be careful when introducing extra `Filter` if the benefit is not large enough. Another PR will be submitted by sameeragarwal to handle this issue.
cc sameeragarwal marmbrus
In addition, `ColumnPruning` should not push `Project` through non-deterministic `Filter`. This could cause wrong results. This will be put in a separate PR.
cc davies cloud-fan yhuai
#### How was this patch tested?
Modified the existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11682 from gatorsmile/viewDuplicateNames.
## What changes were proposed in this pull request?
In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer.
As part of this patch, I also removed the following internal methods from SQLContext:
```
protected[sql] def functionRegistry: FunctionRegistry
protected[sql] def optimizer: Optimizer
protected[sql] def sqlParser: ParserInterface
protected[sql] def planner: SparkPlanner
protected[sql] def continuousQueryManager
protected[sql] def prepareForExecution: RuleExecutor[SparkPlan]
```
## How was this patch tested?
Existing unit/integration tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11712 from rxin/sqlContext-planner.
## What changes were proposed in this pull request?
When initial creating `CVSSuite.scala` in SPARK-12833, there was a typo on `scalastyle:on`: `scalstyle:on`. So, it turns off ScalaStyle checking for the rest of the file mistakenly. So, it can not find a violation on the code of `SPARK-12668` added recently. This issue fixes the existing escaping correctly and adds a new escaping for `SPARK-12668` code like the following.
```scala
test("test aliases sep and encoding for delimiter and charset") {
+ // scalastyle:off
val cars = sqlContext
...
.load(testFile(carsFile8859))
+ // scalastyle:on
```
This will prevent future potential problems, too.
## How was this patch tested?
Pass the Jenkins test.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11700 from dongjoon-hyun/SPARK-13870.
## What changes were proposed in this pull request?
This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner).
## How was this patch tested?
Should be covered by existing unit tests and Hive compatibility tests that run describe table.
Author: Reynold Xin <rxin@databricks.com>
Closes#11710 from rxin/SPARK-13884.
## What changes were proposed in this pull request?
When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows.
This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content).
The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec).
## How was this patch tested?
Existing unit tests.
Add a benchmark for collect, before this patch:
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect 1 million 3991 / 4311 0.3 3805.7 1.0X
collect 2 millions 10083 / 10637 0.1 9616.0 0.4X
collect 4 millions 29551 / 30072 0.0 28182.3 0.1X
```
```
Intel(R) Core(TM) i7-4558U CPU 2.80GHz
collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
collect 1 million 775 / 1170 1.4 738.9 1.0X
collect 2 millions 1153 / 1758 0.9 1099.3 0.7X
collect 4 millions 4451 / 5124 0.2 4244.9 0.2X
```
We can see about 5-7X speedup.
Author: Davies Liu <davies@databricks.com>
Closes#11664 from davies/serialize_row.
## What changes were proposed in this pull request?
Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#11666 from davies/remove_copy.
## What changes were proposed in this pull request?
1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset.
2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0.
## How was this patch tested?
Should be covered by existing unit/integration tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11704 from rxin/SPARK-13880.
## What changes were proposed in this pull request?
- Add a MetadataLog interface for metadata reliably storage.
- Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
- Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.
## How was this patch tested?
unit tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11625 from zsxwing/metadata-log.
## What changes were proposed in this pull request?
We introduced some local operators in org.apache.spark.sql.execution.local package but never fully wired the engine to actually use these. We still plan to implement a full local mode, but it's probably going to be fairly different from what the current iterator-based local mode would look like. Based on what we know right now, we might want a push-based columnar version of these operators.
Let's just remove them for now, and we can always re-introduced them in the future by looking at branch-1.6.
## How was this patch tested?
This is simply dead code removal.
Author: Reynold Xin <rxin@databricks.com>
Closes#11705 from rxin/SPARK-13882.
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed.
Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties:
- It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat`
- Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns
- It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf)
- it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning.
- Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm.
Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API.
A stub for `FileScanRDD` is also added, but most methods remain unimplemented.
Other minor cleanups:
- partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore)
- The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out.
- `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls
- Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes.
Author: Michael Armbrust <michael@databricks.com>
Closes#11646 from marmbrus/fileStrategy.
Three different things were needed to get rid of spurious warnings:
- silence deprecation warnings when cloning configuration
- change the way SparkHadoopUtil instantiates SparkConf to silence
warnings
- avoid creating new SparkConf instances where it's not needed.
On top of that, I changed the way that Logging.scala detects the repl;
now it uses a method that is overridden in the repl's Main class, and
the hack in Utils.scala is not needed anymore. This makes the 2.11 repl
behave like the 2.10 one and set the default log level to WARN, which
is a lot better. Previously, this wasn't working because the 2.11 repl
triggers log initialization earlier than the 2.10 one.
I also removed and simplified some other code in the 2.11 repl's Main
to avoid replicating logic that already exists elsewhere in Spark.
Tested the 2.11 repl in local and yarn modes.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11510 from vanzin/SPARK-13626.
Addressing outstanding comments in #11573.
Jenkins, new test case in `DDLCommandSuite`
Author: Andrew Or <andrew@databricks.com>
Closes#11667 from andrewor14/ddl-parser-followups.
If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files.
In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6.
To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes.
https://issues.apache.org/jira/browse/SPARK-13207
Author: Yin Huai <yhuai@databricks.com>
Closes#11088 from yhuai/SPARK-13207.
## What changes were proposed in this pull request?
This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11689 from dongjoon-hyun/fix_more_typos.
## What changes were proposed in this pull request?
- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11657 from srowen/SPARK-13823.