They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time.
Author: Reynold Xin <rxin@databricks.com>
Closes#8542 from rxin/SPARK-10378.
Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean.
This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#8393 from sureshthalamati/db2_dialect_spark-10170.
This PR includes the following changes:
- Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode.
- Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993)
Author: zsxwing <zsxwing@gmail.com>
Closes#8464 from zsxwing/local-execution.
This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.
https://issues.apache.org/jira/browse/SPARK-10339https://issues.apache.org/jira/browse/SPARK-10334
Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.
Author: Yin Huai <yhuai@databricks.com>
Closes#8515 from yhuai/partitionedTableScan.
SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8499 from JoshRosen/use-hadoop-reflection-in-more-places.
When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader.
Author: wangwei <wangwei82@huawei.com>
Closes#8420 from small-wang/jline-SPARK-10226.
Actually using this API requires access to a lot of classes that we might make private by accident. I've added some tests to prevent this.
Author: Michael Armbrust <michael@databricks.com>
Closes#8516 from marmbrus/extraStrategiesTests.
This PR introduces a direct write API for testing Parquet. It's a DSL flavored version of the [`writeDirect` method] [1] comes with parquet-avro testing code. With this API, it's much easier to construct arbitrary Parquet structures. It's especially useful when adding regression tests for various compatibility corner cases.
Sample usage of this API can be found in the new test case added in `ParquetThriftCompatibilitySuite`.
[1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972
Author: Cheng Lian <lian@databricks.com>
Closes#8454 from liancheng/spark-10289/parquet-testing-direct-write-api.
After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array.
Author: Davies Liu <davies@databricks.com>
Closes#8492 from davies/fix_in.
This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8500 from JoshRosen/SPARK-10325 and squashes the following commits:
51ffea1 [Josh Rosen] Override hashCode() for public Row.
Having sizeInBytes in HadoopFsRelation to enable broadcast join.
cc marmbrus
Author: Davies Liu <davies@databricks.com>
Closes#8490 from davies/sizeInByte.
https://issues.apache.org/jira/browse/SPARK-10287
After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet).
Author: Yin Huai <yhuai@databricks.com>
Closes#8469 from yhuai/jsonRefresh.
In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal.
Author: Davies Liu <davies@databricks.com>
Closes#8428 from davies/smaller_decimal.
This PR:
1. supports transferring arbitrary nested array from JVM to R side in SerDe;
2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types
from a DataFrame.
Author: Sun Rui <rui.sun@intel.com>
Closes#8276 from sun-rui/SPARK-10048.
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes#8033 from srowen/SPARK-9613.
Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties.
This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8403 from JoshRosen/datasources-internal-vs-external-types.
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.
In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).
Author: Davies Liu <davies@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#8400 from davies/timestamp_parquet.
This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions.
I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7631 from JoshRosen/SPARK-9293.
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here.
The structure of the problematic Parquet schema generated by parquet-avro is something like this:
```
message m {
<repetition> group f (LIST) { // Level 1
repeated group array (LIST) { // Level 2
repeated <primitive-type> array; // Level 3
}
}
}
```
(The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.)
This structure consists of two nested legacy 2-level `LIST`-like structures:
1. The repeated group type at level 2 is the element type of the outer array defined at level 1
This group should map to an `CatalystArrayConverter.ElementConverter` when building converters.
2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2
This group should also map to an `CatalystArrayConverter.ElementConverter`.
The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it.
According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.)
As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec:
> If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required.
(The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.)
This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method.
Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3].
[1]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala (L259-L305)
[2]: 85f9a61357/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala (L456-L463)
[3]: https://issues.apache.org/jira/browse/PARQUET-364
Author: Cheng Lian <lian@databricks.com>
Closes#8361 from liancheng/spark-10136/proper-version.
In `HiveComparisionTest`s it is possible to fail a query of the form `SELECT * FROM dest1`, where `dest1` is the query that is actually computing the incorrect results. To aid debugging this patch improves the harness to also print these query plans and their results.
Author: Michael Armbrust <michael@databricks.com>
Closes#8388 from marmbrus/generatedTables.
https://issues.apache.org/jira/browse/SPARK-10121
Looks like the problem is that if we add a jar through another thread, the thread handling the JDBC session will not get the latest classloader.
Author: Yin Huai <yhuai@databricks.com>
Closes#8368 from yhuai/SPARK-10121.
* Makes `SQLImplicits.rddToDataFrameHolder` scaladoc consistent with `SQLContext.createDataFrame[A <: Product](rdd: RDD[A])` since the former is essentially a wrapper for the latter
* Clarifies `createDataFrame[A <: Product]` scaladoc to apply for any `RDD[Product]`, not just case classes
Author: Feynman Liang <fliang@databricks.com>
Closes#8406 from feynmanliang/sql-doc-fixes.
Currently, we eagerly attempt to resolve functions, even before their children are resolved. However, this is not valid in cases where we need to know the types of the input arguments (i.e. when resolving Hive UDFs).
As a fix, this PR delays function resolution until the functions children are resolved. This change also necessitates a change to the way we resolve aggregate expressions that are not in aggregate operators (e.g., in `HAVING` or `ORDER BY` clauses). Specifically, we can't assume that these misplaced functions will be resolved, allowing us to differentiate aggregate functions from normal functions. To compensate for this change we now attempt to resolve these unresolved expressions in the context of the aggregate operator, before checking to see if any aggregate expressions are present.
Author: Michael Armbrust <michael@databricks.com>
Closes#8371 from marmbrus/hiveUDFResolution.
This adds a missing null check to the Decimal `toScala` converter in `CatalystTypeConverters`, fixing an NPE.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8401 from JoshRosen/SPARK-10190.
Move `test.org.apache.spark.sql.hive` package tests to apparent intended `org.apache.spark.sql.hive` as they don't intend to test behavior from outside org.apache.spark.*
Alternate take, per discussion at https://github.com/apache/spark/pull/8051
I think this is what vanzin and I had in mind but also CC rxin to cross-check, as this does indeed depend on whether these tests were accidentally in this package or not. Testing from a `test.org.apache.spark` package is legitimate but didn't seem to be the intent here.
Author: Sean Owen <sowen@cloudera.com>
Closes#8307 from srowen/SPARK-9758.
This PR refactors `ParquetHiveCompatibilitySuite` so that it's easier to add new test cases.
Hit two bugs, SPARK-10177 and HIVE-11625, while working on this, added test cases for them and marked as ignored for now. SPARK-10177 will be addressed in a separate PR.
Author: Cheng Lian <lian@databricks.com>
Closes#8392 from liancheng/spark-8580/parquet-hive-compat-tests.
This PR contains examples on how to use some of the Stat Functions available for DataFrames under `df.stat`.
rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#8378 from brkyvz/update-sql-docs.
https://issues.apache.org/jira/browse/SPARK-10143
With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data.
I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference.
Without this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)
With this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)
Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size.
Tested it on a cluster using
```
val count = sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master.
Author: Yin Huai <yhuai@databricks.com>
Closes#8346 from yhuai/parquetMinSplit.
Type coercion for IF should have children resolved first, or we could meet unresolved exception.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#8331 from adrian-wang/spark10130.
This is based on #7779 , thanks to tarekauel . Fix the conflict and nullability.
Closes#7779 and #8274 .
Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Davies Liu <davies@databricks.com>
Closes#8330 from davies/stringLocate.
This class is identical to `org.apache.spark.sql.execution.datasources.jdbc. DefaultSource` and is not needed.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8334 from cloud-fan/minor.
I caught SPARK-10136 while adding more test cases to `ParquetAvroCompatibilitySuite`. Actual bug fix code lies in `CatalystRowConverter.scala`.
Author: Cheng Lian <lian@databricks.com>
Closes#8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.
This improves performance by ~ 20 - 30% in one of my local test and should fix the performance regression from 1.4 to 1.5 on ss_max.
Author: Reynold Xin <rxin@databricks.com>
Closes#8332 from rxin/SPARK-10100.
https://issues.apache.org/jira/browse/SPARK-10092
This pr is a follow-up one for Multi-DB support. It has the following changes:
* `HiveContext.refreshTable` now accepts `dbName.tableName`.
* `HiveContext.analyze` now accepts `dbName.tableName`.
* `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateTempTableUsing`, `CreateTempTableUsingAsSelect`, `CreateMetastoreDataSource`, and `CreateMetastoreDataSourceAsSelect` all take `TableIdentifier` instead of the string representation of table name.
* When you call `saveAsTable` with a specified database, the data will be saved to the correct location.
* Explicitly do not allow users to create a temporary with a specified database name (users cannot do it before).
* When we save table to metastore, we also check if db name and table name can be accepted by hive (using `MetaStoreUtils.validateName`).
Author: Yin Huai <yhuai@databricks.com>
Closes#8324 from yhuai/saveAsTableDB.
A few minor changes:
1. Improved documentation
2. Rename apply(distinct....) to distinct.
3. Changed MutableAggregationBuffer from a trait to an abstract class.
4. Renamed returnDataType to dataType to be more consistent with other expressions.
And unrelated to UDAFs:
1. Renamed file names in expressions to use suffix "Expressions" to be more consistent.
2. Moved regexp related expressions out to its own file.
3. Renamed StringComparison => StringPredicate.
Author: Reynold Xin <rxin@databricks.com>
Closes#8321 from rxin/SPARK-9242.
As I talked with Lian,
1. I added EquelNullSafe to ParquetFilters
- It uses the same equality comparison filter with EqualTo since the Parquet filter performs actually null-safe equality comparison.
2. Updated the test code (ParquetFilterSuite)
- Convert catalyst.Expression to sources.Filter
- Removed Cast since only Literal is picked up as a proper Filter in DataSourceStrategy
- Added EquelNullSafe comparison
3. Removed deprecated createFilter for catalyst.Expression
Author: hyukjinkwon <gurwls223@gmail.com>
Author: 권혁진 <gurwls223@gmail.com>
Closes#8275 from HyukjinKwon/master.
create t1 (a decimal(7, 2), b long);
select case when 1=1 then a else 1.0 end from t1;
select case when 1=1 then a else b end from t1;
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#8270 from adrian-wang/casewhenfractional.
Speculation hates direct output committer, as there are multiple corner cases that may cause data corruption and/or data loss.
Please see this [PR comment] [1] for more details.
[1]: https://github.com/apache/spark/pull/8191#issuecomment-131598385
Author: Cheng Lian <lian@databricks.com>
Closes#8317 from liancheng/spark-9899/speculation-hates-direct-output-committer.
We should rounding the result of multiply/division of decimal to expected precision/scale, also check overflow.
Author: Davies Liu <davies@databricks.com>
Closes#8287 from davies/decimal_division.
`DictionaryEncoding` uses Scala runtime reflection to avoid boxing costs while building the directory array. However, this code path may hit [SI-6240] [1] and throw exception.
[1]: https://issues.scala-lang.org/browse/SI-6240
Author: Cheng Lian <lian@databricks.com>
Closes#8306 from liancheng/spark-9627/in-memory-cache-scala-reflection.
DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name).
cc marmbrus
Author: Davies Liu <davies@databricks.com>
Closes#8300 from davies/with_column.
This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions
In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include:
* (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail.
* (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver.
This subsumes #8285.
Author: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#8295 from rxin/SPARK-10096.
In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations).
So we should use the public API instead.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#8286 from davies/portable_decimal.
Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky.
This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests.
[1]: https://issues.scala-lang.org/browse/SI-8768
Author: Cheng Lian <lian@databricks.com>
Closes#8168 from liancheng/spark-9939/use-java-process-api.
Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility.
Author: Michael Armbrust <michael@databricks.com>
Closes#8281 from marmbrus/binaryCompat.
Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`.
This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909.
Author: Cheng Lian <lian@databricks.com>
Closes#8196 from liancheng/spark-8118/redirect-parquet-jul.
The type for array of array in Java is slightly different than array of others.
cc cloud-fan
Author: Davies Liu <davies@databricks.com>
Closes#8250 from davies/array_binary.
https://issues.apache.org/jira/browse/SPARK-9592#8113 has the fundamental fix. But, if we want to minimize the number of changed lines, we can go with this one. Then, in 1.6, we merge #8113.
Author: Yin Huai <yhuai@databricks.com>
Closes#8172 from yhuai/lastFix and squashes the following commits:
b28c42a [Yin Huai] Regression test.
af87086 [Yin Huai] Fix last.
JIRA: https://issues.apache.org/jira/browse/SPARK-9526
This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression.
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7855 from yjshen/property_check.
This PR uses `JDBCRDD.getConnector` to load JDBC driver before creating connection in `DataFrameReader.jdbc` and `DataFrameWriter.jdbc`.
Author: zsxwing <zsxwing@gmail.com>
Closes#8232 from zsxwing/SPARK-10036 and squashes the following commits:
adf75de [zsxwing] Add extraOptions to the connection properties
57f59d4 [zsxwing] Load JDBC driver in DataFrameReader.jdbc and DataFrameWriter.jdbc
This issue has been fixed by https://github.com/apache/spark/pull/8215, this PR added regression test for it.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8222 from cloud-fan/minor and squashes the following commits:
0bbfb1c [Wenchen Fan] fix style...
7e2d8d9 [Wenchen Fan] add test
When inserting data into a `HadoopFsRelation`, if `commitTask()` of the writer container fails, `abortTask()` will be invoked. However, both `commitTask()` and `abortTask()` try to close the output writer(s). The problem is that, closing underlying writers may not be an idempotent operation. E.g., `ParquetRecordWriter.close()` throws NPE when called twice.
Author: Cheng Lian <lian@databricks.com>
Closes#8236 from liancheng/spark-7837/double-closing.
In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled.
For example, the schema of a Parquet file to be read can be:
```
message individual {
required group f1 {
optional binary f11 (utf8);
}
}
```
while the global schema is:
```
message global {
required group f1 {
optional binary f11 (utf8);
optional int32 f12;
}
}
```
This PR fixes this issue by padding missing fields when creating actual converters.
Author: Cheng Lian <lian@databricks.com>
Closes#8228 from liancheng/spark-10005/nested-schema-merging.
The `initialSize` argument of `ColumnBuilder.initialize()` should be the
number of rows rather than bytes. However `InMemoryColumnarTableScan`
passes in a byte size, which makes Spark SQL allocate more memory than
necessary when building in-memory columnar buffers.
Author: Kun Xu <viper_kun@163.com>
Closes#8189 from viper-kun/errorSize.
We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8203 from cloud-fan/error-msg and squashes the following commits:
1c67ca7 [Wenchen Fan] move test
7593080 [Wenchen Fan] correct error message for aggregate
This pull request creates a new operator interface that is more similar to traditional database query iterators (with open/close/next/get).
These local operators are not currently used anywhere, but will become the basis for SPARK-9983 (local physical operators for query execution).
cc zsxwing
Author: Reynold Xin <rxin@databricks.com>
Closes#8212 from rxin/SPARK-9984.
This PR enforce dynamic partition column data type requirements by adding analysis rules.
JIRA: https://issues.apache.org/jira/browse/SPARK-8887
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8201 from yjshen/dynamic_partition_columns.
Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary.
Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata.
Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#8215 from marmbrus/pr/7957.
This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8202 from cloud-fan/nested.
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#8159 from cloud-fan/withColumn.
As `InternalRow` does not extend `Row` now, I think we can remove it.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8170 from viirya/remove_canequal.
Currently, pageSize of TungstenSort is calculated from driver.memory, it should use executor.memory instead.
Also, in the worst case, the safeFactor could be 4 (because of rounding), increase it to 16.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#8175 from davies/page_size.
A fundamental limitation of the existing SQL tests is that *there is simply no way to create your own `SparkContext`*. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure.
This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch *all* the SQL test files.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111)
<!-- Reviewable:end -->
Author: Andrew Or <andrew@databricks.com>
Closes#8111 from andrewor14/sql-tests-refactor.
When the free memory in executor goes low, the cached broadcast objects need to serialized into disk, but currently the deserialized UnsafeHashedRelation can't be serialized , fail with NPE. This PR fixes that.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#8174 from davies/serialize_hashed.
PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format.
Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>
Closes#8130 from liancheng/spark-9757/old-hive-parquet-decimal.
I made a mistake in #8049 by casting literal value to attribute's data type, which would cause simply truncate the literal value and push a wrong filter down.
JIRA: https://issues.apache.org/jira/browse/SPARK-9927
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8157 from yjshen/rever8049.
This patch add a thread-safe lookup for BytesToBytseMap, and use that in broadcasted HashedRelation.
Author: Davies Liu <davies@databricks.com>
Closes#8151 from davies/safeLookup.
https://issues.apache.org/jira/browse/SPARK-9920
Taking `sqlContext.sql("select i, sum(j1) as sum from testAgg group by i").explain()` as an example, the output of our current master is
```
== Physical Plan ==
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)]
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], value=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)]
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
With this PR, the output will be
```
== Physical Plan ==
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Final,isDistinct=false)], output=[i#0,sum#18L])
TungstenExchange hashpartitioning(i#0)
TungstenAggregate(key=[i#0], functions=[(sum(cast(j1#1 as bigint)),mode=Partial,isDistinct=false)], output=[i#0,currentSum#22L])
Scan ParquetRelation[file:/user/hive/warehouse/testagg][i#0,j1#1]
```
Author: Yin Huai <yhuai@databricks.com>
Closes#8150 from yhuai/SPARK-9920.
Currently, UnsafeRowSerializer does not close the InputStream, will cause fd leak if the InputStream has an open fd in it.
TODO: the fd could still be leaked, if any items in the stream is not consumed. Currently it replies on GC to close the fd in this case.
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#8116 from davies/fd_leak.
I think that we should pass additional configuration flags to disable the driver UI and Master REST server in SparkSubmitSuite and HiveSparkSubmitSuite. This might cut down on port-contention-related flakiness in Jenkins.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8124 from JoshRosen/disable-ui-in-sparksubmitsuite.
Refactor Utils class and create ShutdownHookManager.
NOTE: Wasn't able to run /dev/run-tests on windows machine.
Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark)
ex:
log4j.rootCategory=WARN,console,redis
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.spark.graphx.Pregel=INFO
log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender
log4j.appender.redis.endpoints=hostname:port
log4j.appender.redis.key=mykey
log4j.appender.redis.alwaysBatch=false
log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1
Author: michellemay <mlemay@gmail.com>
Closes#8109 from michellemay/SPARK-9826.
If the correct parameter is not provided, Hive will run into an error
because it calls methods that are specific to the local filesystem to
copy the data.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8086 from vanzin/SPARK-9804.
This is the sister patch to #8011, but for aggregation.
In a nutshell: create the `TungstenAggregationIterator` before computing the parent partition. Internally this creates a `BytesToBytesMap` which acquires a page in the constructor as of this patch. This ensures that the aggregation operator is not starved since we reserve at least 1 page in advance.
rxin yhuai
Author: Andrew Or <andrew@databricks.com>
Closes#8038 from andrewor14/unsafe-starve-memory-agg.
This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.
This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.
The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.
Author: Cheng Lian <lian@databricks.com>
Closes#8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
This PR fixes unable to push filter down to JDBC source caused by `Cast` during pattern matching.
While we are comparing columns of different type, there's a big chance we need a cast on the column, therefore not match the pattern directly on Attribute and would fail to push down.
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#8049 from yjshen/jdbc_pushdown.
`RuleExecutor.timeMap` is currently a non-thread-safe mutable HashMap; this can lead to infinite loops if multiple threads are concurrently modifying the map. I believe that this is responsible for some hangs that I've observed in HiveQuerySuite.
This patch addresses this by using a Guava `AtomicLongMap`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8120 from JoshRosen/rule-executor-time-map-fix.
Author: Davies Liu <davies@databricks.com>
Closes#8117 from davies/fix_serialization and squashes the following commits:
d21ac71 [Davies Liu] fix serialization with empty broadcast
DirectParquetOutputCommitter was moved in SPARK-9763. However, users can explicitly set the class as a config option, so we must be able to resolve the old committer qualified name.
Author: Reynold Xin <rxin@databricks.com>
Closes#8114 from rxin/SPARK-9849.