ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
DB Tsai	ad853c5678	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests Closes #22967 from dbtsai/scala2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-14 16:22:23 -08:00
Yuanjian Li	2977e2312d	[SPARK-25986][BUILD] Add rules to ban throw Errors in application code ## What changes were proposed in this pull request? Add scala and java lint check rules to ban the usage of `throw new xxxErrors` and fix up all exists instance followed by https://github.com/apache/spark/pull/22989#issuecomment-437939830. See more details in https://github.com/apache/spark/pull/22969. ## How was this patch tested? Local test with lint-scala and lint-java. Closes #22989 from xuanyuanking/SPARK-25986. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-14 13:05:18 -08:00
Shixiong Zhu	4035c98a0c	[SPARK-26042][SS][TESTS] Fix a potential hang in KafkaContinuousSourceTopicDeletionSuite ## What changes were proposed in this pull request? As initializing lazy vals shares the same lock, a thread is trying to initialize `executedPlan` when `isRDD` is running, this thread will hang forever. This PR just materializes `executedPlan` so that accessing it when `toRdd` is running doesn't need to wait for a lock ## How was this patch tested? Jenkins Closes #23023 from zsxwing/SPARK-26042. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2018-11-14 10:19:20 -08:00
hyukjinkwon	a7a331df6e	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files ## What changes were proposed in this pull request? This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file! This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully - although it will costs to start and stop the context. We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/sql/tests.py` into ...: ```bash pyspark ... ├── sql ... │ ├── tests # Includes all tests broken down from 'pyspark/sql/tests.py' │ │ │ # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can │ │ │ # be added. For instance, 'test_arrow.py', 'test_datasources.py' ... │ │ ├── __init__.py │ │ ├── test_appsubmit.py │ │ ├── test_arrow.py │ │ ├── test_catalog.py │ │ ├── test_column.py │ │ ├── test_conf.py │ │ ├── test_context.py │ │ ├── test_dataframe.py │ │ ├── test_datasources.py │ │ ├── test_functions.py │ │ ├── test_group.py │ │ ├── test_pandas_udf.py │ │ ├── test_pandas_udf_grouped_agg.py │ │ ├── test_pandas_udf_grouped_map.py │ │ ├── test_pandas_udf_scalar.py │ │ ├── test_pandas_udf_window.py │ │ ├── test_readwriter.py │ │ ├── test_serde.py │ │ ├── test_session.py │ │ ├── test_streaming.py │ │ ├── test_types.py │ │ ├── test_udf.py │ │ └── test_utils.py ... ├── testing # Includes testing utils that can be used in unittests. │ ├── __init__.py │ └── sqlutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ``` SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23021 from HyukjinKwon/SPARK-25344. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-14 14:51:11 +08:00
Liang-Chi Hsieh	f26cd18816	[SPARK-25942][SQL] Aggregate expressions shouldn't be resolved on AppendColumns ## What changes were proposed in this pull request? `Dataset.groupByKey` will bring in new attributes from serializer. If key type is the same as original Dataset's object type, they have same serializer output and so the attribute names will conflict. This won't be a problem at most of cases, if we don't refer conflict attributes: ```scala val ds: Dataset[(ClassData, Long)] = Seq(ClassData("one", 1), ClassData("two", 2)).toDS() .map(c => ClassData(c.a, c.b + 1)) .groupByKey(p => p).count() ``` But if we use conflict attributes, `Analyzer` will complain about ambiguous references: ```scala val ds = Seq(1, 2, 3).toDS() val agg = ds.groupByKey(_ >= 2).agg(sum("value").as[Long], sum($"value" + 1).as[Long]) ``` We have discussed two fixes https://github.com/apache/spark/pull/22944#discussion_r230977212: 1. Implicitly add alias to key attribute: Works for primitive type. But for product type, we can't implicitly add aliases to key attributes because we might need to access key attributes by names in methods like `mapGroups`. 2. Detect conflict from key attributes and warn users to add alias manually This might work, but needs to add some hacks to Analyzer or AttributeSeq.resolve. This patch applies another simpler fix. We resolve aggregate expressions with `AppendColumns`'s children, instead of `AppendColumns`. `AppendColumns`'s output contains its children's output and serializer output, aggregate expressions shouldn't use serializer output. ## How was this patch tested? Added test. Closes #22944 from viirya/dataset_agg. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-14 01:57:10 +08:00
Marco Gaido	4b955625ee	[SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters ## What changes were proposed in this pull request? An expressions with a subquery can be pushed down as a data source filter. Despite the filter is not actively used, this causes anyway a re-execution of the subquery becuase the `ReuseSubquery` optimization rule is ineffective in this case. The PR avoids this problem by forbidding the push down of filters containing a subquery. ## How was this patch tested? added UT Closes #22518 from mgaido91/SPARK-25482. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-14 01:52:33 +08:00
mu5358271	a7a051afa8	[SPARK-25947][SQL] Reduce memory usage in ShuffleExchangeExec by selecting only the sort columns ## What changes were proposed in this pull request? When sorting rows, ShuffleExchangeExec uses the entire row instead of just the columns references in SortOrder to create the RangePartitioner. This causes the RangePartitioner to sample entire rows to create rangeBounds and can cause OOM issues on the driver when rows contain large fields. This change creates a projection and only use columns involved in the SortOrder for the RangePartitioner ## How was this patch tested? Existing tests in spark-sql. Plus Started a local spark-shell with a small spark.driver.maxResultSize: ``` spark-shell --master 'local[16]' --conf spark.driver.maxResultSize=128M --driver-memory 4g ``` and ran the following script: ``` import com.google.common.io.Files import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession import scala.util.Random transient val sc = SparkContext.getOrCreate() transient val spark = SparkSession.builder().getOrCreate() import spark.implicits._ val path = Files.createTempDir().toString // this creates a dataset with 1024 entries, each 1MB in size, across 16 partitions sc.parallelize(0 until (1 << 10), sc.defaultParallelism). map(_ => Array.fill(1 << 18)(Random.nextInt)). toDS. write.mode("overwrite").parquet(path) spark.read.parquet(path). orderBy('value (0)). write.mode("overwrite").parquet(s"$path-sorted") spark.read.parquet(s"$path-sorted").show ``` execution would fail when initializing RangePartitioner without this change. execution succeeds and generates a correctly sorted dataset with this change. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22961 from mu5358271/sort-improvement. Authored-by: mu5358271 <shuheng.dai@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-14 00:25:57 +08:00
Maxim Gekk	44683e0f7b	[SPARK-26023][SQL] Dumping truncated plans and generated code to a file ## What changes were proposed in this pull request? In the PR, I propose new method for debugging queries by dumping info about their execution to a file. It saves logical, optimized and physical plan similar to the `explain()` method + generated code. One of the advantages of the method over `explain` is it does not materializes full output as one string in memory which can cause OOMs. ## How was this patch tested? Added a few tests to `QueryExecutionSuite` to check positive and negative scenarios. Closes #23018 from MaxGekk/truncated-plan-to-file. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2018-11-13 15:23:35 +01:00
Marco Gaido	8d7dbde914	[SPARK-26003] Improve SQLAppStatusListener.aggregateMetrics performance ## What changes were proposed in this pull request? In `SQLAppStatusListener.aggregateMetrics`, we use the `metricIds` only to filter the relevant metrics. And this is a Seq which is also sorted. When there are many metrics involved, this can be pretty inefficient. The PR proposes to use a Set for it. ## How was this patch tested? NA Closes #23002 from mgaido91/SPARK-26003. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-13 13:47:25 +08:00
Maxim Gekk	c491934377	[SPARK-26007][SQL] DataFrameReader.csv() respects to spark.sql.columnNameOfCorruptRecord ## What changes were proposed in this pull request? Passing current value of SQL config `spark.sql.columnNameOfCorruptRecord` to `CSVOptions` inside of `DataFrameReader`.`csv()`. ## How was this patch tested? Added a test where default value of `spark.sql.columnNameOfCorruptRecord` is changed. Closes #23006 from MaxGekk/csv-corrupt-sql-config. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-13 12:26:19 +08:00
Sean Owen	2d085c13b7	[SPARK-25984][CORE][SQL][STREAMING] Remove deprecated .newInstance(), primitive box class constructor calls ## What changes were proposed in this pull request? Deprecated in Java 11, replace Class.newInstance with Class.getConstructor.getInstance, and primtive wrapper class constructors with valueOf or equivalent ## How was this patch tested? Existing tests. Closes #22988 from srowen/SPARK-25984. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-10 09:52:14 -06:00
Dongjoon Hyun	d66a4e82ec	[SPARK-25102][SQL] Write Spark version to ORC/Parquet file metadata ## What changes were proposed in this pull request? Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`. ``` parameters:{ spark.sql.sources.schema.part.0={ "type":"struct", "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] }, transient_lastDdlTime=1541142761, spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.4.0 } ``` This PR aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version` because we used `org.apache.` prefix in Parquet metadata already. It's different from Hive Table property key `spark.sql.create.version`, but it seems that we cannot change Hive Table property for backward compatibility. After this PR, ORC and Parquet file generated by Spark will have the following metadata. ORC (`native` and `hive` implmentation) ``` $ orc-tools meta /tmp/o File Version: 0.12 with ... ... User Metadata: org.apache.spark.sql.create.version=3.0.0 ``` PARQUET ``` $ parquet-tools meta /tmp/p ... creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) extra: org.apache.spark.sql.create.version = 3.0.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. This closes #22255. Closes #22932 from dongjoon-hyun/SPARK-25102. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-09 22:42:48 -08:00
Gengliang Wang	1db799795c	[SPARK-25979][SQL] Window function: allow parentheses around window reference ## What changes were proposed in this pull request? Very minor parser bug, but possibly problematic for code-generated queries: Consider the following two queries: ``` SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` and ``` SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` The former, with parens around the OVER condition, fails to parse while the latter, without parens, succeeds: ``` Error in SQL statement: ParseException: mismatched input '(' expecting {<EOF>, ',', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19) == SQL == SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 -------------------^^^ ``` This was found when running the cockroach DB tests. I tried PostgreSQL, The SQL with parentheses is also workable. ## How was this patch tested? Unit test Closes #22987 from gengliangwang/windowParentheses. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-09 09:44:04 -08:00
gatorsmile	657fd00b52	[SPARK-25988][SQL] Keep names unchanged when deduplicating the column names in Analyzer ## What changes were proposed in this pull request? When the queries do not use the column names with the same case, users might hit various errors. Below is a typical test failure they can hit. ``` Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); org.apache.spark.sql.AnalysisException: Expected only partition pruning predicates: ArrayBuffer(isnotnull(tdate#237), (cast(tdate#237 as string) >= 2017-08-15)); at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:146) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:560) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) ``` ## How was this patch tested? Added two test cases. Closes #22990 from gatorsmile/fix1283. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-09 08:22:26 -08:00
Yuming Wang	0558d021cc	[SPARK-25510][SQL][TEST][FOLLOW-UP] Remove BenchmarkWithCodegen ## What changes were proposed in this pull request? Remove `BenchmarkWithCodegen` as we don't use it anymore. More details: https://github.com/apache/spark/pull/22484#discussion_r221397904 ## How was this patch tested? N/A Closes #22985 from wangyum/SPARK-25510. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-09 11:45:03 +08:00
Maxim Gekk	79551f558d	[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON ## What changes were proposed in this pull request? In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to `Locale.US`. ## How was this patch tested? Added two tests for parsing a date from CSV/JSON - `ноя 2018`. Closes #22951 from MaxGekk/locale. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-09 09:45:06 +08:00
Gengliang Wang	7bb901aa28	[SPARK-25964][SQL][MINOR] Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions ## What changes were proposed in this pull request? 1. OrcReadBenchmark is under hive module, so the way to run it should be ``` build/sbt "hive/test:runMain <this class>" ``` 2. The benchmark "String with Nulls Scan" should be with case "String with Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)" 3. Add the null value percentages in the test case names of DataSourceReadBenchmark, for the benchmark "String with Nulls Scan" . ## How was this patch tested? Re-run benchmarks Closes #22965 from gengliangwang/fixHiveOrcReadBenchmark. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-08 10:08:14 -08:00
Dongjoon Hyun	6abe90625e	[SPARK-25676][SQL][FOLLOWUP] Use 'foreach(_ => ())' ## What changes were proposed in this pull request? #22970 fixed Scala 2.12 build error, and this PR updates the function according to the review comments. ## How was this patch tested? This is also manually tested with Scala 2.12 build. Closes #22978 from dongjoon-hyun/SPARK-25676-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-08 23:37:14 +08:00
Takuya UESHIN	0d7396f3af	[SPARK-22827][SQL][FOLLOW-UP] Throw `SparkOutOfMemoryError` in `HashAggregateExec`, too. ## What changes were proposed in this pull request? This is a follow-up pr of #20014 which introduced `SparkOutOfMemoryError` to avoid killing the entire executor when an `OutOfMemoryError` is thrown. We should throw `SparkOutOfMemoryError` in `HashAggregateExec`, too. ## How was this patch tested? Existing tests. Closes #22969 from ueshin/issues/SPARK-22827/oome. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-08 03:51:55 -08:00
Dongjoon Hyun	a3004d084c	[SPARK-25971][SQL] Ignore partition byte-size statistics in SQLQueryTestSuite ## What changes were proposed in this pull request? Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet files in table partitions. If we change the default file format (from Parquet to ORC) or update the metadata of them, the test case should be changed accordingly. This PR aims to make `SQLQueryTestSuite` more robust by ignoring the partition byte statistics. ``` -Partition Statistics 1144 bytes, 2 rows +Partition Statistics [not included in comparison] bytes, 2 rows ``` ## How was this patch tested? Pass the Jenkins with the newly updated test cases. Closes #22972 from dongjoon-hyun/SPARK-25971. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-08 03:40:28 -08:00
Maxim Gekk	ee03f760b3	[SPARK-25955][TEST] Porting JSON tests for CSV functions ## What changes were proposed in this pull request? In the PR, I propose to port existing JSON tests from `JsonFunctionsSuite` that are applicable for CSV, and put them to `CsvFunctionsSuite`. In particular: - roundtrip `from_csv` to `to_csv`, and `to_csv` to `from_csv` - using `schema_of_csv` in `from_csv` - Java API `from_csv` - using `from_csv` and `to_csv` in exprs. Closes #22960 from MaxGekk/csv-additional-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-08 14:51:29 +08:00
Maxim Gekk	17449a2e6b	[SPARK-25952][SQL] Passing actual schema to JacksonParser ## What changes were proposed in this pull request? The PR fixes an issue when the corrupt record column specified via `spark.sql.columnNameOfCorruptRecord` or JSON options `columnNameOfCorruptRecord` is propagated to JacksonParser, and returned row breaks an assumption in `FailureSafeParser` that the row must contain only actual data. The issue is fixed by passing actual schema without the corrupt record field into `JacksonParser`. ## How was this patch tested? Added a test with the corrupt record column in the middle of user's schema. Closes #22958 from MaxGekk/from_json-corrupt-record-schema. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-08 14:48:23 +08:00
Dongjoon Hyun	d68f3a726f	[SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 build error ## What changes were proposed in this pull request? This PR fixes the Scala-2.12 build. ## How was this patch tested? Manual build with Scala-2.12 profile. Closes #22970 from dongjoon-hyun/SPARK-25676-2.12. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-08 05:54:48 +00:00
Sean Owen	0025a8397f	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 ## What changes were proposed in this pull request? - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only deprecated) - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only deprecated) - Remove unused Python StorageLevel constants - Remove Dataset unionAll in favor of union - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally ## How was this patch tested? Existing tests Closes #22921 from srowen/SPARK-25908. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-07 22:48:50 -06:00
Imran Rashid	8fbc1830f9	[SPARK-25904][CORE] Allocate arrays smaller than Int.MaxValue JVMs can't allocate arrays of length exactly Int.MaxValue, so ensure we never try to allocate an array that big. This commit changes some defaults & configs to gracefully fallover to something that doesn't require one large array in some cases; in other cases it simply improves an error message for cases which will still fail. Closes #22818 from squito/SPARK-25827. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-11-07 13:18:52 +01:00
Maxim Gekk	76813cfa1e	[SPARK-25950][SQL] from_csv should respect to spark.sql.columnNameOfCorruptRecord ## What changes were proposed in this pull request? Fix for `CsvToStructs` to take into account SQL config `spark.sql.columnNameOfCorruptRecord` similar to `from_json`. ## How was this patch tested? Added new test where `spark.sql.columnNameOfCorruptRecord` is set to corrupt column name different from default. Closes #22956 from MaxGekk/csv-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-07 11:26:17 +08:00
yucai	63ca4bbe79	[SPARK-25676][SQL][TEST] Rename and refactor BenchmarkWideTable to use main method ## What changes were proposed in this pull request? Refactor BenchmarkWideTable to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideTableBenchmark" ``` ## How was this patch tested? manual tests Closes #22823 from yucai/BenchmarkWideTable. Lead-authored-by: yucai <yyu1@ebay.com> Co-authored-by: Yucai Yu <yucai.yu@foxmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-06 15:40:56 -08:00
DB Tsai	3ed91c9b89	[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11 ## What changes were proposed in this pull request? Upgrade ASM to 7.x to support JDK11 ## How was this patch tested? Existing tests. Closes #22953 from dbtsai/asm7. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-06 05:38:59 +00:00
Takuya UESHIN	78fa1be29b	[SPARK-25926][CORE] Move config entries in core module to internal.config. ## What changes were proposed in this pull request? Currently definitions of config entries in `core` module are in several files separately. We should move them into `internal/config` to be easy to manage. ## How was this patch tested? Existing tests. Closes #22928 from ueshin/issues/SPARK-25926/single_config_file. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-06 09:18:17 +08:00
Shahid	fc65b4af00	[SPARK-25900][WEBUI] When the page number is more than the total page size, then fall back to the first page ## What changes were proposed in this pull request? When we give the page number more than the maximum page number, webui is throwing an exception. It would be better if fall back to the default page, instead of throwing the exception in the web ui. ## How was this patch tested? Before PR: ![screenshot from 2018-10-31 23-41-37](https://user-images.githubusercontent.com/23054875/47816448-354fbe80-dd79-11e8-83d8-6aab196642f7.png) After PR: ![screenshot from 2018-10-31 23-54-23](https://user-images.githubusercontent.com/23054875/47816461-3ed92680-dd79-11e8-959d-0c531b3a6b2d.png) Closes #22914 from shahidki31/pageFallBack. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-05 09:13:53 -06:00
Maxim Gekk	950e7374a8	[SPARK-25913][SQL] Extend UnaryExecNode by unary SparkPlan nodes ## What changes were proposed in this pull request? In the PR, I propose to extend `UnaryExecNode` instead of `SparkPlan` by unary nodes. Closes #22925 from MaxGekk/unary-exec-node. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-04 17:41:42 -08:00
Maxim Gekk	39399f40b8	[SPARK-25638][SQL] Adding new function - to_csv() ## What changes were proposed in this pull request? New functions takes a struct and converts it to a CSV strings using passed CSV options. It accepts the same CSV options as CSV data source does. ## How was this patch tested? Added `CsvExpressionsSuite`, `CsvFunctionsSuite` as well as R, Python and SQL tests similar to tests for `to_json()` Closes #22626 from MaxGekk/to_csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-04 14:57:38 +08:00
Maxim Gekk	42b6c1fb05	[SPARK-25931][SQL] Benchmarking creation of Jackson parser ## What changes were proposed in this pull request? Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by #21909 for empty schema pushed down to JSON datasource. The `count()` action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case. Besides of new benchmark I also refactored existing benchmarks: - Added `numIters` to control number of iteration in each benchmark - Renamed `JSON per-line parsing` -> `count a short column`, `JSON parsing of wide lines` -> `count a wide column`, and `Count a dataset with 10 columns` -> `Select a subset of 10 columns`. Closes #22920 from MaxGekk/json-benchmark-follow-up. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-03 09:09:39 -07:00
Shahid	ed0c57e10d	[SPARK-25861][MINOR][WEBUI] Remove unused refreshInterval parameter from the headerSparkPage method. ## What changes were proposed in this pull request? 'refreshInterval' is not used any where in the headerSparkPage method. So, we don't need to pass the parameter while calling the 'headerSparkPage' method. ## How was this patch tested? Existing tests Closes #22864 from shahidki31/unusedCode. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-02 17:17:48 -05:00
Dongjoon Hyun	e91b607719	[SPARK-25918][SQL] LOAD DATA LOCAL INPATH should handle a relative path ## What changes were proposed in this pull request? Unfortunately, it seems that we missed this in 2.4.0. In Spark 2.4, if the default file system is not the local file system, `LOAD DATA LOCAL INPATH` only works in case of absolute paths. This PR aims to fix it to support relative paths. This is a regression in 2.4.0. ```scala $ ls kv1.txt kv1.txt scala> spark.sql("LOAD DATA LOCAL INPATH 'kv1.txt' INTO TABLE t") org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: kv1.txt; ``` ## How was this patch tested? Pass the Jenkins Closes #22927 from dongjoon-hyun/SPARK-LOAD. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-01 23:18:20 -07:00
Takuya UESHIN	cc82b9fed8	[SPARK-25884][SQL] Add TBLPROPERTIES and COMMENT, and use LOCATION when SHOW CREATE TABLE. ## What changes were proposed in this pull request? When `SHOW CREATE TABLE` for Datasource tables, we are missing `TBLPROPERTIES` and `COMMENT`, and we should use `LOCATION` instead of path in `OPTION`. ## How was this patch tested? Splitted `ShowCreateTableSuite` to confirm to work with both `InMemoryCatalog` and `HiveExternalCatalog`, and added some tests. Closes #22892 from ueshin/issues/SPARK-25884/show_create_table. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-01 10:00:14 +08:00
hyukjinkwon	c9667aff4f	[SPARK-25672][SQL] schema_of_csv() - schema inference from an example ## What changes were proposed in this pull request? In the PR, I propose to add new function - schema_of_csv() which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example: ```sql select schema_of_csv('1\|abc', map('delimiter', '\|')) ``` ``` struct<_c0:int,_c1:string> ``` ## How was this patch tested? Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql` Closes #22666 from MaxGekk/schema_of_csv-function. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-01 09:14:16 +08:00
Anton Okolnychyi	bc9f9b4d6e	[SPARK-25860][SQL] Replace Literal(null, _) with FalseLiteral whenever possible ## What changes were proposed in this pull request? This PR proposes a new optimization rule that replaces `Literal(null, _)` with `FalseLiteral` in conditions in `Join` and `Filter`, predicates in `If`, conditions in `CaseWhen`. The idea is that some expressions evaluate to `false` if the underlying expression is `null` (as an example see `GeneratePredicate$create` or `doGenCode` and `eval` methods in `If` and `CaseWhen`). Therefore, we can replace `Literal(null, _)` with `FalseLiteral`, which can lead to more optimizations later on. Let’s consider a few examples. ``` val df = spark.range(1, 100).select($"id".as("l"), ($"id" > 50).as("b")) df.createOrReplaceTempView("t") df.createOrReplaceTempView("p") ``` Case 1 ``` spark.sql("SELECT * FROM t WHERE if(l > 10, false, NULL)").explain(true) // without the new rule … == Optimized Logical Plan == Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- Filter if ((id#0L > 10)) false else null +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- (1) Filter if ((id#0L > 10)) false else null +- (1) Range (1, 100, step=1, splits=12) // with the new rule … == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3] == Physical Plan == LocalTableScan <empty>, [l#2L, s#3] ``` Case 2* ``` spark.sql("SELECT * FROM t WHERE CASE WHEN l < 10 THEN null WHEN l > 40 THEN false ELSE null END”).explain(true) // without the new rule ... == Optimized Logical Plan == Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- Filter CASE WHEN (id#0L < 10) THEN null WHEN (id#0L > 40) THEN false ELSE null END +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] +- (1) Filter CASE WHEN (id#0L < 10) THEN null WHEN (id#0L > 40) THEN false ELSE null END +- (1) Range (1, 100, step=1, splits=12) // with the new rule ... == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3] == Physical Plan == LocalTableScan <empty>, [l#2L, s#3] ``` Case 3* ``` spark.sql("SELECT * FROM t JOIN p ON IF(t.l > p.l, null, false)").explain(true) // without the new rule ... == Optimized Logical Plan == Join Inner, if ((l#2L > l#37L)) null else false :- Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] : +- Range (1, 100, step=1, splits=Some(12)) +- Project [id#0L AS l#37L, cast(id#0L as string) AS s#38] +- Range (1, 100, step=1, splits=Some(12)) == Physical Plan == BroadcastNestedLoopJoin BuildRight, Inner, if ((l#2L > l#37L)) null else false :- (1) Project [id#0L AS l#2L, cast(id#0L as string) AS s#3] : +- (1) Range (1, 100, step=1, splits=12) +- BroadcastExchange IdentityBroadcastMode +- (2) Project [id#0L AS l#37L, cast(id#0L as string) AS s#38] +- (2) Range (1, 100, step=1, splits=12) // with the new rule ... == Optimized Logical Plan == LocalRelation <empty>, [l#2L, s#3, l#37L, s#38] ``` ## How was this patch tested? This PR comes with a set of dedicated tests. Closes #22857 from aokolnychyi/spark-25860. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-10-31 18:35:33 +00:00
caoxuewen	3c0e9ce944	[SPARK-24901][SQL] Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false. ## What changes were proposed in this pull request? Currently, Generate code of update UnsafeRow in hash aggregation. FastHashMap and RegularHashMap are two separate codes，These two separate codes need only when VectorizedHashMap is true. but other cases, we can merge together to reduce compiler maxCodesize. thanks. ``` import org.apache.spark.sql.execution.debug._ sparkSession.range(1).selectExpr("id AS key", "id AS value").groupBy("key").sum("value").debugCodegen ``` Generate code like: Before modified: ``` Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / ............... / 420 / if (agg_fastAggBuffer_0 != null) { / 421 / // common sub-expressions / 422 / / 423 / // evaluate aggregate function / 424 / agg_agg_isNull_14_0 = true; / 425 / long agg_value_15 = -1L; / 426 / do { / 427 / boolean agg_isNull_15 = agg_fastAggBuffer_0.isNullAt(0); / 428 / long agg_value_16 = agg_isNull_15 ? / 429 / -1L : (agg_fastAggBuffer_0.getLong(0)); / 430 / if (!agg_isNull_15) { / 431 / agg_agg_isNull_14_0 = false; / 432 / agg_value_15 = agg_value_16; / 433 / continue; / 434 / } / 435 / / 436 / // This comment is added for manually tracking reference of 0, false / 437 / / 438 / boolean agg_isNull_16 = false; / 439 / long agg_value_17 = -1L; / 440 / if (!false) { / 441 / agg_value_17 = (long) 0; / 442 / } / 443 / if (!agg_isNull_16) { / 444 / agg_agg_isNull_14_0 = false; / 445 / agg_value_15 = agg_value_17; / 446 / continue; / 447 / } / 448 / / 449 / } while (false); / 450 / / 451 / long agg_value_14 = -1L; / 452 / agg_value_14 = agg_value_15 + agg_expr_1_0; / 453 / // update fast row / 454 / agg_fastAggBuffer_0.setLong(0, agg_value_14); / 455 / } else { / 456 / // common sub-expressions / 457 / / 458 / // evaluate aggregate function / 459 / agg_agg_isNull_8_0 = true; / 460 / long agg_value_9 = -1L; / 461 / do { / 462 / boolean agg_isNull_9 = agg_unsafeRowAggBuffer_0.isNullAt(0); / 463 / long agg_value_10 = agg_isNull_9 ? / 464 / -1L : (agg_unsafeRowAggBuffer_0.getLong(0)); / 465 / if (!agg_isNull_9) { / 466 / agg_agg_isNull_8_0 = false; / 467 / agg_value_9 = agg_value_10; / 468 / continue; / 469 / } / 470 / / 471 / // This comment is added for manually tracking reference of 0, false / 472 / / 473 / boolean agg_isNull_10 = false; / 474 / long agg_value_11 = -1L; / 475 / if (!false) { / 476 / agg_value_11 = (long) 0; / 477 / } / 478 / if (!agg_isNull_10) { / 479 / agg_agg_isNull_8_0 = false; / 480 / agg_value_9 = agg_value_11; / 481 / continue; / 482 / } / 483 / / 484 / } while (false); / 485 / / 486 / long agg_value_8 = -1L; / 487 / agg_value_8 = agg_value_9 + agg_expr_1_0; / 488 / // update unsafe row buffer / 489 / agg_unsafeRowAggBuffer_0.setLong(0, agg_value_8); / 490 / / 491 / } ...................... ``` After modified:* ``` Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / ............. / 423 / // Updates the proper row buffer / 424 / UnsafeRow agg_aggBuffer_0 = null; / 425 / if (agg_fastAggBuffer_0 != null) { / 426 / agg_aggBuffer_0 = agg_fastAggBuffer_0; / 427 / } else { / 428 / agg_aggBuffer_0 = agg_unsafeRowAggBuffer_0; / 429 / } / 430 / / 431 / // common sub-expressions / 432 / / 433 / // evaluate aggregate function / 434 / agg_agg_isNull_8_0 = true; / 435 / long agg_value_9 = -1L; / 436 / do { / 437 / boolean agg_isNull_9 = agg_aggBuffer_0.isNullAt(0); / 438 / long agg_value_10 = agg_isNull_9 ? / 439 / -1L : (agg_aggBuffer_0.getLong(0)); / 440 / if (!agg_isNull_9) { / 441 / agg_agg_isNull_8_0 = false; / 442 / agg_value_9 = agg_value_10; / 443 / continue; / 444 / } / 445 / / 446 / // This comment is added for manually tracking reference of 0, false / 447 / / 448 / boolean agg_isNull_10 = false; / 449 / long agg_value_11 = -1L; / 450 / if (!false) { / 451 / agg_value_11 = (long) 0; / 452 / } / 453 / if (!agg_isNull_10) { / 454 / agg_agg_isNull_8_0 = false; / 455 / agg_value_9 = agg_value_11; / 456 / continue; / 457 / } / 458 / / 459 / } while (false); / 460 / / 461 / long agg_value_8 = -1L; / 462 / agg_value_8 = agg_value_9 + agg_expr_1_0; / 463 / // update unsafe row buffer / 464 */ agg_aggBuffer_0.setLong(0, agg_value_8); ........... ``` ## How was this patch tested? the Existed test cases. Closes #21860 from heary-cao/fastHashMap. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-31 18:39:15 +08:00
yucai	f8484e49ef	[SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method ## What changes were proposed in this pull request? Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.BuiltInDataSourceWriteBenchmark" SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain org.apache.spark.sql.execution.benchmark.AvroWriteBenchmark" ``` ## How was this patch tested? manual tests Closes #22861 from yucai/BuiltInDataSourceWriteBenchmark. Lead-authored-by: yucai <yyu1@ebay.com> Co-authored-by: Yucai Yu <yucai.yu@foxmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-31 03:03:42 -07:00
Reynold Xin	9cf9a83afa	[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? This patch removes the rangeBetween functions introduced in SPARK-21608. As explained in SPARK-25841, these functions are confusing and don't quite work. We will redesign them and introduce better ones in SPARK-25843. ## How was this patch tested? Removed relevant test cases as well. These test cases will need to be added back in SPARK-25843. Closes #22870 from rxin/SPARK-25862. Lead-authored-by: Reynold Xin <rxin@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-30 21:27:17 -07:00
caoxuewen	f6ff6329ee	[SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method ## What changes were proposed in this pull request? Refactor JSONBenchmark to use main method use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"` ## How was this patch tested? manual tests Closes #22844 from heary-cao/JSONBenchmarks. Lead-authored-by: caoxuewen <cao.xuewen@zte.com.cn> Co-authored-by: heary <cao.xuewen@zte.com.cn> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-31 10:28:17 +08:00
caoxuewen	94de5609be	[SPARK-25848][SQL][TEST] Refactor CSVBenchmarks to use main method ## What changes were proposed in this pull request? use spark-submit: `bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar` Generate benchmark result: `SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmark"` ## How was this patch tested? manual tests Closes #22845 from heary-cao/CSVBenchmarks. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-30 09:18:55 -07:00
caoxuewen	eab39f79e4	[SPARK-25755][SQL][TEST] Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec ## What changes were proposed in this pull request? Currently, the BroadcastHashJoinExec physical plan supports CodeGen and non-codegen, but only CodeGen code is tested in the unit tests of InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not tested. This PR supplements this part of the test. ## How was this patch tested? add new unit tested. Closes #22755 from heary-cao/AddTestToBroadcastHashJoinExec. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-30 20:13:18 +08:00
Peter Toth	7fe5cff058	[SPARK-25767][SQL] Fix lazily evaluated stream of expressions in code generation ## What changes were proposed in this pull request? Code generation is incorrect if `outputVars` parameter of `consume` method in `CodegenSupport` contains a lazily evaluated stream of expressions. This PR fixes the issue by forcing the evaluation of `inputVars` before generating the code for UnsafeRow. ## How was this patch tested? Tested with the sample program provided in https://issues.apache.org/jira/browse/SPARK-25767 Closes #22789 from peter-toth/SPARK-25767. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2018-10-29 16:47:50 +01:00
yucai	409d688fb6	[SPARK-25864][SQL][TEST] Make main args accessible for BenchmarkBase's subclass ## What changes were proposed in this pull request? Set main args correctly in BenchmarkBase, to make it accessible for its subclass. It will benefit: - BuiltInDataSourceWriteBenchmark - AvroWriteBenchmark ## How was this patch tested? manual tests Closes #22872 from yucai/main_args. Authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-29 20:00:31 +08:00
Peter Toth	ca2fca1432	[SPARK-25816][SQL] Fix attribute resolution in nested extractors ## What changes were proposed in this pull request? Extractors are made of 2 expressions, one of them defines the the value to be extract from (called `child`) and the other defines the way of extraction (called `extraction`). In this term extractors have 2 children so they shouldn't be `UnaryExpression`s. `ResolveReferences` was changed in this commit: `36b826f5d1` which resulted a regression with nested extractors. An extractor need to define its children as the set of both `child` and `extraction`; and should try to resolve both in `ResolveReferences`. This PR changes `UnresolvedExtractValue` to a `BinaryExpression`. ## How was this patch tested? added UT Closes #22817 from peter-toth/SPARK-25816. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-28 17:51:35 -07:00
liuxian	4427a96bce	[SPARK-25806][SQL] The instance of FileSplit is redundant ## What changes were proposed in this pull request? The instance of `FileSplit` is redundant for `ParquetFileFormat` and `hive\orc\OrcFileFormat` class. ## How was this patch tested? Existing unit tests in `ParquetQuerySuite.scala` and `HiveOrcQuerySuite.scala` Closes #22802 from 10110346/FileSplitnotneed. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-28 17:39:16 -05:00
Xingbo Jiang	a7ab7f2348	[SPARK-25845][SQL] Fix MatchError for calendar interval type in range frame left boundary ## What changes were proposed in this pull request? WindowSpecDefinition checks start < last, but CalendarIntervalType is not comparable, so it would throw the following exception at runtime: ``` scala.MatchError: CalendarIntervalType (of class org.apache.spark.sql.types.CalendarIntervalType$) at org.apache.spark.sql.catalyst.util.TypeUtils$.getInterpretedOrdering(TypeUtils.scala:58) at org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering$lzycompute(predicates.scala:592) at org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering(predicates.scala:592) at org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:797) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:496) at org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.isGreaterThan(windowExpressions.scala:245) at org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.checkInputDataTypes(windowExpressions.scala:216) at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:171) at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:171) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183) at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43) at scala.collection.mutable.ArrayBuffer.forall(ArrayBuffer.scala:48) at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:183) at org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved$lzycompute(windowExpressions.scala:48) at org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved(windowExpressions.scala:48) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183) at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83) ``` We fix the issue by only perform the check on boundary expressions that are AtomicType. ## How was this patch tested? Add new test case in `DataFrameWindowFramesSuite` Closes #22853 from jiangxb1987/windowBoundary. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2018-10-28 18:15:47 +08:00
Wenchen Fan	ff4bb836aa	[SPARK-25817][SQL] Dataset encoder should support combination of map and product type ## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/22745 , Dataset encoder supports the combination of java bean and map type. This PR is to fix the Scala side. The reason why it didn't work before is, `CatalystToExternalMap` tries to get the data type of the input map expression, while it can be unresolved and its data type is known. To fix it, we can follow `UnresolvedMapObjects`, to create a `UnresolvedCatalystToExternalMap`, and only create `CatalystToExternalMap` when the input map expression is resolved and the data type is known. ## How was this patch tested? enable a old test case Closes #22812 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-28 13:33:26 +08:00
Dilip Biswal	e545811346	[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates ## What changes were proposed in this pull request? Implements Every, Some, Any aggregates in SQL. These new aggregate expressions are analyzed in normal way and rewritten to equivalent existing aggregate expressions in the optimizer. Every(x) => Min(x) where x is boolean. Some(x) => Max(x) where x is boolean. Any is a synonym for Some. SQL ``` explain extended select every(v) from test_agg group by k; ``` Plan : ``` == Parsed Logical Plan == 'Aggregate ['k], [unresolvedalias('every('v), None)] +- 'UnresolvedRelation `test_agg` == Analyzed Logical Plan == every(v): boolean Aggregate [k#0], [every(v#1) AS every(v)#5] +- SubqueryAlias `test_agg` +- Project [k#0, v#1] +- SubqueryAlias `test_agg` +- LocalRelation [k#0, v#1] == Optimized Logical Plan == Aggregate [k#0], [min(v#1) AS every(v)#5] +- LocalRelation [k#0, v#1] == Physical Plan == (2) HashAggregate(keys=[k#0], functions=[min(v#1)], output=[every(v)#5]) +- Exchange hashpartitioning(k#0, 200) +- (1) HashAggregate(keys=[k#0], functions=[partial_min(v#1)], output=[k#0, min#7]) +- LocalTableScan [k#0, v#1] Time taken: 0.512 seconds, Fetched 1 row(s) ``` ## How was this patch tested? Added tests in SQLQueryTestSuite, DataframeAggregateSuite Closes #22809 from dilipbiswal/SPARK-19851-specific-rewrite. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-28 09:38:38 +08:00
Sean Owen	ca545f7941	[SPARK-25821][SQL] Remove SQLContext methods deprecated in 1.4 ## What changes were proposed in this pull request? Remove SQLContext methods deprecated in 1.4 ## How was this patch tested? Existing tests. Closes #22815 from srowen/SPARK-25821. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-26 16:49:48 -05:00
Gengliang Wang	d325ffbf3a	[SPARK-25851][SQL][MINOR] Fix deprecated API warning in SQLListener ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/21596, Jackson is upgraded to 2.9.6. There are some deprecated API warnings in SQLListener. Create a trivial PR to fix them. ``` [warn] SQLListener.scala:92: method uncheckedSimpleType in class TypeFactory is deprecated: see corresponding Javadoc for more information. [warn] val objectType = typeFactory.uncheckedSimpleType(classOf[Object]) [warn] [warn] SQLListener.scala:93: method constructSimpleType in class TypeFactory is deprecated: see corresponding Javadoc for more information. [warn] typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], Array(objectType, objectType)) [warn] [warn] SQLListener.scala:97: method uncheckedSimpleType in class TypeFactory is deprecated: see corresponding Javadoc for more information. [warn] val longType = typeFactory.uncheckedSimpleType(classOf[Long]) [warn] [warn] SQLListener.scala:98: method constructSimpleType in class TypeFactory is deprecated: see corresponding Javadoc for more information. [warn] typeFactory.constructSimpleType(classOf[(_, _)], classOf[(_, _)], Array(longType, longType)) ``` ## How was this patch tested? Existing unit tests. Closes #22848 from gengliangwang/fixSQLListenerWarning. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-26 16:45:56 -05:00
hyukjinkwon	33e337c118	[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only ## What changes were proposed in this pull request? The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. ## How was this patch tested? Unit tests were added. Closes #22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-26 22:14:43 +08:00
Reynold Xin	89d748b33c	[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. This patch also reverts `8acb51f08b` which applies to 2.4. ## How was this patch tested? Only deprecation and doc changes. Closes #22841 from rxin/SPARK-25842. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-26 13:17:24 +08:00
Shixiong Zhu	86d469aeaa	[SPARK-25822][PYSPARK] Fix a race condition when releasing a Python worker ## What changes were proposed in this pull request? There is a race condition when releasing a Python worker. If `ReaderIterator.handleEndOfDataSection` is not running in the task thread, when a task is early terminated (such as `take(N)`), the task completion listener may close the worker but "handleEndOfDataSection" can still put the worker into the worker pool to reuse. `0e07b483d2` is a patch to reproduce this issue. I also found a user reported this in the mail list: http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=H+YLUEpd23nwvq13Ms5hOStkhX3ao4f4zQV6sgO5zM-xAmail.gmail.com%3E This PR fixes the issue by using `compareAndSet` to make sure we will never return a closed worker to the work pool. ## How was this patch tested? Jenkins. Closes #22816 from zsxwing/fix-socket-closed. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-10-26 13:53:51 +09:00
Liang-Chi Hsieh	cb5ea201df	[SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag ## What changes were proposed in this pull request? This is inspired during implementing #21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do. To summarize the proposed changes: 1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly. 2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row. So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value. A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct. A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type. 3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer. ## How was this patch tested? Existing tests. Closes #22749 from viirya/SPARK-24762-refactor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-25 19:27:45 +08:00
Maxim Gekk	4d6704db4d	[SPARK-25243][SQL] Use FailureSafeParser in from_json ## What changes were proposed in this pull request? In the PR, I propose to switch `from_json` on `FailureSafeParser`, and to make the function compatible to `PERMISSIVE` mode by default, and to support the `FAILFAST` mode as well. The `DROPMALFORMED` mode is not supported by `from_json`. ## How was this patch tested? It was tested by existing `JsonSuite`/`CSVSuite`, `JsonFunctionsSuite` and `JsonExpressionsSuite` as well as new tests for `from_json` which checks different modes. Closes #22237 from MaxGekk/from_json-failuresafe. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-24 19:09:15 +08:00
Vladimir Kuriatkov	584e767d37	[SPARK-25772][SQL] Fix java map of structs deserialization This is a follow-up PR for #22708. It considers another case of java beans deserialization: java maps with struct keys/values. When deserializing values of MapType with struct keys/values in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? Invocations of "keyArray" and "valueArray" functions are used to extract arrays of keys and values. Struct type of keys or values is also inferred from java bean structure and ends up with mixed up field order. I created a new UnresolvedInvoke expression as a temporary substitution of Invoke expression while no actual data is available. It allows to provide the resulting data type during analysis based on the resolved input data, not on the java bean (similar to UnresolvedMapObjects). Key and value arrays are then fed to MapObjects expression which I replaced with UnresolvedMapObjects, just like in case of ArrayType. Finally I added resolution of UnresolvedInvoke expressions in Analyzer.resolveExpression method as an additional pattern matching case. ## How was this patch tested? Added a test case. Built complete project on travis. viirya kiszk cloud-fan michalsenkyr marmbrus liancheng Closes #22745 from vofque/SPARK-21402-FOLLOWUP. Lead-authored-by: Vladimir Kuriatkov <vofque@gmail.com> Co-authored-by: Vladimir Kuriatkov <Vladimir_Kuriatkov@epam.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-24 09:29:40 +08:00
Imran Rashid	78c8bd2e68	[SPARK-25805][SQL][TEST] Fix test for SPARK-25159 The original test would sometimes fail if the listener bus did not keep up, so just wait till the listener bus is empty. Tested by adding a sleep in the listener, which made the test consistently fail without the fix, but pass consistently after the fix. Closes #22799 from squito/SPARK-25805. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-23 14:20:29 +08:00
Liang-Chi Hsieh	03e82e3689	[SPARK-25040][SQL] Empty string for non string types should be disallowed ## What changes were proposed in this pull request? This takes over original PR at #22019. The original proposal is to have null for float and double types. Later a more reasonable proposal is to disallow empty strings. This patch adds logic to throw exception when finding empty strings for non string types. ## How was this patch tested? Added test. Closes #22787 from viirya/SPARK-25040. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-23 13:43:53 +08:00
Liang-Chi Hsieh	ff9ede0929	[SPARK-25627][TEST] Reduce test time for ContinuousStressSuite ## What changes were proposed in this pull request? This goes to reduce test time for ContinuousStressSuite - from 8 mins 13 sec to 43 seconds. The approach taken by this is to reduce the triggers and epochs to wait and to reduce the expected rows accordingly. ## How was this patch tested? Existing tests. Closes #22662 from viirya/SPARK-25627. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-22 13:18:29 -05:00
hyukjinkwon	b8c6ba9e64	[SPARK-25779][SQL][TESTS] Remove SQL query tests for function documentation by DESCRIBE FUNCTION at SQLQueryTestSuite Currently, there are some tests testing function descriptions: ```bash $ grep -ir "describe function" sql/core/src/test/resources/sql-tests/inputs sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe function to_json; sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe function extended to_json; sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe function from_json; sql/core/src/test/resources/sql-tests/inputs/json-functions.sql:describe function extended from_json; ``` Looks there are not quite good points about testing them since we're not going to test documentation itself. For `DESCRIBE FCUNTION` functionality itself, they are already being tested here and there. See the test failures in https://github.com/apache/spark/pull/18749 (where I added examples to function descriptions) We better remove those tests so that people don't add such tests in the SQL tests. ## How was this patch tested? Manual. Closes #22776 from HyukjinKwon/SPARK-25779. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-20 18:02:38 -07:00
Wenchen Fan	ab5752cb95	[SPARK-25747][SQL] remove ColumnarBatchScan.needsUnsafeRowConversion ## What changes were proposed in this pull request? `needsUnsafeRowConversion` is used in 2 places: 1. `ColumnarBatchScan.produceRows` 2. `FileSourceScanExec.doExecute` When we hit `ColumnarBatchScan.produceRows`, it means whole stage codegen is on but the vectorized reader is off. The vectorized reader can be off for several reasons: 1. the file format doesn't have a vectorized reader(json, csv, etc.) 2. the vectorized reader config is off 3. the schema is not supported Anyway when the vectorized reader is off, file format reader will always return unsafe rows, and other `ColumnarBatchScan` implementations also always return unsafe rows, so `ColumnarBatchScan.needsUnsafeRowConversion` is not needed. When we hit `FileSourceScanExec.doExecute`, it means whole stage codegen is off. For this case, we need the `needsUnsafeRowConversion` to convert `ColumnarRow` to `UnsafeRow`, if the file format reader returns batch. This PR removes `ColumnarBatchScan.needsUnsafeRowConversion`, and keep this flag only in `FileSourceScanExec` ## How was this patch tested? existing tests Closes #22750 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-20 17:45:04 -07:00
Yuming Wang	62551cceeb	[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method ## What changes were proposed in this pull request? Refactor `WideSchemaBenchmark` to use main method. 1. use `spark-submit`: ```console bin/spark-submit --class org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` 2. Generate benchmark result: ```console SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark" ``` ## How was this patch tested? manual tests Closes #22501 from wangyum/SPARK-25492. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-20 17:31:13 -07:00
hyukjinkwon	3370865b0e	[SPARK-25785][SQL] Add prettyNames for from_json, to_json, from_csv, and schema_of_json ## What changes were proposed in this pull request? This PR adds `prettyNames` for `from_json`, `to_json`, `from_csv`, and `schema_of_json` so that appropriate names are used. ## How was this patch tested? Unit tests Closes #22773 from HyukjinKwon/minor-prettyNames. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-20 10:15:53 +08:00
Yuming Wang	9ad0f6ea89	[SPARK-25269][SQL] SQL interface support specify StorageLevel when cache table ## What changes were proposed in this pull request? SQL interface support specify `StorageLevel` when cache table. The semantic is: ```sql CACHE TABLE tableName OPTIONS('storageLevel' 'DISK_ONLY'); ``` All supported `StorageLevel` are: `eefdf9f9dd/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala (L172-L183)` ## How was this patch tested? unit tests and manual tests. manual tests configuration: ``` --executor-memory 15G --executor-cores 5 --num-executors 50 ``` Data: Input Size / Records: 1037.7 GB / 11732805788 Result: ![image](https://user-images.githubusercontent.com/5399861/47213362-56a1c980-d3cd-11e8-82e7-28d7abc5923e.png) Closes #22263 from wangyum/SPARK-25269. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-19 09:15:55 -07:00
maryannxue	e8167768cf	[SPARK-25044][FOLLOW-UP] Change ScalaUDF constructor signature ## What changes were proposed in this pull request? This is a follow-up PR for #22259. The extra field added in `ScalaUDF` with the original PR was declared optional, but should be indeed required, otherwise callers of `ScalaUDF`'s constructor could ignore this new field and cause the result to be incorrect. This PR makes the new field required and changes its name to `handleNullForInputs`. #22259 breaks the previous behavior for null-handling of primitive-type input parameters. For example, for `val f = udf({(x: Int, y: Any) => x})`, `f(null, "str")` should return `null` but would return `0` after #22259. In this PR, all UDF methods except `def udf(f: AnyRef, dataType: DataType): UserDefinedFunction` have been restored with the original behavior. The only exception is documented in the Spark SQL migration guide. In addition, now that we have this extra field indicating if a null-test should be applied on the corresponding input value, we can also make use of this flag to avoid the rule `HandleNullInputsForUDF` being applied infinitely. ## How was this patch tested? Added UT in UDFSuite Passed affected existing UTs: AnalysisSuite UDFSuite Closes #22732 from maryannxue/spark-25044-followup. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-19 21:03:59 +08:00
Russell Spitzer	6e0fc8b0fc	[SPARK-25560][SQL] Allow FunctionInjection in SparkExtensions This allows an implementer of Spark Session Extensions to utilize a method "injectFunction" which will add a new function to the default Spark Session Catalogue. ## What changes were proposed in this pull request? Adds a new function to SparkSessionExtensions def injectFunction(functionDescription: FunctionDescription) Where function description is a new type type FunctionDescription = (FunctionIdentifier, FunctionBuilder) The functions are loaded in BaseSessionBuilder when the function registry does not have a parent function registry to get loaded from. ## How was this patch tested? New unit tests are added for the extension in SparkSessionExtensionSuite Closes #22576 from RussellSpitzer/SPARK-25560. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2018-10-19 10:40:56 +02:00
Justin Uang	1e6c1d8bfb	[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. Closes #22503 from justinuang/fix-clrf-multiline. Authored-by: Justin Uang <juang@palantir.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-19 11:13:02 +08:00
Yuming Wang	1117fc35ff	[SPARK-25760][SQL] Set AddJarCommand return empty ## What changes were proposed in this pull request? Only `AddJarCommand` return `0`, the user will be confused about what it means. This PR sets it to empty. ```sql spark-sql> add jar /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar; ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar 0 spark-sql> ``` ## How was this patch tested? manual tests ```sql spark-sql> add jar /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar; ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar spark-sql> ``` Closes #22747 from wangyum/AddJarCommand. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 09:19:42 -07:00
Russell Spitzer	c3eaee7765	[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? An integration test was added which mimics the Scala test for the same feature. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21990 from RussellSpitzer/SPARK-25003-master. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-18 12:29:09 +08:00
Vladimir Kuriatkov	e5b8136f47	[SPARK-21402][SQL] Fix java array of structs deserialization When deserializing values of ArrayType with struct elements in java beans, fields of structs get mixed up. I suggest using struct data types retrieved from resolved input data instead of inferring them from java beans. ## What changes were proposed in this pull request? MapObjects expression is used to map array elements to java beans. Struct type of elements is inferred from java bean structure and ends up with mixed up field order. I used UnresolvedMapObjects instead of MapObjects, which allows to provide element type for MapObjects during analysis based on the resolved input data, not on the java bean. ## How was this patch tested? Added a test case. Built complete project on travis. michalsenkyr cloud-fan marmbrus liancheng Closes #22708 from vofque/SPARK-21402. Lead-authored-by: Vladimir Kuriatkov <vofque@gmail.com> Co-authored-by: Vladimir Kuriatkov <Vladimir_Kuriatkov@epam.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 22:13:05 +08:00
Wenchen Fan	9690eba16e	[SPARK-25680][SQL] SQL execution listener shouldn't happen on execution thread ## What changes were proposed in this pull request? The SQL execution listener framework was created from scratch(see https://github.com/apache/spark/pull/9078). It didn't leverage what we already have in the spark listener framework, and one major problem is, the listener runs on the spark execution thread, which means a bad listener can block spark's query processing. This PR re-implements the SQL execution listener framework. Now `ExecutionListenerManager` is just a normal spark listener, which watches the `SparkListenerSQLExecutionEnd` events and post events to the user-provided SQL execution listeners. ## How was this patch tested? existing tests. Closes #22674 from cloud-fan/listener. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 16:06:07 +08:00
Takeshi Yamamuro	a9f685bb70	[SPARK-25734][SQL] Literal should have a value corresponding to dataType ## What changes were proposed in this pull request? `Literal.value` should have a value a value corresponding to `dataType`. This pr added code to verify it and fixed the existing tests to do so. ## How was this patch tested? Modified the existing tests. Closes #22724 from maropu/SPARK-25734. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-17 11:02:39 +08:00
Maxim Gekk	e9af9460bc	[SPARK-25393][SQL] Adding new function from_csv() ## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes #22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-17 09:32:05 +08:00
Dongjoon Hyun	2c664edc06	[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. Test Data ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` Spark 2.3.2 ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 152 ms ``` Spark 2.4.0 RC3 ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ \|col.with.dot\| +------------+ \| 5\| \| 7\| \| 8\| +------------+ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 20:30:23 +08:00
Wenchen Fan	e028fd3aed	[SPARK-25736][SQL][TEST] add tests to verify the behavior of multi-column count ## What changes were proposed in this pull request? AFAIK multi-column count is not widely supported by the mainstream databases(postgres doesn't support), and the SQL standard doesn't define it clearly, as near as I can tell. Since Spark supports it, we should clearly document the current behavior and add tests to verify it. ## How was this patch tested? N/A Closes #22728 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 15:13:01 +08:00
Yuming Wang	5c7f6b6636	[SPARK-25629][TEST] Reduce ParquetFilterSuite: filter pushdown test time costs in Jenkins ## What changes were proposed in this pull request? Only test these 4 cases is enough: `be2238fb50/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (L269-L279)` ## How was this patch tested? Manual tests on my local machine. before: ``` - filter pushdown - decimal (13 seconds, 683 milliseconds) ``` after: ``` - filter pushdown - decimal (9 seconds, 713 milliseconds) ``` Closes #22636 from wangyum/SPARK-25629. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-16 12:30:02 +08:00
Imran Rashid	fdaa99897a	[SPARK-25738][SQL] Fix LOAD DATA INPATH for hdfs port ## What changes were proposed in this pull request? LOAD DATA INPATH didn't work if the defaultFS included a port for hdfs. Handling this just requires a small change to use the correct URI constructor. ## How was this patch tested? Added a unit test, ran all tests via jenkins Closes #22733 from squito/SPARK-25738. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-10-15 18:34:30 -07:00
gatorsmile	4cee191c04	[SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch ## What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This alternative can avoid the unneeded computation in the hot code path. - For row-based scan, we keep the original way. - For the columnar scan, we just need to update the stats after each batch. ## How was this patch tested? N/A Closes #22731 from gatorsmile/udpateStatsFileScanRDD. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-16 08:58:29 +08:00
Marco Gaido	56247c1d17	[SPARK-25727][FOLLOWUP] Move outputOrdering to case class field for InMemoryRelation ## What changes were proposed in this pull request? The PR addresses [the comment](https://github.com/apache/spark/pull/22715#discussion_r225024084) in the previous one. `outputOrdering` becomes a field of `InMemoryRelation`. ## How was this patch tested? existing UTs Closes #22726 from mgaido91/SPARK-25727_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-15 10:12:45 -07:00
gatorsmile	6c3f2c6a6a	[SPARK-25727][SQL] Add outputOrdering to otherCopyArgs in InMemoryRelation ## What changes were proposed in this pull request? Add `outputOrdering ` to `otherCopyArgs` in InMemoryRelation so that this field will be copied when we doing the tree transformation. ``` val data = Seq(100).toDF("count").cache() data.queryExecution.optimizedPlan.toJSON ``` The above code can generate the following error: ``` assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) java.lang.AssertionError: assertion failed: InMemoryRelation fields: output, cacheBuilder, statsOfPlanToCache, outputOrdering, values: List(count#178), CachedRDDBuilder(true,10000,StorageLevel(disk, memory, deserialized, 1 replicas),(1) Project [value#176 AS count#178] +- LocalTableScan [value#176] ,None), Statistics(sizeInBytes=12.0 B, hints=none) at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonFields(TreeNode.scala:611) at org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$collectJsonValue$1(TreeNode.scala:599) at org.apache.spark.sql.catalyst.trees.TreeNode.jsonValue(TreeNode.scala:604) at org.apache.spark.sql.catalyst.trees.TreeNode.toJSON(TreeNode.scala:590) ``` ## How was this patch tested? Added a test Closes #22715 from gatorsmile/copyArgs1. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-13 22:10:17 -07:00
Dongjoon Hyun	6bbceb9fef	[SPARK-25726][SQL][TEST] Fix flaky test in SaveIntoDataSourceCommandSuite ## What changes were proposed in this pull request? [SPARK-22479](https://github.com/apache/spark/pull/19708/files#diff-5c22ac5160d3c9d81225c5dd86265d27R31) adds a test case which sometimes fails because the used password string `123` matches `41230802`. This PR aims to fix the flakiness. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97343/consoleFull ```scala SaveIntoDataSourceCommandSuite: - simpleString is redacted * FAILED * "SaveIntoDataSourceCommand .org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider41230802, Map(password -> *******(redacted), url -> *******(redacted), driver -> mydriver), ErrorIfExists +- Range (0, 1, step=1, splits=Some(2)) " contained "123" (SaveIntoDataSourceCommandSuite.scala:42) ``` ## How was this patch tested? Pass the Jenkins with the updated test case Closes #22716 from dongjoon-hyun/SPARK-25726. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-13 18:01:28 -07:00
Wenchen Fan	34f229bc21	[SPARK-25710][SQL] range should report metrics correctly ## What changes were proposed in this pull request? Currently `Range` reports metrics in batch granularity. This is acceptable, but it's better if we can make it row granularity without performance penalty. Before this PR, the metrics are updated when preparing the batch, which is before we actually consume data. In this PR, the metrics are updated after the data are consumed. There are 2 different cases: 1. The data processing loop has a stop check. The metrics are updated when we need to stop. 2. no stop check. The metrics are updated after the loop. ## How was this patch tested? existing tests and a new benchmark Closes #22698 from cloud-fan/range. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-13 13:55:28 +08:00
gatorsmile	c9ba59d38e	[SPARK-25714] Fix Null Handling in the Optimizer rule BooleanSimplification ## What changes were proposed in this pull request? ```Scala val df1 = Seq(("abc", 1), (null, 3)).toDF("col1", "col2") df1.write.mode(SaveMode.Overwrite).parquet("/tmp/test1") val df2 = spark.read.parquet("/tmp/test1") df2.filter("col1 = 'abc' OR (col1 != 'abc' AND col2 == 3)").show() ``` Before the PR, it returns both rows. After the fix, it returns `Row ("abc", 1))`. This is to fix the bug in NULL handling in BooleanSimplification. This is a bug introduced in Spark 1.6 release. ## How was this patch tested? Added test cases Closes #22702 from gatorsmile/fixBooleanSimplify2. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-12 21:02:38 -07:00
Yuming Wang	e965fb55ac	[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method ## What changes were proposed in this pull request? Refactor `JoinBenchmark` to use main method. 1. use `spark-submit`: ```console bin/spark-submit --class org.apache.spark.sql.execution.benchmark.JoinBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar ``` 2. Generate benchmark result: ```console SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" ``` ## How was this patch tested? manual tests Closes #22661 from wangyum/SPARK-25664. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-12 16:08:12 -07:00
Mathieu St-Louis	4e141a4160	[STREAMING][DOC] Fix typo & formatting for JavaDoc ## What changes were proposed in this pull request? - Fixed typo for function outputMode - OutputMode.Complete(), changed `these is some updates` to `there are some updates` - Replaced hyphenized list by HTML unordered list tags in comments to fix the Javadoc documentation. Current render from most recent [Spark API Docs](https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/streaming/DataStreamWriter.html): #### outputMode(OutputMode) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250648-11086700-c3f4-11e8-8a5a-d88b079c165d.png) #### outputMode(String) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250651-24b3cd80-c3f4-11e8-9dac-ae37599afbce.png) #### partitionBy(String*) - List formatted as a prose. ![image](https://user-images.githubusercontent.com/2295469/46250655-36957080-c3f4-11e8-990b-47bd612d3c51.png) ## How was this patch tested? This PR contains a document patch ergo no functional testing is required. Closes #22593 from niofire/fix-typo-datastreamwriter. Authored-by: Mathieu St-Louis <mastloui@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 14:09:10 -05:00
Maxim Gekk	c7eadb5e66	[SPARK-25660][SQL] Fix for the backward slash as CSV fields delimiter ## What changes were proposed in this pull request? The PR addresses the exception raised on accessing chars out of delimiter string. In particular, the backward slash `\` as the CSV fields delimiter causes the following exception on reading `abc\1`: ```Scala String index out of range: 1 java.lang.StringIndexOutOfBoundsException: String index out of range: 1 at java.lang.String.charAt(String.java:658) ``` because `str.charAt(1)` tries to access a char out of `str` in `CSVUtils.toChar` ## How was this patch tested? Added tests for empty string and string containing the backward slash to `CSVUtilsSuite`. Besides of that I added an end-to-end test to check how the backward slash is handled in reading CSV string with it. Closes #22654 from MaxGekk/csv-slash-delim. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-12 12:04:00 -07:00
Shahid	3494b12281	[SPARK-25566][SPARK-25567][WEBUI][SQL] Support pagination for SQL tab to avoid OOM ## What changes were proposed in this pull request? Currently SQL tab in the WEBUI doesn't support pagination. Because of that following issues are happening. 1) For large number of executions, SQL page is throwing OOM exception (around 40,000) 2) For large number of executions, loading SQL page is taking time. 3) Difficult to analyse the execution table for large number of execution. [Note: spark.sql.ui.retainedExecutions = 50000] All the tabs, Jobs, Stages etc. supports pagination. So, to make it consistent with other tabs SQL tab also should support pagination. I have followed the similar flow of the pagination code in the Jobs and Stages page for SQL page. Also, this patch doesn't make any behavior change for the SQL tab except the pagination support. ## How was this patch tested? bin/spark-shell --conf spark.sql.ui.retainedExecutions=50000 Run 50,000 sql queries. Before this PR ![screenshot from 2018-10-05 23-48-27](https://user-images.githubusercontent.com/23054875/46552750-4ed82480-c8f9-11e8-8b05-d60bedddd1b8.png) ![screenshot from 2018-10-05 22-58-11](https://user-images.githubusercontent.com/23054875/46550276-33b5e680-c8f2-11e8-9e32-9ae9c5b181e0.png) After this PR Loading of the page is faster, and OOM issue doesn't happen. ![screenshot from 2018-10-05 23-50-32](https://user-images.githubusercontent.com/23054875/46552814-8050f000-c8f9-11e8-96e9-42502d2cfaea.png) Closes #22645 from shahidki31/SPARK-25566. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-12 12:36:35 -05:00
Wenchen Fan	78e133141c	[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate ## What changes were proposed in this pull request? According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. ## How was this patch tested? new test Closes #22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-12 00:24:06 -07:00
hyukjinkwon	39872af882	[SPARK-25684][SQL] Organize header related codes in CSV datasource ## What changes were proposed in this pull request? 1. Move `CSVDataSource.makeSafeHeader` to `CSVUtils.makeSafeHeader` (as is). - Historically and at the first place of refactoring (which I did), I intended to put all CSV specific handling (like options), filtering, extracting header, etc. - See `JsonDataSource`. Now `CSVDataSource` is quite consistent with `JsonDataSource`. Since CSV's code path is quite complicated, we might better match them as possible as we can. 2. Create `CSVHeaderChecker` and put `enforceSchema` logics into that. - The checking header and column pruning stuff were added (per https://github.com/apache/spark/pull/20894 and https://github.com/apache/spark/pull/21296) but some of codes such as https://github.com/apache/spark/pull/22123 are duplicated - Also, checking header code is basically here and there. We better put them in a single place, which was quite error-prone. See (https://github.com/apache/spark/pull/22656). 3. Move `CSVDataSource.checkHeaderColumnNames` to `CSVHeaderChecker.checkHeaderColumnNames` (as is). - Similar reasons above with 1. ## How was this patch tested? Existing tests should cover this. Closes #22676 from HyukjinKwon/refactoring-csv. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-12 09:16:41 +08:00
liuxian	69f5e9cce1	[SPARK-25674][SQL] If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated ## What changes were proposed in this pull request? If the records are incremented by more than 1 at a time,the number of bytes might rarely ever get updated，because it might skip over the count that is an exact multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. This PR just checks whether the increment causes the value to exceed a higher multiple of UPDATE_INPUT_METRICS_INTERVAL_RECORDS. ## How was this patch tested? existed unit tests Closes #22594 from 10110346/inputMetrics. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:24:15 -07:00
caoxuewen	65f75db611	[MINOR][SQL] remove Redundant semicolons ## What changes were proposed in this pull request? remove Redundant semicolons in SortMergeJoinExec， thanks. ## How was this patch tested? N/A Closes #22695 from heary-cao/RedundantSemicolons. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:03:41 -07:00
Peter Toth	8115e6b269	[SPARK-25662][SQL][TEST] Refactor DataSourceReadBenchmark to use main method ## What changes were proposed in this pull request? 1. Refactor DataSourceReadBenchmark ## How was this patch tested? Manually tested and regenerated results. ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DataSourceReadBenchmark" ``` Closes #22664 from peter-toth/SPARK-25662. Lead-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-10-11 20:27:07 +00:00
hyukjinkwon	83e19d5b80	[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path ## What changes were proposed in this pull request? This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes. `5fef6e3513` happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also https://github.com/apache/spark/pull/22009#discussion_r223982672 See also https://github.com/apache/spark/pull/22697 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes #22688 from HyukjinKwon/append-revert-2. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-11 09:35:49 -07:00
Gengliang Wang	6df2345794	[SPARK-25699][SQL] Partially push down conjunctive predicated in ORC ## What changes were proposed in this pull request? Inspired by https://github.com/apache/spark/pull/22574 . We can partially push down top level conjunctive predicates to Orc. This PR improves Orc predicate push down in both SQL and Hive module. ## How was this patch tested? New unit test. Closes #22684 from gengliangwang/pushOrcFilters. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-10-10 18:18:56 +00:00
gatorsmile	faf73dcd33	[SPARK-25559][FOLLOW-UP] Add comments for partial pushdown of conjuncts in Parquet ## What changes were proposed in this pull request? This is a follow up of https://github.com/apache/spark/pull/22574. Renamed the parameter and added comments. ## How was this patch tested? N/A Closes #22679 from gatorsmile/followupSPARK-25559. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-10-09 21:10:33 +00:00
Wenchen Fan	e3133f4abf	[SPARK-25497][SQL] Limit operation within whole stage codegen should not consume all the inputs ## What changes were proposed in this pull request? This PR is inspired by https://github.com/apache/spark/pull/22524, but proposes a safer fix. The current limit whole stage codegen has 2 problems: 1. It's only applied to `InputAdapter`, many leaf nodes can't stop earlier w.r.t. limit. 2. It needs to override a method, which will break if we have more than one limit in the whole-stage. The first problem is easy to fix, just figure out which nodes can stop earlier w.r.t. limit, and update them. This PR updates `RangeExec`, `ColumnarBatchScan`, `SortExec`, `HashAggregateExec`. The second problem is hard to fix. This PR proposes to propagate the limit counter variable name upstream, so that the upstream leaf/blocking nodes can check the limit counter and quit the loop earlier. For better performance, the implementation here follows `CodegenSupport.needStopCheck`, so that we only codegen the check only if there is limit in the query. For columnar node like range, we check the limit counter per-batch instead of per-row, to make the inner loop tight and fast. Why this is safer? 1. the leaf/blocking nodes don't have to check the limit counter and stop earlier. It's only for performance. (this is same as before) 2. The blocking operators can stop propagating the limit counter name, because the counter of limit after blocking operators will never increase, before blocking operators consume all the data from upstream operators. So the upstream operators don't care about limit after blocking operators. This is also for performance only, it's OK if we forget to do it for some new blocking operators. ## How was this patch tested? a new test Closes #22630 from cloud-fan/limit. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>	2018-10-09 16:46:23 +09:00
Maxim Gekk	46fe40838a	[SPARK-25669][SQL] Check CSV header only when it exists ## What changes were proposed in this pull request? Currently the first row of dataset of CSV strings is compared to field names of user specified or inferred schema independently of presence of CSV header. It causes false-positive error messages. For example, parsing `"1,2"` outputs the error: ```java java.lang.IllegalArgumentException: CSV header does not conform to the schema. Header: 1, 2 Schema: _c0, _c1 Expected: _c0 but found: 1 ``` In the PR, I propose: - Checking CSV header only when it exists - Filter header from the input dataset only if it exists ## How was this patch tested? Added a test to `CSVSuite` which reproduces the issue. Closes #22656 from MaxGekk/inferred-header-check. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-09 14:35:00 +08:00
Peter Toth	b0cee9605e	[SPARK-25062][SQL] Clean up BlockLocations in InMemoryFileIndex ## What changes were proposed in this pull request? `InMemoryFileIndex` contains a cache of `LocatedFileStatus` objects. Each `LocatedFileStatus` object can contain several `BlockLocation`s or some subclass of it. Filling up this cache by listing files happens recursively either on the driver or on the executors, depending on the parallel discovery threshold (`spark.sql.sources.parallelPartitionDiscovery.threshold`). If the listing happens on the executors block location objects are converted to simple `BlockLocation` objects to ensure serialization requirements. If it happens on the driver then there is no conversion and depending on the file system a `BlockLocation` object can be a subclass like `HdfsBlockLocation` and consume more memory. This PR adds the conversion to the latter case and decreases memory consumption. ## How was this patch tested? Added unit test. Closes #22603 from peter-toth/SPARK-25062. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-06 14:50:03 -07:00
Dongjoon Hyun	9cbf105ab1	[SPARK-25644][SS][FOLLOWUP][BUILD] Fix Scala 2.12 build error due to foreachBatch ## What changes were proposed in this pull request? This PR fixes the Scala-2.12 build error due to ambiguity in `foreachBatch` test cases. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/428/console ```scala [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala:102: ambiguous reference to overloaded definition, [error] both method foreachBatch in class DataStreamWriter of type (function: org.apache.spark.api.java.function.VoidFunction2[org.apache.spark.sql.Dataset[Int],Long])org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] and method foreachBatch in class DataStreamWriter of type (function: (org.apache.spark.sql.Dataset[Int], Long) => Unit)org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] match argument types ((org.apache.spark.sql.Dataset[Int], Any) => Unit) [error] ds.writeStream.foreachBatch((_, _) => {}).trigger(Trigger.Continuous("1 second")).start() [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/sources/ForeachBatchSinkSuite.scala:106: ambiguous reference to overloaded definition, [error] both method foreachBatch in class DataStreamWriter of type (function: org.apache.spark.api.java.function.VoidFunction2[org.apache.spark.sql.Dataset[Int],Long])org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] and method foreachBatch in class DataStreamWriter of type (function: (org.apache.spark.sql.Dataset[Int], Long) => Unit)org.apache.spark.sql.streaming.DataStreamWriter[Int] [error] match argument types ((org.apache.spark.sql.Dataset[Int], Any) => Unit) [error] ds.writeStream.foreachBatch((_, _) => {}).partitionBy("value").start() [error] ^ ``` ## How was this patch tested? Manual. Since this failure occurs in Scala-2.12 profile and test cases, Jenkins will not test this. We need to build with Scala-2.12 and run the tests. Closes #22649 from dongjoon-hyun/SPARK-SCALA212. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-06 09:40:42 -07:00
Yuming Wang	edf4286611	[SPARK-25488][SQL][TEST] Refactor MiscBenchmark to use main method ## What changes were proposed in this pull request? Refactor `MiscBenchmark ` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.MiscBenchmark" ``` ## How was this patch tested? manual tests Closes #22500 from wangyum/SPARK-25488. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-06 08:47:43 -07:00
Gengliang Wang	1ee472eec1	[SPARK-25621][SPARK-25622][TEST] Reduce test time of BucketedReadWithHiveSupportSuite ## What changes were proposed in this pull request? By replacing loops with random possible value. - `read partitioning bucketed tables with bucket pruning filters` reduce from 55s to 7s - `read partitioning bucketed tables having composite filters` reduce from 54s to 8s - total time: reduce from 288s to 192s ## How was this patch tested? Unit test Closes #22640 from gengliangwang/fastenBucketedReadSuite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-06 14:54:04 +08:00
Dilip Biswal	f2f4e7afe7	[SPARK-25600][SQL][MINOR] Make use of TypeCoercion.findTightestCommonType while inferring CSV schema. ## What changes were proposed in this pull request? Current the CSV's infer schema code inlines `TypeCoercion.findTightestCommonType`. This is a minor refactor to make use of the common type coercion code when applicable. This way we can take advantage of any improvement to the base method. Thanks to MaxGekk for finding this while reviewing another PR. ## How was this patch tested? This is a minor refactor. Existing tests are used to verify the change. Closes #22619 from dilipbiswal/csv_minor. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-06 14:49:51 +08:00
Parker Hegstrom	17781d7530	[SPARK-25202][SQL] Implements split with limit sql function ## What changes were proposed in this pull request? Adds support for the setting limit in the sql split function ## How was this patch tested? 1. Updated unit tests 2. Tested using Scala spark shell Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22227 from phegstrom/master. Authored-by: Parker Hegstrom <phegstrom@palantir.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-06 14:30:43 +08:00
Dilip Biswal	2c6f4d61bb	[SPARK-25610][SQL][TEST] Improve execution time of DatasetCacheSuite: cache UDF result correctly ## What changes were proposed in this pull request? In this test case, we are verifying that the result of an UDF is cached when the underlying data frame is cached and that the udf is not evaluated again when the cached data frame is used. To reduce the runtime we do : 1) Use a single partition dataframe, so the total execution time of UDF is more deterministic. 2) Cut down the size of the dataframe from 10 to 2. 3) Reduce the sleep time in the UDF from 5secs to 2secs. 4) Reduce the failafter condition from 3 to 2. With the above change, it takes about 4 secs to cache the first dataframe. And subsequent check takes a few hundred milliseconds. The new runtime for 5 consecutive runs of this test is as follows : ``` [info] - cache UDF result correctly (4 seconds, 906 milliseconds) [info] - cache UDF result correctly (4 seconds, 281 milliseconds) [info] - cache UDF result correctly (4 seconds, 288 milliseconds) [info] - cache UDF result correctly (4 seconds, 355 milliseconds) [info] - cache UDF result correctly (4 seconds, 280 milliseconds) ``` ## How was this patch tested? This is s test fix. Closes #22638 from dilipbiswal/SPARK-25610. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-05 17:25:28 -07:00
Dongjoon Hyun	1c9486c1ac	[SPARK-25635][SQL][BUILD] Support selective direct encoding in native ORC write ## What changes were proposed in this pull request? Before ORC 1.5.3, `orc.dictionary.key.threshold` and `hive.exec.orc.dictionary.key.size.threshold` are applied for all columns. This has been a big huddle to enable dictionary encoding. From ORC 1.5.3, `orc.column.encoding.direct` is added to enforce direct encoding selectively in a column-wise manner. This PR aims to add that feature by upgrading ORC from 1.5.2 to 1.5.3. The followings are the patches in ORC 1.5.3 and this feature is the only one related to Spark directly. ``` ORC-406: ORC: Char(n) and Varchar(n) writers truncate to n bytes & corrupts multi-byte data (gopalv) ORC-403: [C++] Add checks to avoid invalid offsets in InputStream ORC-405: Remove calcite as a dependency from the benchmarks. ORC-375: Fix libhdfs on gcc7 by adding #include <functional> two places. ORC-383: Parallel builds fails with ConcurrentModificationException ORC-382: Apache rat exclusions + add rat check to travis ORC-401: Fix incorrect quoting in specification. ORC-385: Change RecordReader to extend Closeable. ORC-384: [C++] fix memory leak when loading non-ORC files ORC-391: [c++] parseType does not accept underscore in the field name ORC-397: Allow selective disabling of dictionary encoding. Original patch was by Mithun Radhakrishnan. ORC-389: Add ability to not decode Acid metadata columns ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22622 from dongjoon-hyun/SPARK-25635. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-05 16:42:06 -07:00
Shixiong Zhu	7dcc90fbb8	[SPARK-25644][SS] Fix java foreachBatch in DataStreamWriter ## What changes were proposed in this pull request? The java `foreachBatch` API in `DataStreamWriter` should accept `java.lang.Long` rather `scala.Long`. ## How was this patch tested? New java test. Closes #22633 from zsxwing/fix-java-foreachbatch. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2018-10-05 10:45:15 -07:00
Michal Senkyr	434ada12a0	[SPARK-17952][SQL] Nested Java beans support in createDataFrame ## What changes were proposed in this pull request? When constructing a DataFrame from a Java bean, using nested beans throws an error despite [documentation](http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection) stating otherwise. This PR aims to add that support. This PR does not yet add nested beans support in array or List fields. This can be added later or in another PR. ## How was this patch tested? Nested bean was added to the appropriate unit test. Also manually tested in Spark shell on code emulating the referenced JIRA: ``` scala> import scala.beans.BeanProperty import scala.beans.BeanProperty scala> class SubCategory(BeanProperty var id: String, BeanProperty var name: String) extends Serializable defined class SubCategory scala> class Category(BeanProperty var id: String, BeanProperty var subCategory: SubCategory) extends Serializable defined class Category scala> import scala.collection.JavaConverters._ import scala.collection.JavaConverters._ scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) java.lang.IllegalArgumentException: The value (SubCategory65130cf2) of the type (SubCategory) cannot be converted to struct<id:string,name:string> at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1108) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1108) at org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$class.toStream(Iterator.scala:1320) at scala.collection.AbstractIterator.toStream(Iterator.scala:1334) at scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298) at scala.collection.AbstractIterator.toSeq(Iterator.scala:1334) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:423) ... 51 elided ``` New behavior: ``` scala> spark.createDataFrame(Seq(new Category("s-111", new SubCategory("sc-111", "Sub-1"))).asJava, classOf[Category]) res0: org.apache.spark.sql.DataFrame = [id: string, subCategory: struct<id: string, name: string>] scala> res0.show() +-----+---------------+ \| id\| subCategory\| +-----+---------------+ \|s-111\|[sc-111, Sub-1]\| +-----+---------------+ ``` Closes #22527 from michalsenkyr/SPARK-17952. Authored-by: Michal Senkyr <mike.senkyr@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2018-10-05 17:48:52 +09:00
Fokko Driesprong	ab1650d293	[SPARK-24601] Update Jackson to 2.9.6 Hi all, Jackson is incompatible with upstream versions, therefore bump the Jackson version to a more recent one. I bumped into some issues with Azure CosmosDB that is using a more recent version of Jackson. This can be fixed by adding exclusions and then it works without any issues. So no breaking changes in the API's. I would also consider bumping the version of Jackson in Spark. I would suggest to keep up to date with the dependencies, since in the future this issue will pop up more frequently. ## What changes were proposed in this pull request? Bump Jackson to 2.9.6 ## How was this patch tested? Compiled and tested it locally to see if anything broke. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21596 from Fokko/fd-bump-jackson. Authored-by: Fokko Driesprong <fokkodriesprong@godatadriven.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-05 16:40:08 +08:00
s71955	459700727f	[SPARK-25521][SQL] Job id showing null in the logs when insert into command Job is finished. ## What changes were proposed in this pull request? ``As part of insert command in FileFormatWriter, a job context is created for handling the write operation , While initializing the job context using setupJob() API in HadoopMapReduceCommitProtocol , we set the jobid in the Jobcontext configuration.In FileFormatWriter since we are directly getting the jobId from the map reduce JobContext the job id will come as null while adding the log. As a solution we shall get the jobID from the configuration of the map reduce Jobcontext.`` ## How was this patch tested? Manually, verified the logs after the changes. ![spark-25521 1](https://user-images.githubusercontent.com/12999161/46164933-e95ab700-c2ac-11e8-88e9-49fa5100b872.PNG) Closes #22572 from sujith71955/master_log_issue. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-05 13:09:16 +08:00
Marco Gaido	85a93595d5	[SPARK-25609][TESTS] Reduce time of test for SPARK-22226 ## What changes were proposed in this pull request? The PR changes the test introduced for SPARK-22226, so that we don't run analysis and optimization on the plan. The scope of the test is code generation and running the above mentioned operation is expensive and useless for the test. The UT was also moved to the `CodeGenerationSuite` which is a better place given the scope of the test. ## How was this patch tested? running the UT before SPARK-22226 fails, after it passes. The execution time is about 50% the original one. On my laptop this means that the test now runs in about 23 seconds (instead of 50 seconds). Closes #22629 from mgaido91/SPARK-25609. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-04 18:46:16 -07:00
Yuming Wang	95ae209461	[SPARK-25479][TEST] Refactor DatasetBenchmark to use main method ## What changes were proposed in this pull request? Refactor `DatasetBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.DatasetBenchmark" ``` ## How was this patch tested? manual tests Closes #22488 from wangyum/SPARK-25479. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-04 11:58:16 -07:00
Wenchen Fan	71c24aad36	[SPARK-25602][SQL] SparkPlan.getByteArrayRdd should not consume the input when not necessary ## What changes were proposed in this pull request? In `SparkPlan.getByteArrayRdd`, we should only call `it.hasNext` when the limit is not hit, as `iter.hasNext` may produce one row and buffer it, and cause wrong metrics. ## How was this patch tested? new tests Closes #22621 from cloud-fan/range. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-04 20:15:21 +08:00
Yuming Wang	56741c342d	[SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to use main method ## What changes were proposed in this pull request? Refactor `UnsafeArrayDataBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.UnsafeArrayDataBenchmark" ``` ## How was this patch tested? manual tests Closes #22491 from wangyum/SPARK-25483. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-03 04:20:02 -07:00
Dongjoon Hyun	1a5d83bed8	[SPARK-25589][SQL][TEST] Add BloomFilterBenchmark ## What changes were proposed in this pull request? This PR aims to add `BloomFilterBenchmark`. For ORC data source, Apache Spark has been supporting for a long time. For Parquet data source, it's expected to be added with next Parquet release update. ## How was this patch tested? Manual. ```scala SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.BloomFilterBenchmark" ``` Closes #22605 from dongjoon-hyun/SPARK-25589. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-03 04:14:07 -07:00
Gengliang Wang	7b4e94f160	[SPARK-25581][SQL] Rename method `benchmark` as `runBenchmarkSuite` in `BenchmarkBase` ## What changes were proposed in this pull request? Rename method `benchmark` in `BenchmarkBase` as `runBenchmarkSuite `. Also add comments. Currently the method name `benchmark` is a bit confusing. Also the name is the same as instances of `Benchmark`: `f246813afb/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala (L330-L339)` ## How was this patch tested? Unit test. Closes #22599 from gengliangwang/renameBenchmarkSuite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-02 10:04:47 -07:00
gatorsmile	9bf397c0e4	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT ## What changes were proposed in this pull request? This patch is to bump the master branch version to 3.0.0-SNAPSHOT. ## How was this patch tested? N/A Closes #22606 from gatorsmile/bump3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-02 08:48:24 -07:00
Shahid	3422fc0b6c	[SPARK-25575][WEBUI][SQL] SQL tab in the spark UI support hide tables, to make it consistent with other tabs. ## What changes were proposed in this pull request? Currently, SQL tab in the WEBUI doesn't support hiding table. Other tabs in the web ui like, Jobs, stages etc supports hiding table (refer SPARK-23024 https://github.com/apache/spark/pull/20216). In this PR, added the support for hide table in the sql tab also. ## How was this patch tested? bin/spark-shell ``` sql("create table a (id int)") for(i <- 1 to 100) sql(s"insert into a values ($i)") ``` Open SQL tab in the web UI Before fix: ![image](https://user-images.githubusercontent.com/23054875/46249137-f5c44880-c441-11e8-953a-a811e33ac24d.png) After fix: Consistent with the other tabs. ![screenshot from 2018-09-30 00-11-28](https://user-images.githubusercontent.com/23054875/46249354-75074b80-c445-11e8-9417-28751fd8628a.png) (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22592 from shahidki31/SPARK-25575. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-01 17:45:12 -05:00
Yuming Wang	b96fd44f0e	[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API ## What changes were proposed in this pull request? This PR does 2 things: 1. Add a new trait(`SqlBasedBenchmark`) to better support Dataset and DataFrame API. 2. Refactor `AggregateBenchmark` to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark" ``` ## How was this patch tested? manual tests Closes #22484 from wangyum/SPARK-25476. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-01 07:32:40 -07:00
Marco Gaido	fb8f4c0565	[SPARK-25505][SQL][FOLLOWUP] Fix for attributes cosmetically different in Pivot clause ## What changes were proposed in this pull request? #22519 introduced a bug when the attributes in the pivot clause are cosmetically different from the output ones (eg. different case). In particular, the problem is that the PR used a `Set[Attribute]` instead of an `AttributeSet`. ## How was this patch tested? added UT Closes #22582 from mgaido91/SPARK-25505_followup. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-30 22:08:04 -07:00
hyukjinkwon	a2f502cf53	[SPARK-25565][BUILD] Add scalastyle rule to check add Locale.ROOT to .toLowerCase and .toUpperCase for internal calls ## What changes were proposed in this pull request? This PR adds a rule to force `.toLowerCase(Locale.ROOT)` or `toUpperCase(Locale.ROOT)`. It produces an error as below: ``` [error] Are you sure that you want to use toUpperCase or toLowerCase without the root locale? In most cases, you [error] should use toUpperCase(Locale.ROOT) or toLowerCase(Locale.ROOT) instead. [error] If you must use toUpperCase or toLowerCase without the root locale, wrap the code block with [error] // scalastyle:off caselocale [error] .toUpperCase [error] .toLowerCase [error] // scalastyle:on caselocale ``` This PR excludes the cases above for SQL code path for external calls like table name, column name and etc. For test suites, or when it's clear there's no locale problem like Turkish locale problem, it uses `Locale.ROOT`. One minor problem is, `UTF8String` has both methods, `toLowerCase` and `toUpperCase`, and the new rule detects them as well. They are ignored. ## How was this patch tested? Manually tested, and Jenkins tests. Closes #22581 from HyukjinKwon/SPARK-25565. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-30 14:31:04 +08:00
Maxim Gekk	623c2ec4ef	[SPARK-25048][SQL] Pivoting by multiple columns in Scala/Java ## What changes were proposed in this pull request? In the PR, I propose to extend implementation of existing method: ``` def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset ``` to support values of the struct type. This allows pivoting by multiple columns combined by `struct`: ``` trainingSales .groupBy($"sales.year") .pivot( pivotColumn = struct(lower($"sales.course"), $"training"), values = Seq( struct(lit("dotnet"), lit("Experts")), struct(lit("java"), lit("Dummies"))) ).agg(sum($"sales.earnings")) ``` ## How was this patch tested? Added a test for values specified via `struct` in Java and Scala. Closes #22316 from MaxGekk/pivoting-by-multiple-columns2. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-29 21:50:35 +08:00
Maxim Gekk	1007cae20e	[SPARK-25447][SQL] Support JSON options by schema_of_json() ## What changes were proposed in this pull request? In the PR, I propose to extended the `schema_of_json()` function, and accept JSON options since they can impact on schema inferring. Purpose is to support the same options that `from_json` can use during schema inferring. ## How was this patch tested? Added SQL, Python and Scala tests (`JsonExpressionsSuite` and `JsonFunctionsSuite`) that checks JSON options are used. Closes #22442 from MaxGekk/schema_of_json-options. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-29 17:53:30 +08:00
DB Tsai	5d726b8659	[SPARK-25559][SQL] Remove the unsupported predicates in Parquet when possible ## What changes were proposed in this pull request? Currently, in `ParquetFilters`, if one of the children predicates is not supported by Parquet, the entire predicates will be thrown away. In fact, if the unsupported predicate is in the top level `And` condition or in the child before hitting `Not` or `Or` condition, it can be safely removed. ## How was this patch tested? Tests are added. Closes #22574 from dbtsai/removeUnsupportedPredicatesInParquet. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-28 17:46:11 -07:00
Yuming Wang	a281465686	[SPARK-25429][SQL] Use Set instead of Array to improve lookup performance ## What changes were proposed in this pull request? Use `Set` instead of `Array` to improve `accumulatorIds.contains(acc.id)` performance. This PR close https://github.com/apache/spark/pull/22420 ## How was this patch tested? manual tests. Benchmark code: ```scala def benchmark(func: () => Unit): Long = { val start = System.currentTimeMillis() func() val end = System.currentTimeMillis() end - start } val range = Range(1, 1000000) val set = range.toSet val array = range.toArray for (i <- 0 until 5) { val setExecutionTime = benchmark(() => for (i <- 0 until 500) { set.contains(scala.util.Random.nextInt()) }) val arrayExecutionTime = benchmark(() => for (i <- 0 until 500) { array.contains(scala.util.Random.nextInt()) }) println(s"set execution time: $setExecutionTime, array execution time: $arrayExecutionTime") } ``` Benchmark result: ``` set execution time: 4, array execution time: 2760 set execution time: 1, array execution time: 1911 set execution time: 3, array execution time: 2043 set execution time: 12, array execution time: 2214 set execution time: 6, array execution time: 1770 ``` Closes #22579 from wangyum/SPARK-25429. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-28 15:08:15 -07:00
Dilip Biswal	7deef7a49b	[SPARK-25458][SQL] Support FOR ALL COLUMNS in ANALYZE TABLE ## What changes were proposed in this pull request? Description from the JIRA : Currently, to collect the statistics of all the columns, users need to specify the names of all the columns when calling the command "ANALYZE TABLE ... FOR COLUMNS...". This is not user friendly. Instead, we can introduce the following SQL command to achieve it without specifying the column names. ``` ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR ALL COLUMNS; ``` ## How was this patch tested? Added new tests in SparkSqlParserSuite and StatisticsSuite Closes #22566 from dilipbiswal/SPARK-25458. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-28 15:03:06 -07:00
maryannxue	e120a38c0c	[SPARK-25505][SQL] The output order of grouping columns in Pivot is different from the input order ## What changes were proposed in this pull request? The grouping columns from a Pivot query are inferred as "input columns - pivot columns - pivot aggregate columns", where input columns are the output of the child relation of Pivot. The grouping columns will be the leading columns in the pivot output and they should preserve the same order as specified by the input. For example, ``` SELECT * FROM ( SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, "x" as x, "d" as d, "w" as w FROM courseSales ) PIVOT ( sum(earnings) FOR course IN ('dotNET', 'Java') ) ``` The output columns should be "a, z, b, y, c, x, d, w, ..." but now it is "a, b, c, d, w, x, y, z, ..." The fix is to use the child plan's `output` instead of `outputSet` so that the order can be preserved. ## How was this patch tested? Added UT. Closes #22519 from maryannxue/spark-25505. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-28 00:09:06 -07:00
Chris Zhao	3b7395fe02	[SPARK-25459][SQL] Add viewOriginalText back to CatalogTable ## What changes were proposed in this pull request? The `show create table` will show a lot of generated attributes for views that created by older Spark version. This PR will basically revert https://issues.apache.org/jira/browse/SPARK-19272 back, so when you `DESC [FORMATTED\|EXTENDED] view` will show the original view DDL text. ## How was this patch tested? Unit test. Closes #22458 from zheyuan28/testbranch. Lead-authored-by: Chris Zhao <chris.zhao@databricks.com> Co-authored-by: Christopher Zhao <chris.zhao@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-27 17:55:08 -07:00
Wenchen Fan	a1adde5408	[SPARK-24341][SQL][FOLLOWUP] remove duplicated error checking ## What changes were proposed in this pull request? There are 2 places we check for problematic `InSubquery`: the rule `ResolveSubquery` and `InSubquery.checkInputDataTypes`. We should unify them. ## How was this patch tested? existing tests Closes #22563 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 21:19:25 +08:00
Gengliang Wang	dd8f6b1ce8	[SPARK-25541][SQL][FOLLOWUP] Remove overriding filterKeys in CaseInsensitiveMap ## What changes were proposed in this pull request? As per the discussion in https://github.com/apache/spark/pull/22553#pullrequestreview-159192221, override `filterKeys` violates the documented semantics. This PR is to remove it and add documentation. Also fix one potential non-serializable map in `FileStreamOptions`. The only one call of `CaseInsensitiveMap`'s `filterKeys` left is `c3c45cbd76/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala (L88-L90)` But this one is OK. ## How was this patch tested? Existing unit tests. Closes #22562 from gengliangwang/SPARK-25541-FOLLOWUP. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 19:53:13 +08:00
Marco Gaido	86a2450e09	[SPARK-25551][SQL] Remove unused InSubquery expression ## What changes were proposed in this pull request? The PR removes the `InSubquery` expression which was introduced a long time ago and its only usage was removed in `4ce970d714`. Hence it is not used anymore. ## How was this patch tested? existing UTs Closes #22556 from mgaido91/minor_insubq. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 19:34:05 +08:00
Dilip Biswal	d03e0af80d	[SPARK-25522][SQL] Improve type promotion for input arguments of elementAt function ## What changes were proposed in this pull request? In ElementAt, when first argument is MapType, we should coerce the key type and the second argument based on findTightestCommonType. This is not happening currently. We may produce wrong output as we will incorrectly downcast the right hand side double expression to int. ```SQL spark-sql> select element_at(map(1,"one", 2, "two"), 2.2); two ``` Also, when the first argument is ArrayType, the second argument should be an integer type or a smaller integral type that can be safely casted to an integer type. Currently we may do an unsafe cast. In the following case, we should fail with an error as 2.2 is not a integer index. But instead we down cast it to int currently and return a result instead. ```SQL spark-sql> select element_at(array(1,2), 1.24D); 1 ``` This PR also supports implicit cast between two MapTypes. I have followed similar logic that exists today to do implicit casts between two array types. ## How was this patch tested? Added new tests in DataFrameFunctionSuite, TypeCoercionSuite. Closes #22544 from dilipbiswal/SPARK-25522. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 15:04:59 +08:00
yucai	9063b17f3d	[SPARK-25481][SQL][TEST] Refactor ColumnarBatchBenchmark to use main method ## What changes were proposed in this pull request? Refactor `ColumnarBatchBenchmark` to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.vectorized.ColumnarBatchBenchmark" ``` ## How was this patch tested? manual tests Closes #22490 from yucai/SPARK-25481. Lead-authored-by: yucai <yyu1@ebay.com> Co-authored-by: Yucai Yu <yucai.yu@foxmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-26 20:40:10 -07:00
Wenchen Fan	d0990e3dfe	[SPARK-25454][SQL] add a new config for picking minimum precision for integral literals ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/20023 proposed to allow precision lose during decimal operations, to reduce the possibilities of overflow. This is a behavior change and is protected by the DECIMAL_OPERATIONS_ALLOW_PREC_LOSS config. However, that PR introduced another behavior change: pick a minimum precision for integral literals, which is not protected by a config. This PR add a new config for it: `spark.sql.literal.pickMinimumPrecision`. This can allow users to work around issue in SPARK-25454, which is caused by a long-standing bug of negative scale. ## How was this patch tested? a new test Closes #22494 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-26 17:47:05 -07:00
Dongjoon Hyun	81cbcca600	[SPARK-25534][SQL] Make `SQLHelper` trait ## What changes were proposed in this pull request? Currently, Spark has 7 `withTempPath` and 6 `withSQLConf` functions. This PR aims to remove duplicated and inconsistent code and reduce them to the following meaningful implementations. withTempPath - `SQLHelper.withTempPath`: The one which was used in `SQLTestUtils`. withSQLConf - `SQLHelper.withSQLConf`: The one which was used in `PlanTest`. - `ExecutorSideSQLConfSuite.withSQLConf`: The one which doesn't throw `AnalysisException` on StaticConf changes. - `SQLTestUtils.withSQLConf`: The one which overrides intentionally to change the active session. ```scala protected override def withSQLConf(pairs: (String, String))(f: => Unit): Unit = { SparkSession.setActiveSession(spark) super.withSQLConf(pairs: _)(f) } ``` ## How was this patch tested? Pass the Jenkins with the existing tests. Closes #22548 from dongjoon-hyun/SPARK-25534. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-25 23:03:54 -07:00
Maxim Gekk	473d0d862d	[SPARK-25514][SQL] Generating pretty JSON by to_json ## What changes were proposed in this pull request? The PR introduces new JSON option `pretty` which allows to turn on `DefaultPrettyPrinter` of `Jackson`'s Json generator. New option is useful in exploring of deep nested columns and in converting of JSON columns in more readable representation (look at the added test). ## How was this patch tested? Added rount trip test which convert an JSON string to pretty representation via `from_json()` and `to_json()`. Closes #22534 from MaxGekk/pretty-json. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-26 09:52:15 +08:00
gatorsmile	8c2edf46d0	[SPARK-24324][PYTHON][FOLLOW-UP] Rename the Conf to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## What changes were proposed in this pull request? Add the legacy prefix for spark.sql.execution.pandas.groupedMap.assignColumnsByPosition and rename it to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## How was this patch tested? The existing tests. Closes #22540 from gatorsmile/renameAssignColumnsByPosition. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-26 09:32:51 +08:00
yucai	04db035378	[SPARK-25486][TEST] Refactor SortBenchmark to use main method ## What changes were proposed in this pull request? Refactor SortBenchmark to use main method. Generate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.SortBenchmark" ``` ## How was this patch tested? manual tests Closes #22495 from yucai/SPARK-25486. Authored-by: yucai <yyu1@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-25 11:13:05 -07:00
Reynold Xin	9cbd001e24	[SPARK-23907][SQL] Revert regr_* functions entirely ## What changes were proposed in this pull request? This patch reverts entirely all the regr_* functions added in SPARK-23907. These were added by mgaido91 (and proposed by gatorsmile) to improve compatibility with other database systems, without any actual use cases. However, they are very rarely used, and in Spark there are much better ways to compute these functions, due to Spark's flexibility in exposing real programming APIs. I'm going through all the APIs added in Spark 2.4 and I think we should revert these. If there are strong enough demands and more use cases, we can add them back in the future pretty easily. ## How was this patch tested? Reverted test cases also. Closes #22541 from rxin/SPARK-23907. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-25 20:13:07 +08:00
Dilip Biswal	7d8f5b62c5	[SPARK-25519][SQL] ArrayRemove function may return incorrect result when right expression is implicitly downcasted. ## What changes were proposed in this pull request? In ArrayRemove, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_remove(array(1,2,3), 1.23D); [2,3] ``` ```SQL spark-sql> select array_remove(array(1,2,3), 'foo'); NULL ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22542 from dilipbiswal/SPARK-25519. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-25 12:05:04 +08:00
Dilip Biswal	bb49661e19	[SPARK-25416][SQL] ArrayPosition function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayPosition, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_position(array(1), 1.34); 1 ``` ```SQL spark-sql> select array_position(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22407 from dilipbiswal/SPARK-25416. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-24 21:37:51 +08:00
Yuming Wang	c79072aafa	[SPARK-25478][SQL][TEST] Refactor CompressionSchemeBenchmark to use main method ## What changes were proposed in this pull request? Refactor `CompressionSchemeBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.columnar.compression.CompressionSchemeBenchmark" ``` ## How was this patch tested? manual tests Closes #22486 from wangyum/SPARK-25478. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-23 20:46:40 -07:00
Gengliang Wang	d25f425c96	[SPARK-25499][TEST] Refactor BenchmarkBase and Benchmark ## What changes were proposed in this pull request? Currently there are two classes with the same naming BenchmarkBase: 1. `org.apache.spark.util.BenchmarkBase` 2. `org.apache.spark.sql.execution.benchmark.BenchmarkBase` This is very confusing. And the benchmark object `org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark` is using the one in `org.apache.spark.util.BenchmarkBase`, while there is another class `BenchmarkBase` in the same package of it... Here I propose: 1. the package `org.apache.spark.util.BenchmarkBase` should be in test package of core module. Move it to package `org.apache.spark.benchmark` . 2. Move `org.apache.spark.util.Benchmark` to test package of core module. Move it to package `org.apache.spark.benchmark` . 3. Rename the class `org.apache.spark.sql.execution.benchmark.BenchmarkBase` as `BenchmarkWithCodegen` ## How was this patch tested? Unit test Closes #22513 from gengliangwang/refactorBenchmarkBase. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 22:20:55 +08:00
seancxmao	1f4ca6f5c5	[SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark ## What changes were proposed in this pull request? Refactor PrimitiveArrayBenchmark to use main method and print the output as a separate file. Run blow command to generate benchmark results: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.PrimitiveArrayBenchmark" ``` ## How was this patch tested? Manual tests. Closes #22497 from seancxmao/SPARK-25487. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>	2018-09-21 15:04:47 +09:00
gatorsmile	5d25e15440	Revert "[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone ## What changes were proposed in this pull request? This reverts commit `417ad92502`. We decided to keep the current behaviors unchanged and will consider whether we will deprecate the these functions in 3.0. For more details, see the discussion in https://issues.apache.org/jira/browse/SPARK-23715 ## How was this patch tested? The existing tests. Closes #22505 from gatorsmile/revertSpark-23715. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-21 10:39:45 +08:00
Gengliang Wang	950ab79957	[SPARK-24777][SQL] Add write benchmark for AVRO ## What changes were proposed in this pull request? Refactor `DataSourceWriteBenchmark` and add write benchmark for AVRO. ## How was this patch tested? Build and run the benchmark. Closes #22451 from gengliangwang/avroWriteBenchmark. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-20 17:41:24 -07:00
Burak Yavuz	77e52448e7	[SPARK-25472][SS] Don't have legitimate stops of streams cause stream exceptions ## What changes were proposed in this pull request? Legitimate stops of streams may actually cause an exception to be captured by stream execution, because the job throws a SparkException regarding job cancellation during a stop. This PR makes the stop more graceful by swallowing this cancellation error. ## How was this patch tested? This is pretty hard to test. The existing tests should make sure that we're not swallowing other specific SparkExceptions. I've also run the `KafkaSourceStressForDontFailOnDataLossSuite`100 times, and it didn't fail, whereas it used to be flaky. Closes #22478 from brkyvz/SPARK-25472. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2018-09-20 15:46:33 -07:00
Maxim Gekk	a86f84102e	[SPARK-25381][SQL] Stratified sampling by Column argument ## What changes were proposed in this pull request? In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example: ```Scala spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") .stat .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L) .show() +-----+---+ \| name\|age\| +-----+---+ \| Nico\| 8\| \|Alice\| 10\| +-----+---+ ``` ## How was this patch tested? Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument. Closes #22365 from MaxGekk/sample-by-column. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-21 01:11:40 +08:00
Dilip Biswal	67f2c6a554	[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:33:44 +08:00
hyukjinkwon	edf5cc64e4	[SPARK-25460][SS] DataSourceV2: SS sources do not respect SessionConfigSupport ## What changes were proposed in this pull request? This PR proposes to respect `SessionConfigSupport` in SS datasources as well. Currently these are only respected in batch sources: `e06da95cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (L198-L203)` `e06da95cd9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (L244-L249)` If a developer makes a datasource V2 that supports both structured streaming and batch jobs, batch jobs respect a specific configuration, let's say, URL to connect and fetch data (which end users might not be aware of); however, structured streaming ends up with not supporting this (and should explicitly be set into options). ## How was this patch tested? Unit tests were added. Closes #22462 from HyukjinKwon/SPARK-25460. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:22:55 +08:00
Liang-Chi Hsieh	89671a27e7	Revert [SPARK-19355][SPARK-25352] ## What changes were proposed in this pull request? This goes to revert sequential PRs based on some discussion and comments at https://github.com/apache/spark/pull/16677#issuecomment-422650759. #22344 #22330 #22239 #16677 ## How was this patch tested? Existing tests. Closes #22481 from viirya/revert-SPARK-19355-1. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:18:31 +08:00
Yuming Wang	0e31a6f25e	[SPARK-25339][TEST] Refactor FilterPushdownBenchmark ## What changes were proposed in this pull request? Refactor `FilterPushdownBenchmark` use `main` method. we can use 3 ways to run this test now: 1. bin/spark-submit --class org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark spark-sql_2.11-2.5.0-SNAPSHOT-tests.jar 2. build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark" 3. SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark" The method 2 and the method 3 do not need to compile the `spark-sql_*-tests.jar` package. So these two methods are mainly for developers to quickly do benchmark. ## How was this patch tested? manual tests Closes #22443 from wangyum/SPARK-25339. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 12:34:39 +08:00
Marco Gaido	47d6e80a2e	[SPARK-25457][SQL] IntegralDivide returns data type of the operands ## What changes were proposed in this pull request? The PR proposes to return the data type of the operands as a result for the `div` operator. Before the PR, `bigint` is always returned. It introduces also a `spark.sql.legacy.integralDivide.returnBigint` config in order to let the users restore the legacy behavior. ## How was this patch tested? added UTs Closes #22465 from mgaido91/SPARK-25457. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 10:23:37 +08:00
Wenchen Fan	a71f6a1750	[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source ## What changes were proposed in this pull request? For self-join/self-union, Spark will produce a physical plan which has multiple `DataSourceV2ScanExec` instances referring to the same `ReadSupport` instance. In this case, the streaming source is indeed scanned multiple times, and the `numInputRows` metrics should be counted for each scan. Actually we already have 2 test cases to verify the behavior: 1. `StreamingQuerySuite.input row calculation with same V2 source used twice in self-join` 2. `KafkaMicroBatchSourceSuiteBase.ensure stream-stream self-join generates only one offset in log and correct metrics`. However, in these 2 tests, the expected result is different, which is super confusing. It turns out that, the first test doesn't trigger exchange reuse, so the source is scanned twice. The second test triggers exchange reuse, and the source is scanned only once. This PR proposes to improve these 2 tests, to test with/without exchange reuse. ## How was this patch tested? test only change Closes #22402 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 00:29:48 +08:00
Takeshi Yamamuro	12b1e91e6b	[SPARK-25358][SQL] MutableProjection supports fallback to an interpreted mode ## What changes were proposed in this pull request? In SPARK-23711, `UnsafeProjection` supports fallback to an interpreted mode. Therefore, this pr fixed code to support the same fallback mode in `MutableProjection` based on `CodeGeneratorWithInterpretedFallback`. ## How was this patch tested? Added tests in `CodeGeneratorWithInterpretedFallbackSuite`. Closes #22355 from maropu/SPARK-25358. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-19 19:54:49 +08:00
Imran Rashid	a6f37b0742	[SPARK-25456][SQL][TEST] Fix PythonForeachWriterSuite PythonForeachWriterSuite was failing because RowQueue now needs to have a handle on a SparkEnv with a SerializerManager, so added a mock env with a serializer manager. Also fixed a typo in the `finally` that was hiding the real exception. Tested PythonForeachWriterSuite locally, full tests via jenkins. Closes #22452 from squito/SPARK-25456. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2018-09-18 16:33:37 -05:00
Imran Rashid	8f5a5a9e5b	[PYSPARK][SQL] Updates to RowQueue Tested with updates to RowQueueSuite	2018-09-17 14:06:09 -05:00
Imran Rashid	58419b9267	[PYSPARK] Updates to pyspark broadcast	2018-09-17 14:06:09 -05:00
Marco Gaido	553af22f2c	[SPARK-16323][SQL] Add IntegralDivide expression ## What changes were proposed in this pull request? The PR takes over #14036 and it introduces a new expression `IntegralDivide` in order to avoid the several unneded cast added previously. In order to prove the performance gain, the following benchmark has been run: ``` test("Benchmark IntegralDivide") { val r = new scala.util.Random(91) val nData = 1000000 val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt())) val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong())) val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort)) // old code val oldExprsInt = testDataInt.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsLong = testDataLong.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) val oldExprsShort = testDataShort.map(x => Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType)) // new code val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2)) val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2)) val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2)) Seq(("Long", "old", oldExprsLong), ("Long", "new", newExprsLong), ("Int", "old", oldExprsInt), ("Int", "new", newExprsShort), ("Short", "old", oldExprsShort), ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) => val start = System.nanoTime() ds.foreach(e => e.eval(EmptyRow)) val endNoCodegen = System.nanoTime() println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms") } } ``` The results on my laptop are: ``` Running 1000000 op with old code on Long (no-codegen): 600 ms Running 1000000 op with new code on Long (no-codegen): 112 ms Running 1000000 op with old code on Int (no-codegen): 560 ms Running 1000000 op with new code on Int (no-codegen): 135 ms Running 1000000 op with old code on Short (no-codegen): 317 ms Running 1000000 op with new code on Short (no-codegen): 153 ms ``` Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process. ## How was this patch tested? added UTs Closes #22395 from mgaido91/SPARK-16323. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-17 11:33:50 -07:00
Yuming Wang	4b9542e3a3	[SPARK-25423][SQL] Output "dataFilters" in DataSourceScanExec.metadata ## What changes were proposed in this pull request? Output `dataFilters` in `DataSourceScanExec.metadata`. ## How was this patch tested? unit tests Closes #22435 from wangyum/SPARK-25423. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-17 11:26:08 -07:00
Dongjoon Hyun	0dd61ec47d	[SPARK-25427][SQL][TEST] Add BloomFilter creation test cases ## What changes were proposed in this pull request? Spark supports BloomFilter creation for ORC files. This PR aims to add test coverages to prevent accidental regressions like [SPARK-12417](https://issues.apache.org/jira/browse/SPARK-12417). ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22418 from dongjoon-hyun/SPARK-25427. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-17 19:33:51 +08:00
npoggi	02c2963f89	[SPARK-25439][TESTS][SQL] Fixes TPCHQuerySuite datatype of customer.c_nationkey to BIGINT according to spec ## What changes were proposed in this pull request? Fixes TPCH DDL datatype of `customer.c_nationkey` from `STRING` to `BIGINT` according to spec and `nation.nationkey` in `TPCHQuerySuite.scala`. The rest of the keys are OK. Note, this will lead to non-comparable previous results to new runs involving the customer table. ## How was this patch tested? Manual tests Author: npoggi <npmnpm@gmail.com> Closes #22430 from npoggi/SPARK-25439_Fix-TPCH-customer-c_nationkey.	2018-09-15 20:06:08 -07:00
Dongjoon Hyun	fefaa3c30d	[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption ## What changes were proposed in this pull request? This PR aims to fix three things in `FilterPushdownBenchmark`. 1. Use the same memory assumption. The following configurations are used in ORC and Parquet. - Memory buffer for writing - parquet.block.size (default: 128MB) - orc.stripe.size (default: 64MB) - Compression chunk size - parquet.page.size (default: 1MB) - orc.compress.size (default: 256KB) SPARK-24692 used 1MB, the default value of `parquet.page.size`, for `parquet.block.size` and `orc.stripe.size`. But, it missed to match `orc.compress.size`. So, the current benchmark shows the result from ORC with 256KB memory for compression and Parquet with 1MB. To compare correctly, we need to be consistent. 2. Dictionary encoding should not be enforced for all cases. SPARK-24206 enforced dictionary encoding for all test cases. This PR recovers the default behavior in general and enforces dictionary encoding only in case of `prepareStringDictTable`. 3. Generate test result on AWS r3.xlarge SPARK-24206 generated the result on AWS in order to reproduce and compare easily. This PR also aims to update the result on the same machine again in the same reason. Specifically, AWS r3.xlarge with Instance Store is used. ## How was this patch tested? Manual. Enable the test cases and run `FilterPushdownBenchmark` on `AWS r3.xlarge`. It takes about 4 hours 15 minutes. Closes #22427 from dongjoon-hyun/SPARK-25438. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-15 17:48:39 -07:00
Maxim Gekk	e06da95cd9	[SPARK-25425][SQL] Extra options should override session options in DataSource V2 ## What changes were proposed in this pull request? In the PR, I propose overriding session options by extra options in DataSource V2. Extra options are more specific and set via `.option()`, and should overwrite more generic session options. Entries from seconds map overwrites entries with the same key from the first map, for example: ```Scala scala> Map("option" -> false) ++ Map("option" -> true) res0: scala.collection.immutable.Map[String,Boolean] = Map(option -> true) ``` ## How was this patch tested? Added a test for checking which option is propagated to a data source in `load()`. Closes #22413 from MaxGekk/session-options. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-15 17:24:11 -07:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
Takeshi Yamamuro	5ebef33c85	[SPARK-25426][SQL] Remove the duplicate fallback logic in UnsafeProjection ## What changes were proposed in this pull request? This pr removed the duplicate fallback logic in `UnsafeProjection`. This pr comes from #22355. ## How was this patch tested? Added tests in `CodeGeneratorWithInterpretedFallbackSuite`. Closes #22417 from maropu/SPARK-25426. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:20:45 -07:00
Kazuaki Ishizaki	f60cd7cc3c	[SPARK-25338][TEST] Ensure to call super.beforeAll() and super.afterAll() in test cases ## What changes were proposed in this pull request? This PR ensures to call `super.afterAll()` in `override afterAll()` method for test suites. * Some suites did not call `super.afterAll()` * Some suites may call `super.afterAll()` only under certain condition * Others never call `super.afterAll()`. This PR also ensures to call `super.beforeAll()` in `override beforeAll()` for test suites. ## How was this patch tested? Existing UTs Closes #22337 from kiszk/SPARK-25338. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-13 11:34:22 -07:00
Michael Allman	a7e5aa6cd4	[SPARK-25406][SQL] For ParquetSchemaPruningSuite.scala, move calls to `withSQLConf` inside calls to `test` (Link to Jira: https://issues.apache.org/jira/browse/SPARK-25406) ## What changes were proposed in this pull request? The current use of `withSQLConf` in `ParquetSchemaPruningSuite.scala` is incorrect. The desired configuration settings are not being set when running the test cases. This PR fixes that defective usage and addresses the test failures that were previously masked by that defect. ## How was this patch tested? I added code to relevant test cases to print the expected SQL configuration settings and found that the settings were not being set as expected. When I changed the order of calls to `test` and `withSQLConf` I found that the configuration settings were being set as expected. Closes #22394 from mallman/spark-25406-fix_broken_schema_pruning_tests. Authored-by: Michael Allman <msa@allman.ms> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-09-13 17:08:45 +00:00
Liang-Chi Hsieh	5b761c537a	[SPARK-25352][SQL][FOLLOWUP] Add helper method and address style issue ## What changes were proposed in this pull request? This follow-up patch addresses [the review comment](https://github.com/apache/spark/pull/22344/files#r217070658) by adding a helper method to simplify code and fixing style issue. ## How was this patch tested? Existing unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #22409 from viirya/SPARK-25352-followup.	2018-09-13 14:21:00 +02:00
LantaoJin	6dc5921e66	[SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information like file path to event log ## What changes were proposed in this pull request? Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many meta data was also removed from event SparkListenerSQLExecutionStart in Spark event log. If we want to analyze event log to get all input paths, we couldn't get them. Instead, simpleString of SparkPlanInfo JSON only display 100 characters, it won't help. Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log looks like below (It contains the metadata field which has the intact information): >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., "metadata": {"Location": "InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"} After #18600, metadata field was removed. >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., So I add this field back to SparkPlanInfo class. Then it will log out the meta data to event log. Intact information in event log is very useful for offline job analysis. ## How was this patch tested? Unit test Closes #22353 from LantaoJin/SPARK-25357. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-13 09:57:34 +08:00
Maxim Gekk	083c944767	[SPARK-25387][SQL] Fix for NPE caused by bad CSV input ## What changes were proposed in this pull request? The PR fixes NPE in `UnivocityParser` caused by malformed CSV input. In some cases, `uniVocity` parser can return `null` for bad input. In the PR, I propose to check result of parsing and not propagate NPE to upper layers. ## How was this patch tested? I added a test which reproduce the issue and tested by `CSVSuite`. Closes #22374 from MaxGekk/npe-on-bad-csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-13 09:51:49 +08:00
Liang-Chi Hsieh	3030b82c89	[SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields ## What changes were proposed in this pull request? Schema pruning doesn't work if nested column is used in where clause. For example, ``` sql("select name.first from contacts where name.first = 'David'") == Physical Plan == (1) Project [name#19.first AS first#40] +- (1) Filter (isnotnull(name#19) && (name#19.first = David)) +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>> ``` In above query plan, the scan node reads the entire schema of `name` column. This issue is reported by: https://github.com/apache/spark/pull/21320#issuecomment-419290197 The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields. ## How was this patch tested? Unit tests. Closes #22357 from viirya/SPARK-25363. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-09-12 17:43:40 +00:00
Liang-Chi Hsieh	2f422398b5	[SPARK-25352][SQL] Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold ## What changes were proposed in this pull request? We have optimization on global limit to evenly distribute limit rows across all partitions. This optimization doesn't work for ordered results. For a query ending with sort + limit, in most cases it is performed by `TakeOrderedAndProjectExec`. But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, global limit will be used. At this moment, we need to do ordered global limit. ## How was this patch tested? Unit tests. Closes #22344 from viirya/SPARK-25352. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-12 22:54:05 +08:00
gatorsmile	79cc59718f	[SPARK-25402][SQL] Null handling in BooleanSimplification ## What changes were proposed in this pull request? This PR is to fix the null handling in BooleanSimplification. In the rule BooleanSimplification, there are two cases that do not properly handle null values. The optimization is not right if either side is null. This PR is to fix them. ## How was this patch tested? Added test cases Closes #22390 from gatorsmile/fixBooleanSimplification. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-12 21:11:22 +08:00
Mukul Murthy	9f5c5b4cca	[SPARK-25399][SS] Continuous processing state should not affect microbatch execution jobs ## What changes were proposed in this pull request? The leftover state from running a continuous processing streaming job should not affect later microbatch execution jobs. If a continuous processing job runs and the same thread gets reused for a microbatch execution job in the same environment, the microbatch job could get wrong answers because it can attempt to load the wrong version of the state. ## How was this patch tested? New and existing unit tests Closes #22386 from mukulmurthy/25399-streamthread. Authored-by: Mukul Murthy <mukul.murthy@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2018-09-11 15:53:15 -07:00
Sean Owen	cfbdd6a1f5	[SPARK-25398] Minor bugs from comparing unrelated types ## What changes were proposed in this pull request? Correct some comparisons between unrelated types to what they seem to… have been trying to do ## How was this patch tested? Existing tests. Closes #22384 from srowen/SPARK-25398. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-11 14:46:03 -05:00
Mario Molina	c9cb393dc4	[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-11 20:47:14 +08:00
Wenchen Fan	0e680dcf1e	[SPARK-25278][SQL][FOLLOWUP] remove the hack in ProgressReporter ## What changes were proposed in this pull request? It turns out it's a bug that a `DataSourceV2ScanExec` instance may be referred to in the execution plan multiple times. This bug is fixed by https://github.com/apache/spark/pull/22284 and now we have corrected SQL metrics for batch queries. Thus we don't need the hack in `ProgressReporter` anymore, which fixes the same metrics problem for streaming queries. ## How was this patch tested? existing tests Closes #22380 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-11 19:38:45 +08:00
Marco Gaido	0736e72a66	[SPARK-25371][SQL] struct() should allow being called with 0 args ## What changes were proposed in this pull request? SPARK-21281 introduced a check for the inputs of `CreateStructLike` to be non-empty. This means that `struct()`, which was previously considered valid, now throws an Exception. This behavior change was introduced in 2.3.0. The change may break users' application on upgrade and it causes `VectorAssembler` to fail when an empty `inputCols` is defined. The PR removes the added check making `struct()` valid again. ## How was this patch tested? added UT Closes #22373 from mgaido91/SPARK-25371. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-11 14:16:56 +08:00
Marco Gaido	12e3e9f17d	[SPARK-25278][SQL] Avoid duplicated Exec nodes when the same logical plan appears in the query ## What changes were proposed in this pull request? In the Planner, we collect the placeholder which need to be substituted in the query execution plan and once we plan them, we substitute the placeholder with the effective plan. In this second phase, we rely on the `==` comparison, ie. the `equals` method. This means that if two placeholder plans - which are different instances - have the same attributes (so that they are equal, according to the equal method) they are both substituted with their corresponding new physical plans. So, in such a situation, the first time we substitute both them with the first of the 2 new generated plan and the second time we substitute nothing. This is usually of no harm for the execution of the query itself, as the 2 plans are identical. But since they are the same instance, now, the local variables are shared (which is unexpected). This causes issues for the metrics collected, as the same node is executed 2 times, so the metrics are accumulated 2 times, wrongly. The PR proposes to use the `eq` method in checking which placeholder needs to be substituted,; thus in the previous situation, actually both the two different physical nodes which are created (one for each time the logical plan appears in the query plan) are used and the metrics are collected properly for each of them. ## How was this patch tested? added UT Closes #22284 from mgaido91/SPARK-25278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-10 19:41:51 +08:00
caoxuewen	e7853dc103	[SPARK-24999][SQL] Reduce unnecessary 'new' memory operations ## What changes were proposed in this pull request? This PR is to solve the CodeGen code generated by fast hash, and there is no need to apply for a block of memory for every new entry, because unsafeRow's memory can be reused. ## How was this patch tested? the existed test cases. Closes #21968 from heary-cao/updateNewMemory. Authored-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-10 15:11:14 +08:00
Yuming Wang	f8b4d5aafd	[SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema in Parquet issue ## What changes were proposed in this pull request? How to reproduce: ```scala spark.sql("CREATE TABLE tbl(id long)") spark.sql("INSERT OVERWRITE TABLE tbl VALUES 4") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet' " + "STORED AS PARQUET SELECT ID FROM view1") spark.read.parquet("/tmp/spark/parquet").schema scala> spark.read.parquet("/tmp/spark/parquet").schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,true)) ``` The schema should be `StructType(StructField(ID,LongType,true))` as we `SELECT ID FROM view1`. This pr fix this issue. ## How was this patch tested? unit tests Closes #22359 from wangyum/SPARK-25313-FOLLOW-UP. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-10 13:47:19 +08:00
seancxmao	a0aed475c5	[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native data source table persisted in metastore ## What changes were proposed in this pull request? Apache Spark doesn't create Hive table with duplicated fields in both case-sensitive and case-insensitive mode. However, if Spark creates ORC files in case-sensitive mode first and create Hive table on that location, where it's created. In this situation, field resolution should fail in case-insensitive mode. Otherwise, we don't know which columns will be returned or filtered. Previously, SPARK-25132 fixed the same issue in Parquet. Here is a simple example: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as A") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") sql("CREATE TABLE orc_data_source (A LONG) USING orc LOCATION '/user/hive/warehouse/orc_data'") spark.conf.set("spark.sql.caseSensitive", false) sql("select A from orc_data_source").show +---+ \| A\| +---+ \| 3\| \| 2\| \| 4\| \| 1\| \| 0\| +---+ ``` See #22148 for more details about parquet data source reader. ## How was this patch tested? Unit tests added. Closes #22262 from seancxmao/SPARK-25175. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-09-09 19:22:47 -07:00
Yuming Wang	77c996403d	[SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result ## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---+----+---+ \| a\| b\| c\| d\| +---+---+----+---+ \| 1\| 1\|null\| 0\| \| 1\| 1\|null\| 1\| +---+---+----+---+ ``` `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before https://github.com/apache/spark/pull/19201, but it is transformed to `(c#10 = null)` since https://github.com/apache/spark/pull/20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests Closes #22368 from wangyum/SPARK-25368. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-09 09:07:31 -07:00
gatorsmile	0b9ccd55c2	Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-25317] ## What changes were proposed in this pull request? When running TPC-DS benchmarks on 2.4 release, npoggi and winglungngai saw more than 10% performance regression on the following queries: q67, q24a and q24b. After we applying the PR https://github.com/apache/spark/pull/22338, the performance regression still exists. If we revert the changes in https://github.com/apache/spark/pull/19222, npoggi and winglungngai found the performance regression was resolved. Thus, this PR is to revert the related changes for unblocking the 2.4 release. In the future release, we still can continue the investigation and find out the root cause of the regression. ## How was this patch tested? The existing test cases Closes #22361 from gatorsmile/revertMemoryBlock. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-09 21:25:19 +08:00
ptkool	78981efc2c	[SPARK-20636] Add new optimization rule to transpose adjacent Window expressions. ## What changes were proposed in this pull request? Add new optimization rule to eliminate unnecessary shuffling by flipping adjacent Window expressions. ## How was this patch tested? Tested with unit tests, integration tests, and manual tests. Closes #17899 from ptkool/adjacent_window_optimization. Authored-by: ptkool <michael.styles@shopify.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-08 11:36:55 -07:00
Dilip Biswal	6d7bc5af45	[SPARK-25267][SQL][TEST] Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive ## What changes were proposed in this pull request? In SharedSparkSession and TestHive, we need to disable the rule ConvertToLocalRelation for better test case coverage. ## How was this patch tested? Identify the failures after excluding "ConvertToLocalRelation" rule. Closes #22270 from dilipbiswal/SPARK-25267-final. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-06 23:35:02 -07:00
dujunling	ed249db9c4	[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD ## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added tests in `FileBasedDataSourceSuite`. Closes #22324 from maropu/pr22232-2. Lead-authored-by: dujunling <dujunling@huawei.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-06 21:44:46 -07:00
xuejianbest	f5817d8bb3	[SPARK-25108][SQL] Fix the show method to display the wide character alignment problem This is not a perfect solution. It is designed to minimize complexity on the basis of solving problems. It is effective for English, Chinese characters, Japanese, Korean and so on. ```scala before: +---+---------------------------+-------------+ \|id \|中国 \|s2 \| +---+---------------------------+-------------+ \|1 \|ab \|[a] \| \|2 \|null \|[中国, abc] \| \|3 \|ab1 \|[hello world]\| \|4 \|か行きゃ(kya) きゅ(kyu) きょ(kyo) \|[“中国] \| \|5 \|中国（你好）a \|[“中（国）, 312] \| \|6 \|中国山(东)服务区 \|[“中(国）] \| \|7 \|中国山东服务区 \|[中(国)] \| \|8 \| \|[中国] \| +---+---------------------------+-------------+ after: +---+-----------------------------------+----------------+ \|id \|中国 \|s2 \| +---+-----------------------------------+----------------+ \|1 \|ab \|[a] \| \|2 \|null \|[中国, abc] \| \|3 \|ab1 \|[hello world] \| \|4 \|か行きゃ(kya) きゅ(kyu) きょ(kyo) \|[“中国] \| \|5 \|中国（你好）a \|[“中（国）, 312]\| \|6 \|中国山(东)服务区 \|[“中(国）] \| \|7 \|中国山东服务区 \|[中(国)] \| \|8 \| \|[中国] \| +---+-----------------------------------+----------------+ ``` ## What changes were proposed in this pull request? When there are wide characters such as Chinese characters or Japanese characters in the data, the show method has a alignment problem. Try to fix this problem. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) ![image](https://user-images.githubusercontent.com/13044869/44250564-69f6b400-a227-11e8-88b2-6cf6960377ff.png) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22048 from xuejianbest/master. Authored-by: xuejianbest <384329882@qq.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-06 07:17:37 -07:00
Maxim Gekk	d749d034a8	[SPARK-25252][SQL] Support arrays of any types by to_json ## What changes were proposed in this pull request? In the PR, I propose to extended `to_json` and support any types as element types of input arrays. It should allow converting arrays of primitive types and arrays of arrays. For example: ``` select to_json(array('1','2','3')) > ["1","2","3"] select to_json(array(array(1,2,3),array(4))) > [[1,2,3],[4]] ``` ## How was this patch tested? Added a couple sql tests for arrays of primitive type and of arrays. Also I added round trip test `from_json` -> `to_json`. Closes #22226 from MaxGekk/to_json-array. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-06 12:35:59 +08:00
Gengliang Wang	3d6b68b030	[SPARK-25313][SQL] Fix regression in FileFormatWriter output names ## What changes were proposed in this pull request? Let's see the follow example: ``` val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() ``` The output column name in schema will be `id` instead of `ID`, thus the last query shows nothing from `tbl2`. By enabling the debug message we can see that the output naming is changed from `ID` to `id`, and then the `outputColumns` in `InsertIntoHadoopFsRelationCommand` is changed in `RemoveRedundantAliases`. ![wechatimg5](https://user-images.githubusercontent.com/1097932/44947871-6299f200-ae46-11e8-9c96-d45fe368206c.jpeg) ![wechatimg4](https://user-images.githubusercontent.com/1097932/44947866-56ae3000-ae46-11e8-8923-8b3bbe060075.jpeg) To guarantee correctness, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. I will fix project elimination related rules in https://github.com/apache/spark/pull/22311 after this one. ## How was this patch tested? Unit test. Closes #22320 from gengliangwang/fixOutputSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-06 10:37:52 +08:00
Wenchen Fan	71bd796517	[SPARK-23243][CORE] Fix RDD.repartition() data correctness issue ## What changes were proposed in this pull request? An alternative fix for https://github.com/apache/spark/pull/21698 When Spark rerun tasks for an RDD, there are 3 different behaviors: 1. determinate. Always return the same result with same order when rerun. 2. unordered. Returns same data set in random order when rerun. 3. indeterminate. Returns different result when rerun. Normally Spark doesn't need to care about it. Spark runs stages one by one, when a task is failed, just rerun it. Although the rerun task may return a different result, users will not be surprised. However, Spark may rerun a finished stage when seeing fetch failures. When this happens, Spark needs to rerun all the tasks of all the succeeding stages if the RDD output is indeterminate, because the input of the succeeding stages has been changed. If the RDD output is determinate, we only need to rerun the failed tasks of the succeeding stages, because the input doesn't change. If the RDD output is unordered, it's same as determinate, because shuffle partitioner is always deterministic(round-robin partitioner is not a shuffle partitioner that extends `org.apache.spark.Partitioner`), so the reducers will still get the same input data set. This PR fixed the failure handling for `repartition`, to avoid correctness issues. For `repartition`, it applies a stateful map function to generate a round-robin id, which is order sensitive and makes the RDD's output indeterminate. When the stage contains `repartition` reruns, we must also rerun all the tasks of all the succeeding stages. future improvement: 1. Currently we can't rollback and rerun a shuffle map stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25341 2. Currently we can't rollback and rerun a result stage, and just fail. We should fix it later. https://issues.apache.org/jira/browse/SPARK-25342 3. We should provide public API to allow users to tag the random level of the RDD's computing function. ## How is this pull request tested? a new test case Closes #22112 from cloud-fan/repartition. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-09-05 15:36:34 -07:00
Dongjoon Hyun	c66eef8440	[SPARK-25306][SQL][FOLLOWUP] Change `test` to `ignore` in FilterPushdownBenchmark ## What changes were proposed in this pull request? This is a follow-up of #22313 and aim to ignore the micro benchmark test which takes over 2 minutes in Jenkins. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4939/consoleFull ## How was this patch tested? The test case should be ignored in Jenkins. ``` [info] FilterPushdownBenchmark: ... [info] - Pushdown benchmark with many filters !!! IGNORED !!! ``` Closes #22336 from dongjoon-hyun/SPARK-25306-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-09-05 11:29:15 -07:00
Wenchen Fan	341b55a589	[SPARK-25044][SQL][FOLLOWUP] add back UserDefinedFunction.inputTypes ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/22259 . Scala case class has a wide surface: apply, unapply, accessors, copy, etc. In https://github.com/apache/spark/pull/22259 , we change the type of `UserDefinedFunction.inputTypes` from `Option[Seq[DataType]]` to `Option[Seq[Schema]]`. This breaks backward compatibility. This PR changes the type back, and use a `var` to keep the new nullable info. ## How was this patch tested? N/A Closes #22319 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-05 21:13:16 +08:00
Shixiong Zhu	2119e518d3	[SPARK-25336][SS]Revert SPARK-24863 and SPARK-24748 ## What changes were proposed in this pull request? Revert SPARK-24863 (#21819) and SPARK-24748 (#21721) as per discussion in #21721. We will revisit them when the data source v2 APIs are out. ## How was this patch tested? Jenkins Closes #22334 from zsxwing/revert-SPARK-24863-SPARK-24748. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-05 13:39:34 +08:00
Dongjoon Hyun	103f513231	[SPARK-25306][SQL] Avoid skewed filter trees to speed up `createFilter` in ORC ## What changes were proposed in this pull request? In both ORC data sources, `createFilter` function has exponential time complexity due to its skewed filter tree generation. This PR aims to improve it by using new `buildTree` function. REPRODUCE ```scala // Create and read 1 row table with 1000 columns sql("set spark.sql.orc.filterPushdown=true") val selectExpr = (1 to 1000).map(i => s"id c$i") spark.range(1).selectExpr(selectExpr: _).write.mode("overwrite").orc("/tmp/orc") print(s"With 0 filters, ") spark.time(spark.read.orc("/tmp/orc").count) // Increase the number of filters (20 to 30).foreach { width => val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") print(s"With $width filters, ") spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) } ``` RESULT* ```scala With 0 filters, Time taken: 653 ms With 20 filters, Time taken: 962 ms With 21 filters, Time taken: 1282 ms With 22 filters, Time taken: 1982 ms With 23 filters, Time taken: 3855 ms With 24 filters, Time taken: 6719 ms With 25 filters, Time taken: 12669 ms With 26 filters, Time taken: 25032 ms With 27 filters, Time taken: 49585 ms With 28 filters, Time taken: 98980 ms // over 1 min 38 seconds With 29 filters, Time taken: 198368 ms // over 3 mins With 30 filters, Time taken: 393744 ms // over 6 mins ``` AFTER THIS PR ```scala With 0 filters, Time taken: 774 ms With 20 filters, Time taken: 601 ms With 21 filters, Time taken: 399 ms With 22 filters, Time taken: 679 ms With 23 filters, Time taken: 363 ms With 24 filters, Time taken: 342 ms With 25 filters, Time taken: 336 ms With 26 filters, Time taken: 352 ms With 27 filters, Time taken: 322 ms With 28 filters, Time taken: 302 ms With 29 filters, Time taken: 307 ms With 30 filters, Time taken: 301 ms ``` ## How was this patch tested? Pass the Jenkins with newly added test cases. Closes #22313 from dongjoon-hyun/SPARK-25306. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-05 10:24:13 +08:00
Xingbo Jiang	3aa60282cc	[SPARK-19355][SQL][FOLLOWUP][TEST] Properly recycle SparkSession on TakeOrderedAndProjectSuite finishes ## What changes were proposed in this pull request? Previously in `TakeOrderedAndProjectSuite` the SparkSession will not get recycled when the test suite finishes. ## How was this patch tested? N/A Closes #22330 from jiangxb1987/SPARK-19355. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-09-04 09:44:42 -07:00
Xiao Li	7fc8881b0f	[SPARK-25296][SQL][TEST] Create ExplainSuite ## What changes were proposed in this pull request? Move the output verification of Explain test cases to a new suite ExplainSuite. ## How was this patch tested? N/A Closes #22300 from gatorsmile/test3200. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2018-08-31 08:47:20 -07:00

... 2 3 4 5 6 ...

5293 commits