ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
hyukjinkwon	2877f1a522	[SPARK-16351][SQL] Avoid per-record type dispatch in JSON when writing ## What changes were proposed in this pull request? Currently, `JacksonGenerator.apply` is doing type-based dispatch for each row to write appropriate values. It might not have to be done like this because the schema is already kept. So, appropriate writers can be created first according to the schema once, and then apply them to each row. This approach is similar with `CatalystWriteSupport`. This PR corrects `JacksonGenerator` so that it creates all writers for the schema once and then applies them to each row rather than type dispatching for every row. Benchmark was proceeded with the codes below: ```scala test("Benchmark for JSON writer") { val N = 500 << 8 val row = """{"struct":{"field1": true, "field2": 92233720368547758070}, "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]}, "arrayOfString":["str1", "str2"], "arrayOfInteger":[1, 2147483647, -2147483648], "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808], "arrayOfBigInteger":[922337203685477580700, -922337203685477580800], "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308], "arrayOfBoolean":[true, false, true], "arrayOfNull":[null, null, null, null], "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}], "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]], "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]] }""" val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row))) val benchmark = new Benchmark("JSON writer", N) benchmark.addCase("writing JSON file", 10) { _ => withTempPath { path => df.write.format("json").save(path.getCanonicalPath) } } benchmark.run() } ``` This produced the results below - Before ``` JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ writing JSON file 1675 / 1767 0.1 13087.5 1.0X ``` - After ``` JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ writing JSON file 1597 / 1686 0.1 12477.1 1.0X ``` In addition, I ran this benchmark 10 times for each and calculated the average elapsed time as below: \| Before \| After\| \|---------------\|------------\| \|17478ms \|16669ms \| It seems roughly ~5% is improved. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14028 from HyukjinKwon/SPARK-16351.	2016-07-18 09:49:14 -07:00
Reynold Xin	480c870644	[SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236.	2016-07-17 22:48:00 -07:00
Dongjoon Hyun	c576f9fb90	[SPARK-16529][SQL][TEST] `withTempDatabase` should set `default` database before dropping ## What changes were proposed in this pull request? `SQLTestUtils.withTempDatabase` is a frequently used test harness to setup a temporary table and clean up finally. This issue improves like the following for usability. ```scala - try f(dbName) finally spark.sql(s"DROP DATABASE $dbName CASCADE") + try f(dbName) finally { + if (spark.catalog.currentDatabase == dbName) { + spark.sql(s"USE ${DEFAULT_DATABASE}") + } + spark.sql(s"DROP DATABASE $dbName CASCADE") + } ``` In case of forgetting to reset the databaes, `withTempDatabase` will not raise Exception. ## How was this patch tested? This improves test harness. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14184 from dongjoon-hyun/SPARK-16529.	2016-07-15 00:51:11 +08:00
Dongjoon Hyun	56183b84fb	[SPARK-16543][SQL] Rename the columns of `SHOW PARTITION/COLUMNS` commands ## What changes were proposed in this pull request? This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name. Comparison: Column Name Command\|Spark(Before)\|Spark(After)\|Hive ----------\|--------------\|------------\|----- SHOW PARTITIONS\|result\|partition\|partition SHOW COLUMNS\|result\|col_name\|field Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands. Before ```scala scala> sql("show partitions p").show() +------+ \|result\| +------+ \| b=2\| +------+ scala> sql("show columns in p").show() +------+ \|result\| +------+ \| a\| \| b\| +------+ ``` After ```scala scala> sql("show partitions p").show +---------+ \|partition\| +---------+ \| b=2\| +---------+ scala> sql("show columns in p").show +--------+ \|col_name\| +--------+ \| a\| \| b\| +--------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14199 from dongjoon-hyun/SPARK-16543.	2016-07-14 17:18:34 +02:00
Liwei Lin	39c836e976	[SPARK-16503] SparkSession should provide Spark version ## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <lwlin7@gmail.com> Closes #14165 from lw-lin/add-version.	2016-07-13 22:30:46 -07:00
gatorsmile	c5ec879828	[SPARK-16482][SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14148 from gatorsmile/describeSchema.	2016-07-13 15:23:37 -07:00
Burak Yavuz	0744d84c91	[SPARK-16531][SQL][TEST] Remove timezone setting from DataFrameTimeWindowingSuite ## What changes were proposed in this pull request? It's unnecessary. `QueryTest` already sets it. Author: Burak Yavuz <brkyvz@gmail.com> Closes #14170 from brkyvz/test-tz.	2016-07-13 12:54:57 -07:00
Maciej Brynski	83879ebc58	[SPARK-16439] Fix number formatting in SQL UI ## What changes were proposed in this pull request? Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator. Problem exists when server locale has no-breaking space as separator. (for example pl_PL) This patch turns off grouping and remove this separator. The problem starts with this PR. https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125 ## How was this patch tested? Manual UI tests. Screenshot attached. ![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png) Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14142 from maver1ck/master.	2016-07-13 10:50:26 +01:00
Xin Ren	f73891e0b9	[MINOR] Fix Java style errors and remove unused imports ## What changes were proposed in this pull request? Fix Java style errors and remove unused imports, which are randomly found ## How was this patch tested? Tested on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14161 from keypointt/SPARK-16437.	2016-07-13 10:47:07 +01:00
Sean Owen	c190d89bd3	[SPARK-15889][STREAMING] Follow-up fix to erroneous condition in StreamTest ## What changes were proposed in this pull request? A second form of AssertQuery now actually invokes the condition; avoids a build warning too ## How was this patch tested? Jenkins; running StreamTest Author: Sean Owen <sowen@cloudera.com> Closes #14133 from srowen/SPARK-15889.2.	2016-07-13 10:44:07 +01:00
petermaxlee	56bd399a86	[SPARK-16284][SQL] Implement reflect SQL function ## What changes were proposed in this pull request? This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969. java_method is an alias for reflect, so this should also resolve SPARK-16277. ## How was this patch tested? Added expression unit tests and an end-to-end test. Author: petermaxlee <petermaxlee@gmail.com> Closes #14138 from petermaxlee/reflect-static.	2016-07-13 08:05:20 +08:00
Marcelo Vanzin	7f968867ff	[SPARK-16119][SQL] Support PURGE option to drop table / partition. This option is used by Hive to directly delete the files instead of moving them to the trash. This is needed in certain configurations where moving the files does not work. For non-Hive tables and partitions, Spark already behaves as if the PURGE option was set, so there's no need to do anything. Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for partitions), so the code reflects that: trying to use the option with older versions of Hive will cause an exception to be thrown. The change is a little noisier than I would like, because of the code to propagate the new flag through all the interfaces and implementations; the main changes are in the parser and in HiveShim, aside from the tests (DDLCommandSuite, VersionsSuite). Tested by running sql and catalyst unit tests, plus VersionsSuite which has been updated to test the version-specific behavior. I also ran an internal test suite that uses PURGE and would not pass previously. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13831 from vanzin/SPARK-16119.	2016-07-12 12:47:46 -07:00
Reynold Xin	c377e49e38	[SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation ## What changes were proposed in this pull request? In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r". This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression. ## How was this patch tested? This is a test harness change, but I also created a new test suite for testing the test harness. Author: Reynold Xin <rxin@databricks.com> Closes #14146 from rxin/SPARK-16489.	2016-07-12 10:07:23 -07:00
Lianhui Wang	5ad68ba5ce	[SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators. ## What changes were proposed in this pull request? when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Lianhui Wang <lianhuiwang@users.noreply.github.com> Closes #13494 from lianhuiwang/metadata-only.	2016-07-12 18:52:15 +02:00
Takuya UESHIN	5b28e02584	[SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize. ## What changes were proposed in this pull request? Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`. In this case, serialize and then deserialize happens but it would not be needed. This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13890 from ueshin/issues/SPARK-16189.	2016-07-12 17:16:59 +08:00
petermaxlee	c9a6762150	[SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter ## What changes were proposed in this pull request? It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation. This is based on rxin's patch (#13901) and adds unit tests. ## How was this patch tested? Added a new suite FiltersSuite. Author: petermaxlee <petermaxlee@gmail.com> Author: Reynold Xin <rxin@databricks.com> Closes #14120 from petermaxlee/SPARK-16199.	2016-07-11 22:23:32 -07:00
Russell Spitzer	b1e5281c5c	[SPARK-12639][SQL] Mark Filters Fully Handled By Sources with * ## What changes were proposed in this pull request? In order to make it clear which filters are fully handled by the underlying datasource we will mark them with an . This will give a clear visual queue to users that the filter is being treated differently by catalyst than filters which are just presented to the underlying DataSource. Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not ### Before ``` //SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo') == Physical Plan == Project [a#0] +- Filter (((a#0 + b#1) > 9) && (b#1 < 16)) +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]] ``` ### After ``` == Physical Plan == Project [a#0] +- Filter (((a#0 + b#1) > 9) && (b#1 < 16)) +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]] ``` ## How was the this patch tested? Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested Post 1.6.1 Tested by modifying the FilteredScanSuite to run explains. Author: Russell Spitzer <Russell.Spitzer@gmail.com> Closes #11317 from RussellSpitzer/SPARK-12639-Star.	2016-07-11 21:40:09 -07:00
Sameer Agarwal	9cc74f95ed	[SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy ## What changes were proposed in this pull request? This patch fixes a variable namespace collision bug in pmod and partitionBy ## How was this patch tested? Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR. Author: Sameer Agarwal <sameer@databricks.com> Closes #14144 from sameeragarwal/codegen-bug.	2016-07-11 20:26:01 -07:00
Tathagata Das	e50efd53f0	[SPARK-16430][SQL][STREAMING] Fixed bug in the maxFilesPerTrigger in FileStreamSource ## What changes were proposed in this pull request? Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches. ## How was this patch tested? Added unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14143 from tdas/SPARK-16430-1.	2016-07-11 18:41:36 -07:00
Shixiong Zhu	91a443b849	[SPARK-16433][SQL] Improve StreamingQuery.explain when no data arrives ## What changes were proposed in this pull request? Display `No physical plan. Waiting for data.` instead of `N/A` for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14100 from zsxwing/SPARK-16433.	2016-07-11 18:11:06 -07:00
James Thomas	9e2c763dbb	[SPARK-16114][SQL] structured streaming event time window example ## What changes were proposed in this pull request? A structured streaming example with event time windowing. ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Closes #13957 from jjthomas/current.	2016-07-11 17:57:51 -07:00
Dongjoon Hyun	840853ed06	[SPARK-16458][SQL] SessionCatalog should support `listColumns` for temporary tables ## What changes were proposed in this pull request? Temporary tables are used frequently, but `spark.catalog.listColumns` does not support those tables. This PR make `SessionCatalog` supports temporary table column listing. Before ```scala scala> spark.range(10).createOrReplaceTempView("t1") scala> spark.catalog.listTables().collect() res1: Array[org.apache.spark.sql.catalog.Table] = Array(Table[name=`t1`, tableType=`TEMPORARY`, isTemporary=`true`]) scala> spark.catalog.listColumns("t1").collect() org.apache.spark.sql.AnalysisException: Table `t1` does not exist in database `default`.; ``` After ``` scala> spark.catalog.listColumns("t1").collect() res2: Array[org.apache.spark.sql.catalog.Column] = Array(Column[name='id', description='id', dataType='bigint', nullable='false', isPartition='false', isBucket='false']) ``` ## How was this patch tested? Pass the Jenkins tests including a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14114 from dongjoon-hyun/SPARK-16458.	2016-07-11 22:45:22 +02:00
Dongjoon Hyun	7ac79da0e4	[SPARK-16459][SQL] Prevent dropping current database ## What changes were proposed in this pull request? This PR prevents dropping the current database to avoid errors like the followings. ```scala scala> sql("create database delete_db") scala> sql("use delete_db") scala> sql("drop database delete_db") scala> sql("create table t as select 1") org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database `delete_db` not found; ``` ## How was this patch tested? Pass the Jenkins tests including an updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14115 from dongjoon-hyun/SPARK-16459.	2016-07-11 15:15:47 +02:00
gatorsmile	e226278941	[SPARK-16355][SPARK-16354][SQL] Fix Bugs When LIMIT/TABLESAMPLE is Non-foldable, Zero or Negative #### What changes were proposed in this pull request? Issue 1: When a query containing LIMIT/TABLESAMPLE 0, the statistics could be zero. Results are correct but it could cause a huge performance regression. For example, ```Scala Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("k", "v") .createOrReplaceTempView("test") val df1 = spark.table("test") val df2 = spark.table("test").limit(0) val df = df1.join(df2, Seq("k"), "left") ``` The statistics of both `df` and `df2` are zero. The statistics values should never be zero; otherwise `sizeInBytes` of `BinaryNode` will also be zero (product of children). This PR is to increase it to `1` when the num of rows is equal to 0. Issue 2: When a query containing negative LIMIT/TABLESAMPLE, we should issue exceptions. Negative values could break the implementation assumption of multiple parts. For example, statistics calculation. Below is the example query. ```SQL SELECT * FROM testData TABLESAMPLE (-1 rows) SELECT * FROM testData LIMIT -1 ``` This PR is to issue an appropriate exception in this case. Issue 3: Spark SQL follows the restriction of LIMIT clause in Hive. The argument to the LIMIT clause must evaluate to a constant value. It can be a numeric literal, or another kind of numeric expression involving operators, casts, and function return values. You cannot refer to a column or use a subquery. Currently, we do not detect whether the expression in LIMIT clause is foldable or not. If non-foldable, we might issue a strange error message. For example, ```SQL SELECT * FROM testData LIMIT rand() > 0.2 ``` Then, a misleading error message is issued, like ``` assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] java.lang.AssertionError: assertion failed: No plan for GlobalLimit (_nondeterministic#203 > 0.2) +- Project [key#11, value#12, rand(-1441968339187861415) AS _nondeterministic#203] +- LocalLimit (_nondeterministic#202 > 0.2) +- Project [key#11, value#12, rand(-1308350387169017676) AS _nondeterministic#202] +- LogicalRDD [key#11, value#12] ``` This PR detects it and then issues a meaningful error message. #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #14034 from gatorsmile/limit.	2016-07-11 16:21:13 +08:00
petermaxlee	82f0874453	[SPARK-16318][SQL] Implement all remaining xpath functions ## What changes were proposed in this pull request? This patch implements all remaining xpath functions that Hive supports and not natively supported in Spark: xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string, and xpath. ## How was this patch tested? Added unit tests and end-to-end tests. Author: petermaxlee <petermaxlee@gmail.com> Closes #13991 from petermaxlee/SPARK-16318.	2016-07-11 13:28:34 +08:00
gatorsmile	7374e518e2	[SPARK-16401][SQL] Data Source API: Enable Extending RelationProvider and CreatableRelationProvider without Extending SchemaRelationProvider #### What changes were proposed in this pull request? When users try to implement a data source API with extending only `RelationProvider` and `CreatableRelationProvider`, they will hit an error when resolving the relation. ```Scala spark.read .format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .load() .write. format("org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema") .save() ``` The error they hit is like ``` org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; org.apache.spark.sql.AnalysisException: org.apache.spark.sql.test.DefaultSourceWithoutUserSpecifiedSchema does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:319) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:494) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) ``` Actually, the bug fix is simple. [`DataSource.createRelation(sparkSession.sqlContext, mode, options, data)`](`dd644f8117/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L429)`) already returns a BaseRelation. We should not assign schema to `userSpecifiedSchema`. That schema assignment only makes sense for the data sources that extend `FileFormat`. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #14075 from gatorsmile/dataSource.	2016-07-09 20:35:45 +08:00
Dongjoon Hyun	3b22291b5f	[SPARK-16387][SQL] JDBC Writer should use dialect to quote field names. ## What changes were proposed in this pull request? Currently, JDBC Writer uses dialects to get datatypes, but doesn't to quote field names. This PR uses dialects to quote the field names, too. Reported Error Scenario (MySQL case) ```scala scala> val url="jdbc:mysql://localhost:3306/temp" scala> val prop = new java.util.Properties scala> prop.setProperty("user","root") scala> spark.createDataset(Seq("a","b","c")).toDF("order") scala> df.write.mode("overwrite").jdbc(url, "temptable", prop) ...MySQLSyntaxErrorException: ... near 'order TEXT ) ``` ## How was this patch tested? Pass the Jenkins tests and manually do the above case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14107 from dongjoon-hyun/SPARK-16387.	2016-07-08 16:07:12 -07:00
wujian	f5fef69143	[SPARK-16281][SQL] Implement parse_url SQL function ## What changes were proposed in this pull request? This PR adds parse_url SQL functions in order to remove Hive fallback. A new implementation of #13999 ## How was this patch tested? Pass the exist tests including new testcases. Author: wujian <jan.chou.wu@gmail.com> Closes #14008 from janplus/SPARK-16281.	2016-07-08 14:38:05 -07:00
Dongjoon Hyun	142df4834b	[SPARK-16429][SQL] Include `StringType` columns in `describe()` ## What changes were proposed in this pull request? Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument. Background ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` Before ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+ \|summary\| age\| +-------+------------------+ \| count\| 2\| \| mean\| 24.5\| \| stddev\|7.7781745930520225\| \| min\| 19\| \| max\| 30\| +-------+------------------+ ``` After ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins with a update testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14095 from dongjoon-hyun/SPARK-16429.	2016-07-08 14:36:50 -07:00
Jurriaan Pruis	38cf8f2a50	[SPARK-13638][SQL] Add quoteAll option to CSV DataFrameWriter ## What changes were proposed in this pull request? Adds an quoteAll option for writing CSV which will quote all fields. See https://issues.apache.org/jira/browse/SPARK-13638 ## How was this patch tested? Added a test to verify the output columns are quoted for all fields in the Dataframe Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13374 from jurriaan/csv-quote-all.	2016-07-08 11:45:41 -07:00
Dongjoon Hyun	a54438cb23	[SPARK-16285][SQL] Implement sentences SQL functions ## What changes were proposed in this pull request? This PR implements `sentences` SQL function. ## How was this patch tested? Pass the Jenkins tests with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14004 from dongjoon-hyun/SPARK_16285.	2016-07-08 17:05:24 +08:00
Tathagata Das	5bce458093	[SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrigger ## What changes were proposed in this pull request? An option that limits the file stream source to read 1 file at a time enables rate limiting. It has the additional convenience that a static set of files can be used like a stream for testing as this will allows those files to be considered one at a time. This PR adds option `maxFilesPerTrigger`. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14094 from tdas/SPARK-16430.	2016-07-07 23:19:41 -07:00
Liwei Lin	0f7175def9	[SPARK-16350][SQL] Fix support for incremental planning in wirteStream.foreach() ## What changes were proposed in this pull request? There are cases where `complete` output mode does not output updated aggregated value; for details please refer to [SPARK-16350](https://issues.apache.org/jira/browse/SPARK-16350). The cause is that, as we do `data.as[T].foreachPartition { iter => ... }` in `ForeachSink.addBatch()`, `foreachPartition()` does not support incremental planning for now. This patches makes `foreachPartition()` support incremental planning in `ForeachSink`, by making a special version of `Dataset` with its `rdd()` method supporting incremental planning. ## How was this patch tested? Added a unit test which failed before the change Author: Liwei Lin <lwlin7@gmail.com> Closes #14030 from lw-lin/fix-foreach-complete.	2016-07-07 10:40:42 -07:00
Reynold Xin	986b251401	[SPARK-16400][SQL] Remove InSet filter pushdown from Parquet ## What changes were proposed in this pull request? This patch removes InSet filter pushdown from Parquet data source, since row-based pushdown is not beneficial to Spark and brings extra complexity to the code base. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14076 from rxin/SPARK-16400.	2016-07-07 18:09:18 +08:00
gatorsmile	ab05db0b48	[SPARK-16368][SQL] Fix Strange Errors When Creating View With Unmatched Column Num #### What changes were proposed in this pull request? When creating a view, a common user error is the number of columns produced by the `SELECT` clause does not match the number of column names specified by `CREATE VIEW`. For example, given Table `t1` only has 3 columns ```SQL create view v1(col2, col4, col3, col5) as select * from t1 ``` Currently, Spark SQL reports the following error: ``` requirement failed java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:90) ``` This error message is very confusing. This PR is to detect the error and issue a meaningful error message. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14047 from gatorsmile/viewMismatchedColumns.	2016-07-07 00:07:25 -07:00
hyukjinkwon	34283de160	[SPARK-14839][SQL] Support for other types for `tableProperty` rule in SQL syntax ## What changes were proposed in this pull request? Currently, Scala API supports to take options with the types, `String`, `Long`, `Double` and `Boolean` and Python API also supports other types. This PR corrects `tableProperty` rule to support other types (string, boolean, double and integer) so that support the options for data sources in a consistent way. This will affect other rules such as DBPROPERTIES and TBLPROPERTIES (allowing other types as values). Also, `TODO add bucketing and partitioning.` was removed because it was resolved in `24bea00047` ## How was this patch tested? Unit test in `MetastoreDataSourcesSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13517 from HyukjinKwon/SPARK-14839.	2016-07-06 23:57:18 -04:00
Reynold Xin	8e3e4ed6c0	[SPARK-16371][SQL] Two follow-up tasks ## What changes were proposed in this pull request? This is a small follow-up for SPARK-16371: 1. Hide removeMetadata from public API. 2. Add JIRA ticket number to test case name. ## How was this patch tested? Updated a test comment. Author: Reynold Xin <rxin@databricks.com> Closes #14074 from rxin/parquet-filter.	2016-07-06 15:04:37 -07:00
hyukjinkwon	4f8ceed593	[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet ## What changes were proposed in this pull request? Currently, if there is a schema as below: ``` root \|-- _1: struct (nullable = true) \| \|-- _1: integer (nullable = true) ``` and if we execute the codes below: ```scala df.filter("_1 IS NOT NULL").count() ``` This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those). The reason is, `ParquetFilters.getFieldMap` produces results below: ``` (_1,StructType(StructField(_1,IntegerType,true))) (_1,IntegerType) ``` and then it becomes a `Map` ``` (_1,IntegerType) ``` Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`. So, Parquet filter2 produces incorrect results, for example, the codes below: ``` df.filter("_1 IS NOT NULL").count() ``` produces always 0. This PR prevents this by not finding nested fields. ## How was this patch tested? Unit test in `ParquetFilterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14067 from HyukjinKwon/SPARK-16371.	2016-07-06 12:42:16 -07:00
Cheng Lian	23eff5e512	[SPARK-15979][SQL] Renames CatalystWriteSupport to ParquetWriteSupport ## What changes were proposed in this pull request? PR #13696 renamed various Parquet support classes but left `CatalystWriteSupport` behind. This PR is renames it as a follow-up. ## How was this patch tested? N/A. Author: Cheng Lian <lian@databricks.com> Closes #14070 from liancheng/spark-15979-follow-up.	2016-07-06 10:36:45 -07:00
Reynold Xin	7e28fabdff	[SPARK-16388][SQL] Remove spark.sql.nativeView and spark.sql.nativeView.canonical config ## What changes were proposed in this pull request? These two configs should always be true after Spark 2.0. This patch removes them from the config list. Note that ideally this should've gone into branch-2.0, but due to the timing of the release we should only merge this in master for Spark 2.1. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #14061 from rxin/SPARK-16388.	2016-07-06 17:40:55 +08:00
Dongjoon Hyun	ec79183ac5	[SPARK-16340][SQL] Support column arguments for `regexp_replace` Dataset operation ## What changes were proposed in this pull request? Currently, `regexp_replace` function supports `Column` arguments in a query. This PR supports that in a `Dataset` operation, too. ## How was this patch tested? Pass the Jenkins tests with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14060 from dongjoon-hyun/SPARK-16340.	2016-07-05 22:11:40 -07:00
Dongjoon Hyun	d0d28507ca	[SPARK-16286][SQL] Implement stack table generating function ## What changes were proposed in this pull request? This PR implements `stack` table generating function. ## How was this patch tested? Pass the Jenkins tests including new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14033 from dongjoon-hyun/SPARK-16286.	2016-07-06 10:54:43 +08:00
Dongjoon Hyun	4db63fd2b4	[SPARK-16383][SQL] Remove `SessionState.executeSql` ## What changes were proposed in this pull request? This PR removes `SessionState.executeSql` in favor of `SparkSession.sql`. We can remove this safely since the visibility `SessionState` is `private[sql]` and `executeSql` is only used in one ignored test, `test("Multiple Hive Instances")`. ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14055 from dongjoon-hyun/SPARK-16383.	2016-07-05 16:47:32 -07:00
Reynold Xin	16a2a7d714	[SPARK-16311][SQL] Metadata refresh should work on temporary views ## What changes were proposed in this pull request? This patch fixes the bug that the refresh command does not work on temporary views. This patch is based on https://github.com/apache/spark/pull/13989, but removes the public Dataset.refresh() API as well as improved test coverage. Note that I actually think the public refresh() API is very useful. We can in the future implement it by also invalidating the lazy vals in QueryExecution (or alternatively just create a new QueryExecution). ## How was this patch tested? Re-enabled a previously ignored test, and added a new test suite for Hive testing behavior of temporary views against MetastoreRelation. Author: Reynold Xin <rxin@databricks.com> Author: petermaxlee <petermaxlee@gmail.com> Closes #14009 from rxin/SPARK-16311.	2016-07-05 11:36:05 -07:00
hyukjinkwon	07d9c5327f	[SPARK-9876][SQL][FOLLOWUP] Enable string and binary tests for Parquet predicate pushdown and replace deprecated fromByteArray. ## What changes were proposed in this pull request? It seems Parquet has been upgraded to 1.8.1 by https://github.com/apache/spark/pull/13280. So, this PR enables string and binary predicate push down which was disabled due to [SPARK-11153](https://issues.apache.org/jira/browse/SPARK-11153) and [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and cleans up some comments unremoved (I think by mistake). This PR also replace the API, `fromByteArray()` deprecated in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251). ## How was this patch tested? Unit tests in `ParquetFilters` Author: hyukjinkwon <gurwls223@gmail.com> Closes #13389 from HyukjinKwon/parquet-1.8-followup.	2016-07-05 16:59:40 +08:00
Dongjoon Hyun	7f7eb3934e	[SPARK-16360][SQL] Speed up SQL query performance by removing redundant `executePlan` call ## What changes were proposed in this pull request? Currently, there are a few reports about Spark 2.0 query performance regression for large queries. This PR speeds up SQL query processing performance by removing redundant consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script. The result is 25.78 sec -> 12.36 sec as expected. Sample Query ```scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" \|SELECT $columns \|FROM VALUES ($values) T($columns) \|WHERE 1=2 AND 1 IN ($columns) \|GROUP BY $columns \|ORDER BY $columns \|""".stripMargin def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s") result } ``` Before ```scala scala> time(sql(query)) Elapsed time: 30.138142577s // First query has a little overhead of initialization. res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields] scala> time(sql(query)) Elapsed time: 25.787751452s // Let's compare this one. res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields] ``` After ```scala scala> time(sql(query)) Elapsed time: 17.500279659s // First query has a little overhead of initialization. res0: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields] scala> time(sql(query)) Elapsed time: 12.364812255s // This shows the real difference. The speed up is about 2 times. res1: org.apache.spark.sql.DataFrame = [column1: int, column2: int ... 3998 more fields] ``` ## How was this patch tested? Manual by the above script. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14044 from dongjoon-hyun/SPARK-16360.	2016-07-05 16:19:22 +08:00
Koert Kuipers	8cdb81fa82	[SPARK-15204][SQL] improve nullability inference for Aggregator ## What changes were proposed in this pull request? TypedAggregateExpression sets nullable based on the schema of the outputEncoder ## How was this patch tested? Add test in DatasetAggregatorSuite Author: Koert Kuipers <koert@tresata.com> Closes #13532 from koertkuipers/feat-aggregator-nullable.	2016-07-04 12:14:14 +08:00
Dongjoon Hyun	88134e7368	[SPARK-16288][SQL] Implement inline table generating function ## What changes were proposed in this pull request? This PR implements `inline` table generating function. ## How was this patch tested? Pass the Jenkins tests with new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13976 from dongjoon-hyun/SPARK-16288.	2016-07-04 01:57:45 +08:00
Dongjoon Hyun	54b27c1797	[SPARK-16278][SPARK-16279][SQL] Implement map_keys/map_values SQL functions ## What changes were proposed in this pull request? This PR adds `map_keys` and `map_values` SQL functions in order to remove Hive fallback. ## How was this patch tested? Pass the Jenkins tests including new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13967 from dongjoon-hyun/SPARK-16278.	2016-07-03 16:59:40 +08:00
gatorsmile	ea990f9693	[SPARK-16329][SQL] Star Expansion over Table Containing No Column #### What changes were proposed in this pull request? Star expansion over a table containing zero column does not work since 1.6. However, it works in Spark 1.5.1. This PR is to fix the issue in the master branch. For example, ```scala val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => Row.empty) val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty)) dfNoCols.registerTempTable("temp_table_no_cols") sqlContext.sql("select * from temp_table_no_cols").show ``` Without the fix, users will get the following the exception: ``` java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199) ``` #### How was this patch tested? Tests are added Author: gatorsmile <gatorsmile@gmail.com> Closes #14007 from gatorsmile/starExpansionTableWithZeroColumn.	2016-07-03 16:48:04 +08:00
Dongjoon Hyun	3000b4b29f	[MINOR][BUILD] Fix Java linter errors ## What changes were proposed in this pull request? This PR fixes the minor Java linter errors like the following. ``` - public int read(char cbuf[], int off, int len) throws IOException { + public int read(char[] cbuf, int off, int len) throws IOException { ``` ## How was this patch tested? Manual. ``` $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14017 from dongjoon-hyun/minor_build_java_linter_error.	2016-07-02 16:31:06 +01:00
Reynold Xin	d601894c04	[SPARK-16335][SQL] Structured streaming should fail if source directory does not exist ## What changes were proposed in this pull request? In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern). ## How was this patch tested? Updated unit tests to reflect the new behavior. Author: Reynold Xin <rxin@databricks.com> Closes #14002 from rxin/SPARK-16335.	2016-07-01 15:16:04 -07:00
gatorsmile	0ad6ce7e54	[SPARK-16222][SQL] JDBC Sources - Handling illegal input values for `fetchsize` and `batchsize` #### What changes were proposed in this pull request? For JDBC data sources, users can specify `batchsize` for multi-row inserts and `fetchsize` for multi-row fetch. A few issues exist: - The property keys are case sensitive. Thus, the existing test cases for `fetchsize` use incorrect names, `fetchSize`. Basically, the test cases are broken. - No test case exists for `batchsize`. - We do not detect the illegal input values for `fetchsize` and `batchsize`. For example, when `batchsize` is zero, we got the following exception: ``` Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ArithmeticException: / by zero ``` when `fetchsize` is less than zero, we got the exception from the underlying JDBC driver: ``` Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.h2.jdbc.JdbcSQLException: Invalid value "-1" for parameter "rows" [90008-183] ``` This PR fixes all the above issues, and issue the appropriate exceptions when detecting the illegal inputs for `fetchsize` and `batchsize`. Also update the function descriptions. #### How was this patch tested? Test cases are fixed and added. Author: gatorsmile <gatorsmile@gmail.com> Closes #13919 from gatorsmile/jdbcProperties.	2016-07-01 09:54:02 +01:00
petermaxlee	85f2303eca	[SPARK-16276][SQL] Implement elt SQL function ## What changes were proposed in this pull request? This patch implements the elt function, as it is implemented in Hive. ## How was this patch tested? Added expression unit test in StringExpressionsSuite and end-to-end test in StringFunctionsSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #13966 from petermaxlee/SPARK-16276.	2016-07-01 07:57:48 +08:00
Reynold Xin	3d75a5b2a7	[SPARK-16313][SQL] Spark should not silently drop exceptions in file listing ## What changes were proposed in this pull request? Spark silently drops exceptions during file listing. This is a very bad behavior because it can mask legitimate errors and the resulting plan will silently have 0 rows. This patch changes it to not silently drop the errors. ## How was this patch tested? Manually verified. Author: Reynold Xin <rxin@databricks.com> Closes #13987 from rxin/SPARK-16313.	2016-06-30 16:51:11 -07:00
petermaxlee	fb41670c92	[SPARK-16336][SQL] Suggest doing table refresh upon FileNotFoundException ## What changes were proposed in this pull request? This patch appends a message to suggest users running refresh table or reloading data frames when Spark sees a FileNotFoundException due to stale, cached metadata. ## How was this patch tested? Added a unit test for this in MetadataCacheSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #14003 from petermaxlee/SPARK-16336.	2016-06-30 16:49:59 -07:00
Dongjoon Hyun	46395db80e	[SPARK-16289][SQL] Implement posexplode table generating function ## What changes were proposed in this pull request? This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. Before ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` After ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ \|pos\|key\|value\| +---+---+-----+ \| 0\| a\| 1\| \| 1\| b\| 2\| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ \|pos\|col\| +---+---+ \| 0\| 1\| \| 1\| 2\| \| 2\| 3\| +---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289.	2016-06-30 12:03:54 -07:00
Sital Kedia	07f46afc73	[SPARK-13850] Force the sorter to Spill when number of elements in th… ## What changes were proposed in this pull request? Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size. ## How was this patch tested? Tested by running a job which was failing without this change due to TimSort bug. Author: Sital Kedia <skedia@fb.com> Closes #13107 from sitalkedia/fix_TimSort.	2016-06-30 10:53:18 -07:00
WeichenXu	5344bade8e	[SPARK-15820][PYSPARK][SQL] Add Catalog.refreshTable into python API ## What changes were proposed in this pull request? Add Catalog.refreshTable API into python interface for Spark-SQL. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13558 from WeichenXu123/update_python_sql_interface_refreshTable.	2016-06-30 23:00:39 +08:00
petermaxlee	d3af6731fa	[SPARK-16274][SQL] Implement xpath_boolean ## What changes were proposed in this pull request? This patch implements xpath_boolean expression for Spark SQL, a xpath function that returns true or false. The implementation is modelled after Hive's xpath_boolean, except that how the expression handles null inputs. Hive throws a NullPointerException at runtime if either of the input is null. This implementation returns null if either of the input is null. ## How was this patch tested? Created two new test suites. One for unit tests covering the expression, and the other for end-to-end test in SQL. Author: petermaxlee <petermaxlee@gmail.com> Closes #13964 from petermaxlee/SPARK-16274.	2016-06-30 09:27:48 +08:00
Dongjoon Hyun	831a04f5d1	[SPARK-16267][TEST] Replace deprecated `CREATE TEMPORARY TABLE ... USING` from testsuites. ## What changes were proposed in this pull request? After SPARK-15674, `DDLStrategy` prints out the following deprecation messages in the testsuites. ``` 12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead ``` Total : 40 - JDBCWriteSuite: 14 - DDLSuite: 6 - TableScanSuite: 6 - ParquetSourceSuite: 5 - OrcSourceSuite: 2 - SQLQuerySuite: 2 - HiveCommandSuite: 2 - JsonSuite: 1 - PrunedScanSuite: 1 - FilteredScanSuite 1 This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in order to remove the deprecation messages in the above testsuites except `DDLSuite`, `SQLQuerySuite`, `HiveCommandSuite`. The Jenkins results shows only remaining 10 messages. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61422/consoleFull ## How was this patch tested? This is a testsuite-only change. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13956 from dongjoon-hyun/SPARK-16267.	2016-06-29 17:29:17 -07:00
Wenchen Fan	d063898beb	[SPARK-16134][SQL] optimizer rules for typed filter ## What changes were proposed in this pull request? This PR adds 3 optimizer rules for typed filter: 1. push typed filter down through `SerializeFromObject` and eliminate the deserialization in filter condition. 2. pull typed filter up through `SerializeFromObject` and eliminate the deserialization in filter condition. 3. combine adjacent typed filters and share the deserialized object among all the condition expressions. This PR also adds `TypedFilter` logical plan, to separate it from normal filter, so that the concept is more clear and it's easier to write optimizer rules. ## How was this patch tested? `TypedFilterOptimizationSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13846 from cloud-fan/filter.	2016-06-30 08:15:08 +08:00
Dongjoon Hyun	9b1b3ae771	[SPARK-16006][SQL] Attemping to write empty DataFrame with no fields throws non-intuitive exception ## What changes were proposed in this pull request? This PR allows `emptyDataFrame.write` since the user didn't specify any partition columns. Before ```scala scala> spark.emptyDataFrame.write.parquet("/tmp/t1") org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns; scala> spark.emptyDataFrame.write.csv("/tmp/t1") org.apache.spark.sql.AnalysisException: Cannot use all columns for partition columns; ``` After this PR, there occurs no exceptions and the created directory has only one file, `_SUCCESS`, as expected. ## How was this patch tested? Pass the Jenkins tests including updated test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13730 from dongjoon-hyun/SPARK-16006.	2016-06-29 15:00:41 -07:00
hyukjinkwon	cb1b9d34f3	[SPARK-14480][SQL] Remove meaningless StringIteratorReader for CSV data source. ## What changes were proposed in this pull request? This PR removes meaningless `StringIteratorReader` for CSV data source. In `CSVParser.scala`, there is an `Reader` wrapping `Iterator` but there are two problems by this. Firstly, it was actually not faster than processing line by line with Iterator due to additional logics to wrap `Iterator` to `Reader`. Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). A benchmark was performed manually and the results were below: - Original codes with Reader wrapping Iterator \|End-to-end (ns) \| Parse Time (ns) \| \|-----------------------\|------------------------\| \|14116265034 \|2008277960 \| - New codes with Iterator \|End-to-end (ns) \| Parse Time (ns) \| \|-----------------------\|------------------------\| \|13451699644 \| 1549050564 \| For the details for the environment, dataset and methods, please refer the JIRA ticket. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13808 from HyukjinKwon/SPARK-14480-small.	2016-06-29 11:42:51 -07:00
gatorsmile	7ee9e39cb4	[SPARK-16157][SQL] Add New Methods for comments in StructField and StructType #### What changes were proposed in this pull request? Based on the previous discussion with cloud-fan hvanhovell in another related PR https://github.com/apache/spark/pull/13764#discussion_r67994276, it looks reasonable to add convenience methods for users to add `comment` when defining `StructField`. Currently, the column-related `comment` attribute is stored in `Metadata` of `StructField`. For example, users can add the `comment` attribute using the following way: ```Scala StructType( StructField( "cl1", IntegerType, nullable = false, new MetadataBuilder().putString("comment", "test").build()) :: Nil) ``` This PR is to add more user friendly methods for the `comment` attribute when defining a `StructField`. After the changes, users are provided three different ways to do it: ```Scala val struct = (new StructType) .add("a", "int", true, "test1") val struct = (new StructType) .add("c", StringType, true, "test3") val struct = (new StructType) .add(StructField("d", StringType).withComment("test4")) ``` #### How was this patch tested? Added test cases: - `DataTypeSuite` is for testing three types of API changes, - `DataFrameReaderWriterSuite` is for parquet, json and csv formats - using in-memory catalog - `OrcQuerySuite.scala` is for orc format using Hive-metastore Author: gatorsmile <gatorsmile@gmail.com> Closes #13860 from gatorsmile/newMethodForComment.	2016-06-29 19:36:21 +08:00
Cheng Lian	d1e8108854	[SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions that reference no input attributes ## What changes were proposed in this pull request? `MAX(COUNT())` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime. The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT())` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute. This PR fixes this issue by removing the aforementioned branch. ## How was this patch tested? New test case added in `AnalysisErrorSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13968 from liancheng/spark-16291-nested-agg-functions.	2016-06-29 19:08:36 +08:00
Holden Karau	757dc2c09d	[TRIVIAL][DOCS][STREAMING][SQL] The return type mentioned in the Javadoc is incorrect for toJavaRDD, … ## What changes were proposed in this pull request? Change the return type mentioned in the JavaDoc for `toJavaRDD` / `javaRDD` to match the actual return type & be consistent with the scala rdd return type. ## How was this patch tested? Docs only change. Author: Holden Karau <holden@us.ibm.com> Closes #13954 from holdenk/trivial-streaming-tojavardd-doc-fix.	2016-06-29 01:52:20 -07:00
Burak Yavuz	5545b79109	[MINOR][DOCS][STRUCTURED STREAMING] Minor doc fixes around `DataFrameWriter` and `DataStreamWriter` ## What changes were proposed in this pull request? Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start Author: Burak Yavuz <brkyvz@gmail.com> Closes #13952 from brkyvz/minor-doc-fix.	2016-06-28 17:02:16 -07:00
Wenchen Fan	8a977b0654	[SPARK-16100][SQL] fix bug when use Map as the buffer type of Aggregator ## What changes were proposed in this pull request? The root cause is in `MapObjects`. Its parameter `loopVar` is not declared as child, but sometimes can be same with `lambdaFunction`(e.g. the function that takes `loopVar` and produces `lambdaFunction` may be `identity`), which is a child. This brings trouble when call `withNewChildren`, it may mistakenly treat `loopVar` as a child and cause `IndexOutOfBoundsException: 0` later. This PR fixes this bug by simply pulling out the paremters from `LambdaVariable` and pass them to `MapObjects` directly. ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13835 from cloud-fan/map-objects.	2016-06-29 06:39:28 +08:00
gatorsmile	25520e9762	[SPARK-16236][SQL] Add Path Option back to Load API in DataFrameReader #### What changes were proposed in this pull request? koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #13933 from gatorsmile/optionPath.	2016-06-28 15:32:45 -07:00
Wenchen Fan	1f2776df6e	[SPARK-16181][SQL] outer join with isNull filter may return wrong result ## What changes were proposed in this pull request? The root cause is: the output attributes of outer join are derived from its children, while they are actually different attributes(outer join can return null). We have already added some special logic to handle it, e.g. `PushPredicateThroughJoin` won't push down predicates through outer join side, `FixNullability`. This PR adds one more special logic in `FoldablePropagation`. ## How was this patch tested? new test in `DataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13884 from cloud-fan/bug.	2016-06-28 10:26:01 -07:00
Prashant Sharma	f6b497fcdd	[SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function. ## What changes were proposed in this pull request? Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise. ## How was this patch tested? Existing tests. + 1 new test in DataFrameSuite. For SparkR and pyspark, existing tests and manual testing. Author: Prashant Sharma <prashsh1@in.ibm.com> Author: Prashant Sharma <prashant@apache.org> Closes #13839 from ScrapCodes/add_truncateTo_DF.show.	2016-06-28 17:11:06 +05:30
gatorsmile	4cbf611c1d	[SPARK-16202][SQL][DOC] Correct The Description of CreatableRelationProvider's createRelation #### What changes were proposed in this pull request? The API description of `createRelation` in `CreatableRelationProvider` is misleading. The current description only expects users to return the relation. ```Scala trait CreatableRelationProvider { def createRelation( sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation } ``` However, the major goal of this API should also include saving the `DataFrame`. Since this API is critical for Data Source API developers, this PR is to correct the description. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13903 from gatorsmile/readUnderscoreFiles.	2016-06-27 23:12:17 -07:00
Dongjoon Hyun	a0da854fb3	[SPARK-16221][SQL] Redirect Parquet JUL logger via SLF4J for WRITE operations ## What changes were proposed in this pull request? [SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs. This PR makes the redirection work on WRITE operations, too. Before ```scala scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY ............ about 70 lines Parquet Log ............. scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") ............ about 70 lines Parquet Log ............. ``` After ```scala scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") ``` This PR also fixes some typos. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13918 from dongjoon-hyun/SPARK-16221.	2016-06-28 13:01:18 +08:00
Herman van Hovell	02a029df43	[SPARK-16220][SQL] Add scope to show functions ## What changes were proposed in this pull request? Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used: ALL (default) ```SHOW FUNCTIONS``` ```SHOW ALL FUNCTIONS``` SYSTEM ```SHOW SYSTEM FUNCTIONS``` USER ```SHOW USER FUNCTIONS``` ## How was this patch tested? Updated tests and added tests to the DDLSuite Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13929 from hvanhovell/SPARK-16220.	2016-06-27 16:57:34 -07:00
Bill Chambers	c48c8ebc0a	[SPARK-16220][SQL] Revert Change to Bring Back SHOW FUNCTIONS Functionality ## What changes were proposed in this pull request? - Fix tests regarding show functions functionality - Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality. Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files ## How was this patch tested? Unit tests. Author: Bill Chambers <bill@databricks.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #13916 from anabranch/master.	2016-06-27 11:50:34 -07:00
Dongjoon Hyun	11f420b4bb	[SPARK-10591][SQL][TEST] Add a testcase to ensure if `checkAnswer` handles map correctly ## What changes were proposed in this pull request? This PR adds a testcase to ensure if `checkAnswer` handles Map type correctly. ## How was this patch tested? Pass the jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13913 from dongjoon-hyun/SPARK-10591.	2016-06-27 19:04:50 +08:00
Felix Cheung	30b182bcc0	[SPARK-16184][SPARKR] conf API for SparkSession ## What changes were proposed in this pull request? Add `conf` method to get Runtime Config from SparkSession ## How was this patch tested? unit tests, manual tests This is how it works in sparkR shell: ``` SparkSession available as 'spark'. > conf() $hive.metastore.warehouse.dir [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse" $spark.app.id [1] "local-1466749575523" $spark.app.name [1] "SparkR" $spark.driver.host [1] "10.0.2.1" $spark.driver.port [1] "45629" $spark.executorEnv.LD_LIBRARY_PATH [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server" $spark.executor.id [1] "driver" $spark.home [1] "/opt/spark-2.0.0-bin-hadoop2.6" $spark.master [1] "local[]" $spark.sql.catalogImplementation [1] "hive" $spark.submit.deployMode [1] "client" > conf("spark.master") $spark.master [1] "local[]" ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13885 from felixcheung/rconf.	2016-06-26 13:10:43 -07:00
Sital Kedia	bf665a9586	[SPARK-15958] Make initial buffer size for the Sorter configurable ## What changes were proposed in this pull request? Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable. ## How was this patch tested? Tested by running a job on the cluster. Author: Sital Kedia <skedia@fb.com> Closes #13699 from sitalkedia/config_sort_buffer_upstream.	2016-06-25 09:13:39 +01:00
Dongjoon Hyun	a7d29499dc	[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec ## What changes were proposed in this pull request? One of the most frequent usage patterns for Spark SQL is using cached tables. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. Before ```scala $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(2000000000) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect() // About 2 mins scala> sql("select id from t where id = 1").collect() // less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds ``` After ```scala scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms ``` This PR has impacts over 35 queries of TPC-DS if the tables are cached. Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased. ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13887 from dongjoon-hyun/SPARK-16186.	2016-06-24 22:34:31 -07:00
Takeshi YAMAMURO	d2e44d7db8	[SPARK-16192][SQL] Add type checks in CollectSet ## What changes were proposed in this pull request? `CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`. So, this pr is to add type checks in `CheckAnalysis`. ## How was this patch tested? Added tests to check failures when we found map-typed data in `CollectSet`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13892 from maropu/SPARK-16192.	2016-06-24 21:07:03 -07:00
Dilip Biswal	9053054c7f	[SPARK-16195][SQL] Allow users to specify empty over clause in window expressions through dataset API ## What changes were proposed in this pull request? Allow to specify empty over clause in window expressions through dataset API In SQL, its allowed to specify an empty OVER clause in the window expression. ```SQL select area, sum(product) over () as c from windowData where product > 3 group by area, product having avg(month) > 0 order by avg(month), product ``` In this case the analytic function sum is presented based on all the rows of the result set Currently its not allowed through dataset API and is handled in this PR. ## How was this patch tested? Added a new test in DataframeWindowSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13897 from dilipbiswal/spark-empty-over.	2016-06-24 17:27:33 -07:00
Dongjoon Hyun	e5d0928e24	[SPARK-16173] [SQL] Can't join describe() of DataFrame in Scala 2.10 ## What changes were proposed in this pull request? This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10. ## How was this patch tested? Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13900 from dongjoon-hyun/SPARK-16173.	2016-06-24 17:26:39 -07:00
Davies Liu	20768dade2	Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec" This reverts commit `a65bcbc27d`.	2016-06-24 17:21:18 -07:00
Dongjoon Hyun	a65bcbc27d	[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec ## What changes were proposed in this pull request? One of the most frequent usage patterns for Spark SQL is using cached tables. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. Before ```scala $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(2000000000) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect() // About 2 mins scala> sql("select id from t where id = 1").collect() // less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds ``` After ```scala scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms ``` This PR has impacts over 35 queries of TPC-DS if the tables are cached. Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased. ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13887 from dongjoon-hyun/SPARK-16186.	2016-06-24 17:13:13 -07:00
Davies Liu	4435de1bd3	[SPARK-16179][PYSPARK] fix bugs for Python udf in generate ## What changes were proposed in this pull request? This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions. ``` >>> df.select(explode(f(df))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show print(self._jdf.showString(n, truncate)) File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, **kw) File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: Generate explode(<lambda>(_1#0L)), false, false, [col#15L] +- Scan ExistingRDD[_1#0L] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557) at org.apache.spark.sql.Dataset.head(Dataset.scala:1923) at org.apache.spark.sql.Dataset.take(Dataset.scala:2138) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 42 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63) ... 52 more Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 67 more ``` ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13883 from davies/udf_in_generate.	2016-06-24 15:20:39 -07:00
Reynold Xin	5f8de21606	[SQL][MINOR] Simplify data source predicate filter translation. ## What changes were proposed in this pull request? This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #13889 from rxin/simplify-predicate-filter.	2016-06-24 14:44:24 -07:00
Sean Owen	158af162ea	[SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3 ## What changes were proposed in this pull request? Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException` ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13843 from srowen/SPARK-16129.	2016-06-24 10:35:54 +01:00
Wenchen Fan	6a3c6276f5	[SQL][MINOR] ParserUtils.operationNotAllowed should throw exception directly ## What changes were proposed in this pull request? It's weird that `ParserUtils.operationNotAllowed` returns an exception and the caller throw it. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13874 from cloud-fan/style.	2016-06-23 20:20:55 -07:00
Sameer Agarwal	cc71d4fa37	[SPARK-16123] Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader ## What changes were proposed in this pull request? This patch fixes an overflow bug in vectorized parquet reader where both off-heap and on-heap variants of `ColumnVector.reserve()` can unfortunately overflow while reserving additional capacity during reads. ## How was this patch tested? Manual Tests Author: Sameer Agarwal <sameer@databricks.com> Closes #13832 from sameeragarwal/negative-array.	2016-06-23 18:21:41 -07:00
Dongjoon Hyun	264bc63623	[SPARK-16165][SQL] Fix the update logic for InMemoryTableScanExec.readBatches ## What changes were proposed in this pull request? Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. Although this metric is used for only testing purpose, we had better have correct metric without considering SQL options. ## How was this patch tested? Pass the Jenkins tests (including a new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13870 from dongjoon-hyun/SPARK-16165.	2016-06-24 07:19:20 +08:00
Shixiong Zhu	0e4bdebece	[SPARK-15443][SQL] Fix 'explain' for streaming Dataset ## What changes were proposed in this pull request? - Fix the `explain` command for streaming Dataset/DataFrame. E.g., ``` == Parsed Logical Plan == 'SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- 'MapElements <function1>, obj#6: java.lang.String +- 'DeserializeToObject unresolveddeserializer(createexternalrow(getcolumnbyordinal(0, StringType).toString, StructField(value,StringType,true))), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- StreamingRelation FileSource[/Users/zsx/stream], [value#0] == Analyzed Logical Plan == value: string SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- StreamingRelation FileSource[/Users/zsx/stream], [value#0] == Optimized Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- StreamingRelation FileSource[/Users/zsx/stream], [value#0] == Physical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#0.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- StreamingRelation FileSource[/Users/zsx/stream], [value#0] ``` - Add `StreamingQuery.explain` to display the last execution plan. E.g., ``` == Parsed Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- Relation[value#12] text == Analyzed Logical Plan == value: string SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- Relation[value#12] text == Optimized Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- Relation[value#12] text == Physical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#7] +- MapElements <function1>, obj#6: java.lang.String +- DeserializeToObject createexternalrow(value#12.toString, StructField(value,StringType,true)), obj#5: org.apache.spark.sql.Row +- Filter <function1>.apply +- *Scan text [value#12] Format: org.apache.spark.sql.execution.datasources.text.TextFileFormat1836ab91, InputPaths: file:/Users/zsx/stream/a.txt, file:/Users/zsx/stream/b.txt, file:/Users/zsx/stream/c.txt, PushedFilters: [], ReadSchema: struct<value:string> ``` ## How was this patch tested? The added unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13815 from zsxwing/sdf-explain.	2016-06-23 16:04:16 -07:00
Davies Liu	10396d9505	[SPARK-16163] [SQL] Cache the statistics for logical plans ## What changes were proposed in this pull request? This calculation of statistics is not trivial anymore, it could be very slow on large query (for example, TPC-DS Q64 took several minutes to plan). During the planning of a query, the statistics of any logical plan should not change (even InMemoryRelation), so we should use `lazy val` to cache the statistics. For InMemoryRelation, the statistics could be updated after materialization, it's only useful when used in another query (before planning), because once we finished the planning, the statistics will not be used anymore. ## How was this patch tested? Testsed with TPC-DS Q64, it could be planned in a second after the patch. Author: Davies Liu <davies@databricks.com> Closes #13871 from davies/fix_statistics.	2016-06-23 11:48:48 -07:00
Shixiong Zhu	d85bb10ce4	[SPARK-16116][SQL] ConsoleSink should not require checkpointLocation ## What changes were proposed in this pull request? When the user uses `ConsoleSink`, we should use a temp location if `checkpointLocation` is not specified. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13817 from zsxwing/console-checkpoint.	2016-06-23 10:46:20 -07:00
Cheng Lian	f34b5c62b2	[SQL][MINOR] Fix minor formatting issues in SHOW CREATE TABLE output ## What changes were proposed in this pull request? This PR fixes two minor formatting issues appearing in `SHOW CREATE TABLE` output. Before: ``` CREATE EXTERNAL TABLE ... ... WITH SERDEPROPERTIES ('serialization.format' = '1' ) ... TBLPROPERTIES ('avro.schema.url' = '/tmp/avro/test.avsc', 'transient_lastDdlTime' = '1466638180') ``` After: ``` CREATE EXTERNAL TABLE ... ... WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) ... TBLPROPERTIES ( 'avro.schema.url' = '/tmp/avro/test.avsc', 'transient_lastDdlTime' = '1466638180' ) ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #13864 from liancheng/show-create-table-format-fix.	2016-06-22 22:28:54 -07:00
bomeng	925884a612	[SPARK-15230][SQL] distinct() does not handle column name with dot properly ## What changes were proposed in this pull request? When table is created with column name containing dot, distinct() will fail to run. For example, ```scala val rowRDD = sparkContext.parallelize(Seq(Row(1), Row(1), Row(2))) val schema = StructType(Array(StructField("column.with.dot", IntegerType, nullable = false))) val df = spark.createDataFrame(rowRDD, schema) ``` running the following will have no problem: ```scala df.select(new Column("`column.with.dot`")) ``` but running the query with additional distinct() will cause exception: ```scala df.select(new Column("`column.with.dot`")).distinct() ``` The issue is that distinct() will try to resolve the column name, but the column name in the schema does not have backtick with it. So the solution is to add the backtick before passing the column name to resolve(). ## How was this patch tested? Added a new test case. Author: bomeng <bmeng@us.ibm.com> Closes #13140 from bomeng/SPARK-15230.	2016-06-23 11:06:19 +08:00
Reynold Xin	37f3be5d29	[SPARK-16159][SQL] Move RDD creation logic from FileSourceStrategy.apply ## What changes were proposed in this pull request? We embed partitioning logic in FileSourceStrategy.apply, making the function very long. This is a small refactoring to move it into its own functions. Eventually we would be able to move the partitioning functions into a physical operator, rather than doing it in physical planning. ## How was this patch tested? This is a simple code move. Author: Reynold Xin <rxin@databricks.com> Closes #13862 from rxin/SPARK-16159.	2016-06-22 18:19:07 -07:00
gatorsmile	9f990fa3f9	[SPARK-16024][SQL][TEST] Verify Column Comment for Data Source Tables #### What changes were proposed in this pull request? This PR is to improve test coverage. It verifies whether `Comment` of `Column` can be appropriate handled. The test cases verify the related parts in Parser, both SQL and DataFrameWriter interface, and both Hive Metastore catalog and In-memory catalog. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13764 from gatorsmile/dataSourceComment.	2016-06-23 09:12:20 +08:00
Wenchen Fan	01277d4b25	[SPARK-16097][SQL] Encoders.tuple should handle null object correctly ## What changes were proposed in this pull request? Although the top level input object can not be null, but when we use `Encoders.tuple` to combine 2 encoders, their input objects are not top level anymore and can be null. We should handle this case. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13807 from cloud-fan/bug.	2016-06-22 18:32:14 +08:00
Yin Huai	39ad53f7ff	[SPARK-16121] ListingFileCatalog does not list in parallel anymore ## What changes were proposed in this pull request? Seems the fix of SPARK-14959 breaks the parallel partitioning discovery. This PR fixes the problem ## How was this patch tested? Tested manually. (This PR also adds a proper test for SPARK-14959) Author: Yin Huai <yhuai@databricks.com> Closes #13830 from yhuai/SPARK-16121.	2016-06-22 18:07:07 +08:00
gatorsmile	0e3ce75332	[SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib #### What changes were proposed in this pull request? This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`. Also fix a test case issue in `BroadcastJoinSuite`. BTW, `SQLContext` is not being used in the `MLlib` test suites. #### How was this patch tested? Existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13380 from gatorsmile/sqlContextML.	2016-06-21 23:12:08 -07:00
hyukjinkwon	7580f3041a	[SPARK-16104] [SQL] Do not creaate CSV writer object for every flush when writing ## What changes were proposed in this pull request? This PR let `CsvWriter` object is not created for each time but able to be reused. This way was taken after from JSON data source. Original `CsvWriter` was being created for each row but it was enhanced in https://github.com/apache/spark/pull/13229. However, it still creates `CsvWriter` object for each `flush()` in `LineCsvWriter`. It seems it does not have to close the object and re-create this for every flush. It follows the original logic as it is but `CsvWriter` is reused by reseting `CharArrayWriter`. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13809 from HyukjinKwon/write-perf.	2016-06-21 21:58:38 -07:00
Shixiong Zhu	c399c7f0e4	[SPARK-16002][SQL] Sleep when no new data arrives to avoid 100% CPU usage ## What changes were proposed in this pull request? Add a configuration to allow people to set a minimum polling delay when no new data arrives (default is 10ms). This PR also cleans up some INFO logs. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13718 from zsxwing/SPARK-16002.	2016-06-21 12:42:49 -07:00
bomeng	f3a768b7b9	[SPARK-16084][SQL] Minor comments update for "DESCRIBE" table ## What changes were proposed in this pull request? 1. FORMATTED is actually supported, but partition is not supported; 2. Remove parenthesis as it is not necessary just like anywhere else. ## How was this patch tested? Minor issue. I do not think it needs a test case! Author: bomeng <bmeng@us.ibm.com> Closes #13791 from bomeng/SPARK-16084.	2016-06-21 08:51:43 +01:00
hyukjinkwon	4f7f1c4362	[SPARK-16044][SQL] input_file_name() returns empty strings in data sources based on NewHadoopRDD ## What changes were proposed in this pull request? This PR makes `input_file_name()` function return the file paths not empty strings for external data sources based on `NewHadoopRDD`, such as [spark-redshift](`cba5eee1ab/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala (L149)`) and [spark-xml](https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47). The codes with the external data sources below: ```scala df.select(input_file_name).show() ``` will produce - Before ``` +-----------------+ \|input_file_name()\| +-----------------+ \| \| +-----------------+ ``` - After ``` +--------------------+ \| input_file_name()\| +--------------------+ \|file:/private/var...\| +--------------------+ ``` ## How was this patch tested? Unit tests in `ColumnExpressionSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13759 from HyukjinKwon/SPARK-16044.	2016-06-20 21:55:34 -07:00
gatorsmile	d9a3a2a0be	[SPARK-16056][SPARK-16057][SPARK-16058][SQL] Fix Multiple Bugs in Column Partitioning in JDBC Source #### What changes were proposed in this pull request? This PR is to fix the following bugs: Issue 1: Wrong Results when lowerBound is larger than upperBound in Column Partitioning ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 4, upperBound = 0, numPartitions = 3, connectionProperties = new Properties) ``` Before code changes: The returned results are wrong and the generated partitions are wrong: ``` Part 0 id < 3 or id is null Part 1 id >= 3 AND id < 2 Part 2 id >= 2 ``` After code changes: Issue an `IllegalArgumentException` exception: ``` Operation not allowed: the lower bound of partitioning column is larger than the upper bound. lowerBound: 5; higherBound: 1 ``` Issue 2: numPartitions is more than the number of key values between upper and lower bounds ```scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 1, upperBound = 5, numPartitions = 10, connectionProperties = new Properties) ``` Before code changes: Returned correct results but the generated partitions are very inefficient, like: ``` Partition 0: id < 1 or id is null Partition 1: id >= 1 AND id < 1 Partition 2: id >= 1 AND id < 1 Partition 3: id >= 1 AND id < 1 Partition 4: id >= 1 AND id < 1 Partition 5: id >= 1 AND id < 1 Partition 6: id >= 1 AND id < 1 Partition 7: id >= 1 AND id < 1 Partition 8: id >= 1 AND id < 1 Partition 9: id >= 1 ``` After code changes: Adjust `numPartitions` and can return the correct answers: ``` Partition 0: id < 2 or id is null Partition 1: id >= 2 AND id < 3 Partition 2: id >= 3 AND id < 4 Partition 3: id >= 4 ``` Issue 3: java.lang.ArithmeticException when numPartitions is zero ```Scala spark.read.jdbc( url = urlWithUserAndPass, table = "TEST.seq", columnName = "id", lowerBound = 0, upperBound = 4, numPartitions = 0, connectionProperties = new Properties) ``` Before code changes: Got the following exception: ``` java.lang.ArithmeticException: / by zero ``` After code changes: Able to return a correct answer by disabling column partitioning when numPartitions is equal to or less than zero #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #13773 from gatorsmile/jdbcPartitioning.	2016-06-20 21:49:33 -07:00
Reynold Xin	c775bf09e0	[SPARK-13792][SQL] Limit logging of bad records in CSV data source ## What changes were proposed in this pull request? This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records. The error log looks something like ``` 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4 16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged. ``` Closes #12173 ## How was this patch tested? Manually tested. Author: Reynold Xin <rxin@databricks.com> Closes #13795 from rxin/SPARK-13792.	2016-06-20 21:46:12 -07:00
Kousuke Saruta	6daa8cf1a6	[SPARK-16061][SQL][MINOR] The property "spark.streaming.stateStore.maintenanceInterval" should be renamed to "spark.sql.streaming.stateStore.maintenanceInterval" ## What changes were proposed in this pull request? The property spark.streaming.stateStore.maintenanceInterval should be renamed and harmonized with other properties related to Structured Streaming like spark.sql.streaming.stateStore.minDeltasForSnapshot. ## How was this patch tested? Existing unit tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13777 from sarutak/SPARK-16061.	2016-06-20 15:12:40 -07:00
Tathagata Das	b99129cc45	[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13727 from tdas/SPARK-15982.	2016-06-20 14:52:28 -07:00
Shixiong Zhu	5cfabec872	[SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite ## What changes were proposed in this pull request? ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to stdout while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail. Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it. ## How was this patch tested? Just removed a flaky test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13776 from zsxwing/SPARK-16050.	2016-06-20 10:35:37 -07:00
Yin Huai	905f774b71	[SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables ## What changes were proposed in this pull request? This PR adds the static partition support to INSERT statement when the target table is a data source table. ## How was this patch tested? New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite. Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change. Author: Yin Huai <yhuai@databricks.com> Closes #13769 from yhuai/SPARK-16030-1.	2016-06-20 20:17:47 +08:00
Yin Huai	6d0f921aed	[SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement ## What changes were proposed in this pull request? This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13766 from yhuai/caseSensitivity.	2016-06-19 21:45:53 -07:00
Matei Zaharia	4f17fddcd5	[SPARK-16031] Add debug-only socket source in Structured Streaming ## What changes were proposed in this pull request? This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark. ## How was this patch tested? Unit tests and manual tests in spark-shell. Author: Matei Zaharia <matei@databricks.com> Closes #13748 from mateiz/socket-source.	2016-06-19 21:27:04 -07:00
Sean Zhong	ce3b98bae2	[SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable ## What changes were proposed in this pull request? `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13749 from clockfly/SPARK-16034.	2016-06-18 10:41:33 -07:00
Wenchen Fan	3d010c8375	[SPARK-16036][SPARK-16037][SQL] fix various table insertion problems ## What changes were proposed in this pull request? The current table insertion has some weird behaviours: 1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table 2. inserting into a partitioned table without partition list has wrong result for hive table. This PR fixes these 2 problems. ## How was this patch tested? new test in hive `SQLQuerySuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13754 from cloud-fan/insert2.	2016-06-18 10:32:27 -07:00
Andrew Or	35a2f3c012	[SPARK-16023][SQL] Move InMemoryRelation to its own file ## What changes were proposed in this pull request? Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it. ## How was this patch tested? Jenkins Author: Andrew Or <andrew@databricks.com> Closes #13742 from andrewor14/move-inmemory-relation.	2016-06-17 23:41:09 -07:00
Shixiong Zhu	d0ac0e6f43	[SPARK-16020][SQL] Fix complete mode aggregation with console sink ## What changes were proposed in this pull request? We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging. ## How was this patch tested? Manually confirmed ConsoleSink now works with complete mode aggregation. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13740 from zsxwing/complete-console.	2016-06-17 21:58:10 -07:00
Felix Cheung	8c198e246d	[SPARK-15159][SPARKR] SparkR SparkSession API ## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.	2016-06-17 21:36:01 -07:00
Cheng Lian	10b671447b	[SPARK-16033][SQL] insertInto() can't be used together with partitionBy() ## What changes were proposed in this pull request? When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout. ## How was this patch tested? New test case added in `InsertIntoHiveTableSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.	2016-06-17 20:13:04 -07:00
hyukjinkwon	ebb9a3b6fd	[SPARK-15916][SQL] JDBC filter push down should respect operator precedence ## What changes were proposed in this pull request? This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer. Case 1: For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected. Case 2: For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause: ``` spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...) ``` ## How was this patch tested? Unit test. This PR also close #13640 Author: hyukjinkwon <gurwls223@gmail.com> Author: Sean Zhong <seanzhong@databricks.com> Closes #13743 from clockfly/SPARK-15916.	2016-06-17 17:11:38 -07:00
Sameer Agarwal	34d6c4cd11	Remove non-obvious conf settings from TPCDS benchmark ## What changes were proposed in this pull request? My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader. PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark) ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #13726 from sameeragarwal/tpcds-conf.	2016-06-17 09:47:41 -07:00
Davies Liu	ef43b4ed87	[SPARK-15811][SQL] fix the Python UDF in Scala 2.10 ## What changes were proposed in this pull request? Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that . ## How was this patch tested? Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build). Author: Davies Liu <davies@databricks.com> Closes #13717 from davies/fix_udf_210.	2016-06-17 00:34:33 -07:00
Pete Robbins	5ada606144	[SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory ## What changes were proposed in this pull request? `UTF8String` and all `Unsafe` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys. This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem. This PR is largely based on the work of robbinspg and he should be credited for this. closes https://github.com/apache/spark/pull/13707 ## How was this patch tested? Manually tested on problematic workloads. Author: Pete Robbins <robbinspg@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13723 from hvanhovell/SPARK-15822-2.	2016-06-16 22:27:32 -07:00
Yin Huai	d9c6628c47	[SPARK-15991] SparkContext.hadoopConfiguration should be always the base of hadoop conf created by SessionState ## What changes were proposed in this pull request? Before this patch, after a SparkSession has been created, hadoop conf set directly to SparkContext.hadoopConfiguration will not affect the hadoop conf created by SessionState. This patch makes the change to always use SparkContext.hadoopConfiguration as the base. This patch also changes the behavior of hive-site.xml support added in https://github.com/apache/spark/pull/12689/. With this patch, we will load hive-site.xml to SparkContext.hadoopConfiguration. ## How was this patch tested? New test in SparkSessionBuilderSuite. Author: Yin Huai <yhuai@databricks.com> Closes #13711 from yhuai/SPARK-15991.	2016-06-16 17:06:24 -07:00
Huaxin Gao	62d2fa5e99	[SPARK-15749][SQL] make the error message more meaningful ## What changes were proposed in this pull request? For table test1 (C1 varchar (10), C2 varchar (10)), when I insert a row using ``` sqlContext.sql("insert into test1 values ('abc', 'def', 1)") ``` I got error message ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the query in the SELECT clause of the INSERT INTO/OVERWRITE statement generates the same number of columns as its schema. ``` The error message is a little confusing. In my simple insert statement, it doesn't have a SELECT clause. I will change the error message to a more general one ``` Exception in thread "main" java.lang.RuntimeException: RelationC1#0,C2#1 JDBCRelation(test1) requires that the data to be inserted have the same number of columns as the target table. ``` ## How was this patch tested? I tested the patch using my simple unit test, but it's a very trivial change and I don't think I need to check in any test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #13492 from huaxingao/spark-15749.	2016-06-16 14:37:10 -07:00
Dongjoon Hyun	2d27eb1e75	[MINOR][DOCS][SQL] Fix some comments about types(TypeCoercion,Partition) and exceptions. ## What changes were proposed in this pull request? This PR contains a few changes on code comments. - `HiveTypeCoercion` is renamed into `TypeCoercion`. - `NoSuchDatabaseException` is only used for the absence of database. - For partition type inference, only `DoubleType` is considered. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13674 from dongjoon-hyun/minor_doc_types.	2016-06-16 14:27:09 -07:00
Cheng Lian	7a89f2adbb	[SQL] Minor HashAggregateExec string output fixes ## What changes were proposed in this pull request? This PR fixes some minor `.toString` format issues for `HashAggregateExec`. Before: ``` HashAggregate(key=[a#234L,b#235L], functions=[count(1),max(c#236L)], output=[a#234L,b#235L,count(c)#247L,max(c)#248L]) ``` After: ``` HashAggregate(keys=[a#234L, b#235L], functions=[count(1), max(c#236L)], output=[a#234L, b#235L, count(c)#247L, max(c)#248L]) ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #13710 from liancheng/minor-agg-string-fix.	2016-06-16 14:20:44 -07:00
Herman van Hovell	f9bf15d9bd	[SPARK-15977][SQL] Fix TRUNCATE TABLE for Spark specific datasource tables ## What changes were proposed in this pull request? `TRUNCATE TABLE` is currently broken for Spark specific datasource tables (json, csv, ...). This PR correctly sets the location for these datasources which allows them to be truncated. ## How was this patch tested? Extended the datasources `TRUNCATE TABLE` tests in `DDLSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13697 from hvanhovell/SPARK-15977.	2016-06-16 13:47:36 -07:00
Cheng Lian	9ea0d5e326	[SPARK-15983][SQL] Removes FileFormat.prepareRead ## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.	2016-06-16 10:24:29 -07:00
gatorsmile	6451cf9270	[SPARK-15862][SQL] Better Error Message When Having Database Name in CACHE TABLE AS SELECT #### What changes were proposed in this pull request? ~~If the temp table already exists, we should not silently replace it when doing `CACHE TABLE AS SELECT`. This is inconsistent with the behavior of `CREAT VIEW` or `CREATE TABLE`. This PR is to fix this silent drop.~~ ~~Maybe, we also can introduce new syntax for replacing the existing one. For example, in Hive, to replace a view, the syntax should be like `ALTER VIEW AS SELECT` or `CREATE OR REPLACE VIEW AS SELECT`~~ The table name in `CACHE TABLE AS SELECT` should NOT contain database prefix like "database.table". Thus, this PR captures this in Parser and outputs a better error message, instead of reporting the view already exists. In addition, refactoring the `Parser` to generate table identifiers instead of returning the table name string. #### How was this patch tested? - Added a test case for caching and uncaching qualified table names - Fixed a few test cases that do not drop temp table at the end - Added the related test case for the issue resolved in this PR Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13572 from gatorsmile/cacheTableAsSelect.	2016-06-16 10:01:59 -07:00
Narine Kokhlikyan	7c6c692637	[SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR ## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.	2016-06-15 21:42:05 -07:00
Herman van Hovell	b75f454f94	[SPARK-15824][SQL] Execute WITH .... INSERT ... statements immediately ## What changes were proposed in this pull request? We currently immediately execute `INSERT` commands when they are issued. This is not the case as soon as we use a `WITH` to define common table expressions, for example: ```sql WITH tbl AS (SELECT * FROM x WHERE id = 10) INSERT INTO y SELECT * FROM tbl ``` This PR fixes this problem. This PR closes https://github.com/apache/spark/pull/13561 (which fixes the a instance of this problem in the ThriftSever). ## How was this patch tested? Added a test to `InsertSuite` Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13678 from hvanhovell/SPARK-15824.	2016-06-15 21:33:26 -07:00
Wayne Song	ebdd751272	[SPARK-13498][SQL] Increment the recordsRead input metric for JDBC data source ## What changes were proposed in this pull request? This patch brings https://github.com/apache/spark/pull/11373 up-to-date and increments the record count for JDBC data source. Closes #11373. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13694 from rxin/SPARK-13498.	2016-06-15 20:09:47 -07:00
Reynold Xin	865e7cc38d	[SPARK-15979][SQL] Rename various Parquet support classes. ## What changes were proposed in this pull request? This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. ## How was this patch tested? Renamed test cases as well. Author: Reynold Xin <rxin@databricks.com> Closes #13696 from rxin/parquet-rename.	2016-06-15 20:05:08 -07:00
KaiXinXiaoLei	3e6d567a46	[SPARK-12492][SQL] Add missing SQLExecution.withNewExecutionId for hiveResultString ## What changes were proposed in this pull request? Add missing SQLExecution.withNewExecutionId for hiveResultString so that queries running in `spark-sql` will be shown in Web UI. Closes #13115 ## How was this patch tested? Existing unit tests. Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #13689 from zsxwing/pr13115.	2016-06-15 16:11:46 -07:00
Davies Liu	5389013acc	[SPARK-15888] [SQL] fix Python UDF with aggregate ## What changes were proposed in this pull request? After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate. ## How was this patch tested? Added regression tests. The plan of added test query looks like this: ``` == Parsed Logical Plan == 'Project [<lambda>('k, 's) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Analyzed Logical Plan == t: int Project [<lambda>(k#17, s#22L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Optimized Logical Plan == Project [<lambda>(agg#29, agg#30L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L] +- LogicalRDD [key#5L, value#6] == Physical Plan == Project [pythonUDF0#37 AS t#26] +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37] +- HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L]) +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200) +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L]) +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35] +- Scan ExistingRDD[key#5L,value#6] ``` Author: Davies Liu <davies@databricks.com> Closes #13682 from davies/fix_py_udf.	2016-06-15 13:38:04 -07:00
Yin Huai	e1585cc748	[SPARK-15959][SQL] Add the support of hive.metastore.warehouse.dir back ## What changes were proposed in this pull request? This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows: * If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`. * If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`. * If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`. ## How was this patch tested? `set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`. JIRA: https://issues.apache.org/jira/browse/SPARK-15959 Author: Yin Huai <yhuai@databricks.com> Closes #13679 from yhuai/hiveWarehouseDir.	2016-06-15 11:50:54 -07:00
Tathagata Das	9a5071996b	[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery Renamed for simplicity, so that its obvious that its related to streaming. Existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13673 from tdas/SPARK-15953.	2016-06-15 10:46:07 -07:00
Herman van Hovell	de99c3d081	[SPARK-15960][SQL] Rename `spark.sql.enableFallBackToHdfsForStats` config ## What changes were proposed in this pull request? Since we are probably going to add more statistics related configurations in the future, I'd like to rename the newly added `spark.sql.enableFallBackToHdfsForStats` configuration option to `spark.sql.statistics.fallBackToHdfs`. This allows us to put all statistics related configurations in the same namespace. ## How was this patch tested? None - just a usability thing Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13681 from hvanhovell/SPARK-15960.	2016-06-15 09:43:11 -07:00
bomeng	42a28caf10	[SPARK-15952][SQL] fix "show databases" ordering issue ## What changes were proposed in this pull request? Two issues I've found for "show databases" command: 1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list) 2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least. ## How was this patch tested? Updated existing test case to test its ordering as well. Author: bomeng <bmeng@us.ibm.com> Closes #13671 from bomeng/SPARK-15952.	2016-06-14 18:35:29 -07:00
Tathagata Das	214adb14b8	[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs ## What changes were proposed in this pull request? Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams. - [x] Python API!! ## How was this patch tested? Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13653 from tdas/SPARK-15933.	2016-06-14 17:58:45 -07:00
Takeshi YAMAMURO	dae4d5db21	[SPARK-15247][SQL] Set the default number of partitions for reading parquet schemas ## What changes were proposed in this pull request? This pr sets the default number of partitions when reading parquet schemas. SQLContext#read#parquet currently yields at least n_executors * n_cores tasks even if parquet data consist of a single small file. This issue could increase the latency for small jobs. ## How was this patch tested? Manually tested and checked. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13137 from maropu/SPARK-15247.	2016-06-14 13:05:56 -07:00
Cheng Lian	bd39ffe35c	[SPARK-15895][SQL] Filters out metadata files while doing partition discovery ## What changes were proposed in this pull request? Take the following directory layout as an example: ``` dir/ +- p0=0/ \|-_metadata +- p1=0/ \|-part-00001.parquet \|-part-00002.parquet \|-... ``` The `_metadata` file under `p0=0` shouldn't fail partition discovery. This PR filters output all metadata files whose names start with `_` while doing partition discovery. ## How was this patch tested? New unit test added in `ParquetPartitionDiscoverySuite`. Author: Cheng Lian <lian@databricks.com> Closes #13623 from liancheng/spark-15895-partition-disco-no-metafiles.	2016-06-14 12:13:12 -07:00
gatorsmile	df4ea6614d	[SPARK-15864][SQL] Fix Inconsistent Behaviors when Uncaching Non-cached Tables #### What changes were proposed in this pull request? To uncache a table, we have three different ways: - _SQL interface_: `UNCACHE TABLE` - _DataSet API_: `sparkSession.catalog.uncacheTable` - _DataSet API_: `sparkSession.table(tableName).unpersist()` When the table is not cached, - _SQL interface_: `UNCACHE TABLE non-cachedTable` -> no error message - _Dataset API_: `sparkSession.catalog.uncacheTable("non-cachedTable")` -> report a strange error message: ```requirement failed: Table [a: int] is not cached``` - _Dataset API_: `sparkSession.table("non-cachedTable").unpersist()` -> no error message This PR will make them consistent. No operation if the table has already been uncached. In addition, this PR also removes `uncacheQuery` and renames `tryUncacheQuery` to `uncacheQuery`, and documents it that it's noop if the table has already been uncached #### How was this patch tested? Improved the existing test case for verifying the cases when the table has not been cached. Also added test cases for verifying the cases when the table does not exist Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13593 from gatorsmile/uncacheNonCachedTable.	2016-06-14 11:44:37 -07:00
Takuya UESHIN	c5b7355819	[SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult. ## What changes were proposed in this pull request? `DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable. The example is like: ``` val localRelation = Seq(1, 2, 3).toDF() localRelation.createOrReplaceTempView("localRelation") spark.catalog.cacheTable("localRelation") assert( localRelation.queryExecution.withCachedData.collect { case i: InMemoryRelation => i }.size == 1) ``` and this will fail as: ``` ArrayBuffer() had size 0 instead of expected size 1 ``` The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped. Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same. This pr modifies them to use canonicalized plan when override `sameResult` method. ## How was this patch tested? Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13638 from ueshin/issues/SPARK-15915.	2016-06-14 10:52:13 -07:00
Sean Owen	6151d2641f	[MINOR] Clean up several build warnings, mostly due to internal use of old accumulators ## What changes were proposed in this pull request? Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor". ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13642 from srowen/BuildWarnings.	2016-06-14 09:40:07 -07:00
Sean Zhong	6e8cdef0cf	[SPARK-15914][SQL] Add deprecated method back to SQLContext for backward source code compatibility ## What changes were proposed in this pull request? Revert partial changes in SPARK-12600, and add some deprecated method back to SQLContext for backward source code compatibility. ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13637 from clockfly/SPARK-15914.	2016-06-14 09:10:27 -07:00
Sandeep Singh	1842cdd4ee	[SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions ## What changes were proposed in this pull request? SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions. ## How was this patch tested? CatalogSuite Author: Sandeep Singh <sandeep@techaddict.me> Closes #13413 from techaddict/SPARK-15663.	2016-06-13 21:58:52 -07:00
gatorsmile	5827b65e28	[SPARK-15808][SQL] File Format Checking When Appending Data #### What changes were proposed in this pull request? Issue: Got wrong results or strange errors when append data to a table with mismatched file format. _Example 1: PARQUET -> CSV_ ```Scala createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") ``` Error we got: ``` Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] ``` _Example 2: Json -> CSV_ ```Scala createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") ``` No exception, but wrong results: ``` +----+----+ \| c1\| c2\| +----+----+ \|null\|null\| \|null\|null\| \|null\|null\| \|null\|null\| \| 0\|str0\| \| 1\|str1\| \| 2\|str2\| \| 3\|str3\| \| 4\|str4\| \| 5\|str5\| \| 6\|str6\| \| 7\|str7\| \| 8\|str8\| \| 9\|str9\| +----+----+ ``` _Example 3: Json -> Text_ ```Scala createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") ``` Error we got: ``` Text data source supports only a single column, and you have 2 columns. ``` This PR is to issue an exception with appropriate error messages. #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #13546 from gatorsmile/fileFormatCheck.	2016-06-13 19:31:40 -07:00
Sean Zhong	7b9071eeaa	[SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset ## What changes were proposed in this pull request? This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example. Before the change: Schema is NOT checked when converting DataFrame to Dataset using kryo encoder. ``` scala> case class B(b: Int) scala> implicit val encoder = Encoders.kryo[B] scala> val df = Seq((1)).toDF("b") scala> val ds = df.as[B] // Schema compatibility is NOT checked ``` After the change: Report AnalysisException since the schema is NOT compatible. ``` scala> val ds = Seq((1)).toDF("b").as[B] org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType; ... ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13632 from clockfly/spark-15910.	2016-06-13 17:43:55 -07:00
Josh Rosen	a6babca1bf	[SPARK-15929] Fix portability of DataFrameSuite path globbing tests The DataFrameSuite regression tests for SPARK-13774 fail in my environment because they attempt to glob over all of `/mnt` and some of the subdirectories restrictive permissions which cause the test to fail. This patch rewrites those tests to remove all environment-specific assumptions; the tests now create their own unique temporary paths for use in the tests. Author: Josh Rosen <joshrosen@databricks.com> Closes #13649 from JoshRosen/SPARK-15929.	2016-06-13 17:06:22 -07:00
Wenchen Fan	c4b1ad0209	[SPARK-15887][SQL] Bring back the hive-site.xml support for Spark 2.0 ## What changes were proposed in this pull request? Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it seems make sense to still load this conf file. This PR adds a `hadoopConf` API in `SharedState`, which is `sparkContext.hadoopConfiguration` by default. When users are under hive context, `SharedState.hadoopConf` will load hive-site.xml and append its configs to `sparkContext.hadoopConfiguration`. When we need to read hadoop config in spark sql, we should call `SessionState.newHadoopConf`, which contains `sparkContext.hadoopConfiguration`, hive-site.xml and sql configs. ## How was this patch tested? new test in `HiveDataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13611 from cloud-fan/hive-site.	2016-06-13 14:57:35 -07:00
Tathagata Das	c654ae2140	[SPARK-15889][SQL][STREAMING] Add a unique id to ContinuousQuery ## What changes were proposed in this pull request? ContinuousQueries have names that are unique across all the active ones. However, when queries are rapidly restarted with same name, it causes races conditions with the listener. A listener event from a stopped query can arrive after the query has been restarted, leading to complexities in monitoring infrastructure. Along with this change, I have also consolidated all the messy code paths to start queries with different sinks. ## How was this patch tested? Added unit tests, and existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13613 from tdas/SPARK-15889.	2016-06-13 13:44:46 -07:00
Takeshi YAMAMURO	5ad4e32d46	[SPARK-15530][SQL] Set #parallelism for file listing in listLeafFilesInParallel ## What changes were proposed in this pull request? This pr is to set the number of parallelism to prevent file listing in `listLeafFilesInParallel` from generating many tasks in case of large #defaultParallelism. ## How was this patch tested? Manually checked Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13444 from maropu/SPARK-15530.	2016-06-13 13:41:26 -07:00
gatorsmile	3b7fb84cf8	[SPARK-15676][SQL] Disallow Column Names as Partition Columns For Hive Tables #### What changes were proposed in this pull request? When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case: ``` hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string); FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns ``` Currently, the error we issued is very confusing: ``` org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.); ``` This PR is to fix the above issue by capturing the usage error in `Parser`. #### How was this patch tested? Added a test case to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13415 from gatorsmile/partitionColumnsInTableSchema.	2016-06-13 13:22:46 -07:00
Tathagata Das	a6a18a4573	[HOTFIX][MINOR][SQL] Revert " Standardize 'continuous queries' to 'streaming D… This reverts commit `d32e227787`. Broke build - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.0-compile-maven-hadoop-2.3/326/console Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13645 from tdas/build-break.	2016-06-13 12:47:47 -07:00
Liwei Lin	d32e227787	[MINOR][SQL] Standardize 'continuous queries' to 'streaming Datasets/DataFrames' ## What changes were proposed in this pull request? This patch does some replacing (as `streaming Datasets/DataFrames` is the term we've chosen in [SPARK-15593](`00c310133d`)): - `continuous queries` -> `streaming Datasets/DataFrames` - `non-continuous queries` -> `non-streaming Datasets/DataFrames` This patch also adds `test("check foreach() can only be called on streaming Datasets/DataFrames")`. ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #13595 from lw-lin/continuous-queries-to-streaming-dss-dfs.	2016-06-13 11:49:15 -07:00
Wenchen Fan	cd47e23374	[SPARK-15814][SQL] Aggregator can return null result ## What changes were proposed in this pull request? It's similar to the bug fixed in https://github.com/apache/spark/pull/13425, we should consider null object and wrap the `CreateStruct` with `If` to do null check. This PR also improves the test framework to test the objects of `Dataset[T]` directly, instead of calling `toDF` and compare the rows. ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13553 from cloud-fan/agg-null.	2016-06-13 09:58:48 -07:00
Wenchen Fan	e2ab79d5ea	[SPARK-15898][SQL] DataFrameReader.text should return DataFrame ## What changes were proposed in this pull request? We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. affected PRs: https://github.com/apache/spark/pull/11731 https://github.com/apache/spark/pull/13104 https://github.com/apache/spark/pull/13184 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13604 from cloud-fan/revert.	2016-06-12 21:36:41 -07:00
Herman van Hövell tot Westerflier	1f8f2b5c2a	[SPARK-15370][SQL] Fix count bug # What changes were proposed in this pull request? This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule. After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns `NULL` on empty input. If the expression does not return `NULL`, the rule generates additional logic in the `Project` operator above the rewritten subquery. This additional logic intercepts `NULL` values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input. This PR takes over https://github.com/apache/spark/pull/13155. It only fixes an issue with `Literal` construction and style issues. All credits should go frreiss. # How was this patch tested? Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite`). Ran all existing automated regression tests after merging with latest trunk. Author: frreiss <frreiss@us.ibm.com> Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13629 from hvanhovell/SPARK-15370-cleanup.	2016-06-12 21:30:32 -07:00
Wenchen Fan	f5d38c3925	Revert "[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter" This reverts commit `0ec279ffdf`.	2016-06-12 16:52:15 -07:00
Takuya UESHIN	caebd7f262	[SPARK-15870][SQL] DataFrame can't execute after uncacheTable. ## What changes were proposed in this pull request? If a cached `DataFrame` executed more than once and then do `uncacheTable` like the following: ``` val selectStar = sql("SELECT * FROM testData WHERE key = 1") selectStar.createOrReplaceTempView("selectStar") spark.catalog.cacheTable("selectStar") checkAnswer( selectStar, Seq(Row(1, "1"))) spark.catalog.uncacheTable("selectStar") checkAnswer( selectStar, Seq(Row(1, "1"))) ``` , then the uncached `DataFrame` can't execute because of `Task not serializable` exception like: ``` org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108) at org.apache.spark.SparkContext.clean(SparkContext.scala:2038) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:884) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) at org.apache.spark.rdd.RDD.collect(RDD.scala:883) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290) ... Caused by: java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor at org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:153) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) ... ``` Notice that `DataFrame` uncached with `DataFrame.unpersist()` works, but with `spark.catalog.uncacheTable` doesn't work. This pr reverts a part of `cf38fe0` not to unregister `batchStats` accumulator, which is not needed to be unregistered here because it will be done by `ContextCleaner` after it is collected by GC. ## How was this patch tested? Added a test to check if DataFrame can execute after uncacheTable and other existing tests. But I made a test to check if the accumulator was cleared as `ignore` because the test would be flaky. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13596 from ueshin/issues/SPARK-15870.	2016-06-12 16:37:44 -07:00
Herman van Hovell	20b8f2c32a	[SPARK-15370][SQL] Revert PR "Update RewriteCorrelatedSuquery rule" This reverts commit `9770f6ee60`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13626 from hvanhovell/SPARK-15370-revert.	2016-06-12 15:06:37 -07:00
Ioana Delaney	0ff8a68b9f	[SPARK-15832][SQL] Embedded IN/EXISTS predicate subquery throws TreeNodeException ## What changes were proposed in this pull request? Queries with embedded existential sub-query predicates throws exception when building the physical plan. Example failing query: ```SQL scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1") scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2") scala> sql("select c1 from t1 where (case when c2 in (select c2 from t2) then 2 else 3 end) IN (select c2 from t1)").show() Binding attribute, tree: c2#239 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) ... at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.HashJoin$$anonfun$4.apply(HashJoin.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.joins.HashJoin$class.org$apache$spark$sql$execution$joins$HashJoin$$x$8(HashJoin.scala:66) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.org$apache$spark$sql$execution$joins$HashJoin$$x$8(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.HashJoin$class.buildKeys(HashJoin.scala:63) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys$lzycompute(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.buildKeys(BroadcastHashJoinExec.scala:38) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.requiredChildDistribution(BroadcastHashJoinExec.scala:52) ``` Problem description: When the left hand side expression of an existential sub-query predicate contains another embedded sub-query predicate, the RewritePredicateSubquery optimizer rule does not resolve the embedded sub-query expressions into existential joins.For example, the above query has the following optimized plan, which fails during physical plan build. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN predicate-subquery#255 [(_2#225 = c2#239)] THEN 2 ELSE 3 END = c2#228#262) : +- SubqueryAlias predicate-subquery#255 [(_2#225 = c2#239)] : +- LocalRelation [c2#239] :- LocalRelation [_1#224, _2#225] +- LocalRelation [c2#228#262] == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: c2#239 ``` Solution: In RewritePredicateSubquery, before rewriting the outermost predicate sub-query, resolve any embedded existential sub-queries. The Optimized plan for the above query after the changes looks like below. ```SQL == Optimized Logical Plan == Project [_1#224 AS c1#227] +- Join LeftSemi, (CASE WHEN exists#285 THEN 2 ELSE 3 END = c2#228#284) :- Join ExistenceJoin(exists#285), (_2#225 = c2#239) : :- LocalRelation [_1#224, _2#225] : +- LocalRelation [c2#239] +- LocalRelation [c2#228#284] == Physical Plan == Project [_1#224 AS c1#227] +- BroadcastHashJoin [CASE WHEN exists#285 THEN 2 ELSE 3 END], [c2#228#284], LeftSemi, BuildRight :- *BroadcastHashJoin [_2#225], [c2#239], ExistenceJoin(exists#285), BuildRight : :- LocalTableScan [_1#224, _2#225] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) : +- LocalTableScan [c2#239] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))) +- LocalTableScan [c2#228#284] +- LocalTableScan [c222#36], [[111],[222]] ``` ## How was this patch tested? Added new test cases in SubquerySuite.scala Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #13570 from ioana-delaney/fixEmbedSubPredV1.	2016-06-12 14:26:29 -07:00
frreiss	9770f6ee60	[SPARK-15370][SQL] Update RewriteCorrelatedScalarSubquery rule to fix COUNT bug ## What changes were proposed in this pull request? This pull request fixes the COUNT bug in the `RewriteCorrelatedScalarSubquery` rule. After this change, the rule tests the expression at the root of the correlated subquery to determine whether the expression returns NULL on empty input. If the expression does not return NULL, the rule generates additional logic in the Project operator above the rewritten subquery. This additional logic intercepts NULL values coming from the outer join and replaces them with the value that the subquery's expression would return on empty input. ## How was this patch tested? Added regression tests to cover all branches of the updated rule (see changes to `SubquerySuite.scala`). Ran all existing automated regression tests after merging with latest trunk. Author: frreiss <frreiss@us.ibm.com> Closes #13155 from frreiss/master.	2016-06-12 14:21:10 -07:00
Sean Owen	f51dfe616b	[SPARK-15086][CORE][STREAMING] Deprecate old Java accumulator API ## What changes were proposed in this pull request? - Deprecate old Java accumulator API; should use Scala now - Update Java tests and examples - Don't bother testing old accumulator API in Java 8 (too) - (fix a misspelling too) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13606 from srowen/SPARK-15086.	2016-06-12 11:44:33 -07:00
hyukjinkwon	9e204c62c6	[SPARK-15840][SQL] Add two missing options in documentation and some option related changes ## What changes were proposed in this pull request? This PR 1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala. 2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown - from ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png) - to (with class link) ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png) (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html)) 3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`. They are not used anymore. They were removed in `e720dda42e` and there are no use cases as below: ```bash grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME . ``` ``` ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_SCHEMA = "metastoreSchema" ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: private[sql] val METASTORE_TABLE_NAME = "metastoreTableName" ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala: ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier( ``` It only sets `metastoreTableName` in the last case but does not use the table name. 4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](`3ded5bc4db/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala (L33-L42)`)) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](`3ded5bc4db/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala (L38-L47)`)) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](`4538443e27/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (L53-L55)`) and [JsonFileFormat.scala#L105-L106](`4538443e27/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (L105-L106)`)). ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13576 from HyukjinKwon/SPARK-15840.	2016-06-11 23:20:40 -07:00
Dongjoon Hyun	3fd2ff4dd8	[SPARK-15807][SQL] Support varargs for dropDuplicates in Dataset/DataFrame ## What changes were proposed in this pull request? This PR adds `varargs`-types `dropDuplicates` functions in `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or `Array`. Before ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates(Seq("_1", "_2")) res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") <console>:26: error: overloaded method value dropDuplicates with alternatives: (colNames: Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> (colNames: Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] <and> ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] cannot be applied to (String, String) ds.dropDuplicates("_1", "_2") ^ ``` After ```scala scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] scala> ds.dropDuplicates("_1", "_2") res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, _2: int] ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13545 from dongjoon-hyun/SPARK-15807.	2016-06-11 15:47:51 -07:00
Eric Liang	c06c58bbbb	[SPARK-14851][CORE] Support radix sort with nullable longs ## What changes were proposed in this pull request? This adds support for radix sort of nullable long fields. When a sort field is null and radix sort is enabled, we keep nulls in a separate region of the sort buffer so that radix sort does not need to deal with them. This also has performance benefits when sorting smaller integer types, since the current representation of nulls in two's complement (Long.MIN_VALUE) otherwise forces a full-width radix sort. This strategy for nulls does mean the sort is no longer stable. cc davies ## How was this patch tested? Existing randomized sort tests for correctness. I also tested some TPCDS queries and there does not seem to be any significant regression for non-null sorts. Some test queries (best of 5 runs each). Before change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190437233227987 res3: Double = 4716.471091 After change: scala> val start = System.nanoTime; spark.range(5000000).selectExpr("if(id > 5, cast(hash(id) as long), NULL) as h").coalesce(1).orderBy("h").collect(); (System.nanoTime - start) / 1e6 start: Long = 3190367870952791 res4: Double = 2981.143045 Author: Eric Liang <ekl@databricks.com> Closes #13161 from ericl/sc-2998.	2016-06-11 15:42:58 -07:00
Wenchen Fan	75705e8dbb	[SPARK-15856][SQL] Revert API breaking changes made in SQLContext.range ## What changes were proposed in this pull request? It's easy for users to call `range(...).as[Long]` to get typed Dataset, and don't worth an API breaking change. This PR reverts it. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13605 from cloud-fan/range.	2016-06-11 15:28:40 -07:00
Eric Liang	5bb4564cd4	[SPARK-15881] Update microbenchmark results for WideSchemaBenchmark ## What changes were proposed in this pull request? These were not updated after performance improvements. To make updating them easier, I also moved the results from inline comments out into a file, which is auto-generated when the benchmark is re-run. Author: Eric Liang <ekl@databricks.com> Closes #13607 from ericl/sc-3538.	2016-06-11 15:26:08 -07:00
Takeshi YAMAMURO	cb5d933d86	[SPARK-15585][SQL] Add doc for turning off quotations ## What changes were proposed in this pull request? This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`. ## How was this patch tested? Check behavior to put an empty string in csv options. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13616 from maropu/SPARK-15585-2.	2016-06-11 15:12:21 -07:00
Davies Liu	7504bc73f2	[SPARK-15759] [SQL] Fallback to non-codegen when fail to compile generated code ## What changes were proposed in this pull request? In case of any bugs in whole-stage codegen, the generated code can't be compiled, we should fallback to non-codegen to make sure that query could run. The batch mode of new parquet reader depends on codegen, can't be easily switched to non-batch mode, so we still use codegen for batched scan (for parquet). Because it only support primitive types and the number of columns is less than spark.sql.codegen.maxFields (100), it should not fail. This could be configurable by `spark.sql.codegen.fallback` ## How was this patch tested? Manual test it with buggy operator, it worked well. Author: Davies Liu <davies@databricks.com> Closes #13501 from davies/codegen_fallback.	2016-06-10 21:12:06 -07:00
Sameer Agarwal	468da03e23	[SPARK-15678] Add support to REFRESH data source paths ## What changes were proposed in this pull request? Spark currently incorrectly continues to use cached data even if the underlying data is overwritten. Current behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset ``` This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. Expected behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) spark.catalog.refreshResource(dir) sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #13566 from sameeragarwal/refresh-path-2.	2016-06-10 20:43:18 -07:00
Cheng Lian	8e7b56f3d4	Revert "[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader" This reverts commit `bba5d7999f`.	2016-06-10 20:41:48 -07:00
Liang-Chi Hsieh	bba5d7999f	[SPARK-15639][SQL] Try to push down filter at RowGroups level for parquet reader ## What changes were proposed in this pull request? The base class `SpecificParquetRecordReaderBase` used for vectorized parquet reader will try to get pushed-down filters from the given configuration. This pushed-down filters are used for RowGroups-level filtering. However, we don't set up the filters to push down into the configuration. In other words, the filters are not actually pushed down to do RowGroups-level filtering. This patch is to fix this and tries to set up the filters for pushing down to configuration for the reader. ## How was this patch tested? Existing tests should be passed. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13371 from viirya/vectorized-reader-push-down-filter.	2016-06-10 18:23:59 -07:00
Sela	127a6678d7	[SPARK-15489][SQL] Dataset kryo encoder won't load custom user settings ## What changes were proposed in this pull request? Serializer instantiation will consider existing SparkConf ## How was this patch tested? manual test with `ImmutableList` (Guava) and `kryo-serializers`'s `Immutable*Serializer` implementations. Added Test Suite. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Sela <ansela@paypal.com> Closes #13424 from amitsela/SPARK-15489.	2016-06-10 14:36:51 -07:00
Davies Liu	aec502d911	[SPARK-15654] [SQL] fix non-splitable files for text based file formats ## What changes were proposed in this pull request? Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not. This PR is based on #13442, closes #13442 ## How was this patch tested? add regression tests. Author: Davies Liu <davies@databricks.com> Closes #13531 from davies/fix_split.	2016-06-10 14:32:43 -07:00
Herman van Hovell	e05a2feebe	[SPARK-15825] [SQL] Fix SMJ invalid results ## What changes were proposed in this pull request? Code generated `SortMergeJoin` failed with wrong results when using structs as keys. This could (eventually) be traced back to the use of a wrong row reference when comparing structs. ## How was this patch tested? TBD Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13589 from hvanhovell/SPARK-15822.	2016-06-10 14:29:05 -07:00
wangyang	026eb90644	[SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length == 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.	2016-06-10 13:10:03 -07:00
Sandeep Singh	865ec32dd9	[MINOR][X][X] Replace all occurrences of None: Option with Option.empty ## What changes were proposed in this pull request? Replace all occurrences of `None: Option[X]` with `Option.empty[X]` ## How was this patch tested? Exisiting Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13591 from techaddict/minor-7.	2016-06-10 13:06:51 -07:00
Takuya UESHIN	667d4ea7b3	[SPARK-6320][SQL] Move planLater method into GenericStrategy. ## What changes were proposed in this pull request? This PR moves `QueryPlanner.planLater()` method into `GenericStrategy` for extra strategies to be able to use `planLater` in its strategy. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13147 from ueshin/issues/SPARK-6320.	2016-06-10 13:06:18 -07:00
Liwei Lin	fb219029dd	[SPARK-15871][SQL] Add `assertNotPartitioned` check in `DataFrameWriter` ## What changes were proposed in this pull request? It doesn't make sense to specify partitioning parameters, when we write data out from Datasets/DataFrames into `jdbc` tables or streaming `ForeachWriter`s. This patch adds `assertNotPartitioned` check in `DataFrameWriter`. <table> <tr> <td align="center"><strong>operation</strong></td> <td align="center"><strong>should check not partitioned?</strong></td> </tr> <tr> <td align="center">mode</td> <td align="center"></td> </tr> <tr> <td align="center">outputMode</td> <td align="center"></td> </tr> <tr> <td align="center">trigger</td> <td align="center"></td> </tr> <tr> <td align="center">format</td> <td align="center"></td> </tr> <tr> <td align="center">option/options</td> <td align="center"></td> </tr> <tr> <td align="center">partitionBy</td> <td align="center"></td> </tr> <tr> <td align="center">bucketBy</td> <td align="center"></td> </tr> <tr> <td align="center">sortBy</td> <td align="center"></td> </tr> <tr> <td align="center">save</td> <td align="center"></td> </tr> <tr> <td align="center">queryName</td> <td align="center"></td> </tr> <tr> <td align="center">startStream</td> <td align="center"></td> </tr> <tr> <td align="center">foreach</td> <td align="center">yes</td> </tr> <tr> <td align="center">insertInto</td> <td align="center"></td> </tr> <tr> <td align="center">saveAsTable</td> <td align="center"></td> </tr> <tr> <td align="center">jdbc</td> <td align="center">yes</td> </tr> <tr> <td align="center">json</td> <td align="center"></td> </tr> <tr> <td align="center">parquet</td> <td align="center"></td> </tr> <tr> <td align="center">orc</td> <td align="center"></td> </tr> <tr> <td align="center">text</td> <td align="center"></td> </tr> <tr> <td align="center">csv</td> <td align="center"></td> </tr> </table> ## How was this patch tested? New dedicated tests. Author: Liwei Lin <lwlin7@gmail.com> Closes #13597 from lw-lin/add-assertNotPartitioned.	2016-06-10 13:01:29 -07:00
Dongjoon Hyun	2413fce9d6	[SPARK-15743][SQL] Prevent saving with all-column partitioning ## What changes were proposed in this pull request? When saving datasets on storage, `partitionBy` provides an easy way to construct the directory structure. However, if a user choose all columns as partition columns, some exceptions occurs. - ORC with all column partitioning: `AnalysisException` on future read due to schema inference failure. ```scala scala> spark.range(10).write.format("orc").mode("overwrite").partitionBy("id").save("/tmp/data") scala> spark.read.format("orc").load("/tmp/data").collect() org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at /tmp/data. It must be specified manually; ``` - Parquet with all-column partitioning: `InvalidSchemaException` on write execution due to Parquet limitation. ```scala scala> spark.range(100).write.format("parquet").mode("overwrite").partitionBy("id").save("/tmp/data") [Stage 0:> (0 + 8) / 8]16/06/02 16:51:17 ERROR Utils: Aborting task org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema ... (lots of error messages) ``` Although some formats like JSON support all-column partitioning without any problem, it seems not a good idea to make lots of empty directories. This PR prevents saving with all-column partitioning by consistently raising `AnalysisException` before executing save operation. ## How was this patch tested? Newly added `PartitioningUtilsSuite`. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13486 from dongjoon-hyun/SPARK-15743.	2016-06-10 12:43:27 -07:00
Reynold Xin	254bc8c34e	[SPARK-15866] Rename listAccumulator collectionAccumulator ## What changes were proposed in this pull request? SparkContext.listAccumulator, by Spark's convention, makes it sound like "list" is a verb and the method should return a list of accumulators. This patch renames the method and the class collection accumulator. ## How was this patch tested? Updated test case to reflect the names. Author: Reynold Xin <rxin@databricks.com> Closes #13594 from rxin/SPARK-15866.	2016-06-10 11:08:39 -07:00
Liang-Chi Hsieh	0ec279ffdf	[SPARK-15753][SQL] Move Analyzer stuff to Analyzer from DataFrameWriter ## What changes were proposed in this pull request? This patch moves some codes in `DataFrameWriter.insertInto` that belongs to `Analyzer`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13496 from viirya/move-analyzer-stuff.	2016-06-10 11:05:04 -07:00
Tathagata Das	abdb5d42c5	[SPARK-15812][SQ][STREAMING] Added support for sorting after streaming aggregation with complete mode ## What changes were proposed in this pull request? When the output mode is complete, then the output of a streaming aggregation essentially will contain the complete aggregates every time. So this is not different from a batch dataset within an incremental execution. Other non-streaming operations should be supported on this dataset. In this PR, I am just adding support for sorting, as it is a common useful functionality. Support for other operations will come later. ## How was this patch tested? Additional unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13549 from tdas/SPARK-15812.	2016-06-10 10:48:28 -07:00
Shixiong Zhu	00c310133d	[SPARK-15593][SQL] Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery ## What changes were proposed in this pull request? * Add DataFrameWriter.foreach to allow the user consuming data in ContinuousQuery * ForeachWriter is the interface for the user to consume partitions of data * Add a type parameter T to DataFrameWriter Usage ```Scala val ds = spark.read....stream().as[String] ds.....write .queryName(...) .option("checkpointLocation", ...) .foreach(new ForeachWriter[Int] { def open(partitionId: Long, version: Long): Boolean = { // prepare some resources for a partition // check `version` if possible and return `false` if this is a duplicated data to skip the data processing. } override def process(value: Int): Unit = { // process data } def close(errorOrNull: Throwable): Unit = { // release resources for a partition // check `errorOrNull` and handle the error if necessary. } }) ``` ## How was this patch tested? New unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13342 from zsxwing/foreach.	2016-06-10 00:11:46 -07:00
Dongjoon Hyun	5a3533e779	[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order ## What changes were proposed in this pull request? Currently, `crosstab` returns a Dataframe having random-order columns obtained by just `distinct`. Also, the documentation of `crosstab` shows the result in a sorted order which is different from the current implementation. This PR explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation. Before ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ \|key_value\| 3\| 2\| 1\| +---------+---+---+---+ \| 2\| 1\| 0\| 2\| \| 1\| 0\| 1\| 1\| \| 3\| 1\| 1\| 0\| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ \|key_value\| c\| a\| b\| +---------+---+---+---+ \| 2\| 1\| 2\| 0\| \| 1\| 0\| 1\| 1\| \| 3\| 1\| 0\| 1\| +---------+---+---+---+ ``` After ```scala scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ \|key_value\| 1\| 2\| 3\| +---------+---+---+---+ \| 2\| 2\| 0\| 1\| \| 1\| 1\| 1\| 0\| \| 3\| 0\| 1\| 1\| +---------+---+---+---+ scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show() +---------+---+---+---+ \|key_value\| a\| b\| c\| +---------+---+---+---+ \| 2\| 2\| 0\| 1\| \| 1\| 1\| 1\| 0\| \| 3\| 0\| 1\| 1\| +---------+---+---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with updated testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13436 from dongjoon-hyun/SPARK-15696.	2016-06-09 22:46:51 -07:00
Eric Liang	6c5fd977fb	[SPARK-15791] Fix NPE in ScalarSubquery ## What changes were proposed in this pull request? The fix is pretty simple, just don't make the executedPlan transient in `ScalarSubquery` since it is referenced at execution time. ## How was this patch tested? I verified the fix manually in non-local mode. It's not clear to me why the problem did not manifest in local mode, any suggestions? cc davies Author: Eric Liang <ekl@databricks.com> Closes #13569 from ericl/fix-scalar-npe.	2016-06-09 22:28:31 -07:00
Reynold Xin	16df133d7f	[SPARK-15850][SQL] Remove function grouping in SparkSession ## What changes were proposed in this pull request? SparkSession does not have that many functions due to better namespacing, and as a result we probably don't need the function grouping. This patch removes the grouping and also adds missing scaladocs for createDataset functions in SQLContext. Closes #13577. ## How was this patch tested? N/A - this is a documentation change. Author: Reynold Xin <rxin@databricks.com> Closes #13582 from rxin/SPARK-15850.	2016-06-09 18:58:24 -07:00
Shixiong Zhu	4d9d9cc585	[SPARK-15853][SQL] HDFSMetadataLog.get should close the input stream ## What changes were proposed in this pull request? This PR closes the input stream created in `HDFSMetadataLog.get` ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13583 from zsxwing/leak.	2016-06-09 18:45:19 -07:00
Eric Liang	b914e1930f	[SPARK-15794] Should truncate toString() of very wide plans ## What changes were proposed in this pull request? With very wide tables, e.g. thousands of fields, the plan output is unreadable and often causes OOMs due to inefficient string processing. This truncates all struct and operator field lists to a user configurable threshold to limit performance impact. It would also be nice to optimize string generation to avoid these sort of O(n^2) slowdowns entirely (i.e. use StringBuilder everywhere including expressions), but this is probably too large of a change for 2.0 at this point, and truncation has other benefits for usability. ## How was this patch tested? Added a microbenchmark that covers this case particularly well. I also ran the microbenchmark while varying the truncation threshold. ``` numFields = 5 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 2336 / 2558 0.0 23364.4 0.1X numFields = 25 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 4237 / 4465 0.0 42367.9 0.1X numFields = 100 wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ 2000 wide x 50 rows (write in-mem) 10458 / 11223 0.0 104582.0 0.0X numFields = Infinity wide shallowly nested struct field r/w: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ [info] java.lang.OutOfMemoryError: Java heap space ``` Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #13537 from ericl/truncated-string.	2016-06-09 18:05:16 -07:00
Kevin Yu	99386fe398	[SPARK-15804][SQL] Include metadata in the toStructType ## What changes were proposed in this pull request? The help function 'toStructType' in the AttributeSeq class doesn't include the metadata when it builds the StructField, so it causes this reported problem https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK when spark writes the the dataframe with the metadata to the parquet datasource. The code path is when spark writes the dataframe to the parquet datasource through the InsertIntoHadoopFsRelationCommand, spark will build the WriteRelation container, and it will call the help function 'toStructType' to create StructType which contains StructField, it should include the metadata there, otherwise, we will lost the user provide metadata. ## How was this patch tested? added test case in ParquetQuerySuite.scala (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Kevin Yu <qyu@us.ibm.com> Closes #13555 from kevinyu98/spark-15804.	2016-06-09 09:50:09 -07:00
Sandeep Singh	d5807def10	[MINOR][DOC] In Dataset docs, remove self link to Dataset and add link to Column ## What changes were proposed in this pull request? Documentation Fix ## How was this patch tested? Author: Sandeep Singh <sandeep@techaddict.me> Closes #13567 from techaddict/minor-4.	2016-06-08 23:41:29 -07:00
Wenchen Fan	afbe35cf5b	[SPARK-14670] [SQL] allow updating driver side sql metrics ## What changes were proposed in this pull request? On the SparkUI right now we have this SQLTab that displays accumulator values per operator. However, it only displays metrics updated on the executors, not on the driver. It is useful to also include driver metrics, e.g. broadcast time. This is a different version from https://github.com/apache/spark/pull/12427. This PR sends driver side accumulator updates right after the updating happens, not at the end of execution, by a new event. ## How was this patch tested? new test in `SQLListenerSuite` ![qq20160606-0](https://cloud.githubusercontent.com/assets/3182036/15841418/0eb137da-2c06-11e6-9068-5694eeb78530.png) Author: Wenchen Fan <wenchen@databricks.com> Closes #13189 from cloud-fan/metrics.	2016-06-08 22:47:29 -07:00
Sandeep Singh	f958c1c3e2	[MINOR] Fix Java Lint errors introduced by #13286 and #13280 ## What changes were proposed in this pull request? revived #13464 Fix Java Lint errors introduced by #13286 and #13280 Before: ``` Using `mvn` from path: /Users/pichu/Project/spark/build/apache-maven-3.3.9/bin/mvn Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[340,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[341,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[342,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/launcher/LauncherServer.java:[343,5] (whitespace) FileTabCharacter: Line contains a tab character. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[41,28] (naming) MethodName: Method name 'Append' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/streaming/OutputMode.java:[52,28] (naming) MethodName: Method name 'Complete' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[61,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.PrimitiveType. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[62,8] (imports) UnusedImports: Unused import - org.apache.parquet.schema.Type. ``` ## How was this patch tested? ran `dev/lint-java` locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #13559 from techaddict/minor-3.	2016-06-08 14:51:00 +01:00
Shixiong Zhu	0cfd6192f3	[SPARK-15580][SQL] Add ContinuousQueryInfo to make ContinuousQueryListener events serializable ## What changes were proposed in this pull request? This PR adds ContinuousQueryInfo to make ContinuousQueryListener events serializable in order to support writing events into the event log. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13335 from zsxwing/query-info.	2016-06-07 16:40:03 -07:00
Sean Zhong	890baaca50	[SPARK-15674][SQL] Deprecates "CREATE TEMPORARY TABLE USING...", uses "CREAT TEMPORARY VIEW USING..." instead ## What changes were proposed in this pull request? The current implementation of "CREATE TEMPORARY TABLE USING datasource..." is NOT creating any intermediate temporary data directory like temporary HDFS folder, instead, it only stores a SQL string in memory. Probably we should use "TEMPORARY VIEW" instead. This PR assumes a temporary table has to link with some temporary intermediate data. It follows the definition of temporary table like this (from [hortonworks doc](https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/temp-tables.html)): > A temporary table is a convenient way for an application to automatically manage intermediate data generated during a complex query Example: ``` scala> spark.sql("CREATE temporary view my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") scala> spark.sql("select c1, c2 from my_tab7").show() +----+-----+ \| c1\| c2\| +----+-----+ \|year\| make\| \|2012\|Tesla\| ... ``` It NOW prints a deprecation warning if "CREATE TEMPORARY TABLE USING..." is used. ``` scala> spark.sql("CREATE temporary table my_tab7 (c1: String, c2: String) USING org.apache.spark.sql.execution.datasources.csv.CSVFileFormat OPTIONS (PATH '/Users/seanzhong/csv/cars.csv')") 16/05/31 10:39:27 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE tableName USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13414 from clockfly/create_temp_view_using.	2016-06-07 15:21:55 -07:00
Sean Zhong	5f731d6859	[SPARK-15792][SQL] Allows operator to change the verbosity in explain output ## What changes were proposed in this pull request? This PR allows customization of verbosity in explain output. After change, `dataframe.explain()` and `dataframe.explain(true)` has different verbosity output for physical plan. Currently, this PR only enables verbosity string for operator `HashAggregateExec` and `SortAggregateExec`. We will gradually enable verbosity string for more operators in future. Less verbose mode: dataframe.explain(extended = false) `output=[count(a)#85L]` is NOT displayed for HashAggregate. ``` scala> Seq((1,2,3)).toDF("a", "b", "c").createTempView("df2") scala> spark.sql("select count(a) from df2").explain() == Physical Plan == HashAggregate(key=[], functions=[count(1)]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)]) +- LocalTableScan ``` Verbose mode: dataframe.explain(extended = true) `output=[count(a)#85L]` is displayed for HashAggregate. ``` scala> spark.sql("select count(a) from df2").explain(true) // "output=[count(a)#85L]" is added ... == Physical Plan == HashAggregate(key=[], functions=[count(1)], output=[count(a)#85L]) +- Exchange SinglePartition +- HashAggregate(key=[], functions=[partial_count(1)], output=[count#87L]) +- LocalTableScan ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13535 from clockfly/verbose_breakdown_2.	2016-06-06 22:59:25 -07:00
Sean Zhong	0e0904a2fc	[SPARK-15632][SQL] Typed Filter should NOT change the Dataset schema ## What changes were proposed in this pull request? This PR makes sure the typed Filter doesn't change the Dataset schema. Before the change: ``` scala> val df = spark.range(0,9) scala> df.schema res12: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) scala> val afterFilter = df.filter(_=>true) scala> afterFilter.schema // !!! schema is CHANGED!!! Column name is changed from id to value, nullable is changed from false to true. res13: org.apache.spark.sql.types.StructType = StructType(StructField(value,LongType,true)) ``` SerializeFromObject and DeserializeToObject are inserted to wrap the Filter, and these two can possibly change the schema of Dataset. After the change: ``` scala> afterFilter.schema // schema is NOT changed. res47: org.apache.spark.sql.types.StructType = StructType(StructField(id,LongType,false)) ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13529 from clockfly/spark-15632.	2016-06-06 22:40:21 -07:00
Josh Rosen	0b8d694999	[SPARK-15764][SQL] Replace N^2 loop in BindReferences BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement.	2016-06-06 11:44:51 -07:00
Zheng RuiFeng	fd8af39713	[MINOR] Fix Typos 'an -> a' ## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.	2016-06-06 09:35:47 +01:00
Reynold Xin	32f2f95dbd	Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour" This reverts commit `b7e8d1cb3c`.	2016-06-05 23:40:13 -07:00
Takeshi YAMAMURO	b7e8d1cb3c	[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour ## What changes were proposed in this pull request? This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv. Also, it explicitly sets default values for CSV options in python. ## How was this patch tested? Added tests in CSVSuite. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13372 from maropu/SPARK-15585.	2016-06-05 23:35:04 -07:00
Hiroshi Inoue	79268aa461	[SPARK-15704][SQL] add a test case in DatasetAggregatorSuite for regression testing ## What changes were proposed in this pull request? This change fixes a crash in TungstenAggregate while executing "Dataset complex Aggregator" test case due to IndexOutOfBoundsException. jira entry for detail: https://issues.apache.org/jira/browse/SPARK-15704 ## How was this patch tested? Using existing unit tests (including DatasetBenchmark) Author: Hiroshi Inoue <inouehrs@jp.ibm.com> Closes #13446 from inouehrs/fix_aggregate.	2016-06-05 20:10:33 -07:00
Josh Rosen	26c1089c37	[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap.	2016-06-05 16:51:00 -07:00
Lianhui Wang	2ca563cc45	[SPARK-15756][SQL] Support command 'create table stored as orcfile/parquetfile/avrofile' ## What changes were proposed in this pull request? Now Spark SQL can support 'create table src stored as orc/parquet/avro' for orc/parquet/avro table. But Hive can support both commands: ' stored as orc/parquet/avro' and 'stored as orcfile/parquetfile/avrofile'. So this PR supports these keywords 'orcfile/parquetfile/avrofile' in Spark SQL. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13500 from lianhuiwang/SPARK-15756.	2016-06-03 22:19:22 -07:00
Davies Liu	3074f575a3	[SPARK-15391] [SQL] manage the temporary memory of timsort ## What changes were proposed in this pull request? Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode. This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode. This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #13318 from davies/fix_timsort.	2016-06-03 16:45:09 -07:00
Andrew Or	b1cc7da3e3	[SPARK-15722][SQL] Disallow specifying schema in CTAS statement ## What changes were proposed in this pull request? As of this patch, the following throws an exception because the schemas may not match: ``` CREATE TABLE students (age INT, name STRING) AS SELECT * FROM boxes ``` but this is OK: ``` CREATE TABLE students AS SELECT * FROM boxes ``` ## How was this patch tested? SQLQuerySuite, HiveDDLCommandSuite Author: Andrew Or <andrew@databricks.com> Closes #13490 from andrewor14/ctas-no-column.	2016-06-03 14:39:41 -07:00
Wenchen Fan	11c83f83d5	[SPARK-15140][SQL] make the semantics of null input object for encoder clear ## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow row to be null, only its columns can be null. This PR explicitly add this constraint and throw exception if users break it. ## How was this patch tested? several new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13469 from cloud-fan/null-object.	2016-06-03 14:28:19 -07:00
Ioana Delaney	9e2eb13ca5	[SPARK-15677][SQL] Query with scalar sub-query in the SELECT list throws UnsupportedOperationException ## What changes were proposed in this pull request? Queries with scalar sub-query in the SELECT list run against a local, in-memory relation throw UnsupportedOperationException exception. Problem repro: ```SQL scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t1") scala> Seq((1, 1), (2, 2)).toDF("c1", "c2").createOrReplaceTempView("t2") scala> sql("select (select min(c1) from t2) from t1").show() java.lang.UnsupportedOperationException: Cannot evaluate expression: scalar-subquery#62 [] at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:215) at org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:62) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:142) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:45) at org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:29) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$37.applyOrElse(Optimizer.scala:1473) ``` The problem is specific to local, in memory relations. It is caused by rule ConvertToLocalRelation, which attempts to push down a scalar-subquery expression to the local tables. The solution prevents the rule to apply if Project references scalar subqueries. ## How was this patch tested? Added regression tests to SubquerySuite.scala Author: Ioana Delaney <ioanamdelaney@gmail.com> Closes #13418 from ioana-delaney/scalarSubV2.	2016-06-03 12:04:27 -07:00
Wenchen Fan	190ff274fd	[SPARK-15494][SQL] encoder code cleanup ## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #13269 from cloud-fan/clean-encoder.	2016-06-03 00:43:02 -07:00
Dongjoon Hyun	b9fcfb3bd1	[SPARK-15744][SQL] Rename two TungstenAggregation*Suites and update codgen/error messages/comments ## What changes were proposed in this pull request? For consistency, this PR updates some remaining `TungstenAggregation/SortBasedAggregate` after SPARK-15728. - Update a comment in codegen in `VectorizedHashMapGenerator.scala`. - `TungstenAggregationQuerySuite` --> `HashAggregationQuerySuite` - `TungstenAggregationQueryWithControlledFallbackSuite` --> `HashAggregationQueryWithControlledFallbackSuite` - Update two error messages in `SQLQuerySuite.scala` and `AggregationQuerySuite.scala`. - Update several comments. ## How was this patch tested? Manual (Only comment changes and test suite renamings). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13487 from dongjoon-hyun/SPARK-15744.	2016-06-03 00:36:06 -07:00
Xin Wu	76aa45d359	[SPARK-14959][SQL] handle partitioned table directories in distributed filesystem ## What changes were proposed in this pull request? ##### The root cause: When `DataSource.resolveRelation` is trying to build `ListingFileCatalog` object, `ListLeafFiles` is invoked where a list of `FileStatus` objects are retrieved from the provided path. These FileStatus objects include directories for the partitions (id=0 and id=2 in the jira). However, these directory `FileStatus` objects also try to invoke `getFileBlockLocations` where directory is not allowed for `DistributedFileSystem`, hence the exception happens. This PR is to remove the block of code that invokes `getFileBlockLocations` for every FileStatus object of the provided path. Instead, we call `HadoopFsRelation.listLeafFiles` directly because this utility method filters out the directories before calling `getFileBlockLocations` for generating `LocatedFileStatus` objects. ## How was this patch tested? Regtest is run. Manual test: ``` scala> spark.read.format("parquet").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_part").show +-----+---+ \| text\| id\| +-----+---+ \|hello\| 0\| \|world\| 0\| \|hello\| 1\| \|there\| 1\| +-----+---+ spark.read.format("orc").load("hdfs://bdavm009.svl.ibm.com:8020/user/spark/SPARK-14959_orc").show +-----+---+ \| text\| id\| +-----+---+ \|hello\| 0\| \|world\| 0\| \|hello\| 1\| \|there\| 1\| +-----+---+ ``` I also tried it with 2 level of partitioning. I have not found a way to add test case in the unit test bucket that can test a real hdfs file location. Any suggestions will be appreciated. Author: Xin Wu <xinwu@us.ibm.com> Closes #13463 from xwu0226/SPARK-14959.	2016-06-02 22:49:17 -07:00
Eric Liang	901b2e69ea	[SPARK-15724] Add benchmarks for performance over wide schemas ## What changes were proposed in this pull request? This adds microbenchmarks for tracking performance of queries over very wide or deeply nested DataFrames. It seems performance degrades when DataFrames get thousands of columns wide or hundreds of fields deep. ## How was this patch tested? Current results included. cc rxin JoshRosen Author: Eric Liang <ekl@databricks.com> Closes #13456 from ericl/sc-3468.	2016-06-02 19:42:05 -07:00
Wenchen Fan	6323e4bd76	[SPARK-15732][SQL] better error message when use java reserved keyword as field name ## What changes were proposed in this pull request? When users create a case class and use java reserved keyword as field name, spark sql will generate illegal java code and throw exception at runtime. This PR checks the field names when building the encoder, and if illegal field names are used, throw exception immediately with a good error message. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13485 from cloud-fan/java.	2016-06-02 18:13:04 -07:00
Wenchen Fan	f34aadc54c	[SPARK-15718][SQL] better error message for writing bucketed data ## What changes were proposed in this pull request? Currently we don't support bucketing for `save` and `insertInto`. For `save`, we just write the data out into a directory users specified, and it's not a table, we don't keep its metadata. When we read it back, we have no idea if the data is bucketed or not, so it doesn't make sense to use `save` to write bucketed data, as we can't use the bucket information anyway. We can support it in the future, once we have features like bucket discovery, or we save bucket information in the data directory too, so that we don't need to rely on a metastore. For `insertInto`, it inserts data into an existing table, so it doesn't make sense to specify bucket information, as we should get the bucket information from the existing table. This PR improves the error message for the above 2 cases. ## How was this patch tested? new test in `BukctedWriteSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13452 from cloud-fan/error-msg.	2016-06-02 17:39:56 -07:00
Sean Zhong	985d532812	[SPARK-15734][SQL] Avoids printing internal row in explain output ## What changes were proposed in this pull request? This PR avoids printing internal rows in explain output for some operators. Before change: ``` scala> (1 to 10).toSeq.map(_ => (1,2,3)).toDF().createTempView("df3") scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#37,_2#38,_3#39] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#37,_2#38,_3#39], [[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3],[0,1,2,3]] ... == Physical Plan == LocalTableScan [_1#37,_2#38,_3#39] ``` After change: ``` scala> spark.sql("select * from df3 where 1=2").explain(true) ... == Analyzed Logical Plan == _1: int, _2: int, _3: int Project [_1#58,_2#59,_3#60] +- Filter (1 = 2) +- SubqueryAlias df3 +- LocalRelation [_1#58,_2#59,_3#60] ... == Physical Plan == LocalTableScan <empty>, [_1#58,_2#59,_3#60] ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13471 from clockfly/verbose_breakdown_5.	2016-06-02 16:21:33 -07:00
Cheng Lian	4315427657	[SPARK-15719][SQL] Disables writing Parquet summary files by default ## What changes were proposed in this pull request? This PR disables writing Parquet summary files by default (i.e., when Hadoop configuration "parquet.enable.summary-metadata" is not set). Please refer to [SPARK-15719][1] for more details. ## How was this patch tested? New test case added in `ParquetQuerySuite` to check no summary files are written by default. [1]: https://issues.apache.org/jira/browse/SPARK-15719 Author: Cheng Lian <lian@databricks.com> Closes #13455 from liancheng/spark-15719-disable-parquet-summary-files.	2016-06-02 16:16:27 -07:00
Sean Zhong	d109a1beee	[SPARK-15711][SQL] Ban CREATE TEMPORARY TABLE USING AS SELECT ## What changes were proposed in this pull request? This PR bans syntax like `CREATE TEMPORARY TABLE USING AS SELECT` `CREATE TEMPORARY TABLE ... USING ... AS ...` is not properly implemented, the temporary data is not cleaned up when the session exits. Before a full fix, we probably should ban this syntax. This PR only impact syntax like `CREATE TEMPORARY TABLE ... USING ... AS ...`. Other syntax like `CREATE TEMPORARY TABLE .. USING ...` and `CREATE TABLE ... USING ...` are not impacted. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13451 from clockfly/ban_create_temp_table_using_as.	2016-06-02 14:11:01 -07:00
gatorsmile	9aff6f3b19	[SPARK-15515][SQL] Error Handling in Running SQL Directly On Files #### What changes were proposed in this pull request? This PR is to address the following issues: - ISSUE 1: For ORC source format, we are reporting the strange error message when we did not enable Hive support: ```SQL SQL Example: select id from `org.apache.spark.sql.hive.orc`.`file_path` Error Message: Table or view not found: `org.apache.spark.sql.hive.orc`.`file_path` ``` Instead, we should issue the error message like: ``` Expected Error Message: The ORC data source must be used with Hive support enabled ``` - ISSUE 2: For the Avro format, we report the strange error message like: The example query is like ```SQL SQL Example: select id from `avro`.`file_path` select id from `com.databricks.spark.avro`.`file_path` Error Message: Table or view not found: `com.databricks.spark.avro`.`file_path` ``` The desired message should be like: ``` Expected Error Message: Failed to find data source: avro. Please use Spark package http://spark-packages.org/package/databricks/spark-avro" ``` - ~~ISSUE 3: Unable to detect incompatibility libraries for Spark 2.0 in Data Source Resolution. We report a strange error message:~~ Update: The latest code changes contains - For JDBC format, we added an extra checking in the rule `ResolveRelations` of `Analyzer`. Without the PR, Spark will return the error message like: `Option 'url' not specified`. Now, we are reporting `Unsupported data source type for direct query on files: jdbc` - Make data source format name case incensitive so that error handling behaves consistent with the normal cases. - Added the test cases for all the supported formats. #### How was this patch tested? Added test cases to cover all the above issues Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13283 from gatorsmile/runSQLAgainstFile.	2016-06-02 13:22:43 -07:00
Reynold Xin	8900c8d8ff	[SPARK-15728][SQL] Rename aggregate operators: HashAggregate and SortAggregate ## What changes were proposed in this pull request? We currently have two physical aggregate operators: TungstenAggregate and SortBasedAggregate. These names don't make a lot of sense from an end-user point of view. This patch renames them HashAggregate and SortAggregate. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #13465 from rxin/SPARK-15728.	2016-06-02 12:34:51 -07:00
hyukjinkwon	252417fa21	[SPARK-15322][SQL][FOLLOWUP] Use the new long accumulator for old int accumulators. ## What changes were proposed in this pull request? This PR corrects the remaining cases for using old accumulators. This does not change some old accumulator usages below: - `ImplicitSuite.scala` - Tests dedicated to old accumulator, for implicits with `AccumulatorParam` - `AccumulatorSuite.scala` - Tests dedicated to old accumulator - `JavaSparkContext.scala` - For supporting old accumulators for Java API. - `debug.package.scala` - Usage with `HashSet[String]`. Currently, it seems no implementation for this. I might be able to write an anonymous class for this but I didn't because I think it is not worth writing a lot of codes only for this. - `SQLMetricsSuite.scala` - This uses the old accumulator for checking type boxing. It seems new accumulator does not require type boxing for this case whereas the old one requires (due to the use of generic). ## How was this patch tested? Existing tests cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13434 from HyukjinKwon/accum.	2016-06-02 11:16:24 -05:00
Dongjoon Hyun	b85d18f3bd	[SPARK-15709][SQL] Prevent `freqItems` from raising `UnsupportedOperationException: empty.min` ## What changes were proposed in this pull request? Currently, `freqItems` raises `UnsupportedOperationException` on `empty.min` usually when its `support` argument is high. ```scala scala> spark.createDataset(Seq(1, 2, 2, 3, 3, 3)).stat.freqItems(Seq("value"), 2) 16/06/01 11:11:38 ERROR Executor: Exception in task 5.0 in stage 0.0 (TID 5) java.lang.UnsupportedOperationException: empty.min ... ``` Also, the parameter checking message is wrong. ``` require(support >= 1e-4, s"support ($support) must be greater than 1e-4.") ``` This PR changes the logic to handle the `empty` case and also improves parameter checking. ## How was this patch tested? Pass the Jenkins tests (with a new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13449 from dongjoon-hyun/SPARK-15709.	2016-06-02 11:12:17 -05:00
Takeshi YAMAMURO	5eea332307	[SPARK-13484][SQL] Prevent illegal NULL propagation when filtering outer-join results ## What changes were proposed in this pull request? This PR add a rule at the end of analyzer to correct nullable fields of attributes in a logical plan by using nullable fields of the corresponding attributes in its children logical plans (these plans generate the input rows). This is another approach for addressing SPARK-13484 (the first approach is https://github.com/apache/spark/pull/11371). Close #113711 Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Author: Yin Huai <yhuai@databricks.com> Closes #13290 from yhuai/SPARK-13484.	2016-06-01 22:23:00 -07:00
jerryshao	8288e16a5a	[SPARK-15620][SQL] Fix transformed dataset attributes revolve failure ## What changes were proposed in this pull request? Join on transformed dataset has attributes conflicts, which make query execution failure, for example: ``` val dataset = Seq(1, 2, 3).toDs val mappedDs = dataset.map(_ + 1) mappedDs.as("t1").joinWith(mappedDs.as("t2"), $"t1.value" === $"t2.value").show() ``` will throw exception: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`t1.value`' given input columns: [value]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) ``` ## How was this patch tested? Unit test. Author: jerryshao <sshao@hortonworks.com> Closes #13399 from jerryshao/SPARK-15620.	2016-06-01 21:58:05 -07:00
Yin Huai	6dddb70c38	[SPARK-15646][SQL] When spark.sql.hive.convertCTAS is true, the conversion rule needs to respect TEXTFILE/SEQUENCEFILE format and the user-defined location ## What changes were proposed in this pull request? When `spark.sql.hive.convertCTAS` is true, for a CTAS statement, we will create a data source table using the default source (i.e. parquet) if the CTAS does not specify any Hive storage format. However, there are two issues with this conversion logic. 1. First, we determine if a CTAS statement defines storage format by checking the serde. However, TEXTFILE/SEQUENCEFILE does not have a default serde. When we do the check, we have not set the default serde. So, a query like `CREATE TABLE abc STORED AS TEXTFILE AS SELECT ...` actually creates a data source parquet table. 2. In the conversion logic, we are ignoring the user-specified location. This PR fixes the above two issues. Also, this PR makes the parser throws an exception when a CTAS statement has a PARTITIONED BY clause. This change is made because Hive's syntax does not allow it and our current implementation actually does not work for this case (the insert operation always throws an exception because the insertion does not pick up the partitioning info). ## How was this patch tested? I am adding new tests in SQLQuerySuite and HiveDDLCommandSuite. Author: Yin Huai <yhuai@databricks.com> Closes #13386 from yhuai/SPARK-14507.	2016-06-01 17:55:37 -07:00
Sean Zhong	c8fb776d4a	[SPARK-15692][SQL] Improves the explain output of several physical plans by displaying embedded logical plan in tree style ## What changes were proposed in this pull request? Improves the explain output of several physical plans by displaying embedded logical plan in tree style Some physical plan contains a embedded logical plan, for example, `cache tableName query` maps to: ``` case class CacheTableCommand( tableName: String, plan: Option[LogicalPlan], isLazy: Boolean) extends RunnableCommand ``` It is easier to read the explain output if we can display the `plan` in tree style. Before change: Everything is messed in one line. ``` scala> Seq((1,2)).toDF().createOrReplaceTempView("testView") scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand CacheTableCommand testView2, Some('Project [] +- 'UnresolvedRelation `testView`, None ), false ``` After change:* ``` scala> spark.sql("cache table testView2 select * from testView").explain() == Physical Plan == ExecutedCommand : +- CacheTableCommand testView2, false : : +- 'Project [*] : : +- 'UnresolvedRelation `testView`, None ``` ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13433 from clockfly/verbose_breakdown_3_2.	2016-06-01 17:03:39 -07:00
Wenchen Fan	8640cdb836	[SPARK-15441][SQL] support null object in Dataset outer-join ## What changes were proposed in this pull request? Currently we can't encode top level null object into internal row, as Spark SQL doesn't allow row to be null, only its columns can be null. This is not a problem before, as we assume the input object is never null. However, for outer join, we do need the semantics of null object. This PR fixes this problem by making both join sides produce a single column, i.e. nest the logical plan output(by `CreateStruct`), so that we have an extra level to represent top level null obejct. ## How was this patch tested? new test in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13425 from cloud-fan/outer-join2.	2016-06-01 16:16:54 -07:00
Cheng Lian	7bb64aae27	[SPARK-15269][SQL] Removes unexpected empty table directories created while creating external Spark SQL data sourcet tables. This PR is an alternative to #13120 authored by xwu0226. ## What changes were proposed in this pull request? When creating an external Spark SQL data source table and persisting its metadata to Hive metastore, we don't use the standard Hive `Table.dataLocation` field because Hive only allows directory paths as data locations while Spark SQL also allows file paths. However, if we don't set `Table.dataLocation`, Hive always creates an unexpected empty table directory under database location, but doesn't remove it while dropping the table (because the table is external). This PR works around this issue by explicitly setting `Table.dataLocation` and then manullay removing the created directory after creating the external table. Please refer to [this JIRA comment][1] for more details about why we chose this approach as a workaround. [1]: https://issues.apache.org/jira/browse/SPARK-15269?focusedCommentId=15297408&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15297408 ## How was this patch tested? 1. A new test case is added in `HiveQuerySuite` for this case 2. Updated `ShowCreateTableSuite` to use the same table name in all test cases. (This is how I hit this issue at the first place.) Author: Cheng Lian <lian@databricks.com> Closes #13270 from liancheng/spark-15269-unpleasant-fix.	2016-06-01 16:02:27 -07:00
Andrew Or	9e2643b21d	[SPARK-15596][SPARK-15635][SQL] ALTER TABLE RENAME fixes ## What changes were proposed in this pull request? SPARK-15596: Even after we renamed a cached table, the plan would remain in the cache with the old table name. If I created a new table using the old name then the old table would return incorrect data. Note that this applies only to Hive tables. SPARK-15635: Renaming a datasource table would render the table not query-able. This is because we store the location of the table in a "path" property, which was not updated to reflect Hive's change in table location following a rename. ## How was this patch tested? DDLSuite Author: Andrew Or <andrew@databricks.com> Closes #13416 from andrewor14/rename-table.	2016-06-01 14:26:24 -07:00
Reynold Xin	a71d1364ae	[SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming ## What changes were proposed in this pull request? This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them. ## How was this patch tested? Updated tests to reflect the moves. Author: Reynold Xin <rxin@databricks.com> Closes #13429 from rxin/SPARK-15686.	2016-06-01 10:14:40 -07:00
Sean Zhong	d5012c2740	[SPARK-15495][SQL] Improve the explain output for Aggregation operator ## What changes were proposed in this pull request? This PR improves the explain output of Aggregator operator. SQL: ``` Seq((1,2,3)).toDF("a", "b", "c").createTempView("df1") spark.sql("cache table df1") spark.sql("select count(a), count(c), b from df1 group by b").explain() ``` Before change: ``` TungstenAggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- TungstenAggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) `````` After change: ``` Aggregate(key=[b#8], functions=[count(1),count(1)], output=[count(a)#79L,count(c)#80L,b#8]) +- Exchange hashpartitioning(b#8, 200), None +- Aggregate(key=[b#8], functions=[partial_count(1),partial_count(1)], output=[b#8,count#98L,count#99L]) +- InMemoryTableScan [b#8], InMemoryRelation [a#7,b#8,c#9], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), LocalTableScan [a#7,b#8,c#9], [[1,2,3]], Some(df1) ``` ## How was this patch tested? Manual test and existing UT. Author: Sean Zhong <seanzhong@databricks.com> Closes #13363 from clockfly/verbose3.	2016-06-01 09:58:01 -07:00
Cheng Lian	1f43562daf	[SPARK-14343][SQL] Proper column pruning for text data source ## What changes were proposed in this pull request? Text data source ignores requested schema, and may give wrong result when the only data column is not requested. This may happen when only partitioning column(s) are requested for a partitioned text table. ## How was this patch tested? New test case added in `TextSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13431 from liancheng/spark-14343-partitioned-text-table.	2016-06-01 07:30:55 -07:00
Andrew Or	1dd9256441	[HOTFIX] DDLSuite was broken by `93e9714`	2016-05-31 20:06:08 -07:00
Dongjoon Hyun	85d6b0db9f	[SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable. ## What changes were proposed in this pull request? This PR changes function `SparkSession.builder.sparkContext(..)` from private[sql] into private[spark], and uses it if applicable like the followings. ``` - val spark = SparkSession.builder().config(sc.getConf).getOrCreate() + val spark = SparkSession.builder().sparkContext(sc).getOrCreate() ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13365 from dongjoon-hyun/SPARK-15618.	2016-05-31 17:40:44 -07:00
Dongjoon Hyun	196a0d8273	[MINOR][SQL][DOCS] Fix docs of Dataset.scala and SQLImplicits.scala. This PR fixes a sample code, a description, and indentations in docs. Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13420 from dongjoon-hyun/minor_fix_dataset_doc.	2016-05-31 17:37:33 -07:00
Sean Zhong	06514d689c	[SPARK-12988][SQL] Can't drop top level columns that contain dots ## What changes were proposed in this pull request? Fixes "Can't drop top level columns that contain dots". This work is based on dilipbiswal's https://github.com/apache/spark/pull/10943. This PR fixes problems like: ``` scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b") org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c]; ``` `drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "." We should also NOT interpret back tick "`", otherwise it is hard to understand what ``` ```aaa```bbb`` ``` actually means. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #13306 from clockfly/fix_drop_column.	2016-05-31 17:34:10 -07:00
Reynold Xin	223f1d58c4	[SPARK-15662][SQL] Add since annotation for classes in sql.catalog ## What changes were proposed in this pull request? This patch does a few things: 1. Adds since version annotation to methods and classes in sql.catalog. 2. Fixed a typo in FilterFunction and a whitespace issue in spark/api/java/function/package.scala 3. Added "database" field to Function class. ## How was this patch tested? Updated unit test case for "database" field in Function class. Author: Reynold Xin <rxin@databricks.com> Closes #13406 from rxin/SPARK-15662.	2016-05-31 17:29:10 -07:00
Tathagata Das	90b11439b3	[SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming ## What changes were proposed in this pull request? Currently structured streaming only supports append output mode. This PR adds the following. - Added support for Complete output mode in the internal state store, analyzer and planner. - Added public API in Scala and Python for users to specify output mode - Added checks for unsupported combinations of output mode and DF operations - Plans with no aggregation should support only Append mode - Plans with aggregation should support only Update and Complete modes - Default output mode is Append mode (Question: should we change this to automatically set to Complete mode when there is aggregation?) - Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported. ## How was this patch tested? Unit tests in various test suites - StreamingAggregationSuite: tests for complete mode - MemorySinkSuite: tests for checking behavior in Append and Complete modes. - UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes - DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs - Python doc test and existing unit tests modified to call write.outputMode. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13286 from tdas/complete-mode.	2016-05-31 15:57:01 -07:00
Dilip Biswal	dfe2cbeb43	[SPARK-15557] [SQL] cast the string into DoubleType when it's used together with decimal In this case, the result type of the expression becomes DECIMAL(38, 36) as we promote the individual string literals to DECIMAL(38, 18) when we handle string promotions for `BinaryArthmaticExpression`. I think we need to cast the string literals to Double type instead. I looked at the history and found that this was changed to use decimal instead of double to avoid potential loss of precision when we cast decimal to double. To double check i ran the query against hive, mysql. This query returns non NULL result for both the databases and both promote the expression to use double. Here is the output. - Hive ```SQL hive> create table l2 as select (cast(99 as decimal(19,6)) + '2') from l1; OK hive> describe l2; OK _c0 double ``` - MySQL ```SQL mysql> create table foo2 as select (cast(99 as decimal(19,6)) + '2') from test; Query OK, 1 row affected (0.01 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> describe foo2; +-----------------------------------+--------+------+-----+---------+-------+ \| Field \| Type \| Null \| Key \| Default \| Extra \| +-----------------------------------+--------+------+-----+---------+-------+ \| (cast(99 as decimal(19,6)) + '2') \| double \| NO \| \| 0 \| \| +-----------------------------------+--------+------+-----+---------+-------+ ``` ## How was this patch tested? Added a new test in SQLQuerySuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13368 from dilipbiswal/spark-15557.	2016-05-31 15:49:45 -07:00
Davies Liu	2df6ca848e	[SPARK-15327] [SQL] fix split expression in whole stage codegen ## What changes were proposed in this pull request? Right now, we will split the code for expressions into multiple functions when it exceed 64k, which requires that the the expressions are using Row object, but this is not true for whole-state codegen, it will fail to compile after splitted. This PR will not split the code in whole-stage codegen. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13235 from davies/fix_nested_codegen.	2016-05-31 15:36:02 -07:00
Shixiong Zhu	9a74de18a1	Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work ## What changes were proposed in this pull request? This reverts commit `c24b6b679c`. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`. ## How was this patch tested? Jenkins unit tests, integration tests, manual tests) Author: Shixiong Zhu <shixiong@databricks.com> Closes #13417 from zsxwing/revert-SPARK-11753.	2016-05-31 14:50:07 -07:00
Wenchen Fan	2bfed1a0c5	[SPARK-15658][SQL] UDT serializer should declare its data type as udt instead of udt.sqlType ## What changes were proposed in this pull request? When we build serializer for UDT object, we should declare its data type as udt instead of udt.sqlType, or if we deserialize it again, we lose the information that it's a udt object and throw analysis exception. ## How was this patch tested? new test in `UserDefiendTypeSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13402 from cloud-fan/udt.	2016-05-31 11:00:38 -07:00
gatorsmile	d67c82e4b6	[SPARK-15647][SQL] Fix Boundary Cases in OptimizeCodegen Rule #### What changes were proposed in this pull request? The following condition in the Optimizer rule `OptimizeCodegen` is not right. ```Scala branches.size < conf.maxCaseBranchesForCodegen ``` - The number of branches in case when clause should be `branches.size + elseBranch.size`. - `maxCaseBranchesForCodegen` is the maximum boundary for enabling codegen. Thus, we should use `<=` instead of `<`. This PR is to fix this boundary case and also add missing test cases for verifying the conf `MAX_CASES_BRANCHES`. #### How was this patch tested? Added test cases in `SQLConfSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13392 from gatorsmile/maxCaseWhen.	2016-05-31 10:08:00 -07:00
Reynold Xin	675921040e	[SPARK-15638][SQL] Audit Dataset, SparkSession, and SQLContext ## What changes were proposed in this pull request? This patch contains a list of changes as a result of my auditing Dataset, SparkSession, and SQLContext. The patch audits the categorization of experimental APIs, function groups, and deprecations. For the detailed list of changes, please see the diff. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13370 from rxin/SPARK-15638.	2016-05-30 22:47:58 -07:00
Cheng Lian	1360a6d636	[SPARK-15112][SQL] Disables EmbedSerializerInFilter for plan fragments that change schema ## What changes were proposed in this pull request? `EmbedSerializerInFilter` implicitly assumes that the plan fragment being optimized doesn't change plan schema, which is reasonable because `Dataset.filter` should never change the schema. However, due to another issue involving `DeserializeToObject` and `SerializeFromObject`, typed filter does change plan schema (see [SPARK-15632][1]). This breaks `EmbedSerializerInFilter` and causes corrupted data. This PR disables `EmbedSerializerInFilter` when there's a schema change to avoid data corruption. The schema change issue should be addressed in follow-up PRs. ## How was this patch tested? New test case added in `DatasetSuite`. [1]: https://issues.apache.org/jira/browse/SPARK-15632 Author: Cheng Lian <lian@databricks.com> Closes #13362 from liancheng/spark-15112-corrupted-filter.	2016-05-29 23:19:12 -07:00
Sean Owen	ce1572d16f	[MINOR] Resolve a number of miscellaneous build warnings ## What changes were proposed in this pull request? This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately. ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13377 from srowen/BuildWarnings.	2016-05-29 16:48:14 -05:00
Yadong Qi	b4c32c4952	[SPARK-15549][SQL] Disable bucketing when the output doesn't contain all bucketing columns ## What changes were proposed in this pull request? I create a bucketed table bucketed_table with bucket column i, ```scala case class Data(i: Int, j: Int, k: Int) sc.makeRDD(Array((1, 2, 3))).map(x => Data(x._1, x._2, x._3)).toDF.write.bucketBy(2, "i").saveAsTable("bucketed_table") ``` and I run the following SQLs: ```sql SELECT j FROM bucketed_table; Error in query: bucket column i not found in existing columns (j); SELECT j, MAX(k) FROM bucketed_table GROUP BY j; Error in query: bucket column i not found in existing columns (j, k); ``` I think we should add a check that, we only enable bucketing when it satisfies all conditions below: 1. the conf is enabled 2. the relation is bucketed 3. the output contains all bucketing columns ## How was this patch tested? Updated test cases to reflect the changes. Author: Yadong Qi <qiyadong2010@gmail.com> Closes #13321 from watermen/SPARK-15549.	2016-05-28 10:19:29 -07:00
Liang-Chi Hsieh	f1b220eeee	[SPARK-15553][SQL] Dataset.createTempView should use CreateViewCommand ## What changes were proposed in this pull request? Let `Dataset.createTempView` and `Dataset.createOrReplaceTempView` use `CreateViewCommand`, rather than calling `SparkSession.createTempView`. Besides, this patch also removes `SparkSession.createTempView`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13327 from viirya/dataset-createtempview.	2016-05-27 21:24:08 -07:00
Reynold Xin	73178c7556	[SPARK-15633][MINOR] Make package name for Java tests consistent ## What changes were proposed in this pull request? This is a simple patch that makes package names for Java 8 test suites consistent. I moved everything to test.org.apache.spark to we can test package private APIs properly. Also added "java8" as the package name so we can easily run all the tests related to Java 8. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #13364 from rxin/SPARK-15633.	2016-05-27 21:20:02 -07:00
Andrew Or	4a2fb8b87c	[SPARK-15594][SQL] ALTER TABLE SERDEPROPERTIES does not respect partition spec ## What changes were proposed in this pull request? These commands ignore the partition spec and change the storage properties of the table itself: ``` ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDE 'my_serde' ALTER TABLE table_name PARTITION (a=1, b=2) SET SERDEPROPERTIES ('key1'='val1') ``` Now they change the storage properties of the specified partition. ## How was this patch tested? DDLSuite Author: Andrew Or <andrew@databricks.com> Closes #13343 from andrewor14/alter-table-serdeproperties.	2016-05-27 17:27:24 -07:00
Ryan Blue	776d183c82	[SPARK-9876][SQL] Update Parquet to 1.8.1. ## What changes were proposed in this pull request? This includes minimal changes to get Spark using the current release of Parquet, 1.8.1. ## How was this patch tested? This uses the existing Parquet tests. Author: Ryan Blue <blue@apache.org> Closes #13280 from rdblue/SPARK-9876-update-parquet.	2016-05-27 16:59:38 -07:00
Tejas Patil	a96e4151a9	[SPARK-14400][SQL] ScriptTransformation does not fail the job for bad user command ## What changes were proposed in this pull request? - Refer to the Jira for the problem: jira : https://issues.apache.org/jira/browse/SPARK-14400 - The fix is to check if the process has exited with a non-zero exit code in `hasNext()`. I have moved this and checking of writer thread exception to a separate method. ## How was this patch tested? - Ran a job which had incorrect transform script command and saw that the job fails - Existing unit tests for `ScriptTransformationSuite`. Added a new unit test Author: Tejas Patil <tejasp@fb.com> Closes #12194 from tejasapatil/script_transform.	2016-05-27 12:05:11 -07:00
Xinh Huynh	5bdbedf220	[MINOR][DOCS] Typo fixes in Dataset scaladoc ## What changes were proposed in this pull request? Minor typo fixes in Dataset scaladoc * Corrected context type as SparkSession, not SQLContext. liancheng rxin andrewor14 ## How was this patch tested? Compiled locally Author: Xinh Huynh <xinh_huynh@yahoo.com> Closes #13330 from xinhhuynh/fix-dataset-typos.	2016-05-27 11:13:53 -07:00
Reynold Xin	a52e681339	[SPARK-15597][SQL] Add SparkSession.emptyDataset ## What changes were proposed in this pull request? This patch adds a new function emptyDataset to SparkSession, for creating an empty dataset. ## How was this patch tested? Added a test case. Author: Reynold Xin <rxin@databricks.com> Closes #13344 from rxin/SPARK-15597.	2016-05-27 11:13:09 -07:00
Sameer Agarwal	635fb30f83	[SPARK-15599][SQL][DOCS] API docs for `createDataset` functions in SparkSession ## What changes were proposed in this pull request? Adds API docs and usage examples for the 3 `createDataset` calls in `SparkSession` ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #13345 from sameeragarwal/dataset-doc.	2016-05-27 11:11:31 -07:00
Dongjoon Hyun	4538443e27	[SPARK-15584][SQL] Abstract duplicate code: `spark.sql.sources.` properties ## What changes were proposed in this pull request? This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13349 from dongjoon-hyun/SPARK-15584.	2016-05-27 11:10:31 -07:00
gatorsmile	c17272902c	[SPARK-15565][SQL] Add the File Scheme to the Default Value of WAREHOUSE_PATH #### What changes were proposed in this pull request? The default value of `spark.sql.warehouse.dir` is `System.getProperty("user.dir")/spark-warehouse`. Since `System.getProperty("user.dir")` is a local dir, we should explicitly set the scheme to local filesystem. cc yhuai #### How was this patch tested? Added two test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #13348 from gatorsmile/addSchemeToDefaultWarehousePath.	2016-05-27 09:54:31 -07:00
gatorsmile	d5911d1173	[SPARK-15529][SQL] Replace SQLContext and HiveContext with SparkSession in Test #### What changes were proposed in this pull request? This PR is to use the new entrance `Sparksession` to replace the existing `SQLContext` and `HiveContext` in SQL test suites. No change is made in the following suites: - `ListTablesSuite` is to test the APIs of `SQLContext`. - `SQLContextSuite` is to test `SQLContext` - `HiveContextCompatibilitySuite` is to test `HiveContext` Update: Move tests in `ListTableSuite` to `SQLContextSuite` #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13337 from gatorsmile/sparkSessionTest.	2016-05-26 22:40:57 -07:00
Zheng RuiFeng	6b1a6180e7	[MINOR] Fix Typos 'a -> an' ## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml//scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.	2016-05-26 22:39:14 -07:00
Andrew Or	3fca635b4e	[SPARK-15583][SQL] Disallow altering datasource properties ## What changes were proposed in this pull request? Certain table properties (and SerDe properties) are in the protected namespace `spark.sql.sources.`, which we use internally for datasource tables. The user should not be allowed to (1) Create a Hive table setting these properties (2) Alter these properties in an existing table Previously, we threw an exception if the user tried to alter the properties of an existing datasource table. However, this is overly restrictive for datasource tables and does not do anything for Hive tables. ## How was this patch tested? DDLSuite Author: Andrew Or <andrew@databricks.com> Closes #13341 from andrewor14/alter-table-props.	2016-05-26 20:11:09 -07:00
Andrew Or	008a5377d5	[SPARK-15538][SPARK-15539][SQL] Truncate table fixes round 2 ## What changes were proposed in this pull request? Two more changes: (1) Fix truncate table for data source tables (only for cases without `PARTITION`) (2) Disallow truncating external tables or views ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #13315 from andrewor14/truncate-table.	2016-05-26 19:01:41 -07:00
Yin Huai	3ac2363d75	[SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use SparkSession.build.getOrCreate ## What changes were proposed in this pull request? This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13310 from yhuai/SPARK-15532.	2016-05-26 16:53:31 -07:00
Cheng Lian	e7082caeb4	[SPARK-15550][SQL] Dataset.show() should show contents nested products as rows ## What changes were proposed in this pull request? This PR addresses two related issues: 1. `Dataset.showString()` should show case classes/Java beans at all levels as rows, while master code only handles top level ones. 2. `Dataset.showString()` should show full contents produced the underlying query plan Dataset is only a view of the underlying query plan. Columns not referred by the encoder are still reachable using methods like `Dataset.col`. So it probably makes more sense to show full contents of the query plan. ## How was this patch tested? Two new test cases are added in `DatasetSuite` to check `.showString()` output. Author: Cheng Lian <lian@databricks.com> Closes #13331 from liancheng/spark-15550-ds-show.	2016-05-26 16:23:48 -07:00
Sean Zhong	b5859e0bb8	[SPARK-13445][SQL] Improves error message and add test coverage for Window function ## What changes were proposed in this pull request? Add more verbose error message when order by clause is missed when using Window function. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13333 from clockfly/spark-13445.	2016-05-26 14:50:00 -07:00
Reynold Xin	0f61d6efb4	[SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession ## What changes were proposed in this pull request? SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that. As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility. ## How was this patch tested? Updated test cases to reflect the changes. Author: Reynold Xin <rxin@databricks.com> Closes #13319 from rxin/SPARK-15552.	2016-05-26 13:03:07 -07:00
Andrew Or	2b1ac6cea8	[SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist ## What changes were proposed in this pull request? Same as #13302, but for DROP TABLE. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #13307 from andrewor14/drop-table.	2016-05-26 12:04:18 -07:00
Reynold Xin	361ebc282b	[SPARK-15543][SQL] Rename DefaultSources to make them more self-describing ## What changes were proposed in this pull request? This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names. They are now named: - LibSVMFileFormat - CSVFileFormat - JdbcRelationProvider - JsonFileFormat - ParquetFileFormat - TextFileFormat Backward compatibility is maintained through aliasing. ## How was this patch tested? Updated relevant test cases too. Author: Reynold Xin <rxin@databricks.com> Closes #13311 from rxin/SPARK-15543.	2016-05-25 23:54:24 -07:00
Sameer Agarwal	06ed1fa3e4	[SPARK-15533][SQL] Deprecate Dataset.explode ## What changes were proposed in this pull request? This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #13312 from sameeragarwal/deprecate.	2016-05-25 19:10:57 -07:00
Andrew Or	ee682fe293	[SPARK-15534][SPARK-15535][SQL] Truncate table fixes ## What changes were proposed in this pull request? Two changes: - When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions. - Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive. ## How was this patch tested? Jenkins. Author: Andrew Or <andrew@databricks.com> Closes #13302 from andrewor14/truncate-table.	2016-05-25 15:08:39 -07:00
Jurriaan Pruis	c875d81a3d	[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV ## What changes were proposed in this pull request? Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this. See `f3eb2af263/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java (L231-L247)` This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2) https://issues.apache.org/jira/browse/SPARK-15493 ## How was this patch tested? Added a test that verifies the output is quoted correctly. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13267 from jurriaan/quote-escaping.	2016-05-25 12:40:16 -07:00
Takuya UESHIN	4b88067416	[SPARK-15483][SQL] IncrementalExecution should use extra strategies. ## What changes were proposed in this pull request? Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies. This pr fixes `IncrementalExecution` to include extra strategies to use them. ## How was this patch tested? I added a test to check if extra strategies work for streams. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13261 from ueshin/issues/SPARK-15483.	2016-05-25 12:02:07 -07:00
lfzCarlosC	02c8072eea	[MINOR][MLLIB][STREAMING][SQL] Fix typos fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.	2016-05-25 10:53:57 -07:00
Jeff Zhang	01e7b9c85b	[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext ## What changes were proposed in this pull request? Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach. ## How was this patch tested? Manually verify it in spark-shell. rxin Please help review it, I think this is a very critical issue for spark 2.0 Author: Jeff Zhang <zjffdu@apache.org> Closes #13160 from zjffdu/SPARK-15345.	2016-05-25 10:46:51 -07:00
Reynold Xin	4f27b8dd58	[SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions ## What changes were proposed in this pull request? This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly. ## How was this patch tested? Created a new SparkSqlParserSuite. Author: Reynold Xin <rxin@databricks.com> Closes #13292 from rxin/SPARK-15436.	2016-05-25 19:17:53 +02:00
Wenchen Fan	50b660d725	[SPARK-15498][TESTS] fix slow tests ## What changes were proposed in this pull request? This PR fixes 3 slow tests: 1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit". 2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size. 3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #13273 from cloud-fan/test.	2016-05-24 21:23:39 -07:00
Parth Brahmbhatt	4acababcab	[SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS ## What changes were proposed in this pull request? Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins. ## How was this patch tested? I have executed queries locally to test. Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com> Closes #13150 from Parth-Brahmbhatt/SPARK-15365.	2016-05-24 20:58:20 -07:00
Dongjoon Hyun	f08bf587b1	[SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException ## What changes were proposed in this pull request? Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases. Before ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0).collect() res1: Array[Int] = Array() // empty scala> spark.sql("select 1").coalesce(0) res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").coalesce(0).collect() java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. scala> spark.sql("select 1").repartition(0) res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int] scala> spark.sql("select 1").repartition(0).collect() res4: Array[org.apache.spark.sql.Row] = Array() // empty ``` After ```scala scala> sc.parallelize(1 to 5).coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> sc.parallelize(1 to 5).repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").coalesce(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... scala> spark.sql("select 1").repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. ... ``` ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13282 from dongjoon-hyun/SPARK-15512.	2016-05-24 18:55:23 -07:00
Tathagata Das	e631b819fe	[SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams ## What changes were proposed in this pull request? If the user relies on the schema to be inferred in file streams can break easily for multiple reasons - accidentally running on a directory which has no data - schema changing underneath - on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart. To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases. In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default. ## How was this patch tested? Updated unit tests that test error behavior with and without schema inference enabled. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13238 from tdas/SPARK-15458.	2016-05-24 14:27:39 -07:00
Dongjoon Hyun	f8763b80ec	[SPARK-13135] [SQL] Don't print expressions recursively in generated code ## What changes were proposed in this pull request? This PR is an up-to-date and a little bit improved version of #11019 of rxin for - (1) preventing recursive printing of expressions in generated code. Since the major function of this PR is indeed the above, he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation. - (2) Improve multiline comment indentation. - (3) Reduce the number of empty lines (mainly consecutive empty lines). - (4) Remove all space characters on empty lines. Example ```scala spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6) ``` Before ``` Generated code: /* 001 / public Object generate(Object[] references) { ... / 005 / /* /* 006 / Codegend pipeline for /* 007 / Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 / +- Range 1, 1, 8, 999, [id#0L] /* 009 / / ... /* 075 / // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 076 / / 077 / // PRODUCE: Range 1, 1, 8, 999, [id#0L] / 078 / / 079 / // initialize Range ... / 092 / // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 093 / / 094 / // CONSUME: WholeStageCodegen / 095 / / 096 / // (((input[0, bigint, false] + 1) + 2) + 3) / 097 / // ((input[0, bigint, false] + 1) + 2) / 098 / // (input[0, bigint, false] + 1) ... / 107 / // (((input[0, bigint, false] + 4) + 5) + 6) / 108 / // ((input[0, bigint, false] + 4) + 5) / 109 / // (input[0, bigint, false] + 4) ... / 126 / } ``` After* ``` Generated code: /* 001 / public Object generate(Object[] references) { ... / 005 / /* /* 006 / Codegend pipeline for /* 007 / Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] /* 008 / +- Range 1, 1, 8, 999, [id#0L] /* 009 / / ... /* 075 / // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 076 / // PRODUCE: Range 1, 1, 8, 999, [id#0L] / 077 / // initialize Range ... / 090 / // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L] / 091 / // CONSUME: WholeStageCodegen / 092 / // (((input[0, bigint, false] + 1) + 2) + 3) ... / 101 / // (((input[0, bigint, false] + 4) + 5) + 6) ... / 118 */ } ``` ## How was this patch tested? Pass the Jenkins tests and see the result of the following command manually. ```scala scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen() ``` Author: Dongjoon Hyun <dongjoonapache.org> Author: Reynold Xin <rxindatabricks.com> Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13192 from dongjoon-hyun/SPARK-13135.	2016-05-24 10:08:14 -07:00
Liang-Chi Hsieh	c24b6b679c	[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work ## What changes were proposed in this pull request? Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed. ## How was this patch tested? `JsonParsingOptionsSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9759 from viirya/fix-json-nonnumric.	2016-05-24 09:43:39 -07:00
Daoyuan Wang	d642b27354	[SPARK-15397][SQL] fix string udf locate as hive ## What changes were proposed in this pull request? in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0. ## How was this patch tested? tested with modified `StringExpressionsSuite` and `StringFunctionsSuite` Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #13186 from adrian-wang/locate.	2016-05-23 23:29:15 -07:00
Andrew Or	de726b0d53	Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB" This reverts commit `fa244e5a90`.	2016-05-23 21:43:11 -07:00
Kazuaki Ishizaki	fa244e5a90	[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB ## What changes were proposed in this pull request? This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method. ## How was this patch tested? Added new tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #13243 from kiszk/SPARK-15285.	2016-05-23 21:12:34 -07:00
gatorsmile	5afd927a47	[SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog #### What changes were proposed in this pull request? So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example, ```SQL CREATE TABLE tabName(i INT, j STRING) SELECT * FROM tabName INSERT OVERWRITE TABLE tabName SELECT 1, 'a' ``` In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT` ``` org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None; ``` This PR is to issue appropriate exceptions in this case. The message will be like ``` org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`; ``` #### How was this patch tested? Added a test case in `DDLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13093 from gatorsmile/selectAfterCreate.	2016-05-23 18:03:45 -07:00
Xin Wu	01659bc50c	[SPARK-15431][SQL] Support LIST FILE(s)\|JAR(s) command natively ## What changes were proposed in this pull request? Currently command `ADD FILE\|JAR <filepath \| jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)\|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] \| JAR[s] [jarfile ...])` ### For example: ##### LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| +----------------------------------------------+ scala> spark.sql("list files").show(false) +----------------------------------------------+ \|result \| +----------------------------------------------+ \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt\| \|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt \| +----------------------------------------------+ ``` ##### LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ scala> spark.sql("list jars").show(false) +---------------------------------------------+ \|result \| +---------------------------------------------+ \|spark://192.168.1.234:50131/jars/TestUDTF.jar\| +---------------------------------------------+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. Author: Xin Wu <xinwu@us.ibm.com> Author: xin Wu <xinwu@us.ibm.com> Closes #13212 from xwu0226/list_command.	2016-05-23 17:32:01 -07:00
sureshthalamati	03c7b7c4b9	[SPARK-15315][SQL] Adding error check to the CSV datasource writer for unsupported complex data types. ## What changes were proposed in this pull request? Adds error handling to the CSV writer for unsupported complex data types. Currently garbage gets written to the output csv files if the data frame schema has complex data types. ## How was this patch tested? Added new unit test case. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.	2016-05-23 17:15:19 -07:00
Dongjoon Hyun	37c617e4f5	[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions ## What changes were proposed in this pull request? Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that. ## How was this patch tested? It's only about docs. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13087 from dongjoon-hyun/SPARK-15282.	2016-05-23 14:19:25 -07:00
Andrew Or	2585d2b322	[SPARK-15279][SQL] Catch conflicting SerDe when creating table ## What changes were proposed in this pull request? The user may do something like: ``` CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde' CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde' ``` None of these should be allowed because the SerDe's conflict. As of this patch: - `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE` - `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE` ## How was this patch tested? New tests in `DDLCommandSuite`. Author: Andrew Or <andrew@databricks.com> Closes #13068 from andrewor14/row-format-conflict.	2016-05-23 11:55:03 -07:00
Davies Liu	80091b8a68	[SPARK-14031][SQL] speedup CSV writer ## What changes were proposed in this pull request? Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns). This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster). ## How was this patch tested? Manually benchmark it. Author: Davies Liu <davies@databricks.com> Closes #13229 from davies/csv_writer.	2016-05-23 10:48:25 -07:00
Sameer Agarwal	dafcb05c2e	[SPARK-15425][SQL] Disallow cross joins by default ## What changes were proposed in this pull request? In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products. ## How was this patch tested? Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products. Author: Sameer Agarwal <sameer@databricks.com> Closes #13209 from sameeragarwal/disallow-cartesian.	2016-05-22 23:32:39 -07:00
Tathagata Das	1ffa608ba5	[SPARK-15428][SQL] Disable multiple streaming aggregations ## What changes were proposed in this pull request? Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations. ## How was this patch tested? Additional unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13210 from tdas/SPARK-15428.	2016-05-22 02:08:18 -07:00
Reynold Xin	845e447fa0	[SPARK-15459][SQL] Make Range logical and physical explain consistent ## What changes were proposed in this pull request? This patch simplifies the implementation of Range operator and make the explain string consistent between logical plan and physical plan. To do this, I changed RangeExec to embed a Range logical plan in it. Before this patch (note that the logical Range and physical Range actually output different information): ``` == Optimized Logical Plan == Range 0, 100, 2, 2, [id#8L] == Physical Plan == Range 0, 2, 2, 50, [id#8L] ``` After this patch: If step size is 1: ``` == Optimized Logical Plan == Range(0, 100, splits=2) == Physical Plan == Range(0, 100, splits=2) ``` If step size is not 1: ``` == Optimized Logical Plan == Range (0, 100, step=2, splits=2) == Physical Plan == *Range (0, 100, step=2, splits=2) ``` ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13239 from rxin/SPARK-15459.	2016-05-22 00:03:37 -07:00
gatorsmile	a11175eeca	[SPARK-15312][SQL] Detect Duplicate Key in Partition Spec and Table Properties #### What changes were proposed in this pull request? When there are duplicate keys in the partition specs or table properties, we always use the last value and ignore all the previous values. This is caused by the function call `toMap`. partition specs or table properties are widely used in multiple DDL statements. This PR is to detect the duplicates and issue an exception if found. #### How was this patch tested? Added test cases in DDLSuite Author: gatorsmile <gatorsmile@gmail.com> Closes #13095 from gatorsmile/detectDuplicate.	2016-05-21 23:56:10 -07:00
Reynold Xin	6d0bfb9601	Small documentation and style fix.	2016-05-21 23:12:56 -07:00
Jurriaan Pruis	223f633908	[SPARK-15415][SQL] Fix BroadcastHint when autoBroadcastJoinThreshold is 0 or -1 ## What changes were proposed in this pull request? This PR makes BroadcastHint more deterministic by using a special isBroadcastable property instead of setting the sizeInBytes to 1. See https://issues.apache.org/jira/browse/SPARK-15415 ## How was this patch tested? Added testcases to test if the broadcast hash join is included in the plan when the BroadcastHint is supplied and also tests for propagation of the joins. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13244 from jurriaan/broadcast-hint.	2016-05-21 23:01:14 -07:00
gatorsmile	8f0a3d5bcb	[SPARK-15330][SQL] Implement Reset Command #### What changes were proposed in this pull request? Like `Set` Command in Hive, `Reset` is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-3202 This PR is to implement such a command for resetting the SQL-related configuration to the default values. One of the use case shown in HIVE-3202 is listed below: > For the purpose of optimization we set various configs per query. It's worthy but all those configs should be reset every time for next query. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13121 from gatorsmile/resetCommand.	2016-05-21 20:07:34 -07:00
Reynold Xin	201a51f366	[SPARK-15452][SQL] Mark aggregator API as experimental ## What changes were proposed in this pull request? The Aggregator API was introduced in 2.0 for Dataset. All typed Dataset APIs should still be marked as experimental in 2.0. ## How was this patch tested? N/A - annotation only change. Author: Reynold Xin <rxin@databricks.com> Closes #13226 from rxin/SPARK-15452.	2016-05-21 12:46:25 -07:00

... 4 5 6 7 8 ...

2948 commits