ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wojciech Jurczyk	6e0b3d795c	[DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11252 from wjur/wjur/docs_multiclass.	2016-06-15 15:58:42 -07:00
Davies Liu	a153e41c08	Revert "[SPARK-15782][YARN] Set spark.jars system property in client mode" This reverts commit `4df8df5c2e`.	2016-06-15 15:55:07 -07:00
Reynold Xin	1a33f2e05c	Closing stale pull requests. Closes #13103 Closes #8320 Closes #7871 Closes #7461 Closes #9159 Closes #9150 Closes #9200 Closes #9089 Closes #8022 Closes #6767 Closes #8505 Closes #9457 Closes #9397 Closes #8563 Closes #10062 Closes #9944 Closes #10137 Closes #10148 Closes #9057 Closes #10163 Closes #8023 Closes #10302 Closes #8979 Closes #8981 Closes #10258 Closes #7345 Closes #9183 Closes #10087 Closes #10292 Closes #10254 Closes #10374 Closes #8915 Closes #10128 Closes #10666 Closes #8533 Closes #10625 Closes #8013 Closes #8427 Closes #7753 Closes #10116 Closes #11005 Closes #10797 Closes #11026 Closes #11009 Closes #10117 Closes #11382 Closes #9483 Closes #10566 Closes #10753 Closes #11386 Closes #9097 Closes #11245 Closes #11257 Closes #11045 Closes #10144 Closes #11066 Closes #8610 Closes #10634 Closes #11224 Closes #11212 Closes #11244 Closes #10326 Closes #13524	2016-06-15 15:43:45 -07:00
Nirman Narang	04d7b3d2b6	[SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.] Updated the SparkStreaming Doc with some important points. Author: Nirman Narang <narang@us.ibm.com> Closes #11114 from nirmannarang/SPARK-7848.	2016-06-15 15:36:31 -07:00
Imran Rashid	cafc696d09	[HOTFIX][CORE] fix flaky BasicSchedulerIntegrationTest ## What changes were proposed in this pull request? SPARK-15927 exacerbated a race in BasicSchedulerIntegrationTest, so it went from very unlikely to fairly frequent. The issue is that stage numbering is not completely deterministic, but these tests treated it like it was. So turn off the tests. ## How was this patch tested? on my laptop the test failed abotu 10% of the time before this change, and didn't fail in 500 runs after the change. Author: Imran Rashid <irashid@cloudera.com> Closes #13688 from squito/hotfix_basic_scheduler.	2016-06-15 16:44:18 -05:00
Sean Zhong	9bd80ad6bd	[SPARK-15776][SQL] Divide Expression inside Aggregation function is casted to wrong type ## What changes were proposed in this pull request? This PR fixes the problem that Divide Expression inside Aggregation function is casted to wrong type, which cause `select 1/2` and `select sum(1/2)`returning different result. Before the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0 \| +---+ scala> sql("select sum(1 / 2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,LongType,true)) ``` After the change: ``` scala> sql("select 1/2 as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").show() +---+ \| a\| +---+ \|0.5\| +---+ scala> sql("select sum(1/2) as a").schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(a,DoubleType,true)) ``` ## How was this patch tested? Unit test. This PR is based on https://github.com/apache/spark/pull/13524 by Sephiroth-Lin Author: Sean Zhong <seanzhong@databricks.com> Closes #13651 from clockfly/SPARK-15776.	2016-06-15 14:34:15 -07:00
Egor Pakhomov	049e639fc2	[SPARK-15934] [SQL] Return binary mode in ThriftServer Returning binary mode to ThriftServer for backward compatibility. Tested with Squirrel and Tableau. Author: Egor Pakhomov <egor@anchorfree.com> Closes #13667 from epahomov/SPARK-15095-2.0.	2016-06-15 14:29:32 -07:00
gatorsmile	09925735b5	[SPARK-15901][SQL][TEST] Verification of CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET #### What changes were proposed in this pull request? So far, we do not have test cases for verifying whether the external parameters `HiveUtils .CONVERT_METASTORE_ORC` and `HiveUtils.CONVERT_METASTORE_PARQUET` properly works when users use non-default values. This PR is to add such test cases for avoiding potential regression. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13622 from gatorsmile/addTestCase4parquetOrcConversion.	2016-06-15 14:08:55 -07:00
Nezih Yigitbasi	4df8df5c2e	[SPARK-15782][YARN] Set spark.jars system property in client mode ## What changes were proposed in this pull request? When `--packages` is specified with `spark-shell` the classes from those packages cannot be found, which I think is due to some of the changes in `SPARK-12343`. In particular `SPARK-12343` removes a line that sets the `spark.jars` system property in client mode, which is used by the repl main class to set the classpath. ## How was this patch tested? Tested manually. This system property is used by the repl to populate its classpath. If this is not set properly the classes for external packages cannot be found. tgravescs vanzin as you may be familiar with this part of the code. Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #13527 from nezihyigitbasi/repl-fix.	2016-06-15 14:07:36 -07:00
Davies Liu	5389013acc	[SPARK-15888] [SQL] fix Python UDF with aggregate ## What changes were proposed in this pull request? After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate. ## How was this patch tested? Added regression tests. The plan of added test query looks like this: ``` == Parsed Logical Plan == 'Project [<lambda>('k, 's) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Analyzed Logical Plan == t: int Project [<lambda>(k#17, s#22L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Optimized Logical Plan == Project [<lambda>(agg#29, agg#30L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L] +- LogicalRDD [key#5L, value#6] == Physical Plan == Project [pythonUDF0#37 AS t#26] +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37] +- HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L]) +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200) +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L]) +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35] +- Scan ExistingRDD[key#5L,value#6] ``` Author: Davies Liu <davies@databricks.com> Closes #13682 from davies/fix_py_udf.	2016-06-15 13:38:04 -07:00
Tejas Patil	279bd4aa5f	[SPARK-15826][CORE] PipedRDD to allow configurable char encoding ## What changes were proposed in this pull request? Link to jira which describes the problem: https://issues.apache.org/jira/browse/SPARK-15826 The fix in this PR is to allow users specify encoding in the pipe() operation. For backward compatibility, keeping the default value to be system default. ## How was this patch tested? Ran existing unit tests Author: Tejas Patil <tejasp@fb.com> Closes #13563 from tejasapatil/pipedrdd_utf8.	2016-06-15 12:03:00 -07:00
Liwei Lin	9b234b55d1	[SPARK-15518][CORE][FOLLOW-UP] Rename LocalSchedulerBackendEndpoint -> LocalSchedulerBackend ## What changes were proposed in this pull request? This patch is a follow-up to https://github.com/apache/spark/pull/13288 completing the renaming: - LocalScheduler -> LocalSchedulerBackend~~Endpoint~~ ## How was this patch tested? Updated test cases to reflect the name change. Author: Liwei Lin <lwlin7@gmail.com> Closes #13683 from lw-lin/rename-backend.	2016-06-15 11:52:36 -07:00
Yin Huai	e1585cc748	[SPARK-15959][SQL] Add the support of hive.metastore.warehouse.dir back ## What changes were proposed in this pull request? This PR adds the support of conf `hive.metastore.warehouse.dir` back. With this patch, the way of setting the warehouse dir is described as follows: * If `spark.sql.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the value of `spark.sql.warehouse.dir`. * If `spark.sql.warehouse.dir` is not set but `hive.metastore.warehouse.dir` is set, `spark.sql.warehouse.dir` will be automatically set to the value of `hive.metastore.warehouse.dir`. The warehouse dir is effectively set to the value of `hive.metastore.warehouse.dir`. * If neither `spark.sql.warehouse.dir` nor `hive.metastore.warehouse.dir` is set, `hive.metastore.warehouse.dir` will be automatically set to the default value of `spark.sql.warehouse.dir`. The warehouse dir is effectively set to the default value of `spark.sql.warehouse.dir`. ## How was this patch tested? `set hive.metastore.warehouse.dir` in `HiveSparkSubmitSuite`. JIRA: https://issues.apache.org/jira/browse/SPARK-15959 Author: Yin Huai <yhuai@databricks.com> Closes #13679 from yhuai/hiveWarehouseDir.	2016-06-15 11:50:54 -07:00
Tathagata Das	9a5071996b	[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery Renamed for simplicity, so that its obvious that its related to streaming. Existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13673 from tdas/SPARK-15953.	2016-06-15 10:46:07 -07:00
Felix Cheung	d30b7e6696	[SPARK-15637][SPARK-15931][SPARKR] Fix R masked functions checks ## What changes were proposed in this pull request? Because of the fix in SPARK-15684, this exclusion is no longer necessary. ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13636 from felixcheung/rendswith.	2016-06-15 10:29:07 -07:00
Herman van Hovell	de99c3d081	[SPARK-15960][SQL] Rename `spark.sql.enableFallBackToHdfsForStats` config ## What changes were proposed in this pull request? Since we are probably going to add more statistics related configurations in the future, I'd like to rename the newly added `spark.sql.enableFallBackToHdfsForStats` configuration option to `spark.sql.statistics.fallBackToHdfs`. This allows us to put all statistics related configurations in the same namespace. ## How was this patch tested? None - just a usability thing Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13681 from hvanhovell/SPARK-15960.	2016-06-15 09:43:11 -07:00
Marcelo Vanzin	40eeef9525	[SPARK-15046][YARN] Parse value of token renewal interval correctly. Use the config variable definition both to set and parse the value, avoiding issues with code expecting the value in a different format. Tested by running spark-submit with --principal / --keytab. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13669 from vanzin/SPARK-15046.	2016-06-15 09:09:21 -05:00
Shixiong Zhu	0ee9fd9e52	[SPARK-15935][PYSPARK] Fix a wrong format tag in the error message ## What changes were proposed in this pull request? A follow up PR for #13655 to fix a wrong format tag. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13665 from zsxwing/fix.	2016-06-14 19:45:11 -07:00
Xiangrui Meng	63e0aebe22	[SPARK-15945][MLLIB] Conversion between old/new vector columns in a DataFrame (Scala/Java) ## What changes were proposed in this pull request? This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens. This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0. ## How was this patch tested? Unit tests in Scala and Java. cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13662 from mengxr/SPARK-15945.	2016-06-14 18:57:45 -07:00
bomeng	42a28caf10	[SPARK-15952][SQL] fix "show databases" ordering issue ## What changes were proposed in this pull request? Two issues I've found for "show databases" command: 1. The returned database name list was not sorted, it only works when "like" was used together; (HIVE will always return a sorted list) 2. When it is used as sql("show databases").show, it will output a table with column named as "result", but for sql("show tables").show, it will output the column name as "tableName", so I think we should be consistent and use "databaseName" at least. ## How was this patch tested? Updated existing test case to test its ordering as well. Author: bomeng <bmeng@us.ibm.com> Closes #13671 from bomeng/SPARK-15952.	2016-06-14 18:35:29 -07:00
Herman van Hovell	0bd86c0fe4	[SPARK-15011][SQL] Re-enable 'analyze MetastoreRelations' in hive StatisticsSuite ## What changes were proposed in this pull request? This test re-enables the `analyze MetastoreRelations` in `org.apache.spark.sql.hive.StatisticsSuite`. The flakiness of this test was traced back to a shared configuration option, `hive.exec.compress.output`, in `TestHive`. This property was set to `true` by the `HiveCompatibilitySuite`. I have added configuration resetting logic to `HiveComparisonTest`, in order to prevent such a thing from happening again. ## How was this patch tested? Is a test. Author: Herman van Hovell <hvanhovell@databricks.com> Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #13498 from hvanhovell/SPARK-15011.	2016-06-14 18:24:59 -07:00
Tathagata Das	214adb14b8	[SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs ## What changes were proposed in this pull request? Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams. - [x] Python API!! ## How was this patch tested? Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13653 from tdas/SPARK-15933.	2016-06-14 17:58:45 -07:00
Kay Ousterhout	5d50d4f0f9	[SPARK-15927] Eliminate redundant DAGScheduler code. To try to eliminate redundant code to traverse the RDD dependency graph, this PR creates a new function getShuffleDependencies that returns shuffle dependencies that are immediate parents of a given RDD. This new function is used by getParentStages and getAncestorShuffleDependencies. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #13646 from kayousterhout/SPARK-15927.	2016-06-14 17:27:01 -07:00
Takeshi YAMAMURO	dae4d5db21	[SPARK-15247][SQL] Set the default number of partitions for reading parquet schemas ## What changes were proposed in this pull request? This pr sets the default number of partitions when reading parquet schemas. SQLContext#read#parquet currently yields at least n_executors * n_cores tasks even if parquet data consist of a single small file. This issue could increase the latency for small jobs. ## How was this patch tested? Manually tested and checked. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13137 from maropu/SPARK-15247.	2016-06-14 13:05:56 -07:00
Cheng Lian	bd39ffe35c	[SPARK-15895][SQL] Filters out metadata files while doing partition discovery ## What changes were proposed in this pull request? Take the following directory layout as an example: ``` dir/ +- p0=0/ \|-_metadata +- p1=0/ \|-part-00001.parquet \|-part-00002.parquet \|-... ``` The `_metadata` file under `p0=0` shouldn't fail partition discovery. This PR filters output all metadata files whose names start with `_` while doing partition discovery. ## How was this patch tested? New unit test added in `ParquetPartitionDiscoverySuite`. Author: Cheng Lian <lian@databricks.com> Closes #13623 from liancheng/spark-15895-partition-disco-no-metafiles.	2016-06-14 12:13:12 -07:00
gatorsmile	df4ea6614d	[SPARK-15864][SQL] Fix Inconsistent Behaviors when Uncaching Non-cached Tables #### What changes were proposed in this pull request? To uncache a table, we have three different ways: - _SQL interface_: `UNCACHE TABLE` - _DataSet API_: `sparkSession.catalog.uncacheTable` - _DataSet API_: `sparkSession.table(tableName).unpersist()` When the table is not cached, - _SQL interface_: `UNCACHE TABLE non-cachedTable` -> no error message - _Dataset API_: `sparkSession.catalog.uncacheTable("non-cachedTable")` -> report a strange error message: ```requirement failed: Table [a: int] is not cached``` - _Dataset API_: `sparkSession.table("non-cachedTable").unpersist()` -> no error message This PR will make them consistent. No operation if the table has already been uncached. In addition, this PR also removes `uncacheQuery` and renames `tryUncacheQuery` to `uncacheQuery`, and documents it that it's noop if the table has already been uncached #### How was this patch tested? Improved the existing test case for verifying the cases when the table has not been cached. Also added test cases for verifying the cases when the table does not exist Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13593 from gatorsmile/uncacheNonCachedTable.	2016-06-14 11:44:37 -07:00
Takuya UESHIN	c5b7355819	[SPARK-15915][SQL] Logical plans should use canonicalized plan when override sameResult. ## What changes were proposed in this pull request? `DataFrame` with plan overriding `sameResult` but not using canonicalized plan to compare can't cacheTable. The example is like: ``` val localRelation = Seq(1, 2, 3).toDF() localRelation.createOrReplaceTempView("localRelation") spark.catalog.cacheTable("localRelation") assert( localRelation.queryExecution.withCachedData.collect { case i: InMemoryRelation => i }.size == 1) ``` and this will fail as: ``` ArrayBuffer() had size 0 instead of expected size 1 ``` The reason is that when do `spark.catalog.cacheTable("localRelation")`, `CacheManager` tries to cache for the plan wrapped by `SubqueryAlias` but when planning for the DataFrame `localRelation`, `CacheManager` tries to find cached table for the not-wrapped plan because the plan for DataFrame `localRelation` is not wrapped. Some plans like `LocalRelation`, `LogicalRDD`, etc. override `sameResult` method, but not use canonicalized plan to compare so the `CacheManager` can't detect the plans are the same. This pr modifies them to use canonicalized plan when override `sameResult` method. ## How was this patch tested? Added a test to check if DataFrame with plan overriding sameResult but not using canonicalized plan to compare can cacheTable. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #13638 from ueshin/issues/SPARK-15915.	2016-06-14 10:52:13 -07:00
gatorsmile	bc02d01129	[SPARK-15655][SQL] Fix Wrong Partition Column Order when Fetching Partitioned Tables #### What changes were proposed in this pull request? When fetching the partitioned table, the output contains wrong results. The order of partition key values do not match the order of partition key columns in output schema. For example, ```SQL CREATE TABLE table_with_partition(c1 string) PARTITIONED BY (p1 string,p2 string,p3 string,p4 string,p5 string) INSERT OVERWRITE TABLE table_with_partition PARTITION (p1='a',p2='b',p3='c',p4='d',p5='e') SELECT 'blarr' SELECT p1, p2, p3, p4, p5, c1 FROM table_with_partition ``` ``` +---+---+---+---+---+-----+ \| p1\| p2\| p3\| p4\| p5\| c1\| +---+---+---+---+---+-----+ \| d\| e\| c\| b\| a\|blarr\| +---+---+---+---+---+-----+ ``` The expected result should be ``` +---+---+---+---+---+-----+ \| p1\| p2\| p3\| p4\| p5\| c1\| +---+---+---+---+---+-----+ \| a\| b\| c\| d\| e\|blarr\| +---+---+---+---+---+-----+ ``` This PR is to fix this by enforcing the order matches the table partition definition. #### How was this patch tested? Added a test case into `SQLQuerySuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13400 from gatorsmile/partitionedTableFetch.	2016-06-14 09:58:06 -07:00
Sean Owen	6151d2641f	[MINOR] Clean up several build warnings, mostly due to internal use of old accumulators ## What changes were proposed in this pull request? Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor". ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13642 from srowen/BuildWarnings.	2016-06-14 09:40:07 -07:00
Sean Zhong	6e8cdef0cf	[SPARK-15914][SQL] Add deprecated method back to SQLContext for backward source code compatibility ## What changes were proposed in this pull request? Revert partial changes in SPARK-12600, and add some deprecated method back to SQLContext for backward source code compatibility. ## How was this patch tested? Manual test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13637 from clockfly/SPARK-15914.	2016-06-14 09:10:27 -07:00
Jeff Zhang	53bb030847	doc fix of HiveThriftServer ## What changes were proposed in this pull request? Just minor doc fix. \cc yhuai Author: Jeff Zhang <zjffdu@apache.org> Closes #13659 from zjffdu/doc_fix.	2016-06-14 14:28:40 +01:00
Adam Roberts	a431e3f1f8	[SPARK-15821][DOCS] Include parallel build info ## What changes were proposed in this pull request? We should mention that users can build Spark using multiple threads to decrease build times; either here or in "Building Spark" ## How was this patch tested? Built on machines with between one core to 192 cores using mvn -T 1C and observed faster build times with no loss in stability In response to the question here https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest this option as we know it works for Spark and can result in faster builds Author: Adam Roberts <aroberts@uk.ibm.com> Closes #13562 from a-roberts/patch-3.	2016-06-14 13:59:01 +01:00
Shixiong Zhu	96c3500c66	[SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests ## What changes were proposed in this pull request? This PR just enables tests for sql/streaming.py and also fixes the failures. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13655 from zsxwing/python-streaming-test.	2016-06-14 02:12:29 -07:00
Mortada Mehyar	a87a56f5c7	[DOCUMENTATION] fixed typos in python programming guide ## What changes were proposed in this pull request? minor typo ## How was this patch tested? minor typo in the doc, should be self explanatory Author: Mortada Mehyar <mortada.mehyar@gmail.com> Closes #13639 from mortada/typo.	2016-06-14 09:45:46 +01:00
Wenchen Fan	688b6ef9dc	[SPARK-15932][SQL][DOC] document the contract of encoder serializer expressions ## What changes were proposed in this pull request? In our encoder framework, we imply that serializer expressions should use `BoundReference` to refer to the input object, and a lot of codes depend on this contract(e.g. ExpressionEncoder.tuple). This PR adds some document and assert in `ExpressionEncoder` to make it clearer. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #13648 from cloud-fan/comment.	2016-06-13 22:02:23 -07:00
Sandeep Singh	1842cdd4ee	[SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions ## What changes were proposed in this pull request? SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions. ## How was this patch tested? CatalogSuite Author: Sandeep Singh <sandeep@techaddict.me> Closes #13413 from techaddict/SPARK-15663.	2016-06-13 21:58:52 -07:00
Liang-Chi Hsieh	baa3e633e1	[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.	2016-06-13 19:59:53 -07:00
gatorsmile	5827b65e28	[SPARK-15808][SQL] File Format Checking When Appending Data #### What changes were proposed in this pull request? Issue: Got wrong results or strange errors when append data to a table with mismatched file format. _Example 1: PARQUET -> CSV_ ```Scala createDF(0, 9).write.format("parquet").saveAsTable("appendParquetToOrc") createDF(10, 19).write.mode(SaveMode.Append).format("orc").saveAsTable("appendParquetToOrc") ``` Error we got: ``` Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.RuntimeException: file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/warehouse-bc8fedf2-aa6a-4002-a18b-524c6ac859d4/appendorctoparquet/part-r-00000-c0e3f365-1d46-4df5-a82c-b47d7af9feb9.snappy.orc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [79, 82, 67, 23] ``` _Example 2: Json -> CSV_ ```Scala createDF(0, 9).write.format("json").saveAsTable("appendJsonToCSV") createDF(10, 19).write.mode(SaveMode.Append).format("parquet").saveAsTable("appendJsonToCSV") ``` No exception, but wrong results: ``` +----+----+ \| c1\| c2\| +----+----+ \|null\|null\| \|null\|null\| \|null\|null\| \|null\|null\| \| 0\|str0\| \| 1\|str1\| \| 2\|str2\| \| 3\|str3\| \| 4\|str4\| \| 5\|str5\| \| 6\|str6\| \| 7\|str7\| \| 8\|str8\| \| 9\|str9\| +----+----+ ``` _Example 3: Json -> Text_ ```Scala createDF(0, 9).write.format("json").saveAsTable("appendJsonToText") createDF(10, 19).write.mode(SaveMode.Append).format("text").saveAsTable("appendJsonToText") ``` Error we got: ``` Text data source supports only a single column, and you have 2 columns. ``` This PR is to issue an exception with appropriate error messages. #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #13546 from gatorsmile/fileFormatCheck.	2016-06-13 19:31:40 -07:00
Sean Zhong	7b9071eeaa	[SPARK-15910][SQL] Check schema consistency when using Kryo encoder to convert DataFrame to Dataset ## What changes were proposed in this pull request? This PR enforces schema check when converting DataFrame to Dataset using Kryo encoder. For example. Before the change: Schema is NOT checked when converting DataFrame to Dataset using kryo encoder. ``` scala> case class B(b: Int) scala> implicit val encoder = Encoders.kryo[B] scala> val df = Seq((1)).toDF("b") scala> val ds = df.as[B] // Schema compatibility is NOT checked ``` After the change: Report AnalysisException since the schema is NOT compatible. ``` scala> val ds = Seq((1)).toDF("b").as[B] org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`b` AS BINARY)' due to data type mismatch: cannot cast IntegerType to BinaryType; ... ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13632 from clockfly/spark-15910.	2016-06-13 17:43:55 -07:00
Josh Rosen	a6babca1bf	[SPARK-15929] Fix portability of DataFrameSuite path globbing tests The DataFrameSuite regression tests for SPARK-13774 fail in my environment because they attempt to glob over all of `/mnt` and some of the subdirectories restrictive permissions which cause the test to fail. This patch rewrites those tests to remove all environment-specific assumptions; the tests now create their own unique temporary paths for use in the tests. Author: Josh Rosen <joshrosen@databricks.com> Closes #13649 from JoshRosen/SPARK-15929.	2016-06-13 17:06:22 -07:00
Cheng Lian	ced8d669b3	[SPARK-15925][SQL][SPARKR] Replaces registerTempTable with createOrReplaceTempView ## What changes were proposed in this pull request? This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945. ## How was this patch tested? Existing SparkR tests. Author: Cheng Lian <lian@databricks.com> Closes #13644 from liancheng/spark-15925-temp-view-for-r.	2016-06-13 15:46:50 -07:00
Wenchen Fan	c4b1ad0209	[SPARK-15887][SQL] Bring back the hive-site.xml support for Spark 2.0 ## What changes were proposed in this pull request? Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it seems make sense to still load this conf file. This PR adds a `hadoopConf` API in `SharedState`, which is `sparkContext.hadoopConfiguration` by default. When users are under hive context, `SharedState.hadoopConf` will load hive-site.xml and append its configs to `sparkContext.hadoopConfiguration`. When we need to read hadoop config in spark sql, we should call `SessionState.newHadoopConf`, which contains `sparkContext.hadoopConfiguration`, hive-site.xml and sql configs. ## How was this patch tested? new test in `HiveDataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13611 from cloud-fan/hive-site.	2016-06-13 14:57:35 -07:00
Tathagata Das	c654ae2140	[SPARK-15889][SQL][STREAMING] Add a unique id to ContinuousQuery ## What changes were proposed in this pull request? ContinuousQueries have names that are unique across all the active ones. However, when queries are rapidly restarted with same name, it causes races conditions with the listener. A listener event from a stopped query can arrive after the query has been restarted, leading to complexities in monitoring infrastructure. Along with this change, I have also consolidated all the messy code paths to start queries with different sinks. ## How was this patch tested? Added unit tests, and existing unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13613 from tdas/SPARK-15889.	2016-06-13 13:44:46 -07:00
Takeshi YAMAMURO	5ad4e32d46	[SPARK-15530][SQL] Set #parallelism for file listing in listLeafFilesInParallel ## What changes were proposed in this pull request? This pr is to set the number of parallelism to prevent file listing in `listLeafFilesInParallel` from generating many tasks in case of large #defaultParallelism. ## How was this patch tested? Manually checked Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13444 from maropu/SPARK-15530.	2016-06-13 13:41:26 -07:00
gatorsmile	3b7fb84cf8	[SPARK-15676][SQL] Disallow Column Names as Partition Columns For Hive Tables #### What changes were proposed in this pull request? When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case: ``` hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string); FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns ``` Currently, the error we issued is very confusing: ``` org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.); ``` This PR is to fix the above issue by capturing the usage error in `Parser`. #### How was this patch tested? Added a test case to `DDLCommandSuite` Author: gatorsmile <gatorsmile@gmail.com> Closes #13415 from gatorsmile/partitionColumnsInTableSchema.	2016-06-13 13:22:46 -07:00
Tathagata Das	a6a18a4573	[HOTFIX][MINOR][SQL] Revert " Standardize 'continuous queries' to 'streaming D… This reverts commit `d32e227787`. Broke build - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.0-compile-maven-hadoop-2.3/326/console Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13645 from tdas/build-break.	2016-06-13 12:47:47 -07:00
Liwei Lin	d32e227787	[MINOR][SQL] Standardize 'continuous queries' to 'streaming Datasets/DataFrames' ## What changes were proposed in this pull request? This patch does some replacing (as `streaming Datasets/DataFrames` is the term we've chosen in [SPARK-15593](`00c310133d`)): - `continuous queries` -> `streaming Datasets/DataFrames` - `non-continuous queries` -> `non-streaming Datasets/DataFrames` This patch also adds `test("check foreach() can only be called on streaming Datasets/DataFrames")`. ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #13595 from lw-lin/continuous-queries-to-streaming-dss-dfs.	2016-06-13 11:49:15 -07:00
Prashant Sharma	4134653e53	[SPARK-15697][REPL] Unblock some of the useful repl commands. ## What changes were proposed in this pull request? Unblock some of the useful repl commands. like, "implicits", "javap", "power", "type", "kind". As they are useful and fully functional and part of scala/scala project, I see no harm in having them either. Verbatim paste form JIRA description. "implicits", "javap", "power", "type", "kind" commands in repl are blocked. However, they work fine in all cases I have tried. It is clear we don't support them as they are part of the scala/scala repl project. What is the harm in unblocking them, given they are useful ? In previous versions of spark we disabled these commands because it was difficult to support them without customization and the associated maintenance. Since the code base of scala repl was actually ported and maintained under spark source. Now that is not the situation and one can benefit from these commands in Spark REPL as much as in scala repl. ## How was this patch tested? Existing tests and manual, by trying out all of the above commands. P.S. Symantics of reset are to be discussed in a separate issue. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #13437 from ScrapCodes/SPARK-15697/repl-unblock-commands.	2016-06-13 11:13:09 -07:00
Dongjoon Hyun	938434dc78	[SPARK-15913][CORE] Dispatcher.stopped should be enclosed by synchronized block. ## What changes were proposed in this pull request? `Dispatcher.stopped` is guarded by `this`, but it is used without synchronization in `postMessage` function. This PR fixes this and also the exception message became more accurate. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13634 from dongjoon-hyun/SPARK-15913.	2016-06-13 10:30:17 -07:00
Wenchen Fan	cd47e23374	[SPARK-15814][SQL] Aggregator can return null result ## What changes were proposed in this pull request? It's similar to the bug fixed in https://github.com/apache/spark/pull/13425, we should consider null object and wrap the `CreateStruct` with `If` to do null check. This PR also improves the test framework to test the objects of `Dataset[T]` directly, instead of calling `toDF` and compare the rows. ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13553 from cloud-fan/agg-null.	2016-06-13 09:58:48 -07:00

1 2 3 4 5 ...

16643 commits