ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Huaxin Gao	98909c398d	[SPARK-23920][SQL] add array_remove to remove all elements that equal element from array ## What changes were proposed in this pull request? add array_remove to remove all elements that equal element from array ## How was this patch tested? add unit tests Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21069 from huaxingao/spark-23920.	2018-05-31 22:04:26 -07:00
Bryan Cutler	b2d0226562	[SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment ## What changes were proposed in this pull request? Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427.	2018-06-01 11:58:59 +08:00
e-dorigatti	0ebb0c0d4d	[SPARK-23754][PYTHON] Re-raising StopIteration in client code ## What changes were proposed in this pull request? Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s ## How was this patch tested? Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way? ## License This is my original work, licensed in the same way as spark Author: e-dorigatti <emilio.dorigatti@gmail.com> Author: edorigatti <emilio.dorigatti@gmail.com> Closes #21383 from e-dorigatti/fix_spark_23754.	2018-05-30 18:11:33 +08:00
Bryan Cutler	fa2ae9d201	[SPARK-24392][PYTHON] Label pandas_udf as Experimental ## What changes were proposed in this pull request? The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving. This adds a label to indicate it is still an experimental API. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392.	2018-05-28 12:56:05 +08:00
Marek Novotny	a6e883feb3	[SPARK-23935][SQL] Adding map_entries function ## What changes were proposed in this pull request? This PR adds `map_entries` function that returns an unordered array of all entries in the given map. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionsSuite` ## CodeGen examples ### Primitive types ``` val df = Seq(Map(1 -> 5, 2 -> 6)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final long project_size_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 051 / project_numElements_0, / 052 / 32); / 053 / if (project_size_0 > 2147483632) { / 054 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 055 / for (int z = 0; z < project_numElements_0; z++) { / 056 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getInt(z), project_values_0.getInt(z)}); / 057 / } / 058 / project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); / 059 / / 060 / } else { / 061 / final byte[] project_arrayBytes_0 = new byte[(int)project_size_0]; / 062 / UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData(); / 063 / Platform.putLong(project_arrayBytes_0, 16, project_numElements_0); / 064 / project_unsafeArrayData_0.pointTo(project_arrayBytes_0, 16, (int)project_size_0); / 065 / / 066 / final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 8; /* 067 / UnsafeRow project_unsafeRow_0 = new UnsafeRow(2); / 068 / for (int z = 0; z < project_numElements_0; z++) { / 069 / long offset = project_structsOffset_0 + z 24L; /* 070 / project_unsafeArrayData_0.setLong(z, (offset << 32) + 24L); / 071 / project_unsafeRow_0.pointTo(project_arrayBytes_0, 16 + offset, 24); / 072 / project_unsafeRow_0.setInt(0, project_keys_0.getInt(z)); / 073 / project_unsafeRow_0.setInt(1, project_values_0.getInt(z)); / 074 / } / 075 / project_value_0 = project_unsafeArrayData_0; / 076 / / 077 / } ``` ### Non-primitive types ``` val df = Seq(Map("a" -> "foo", "b" -> null)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 051 / for (int z = 0; z < project_numElements_0; z++) { / 052 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getUTF8String(z), project_values_0.getUTF8String(z)}); / 053 / } / 054 */ project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21236 from mn-mikke/feature/array-api-map_entries-to-master.	2018-05-21 23:14:03 +09:00
Liang-Chi Hsieh	6d7d45a1af	[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning ## What changes were proposed in this pull request? Logical `Range` node has been added with `outputOrdering` recently. It's used to eliminate redundant `Sort` during optimization. However, this `outputOrdering` doesn't not propagate to physical `RangeExec` node. We also add correct `outputPartitioning` to `RangeExec` node. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21291 from viirya/SPARK-24242.	2018-05-21 15:39:35 +08:00
Marco Gaido	69350aa2f0	[SPARK-23922][SQL] Add arrays_overlap function ## What changes were proposed in this pull request? The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21028 from mgaido91/SPARK-23922.	2018-05-17 20:45:32 +08:00
Florent Pépin	3e66350c24	[SPARK-23925][SQL] Add array_repeat collection function ## What changes were proposed in this pull request? The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish. The behaviour of the function is based on Presto's one. The function creates an array containing a given element repeated the requested number of times. ## How was this patch tested? New unit tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite Author: Florent Pépin <florentpepin.92@gmail.com> Author: Florent Pépin <florent.pepin14@imperial.ac.uk> Closes #21208 from pepinoflo/SPARK-23925.	2018-05-17 13:31:14 +09:00
Liang-Chi Hsieh	d610d2a3f5	[SPARK-24259][SQL] ArrayWriter for Arrow produces wrong output ## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21312 from viirya/SPARK-24259.	2018-05-15 22:06:58 +08:00
Maxim Gekk	8cd83acf40	[SPARK-24027][SQL] Support MapType with StringType for keys as the root type by from_json ## What changes were proposed in this pull request? Currently, the from_json function support StructType or ArrayType as the root type. The PR allows to specify MapType(StringType, DataType) as the root type additionally to mentioned types. For example: ```scala import org.apache.spark.sql.types._ val schema = MapType(StringType, IntegerType) val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS() in.select(from_json($"value", schema, Map[String, String]())).collect() ``` ``` res1: Array[org.apache.spark.sql.Row] = Array([Map(a -> 1, b -> 2, c -> 3)]) ``` ## How was this patch tested? It was checked by new tests for the map type with integer type and struct type as value types. Also roundtrip tests like from_json(to_json) and to_json(from_json) for MapType are added. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21108 from MaxGekk/from_json-map-type.	2018-05-14 14:05:42 -07:00
aditkumar	92f6f52ff0	[MINOR][DOCS] Documenting months_between direction ## What changes were proposed in this pull request? It's useful to know what relationship between date1 and date2 results in a positive number. Author: aditkumar <aditkumar@gmail.com> Author: Adit Kumar <aditkumar@gmail.com> Closes #20787 from aditkumar/master.	2018-05-11 14:42:23 -05:00
Maxim Gekk	f4fed05121	[SPARK-24171] Adding a note for non-deterministic functions ## What changes were proposed in this pull request? I propose to add a clear statement for functions like `collect_list()` about non-deterministic behavior of such functions. The behavior must be taken into account by user while creating and running queries. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21228 from MaxGekk/deterministic-comments.	2018-05-10 09:44:49 -07:00
Marcelo Vanzin	cc613b552e	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
Marco Gaido	e35ad3cadd	[SPARK-23930][SQL] Add slice function ## What changes were proposed in this pull request? The PR add the `slice` function. The behavior of the function is based on Presto's one. The function slices an array according to the requested start index and length. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21040 from mgaido91/SPARK-23930.	2018-05-07 16:57:37 +09:00
Kazuaki Ishizaki	7564a9a706	[SPARK-23921][SQL] Add array_sort function ## What changes were proposed in this pull request? The PR adds the SQL function `array_sort`. The behavior of the function is based on Presto's one. The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21021 from kiszk/SPARK-23921.	2018-05-07 15:22:23 +09:00
Marcelo Vanzin	a634d66ce7	[SPARK-24126][PYSPARK] Use build-specific temp directory for pyspark tests. This avoids polluting and leaving garbage behind in /tmp, and allows the usual build tools to clean up any leftover files. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21198 from vanzin/SPARK-24126.	2018-05-07 13:00:18 +08:00
Dongjoon Hyun	b857fb549f	[SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark tests only for `-Phive` ## What changes were proposed in this pull request? When `PyArrow` or `Pandas` are not available, the corresponding PySpark tests are skipped automatically. Currently, PySpark tests fail when we are not using `-Phive`. This PR aims to skip Hive related PySpark tests when `-Phive` is not given. BEFORE ```bash $ build/mvn -DskipTests clean package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql File "/Users/dongjoon/spark/python/pyspark/sql/readwriter.py", line 295, in pyspark.sql.readwriter.DataFrameReader.table ... IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':" ******************************************************************** 1 of 3 in pyspark.sql.readwriter.DataFrameReader.table Test Failed 1 failures. ``` AFTER** ```bash $ build/mvn -DskipTests clean package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql ... Tests passed in 138 seconds Skipped tests in pyspark.sql.tests with python2.7: ... test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.' ``` ## How was this patch tested? This is a test-only change. First, this should pass the Jenkins. Then, manually do the following. ```bash build/mvn -DskipTests clean package python/run-tests.py --python-executables python2.7 --modules pyspark-sql ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21141 from dongjoon-hyun/SPARK-23853.	2018-05-01 09:06:23 +08:00
Maxim Gekk	3121b411f7	[SPARK-23846][SQL] The samplingRatio option for CSV datasource ## What changes were proposed in this pull request? I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource: `b14993e1fc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (L49-L50)` ## How was this patch tested? Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20959 from MaxGekk/csv-sampling.	2018-04-30 09:45:22 +08:00
Maxim Gekk	bd14da6fd5	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.	2018-04-29 11:25:31 +08:00
hyukjinkwon	f7435bec6a	[SPARK-24044][PYTHON] Explicitly print out skipped tests from unittest module ## What changes were proposed in this pull request? This PR proposes to remove duplicated dependency checking logics and also print out skipped tests from unittests. For example, as below: ``` Skipped tests in pyspark.sql.tests with pypy: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ... Skipped tests in pyspark.sql.tests with python3: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ... ``` Currently, it's not printed out in the console. I think we should better print out skipped tests in the console. ## How was this patch tested? Manually tested. Also, fortunately, Jenkins has good environment to test the skipped output. Author: hyukjinkwon <gurwls223@apache.org> Closes #21107 from HyukjinKwon/skipped-tests-print.	2018-04-26 15:11:42 -07:00
Huaxin Gao	4f1e38649e	[SPARK-24057][PYTHON] put the real data type in the AssertionError message ## What changes were proposed in this pull request? Print out the data type in the AssertionError message to make it more meaningful. ## How was this patch tested? I manually tested the changed code on my local, but didn't add any test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21159 from huaxingao/spark-24057.	2018-04-26 14:21:22 -07:00
Marco Gaido	cd10f9df82	[SPARK-23916][SQL] Add array_join function ## What changes were proposed in this pull request? The PR adds the SQL function `array_join`. The behavior of the function is based on Presto's one. The function accepts an `array` of `string` which is to be joined, a `string` which is the delimiter to use between the items of the first argument and optionally a `string` which is used to replace `null` values. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21011 from mgaido91/SPARK-23916.	2018-04-26 13:37:13 +09:00
Marco Gaido	58c55cb4a6	[SPARK-23902][SQL] Add roundOff flag to months_between ## What changes were proposed in this pull request? HIVE-15511 introduced the `roundOff` flag in order to disable the rounding to 8 digits which is performed in `months_between`. Since this can be a computational intensive operation, skipping it may improve performances when the rounding is not needed. ## How was this patch tested? modified existing UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21008 from mgaido91/SPARK-23902.	2018-04-26 12:19:20 +09:00
Maxim Gekk	3f1e999d3d	[SPARK-23849][SQL] Tests for samplingRatio of json datasource ## What changes were proposed in this pull request? Added the `samplingRatio` option to the `json()` method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: https://github.com/apache/spark/pull/20959 ## How was this patch tested? Added new test for PySpark, updated 2 existing tests according to reviews of https://github.com/apache/spark/pull/20959 and added new negative test Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21056 from MaxGekk/json-sampling.	2018-04-26 09:14:24 +08:00
mn-mikke	5fea17b3be	[SPARK-23821][SQL] Collection function: flatten ## What changes were proposed in this pull request? This PR adds a new collection function that transforms an array of arrays into a single array. The PR comprises: - An expression for flattening array structure - Flatten function - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(Seq(1, 2), Seq(4, 5)), Seq(null, Seq(1)) ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(flatten($"i")).debugCodegen ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 064 / project_numElements, / 065 / 4); / 066 / if (project_size > 2147483632) { / 067 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 068 / project_size + " bytes of data due to exceeding the limit 2147483632" + / 069 / " bytes for UnsafeArrayData."); / 070 / } / 071 / / 072 / byte[] project_array = new byte[(int)project_size]; / 073 / UnsafeArrayData project_tempArrayData = new UnsafeArrayData(); / 074 / Platform.putLong(project_array, 16, project_numElements); / 075 / project_tempArrayData.pointTo(project_array, 16, (int)project_size); / 076 / int project_counter = 0; / 077 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 078 / ArrayData arr = inputadapter_value.getArray(k); / 079 / for (int l = 0; l < arr.numElements(); l++) { / 080 / if (arr.isNullAt(l)) { / 081 / project_tempArrayData.setNullAt(project_counter); / 082 / } else { / 083 / project_tempArrayData.setInt( / 084 / project_counter, / 085 / arr.getInt(l) / 086 / ); / 087 / } / 088 / project_counter++; / 089 / } / 090 / } / 091 / project_value = project_tempArrayData; / 092 / / 093 / } / 094 / / 095 / } ``` ### Non-primitive type ``` val df = Seq( Seq(Seq("a", "b"), Seq(null, "d")), Seq(null, Seq("a")) ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(flatten($"s")).debugCodegen ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / Object[] project_arrayObject = new Object[(int)project_numElements]; / 064 / int project_counter = 0; / 065 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 066 / ArrayData arr = inputadapter_value.getArray(k); / 067 / for (int l = 0; l < arr.numElements(); l++) { / 068 / project_arrayObject[project_counter] = arr.getUTF8String(l); / 069 / project_counter++; / 070 / } / 071 / } / 072 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObject); / 073 / / 074 / } / 075 / / 076 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #20938 from mn-mikke/feature/array-api-flatten-to-master.	2018-04-25 11:19:08 +09:00
mn-mikke	e6b466084c	[SPARK-23736][SQL] Extending the concat function to support array columns ## What changes were proposed in this pull request? The PR adds a logic for easy concatenation of multiple array columns and covers: - Concat expression has been extended to support array columns - A Python wrapper ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite - typeCoercion/native/concat.sql ## Codegen examples ### Primitive-type elements ``` val df = Seq( (Seq(1 ,2), Seq(3, 4)), (Seq(1, 2, 3), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 070 / project_numElements, / 071 / 4); / 072 / if (project_size > 2147483632) { / 073 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_size + / 074 / " bytes of data due to exceeding the limit 2147483632 bytes" + / 075 / " for UnsafeArrayData."); / 076 / } / 077 / / 078 / byte[] project_array = new byte[(int)project_size]; / 079 / UnsafeArrayData project_arrayData = new UnsafeArrayData(); / 080 / Platform.putLong(project_array, 16, project_numElements); / 081 / project_arrayData.pointTo(project_array, 16, (int)project_size); / 082 / int project_counter = 0; / 083 / for (int y = 0; y < 2; y++) { / 084 / for (int z = 0; z < args[y].numElements(); z++) { / 085 / if (args[y].isNullAt(z)) { / 086 / project_arrayData.setNullAt(project_counter); / 087 / } else { / 088 / project_arrayData.setInt( / 089 / project_counter, / 090 / args[y].getInt(z) / 091 / ); / 092 / } / 093 / project_counter++; / 094 / } / 095 / } / 096 / return project_arrayData; / 097 / } / 098 / }.concat(project_args); / 099 / boolean project_isNull = project_value == null; ``` ### Non-primitive-type elements ``` val df = Seq( (Seq("aa" ,"bb"), Seq("ccc", "ddd")), (Seq("x", "y"), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / Object[] project_arrayObjects = new Object[(int)project_numElements]; / 070 / int project_counter = 0; / 071 / for (int y = 0; y < 2; y++) { / 072 / for (int z = 0; z < args[y].numElements(); z++) { / 073 / project_arrayObjects[project_counter] = args[y].getUTF8String(z); / 074 / project_counter++; / 075 / } / 076 / } / 077 / return new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObjects); / 078 / } / 079 / }.concat(project_args); / 080 */ boolean project_isNull = project_value == null; ``` Author: mn-mikke <mrkAha12346github> Closes #20858 from mn-mikke/feature/array-api-concat_arrays-to-master.	2018-04-20 14:58:11 +09:00
Kazuaki Ishizaki	46bb2b5129	[SPARK-23924][SQL] Add element_at function ## What changes were proposed in this pull request? The PR adds the SQL function `element_at`. The behavior of the function is based on Presto's one. This function returns element of array at given index in value if column is array, or returns value for the given key in value if column is map. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21053 from kiszk/SPARK-23924.	2018-04-19 21:00:10 +09:00
Kazuaki Ishizaki	d5bec48b9c	[SPARK-23919][SQL] Add array_position function ## What changes were proposed in this pull request? The PR adds the SQL function `array_position`. The behavior of the function is based on Presto's one. The function returns the position of the first occurrence of the element in array x (or 0 if not found) using 1-based index as BigInt. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21037 from kiszk/SPARK-23919.	2018-04-19 11:59:17 +09:00
mn-mikke	f81fa478ff	[SPARK-23926][SQL] Extending reverse function to support ArrayType arguments ## What changes were proposed in this pull request? This PR extends `reverse` functions to be able to operate over array columns and covers: - Introduction of `Reverse` expression that represents logic for reversing arrays and also strings - Removal of `StringReverse` expression - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(1, 3, 4, 2), null ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(reverse($"i")).debugCodegen ``` Result: ``` /* 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = inputadapter_value.copy(); / 051 / for(int k = 0; k < project_length / 2; k++) { / 052 / int l = project_length - k - 1; / 053 / boolean isNullAtK = project_value.isNullAt(k); / 054 / boolean isNullAtL = project_value.isNullAt(l); / 055 / if(!isNullAtK) { / 056 / int el = project_value.getInt(k); / 057 / if(!isNullAtL) { / 058 / project_value.setInt(k, project_value.getInt(l)); / 059 / } else { / 060 / project_value.setNullAt(k); / 061 / } / 062 / project_value.setInt(l, el); / 063 / } else if (!isNullAtL) { / 064 / project_value.setInt(k, project_value.getInt(l)); / 065 / project_value.setNullAt(l); / 066 / } / 067 / } / 068 / / 069 / } ``` ### Non-primitive type ``` val df = Seq( Seq("a", "c", "d", "b"), null ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(reverse($"s")).debugCodegen ``` Result: ``` / 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(new Object[project_length]); / 051 / for(int k = 0; k < project_length; k++) { / 052 / int l = project_length - k - 1; / 053 / project_value.update(k, inputadapter_value.getUTF8String(l)); / 054 / } / 055 / / 056 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #21034 from mn-mikke/feature/array-api-reverse-to-master.	2018-04-18 18:41:55 +09:00
Marco Gaido	14844a62c0	[SPARK-23918][SQL] Add array_min function ## What changes were proposed in this pull request? The PR adds the SQL function `array_min`. It takes an array as argument and returns the minimum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21025 from mgaido91/SPARK-23918.	2018-04-17 17:55:35 +09:00
Marco Gaido	6931022031	[SPARK-23917][SQL] Add array_max function ## What changes were proposed in this pull request? The PR adds the SQL function `array_max`. It takes an array as argument and returns the maximum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21024 from mgaido91/SPARK-23917.	2018-04-15 21:45:55 -07:00
hyukjinkwon	ab7b961a4f	[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener ## What changes were proposed in this pull request? This PR proposes to add `collect` to a query executor as an action. Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below: ```scala package org.apache.spark.sql import org.apache.spark.internal.Logging import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.util.QueryExecutionListener class TestQueryExecutionListener extends QueryExecutionListener with Logging { override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = { logError("Look at me! I'm 'onSuccess'") } override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { } } ``` and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener` Other operations in PySpark or Scala side seems fine: ```python >>> sql("SELECT * FROM range(1)").show() ``` ``` 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' +---+ \| id\| +---+ \| 0\| +---+ ``` ```scala scala> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' res1: Array[org.apache.spark.sql.Row] = Array([0]) ``` but .. Before ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` id 0 0 ``` After ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` 18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' id 0 0 ``` ## How was this patch tested? I have manually tested as described above and unit test was added. Author: hyukjinkwon <gurwls223@apache.org> Closes #21007 from HyukjinKwon/SPARK-23942.	2018-04-13 11:28:13 +08:00
hyukjinkwon	c7622befda	[SPARK-23847][FOLLOWUP][PYTHON][SQL] Actually test [desc\|acs]_nulls_[first\|last] functions in PySpark ## What changes were proposed in this pull request? There was a mistake in `tests.py` missing `assertEquals`. ## How was this patch tested? Fixed tests. Author: hyukjinkwon <gurwls223@apache.org> Closes #21035 from HyukjinKwon/SPARK-23847.	2018-04-11 19:42:09 +08:00
Huaxin Gao	2c1fe64757	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark ## What changes were proposed in this pull request? Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last, desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in column.py and functions.py ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20962 from huaxingao/spark-23847.	2018-04-08 12:09:06 +08:00
Li Jin	d766ea2ff2	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause ## What changes were proposed in this pull request? Add docstring to clarify default window frame boundaries with and without orderBy clause ## How was this patch tested? Manually generate doc and check. Author: Li Jin <ice.xelloss@gmail.com> Closes #20978 from icexelloss/SPARK-23861-window-doc.	2018-04-07 00:15:54 +08:00
hyukjinkwon	34c4b9c57e	[SPARK-23765][SQL] Supports custom line separator for json datasource ## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with https://github.com/apache/spark/pull/20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes #20877 from HyukjinKwon/linesep-json.	2018-03-28 19:49:27 +08:00
Bryan Cutler	ed72badb04	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled ## What changes were proposed in this pull request? When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3. ## How was this patch tested? Updated existing tests to verify error type. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20839 from BryanCutler/arrow-raise-same-error-SPARK-23699.	2018-03-27 20:06:12 -07:00
Michael (Stu) Stewart	087fb31420	[SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_udf` with keyword args ## What changes were proposed in this pull request? Add documentation about the limitations of `pandas_udf` with keyword arguments and related concepts, like `functools.partial` fn objects. NOTE: intermediate commits on this PR show some of the steps that can be taken to fix some (but not all) of these pain points. ### Survey of problems we face today: (Initialize) Note: python 3.6 and spark 2.4snapshot. ``` from pyspark.sql import SparkSession import inspect, functools from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit, udf spark = SparkSession.builder.getOrCreate() print(spark.version) df = spark.range(1,6).withColumn('b', col('id') * 2) def ok(a,b): return a+b ``` Using a keyword argument at the call site `b=...` (and yes, full stack trace below, haha): ``` ---> 14 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id', b='id')).show() # no kwargs TypeError: wrapper() got an unexpected keyword argument 'b' ``` Using partial with a keyword argument where the kw-arg is the first argument of the fn: (Aside: kind of interesting that lines 15,16 work great and then 17 explodes) ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-e9f31b8799c1> in <module>() 15 df.withColumn('ok', pandas_udf(f=functools.partial(ok, 7), returnType='bigint')('id')).show() 16 df.withColumn('ok', pandas_udf(f=functools.partial(ok, b=7), returnType='bigint')('id')).show() ---> 17 df.withColumn('ok', pandas_udf(f=functools.partial(ok, a=7), returnType='bigint')('id')).show() /Users/stu/ZZ/spark/python/pyspark/sql/functions.py in pandas_udf(f, returnType, functionType) 2378 return functools.partial(_create_udf, returnType=return_type, evalType=eval_type) 2379 else: -> 2380 return _create_udf(f=f, returnType=return_type, evalType=eval_type) 2381 2382 /Users/stu/ZZ/spark/python/pyspark/sql/udf.py in _create_udf(f, returnType, evalType) 54 argspec.varargs is None: 55 raise ValueError( ---> 56 "Invalid function: 0-arg pandas_udfs are not supported. " 57 "Instead, create a 1-arg pandas_udf and ignore the arg in your function." 58 ) ValueError: Invalid function: 0-arg pandas_udfs are not supported. Instead, create a 1-arg pandas_udf and ignore the arg in your function. ``` Author: Michael (Stu) Stewart <mstewart141@gmail.com> Closes #20900 from mstewart141/udfkw2.	2018-03-26 12:45:45 +09:00
Bryan Cutler	a9350d7095	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql ## What changes were proposed in this pull request? This cleans up unused imports, mainly from pyspark.sql module. Added a note in function.py that imports `UserDefinedFunction` only to maintain backwards compatibility for using `from pyspark.sql.function import UserDefinedFunction`. ## How was this patch tested? Existing tests and built docs. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20892 from BryanCutler/pyspark-cleanup-imports-SPARK-23700.	2018-03-26 12:42:32 +09:00
hyukjinkwon	a649fcf32a	[MINOR][PYTHON] Remove unused codes in schema parsing logics of PySpark ## What changes were proposed in this pull request? This PR proposes to remove out unused codes, `_ignore_brackets_split` and `_BRACKETS`. `_ignore_brackets_split` was introduced in `d57daf1f77` to refactor and support `toDF("...")`; however, `ebc124d4c4` replaced the logics here. Seems `_ignore_brackets_split` is not referred anymore. `_BRACKETS` was introduced in `880eabec37`; however, all other usages were removed out in `648a8626b8`. This is rather a followup for `ebc124d4c4` which I missed in that PR. ## How was this patch tested? Manually tested. Existing tests should cover this. I also double checked by `grep` in the whole repo. Author: hyukjinkwon <gurwls223@apache.org> Closes #20878 from HyukjinKwon/minor-remove-unused.	2018-03-22 21:20:41 -07:00
hyukjinkwon	8d79113b81	[SPARK-23577][SQL] Supports custom line separator for text datasource ## What changes were proposed in this pull request? This PR proposes to add `lineSep` option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. ## How was this patch tested? Manual tests and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20727 from HyukjinKwon/linesep-text.	2018-03-21 09:46:47 -07:00
hyukjinkwon	566321852b	[SPARK-23691][PYTHON] Use sql_conf util in PySpark tests where possible ## What changes were proposed in this pull request? `d6632d185e` added an useful util ```python contextmanager def sql_conf(self, pairs): ... ``` to allow configuration set/unset within a block: ```python with self.sql_conf({"spark.blah.blah.blah", "blah"}) # test codes ``` This PR proposes to use this util where possible in PySpark tests. Note that there look already few places affecting tests without restoring the original value back in unittest classes. ## How was this patch tested? Manually tested via: ``` ./run-tests --modules=pyspark-sql --python-executables=python2 ./run-tests --modules=pyspark-sql --python-executables=python3 ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20830 from HyukjinKwon/cleanup-sql-conf.	2018-03-19 21:25:37 -07:00
hyukjinkwon	61487b308b	[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark ## What changes were proposed in this pull request? Scala: ``` scala> spark.conf.get("hey", null) res1: String = null ``` ``` scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) res2: String = null ``` Python: Before ``` >>> spark.conf.get("hey", None) ... py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. : java.util.NoSuchElementException: hey ... ``` ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) u'STATIC' ``` After ``` >>> spark.conf.get("hey", None) is None True ``` ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) is None True ``` *Note that this PR preserves the case below: ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode") u'STATIC' ``` ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20841 from HyukjinKwon/spark-conf-get.	2018-03-18 20:24:14 +09:00
Dongjoon Hyun	5414abca4f	[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20705 from dongjoon-hyun/SPARK-23553.	2018-03-16 09:36:30 -07:00
Benjamin Peterson	7013eea11c	[SPARK-23522][PYTHON] always use sys.exit over builtin exit The exit() builtin is only for interactive use. applications should use sys.exit(). ## What changes were proposed in this pull request? All usage of the builtin `exit()` function is replaced by `sys.exit()`. ## How was this patch tested? I ran `python/run-tests`. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Benjamin Peterson <benjamin@python.org> Closes #20682 from benjaminp/sys-exit.	2018-03-08 20:38:34 +09:00
Li Jin	2cb23a8f51	[SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF ## What changes were proposed in this pull request? This PR proposes to support an alternative function from with group aggregate pandas UDF. The current form: ``` def foo(pdf): return ... ``` Takes a single arg that is a pandas DataFrame. With this PR, an alternative form is supported: ``` def foo(key, pdf): return ... ``` The alternative form takes two argument - a tuple that presents the grouping key, and a pandas DataFrame represents the data. ## How was this patch tested? GroupbyApplyTests Author: Li Jin <ice.xelloss@gmail.com> Closes #20295 from icexelloss/SPARK-23011-groupby-apply-key.	2018-03-08 20:29:07 +09:00
hyukjinkwon	d6632d185e	[SPARK-23380][PYTHON] Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame ## What changes were proposed in this pull request? This PR adds a configuration to control the fallback of Arrow optimization for `toPandas` and `createDataFrame` with Pandas DataFrame. ## How was this patch tested? Manually tested and unit tests added. You can test this by: `createDataFrame` ```python spark.conf.set("spark.sql.execution.arrow.enabled", False) pdf = spark.createDataFrame([[{'a': 1}]]).toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True) spark.createDataFrame(pdf, "a: map<string, int>") ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled", False) pdf = spark.createDataFrame([[{'a': 1}]]).toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False) spark.createDataFrame(pdf, "a: map<string, int>") ``` `toPandas` ```python spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", True) spark.createDataFrame([[{'a': 1}]]).toPandas() ``` ```python spark.conf.set("spark.sql.execution.arrow.enabled", True) spark.conf.set("spark.sql.execution.arrow.fallback.enabled", False) spark.createDataFrame([[{'a': 1}]]).toPandas() ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20678 from HyukjinKwon/SPARK-23380-conf.	2018-03-08 20:22:07 +09:00
Mihaly Toth	a366b950b9	[SPARK-23329][SQL] Fix documentation of trigonometric functions ## What changes were proposed in this pull request? Provide more details in trigonometric function documentations. Referenced `java.lang.Math` for further details in the descriptions. ## How was this patch tested? Ran full build, checked generated documentation manually Author: Mihaly Toth <misutoth@gmail.com> Closes #20618 from misutoth/trigonometric-doc.	2018-03-05 23:46:40 +09:00
Anirudh	5ff72ffcf4	[SPARK-23566][MINOR][DOC] Argument name mismatch fixed Argument name mismatch fixed. ## What changes were proposed in this pull request? `col` changed to `new` in doc string to match the argument list. Patch file added: https://issues.apache.org/jira/browse/SPARK-23566 Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Anirudh <animenon@mail.com> Closes #20716 from animenon/master.	2018-03-05 23:17:16 +09:00
Michael (Stu) Stewart	7965c91d8a	[SPARK-23569][PYTHON] Allow pandas_udf to work with python3 style type-annotated functions ## What changes were proposed in this pull request? Check python version to determine whether to use `inspect.getargspec` or `inspect.getfullargspec` before applying `pandas_udf` core logic to a function. The former is python2.7 (deprecated in python3) and the latter is python3.x. The latter correctly accounts for type annotations, which are syntax errors in python2.x. ## How was this patch tested? Locally, on python 2.7 and 3.6. Author: Michael (Stu) Stewart <mstewart141@gmail.com> Closes #20728 from mstewart141/pandas_udf_fix.	2018-03-05 13:36:42 +09:00
Liang-Chi Hsieh	b14993e1fc	[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document ## What changes were proposed in this pull request? Clarify JSON and CSV reader behavior in document. JSON doesn't support partial results for corrupted records. CSV only supports partial results for the records with more or less tokens. ## How was this patch tested? Pass existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20666 from viirya/SPARK-23448-2.	2018-02-28 11:00:54 +09:00
hyukjinkwon	c5857e496f	[SPARK-23446][PYTHON] Explicitly check supported types in toPandas ## What changes were proposed in this pull request? This PR explicitly specifies and checks the types we supported in `toPandas`. This was a hole. For example, we haven't finished the binary type support in Python side yet but now it allows as below: ```python spark.conf.set("spark.sql.execution.arrow.enabled", "false") df = spark.createDataFrame([[bytearray("a")]]) df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() ``` ``` _1 0 [97] _1 0 a ``` This should be disallowed. I think the same things also apply to nested timestamps too. I also added some nicer message about `spark.sql.execution.arrow.enabled` in the error message. ## How was this patch tested? Manually tested and tests added in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20625 from HyukjinKwon/pandas_convertion_supported_type.	2018-02-16 09:41:17 -08:00
gatorsmile	407f672496	[SPARK-20090][FOLLOW-UP] Revert the deprecation of `names` in PySpark ## What changes were proposed in this pull request? Deprecating the field `name` in PySpark is not expected. This PR is to revert the change. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20595 from gatorsmile/removeDeprecate.	2018-02-13 15:05:13 +09:00
hyukjinkwon	c338c8cf82	[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs ## What changes were proposed in this pull request? This PR targets to explicitly specify supported types in Pandas UDFs. The main change here is to add a deduplicated and explicit type checking in `returnType` ahead with documenting this; however, it happened to fix multiple things. 1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, see: ```python from pyspark.sql.functions import pandas_udf pudf = pandas_udf(lambda x: x, "binary") df = spark.createDataFrame([[bytearray(1)]]) df.select(pudf("_1")).show() ``` ``` ... TypeError: Unsupported type in conversion to Arrow: BinaryType ``` We can document this behaviour for its guide. 2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but seems we can support this case. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType foo = pandas_udf(lambda v: v.mean(), 'array<double>', PandasUDFType.GROUPED_AGG) df = spark.range(100).selectExpr("id", "array(id) as value") df.groupBy("id").agg(foo("value")).show() ``` ``` ... NotImplementedError: ArrayType, StructType and MapType are not supported with PandasUDFType.GROUPED_AGG ``` 3. Since we can check the return type ahead, we can fail fast before actual execution. ```python # we can fail fast at this stage because we know the schema ahead pandas_udf(lambda x: x, BinaryType()) ``` ## How was this patch tested? Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20531 from HyukjinKwon/pudf-cleanup.	2018-02-12 20:49:36 +09:00
xubo245	eacb62fbbe	[SPARK-22624][PYSPARK] Expose range partitioning shuffle introduced by spark-22614 ## What changes were proposed in this pull request? Expose range partitioning shuffle introduced by spark-22614 ## How was this patch tested? Unit test in dataframe.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: xubo245 <601450868@qq.com> Closes #20456 from xubo245/SPARK22624_PysparkRangePartition.	2018-02-11 19:23:15 +09:00
Huaxin Gao	8acb51f08b	[SPARK-23084][PYTHON] Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark ## What changes were proposed in this pull request? Added unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark, also updated the rangeBetween API ## How was this patch tested? did unit test on my local. Please let me know if I need to add unit test in tests.py Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20400 from huaxingao/spark_23084.	2018-02-11 18:55:38 +09:00
Li Jin	a34fce19bc	[SPARK-23314][PYTHON] Add ambiguous=False when localizing tz-naive timestamps in Arrow codepath to deal with dst ## What changes were proposed in this pull request? When tz_localize a tz-naive timetamp, pandas will throw exception if the timestamp is during daylight saving time period, e.g., `2015-11-01 01:30:00`. This PR fixes this issue by setting `ambiguous=False` when calling tz_localize, which is the same default behavior of pytz. ## How was this patch tested? Add `test_timestamp_dst` Author: Li Jin <ice.xelloss@gmail.com> Closes #20537 from icexelloss/SPARK-23314.	2018-02-11 17:31:35 +09:00
Takuya UESHIN	97a224a855	[SPARK-23360][SQL][PYTHON] Get local timezone from environment via pytz, or dateutil. ## What changes were proposed in this pull request? Currently we use `tzlocal()` to get Python local timezone, but it sometimes causes unexpected behavior. I changed the way to get Python local timezone to use pytz if the timezone is specified in environment variable, or timezone file via dateutil . ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20559 from ueshin/issues/SPARK-23360/master.	2018-02-11 01:08:02 +09:00
hyukjinkwon	4b4ee26010	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ \| age\|height\| name\| +----+------+-----+ \| 10\| 80\|Alice\| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ \| age\|height\|name\| +----+------+----+ \| 10\| 80\|null\| ... ``` After This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ \| age\|height\|name\| +----+------+----+ \| 10\| 80\|null\| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20499 from HyukjinKwon/SPARK-19454-followup.	2018-02-09 14:21:10 +08:00
Takuya UESHIN	a62f30d3fa	[SPARK-23319][TESTS][FOLLOWUP] Fix a test for Python 3 without pandas. ## What changes were proposed in this pull request? This is a followup pr of #20487. When importing module but it doesn't exists, the error message is slightly different between Python 2 and 3. E.g., in Python 2: ``` No module named pandas ``` in Python 3: ``` No module named 'pandas' ``` So, one test to check an import error fails in Python 3 without pandas. This pr fixes it. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20538 from ueshin/issues/SPARK-23319/fup1.	2018-02-08 12:46:10 +09:00
hyukjinkwon	71cfba04ae	[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) ## What changes were proposed in this pull request? This PR proposes to explicitly specify Pandas and PyArrow versions in PySpark tests to skip or test. We declared the extra dependencies: `b8bfce51ab/python/setup.py (L204)` In case of PyArrow: Currently we only check if pyarrow is installed or not without checking the version. It already fails to run tests. For example, if PyArrow 0.7.0 is installed: ``` ====================================================================== ERROR: test_vectorized_udf_wrong_return_type (pyspark.sql.tests.ScalarPandasUDF) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4019, in test_vectorized_udf_wrong_return_type f = pandas_udf(lambda x: x * 1.0, MapType(LongType(), LongType())) File "/.../spark/python/pyspark/sql/functions.py", line 2309, in pandas_udf return _create_udf(f=f, returnType=return_type, evalType=eval_type) File "/.../spark/python/pyspark/sql/udf.py", line 47, in _create_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/utils.py", line 132, in require_minimum_pyarrow_version "however, your version was %s." % pyarrow.__version__) ImportError: pyarrow >= 0.8.0 must be installed on calling Python process; however, your version was 0.7.0. ---------------------------------------------------------------------- Ran 33 tests in 8.098s FAILED (errors=33) ``` In case of Pandas: There are few tests for old Pandas which were tested only when Pandas version was lower, and I rewrote them to be tested when both Pandas version is lower and missing. ## How was this patch tested? Manually tested by modifying the condition: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 1.19.2 must be installed; however, your version was 0.19.2.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 1.8.0 must be installed; however, your version was 0.8.0.' ``` ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_respect_session_timezone (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20487 from HyukjinKwon/pyarrow-pandas-skip.	2018-02-07 23:28:10 +09:00
gatorsmile	9775df67f9	[SPARK-23122][PYSPARK][FOLLOWUP] Replace registerTempTable by createOrReplaceTempView ## What changes were proposed in this pull request? Replace `registerTempTable` by `createOrReplaceTempView`. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20523 from gatorsmile/updateExamples.	2018-02-07 23:24:16 +09:00
gatorsmile	c36fecc3b4	[SPARK-23327][SQL] Update the description and tests of three external API or functions ## What changes were proposed in this pull request? Update the description and tests of three external API or functions `createFunction `, `length` and `repartitionByRange ` ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20495 from gatorsmile/updateFunc.	2018-02-06 16:46:43 -08:00
Li Jin	caf3044563	[MINOR][TEST] Fix class name for Pandas UDF tests ## What changes were proposed in this pull request? In `b2ce17b4c9`, I mistakenly renamed `VectorizedUDFTests` to `ScalarPandasUDF`. This PR fixes the mistake. ## How was this patch tested? Existing tests. Author: Li Jin <ice.xelloss@gmail.com> Closes #20489 from icexelloss/fix-scalar-udf-tests.	2018-02-06 12:30:04 -08:00
Takuya UESHIN	63c5bf13ce	[SPARK-23334][SQL][PYTHON] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. ## What changes were proposed in this pull request? In Python 2, when `pandas_udf` tries to return string type value created in the udf with `".."`, the execution fails. E.g., ```python from pyspark.sql.functions import pandas_udf, col import pandas as pd df = spark.range(10) str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string") df.select(str_f(col('id'))).show() ``` raises the following exception: ``` ... java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93) ... ``` Seems like pyarrow ignores `type` parameter for `pa.Array.from_pandas()` and consider it as binary type when the type is string type and the string values are `str` instead of `unicode` in Python 2. This pr adds a workaround for the case. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20507 from ueshin/issues/SPARK-23334.	2018-02-06 18:30:50 +09:00
Takuya UESHIN	a24c03138a	[SPARK-23290][SQL][PYTHON] Use datetime.date for date type when converting Spark DataFrame to Pandas DataFrame. ## What changes were proposed in this pull request? In #18664, there was a change in how `DateType` is being returned to users ([line 1968 in dataframe.py](https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968)). This can cause client code which works in Spark 2.2 to fail. See [SPARK-23290](https://issues.apache.org/jira/browse/SPARK-23290?focusedCommentId=16350917&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16350917) for an example. This pr modifies to use `datetime.date` for date type as Spark 2.2 does. ## How was this patch tested? Tests modified to fit the new behavior and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20506 from ueshin/issues/SPARK-23290.	2018-02-06 14:52:25 +08:00
hyukjinkwon	551dff2bcc	[SPARK-21658][SQL][PYSPARK] Revert "[] Add default None for value in na.replace in PySpark" This reverts commit `0fcde87aad`. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and https://github.com/apache/spark/pull/16793 Author: hyukjinkwon <gurwls223@gmail.com> Closes #20496 from HyukjinKwon/revert-SPARK-21658.	2018-02-03 10:40:21 -08:00
Takuya UESHIN	07cee33736	[SPARK-22274][PYTHON][SQL][FOLLOWUP] Use `assertRaisesRegexp` instead of `assertRaisesRegex`. ## What changes were proposed in this pull request? This is a follow-up pr of #19872 which uses `assertRaisesRegex` but it doesn't exist in Python 2, so some tests fail when running tests in Python 2 environment. Unfortunately, we missed it because currently Python 2 environment of the pr builder doesn't have proper versions of pandas or pyarrow, so the tests were skipped. This pr modifies to use `assertRaisesRegexp` instead of `assertRaisesRegex`. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20467 from ueshin/issues/SPARK-22274/fup1.	2018-01-31 22:26:27 -08:00
Henry Robinson	f470df2fcf	[SPARK-23157][SQL][FOLLOW-UP] DataFrame -> SparkDataFrame in R comment Author: Henry Robinson <henry@cloudera.com> Closes #20443 from henryr/SPARK-23157.	2018-02-01 11:15:17 +09:00
jerryshao	3d0911bbe4	[SPARK-23228][PYSPARK] Add Python Created jsparkSession to JVM's defaultSession ## What changes were proposed in this pull request? In the current PySpark code, Python created `jsparkSession` doesn't add to JVM's defaultSession, this `SparkSession` object cannot be fetched from Java side, so the below scala code will be failed when loaded in PySpark application. ```scala class TestSparkSession extends SparkListener with Logging { override def onOtherEvent(event: SparkListenerEvent): Unit = { event match { case CreateTableEvent(db, table) => val session = SparkSession.getActiveSession.orElse(SparkSession.getDefaultSession) assert(session.isDefined) val tableInfo = session.get.sharedState.externalCatalog.getTable(db, table) logInfo(s"Table info ${tableInfo}") case e => logInfo(s"event $e") } } } ``` So here propose to add fresh create `jsparkSession` to `defaultSession`. ## How was this patch tested? Manual verification. Author: jerryshao <sshao@hortonworks.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Saisai Shao <sai.sai.shao@gmail.com> Closes #20404 from jerryshao/SPARK-23228.	2018-01-31 20:04:51 +09:00
gatorsmile	7a2ada223e	[SPARK-23261][PYSPARK] Rename Pandas UDFs ## What changes were proposed in this pull request? Rename the public APIs and names of pandas udfs. - `PANDAS SCALAR UDF` -> `SCALAR PANDAS UDF` - `PANDAS GROUP MAP UDF` -> `GROUPED MAP PANDAS UDF` - `PANDAS GROUP AGG UDF` -> `GROUPED AGG PANDAS UDF` ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #20428 from gatorsmile/renamePandasUDFs.	2018-01-30 21:55:55 +09:00
Henry Robinson	8b983243e4	[SPARK-23157][SQL] Explain restriction on column expression in withColumn() ## What changes were proposed in this pull request? It's not obvious from the comments that any added column must be a function of the dataset that we are adding it to. Add a comment to that effect to Scala, Python and R Data* methods. Author: Henry Robinson <henry@cloudera.com> Closes #20429 from henryr/SPARK-23157.	2018-01-29 22:19:59 -08:00
hyukjinkwon	3227d14feb	[SPARK-23233][PYTHON] Reset the cache in asNondeterministic to set deterministic properly ## What changes were proposed in this pull request? Reproducer: ```python from pyspark.sql.functions import udf f = udf(lambda x: x) spark.range(1).select(f("id")) # cache JVM UDF instance. f = f.asNondeterministic() spark.range(1).select(f("id"))._jdf.logicalPlan().projectList().head().deterministic() ``` It should return `False` but the current master returns `True`. Seems it's because we cache the JVM UDF instance and then we reuse it even after setting `deterministic` disabled once it's called. ## How was this patch tested? Manually tested. I am not sure if I should add the test with a lot of JVM accesses with the intetnal stuff .. Let me know if anyone feels so. I will add. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20409 from HyukjinKwon/SPARK-23233.	2018-01-27 11:26:09 -08:00
Huaxin Gao	8480c0c576	[SPARK-23081][PYTHON] Add colRegex API to PySpark ## What changes were proposed in this pull request? Add colRegex API to PySpark ## How was this patch tested? add a test in sql/tests.py Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20390 from huaxingao/spark-23081.	2018-01-26 07:50:48 +09:00
Liang-Chi Hsieh	a3911cf896	[SPARK-23177][SQL][PYSPARK] Extract zero-parameter UDFs from aggregate ## What changes were proposed in this pull request? We extract Python UDFs in logical aggregate which depends on aggregate expression or grouping key in ExtractPythonUDFFromAggregate rule. But Python UDFs which don't depend on above expressions should also be extracted to avoid the issue reported in the JIRA. A small code snippet to reproduce that issue looks like: ```python import pyspark.sql.functions as f df = spark.createDataFrame([(1,2), (3,4)]) f_udf = f.udf(lambda: str("const_str")) df2 = df.distinct().withColumn("a", f_udf()) df2.show() ``` Error exception is raised as: ``` : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#50 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) ``` This exception raises because `HashAggregateExec` tries to bind the aliased Python UDF expression (e.g., `pythonUDF0#50 AS a#44`) to grouping key. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20360 from viirya/SPARK-23177.	2018-01-24 11:43:48 +09:00
Li Jin	b2ce17b4c9	[SPARK-22274][PYTHON][SQL] User-defined aggregation functions with pandas udf (full shuffle) ## What changes were proposed in this pull request? Add support for using pandas UDFs with groupby().agg(). This PR introduces a new type of pandas UDF - group aggregate pandas UDF. This type of UDF defines a transformation of multiple pandas Series -> a scalar value. Group aggregate pandas UDFs can be used with groupby().agg(). Note group aggregate pandas UDF doesn't support partial aggregation, i.e., a full shuffle is required. This PR doesn't support group aggregate pandas UDFs that return ArrayType, StructType or MapType. Support for these types is left for future PR. ## How was this patch tested? GroupbyAggPandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #19872 from icexelloss/SPARK-22274-groupby-agg.	2018-01-23 14:11:30 +09:00
gatorsmile	73281161fc	[SPARK-23122][PYSPARK][FOLLOW-UP] Update the docs for UDF Registration ## What changes were proposed in this pull request? This PR is to update the docs for UDF registration ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20348 from gatorsmile/testUpdateDoc.	2018-01-22 04:27:59 -08:00
Takuya UESHIN	568055da93	[SPARK-23054][SQL][PYSPARK][FOLLOWUP] Use sqlType casting when casting PythonUserDefinedType to String. ## What changes were proposed in this pull request? This is a follow-up of #20246. If a UDT in Python doesn't have its corresponding Scala UDT, cast to string will be the raw string of the internal value, e.g. `"org.apache.spark.sql.catalyst.expressions.UnsafeArrayDataxxxxxxxx"` if the internal type is `ArrayType`. This pr fixes it by using its `sqlType` casting. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20306 from ueshin/issues/SPARK-23054/fup1.	2018-01-19 11:37:08 +08:00
Tathagata Das	2d41f040a3	[SPARK-23143][SS][PYTHON] Added python API for setting continuous trigger ## What changes were proposed in this pull request? Self-explanatory. ## How was this patch tested? New python tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #20309 from tdas/SPARK-23143.	2018-01-18 12:25:52 -08:00
Takuya UESHIN	5063b74811	[SPARK-23141][SQL][PYSPARK] Support data type string as a returnType for registerJavaFunction. ## What changes were proposed in this pull request? Currently `UDFRegistration.registerJavaFunction` doesn't support data type string as a `returnType` whereas `UDFRegistration.register`, `udf`, or `pandas_udf` does. We can support it for `UDFRegistration.registerJavaFunction` as well. ## How was this patch tested? Added a doctest and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20307 from ueshin/issues/SPARK-23141.	2018-01-18 22:33:04 +09:00
hyukjinkwon	39d244d921	[SPARK-23122][PYTHON][SQL] Deprecate register* for UDFs in SQLContext and Catalog in PySpark ## What changes were proposed in this pull request? This PR proposes to deprecate `register` for UDFs in `SQLContext` and `Catalog` in Spark 2.3.0. These are inconsistent with Scala / Java APIs and also these basically do the same things with `spark.udf.register`. Also, this PR moves the logcis from `[sqlContext\|spark.catalog].register` to `spark.udf.register` and reuse the docstring. This PR also handles minor doc corrections. It also includes https://github.com/apache/spark/pull/20158 ## How was this patch tested? Manually tested, manually checked the API documentation and tests added to check if deprecated APIs call the aliases correctly. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20288 from HyukjinKwon/deprecate-udf.	2018-01-18 14:51:05 +09:00
Henry Robinson	1f3d933e0b	[SPARK-23062][SQL] Improve EXCEPT documentation ## What changes were proposed in this pull request? Make the default behavior of EXCEPT (i.e. EXCEPT DISTINCT) more explicit in the documentation, and call out the change in behavior from 1.x. Author: Henry Robinson <henry@cloudera.com> Closes #20254 from henryr/spark-23062.	2018-01-17 16:01:41 +08:00
gatorsmile	b85eb946ac	[SPARK-22978][PYSPARK] Register Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? Register Vectorized UDFs for SQL Statement. For example, ```Python >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> pandas_udf("integer", PandasUDFType.SCALAR) ... def add_one(x): ... return x + 1 ... >>> _ = spark.udf.register("add_one", add_one) >>> spark.sql("SELECT add_one(id) FROM range(3)").collect() [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] ``` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #20171 from gatorsmile/supportVectorizedUDF.	2018-01-16 20:20:33 +09:00
Takeshi Yamamuro	b59808385c	[SPARK-23023][SQL] Cast field data to strings in showString ## What changes were proposed in this pull request? The current `Datset.showString` prints rows thru `RowEncoder` deserializers like; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------------------------------------------+ \|a \| +------------------------------------------------------------+ \|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]\| +------------------------------------------------------------+ ``` This result is incorrect because the correct one is; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------+ \|a \| +------------------------+ \|[[1, 2], [3], [4, 5, 6]]\| +------------------------+ ``` So, this pr fixed code in `showString` to cast field data to strings before printing. ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20214 from maropu/SPARK-23023.	2018-01-15 16:26:52 +08:00
hyukjinkwon	cd9f49a2ae	[SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each batch within scalar Pandas UDF ## What changes were proposed in this pull request? This PR proposes to add a note that saying the length of a scalar Pandas UDF's `Series` is not of the whole input column but of the batch. We are fine for a group map UDF because the usage is different from our typical UDF but scalar UDFs might cause confusion with the normal UDF. For example, please consider this example: ```python from pyspark.sql.functions import pandas_udf, col, lit df = spark.range(1) f = pandas_udf(lambda x, y: len(x) + y, LongType()) df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ \|<lambda>(text, id)\| +------------------+ \| 1\| +------------------+ ``` ```python from pyspark.sql.functions import udf, col, lit df = spark.range(1) f = udf(lambda x, y: len(x) + y, "long") df.select(f(lit('text'), col('id'))).show() ``` ``` +------------------+ \|<lambda>(text, id)\| +------------------+ \| 4\| +------------------+ ``` ## How was this patch tested? Manually built the doc and checked the output. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20237 from HyukjinKwon/SPARK-22980.	2018-01-13 16:13:44 +09:00
Bryan Cutler	e599837248	[SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas ## What changes were proposed in this pull request? This the case when calling `SparkSession.createDataFrame` using a Pandas DataFrame that has non-str column labels. The column name conversion logic to handle non-string or unicode in python2 is: ``` if column is not any type of string: name = str(column) else if column is unicode in Python 2: name = column.encode('utf-8') ``` ## How was this patch tested? Added a new test with a Pandas DataFrame that has int column labels Author: Bryan Cutler <cutlerb@gmail.com> Closes #20210 from BryanCutler/python-createDataFrame-int-col-error-SPARK-23009.	2018-01-10 14:55:24 +09:00
Bryan Cutler	7bcc266681	[SPARK-23018][PYTHON] Fix createDataFrame from Pandas timestamp series assignment ## What changes were proposed in this pull request? This fixes createDataFrame from Pandas to only assign modified timestamp series back to a copied version of the Pandas DataFrame. Previously, if the Pandas DataFrame was only a reference (e.g. a slice of another) each series will still get assigned back to the reference even if it is not a modified timestamp column. This caused the following warning "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame." ## How was this patch tested? existing tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #20213 from BryanCutler/pyspark-createDataFrame-copy-slice-warn-SPARK-23018.	2018-01-10 14:00:07 +09:00
Guilherme Berger	3e40eb3f1f	[SPARK-22566][PYTHON] Better error message for `_merge_type` in Pandas to Spark DF conversion ## What changes were proposed in this pull request? It provides a better error message when doing `spark_session.createDataFrame(pandas_df)` with no schema and an error occurs in the schema inference due to incompatible types. The Pandas column names are propagated down and the error message mentions which column had the merging error. https://issues.apache.org/jira/browse/SPARK-22566 ## How was this patch tested? Manually in the `./bin/pyspark` console, and with new tests: `./python/run-tests` <img width="873" alt="screen shot 2017-11-21 at 13 29 49" src="https://user-images.githubusercontent.com/3977115/33080121-382274e0-cecf-11e7-808f-057a65bb7b00.png"> I state that the contribution is my original work and that I license the work to the Apache Spark project under the project’s open source license. Author: Guilherme Berger <gberger@palantir.com> Closes #19792 from gberger/master.	2018-01-08 14:32:05 +09:00
hyukjinkwon	993f21567a	[SPARK-22901][PYTHON][FOLLOWUP] Adds the doc for asNondeterministic for wrapped UDF function ## What changes were proposed in this pull request? This PR wraps the `asNondeterministic` attribute in the wrapped UDF function to set the docstring properly. ```python from pyspark.sql.functions import udf help(udf(lambda x: x).asNondeterministic) ``` Before: ``` Help on function <lambda> in module pyspark.sql.udf: <lambda> lambda (END ``` After: ``` Help on function asNondeterministic in module pyspark.sql.udf: asNondeterministic() Updates UserDefinedFunction to nondeterministic. .. versionadded:: 2.3 (END) ``` ## How was this patch tested? Manually tested and a simple test was added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20173 from HyukjinKwon/SPARK-22901-followup.	2018-01-06 23:08:26 +08:00
Li Jin	f2dd8b9237	[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs for non-deterministic cases ## What changes were proposed in this pull request? Add tests for using non deterministic UDFs in aggregate. Update pandas_udf docstring w.r.t to determinism. ## How was this patch tested? test_nondeterministic_udf_in_aggregate Author: Li Jin <ice.xelloss@gmail.com> Closes #20142 from icexelloss/SPARK-22930-pandas-udf-deterministic.	2018-01-06 16:11:20 +08:00
gatorsmile	5aadbc929c	[SPARK-22939][PYSPARK] Support Spark UDF in registerFunction ## What changes were proposed in this pull request? ```Python import random from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType random_udf = udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic() spark.catalog.registerFunction("random_udf", random_udf, StringType()) spark.sql("SELECT random_udf()").collect() ``` We will get the following error. ``` Py4JError: An error occurred while calling o29.__getnewargs__. Trace: py4j.Py4JException: Method __getnewargs__([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) ``` This PR is to support it. ## How was this patch tested? WIP Author: gatorsmile <gatorsmile@gmail.com> Closes #20137 from gatorsmile/registerFunction.	2018-01-04 21:07:31 +08:00
Felix Cheung	df95a908ba	[SPARK-22933][SPARKR] R Structured Streaming API for withWatermark, trigger, partitionBy ## What changes were proposed in this pull request? R Structured Streaming API for withWatermark, trigger, partitionBy ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #20129 from felixcheung/rwater.	2018-01-03 21:43:14 -08:00
Bryan Cutler	1c9f95cb77	[SPARK-22530][PYTHON][SQL] Adding Arrow support for ArrayType ## What changes were proposed in this pull request? This change adds `ArrayType` support for working with Arrow in pyspark when creating a DataFrame, calling `toPandas()`, and using vectorized `pandas_udf`. ## How was this patch tested? Added new Python unit tests using Array data. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20114 from BryanCutler/arrow-ArrayType-support-SPARK-22530.	2018-01-02 07:13:27 +09:00
Takeshi Yamamuro	f2b3525c17	[SPARK-22771][SQL] Concatenate binary inputs into a binary output ## What changes were proposed in this pull request? This pr modified `concat` to concat binary inputs into a single binary output. `concat` in the current master always output data as a string. But, in some databases (e.g., PostgreSQL), if all inputs are binary, `concat` also outputs binary. ## How was this patch tested? Added tests in `SQLQueryTestSuite` and `TypeCoercionSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19977 from maropu/SPARK-22771.	2017-12-30 14:09:56 +08:00
Takuya UESHIN	11a849b3a7	[SPARK-22370][SQL][PYSPARK][FOLLOW-UP] Fix a test failure when xmlrunner is installed. ## What changes were proposed in this pull request? This is a follow-up pr of #19587. If `xmlrunner` is installed, `VectorizedUDFTests.test_vectorized_udf_check_config` fails by the following error because the `self` which is a subclass of `unittest.TestCase` in the UDF `check_records_per_batch` can't be pickled anymore. ``` PicklingError: Cannot pickle files that are not opened for reading: w ``` This changes the UDF not to refer the `self`. ## How was this patch tested? Tested locally. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20115 from ueshin/issues/SPARK-22370_fup1.	2017-12-29 23:04:28 +09:00
soonmok-kwon	ffe6fd77a4	[SPARK-22818][SQL] csv escape of quote escape ## What changes were proposed in this pull request? Escape of escape should be considered when using the UniVocity csv encoding/decoding library. Ref: https://github.com/uniVocity/univocity-parsers#escaping-quote-escape-characters One option is added for reading and writing CSV: `escapeQuoteEscaping` ## How was this patch tested? Unit test added. Author: soonmok-kwon <soonmok.kwon@navercorp.com> Closes #20004 from ep1804/SPARK-22818.	2017-12-29 07:30:06 +08:00
Marco Gaido	ff48b1b338	[SPARK-22901][PYTHON] Add deterministic flag to pyspark UDF ## What changes were proposed in this pull request? In SPARK-20586 the flag `deterministic` was added to Scala UDF, but it is not available for python UDF. This flag is useful for cases when the UDF's code can return different result with the same input. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. This can lead to unexpected behavior. This PR adds the deterministic flag, via the `asNondeterministic` method, to let the user mark the function as non-deterministic and therefore avoid the optimizations which might lead to strange behaviors. ## How was this patch tested? Manual tests: ``` >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> df_br = spark.createDataFrame([{'name': 'hello'}]) >>> import random >>> udf_random_col = udf(lambda: int(100*random.random()), IntegerType()).asNondeterministic() >>> df_br = df_br.withColumn('RAND', udf_random_col()) >>> random.seed(1234) >>> udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) >>> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show() +-----+----+-------------+ \| name\|RAND\|RAND_PLUS_TEN\| +-----+----+-------------+ \|hello\| 3\| 13\| +-----+----+-------------+ ``` Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #19929 from mgaido91/SPARK-22629.	2017-12-26 06:39:40 -08:00
Takuya UESHIN	eb386be1ed	[SPARK-21552][SQL] Add DecimalType support to ArrowWriter. ## What changes were proposed in this pull request? Decimal type is not yet supported in `ArrowWriter`. This is adding the decimal type support. ## How was this patch tested? Added a test to `ArrowConvertersSuite`. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18754 from ueshin/issues/SPARK-21552.	2017-12-26 21:37:25 +09:00
Takuya UESHIN	12d20dd75b	[SPARK-22874][PYSPARK][SQL][FOLLOW-UP] Modify error messages to show actual versions. ## What changes were proposed in this pull request? This is a follow-up pr of #20054 modifying error messages for both pandas and pyarrow to show actual versions. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20074 from ueshin/issues/SPARK-22874_fup1.	2017-12-25 20:29:10 +09:00
Takuya UESHIN	13190a4f60	[SPARK-22874][PYSPARK][SQL] Modify checking pandas version to use LooseVersion. ## What changes were proposed in this pull request? Currently we check pandas version by capturing if `ImportError` for the specific imports is raised or not but we can compare `LooseVersion` of the version strings as the same as we're checking pyarrow version. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #20054 from ueshin/issues/SPARK-22874.	2017-12-22 20:09:51 +09:00

1 2 3 4 5 ...

735 commits