ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	14869ae64e	[SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. ## What changes were proposed in this pull request? This issue aims to expose Scala `bround` function in Python/R API. `bround` function is implemented in SPARK-14614 by extending current `round` function. We used the following semantics from Hive. ```java public static double bround(double input, int scale) { if (Double.isNaN(input) \|\| Double.isInfinite(input)) { return input; } return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue(); } ``` After this PR, `pyspark` and `sparkR` also support `bround` function. PySpark ```python >>> from pyspark.sql.functions import bround >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect() [Row(r=2.0)] ``` SparkR ```r > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5))) > head(collect(select(df, bround(df$x, 0)))) bround(x, 0) 1 2 2 4 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12509 from dongjoon-hyun/SPARK-14639.	2016-04-19 22:28:11 -07:00
felixcheung	3664142350	[SPARK-14717] [PYTHON] Scala, Python APIs for Dataset.unpersist differ in default blocking value ## What changes were proposed in this pull request? Change unpersist blocking parameter default value to match Scala ## How was this patch tested? unit tests, manual tests jkbradley davies Author: felixcheung <felixcheung_m@hotmail.com> Closes #12507 from felixcheung/pyunpersist.	2016-04-19 17:29:28 -07:00
Holden Karau	478af2f455	[SPARK-14573][PYSPARK][BUILD] Fix PyDoc Makefile & highlighting issues ## What changes were proposed in this pull request? The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced. ## How was this patch tested? manual local export & make Author: Holden Karau <holden@us.ibm.com> Closes #12336 from holdenk/SPARK-14573-fix-pydoc-makefile.	2016-04-14 09:42:15 +01:00
Burak Yavuz	9ee5c25717	[SPARK-14353] Dataset Time Window `window` API for Python, and SQL ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the Python, and SQL, API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python, users can access all APIs above, but in addition they can do - In Python: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12136 from brkyvz/python-windows.	2016-04-05 13:18:39 -07:00
Davies Liu	cc70f17416	[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame ## What changes were proposed in this pull request? RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer). This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds. The JDBC server has been updated to use DataFrame.toIterator. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12114 from davies/local_iterator.	2016-04-04 13:31:44 -07:00
Davies Liu	5743c6476d	[SPARK-12981] [SQL] extract Pyhton UDF in physical plan ## What changes were proposed in this pull request? Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning). We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan. This PR extract Python UDFs in physical plan. Closes #10935 ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #12127 from davies/py_udf.	2016-04-04 10:56:26 -07:00
hyukjinkwon	2262a93358	[SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14231 Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`. But there are few restrictions in Spark `DecimalType` below: 1. The precision cannot be bigger than 38. 2. scale cannot be bigger than precision. Currently, both restrictions are not being handled. This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579). So, the codes below: ```scala def doubleRecords: RDD[String] = sqlContext.sparkContext.parallelize( s"""{"a": 1${"0" * 38}, "b": 0.01}""" :: s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil) val jsonDF = sqlContext.read .option("prefersDecimal", "true") .json(doubleRecords) jsonDF.printSchema() ``` produces below: - Before ```scala org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108) at ... ``` - After ```scala root \|-- a: double (nullable = true) \|-- b: double (nullable = true) ``` ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12030 from HyukjinKwon/SPARK-14231.	2016-04-02 23:12:04 -07:00
Davies Liu	f0afafdc5d	[SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch ## What changes were proposed in this pull request? This PR support multiple Python UDFs within single batch, also improve the performance. ```python >>> from pyspark.sql.types import IntegerType >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType()) >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType()) >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True) == Parsed Logical Plan == 'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)] +- OneRowRelation$ == Analyzed Logical Plan == double(add(1, 2)): int, add(double(2), 1): int Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [double(add(1, 2))#14,add(double(2), 1)#15] +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Optimized Logical Plan == Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18] +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- OneRowRelation$ == Physical Plan == WholeStageCodegen : +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15] : +- INPUT +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18] +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17] +- Scan OneRowRelation[] ``` ## How was this patch tested? Added new tests. Using the following script to benchmark 1, 2 and 3 udfs, ``` df = sqlContext.range(1, 1 << 23, 1, 4) double = F.udf(lambda x: x * 2, LongType()) print df.select(double(df.id)).count() print df.select(double(df.id), double(df.id + 1)).count() print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count() ``` Here is the results: N \| Before \| After \| speed up ---- \|------------ \| -------------\|------ 1 \| 22 s \| 7 s \| 3.1X 2 \| 38 s \| 13 s \| 2.9X 3 \| 58 s \| 16 s \| 3.6X This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering). Author: Davies Liu <davies@databricks.com> Closes #12057 from davies/multi_udfs.	2016-03-31 16:40:20 -07:00
Herman van Hovell	a9b93e0739	[SPARK-14211][SQL] Remove ANTLR3 based parser ### What changes were proposed in this pull request? This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`. ### How was this patch tested? Existing unit tests. cc rxin andrewor14 yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12071 from hvanhovell/SPARK-14211.	2016-03-31 09:25:09 -07:00
Davies Liu	a7a93a116d	[SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs ## What changes were proposed in this pull request? This PR brings the support for chained Python UDFs, for example ```sql select udf1(udf2(a)) select udf1(udf2(a) + 3) select udf1(udf2(a) + udf3(b)) ``` Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches. For example, ```python >>> sqlContext.sql("select double(double(1))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#10 AS double(double(1))#9] : +- INPUT +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10] +- Scan OneRowRelation[] >>> sqlContext.sql("select double(double(1) + double(2))").explain() == Physical Plan == WholeStageCodegen : +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16] : +- INPUT +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19] +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18] +- !BatchPythonEvaluation double(1), [pythonUDF#17] +- Scan OneRowRelation[] ``` TODO: will support multiple unrelated Python UDFs in one batch (another PR). ## How was this patch tested? Added new unit tests for chained UDFs. Author: Davies Liu <davies@databricks.com> Closes #12014 from davies/py_udfs.	2016-03-29 15:06:29 -07:00
Herman van Hovell	600c0b69ca	[SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4 ### What changes were proposed in this pull request? The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4. This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs. This PR is a work in progress, and work needs to be done in the following area's: - [x] Error handling should be improved. - [x] Documentation should be improved. - [x] Multi-Insert needs to be tested. - [ ] Naming and package locations. ### How was this patch tested? Catalyst and SQL unit tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #11557 from hvanhovell/ngParser.	2016-03-28 12:31:12 -07:00
Wenchen Fan	43b15e01c4	[SPARK-14061][SQL] implement CreateMap ## What changes were proposed in this pull request? As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API. ## How was this patch tested? various new tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11879 from cloud-fan/create_map.	2016-03-25 09:50:06 -07:00
Andrew Or	20ddf5fddf	[SPARK-14014][SQL] Integrate session catalog (attempt #2 ) ## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.	2016-03-24 22:59:35 -07:00
Reynold Xin	3619fec1ec	[SPARK-14142][SQL] Replace internal use of unionAll with union ## What changes were proposed in this pull request? unionAll has been deprecated in SPARK-14088. ## How was this patch tested? Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11946 from rxin/SPARK-14142.	2016-03-24 22:34:55 -07:00
Andrew Or	c44d140cae	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog" This reverts commit `5dfc01976b`.	2016-03-23 22:21:15 -07:00
Andrew Or	5dfc01976b	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.	2016-03-23 13:34:22 -07:00
Reynold Xin	926a93e54b	[SPARK-14088][SQL] Some Dataset API touch-up ## What changes were proposed in this pull request? 1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL. 2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups. 3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users. 4. Remove "subtract" function since it is just an alias for "except". ## How was this patch tested? All changes should be covered by existing tests. Also added couple test cases to cover "name". Author: Reynold Xin <rxin@databricks.com> Closes #11908 from rxin/SPARK-14088.	2016-03-22 23:43:09 -07:00
hyukjinkwon	4e09a0d5ea	[SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13953 Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string. This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration. This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11881 from HyukjinKwon/SPARK-13953.	2016-03-22 20:30:48 +08:00
zero323	8193a266b5	[SPARK-14058][PYTHON] Incorrect docstring in Window.order ## What changes were proposed in this pull request? Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined." ## How was this patch tested? PySpark unit tests (no regression introduced). No changes to the code. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #11877 from zero323/order-by-description.	2016-03-21 23:52:33 -07:00
hyukjinkwon	e474088144	[SPARK-13764][SQL] Parse modes in JSON data source ## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11756 from HyukjinKwon/SPARK-13764.	2016-03-21 15:42:35 +08:00
Reynold Xin	8e0b030606	[SPARK-10380][SQL] Fix confusing documentation examples for astype/drop_duplicates. ## What changes were proposed in this pull request? We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases. ## How was this patch tested? Existing PySpark unit tests. Closes #11543. Author: Reynold Xin <rxin@databricks.com> Closes #11698 from rxin/SPARK-10380.	2016-03-14 19:25:49 -07:00
Davies Liu	ba8c86d06f	[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes #11514 from davies/existing_rdd.	2016-03-12 00:48:36 -08:00
Tristan Reid	5f7dbdba6f	[MINOR] Fix typo in 'hypot' docstring Minor typo: docstring for pyspark.sql.functions: hypot has extra characters N/A Author: Tristan Reid <treid@netflix.com> Closes #11616 from tristanreid/master.	2016-03-09 18:05:03 -08:00
Wenchen Fan	d57daf1f77	[SPARK-13593] [SQL] improve the `createDataFrame` to accept data type string and verify the data ## What changes were proposed in this pull request? This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`. It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users. If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't. ## How was this patch tested? new tests in `test.py` and doc test in `types.py` Author: Wenchen Fan <wenchen@databricks.com> Closes #11444 from cloud-fan/pyrdd.	2016-03-08 14:00:03 -08:00
Wenchen Fan	d5ce61722f	[SPARK-13740][SQL] add null check for _verify_type in types.py ## What changes were proposed in this pull request? This PR adds null check in `_verify_type` according to the nullability information. ## How was this patch tested? new doc tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11574 from cloud-fan/py-null-check.	2016-03-08 13:46:17 -08:00
gatorsmile	adce5ee721	[SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets #### What changes were proposed in this pull request? This PR is for supporting SQL generation for cube, rollup and grouping sets. For example, a query using rollup: ```SQL SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP ``` Original logical plan: ``` Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46], [(count(1),mode=Complete,isDistinct=false) AS cnt#43L, (key#17L % cast(5 as bigint))#47L AS _c1#45L, grouping__id#46 AS _c2#44] +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0), List(key#17L, value#18, null, 1)], [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46] +- Project [key#17L, value#18, (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L] +- Subquery t1 +- Relation[key#17L,value#18] ParquetRelation ``` Converted SQL: ```SQL SELECT count( 1) AS `cnt`, (`t1`.`key` % CAST(5 AS BIGINT)), grouping_id() AS `_c2` FROM `default`.`t1` GROUP BY (`t1`.`key` % CAST(5 AS BIGINT)) GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ()) ``` #### How was the this patch tested? Added eight test cases in `LogicalPlanToSQLSuite`. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11283 from gatorsmile/groupingSetsToSQL.	2016-03-05 19:25:03 +08:00
Wenchen Fan	15d57f9c23	[SPARK-13647] [SQL] also check if numeric value is within allowed range in _verify_type ## What changes were proposed in this pull request? This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range. ## How was this patch tested? newly added doc test. Author: Wenchen Fan <wenchen@databricks.com> Closes #11492 from cloud-fan/py-verify.	2016-03-03 20:16:37 -08:00
hyukjinkwon	cf95d728c6	[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() ## What changes were proposed in this pull request? This PR adds the support to specify compression codecs for both ORC and Parquet. ## How was this patch tested? unittests within IDE and code style tests with `dev/run_tests`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11464 from HyukjinKwon/SPARK-13543.	2016-03-03 10:30:55 -08:00
Wenchen Fan	4dd24811d9	[SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python DataFrame ## What changes were proposed in this pull request? Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #11445 from cloud-fan/python-clean.	2016-03-02 15:26:34 -08:00
hyukjinkwon	02aa499dfb	[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call https://issues.apache.org/jira/browse/SPARK-13507 https://issues.apache.org/jira/browse/SPARK-13509 ## What changes were proposed in this pull request? This PR adds the support to write CSV data directly by a single call to the given path. Several unitests were added for each functionality. ## How was this patch tested? This was tested with unittests and with `dev/run_tests` for coding style Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #11389 from HyukjinKwon/SPARK-13507-13509.	2016-02-29 09:44:29 -08:00
Joseph K. Bradley	13ce10e954	[SPARK-13479][SQL][PYTHON] Added Python API for approxQuantile ## What changes were proposed in this pull request? * Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility * Python DataFrame and DataFrameStatFunctions: Added approxQuantile ## How was this patch tested? * unit test in sql/tests.py Documentation was copied from the existing approxQuantile exactly. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11356 from jkbradley/approx-quantile-python.	2016-02-24 23:15:36 -08:00
Nong Li	5a7af9e7ac	[SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if using the vectorized scanner. Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion in all cases. The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds to 6.5 seconds. Author: Nong Li <nong@databricks.com> Closes #11141 from nongli/spark-13250.	2016-02-24 17:16:45 -08:00
Wenchen Fan	a60f91284c	[SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code ## What changes were proposed in this pull request? When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear. ## How was the this patch tested? by existing unit tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11342 from cloud-fan/python-clean.	2016-02-24 12:44:54 -08:00
Davies Liu	c481bdf512	[SPARK-13329] [SQL] considering output for statistics of logical plan The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess. We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join. After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time. We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them: DecimalType: 8 or 16 bytes, based on the precision StringType: 20 bytes BinaryType: 100 bytes UDF: default size of SQL type These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096. Author: Davies Liu <davies@databricks.com> Closes #11210 from davies/statics.	2016-02-23 12:55:44 -08:00
Dongjoon Hyun	024482bf51	[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments ## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.	2016-02-22 09:52:07 +00:00
Franklyn D'souza	0f90f4e6ac	[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 Author: Franklyn D'souza <franklynd@gmail.com> Closes #11279 from damnMeddlingKid/udt-union-all.	2016-02-21 16:58:17 -08:00
Cheng Lian	d9efe63ecd	[SPARK-12799] Simplify various string output for expressions This PR introduces several major changes: 1. Replacing `Expression.prettyString` with `Expression.sql` The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users. 1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed) Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples: Expression \| `prettyString` \| `sql` \| Note ------------------ \| -------------- \| ---------- \| --------------- `a && b` \| `a && b` \| `a AND b` \| `a.getField("f")` \| `a[f]` \| `a.f` \| `a` is a struct 1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders) `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression. Author: Cheng Lian <lian@databricks.com> Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.	2016-02-21 22:53:15 +08:00
Reynold Xin	6624a588c1	Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" This reverts commit `4f9a664818`.	2016-02-19 22:44:20 -08:00
Kai Jiang	4f9a664818	[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs Author: Kai Jiang <jiangkai@gmail.com> Closes #10527 from vectorijk/spark-12567.	2016-02-19 22:28:47 -08:00
Reynold Xin	354d4c24be	[SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. This pull request has the following changes: 1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs. 2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package. 3. Move everything in execution/python.scala into the newly created execution.python package. Most of the diffs are just straight copy-paste. Author: Reynold Xin <rxin@databricks.com> Closes #11181 from rxin/SPARK-13296.	2016-02-13 21:06:31 -08:00
Yanbo Liang	90de6b2fae	[SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop PySpark support ```covar_samp``` and ```covar_pop```. cc rxin davies marmbrus Author: Yanbo Liang <ybliang8@gmail.com> Closes #10876 from yanboliang/spark-12962.	2016-02-12 12:43:13 -08:00
Davies Liu	b5761d150b	[SPARK-12706] [SQL] grouping() and grouping_id() Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels. grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR. The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive). Author: Davies Liu <davies@databricks.com> Closes #10677 from davies/grouping.	2016-02-10 20:13:38 -08:00
Tommy YU	81da3bee66	[SPARK-5865][API DOC] Add doc warnings for methods that return local data structures rxin srowen I work out note message for rdd.take function, please help to review. If it's fine, I can apply to all other function later. Author: Tommy YU <tummyyu@163.com> Closes #10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.	2016-02-06 17:29:09 +00:00
Herman van Hovell	5a8b978fab	[SPARK-13049] Add First/last with ignore nulls to functions.scala This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g: ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))``` This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10957 from hvanhovell/SPARK-13049.	2016-01-31 13:56:13 -08:00
Brandon Bradley	3a40c0e575	[SPARK-12749][SQL] add json option to parse floating-point types as DecimalType I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10936 from blbradley/spark-12749.	2016-01-28 15:25:57 -08:00
Jason Lee	edd473751b	[SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847.	2016-01-27 09:55:10 -08:00
Cheng Lian	3327fd2817	[SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624.	2016-01-24 19:40:34 -08:00
Jeff Zhang	e789b1d2c1	[SPARK-12120][PYSPARK] Improve exception message when failing to init… …ialize HiveContext in PySpark davies Mind to review ? This is the error message after this PR ``` 15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly warnings.warn("You must build Spark with Hive. " Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read return DataFrameReader(self) File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__ self._jreader = sqlContext._ssql_ctx.read() File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx raise e py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext. : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218) at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208) at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462) at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461) at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40) at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90) at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) ``` Author: Jeff Zhang <zjffdu@apache.org> Closes #10126 from zjffdu/SPARK-12120.	2016-01-24 12:29:26 -08:00
Gábor Lipták	9bb35c5b59	[SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test. Author: Gábor Lipták <gliptak@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10850 from mengxr/SPARK-11295.	2016-01-20 11:11:10 -08:00
Xiangrui Meng	beda901422	Revert "[SPARK-11295] Add packages to JUnit output for Python tests" This reverts commit `c6f971b4ae`.	2016-01-19 16:51:17 -08:00

1 2 3 4 5 ...

297 commits