## What changes were proposed in this pull request?
This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
if (Double.isNaN(input) || Double.isInfinite(input)) {
return input;
}
return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```
After this PR, `pyspark` and `sparkR` also support `bround` function.
**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```
**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
bround(x, 0)
1 2
2 4
```
## How was this patch tested?
Pass the Jenkins tests (including new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12509 from dongjoon-hyun/SPARK-14639.
## What changes were proposed in this pull request?
Change unpersist blocking parameter default value to match Scala
## How was this patch tested?
unit tests, manual tests
jkbradley davies
Author: felixcheung <felixcheung_m@hotmail.com>
Closes#12507 from felixcheung/pyunpersist.
## What changes were proposed in this pull request?
The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced.
## How was this patch tested?
manual local export & make
Author: Holden Karau <holden@us.ibm.com>
Closes#12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
## What changes were proposed in this pull request?
The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the Python, and SQL, API for this function.
With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
- `window(timeColumn, windowDuration)`
- `window(timeColumn, windowDuration, slideDuration)`
- `window(timeColumn, windowDuration, slideDuration, startTime)`
In Python, users can access all APIs above, but in addition they can do
- In Python:
`window(timeColumn, windowDuration, startTime=...)`
that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
## How was this patch tested?
Unit tests + manual tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12136 from brkyvz/python-windows.
## What changes were proposed in this pull request?
RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).
This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.
The JDBC server has been updated to use DataFrame.toIterator.
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#12114 from davies/local_iterator.
## What changes were proposed in this pull request?
Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).
We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.
This PR extract Python UDFs in physical plan.
Closes#10935
## How was this patch tested?
Added regression tests.
Author: Davies Liu <davies@databricks.com>
Closes#12127 from davies/py_udf.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14231
Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
But there are few restrictions in Spark `DecimalType` below:
1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.
Currently, both restrictions are not being handled.
This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
So, the codes below:
```scala
def doubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(doubleRecords)
jsonDF.printSchema()
```
produces below:
- **Before**
```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at
...
```
- **After**
```scala
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
```
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12030 from HyukjinKwon/SPARK-14231.
## What changes were proposed in this pull request?
This PR support multiple Python UDFs within single batch, also improve the performance.
```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$
== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
: +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
+- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Added new tests.
Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:
N | Before | After | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s | 3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X
This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
Author: Davies Liu <davies@databricks.com>
Closes#12057 from davies/multi_udfs.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
## What changes were proposed in this pull request?
This PR brings the support for chained Python UDFs, for example
```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```
Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#10 AS double(double(1))#9]
: +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
+- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
: +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
+- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
+- !BatchPythonEvaluation double(1), [pythonUDF#17]
+- Scan OneRowRelation[]
```
TODO: will support multiple unrelated Python UDFs in one batch (another PR).
## How was this patch tested?
Added new unit tests for chained UDFs.
Author: Davies Liu <davies@databricks.com>
Closes#12014 from davies/py_udfs.
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.
This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.
This PR is a work in progress, and work needs to be done in the following area's:
- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.
### How was this patch tested?
Catalyst and SQL unit tests.
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11557 from hvanhovell/ngParser.
## What changes were proposed in this pull request?
As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`. This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
## How was this patch tested?
various new tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11879 from cloud-fan/create_map.
## What changes were proposed in this pull request?
This reopens#11836, which was merged but promptly reverted because it introduced flaky Hive tests.
## How was this patch tested?
See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
Author: Andrew Or <andrew@databricks.com>
Closes#11938 from andrewor14/session-catalog-again.
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.
## How was this patch tested?
Should be covered by all existing tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#11946 from rxin/SPARK-14142.
## What changes were proposed in this pull request?
`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext
## How was this patch tested?
This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#11836 from andrewor14/use-session-catalog.
## What changes were proposed in this pull request?
1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
4. Remove "subtract" function since it is just an alias for "except".
## How was this patch tested?
All changes should be covered by existing tests. Also added couple test cases to cover "name".
Author: Reynold Xin <rxin@databricks.com>
Closes#11908 from rxin/SPARK-14088.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13953
Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11881 from HyukjinKwon/SPARK-13953.
## What changes were proposed in this pull request?
Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined."
## How was this patch tested?
PySpark unit tests (no regression introduced). No changes to the code.
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#11877 from zero323/order-by-description.
## What changes were proposed in this pull request?
Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
This PR adds the support for parse modes just like CSV data source. There are three modes below:
- `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.
This PR also make JSON data source share the `ParseModes` in CSV data source.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11756 from HyukjinKwon/SPARK-13764.
## What changes were proposed in this pull request?
We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.
## How was this patch tested?
Existing PySpark unit tests.
Closes#11543.
Author: Reynold Xin <rxin@databricks.com>
Closes#11698 from rxin/SPARK-10380.
## What changes were proposed in this pull request?
This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.
Also fix the problem for sameResult() on two DataSourceScan.
Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).
## How was this patch tested?
Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).
Author: Davies Liu <davies@databricks.com>
Closes#11514 from davies/existing_rdd.
Minor typo: docstring for pyspark.sql.functions: hypot has extra characters
N/A
Author: Tristan Reid <treid@netflix.com>
Closes#11616 from tristanreid/master.
## What changes were proposed in this pull request?
This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.
## How was this patch tested?
new tests in `test.py` and doc test in `types.py`
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11444 from cloud-fan/pyrdd.
## What changes were proposed in this pull request?
This PR adds null check in `_verify_type` according to the nullability information.
## How was this patch tested?
new doc tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11574 from cloud-fan/py-null-check.
#### What changes were proposed in this pull request?
This PR is for supporting SQL generation for cube, rollup and grouping sets.
For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
[(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
(key#17L % cast(5 as bigint))#47L AS _c1#45L,
grouping__id#46 AS _c2#44]
+- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
List(key#17L, value#18, null, 1)],
[key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
+- Project [key#17L,
value#18,
(key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
+- Subquery t1
+- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
SELECT count( 1) AS `cnt`,
(`t1`.`key` % CAST(5 AS BIGINT)),
grouping_id() AS `_c2`
FROM `default`.`t1`
GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```
#### How was the this patch tested?
Added eight test cases in `LogicalPlanToSQLSuite`.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11283 from gatorsmile/groupingSetsToSQL.
## What changes were proposed in this pull request?
This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.
## How was this patch tested?
newly added doc test.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11492 from cloud-fan/py-verify.
## What changes were proposed in this pull request?
This PR adds the support to specify compression codecs for both ORC and Parquet.
## How was this patch tested?
unittests within IDE and code style tests with `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11464 from HyukjinKwon/SPARK-13543.
## What changes were proposed in this pull request?
Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11445 from cloud-fan/python-clean.
https://issues.apache.org/jira/browse/SPARK-13507https://issues.apache.org/jira/browse/SPARK-13509
## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.
Several unitests were added for each functionality.
## How was this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#11389 from HyukjinKwon/SPARK-13507-13509.
## What changes were proposed in this pull request?
* Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
* Python DataFrame and DataFrameStatFunctions: Added approxQuantile
## How was this patch tested?
* unit test in sql/tests.py
Documentation was copied from the existing approxQuantile exactly.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11356 from jkbradley/approx-quantile-python.
Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
in all cases.
The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
to 6.5 seconds.
Author: Nong Li <nong@databricks.com>
Closes#11141 from nongli/spark-13250.
## What changes were proposed in this pull request?
When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
## How was the this patch tested?
by existing unit tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11342 from cloud-fan/python-clean.
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.
We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.
After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.
We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:
DecimalType: 8 or 16 bytes, based on the precision
StringType: 20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type
These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.
Author: Davies Liu <davies@databricks.com>
Closes#11210 from davies/statics.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
## What changes were proposed in this pull request?
This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.
This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.
```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types
schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])
a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)
c = a.unionAll(b)
```
## How was the this patch tested?
Tested using two unit tests in sql/test.py and the DataFrameSuite.
Additional information here : https://issues.apache.org/jira/browse/SPARK-13410
Author: Franklyn D'souza <franklynd@gmail.com>
Closes#11279 from damnMeddlingKid/udt-union-all.
This PR introduces several major changes:
1. Replacing `Expression.prettyString` with `Expression.sql`
The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples:
Expression | `prettyString` | `sql` | Note
------------------ | -------------- | ---------- | ---------------
`a && b` | `a && b` | `a AND b` |
`a.getField("f")` | `a[f]` | `a.f` | `a` is a struct
1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
`NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
Author: Cheng Lian <lian@databricks.com>
Closes#10757 from liancheng/spark-12799.simplify-expression-string-methods.
This pull request has the following changes:
1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
3. Move everything in execution/python.scala into the newly created execution.python package.
Most of the diffs are just straight copy-paste.
Author: Reynold Xin <rxin@databricks.com>
Closes#11181 from rxin/SPARK-13296.
PySpark support ```covar_samp``` and ```covar_pop```.
cc rxin davies marmbrus
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10876 from yanboliang/spark-12962.
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
Author: Davies Liu <davies@databricks.com>
Closes#10677 from davies/grouping.
rxin srowen
I work out note message for rdd.take function, please help to review.
If it's fine, I can apply to all other function later.
Author: Tommy YU <tummyyu@163.com>
Closes#10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g:
```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))```
This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.
cc yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#10957 from hvanhovell/SPARK-13049.
I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.
Added test for non-complex types. Should I add a test for complex types?
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes#10936 from blbradley/spark-12749.
The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.
Author: Jason Lee <cjlee@us.ibm.com>
Closes#8969 from jasoncl/SPARK-10847.
When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.
Author: Cheng Lian <lian@databricks.com>
Closes#10886 from liancheng/spark-12624.
…ialize HiveContext in PySpark
davies Mind to review ?
This is the error message after this PR
```
15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
/Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
warnings.warn("You must build Spark with Hive. "
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
return DataFrameReader(self)
File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
self._jreader = sqlContext._ssql_ctx.read()
File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
raise e
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
```
Author: Jeff Zhang <zjffdu@apache.org>
Closes#10126 from zjffdu/SPARK-12120.
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.
Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#10850 from mengxr/SPARK-11295.