ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuanjian Li	c00e72f3d7	[SPARK-25921][PYSPARK] Fix barrier task run without BarrierTaskContext while python worker reuse ## What changes were proposed in this pull request? Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. This is because while python worker reuse, BarrierTaskContext._getOrCreate() will still return a TaskContext after firstly submit a normal spark job, we'll get a `AttributeError: 'TaskContext' object has no attribute 'barrier'`. Fix this by adding check logic in BarrierTaskContext._getOrCreate() and make sure it will return BarrierTaskContext in this scenario. ## How was this patch tested? Add new UT in pyspark-core. Closes #22962 from xuanyuanking/SPARK-25921. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-13 17:05:39 +08:00
Sean Owen	510ec77a60	[SPARK-19714][DOCS] Clarify Bucketizer handling of invalid input ## What changes were proposed in this pull request? Clarify Bucketizer handleInvalid docs. Just a resubmit of https://github.com/apache/spark/pull/17169 ## How was this patch tested? N/A Closes #23003 from srowen/SPARK-19714. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-11 09:21:40 -06:00
Maxim Gekk	aec0af4a95	[SPARK-25972][PYTHON] Missed JSON options in streaming.py ## What changes were proposed in this pull request? Added JSON options for `json()` in streaming.py that are presented in the similar method in readwriter.py. In particular, missed options are `dropFieldIfAllNull` and `encoding`. Closes #22973 from MaxGekk/streaming-missed-options. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-11 21:01:29 +08:00
Maxim Gekk	79551f558d	[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON ## What changes were proposed in this pull request? In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to `Locale.US`. ## How was this patch tested? Added two tests for parsing a date from CSV/JSON - `ноя 2018`. Closes #22951 from MaxGekk/locale. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-09 09:45:06 +08:00
Sean Owen	0025a8397f	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 ## What changes were proposed in this pull request? - Remove some AccumulableInfo .apply() methods - Remove non-label-specific multiclass precision/recall/fScore in favor of accuracy - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only deprecated) - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only deprecated) - Remove unused Python StorageLevel constants - Remove Dataset unionAll in favor of union - Remove unused multiclass option in libsvm parsing - Remove references to deprecated spark configs like spark.yarn.am.port - Remove TaskContext.isRunningLocally - Remove ShuffleMetrics.shuffle* methods - Remove BaseReadWrite.context in favor of session - Remove Column.!== in favor of =!= - Remove Dataset.explode - Remove Dataset.registerTempTable - Remove SQLContext.getOrCreate, setActive, clearActive, constructors Not touched yet - everything else in MLLib - HiveContext - Anything deprecated more recently than 2.0.0, generally ## How was this patch tested? Existing tests Closes #22921 from srowen/SPARK-25908. Lead-authored-by: Sean Owen <sean.owen@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-07 22:48:50 -06:00
Sean Owen	c0d1bf0322	[MINOR] Fix typos and misspellings ## What changes were proposed in this pull request? Fix typos and misspellings, per https://github.com/apache/spark-website/pull/158#issuecomment-435790366 ## How was this patch tested? Existing tests. Closes #22950 from srowen/Typos. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-05 17:34:23 -06:00
Marco Gaido	fc10c898f4	[SPARK-25758][ML] Deprecate computeCost in BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22869 from mgaido91/SPARK-25758_3.0. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-11-05 22:13:20 +00:00
Maxim Gekk	39399f40b8	[SPARK-25638][SQL] Adding new function - to_csv() ## What changes were proposed in this pull request? New functions takes a struct and converts it to a CSV strings using passed CSV options. It accepts the same CSV options as CSV data source does. ## How was this patch tested? Added `CsvExpressionsSuite`, `CsvFunctionsSuite` as well as R, Python and SQL tests similar to tests for `to_json()` Closes #22626 from MaxGekk/to_csv. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-04 14:57:38 +08:00
hyukjinkwon	c9667aff4f	[SPARK-25672][SQL] schema_of_csv() - schema inference from an example ## What changes were proposed in this pull request? In the PR, I propose to add new function - schema_of_csv() which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example: ```sql select schema_of_csv('1\|abc', map('delimiter', '\|')) ``` ``` struct<_c0:int,_c1:string> ``` ## How was this patch tested? Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql` Closes #22666 from MaxGekk/schema_of_csv-function. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-01 09:14:16 +08:00
Huaxin Gao	d367bdcf52	[SPARK-25255][PYTHON] Add getActiveSession to SparkSession in PySpark ## What changes were proposed in this pull request? add getActiveSession in session.py ## How was this patch tested? add doctest Closes #22295 from huaxingao/spark25255. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2018-10-26 09:40:13 -07:00
hyukjinkwon	33e337c118	[SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's input json as literal only ## What changes were proposed in this pull request? The main purpose of `schema_of_json` is the usage of combination with `from_json` (to make up the leak of schema inference) which takes its schema only as literal; however, currently `schema_of_json` allows JSON input as non-literal expressions (e.g, column). This was mistakenly allowed - we don't have to take other usages rather then the main purpose into account for now. This PR makes a followup to only allow literals for `schema_of_json`'s JSON input. We can allow non literal expressions later when it's needed or there are some usecase for it. ## How was this patch tested? Unit tests were added. Closes #22775 from HyukjinKwon/SPARK-25447-followup. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-26 22:14:43 +08:00
Reynold Xin	89d748b33c	[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608 ## What changes were proposed in this pull request? See the detailed information at https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be deprecated and redesigned. This patch also reverts `8acb51f08b` which applies to 2.4. ## How was this patch tested? Only deprecation and doc changes. Closes #22841 from rxin/SPARK-25842. Authored-by: Reynold Xin <rxin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-26 13:17:24 +08:00
Sean Owen	f83fedc9f2	[SPARK-25737][CORE] Remove JavaSparkContextVarargsWorkaround ## What changes were proposed in this pull request? Remove JavaSparkContextVarargsWorkaround ## How was this patch tested? Existing tests. Closes #22729 from srowen/SPARK-25737. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-24 14:43:51 -05:00
hyukjinkwon	7251be0c04	[SPARK-25798][PYTHON] Internally document type conversion between Pandas data and SQL types in Pandas UDFs ## What changes were proposed in this pull request? We are facing some problems about type conversions between Pandas data and SQL types in Pandas UDFs. It's even difficult to identify the problems (see #20163 and #22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. Table can be generated via the codes below: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` This code is compatible with both Python 2 and 3 but the table was generated under Python 2. ## How was this patch tested? Manually tested and lint check. Closes #22795 from HyukjinKwon/SPARK-25798. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-10-24 10:04:17 -07:00
Maxim Gekk	4d6704db4d	[SPARK-25243][SQL] Use FailureSafeParser in from_json ## What changes were proposed in this pull request? In the PR, I propose to switch `from_json` on `FailureSafeParser`, and to make the function compatible to `PERMISSIVE` mode by default, and to support the `FAILFAST` mode as well. The `DROPMALFORMED` mode is not supported by `from_json`. ## How was this patch tested? It was tested by existing `JsonSuite`/`CSVSuite`, `JsonFunctionsSuite` and `JsonExpressionsSuite` as well as new tests for `from_json` which checks different modes. Closes #22237 from MaxGekk/from_json-failuresafe. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-10-24 19:09:15 +08:00
Wenchen Fan	2fbbcd0d27	Revert "[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans" This reverts commit `c2962546d9`.	2018-10-21 09:12:29 +08:00
Marco Gaido	c2962546d9	[SPARK-25758][ML] Deprecate computeCost on BisectingKMeans ## What changes were proposed in this pull request? The PR proposes to deprecate the `computeCost` method on `BisectingKMeans` in favor of the adoption of `ClusteringEvaluator` in order to evaluate the clustering. ## How was this patch tested? NA Closes #22756 from mgaido91/SPARK-25758. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-10-18 10:32:25 -07:00
Takuya UESHIN	e80f18dbd8	[SPARK-25763][SQL][PYSPARK][TEST] Use more `@contextmanager` to ensure clean-up each test. ## What changes were proposed in this pull request? Currently each test in `SQLTest` in PySpark is not cleaned properly. We should introduce and use more `contextmanager` to be convenient to clean up the context properly. ## How was this patch tested? Modified tests. Closes #22762 from ueshin/issues/SPARK-25763/cleanup_sqltests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-19 00:31:01 +08:00
Russell Spitzer	c3eaee7765	[SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark Master ## What changes were proposed in this pull request? Previously Pyspark used the private constructor for SparkSession when building that object. This resulted in a SparkSession without checking the sql.extensions parameter for additional session extensions. To fix this we instead use the Session.builder() path as SparkR uses, this loads the extensions and allows their use in PySpark. ## How was this patch tested? An integration test was added which mimics the Scala test for the same feature. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21990 from RussellSpitzer/SPARK-25003-master. Authored-by: Russell Spitzer <Russell.Spitzer@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-18 12:29:09 +08:00
Maxim Gekk	e9af9460bc	[SPARK-25393][SQL] Adding new function from_csv() ## What changes were proposed in this pull request? The PR adds new function `from_csv()` similar to `from_json()` to parse columns with CSV strings. I added the following methods: ```Scala def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column ``` and this signature to call it from Python, R and Java: ```Scala def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column ``` ## How was this patch tested? Added new test suites `CsvExpressionsSuite`, `CsvFunctionsSuite` and sql tests. Closes #22379 from MaxGekk/from_csv. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-17 09:32:05 +08:00
Sean Owen	703e6da1ec	[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration ## What changes were proposed in this pull request? Remove Kafka 0.8 integration ## How was this patch tested? Existing tests, build scripts Closes #22703 from srowen/SPARK-25705. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-16 09:10:24 -05:00
Sean Owen	a001814189	[SPARK-25598][STREAMING][BUILD][TEST-MAVEN] Remove flume connector in Spark 3 ## What changes were proposed in this pull request? Removes all vestiges of Flume in the build, for Spark 3. I don't think this needs Jenkins config changes. ## How was this patch tested? Existing tests. Closes #22692 from srowen/SPARK-25598. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-11 14:28:06 -07:00
hyukjinkwon	f3fed28230	[SPARK-25659][PYTHON][TEST] Test type inference specification for createDataFrame in PySpark ## What changes were proposed in this pull request? This PR proposes to specify type inference and simple e2e tests. Looks we are not cleanly testing those logics. For instance, see `08c76b5d39/python/pyspark/sql/types.py (L894-L905)` Looks we intended to support datetime.time and None for type inference too but it does not work: ``` >>> spark.createDataFrame([[datetime.time()]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 432, in _createFromLocal data = [schema.toInternal(row) for row in data] File "/.../spark/python/pyspark/sql/types.py", line 604, in toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 604, in <genexpr> for f, v, c in zip(self.fields, obj, self._needConversion)) File "/.../spark/python/pyspark/sql/types.py", line 442, in toInternal return self.dataType.toInternal(obj) File "/.../spark/python/pyspark/sql/types.py", line 193, in toInternal else time.mktime(dt.timetuple())) AttributeError: 'datetime.time' object has no attribute 'timetuple' >>> spark.createDataFrame([[None]]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 751, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/.../spark/python/pyspark/sql/session.py", line 419, in _createFromLocal struct = self._inferSchemaFromList(data, names=schema) File "/.../python/pyspark/sql/session.py", line 353, in _inferSchemaFromList raise ValueError("Some of types cannot be determined after inferring") ValueError: Some of types cannot be determined after inferring ``` ## How was this patch tested? Manual tests and unit tests were added. Closes #22653 from HyukjinKwon/SPARK-25659. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-09 07:45:02 +08:00
hyukjinkwon	a853a80202	[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs ### What changes were proposed in this pull request? We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well). It's even difficult to identify the problems (see https://github.com/apache/spark/pull/20163 and https://github.com/apache/spark/pull/22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf if sys.version >= '3': long = int data = [ None, True, 1, long(1), "a", u"a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` This table was generated under Python 2 but the code above is Python 3 compatible as well. ## How was this patch tested? Manually tested and lint check. Closes #22655 from HyukjinKwon/SPARK-25666. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-08 15:47:15 +08:00
Liang-Chi Hsieh	cb90617f89	[SPARK-25591][PYSPARK][SQL] Avoid overwriting deserialized accumulator ## What changes were proposed in this pull request? If we use accumulators in more than one UDFs, it is possible to overwrite deserialized accumulators and its values. We should check if an accumulator was deserialized before overwriting it in accumulator registry. ## How was this patch tested? Added test. Closes #22635 from viirya/SPARK-25591. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-08 15:18:08 +08:00
Liang-Chi Hsieh	3eb8429699	[SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf ## What changes were proposed in this pull request? For Pandas UDFs, we get arrow type from defined Catalyst return data type of UDFs. We use this arrow type to do serialization of data. If the defined return data type doesn't match with actual return type of Pandas.Series returned by Pandas UDFs, it has a risk to return incorrect data from Python side. Currently we don't have reliable approach to check if the data conversion is safe or not. We leave some document to notify this to users for now. When there is next upgrade of PyArrow available we can use to check it, we should add the option to check it. ## How was this patch tested? Only document change. Closes #22610 from viirya/SPARK-25461. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-07 23:18:46 +08:00
Parker Hegstrom	17781d7530	[SPARK-25202][SQL] Implements split with limit sql function ## What changes were proposed in this pull request? Adds support for the setting limit in the sql split function ## How was this patch tested? 1. Updated unit tests 2. Tested using Scala spark shell Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22227 from phegstrom/master. Authored-by: Parker Hegstrom <phegstrom@palantir.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-06 14:30:43 +08:00
hyukjinkwon	79dd4c9648	[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement ## What changes were proposed in this pull request? This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL Statement, for instance: ```python from pyspark.sql.functions import pandas_udf, PandasUDFType pandas_udf("integer", PandasUDFType.GROUPED_AGG) def sum_udf(v): return v.sum() spark.udf.register("sum_udf", sum_udf) q = "SELECT v2, sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) GROUP BY v2" spark.sql(q).show() ``` ``` +---+-----------+ \| v2\|sum_udf(v1)\| +---+-----------+ \| 1\| 1\| \| 0\| 5\| +---+-----------+ ``` ## How was this patch tested? Manual test and unit test. Closes #22620 from HyukjinKwon/SPARK-25601. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-04 09:36:23 +08:00
gatorsmile	9bf397c0e4	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT ## What changes were proposed in this pull request? This patch is to bump the master branch version to 3.0.0-SNAPSHOT. ## How was this patch tested? N/A Closes #22606 from gatorsmile/bump3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-02 08:48:24 -07:00
Aleksandr Koriagin	30f5d0f2dd	[SPARK-23401][PYTHON][TESTS] Add more data types for PandasUDFTests ## What changes were proposed in this pull request? Add more data types for Pandas UDF Tests for PySpark SQL ## How was this patch tested? manual tests Closes #22568 from AlexanderKoryagin/new_types_for_pandas_udf_tests. Lead-authored-by: Aleksandr Koriagin <aleksandr_koriagin@epam.com> Co-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Alexander Koryagin <AlexanderKoryagin@users.noreply.github.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-10-01 17:18:45 +08:00
Maxim Gekk	1007cae20e	[SPARK-25447][SQL] Support JSON options by schema_of_json() ## What changes were proposed in this pull request? In the PR, I propose to extended the `schema_of_json()` function, and accept JSON options since they can impact on schema inferring. Purpose is to support the same options that `from_json` can use during schema inferring. ## How was this patch tested? Added SQL, Python and Scala tests (`JsonExpressionsSuite` and `JsonFunctionsSuite`) that checks JSON options are used. Closes #22442 from MaxGekk/schema_of_json-options. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-29 17:53:30 +08:00
Yuanjian Li	2a8cbfddba	[SPARK-25314][SQL] Fix Python UDF accessing attributes from both side of join in join conditions ## What changes were proposed in this pull request? Thanks for bahchis reporting this. It is more like a follow up work for #16581, this PR fix the scenario of Python UDF accessing attributes from both side of join in join condition. ## How was this patch tested? Add regression tests in PySpark and `BatchEvalPythonExecSuite`. Closes #22326 from xuanyuanking/SPARK-25314. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 15:13:18 +08:00
Wenchen Fan	ff876137fa	[SPARK-23715][SQL][DOC] improve document for from/to_utc_timestamp ## What changes were proposed in this pull request? We have an agreement that the behavior of `from/to_utc_timestamp` is corrected, although the function itself doesn't make much sense in Spark: https://issues.apache.org/jira/browse/SPARK-23715 This PR improves the document. ## How was this patch tested? N/A Closes #22543 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 15:02:20 +08:00
Takuya UESHIN	ee214ef3a0	[SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. ## What changes were proposed in this pull request? In [SPARK-20946](https://issues.apache.org/jira/browse/SPARK-20946), we modified `SparkSession.getOrCreate` to not update conf for existing `SparkContext` because `SparkContext` is shared by all sessions. We should not update it in PySpark side as well. ## How was this patch tested? Added tests. Closes #22545 from ueshin/issues/SPARK-25525/not_update_existing_conf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-27 12:37:03 +08:00
Takuya UESHIN	c3c45cbd76	[SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala. ## What changes were proposed in this pull request? In Scala, `HiveContext` sets a config `spark.sql.catalogImplementation` of the given `SparkContext` and then passes to `SparkSession.builder`. The `HiveContext` in PySpark should behave as the same as Scala. ## How was this patch tested? Existing tests. Closes #22552 from ueshin/issues/SPARK-25540/hive_context. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-27 09:51:20 +08:00
Maxim Gekk	473d0d862d	[SPARK-25514][SQL] Generating pretty JSON by to_json ## What changes were proposed in this pull request? The PR introduces new JSON option `pretty` which allows to turn on `DefaultPrettyPrinter` of `Jackson`'s Json generator. New option is useful in exploring of deep nested columns and in converting of JSON columns in more readable representation (look at the added test). ## How was this patch tested? Added rount trip test which convert an JSON string to pretty representation via `from_json()` and `to_json()`. Closes #22534 from MaxGekk/pretty-json. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-26 09:52:15 +08:00
gatorsmile	8c2edf46d0	[SPARK-24324][PYTHON][FOLLOW-UP] Rename the Conf to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## What changes were proposed in this pull request? Add the legacy prefix for spark.sql.execution.pandas.groupedMap.assignColumnsByPosition and rename it to spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName ## How was this patch tested? The existing tests. Closes #22540 from gatorsmile/renameAssignColumnsByPosition. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-26 09:32:51 +08:00
hyukjinkwon	a72d118cd9	[SPARK-25473][PYTHON][SS][TEST] ForeachWriter tests failed on Python 3.6 and macOS High Sierra ## What changes were proposed in this pull request? This PR does not fix the problem itself but just target to add few comments to run PySpark tests on Python 3.6 and macOS High Serria since it actually blocks to run tests on this enviornment. it does not target to fix the problem yet. The problem here looks because we fork python workers and the forked workers somehow call Objective-C libraries in some codes at CPython's implementation. After debugging a while, I suspect `pickle` in Python 3.6 has some changes: `58419b9267/python/pyspark/serializers.py (L577)` in particular, it looks also related to which objects are serialized or not as well. This link (http://sealiesoftware.com/blog/archive/2017/6/5/Objective-C_and_fork_in_macOS_1013.html) and this link (https://blog.phusion.nl/2017/10/13/why-ruby-app-servers-break-on-macos-high-sierra-and-what-can-be-done-about-it/) were helpful for me to understand this. I am still debugging this but my guts say it's difficult to fix or workaround within Spark side. ## How was this patch tested? Manually tested: Before `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`: ``` /usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py:766: ResourceWarning: subprocess 27563 is still running ResourceWarning, source=self) [Stage 0:> (0 + 1) / 1]objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. objc[27586]: +[__NSPlaceholderDictionary initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug. ERROR ====================================================================== ERROR: test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, *kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o54.processAllAvailable. : org.apache.spark.sql.streaming.StreamingQueryException: Writing job aborted. === Streaming Query === Identifier: [id = f508d634-407c-4232-806b-70e54b055c42, runId = 08d1435b-5358-4fb6-b167-811584a3163e] Current Committed Offsets: {} Current Available Offsets: {FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hr0000gp/T/tmpolebys1s]: {"logOffset":0}} Current State: ACTIVE Thread State: RUNNABLE Logical Plan: FileStreamSource[file:/var/folders/71/484zt4z10ks1vydt03bhp6hr0000gp/T/tmpolebys1s] at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:91) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ``` After `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`: ``` test_streaming_foreach_with_simple_function (pyspark.sql.tests.SQLTests) ... ok ``` Closes #22480 from HyukjinKwon/SPARK-25473. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-23 11:14:27 +08:00
gatorsmile	0fbba76faa	[MINOR][PYSPARK] Always Close the tempFile in _serialize_to_jvm ## What changes were proposed in this pull request? Always close the tempFile after `serializer.dump_stream(data, tempFile)` in _serialize_to_jvm ## How was this patch tested? N/A Closes #22523 from gatorsmile/fixMinor. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-23 10:16:33 +08:00
Maxim Gekk	a86f84102e	[SPARK-25381][SQL] Stratified sampling by Column argument ## What changes were proposed in this pull request? In the PR, I propose to add an overloaded method for `sampleBy` which accepts the first argument of the `Column` type. This will allow to sample by any complex columns as well as sampling by multiple columns. For example: ```Scala spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") .stat .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0), 36L) .show() +-----+---+ \| name\|age\| +-----+---+ \| Nico\| 8\| \|Alice\| 10\| +-----+---+ ``` ## How was this patch tested? Added new test for sampling by multiple columns for Scala and test for Java, Python to check that `sampleBy` is able to sample by `Column` type argument. Closes #22365 from MaxGekk/sample-by-column. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-21 01:11:40 +08:00
hyukjinkwon	88e7e87bd5	[MINOR][PYTHON] Use a helper in `PythonUtils` instead of direct accessing Scala package ## What changes were proposed in this pull request? This PR proposes to use add a helper in `PythonUtils` instead of direct accessing Scala package. ## How was this patch tested? Jenkins tests. Closes #22483 from HyukjinKwon/minor-refactoring. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-21 00:41:42 +08:00
Dilip Biswal	67f2c6a554	[SPARK-25417][SQL] ArrayContains function may return incorrect result when right expression is implicitly down casted ## What changes were proposed in this pull request? In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : ```SQL spark-sql> select array_contains(array(1), 1.34); true ``` ```SQL spark-sql> select array_contains(array(1), 'foo'); null ``` We should safely coerce both left and right hand side expressions. ## How was this patch tested? Added tests in DataFrameFunctionsSuite Closes #22408 from dilipbiswal/SPARK-25417. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-09-20 20:33:44 +08:00
hyukjinkwon	7ff5386ed9	[MINOR][PYTHON][TEST] Use collect() instead of show() to make the output silent ## What changes were proposed in this pull request? This PR replace an effective `show()` to `collect()` to make the output silent. Before: ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... +---+----------+ \|key\| val\| +---+----------+ \| 0\|[0.0, 0.0]\| \| 1\|[1.0, 1.0]\| \| 2\|[2.0, 2.0]\| \| 0\|[3.0, 3.0]\| \| 1\|[4.0, 4.0]\| \| 2\|[5.0, 5.0]\| \| 0\|[6.0, 6.0]\| \| 1\|[7.0, 7.0]\| \| 2\|[8.0, 8.0]\| \| 0\|[9.0, 9.0]\| +---+----------+ ``` After: ``` test_simple_udt_in_df (pyspark.sql.tests.SQLTests) ... ok ``` ## How was this patch tested? Manually tested. Closes #22479 from HyukjinKwon/minor-udf-test. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-20 15:03:16 +08:00
Bryan Cutler	90e3955f38	[SPARK-25471][PYTHON][TEST] Fix pyspark-sql test error when using Python 3.6 and Pandas 0.23 ## What changes were proposed in this pull request? Fix test that constructs a Pandas DataFrame by specifying the column order. Previously this test assumed the columns would be sorted alphabetically, however when using Python 3.6 with Pandas 0.23 or higher, the original column order is maintained. This causes the columns to get mixed up and the test errors. Manually tested with `python/run-tests` using Python 3.6.6 and Pandas 0.23.4 Closes #22477 from BryanCutler/pyspark-tests-py36-pd23-SPARK-25471. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-20 09:29:29 +08:00
Imran Rashid	58419b9267	[PYSPARK] Updates to pyspark broadcast	2018-09-17 14:06:09 -05:00
gatorsmile	bb2f069cf2	[SPARK-25436] Bump master branch version to 2.5.0-SNAPSHOT ## What changes were proposed in this pull request? In the dev list, we can still discuss whether the next version is 2.5.0 or 3.0.0. Let us first bump the master branch version to `2.5.0-SNAPSHOT`. ## How was this patch tested? N/A Closes #22426 from gatorsmile/bumpVersionMaster. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-09-15 16:24:02 -07:00
cclauss	9bb798f2e6	[SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle to v2.4.0 See https://pycodestyle.readthedocs.io/en/latest/developer.html#changes for changes made in this release. ## What changes were proposed in this pull request? Upgrade pycodestyle to v2.4.0 ## How was this patch tested? __pycodestyle__ Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22231 from cclauss/patch-1. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-14 20:13:07 -05:00
Sean Owen	08c76b5d39	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 (This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231) ## What changes were proposed in this pull request? Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines. ## How was this patch tested? Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure. Closes #22400 from srowen/SPARK-25238.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-13 11:19:43 +08:00
Mario Molina	c9cb393dc4	[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <mmolimar@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-11 20:47:14 +08:00
Holden Karau	da5685b5bb	[SPARK-23672][PYTHON] Document support for nested return types in scalar with arrow udfs ## What changes were proposed in this pull request? Clarify docstring for Scalar functions ## How was this patch tested? Adds a unit test showing use similar to wordcount, there's existing unit test for array of floats as well. Closes #20908 from holdenk/SPARK-23672-document-support-for-nested-return-types-in-scalar-with-arrow-udfs. Authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-09-10 11:01:51 -07:00
WeichenXu	08c02e637a	[SPARK-25345][ML] Deprecate public APIs from ImageSchema ## What changes were proposed in this pull request? Deprecate public APIs from ImageSchema. ## How was this patch tested? N/A Closes #22349 from WeichenXu123/image_api_deprecate. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-08 09:09:14 -07:00
liyuanjian	c84bc40d7f	[SPARK-25072][PYSPARK] Forbid extra value for custom Row ## What changes were proposed in this pull request? Add value length check in `_create_row`, forbid extra value for custom Row in PySpark. ## How was this patch tested? New UT in pyspark-sql Closes #22140 from xuanyuanking/SPARK-25072. Lead-authored-by: liyuanjian <liyuanjian@baidu.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-09-06 10:17:29 -07:00
hyukjinkwon	7ef6d1daf8	[SPARK-25328][PYTHON] Add an example for having two columns as the grouping key in group aggregate pandas UDF ## What changes were proposed in this pull request? This PR proposes to add another example for multiple grouping key in group aggregate pandas UDF since this feature could make users still confused. ## How was this patch tested? Manually tested and documentation built. Closes #22329 from HyukjinKwon/SPARK-25328. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-09-06 08:18:49 -07:00
Maxim Gekk	d749d034a8	[SPARK-25252][SQL] Support arrays of any types by to_json ## What changes were proposed in this pull request? In the PR, I propose to extended `to_json` and support any types as element types of input arrays. It should allow converting arrays of primitive types and arrays of arrays. For example: ``` select to_json(array('1','2','3')) > ["1","2","3"] select to_json(array(array(1,2,3),array(4))) > [[1,2,3],[4]] ``` ## How was this patch tested? Added a couple sql tests for arrays of primitive type and of arrays. Also I added round trip test `from_json` -> `to_json`. Closes #22226 from MaxGekk/to_json-array. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-09-06 12:35:59 +08:00
WeichenXu	925449283d	[SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/`) This datasource NOT support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. cluster: 4 nodes, each with 64GB memory, 8 cores CPU test dataset: Flickr8k_Dataset (about 8091 images) time cost: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes #22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-09-05 11:59:00 -07:00
Marco Gaido	a3dccd24c2	[SPARK-10697][ML] Add lift to Association rules ## What changes were proposed in this pull request? The PR adds the lift measure to Association rules. ## How was this patch tested? existing and modified UTs Closes #22236 from mgaido91/SPARK-10697. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-01 18:07:58 -05:00
cclauss	3a66a7fca9	[SPARK-25253][PYSPARK][FOLLOWUP] Undefined name: from pyspark.util import _exception_message HyukjinKwon ## What changes were proposed in this pull request? add __from pyspark.util import \_exception_message__ to python/pyspark/java_gateway.py ## How was this patch tested? [flake8](http://flake8.pycqa.org) testing of https://github.com/apache/spark on Python 3.7.0 $ __flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics__ ``` ./python/pyspark/java_gateway.py:172:20: F821 undefined name '_exception_message' emsg = _exception_message(e) ^ 1 F821 undefined name '_exception_message' 1 ``` Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22265 from cclauss/patch-2. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-30 08:13:11 +08:00
Xiangrui Meng	20b7c684cc	[SPARK-25248][.1][PYSPARK] update barrier Python API ## What changes were proposed in this pull request? I made one pass over the Python APIs for barrier mode and updated them to match the Scala doc in #22240 . Major changes: * export the public classes * expand the docs * add doc for BarrierTaskInfo.addresss cc: jiangxb1987 Closes #22261 from mengxr/SPARK-25248.1. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-08-29 07:22:03 -07:00
Bryan Cutler	82c18c240a	[SPARK-23030][SQL][PYTHON] Use Arrow stream format for creating from and collecting Pandas DataFrames ## What changes were proposed in this pull request? This changes the calls of `toPandas()` and `createDataFrame()` to use the Arrow stream format, when Arrow is enabled. Previously, Arrow data was written to byte arrays where each chunk is an output of the Arrow file format. This was mainly due to constraints at the time, and caused some overhead by writing the schema/footer on each chunk of data and then having to read multiple Arrow file inputs and concat them together. Using the Arrow stream format has improved these by increasing performance, lower memory overhead for the average case, and simplified the code. Here are the details of this change: toPandas() _Before:_ Spark internal rows are converted to Arrow file format, each group of records is a complete Arrow file which contains the schema and other metadata. Next a collect is done and an Array of Arrow files is the result. After that each Arrow file is sent to Python driver which then loads each file and concats them to a single Arrow DataFrame. _After:_ Spark internal rows are converted to ArrowRecordBatches directly, which is the simplest Arrow component for IPC data transfers. The driver JVM then immediately starts serving data to Python as an Arrow stream, sending the schema first. It then starts a Spark job with a custom handler that sends Arrow RecordBatches to Python. Partitions arriving in order are sent immediately, and out-of-order partitions are buffered until the ones that precede it come in. This improves performance, simplifies memory usage on executors, and improves the average memory usage on the JVM driver. Since the order of partitions must be preserved, the worst case is that the first partition will be the last to arrive all data must be buffered in memory until then. This case is no worse that before when doing a full collect. createDataFrame() _Before:_ A Pandas DataFrame is split into parts and each part is made into an Arrow file. Then each file is prefixed by the buffer size and written to a temp file. The temp file is read and each Arrow file is parallelized as a byte array. _After:_ A Pandas DataFrame is split into parts, then an Arrow stream is written to a temp file where each part is an ArrowRecordBatch. The temp file is read as a stream and the Arrow messages are examined. If the message is an ArrowRecordBatch, the data is saved as a byte array. After reading the file, each ArrowRecordBatch is parallelized as a byte array. This has slightly more processing than before because we must look each Arrow message to extract the record batches, but performance ends up a litle better. It is cleaner in the sense that IPC from Python to JVM is done over a single Arrow stream. ## How was this patch tested? Added new unit tests for the additions to ArrowConverters in Scala, existing tests for Python. ## Performance Tests - toPandas Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each. Test code ```python df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand()) for i in range(5): start = time.time() _ = df.toPandas() elapsed = time.time() - start ``` Current Master \| This PR ---------------------\|------------ 5.803557 \| 5.16207 5.409119 \| 5.133671 5.493509 \| 5.147513 5.433107 \| 5.105243 5.488757 \| 5.018685 Avg Master \| Avg This PR ------------------\|-------------- 5.5256098 \| 5.1134364 Speedup of 1.08060595 ## Performance Tests - createDataFrame Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `createDataFrame()` and get the first record. Took the average best time of 5 runs/5 loops each. Test code ```python def run(): pdf = pd.DataFrame(np.random.rand(10000000, 10)) spark.createDataFrame(pdf).first() for i in range(6): start = time.time() run() elapsed = time.time() - start gc.collect() print("Run %d: %f" % (i, elapsed)) ``` Current Master \| This PR --------------------\|---------- 6.234608 \| 5.665641 6.32144 \| 5.3475 6.527859 \| 5.370803 6.95089 \| 5.479151 6.235046 \| 5.529167 Avg Master \| Avg This PR ---------------\|---------------- 6.4539686 \| 5.4784524 Speedup of 1.178064192 ## Memory Improvements toPandas() The most significant improvement is reduction of the upper bound space complexity in the JVM driver. Before, the entire dataset was collected in the JVM first before sending it to Python. With this change, as soon as a partition is collected, the result handler immediately sends it to Python, so the upper bound is the size of the largest partition. Also, using the Arrow stream format is more efficient because the schema is written once per stream, followed by record batches. The schema is now only send from driver JVM to Python. Before, multiple Arrow file formats were used that each contained the schema. This duplicated schema was created in the executors, sent to the driver JVM, and then Python where all but the first one received are discarded. I verified the upper bound limit by running a test that would collect data that would exceed the amount of driver JVM memory available. Using these settings on a standalone cluster: ``` spark.driver.memory 1g spark.executor.memory 5g spark.sql.execution.arrow.enabled true spark.sql.execution.arrow.fallback.enabled false spark.sql.execution.arrow.maxRecordsPerBatch 0 spark.driver.maxResultSize 2g ``` Test code: ```python from pyspark.sql.functions import rand df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()) df.toPandas() ``` This makes total data size of 33554432×8×4 = 1073741824 With the current master, it fails with OOM but passes using this PR. createDataFrame() No significant change in memory except that using the stream format instead of separate file formats avoids duplicated the schema, similar to toPandas above. The process of reading the stream and parallelizing the batches does cause the record batch message metadata to be copied, but it's size is insignificant. Closes #21546 from BryanCutler/arrow-toPandas-stream-SPARK-23030. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-29 15:01:12 +08:00
Imran Rashid	38391c9aa8	[SPARK-25253][PYSPARK] Refactor local connection & auth code This eliminates some duplication in the code to connect to a server on localhost to talk directly to the jvm. Also it gives consistent ipv6 and error handling. Two other incidental changes, that shouldn't matter: 1) python barrier tasks perform authentication immediately (rather than waiting for the BARRIER_FUNCTION indicator) 2) for `rdd._load_from_socket`, the timeout is only increased after authentication. Closes #22247 from squito/py_connection_refactor. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-29 09:47:38 +08:00
Ryan Blue	7ad18ee9f2	[SPARK-25004][CORE] Add spark.executor.pyspark.memory limit. ## What changes were proposed in this pull request? This adds `spark.executor.pyspark.memory` to configure Python's address space limit, [`resource.RLIMIT_AS`](https://docs.python.org/3/library/resource.html#resource.RLIMIT_AS). Limiting Python's address space allows Python to participate in memory management. In practice, we see fewer cases of Python taking too much memory because it doesn't know to run garbage collection. This results in YARN killing fewer containers. This also improves error messages so users know that Python is consuming too much memory: ``` File "build/bdist.linux-x86_64/egg/package/library.py", line 265, in fe_engineer fe_eval_rec.update(f(src_rec_prep, mat_rec_prep)) File "build/bdist.linux-x86_64/egg/package/library.py", line 163, in fe_comp comparisons = EvaluationUtils.leven_list_compare(src_rec_prep.get(item, []), mat_rec_prep.get(item, [])) File "build/bdist.linux-x86_64/egg/package/evaluationutils.py", line 25, in leven_list_compare permutations = sorted(permutations, reverse=True) MemoryError ``` The new pyspark memory setting is used to increase requested YARN container memory, instead of sharing overhead memory between python and off-heap JVM activity. ## How was this patch tested? Tested memory limits in our YARN cluster and verified that MemoryError is thrown. Author: Ryan Blue <blue@apache.org> Closes #21977 from rdblue/SPARK-25004-add-python-memory-limit.	2018-08-28 12:31:33 -07:00
Li Jin	8198ea5019	[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy ## What changes were proposed in this pull request? The PR excludes Python UDFs filters in FileSourceStrategy so that they don't ExtractPythonUDF rule to throw exception. It doesn't make sense to pass Python UDF filters in FileSourceStrategy anyway because they cannot be used as push down filters. ## How was this patch tested? Add a new regression test Closes #22104 from icexelloss/SPARK-24721-udf-filter. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-28 10:57:13 +08:00
Huaxin Gao	b5e1188087	[SPARK-25124][ML] VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How was this patch tested? I tested the changes on my local. Closes #22136 from huaxingao/spark-25124. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2018-08-23 16:17:27 -07:00
Kevin Yu	2381953ab5	[SPARK-25105][PYSPARK][SQL] Include PandasUDFType in the import all of pyspark.sql.functions ## What changes were proposed in this pull request? Include PandasUDFType in the import all of pyspark.sql.functions ## How was this patch tested? Run the test case from the pyspark shell from the jira [spark-25105](https://jira.apache.org/jira/browse/SPARK-25105?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20PySpark%2C%20SQL%2C%20%22Structured%20Streaming%22)) I manually test on pyspark-shell: before: ` >>> from pyspark.sql.functions import * >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'PandasUDFType' is not defined >>> ` after: ` >>> from pyspark.sql.functions import * >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) >>> ` Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22100 from kevinyu98/spark-25105. Authored-by: Kevin Yu <qyu@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-08-22 10:16:47 -07:00
cclauss	71f38ac242	[SPARK-23698][PYTHON] Resolve undefined names in Python 3 ## What changes were proposed in this pull request? Fix issues arising from the fact that builtins __file__, __long__, __raw_input()__, __unicode__, __xrange()__, etc. were all removed from Python 3. __Undefined names__ have the potential to raise [NameError](https://docs.python.org/3/library/exceptions.html#NameError) at runtime. ## How was this patch tested? * $ __python2 -m flake8 . --count --select=E9,F82 --show-source --statistics__ * $ __python3 -m flake8 . --count --select=E9,F82 --show-source --statistics__ holdenk flake8 testing of https://github.com/apache/spark on Python 3.6.3 $ __python3 -m flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics__ ``` ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' result = raw_input("\n%s (y/n): " % prompt) ^ ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' primary_author = raw_input( ^ ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) ^ ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) ^ ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) ^ ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' raw_assignee = raw_input( ^ ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ") ^ ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' result = raw_input("Would you like to use the modified title? (y/n): ") ^ ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": ^ ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' response = raw_input("%s [y/n]: " % msg) ^ ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' author = unidecode.unidecode(unicode(author, "UTF-8")).strip() ^ ./python/setup.py:37:11: F821 undefined name '__version__' VERSION = __version__ ^ ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' dispatch[buffer] = save_buffer ^ ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' dispatch[file] = save_file ^ ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' if not isinstance(obj, str) and not isinstance(obj, unicode): ^ ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' intlike = (int, long) ^ ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' return self._sc._jvm.Time(long(timestamp * 1000)) ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 undefined name 'xrange' for i in xrange(50): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 undefined name 'xrange' for j in xrange(5): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 undefined name 'xrange' for k in xrange(20022): ^ 20 F821 undefined name 'raw_input' 20 ``` Closes #20838 from cclauss/fix-undefined-names. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-08-22 10:06:59 -07:00
Xingbo Jiang	ad45299d04	[SPARK-25095][PYSPARK] Python support for BarrierTaskContext ## What changes were proposed in this pull request? Add method `barrier()` and `getTaskInfos()` in python TaskContext, these two methods are only allowed for barrier tasks. ## How was this patch tested? Add new tests in `tests.py` Closes #22085 from jiangxb1987/python.barrier. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2018-08-21 15:54:30 -07:00
Takuya UESHIN	4dd87d8ff5	[SPARK-25142][PYSPARK] Add error messages when Python worker could not open socket in `_load_from_socket`. ## What changes were proposed in this pull request? Sometimes Python worker can't open socket in `_load_from_socket` for some reason, but it's difficult to figure out the reason because the exception doesn't even contain the messages from `socket.error`s. We should at least add the error messages when raising the exception. ## How was this patch tested? Manually in my local environment. Closes #22132 from ueshin/issues/SPARK-25142/socket_error. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-18 17:24:06 +08:00
Bryan Cutler	10f2b6fa05	[SPARK-23555][PYTHON] Add BinaryType support for Arrow in Python ## What changes were proposed in this pull request? Adding `BinaryType` support for Arrow in pyspark, conditional on using pyarrow >= 0.10.0. Earlier versions will continue to raise a TypeError. ## How was this patch tested? Additional unit tests in pyspark for code paths that use Arrow for createDataFrame, toPandas, and scalar pandas_udfs. Closes #20725 from BryanCutler/arrow-binary-type-support-SPARK-23555. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-08-17 22:14:42 -07:00
Yuanjian Li	9251c61bd8	[SPARK-24665][PYSPARK][FOLLOWUP] Use SQLConf in PySpark to manage all sql configs ## What changes were proposed in this pull request? Follow up for SPARK-24665, find some others hard code during code review. ## How was this patch tested? Existing UT. Closes #22122 from xuanyuanking/SPARK-24665-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-17 10:18:08 +08:00
Bryan Cutler	ed075e1ff6	[SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to 0.10.0 Version 0.10.0 has a number of bug fixes and improvements with the following pertaining directly to usage in Spark: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 ## How was this patch tested? existing tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #21939 from BryanCutler/arrow-upgrade-010.	2018-08-14 17:13:38 -07:00
Maxim Gekk	ab06c25350	[SPARK-24391][SQL] Support arrays of any types by from_json ## What changes were proposed in this pull request? The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type: ``` scala> import org.apache.spark.sql.functions._ scala> val df = Seq("[1, 2, 3]").toDF("a") scala> val schema = new ArrayType(IntegerType, false) scala> val arr = df.select(from_json($"a", schema)) scala> arr.printSchema root \|-- jsontostructs(a): array (nullable = true) \| \|-- element: integer (containsNull = true) ``` and result of converting of the json string to the `ArrayType`: ``` scala> arr.show +----------------+ \|jsontostructs(a)\| +----------------+ \| [1, 2, 3]\| +----------------+ ``` ## How was this patch tested? I added a few positive and negative tests: - array of primitive types - array of arrays - array of structs - array of maps Closes #21439 from MaxGekk/from_json-array. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-13 20:13:09 +08:00
Marco Gaido	20fa456932	[SPARK-25090][ML] Enforce implicit type coercion in ParamGridBuilder ## What changes were proposed in this pull request? When the grid of the parameters is created in `ParamGridBuilder`, the implicit type coercion is not enforced. So using an integer in the list of parameters to set for a parameter accepting a double can cause a class cast exception. The PR proposes to enforce the type coercion when building the parameters. ## How was this patch tested? added UT Closes #22076 from mgaido91/SPARK-25090. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-13 09:11:37 +08:00
Tynan CR	5bc7598b25	Fix typos ## What changes were proposed in this pull request? Small typo fixes in Pyspark. These were the only ones I stumbled across after looking around for a while. ## How was this patch tested? Manually Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22016 from tynan-cr/typo-fix-pyspark. Authored-by: Tynan CR <tynancr@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-12 08:13:09 -05:00
Kazuhiro Sera	8ec25cd67e	Fix typos detected by github.com/client9/misspell ## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . \| grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . \| grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2018-08-11 21:23:36 -05:00
Xingbo Jiang	4855d5c4b9	[SPARK-24822][PYSPARK] Python support for barrier execution mode ## What changes were proposed in this pull request? This PR add python support for barrier execution mode, thus enable launch a job containing barrier stage(s) from PySpark. We just forked the existing `RDDBarrier` and `RDD.barrier()` in Python api. ## How was this patch tested? Manually tested: ``` >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(iterator): yield sum(iterator) ... >>> rdd.barrier().mapPartitions(f).isBarrier() == True True ``` Unit tests will be added in a follow-up PR that implements BarrierTaskContext on python side. Closes #22011 from jiangxb1987/python. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-08-11 21:44:45 +08:00
liuxian	4b11d909fd	[MINOR][DOC] Add missing compression codec . ## What changes were proposed in this pull request? Parquet file provides six codecs: "snappy", "gzip", "lzo", "lz4", "brotli", "zstd". This pr add missing compression codec :"lz4", "brotli", "zstd" . ## How was this patch tested? N/A Closes #22068 from 10110346/nosupportlz4. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-11 20:49:52 +08:00
Kazuaki Ishizaki	56e9e97073	[MINOR][DOC] Fix typo ## What changes were proposed in this pull request? This PR fixes typo regarding `auxiliary verb + verb[s]`. This is a follow-on of #21956. ## How was this patch tested? N/A Closes #22040 from kiszk/spellcheck1. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-09 20:10:17 +08:00
Kazuaki Ishizaki	1a5e460762	[SPARK-23913][SQL] Add array_intersect function ## What changes were proposed in this pull request? The PR adds the SQL function `array_intersect`. The behavior of the function is based on Presto's one. This function returns returns an array of the elements in the intersection of array1 and array2. Note: The order of elements in the result is not defined. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21102 from kiszk/SPARK-23913.	2018-08-06 23:27:57 +09:00
Maxim Gekk	64ad7b841d	[SPARK-23772][FOLLOW-UP][SQL] Provide an option to ignore column of all null values or empty array during JSON schema inference ## What changes were proposed in this pull request? The `dropFieldIfAllNull` parameter of the `json` method wasn't set as an option. This PR fixes that. ## How was this patch tested? I added a test to `sql/test.py` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #22002 from MaxGekk/drop-field-if-all-null.	2018-08-06 16:46:55 +08:00
Maxim Gekk	41c2227a23	[SPARK-24722][SQL] pivot() with Column type argument ## What changes were proposed in this pull request? In the PR, I propose column-based API for the `pivot()` function. It allows using of any column expressions as the pivot column. Also this makes it consistent with how groupBy() works. ## How was this patch tested? I added new tests to `DataFramePivotSuite` and updated PySpark examples for the `pivot()` function. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21699 from MaxGekk/pivot-column.	2018-08-04 14:17:32 +08:00
Onwuka Gideon	8c14276c33	Little typo ## What changes were proposed in this pull request? Fixed little typo for a comment ## How was this patch tested? Manual test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Onwuka Gideon <dongidomed@gmail.com> Closes #21992 from dongido001/patch-1.	2018-08-03 17:39:40 -05:00
Yuhao Yang	ebf33a333e	[SAPRK-25011][ML] add prefix to __all__ in fpm.py ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-25011 add prefix to __all__ in fpm.py ## How was this patch tested? existing unit test. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #21981 from hhbyyh/prefixall.	2018-08-03 15:02:41 +08:00
Takuya UESHIN	0df6bf8829	[BUILD] Fix lint-python. ## What changes were proposed in this pull request? This pr fixes lint-python. ``` ./python/pyspark/accumulators.py:231:9: E306 expected 1 blank line before a nested definition, found 0 ./python/pyspark/accumulators.py:257:101: E501 line too long (107 > 100 characters) ./python/pyspark/accumulators.py:264:1: E302 expected 2 blank lines, found 1 ./python/pyspark/accumulators.py:281:1: E302 expected 2 blank lines, found 1 ``` ## How was this patch tested? Executed lint-python manually. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21973 from ueshin/issues/build/1/fix_lint-python.	2018-08-03 03:18:46 +09:00
LucaCanali	15fc237226	Updates to Accumulators	2018-08-02 10:03:22 -05:00
Kazuaki Ishizaki	95a9d5e3a5	[SPARK-23915][SQL] Add array_except function ## What changes were proposed in this pull request? The PR adds the SQL function `array_except`. The behavior of the function is based on Presto's one. This function returns returns an array of the elements in array1 but not in array2. Note: The order of elements in the result is not defined. ## How was this patch tested? Added UTs. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21103 from kiszk/SPARK-23915.	2018-08-02 02:52:30 +08:00
hyukjinkwon	f4772fd26f	[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) ## What changes were proposed in this pull request? See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems using `from_pandas` to convert decimals fails if encounters a value of `None`: ```python import pyarrow as pa import pandas as pd from decimal import Decimal pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), type=pa.decimal128(3, 2)) ``` Arrow 0.8.0 ``` <pyarrow.lib.Decimal128Array object at 0x10a572c58> [ Decimal('3.14'), NA ] ``` Arrow 0.9.0 ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` This PR propose to work around this via Decimal NaN: ```python pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), type=pa.decimal128(3, 2)) ``` ``` <pyarrow.lib.Decimal128Array object at 0x10ffd2e68> [ Decimal('3.14'), NA ] ``` ## How was this patch tested? Manually tested: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests ``` Before ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal self.assertEquals(df.collect(), res.collect()) File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect sock_info = self._jdf.collectToPython() File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/pyspark/worker.py", line 320, in main process() File "/.../spark/python/pyspark/worker.py", line 315, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream batch = _create_batch(series, self._timezone) File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/.../spark/python/pyspark/serializers.py", line 241, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` After* ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). .......S............................. ---------------------------------------------------------------------- Ran 37 tests in 21.980s ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21928 from HyukjinKwon/SPARK-24976.	2018-07-31 17:24:24 -07:00
Huaxin Gao	42dfe4f159	[SPARK-24973][PYTHON] Add numIter to Python ClusteringSummary ## What changes were proposed in this pull request? Add numIter to Python version of ClusteringSummary ## How was this patch tested? Modified existing UT test_multiclass_logistic_regression_summary Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21925 from huaxingao/spark-24973.	2018-07-31 15:23:11 -05:00
zhengruifeng	1223a201fc	[SPARK-24609][ML][DOC] PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well ## What changes were proposed in this pull request? update doc of RandomForestClassifier.featureSubsetStrategy ## How was this patch tested? local built doc rdoc: ![default](https://user-images.githubusercontent.com/7322292/42807787-4dda6362-89e4-11e8-839f-a8519b7c1f1c.png) pydoc: ![default](https://user-images.githubusercontent.com/7322292/43112817-5f1d4d88-8f2a-11e8-93ff-de90db8afdca.png) Author: zhengruifeng <ruifengz@foxmail.com> Closes #21788 from zhengruifeng/rf_doc_py_r.	2018-07-31 13:37:13 -05:00
Li Jin	8141d55926	[SPARK-23633][SQL] Update Pandas UDFs section in sql-programming-guide ## What changes were proposed in this pull request? Update Pandas UDFs section in sql-programming-guide. Add section for grouped aggregate pandas UDF. ## How was this patch tested? Author: Li Jin <ice.xelloss@gmail.com> Closes #21887 from icexelloss/SPARK-23633-sql-programming-guide.	2018-07-31 10:10:38 +08:00
Dilip Biswal	65a4bc143a	[SPARK-21274][SQL] Implement INTERSECT ALL clause ## What changes were proposed in this pull request? Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Input Query ``` SQL SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_row(min_count, c1) FROM ( SELECT c1, IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count FROM ( SELECT c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt FROM ( SELECT c1, true as vcol1, null as vcol2 FROM ut1 UNION ALL SELECT c1, null as vcol1, true as vcol2 FROM ut2 ) AS union_all GROUP BY c1 HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1 ) ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21886 from dilipbiswal/dkb_intersect_all_final.	2018-07-29 22:11:01 -07:00
Li Jin	e8752095a0	[SPARK-24624][SQL][PYTHON] Support mixture of Python UDF and Scalar Pandas UDF ## What changes were proposed in this pull request? This PR add supports for using mixed Python UDF and Scalar Pandas UDF, in the following two cases: (1) ``` from pyspark.sql.functions import udf, pandas_udf udf('int') def f1(x): return x + 1 pandas_udf('int') def f2(x): return x + 1 df = spark.range(0, 1).toDF('v') \ .withColumn('foo', f1(col('v'))) \ .withColumn('bar', f2(col('v'))) ``` QueryPlan: ``` >>> df.explain(True) == Parsed Logical Plan == 'Project [v#2L, foo#5, f2('v) AS bar#9] +- AnalysisBarrier +- Project [v#2L, f1(v#2L) AS foo#5] +- Project [id#0L AS v#2L] +- Range (0, 1, step=1, splits=Some(4)) == Analyzed Logical Plan == v: bigint, foo: int, bar: int Project [v#2L, foo#5, f2(v#2L) AS bar#9] +- Project [v#2L, f1(v#2L) AS foo#5] +- Project [id#0L AS v#2L] +- Range (0, 1, step=1, splits=Some(4)) == Optimized Logical Plan == Project [id#0L AS v#2L, f1(id#0L) AS foo#5, f2(id#0L) AS bar#9] +- Range (0, 1, step=1, splits=Some(4)) == Physical Plan == (2) Project [id#0L AS v#2L, pythonUDF0#13 AS foo#5, pythonUDF0#14 AS bar#9] +- ArrowEvalPython [f2(id#0L)], [id#0L, pythonUDF0#13, pythonUDF0#14] +- BatchEvalPython [f1(id#0L)], [id#0L, pythonUDF0#13] +- (1) Range (0, 1, step=1, splits=4) ``` (2) ``` from pyspark.sql.functions import udf, pandas_udf udf('int') def f1(x): return x + 1 pandas_udf('int') def f2(x): return x + 1 df = spark.range(0, 1).toDF('v') df = df.withColumn('foo', f2(f1(df['v']))) ``` QueryPlan: ``` >>> df.explain(True) == Parsed Logical Plan == Project [v#21L, f2(f1(v#21L)) AS foo#46] +- AnalysisBarrier +- Project [v#21L, f1(f2(v#21L)) AS foo#39] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25] +- Project [id#19L AS v#21L] +- Range (0, 1, step=1, splits=Some(4)) == Analyzed Logical Plan == v: bigint, foo: int Project [v#21L, f2(f1(v#21L)) AS foo#46] +- Project [v#21L, f1(f2(v#21L)) AS foo#39] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#32] +- Project [v#21L, <lambda>(<lambda>(v#21L)) AS foo#25] +- Project [id#19L AS v#21L] +- Range (0, 1, step=1, splits=Some(4)) == Optimized Logical Plan == Project [id#19L AS v#21L, f2(f1(id#19L)) AS foo#46] +- Range (0, 1, step=1, splits=Some(4)) == Physical Plan == (2) Project [id#19L AS v#21L, pythonUDF0#50 AS foo#46] +- ArrowEvalPython [f2(pythonUDF0#49)], [id#19L, pythonUDF0#49, pythonUDF0#50] +- BatchEvalPython [f1(id#19L)], [id#19L, pythonUDF0#49] +- (1) Range (0, 1, step=1, splits=4) ``` ## How was this patch tested? New tests are added to BatchEvalPythonExecSuite and ScalarPandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #21650 from icexelloss/SPARK-24624-mix-udf.	2018-07-28 13:41:07 +08:00
Dilip Biswal	10f1f19659	[SPARK-21274][SQL] Implement EXCEPT ALL clause. ## What changes were proposed in this pull request? Implements EXCEPT ALL clause through query rewrites using existing operators in Spark. In this PR, an internal UDTF (replicate_rows) is added to aid in preserving duplicate rows. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Note This proposed UDTF is kept as a internal function that is purely used to aid with this particular rewrite to give us flexibility to change to a more generalized UDTF in future. Input Query ``` SQL SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_rows(sum_val, c1) FROM ( SELECT c1, sum_val FROM ( SELECT c1, sum(vcol) AS sum_val FROM ( SELECT 1L as vcol, c1 FROM ut1 UNION ALL SELECT -1L as vcol, c1 FROM ut2 ) AS union_all GROUP BY union_all.c1 ) WHERE sum_val > 0 ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite and SetOperationSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21857 from dilipbiswal/dkb_except_all_final.	2018-07-27 13:47:33 -07:00
pkuwm	ef6c8395c4	[SPARK-23928][SQL] Add shuffle collection function. ## What changes were proposed in this pull request? This PR adds a new collection function: shuffle. It generates a random permutation of the given array. This implementation uses the "inside-out" version of Fisher-Yates algorithm. ## How was this patch tested? New tests are added to CollectionExpressionsSuite.scala and DataFrameFunctionsSuite.scala. Author: Takuya UESHIN <ueshin@databricks.com> Author: pkuwm <ihuizhi.lu@gmail.com> Closes #21802 from ueshin/issues/SPARK-23928/shuffle.	2018-07-27 23:02:48 +09:00
crafty-coder	78e0a725e0	[SPARK-19018][SQL] Add support for custom encoding on csv writer ## What changes were proposed in this pull request? Add support for custom encoding on csv writer, see https://issues.apache.org/jira/browse/SPARK-19018 ## How was this patch tested? Added two unit tests in CSVSuite Author: crafty-coder <carlospb86@gmail.com> Author: Carlos <crafty-coder@users.noreply.github.com> Closes #20949 from crafty-coder/master.	2018-07-25 14:17:20 +08:00
William Sheu	96f3120760	[PYSPARK][TEST][MINOR] Fix UDFInitializationTests ## What changes were proposed in this pull request? Fix a typo in pyspark sql tests Author: William Sheu <william.sheu@databricks.com> Closes #21833 from PenguinToast/fix-test-typo.	2018-07-20 19:48:32 -07:00
Marco Gaido	cc4d64bb16	[SPARK-23451][ML] Deprecate KMeans.computeCost ## What changes were proposed in this pull request? Deprecate `KMeans.computeCost` which was introduced as a temp fix and now it is not needed anymore, since we introduced `ClusteringEvaluator`. ## How was this patch tested? manual test (deprecation warning displayed) Scala ``` ... scala> model.computeCost(dataset) warning: there was one deprecation warning; re-run with -deprecation for details res1: Double = 0.0 ``` Python ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) ... >>> model.computeCost(df) /Users/mgaido/apache/spark/python/pyspark/ml/clustering.py:330: DeprecationWarning: Deprecated in 2.4.0. It will be removed in 3.0.0. Use ClusteringEvaluator instead. " instead.", DeprecationWarning) ``` Author: Marco Gaido <marcogaido91@gmail.com> Closes #20629 from mgaido91/SPARK-23451.	2018-07-20 09:18:57 -07:00
Huaxin Gao	0ab07b357b	[SPARK-24868][PYTHON] add sequence function in Python ## What changes were proposed in this pull request? Add ```sequence``` in functions.py ## How was this patch tested? Add doctest. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21820 from huaxingao/spark-24868.	2018-07-20 17:53:14 +08:00
Maxim Gekk	69993217fc	[SPARK-24807][CORE] Adding files/jars twice: output a warning and add a note ## What changes were proposed in this pull request? In the PR, I propose to output an warning if the `addFile()` or `addJar()` methods are callled more than once for the same path. Currently, overwriting of already added files is not supported. New comments and warning are reflected the existing behaviour. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21771 from MaxGekk/warning-on-adding-file.	2018-07-14 22:07:49 -07:00
Sean Owen	8aceb961c3	[SPARK-24754][ML] Minhash integer overflow ## What changes were proposed in this pull request? Use longs in calculating min hash to avoid bias due to int overflow. ## How was this patch tested? Existing tests. Author: Sean Owen <srowen@gmail.com> Closes #21750 from srowen/SPARK-24754.	2018-07-14 15:59:17 -05:00
Marco Gaido	11384893b6	[SPARK-24208][SQL][FOLLOWUP] Move test cases to proper locations ## What changes were proposed in this pull request? The PR is a followup to move the test cases introduced by the original PR in their proper location. ## How was this patch tested? moved UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21751 from mgaido91/SPARK-24208_followup.	2018-07-12 15:13:26 -07:00
Kazuaki Ishizaki	301bff7063	[SPARK-23914][SQL] Add array_union function ## What changes were proposed in this pull request? The PR adds the SQL function `array_union`. The behavior of the function is based on Presto's one. This function returns returns an array of the elements in the union of array1 and array2. Note: The order of elements in the result is not defined. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21061 from kiszk/SPARK-23914.	2018-07-12 17:42:29 +09:00
Maxim Gekk	3ab48f985c	[SPARK-24761][SQL] Adding of isModifiable() to RuntimeConfig ## What changes were proposed in this pull request? In the PR, I propose to extend `RuntimeConfig` by new method `isModifiable()` which returns `true` if a config parameter can be modified at runtime (for current session state). For static SQL and core parameters, the method returns `false`. ## How was this patch tested? Added new test to `RuntimeConfigSuite` for checking Spark core and SQL parameters. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21730 from MaxGekk/is-modifiable.	2018-07-11 17:38:43 -07:00
Marco Gaido	ebf4bfb966	[SPARK-24208][SQL] Fix attribute deduplication for FlatMapGroupsInPandas ## What changes were proposed in this pull request? A self-join on a dataset which contains a `FlatMapGroupsInPandas` fails because of duplicate attributes. This happens because we are not dealing with this specific case in our `dedupAttr` rules. The PR fix the issue by adding the management of the specific case ## How was this patch tested? added UT + manual tests Author: Marco Gaido <marcogaido91@gmail.com> Author: Marco Gaido <mgaido@hortonworks.com> Closes #21737 from mgaido91/SPARK-24208.	2018-07-11 09:29:19 -07:00
Bruce Robbins	034913b62b	[SPARK-23936][SQL] Implement map_concat ## What changes were proposed in this pull request? Implement map_concat high order function. This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with ueshin, we settled on option 1 from [here](https://issues.apache.org/jira/browse/SPARK-23936?focusedCommentId=16464245&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16464245)). ## How was this patch tested? New tests Manual tests Run all sbt SQL tests Run all pyspark sql tests Author: Bruce Robbins <bersprockets@gmail.com> Closes #21073 from bersprockets/SPARK-23936.	2018-07-09 21:21:38 +09:00
hyukjinkwon	044b33b2ed	[SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+ ## What changes were proposed in this pull request? This PR proposes to make PySpark's tests compatible with NumPy 0.14+ NumPy 0.14.x introduced rather radical changes about its string representation. For example, the tests below are failed: ``` ******************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 2.], [ 1., 3.]]) Got: DenseMatrix([[0., 2.], [1., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__ Failed example: print(dm) Expected: DenseMatrix([[ 0., 1.], [ 2., 3.]]) Got: DenseMatrix([[0., 1.], [2., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray Failed example: m.toArray() Expected: array([[ 0., 2.], [ 1., 3.]]) Got: array([[0., 2.], [1., 3.]]) ****************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot Failed example: dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F')) Expected: array([ 5., 11.]) Got: array([ 5., 11.]) ******************************************************************** File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot Failed example: a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]])) Expected: array([ 22., 22.]) Got: array([22., 22.]) ``` See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python2.7'] Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib'] Starting test(python2.7): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.classification Starting test(python3.6): pyspark.mllib.tests Starting test(python2.7): pyspark.ml.clustering Finished test(python2.7): pyspark.ml.clustering (54s) Starting test(python2.7): pyspark.ml.evaluation Finished test(python2.7): pyspark.ml.classification (74s) Starting test(python2.7): pyspark.ml.feature Finished test(python2.7): pyspark.ml.evaluation (27s) Starting test(python2.7): pyspark.ml.fpm Finished test(python2.7): pyspark.ml.fpm (0s) Starting test(python2.7): pyspark.ml.image Finished test(python2.7): pyspark.ml.image (17s) Starting test(python2.7): pyspark.ml.linalg.__init__ Finished test(python2.7): pyspark.ml.linalg.__init__ (1s) Starting test(python2.7): pyspark.ml.recommendation Finished test(python2.7): pyspark.ml.feature (76s) Starting test(python2.7): pyspark.ml.regression Finished test(python2.7): pyspark.ml.recommendation (69s) Starting test(python2.7): pyspark.ml.stat Finished test(python2.7): pyspark.ml.regression (45s) Starting test(python2.7): pyspark.ml.tests Finished test(python2.7): pyspark.ml.stat (28s) Starting test(python2.7): pyspark.ml.tuning Finished test(python2.7): pyspark.ml.tuning (20s) Starting test(python2.7): pyspark.mllib.classification Finished test(python2.7): pyspark.mllib.classification (31s) Starting test(python2.7): pyspark.mllib.clustering Finished test(python2.7): pyspark.mllib.tests (260s) Starting test(python2.7): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.tests (266s) Starting test(python2.7): pyspark.mllib.feature Finished test(python2.7): pyspark.mllib.evaluation (21s) Starting test(python2.7): pyspark.mllib.fpm Finished test(python2.7): pyspark.mllib.feature (38s) Starting test(python2.7): pyspark.mllib.linalg.__init__ Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s) Starting test(python2.7): pyspark.mllib.linalg.distributed Finished test(python2.7): pyspark.mllib.fpm (34s) Starting test(python2.7): pyspark.mllib.random Finished test(python2.7): pyspark.mllib.clustering (64s) Starting test(python2.7): pyspark.mllib.recommendation Finished test(python2.7): pyspark.mllib.random (15s) Starting test(python2.7): pyspark.mllib.regression Finished test(python2.7): pyspark.mllib.linalg.distributed (47s) Starting test(python2.7): pyspark.mllib.stat.KernelDensity Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python2.7): pyspark.mllib.stat._statistics Finished test(python2.7): pyspark.mllib.recommendation (40s) Starting test(python2.7): pyspark.mllib.tree Finished test(python2.7): pyspark.mllib.regression (38s) Starting test(python2.7): pyspark.mllib.util Finished test(python2.7): pyspark.mllib.stat._statistics (19s) Starting test(python3.6): pyspark.ml.classification Finished test(python2.7): pyspark.mllib.tree (26s) Starting test(python3.6): pyspark.ml.clustering Finished test(python2.7): pyspark.mllib.util (27s) Starting test(python3.6): pyspark.ml.evaluation Finished test(python3.6): pyspark.ml.evaluation (30s) Starting test(python3.6): pyspark.ml.feature Finished test(python2.7): pyspark.ml.tests (234s) Starting test(python3.6): pyspark.ml.fpm Finished test(python3.6): pyspark.ml.fpm (1s) Starting test(python3.6): pyspark.ml.image Finished test(python3.6): pyspark.ml.clustering (55s) Starting test(python3.6): pyspark.ml.linalg.__init__ Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Starting test(python3.6): pyspark.ml.recommendation Finished test(python3.6): pyspark.ml.classification (71s) Starting test(python3.6): pyspark.ml.regression Finished test(python3.6): pyspark.ml.image (18s) Starting test(python3.6): pyspark.ml.stat Finished test(python3.6): pyspark.ml.stat (37s) Starting test(python3.6): pyspark.ml.tests Finished test(python3.6): pyspark.ml.regression (59s) Starting test(python3.6): pyspark.ml.tuning Finished test(python3.6): pyspark.ml.feature (93s) Starting test(python3.6): pyspark.mllib.classification Finished test(python3.6): pyspark.ml.recommendation (83s) Starting test(python3.6): pyspark.mllib.clustering Finished test(python3.6): pyspark.ml.tuning (29s) Starting test(python3.6): pyspark.mllib.evaluation Finished test(python3.6): pyspark.mllib.evaluation (26s) Starting test(python3.6): pyspark.mllib.feature Finished test(python3.6): pyspark.mllib.classification (43s) Starting test(python3.6): pyspark.mllib.fpm Finished test(python3.6): pyspark.mllib.clustering (81s) Starting test(python3.6): pyspark.mllib.linalg.__init__ Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s) Starting test(python3.6): pyspark.mllib.linalg.distributed Finished test(python3.6): pyspark.mllib.fpm (48s) Starting test(python3.6): pyspark.mllib.random Finished test(python3.6): pyspark.mllib.feature (54s) Starting test(python3.6): pyspark.mllib.recommendation Finished test(python3.6): pyspark.mllib.random (18s) Starting test(python3.6): pyspark.mllib.regression Finished test(python3.6): pyspark.mllib.linalg.distributed (55s) Starting test(python3.6): pyspark.mllib.stat.KernelDensity Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.6): pyspark.mllib.stat._statistics Finished test(python3.6): pyspark.mllib.recommendation (51s) Starting test(python3.6): pyspark.mllib.tree Finished test(python3.6): pyspark.mllib.regression (45s) Starting test(python3.6): pyspark.mllib.util Finished test(python3.6): pyspark.mllib.stat._statistics (21s) Finished test(python3.6): pyspark.mllib.tree (27s) Finished test(python3.6): pyspark.mllib.util (27s) Finished test(python3.6): pyspark.ml.tests (264s) ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21715 from HyukjinKwon/SPARK-24740.	2018-07-07 11:39:29 +08:00
hyukjinkwon	74f6a92fce	[SPARK-24739][PYTHON] Make PySpark compatible with Python 3.7 ## What changes were proposed in this pull request? This PR proposes to make PySpark compatible with Python 3.7. There are rather radical change in semantic of `StopIteration` within a generator. It now throws it as a `RuntimeError`. To make it compatible, we should fix it: ```python try: next(...) except StopIteration return ``` See [release note](https://docs.python.org/3/whatsnew/3.7.html#porting-to-python-3-7) and [PEP 479](https://www.python.org/dev/peps/pep-0479/). ## How was this patch tested? Manually tested: ``` $ ./run-tests --python-executables=python3.7 Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3.7'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Starting test(python3.7): pyspark.mllib.tests Starting test(python3.7): pyspark.sql.tests Starting test(python3.7): pyspark.streaming.tests Starting test(python3.7): pyspark.tests Finished test(python3.7): pyspark.streaming.tests (130s) Starting test(python3.7): pyspark.accumulators Finished test(python3.7): pyspark.accumulators (8s) Starting test(python3.7): pyspark.broadcast Finished test(python3.7): pyspark.broadcast (9s) Starting test(python3.7): pyspark.conf Finished test(python3.7): pyspark.conf (6s) Starting test(python3.7): pyspark.context Finished test(python3.7): pyspark.context (27s) Starting test(python3.7): pyspark.ml.classification Finished test(python3.7): pyspark.tests (200s) ... 3 tests were skipped Starting test(python3.7): pyspark.ml.clustering Finished test(python3.7): pyspark.mllib.tests (244s) Starting test(python3.7): pyspark.ml.evaluation Finished test(python3.7): pyspark.ml.classification (63s) Starting test(python3.7): pyspark.ml.feature Finished test(python3.7): pyspark.ml.clustering (48s) Starting test(python3.7): pyspark.ml.fpm Finished test(python3.7): pyspark.ml.fpm (0s) Starting test(python3.7): pyspark.ml.image Finished test(python3.7): pyspark.ml.evaluation (23s) Starting test(python3.7): pyspark.ml.linalg.__init__ Finished test(python3.7): pyspark.ml.linalg.__init__ (0s) Starting test(python3.7): pyspark.ml.recommendation Finished test(python3.7): pyspark.ml.image (20s) Starting test(python3.7): pyspark.ml.regression Finished test(python3.7): pyspark.ml.regression (58s) Starting test(python3.7): pyspark.ml.stat Finished test(python3.7): pyspark.ml.feature (90s) Starting test(python3.7): pyspark.ml.tests Finished test(python3.7): pyspark.ml.recommendation (82s) Starting test(python3.7): pyspark.ml.tuning Finished test(python3.7): pyspark.ml.stat (27s) Starting test(python3.7): pyspark.mllib.classification Finished test(python3.7): pyspark.sql.tests (362s) ... 102 tests were skipped Starting test(python3.7): pyspark.mllib.clustering Finished test(python3.7): pyspark.ml.tuning (29s) Starting test(python3.7): pyspark.mllib.evaluation Finished test(python3.7): pyspark.mllib.classification (39s) Starting test(python3.7): pyspark.mllib.feature Finished test(python3.7): pyspark.mllib.evaluation (30s) Starting test(python3.7): pyspark.mllib.fpm Finished test(python3.7): pyspark.mllib.feature (44s) Starting test(python3.7): pyspark.mllib.linalg.__init__ Finished test(python3.7): pyspark.mllib.linalg.__init__ (0s) Starting test(python3.7): pyspark.mllib.linalg.distributed Finished test(python3.7): pyspark.mllib.clustering (78s) Starting test(python3.7): pyspark.mllib.random Finished test(python3.7): pyspark.mllib.fpm (33s) Starting test(python3.7): pyspark.mllib.recommendation Finished test(python3.7): pyspark.mllib.random (12s) Starting test(python3.7): pyspark.mllib.regression Finished test(python3.7): pyspark.mllib.linalg.distributed (45s) Starting test(python3.7): pyspark.mllib.stat.KernelDensity Finished test(python3.7): pyspark.mllib.stat.KernelDensity (0s) Starting test(python3.7): pyspark.mllib.stat._statistics Finished test(python3.7): pyspark.mllib.recommendation (41s) Starting test(python3.7): pyspark.mllib.tree Finished test(python3.7): pyspark.mllib.regression (44s) Starting test(python3.7): pyspark.mllib.util Finished test(python3.7): pyspark.mllib.stat._statistics (20s) Starting test(python3.7): pyspark.profiler Finished test(python3.7): pyspark.mllib.tree (26s) Starting test(python3.7): pyspark.rdd Finished test(python3.7): pyspark.profiler (11s) Starting test(python3.7): pyspark.serializers Finished test(python3.7): pyspark.mllib.util (24s) Starting test(python3.7): pyspark.shuffle Finished test(python3.7): pyspark.shuffle (0s) Starting test(python3.7): pyspark.sql.catalog Finished test(python3.7): pyspark.serializers (15s) Starting test(python3.7): pyspark.sql.column Finished test(python3.7): pyspark.rdd (27s) Starting test(python3.7): pyspark.sql.conf Finished test(python3.7): pyspark.sql.catalog (24s) Starting test(python3.7): pyspark.sql.context Finished test(python3.7): pyspark.sql.conf (8s) Starting test(python3.7): pyspark.sql.dataframe Finished test(python3.7): pyspark.sql.column (29s) Starting test(python3.7): pyspark.sql.functions Finished test(python3.7): pyspark.sql.context (26s) Starting test(python3.7): pyspark.sql.group Finished test(python3.7): pyspark.sql.dataframe (51s) Starting test(python3.7): pyspark.sql.readwriter Finished test(python3.7): pyspark.ml.tests (266s) Starting test(python3.7): pyspark.sql.session Finished test(python3.7): pyspark.sql.group (36s) Starting test(python3.7): pyspark.sql.streaming Finished test(python3.7): pyspark.sql.functions (57s) Starting test(python3.7): pyspark.sql.types Finished test(python3.7): pyspark.sql.session (25s) Starting test(python3.7): pyspark.sql.udf Finished test(python3.7): pyspark.sql.types (10s) Starting test(python3.7): pyspark.sql.window Finished test(python3.7): pyspark.sql.readwriter (31s) Starting test(python3.7): pyspark.streaming.util Finished test(python3.7): pyspark.sql.streaming (22s) Starting test(python3.7): pyspark.util Finished test(python3.7): pyspark.util (0s) Finished test(python3.7): pyspark.streaming.util (0s) Finished test(python3.7): pyspark.sql.udf (16s) Finished test(python3.7): pyspark.sql.window (12s) ``` In my local (I have two Macs but both have the same issues), I currently faced some issues for now to install both extra dependencies PyArrow and Pandas same as Jenkins's, against Python 3.7. Author: hyukjinkwon <gurwls223@apache.org> Closes #21714 from HyukjinKwon/SPARK-24739.	2018-07-07 11:37:41 +08:00
Takeshi Yamamuro	a381bce728	[SPARK-24673][SQL][PYTHON][FOLLOWUP] Support Column arguments in timezone of from_utc_timestamp/to_utc_timestamp ## What changes were proposed in this pull request? This pr supported column arguments in timezone of `from_utc_timestamp/to_utc_timestamp` (follow-up of #21693). ## How was this patch tested? Added tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21723 from maropu/SPARK-24673-FOLLOWUP.	2018-07-06 18:28:54 +08:00
mcteo	f997be0c31	[SPARK-24698][PYTHON] Fixed typo in pyspark.ml's Identifiable class. ## What changes were proposed in this pull request? Fixed a small typo in the code that caused 20 random characters to be added to the UID, rather than 12. Author: mcteo <mc_teo@live.ie> Closes #21675 from mcteo/SPARK-24698-fix.	2018-07-05 10:05:41 +08:00
Maxim Gekk	776f299fc8	[SPARK-24709][SQL] schema_of_json() - schema inference from an example ## What changes were proposed in this pull request? In the PR, I propose to add new function - schema_of_json() which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format. One of the use cases is using of schema_of_json() in the combination with from_json(). Currently, _from_json()_ requires a schema as a mandatory argument. The schema_of_json() function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance: ```sql select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}')) from json_table; ``` ## How was this patch tested? Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21686 from MaxGekk/infer_schema_json.	2018-07-04 09:38:18 +08:00
Yuanjian Li	8f91c697e2	[SPARK-24665][PYSPARK] Use SQLConf in PySpark to manage all sql configs ## What changes were proposed in this pull request? Use SQLConf for PySpark to manage all sql configs, drop all the hard code in config usage. ## How was this patch tested? Existing UT. Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21648 from xuanyuanking/SPARK-24665.	2018-07-02 14:35:37 +08:00
Huaxin Gao	2224861f2f	[SPARK-24439][ML][PYTHON] Add distanceMeasure to BisectingKMeans in PySpark ## What changes were proposed in this pull request? add distanceMeasure to BisectingKMeans in Python. ## How was this patch tested? added doctest and also manually tested it. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21557 from huaxingao/spark-24439.	2018-06-28 14:07:28 -07:00
Holden Karau	a95a4af764	[SPARK-23120][PYSPARK][ML] Add basic PMML export support to PySpark ## What changes were proposed in this pull request? Adds basic PMML export support for Spark ML stages to PySpark as was previously done in Scala. Includes LinearRegressionModel as the first stage to implement. ## How was this patch tested? Doctest, the main testing work for this is on the Scala side. (TODO holden add the unittest once I finish locally). Author: Holden Karau <holden@pigscanfly.ca> Closes #21172 from holdenk/SPARK-23120-add-pmml-export-support-to-pyspark.	2018-06-28 13:20:08 -07:00
bravo-zhang	524827f062	[SPARK-14712][ML] LogisticRegressionModel.toString should summarize model ## What changes were proposed in this pull request? [SPARK-14712](https://issues.apache.org/jira/browse/SPARK-14712) spark.mllib LogisticRegressionModel overrides toString to print a little model info. We should do the same in spark.ml and override repr in pyspark. ## How was this patch tested? LogisticRegressionSuite.scala Python doctest in pyspark.ml.classification.py Author: bravo-zhang <mzhang1230@gmail.com> Closes #18826 from bravo-zhang/spark-14712.	2018-06-28 12:40:39 -07:00
Yuanjian Li	6a0b77a55d	[SPARK-24215][PYSPARK][FOLLOW UP] Implement eager evaluation for DataFrame APIs in PySpark ## What changes were proposed in this pull request? Address comments in #21370 and add more test. ## How was this patch tested? Enhance test in pyspark/sql/test.py and DataFrameSuite Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21553 from xuanyuanking/SPARK-24215-follow.	2018-06-27 10:43:06 -07:00
Bryan Cutler	a5849ad9a3	[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name ## What changes were proposed in this pull request? Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign the resulting columns based on index of the return pandas.DataFrame. If a new DataFrame is returned and constructed using a dict, then the order of the columns could be arbitrary and be different than the defined schema for the UDF. If the schema types still match, then no error will be raised and the user will see column names and column data mixed up. This change will first try to assign columns using the return type field names. If a KeyError occurs, then the column index is checked if it is string based. If so, then the error is raised as it is most likely a naming mistake, else it will fallback to assign columns by position and raise a TypeError if the field types do not match. ## How was this patch tested? Added a test that returns a new DataFrame with column order different than the schema. Author: Bryan Cutler <cutlerb@gmail.com> Closes #21427 from BryanCutler/arrow-grouped-map-mixesup-cols-SPARK-24324.	2018-06-24 09:28:46 +08:00
Marek Novotny	92c2f00bd2	[SPARK-23934][SQL] Adding map_from_entries function ## What changes were proposed in this pull request? The PR adds the `map_from_entries` function that returns a map created from the given array of entries. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionSuite` ## CodeGen Examples ### Primitive-type Keys and Values ``` val idf = Seq( Seq((1, 10), (2, 20), (3, 10)), Seq((1, 10), null, (2, 20)) ).toDF("a") idf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_2 = 0; !project_isNull_0 && project_idx_2 < inputadapter_value_0.numElements(); project_idx_2++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_2); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final long project_keySectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 052 / final long project_valueSectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 053 / final long project_byteArraySize_0 = 8 + project_keySectionSize_0 + project_valueSectionSize_0; / 054 / if (project_byteArraySize_0 > 2147483632) { / 055 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 056 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 057 / / 058 / for (int project_idx_1 = 0; project_idx_1 < project_numEntries_0; project_idx_1++) { / 059 / InternalRow project_entry_1 = inputadapter_value_0.getStruct(project_idx_1, 2); / 060 / / 061 / project_keys_0[project_idx_1] = project_entry_1.getInt(0); / 062 / project_values_0[project_idx_1] = project_entry_1.getInt(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 / } else { / 068 / final byte[] project_byteArray_0 = new byte[(int)project_byteArraySize_0]; / 069 / UnsafeMapData project_unsafeMapData_0 = new UnsafeMapData(); / 070 / Platform.putLong(project_byteArray_0, 16, project_keySectionSize_0); / 071 / Platform.putLong(project_byteArray_0, 24, project_numEntries_0); / 072 / Platform.putLong(project_byteArray_0, 24 + project_keySectionSize_0, project_numEntries_0); / 073 / project_unsafeMapData_0.pointTo(project_byteArray_0, 16, (int)project_byteArraySize_0); / 074 / ArrayData project_keyArrayData_0 = project_unsafeMapData_0.keyArray(); / 075 / ArrayData project_valueArrayData_0 = project_unsafeMapData_0.valueArray(); / 076 / / 077 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 078 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 079 / / 080 / project_keyArrayData_0.setInt(project_idx_0, project_entry_0.getInt(0)); / 081 / project_valueArrayData_0.setInt(project_idx_0, project_entry_0.getInt(1)); / 082 / } / 083 / / 084 / project_value_0 = project_unsafeMapData_0; / 085 / } / 086 / / 087 / } ``` ### Non-primitive-type Keys and Values ``` val sdf = Seq( Seq(("a", null), ("b", "bb"), ("c", "aa")), Seq(("a", "aa"), null, (null, "bb")) ).toDF("a") sdf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_1 = 0; !project_isNull_0 && project_idx_1 < inputadapter_value_0.numElements(); project_idx_1++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_1); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 052 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 053 / / 054 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 055 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 056 / / 057 / if (project_entry_0.isNullAt(0)) { / 058 / throw new RuntimeException("The first field from a struct (key) can't be null."); / 059 / } / 060 / / 061 / project_keys_0[project_idx_0] = project_entry_0.getUTF8String(0); / 062 / project_values_0[project_idx_0] = project_entry_0.getUTF8String(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 */ } ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21282 from mn-mikke/feature/array-api-map_from_entries-to-master.	2018-06-22 16:18:22 +09:00
Rekha Joshi	c0cad596b8	[SPARK-24614][PYSPARK] Fix for SyntaxWarning on tests.py ## What changes were proposed in this pull request? Fix for SyntaxWarning on tests.py ## How was this patch tested? ./dev/run-tests Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #21604 from rekhajoshm/SPARK-24614.	2018-06-21 16:41:43 +08:00
Huaxin Gao	9de11d3f90	[SPARK-23912][SQL] add array_distinct ## What changes were proposed in this pull request? Add array_distinct to remove duplicate value from the array. ## How was this patch tested? Add unit tests Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21050 from huaxingao/spark-23912.	2018-06-21 12:24:53 +09:00
Tathagata Das	2cb976355c	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame ## What changes were proposed in this pull request? Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful. - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning). - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source). - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice. The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`. ## How was this patch tested? New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21571 from tdas/foreachBatch.	2018-06-19 13:56:51 -07:00
Takeshi Yamamuro	e219e692ef	[SPARK-23772][SQL] Provide an option to ignore column of all null values or empty array during JSON schema inference ## What changes were proposed in this pull request? This pr added a new JSON option `dropFieldIfAllNull ` to ignore column of all null values or empty array/struct during JSON schema inference. ## How was this patch tested? Added tests in `JsonSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Xiangrui Meng <meng@databricks.com> Closes #20929 from maropu/SPARK-23772.	2018-06-19 00:24:54 +08:00
Tathagata Das	b5ccf0d395	[SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for python ## What changes were proposed in this pull request? This PR adds `foreach` for streaming queries in Python. Users will be able to specify their processing logic in two different ways. - As a function that takes a row as input. - As an object that has methods `open`, `process`, and `close` methods. See the python docs in this PR for more details. ## How was this patch tested? Added java and python unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21477 from tdas/SPARK-24396.	2018-06-15 12:56:39 -07:00
Ruben Berenguel Montoro	6567fc43ac	[PYTHON] Fix typo in serializer exception ## What changes were proposed in this pull request? Fix typo in exception raised in Python serializer ## How was this patch tested? No code changes Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ruben Berenguel Montoro <ruben@mostlymaths.net> Closes #21566 from rberenguel/fix_typo_pyspark_serializers.	2018-06-15 16:59:00 +08:00
Maxim Gekk	b8f27ae3b3	[SPARK-24543][SQL] Support any type as DDL string for from_json's schema ## What changes were proposed in this pull request? In the PR, I propose to support any DataType represented as DDL string for the from_json function. After the changes, it will be possible to specify `MapType` in SQL like: ```sql select from_json('{"a":1, "b":2}', 'map<string, int>') ``` and in Scala (similar in other languages) ```scala val in = Seq("""{"a": {"b": 1}}""").toDS() val schema = "map<string, map<string, int>>" val out = in.select(from_json($"value", schema, Map.empty[String, String])) ``` ## How was this patch tested? Added a couple sql tests and modified existing tests for Python and Scala. The former tests were modified because it is not imported for them in which format schema for `from_json` is provided. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21550 from MaxGekk/from_json-ddl-schema.	2018-06-14 13:27:27 -07:00
Li Jin	d3eed8fd6d	[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf when creating pysp… …ark shell ## What changes were proposed in this pull request? This PR catches TypeError when testing existence of HiveConf when creating pyspark shell ## How was this patch tested? Manually tested. Here are the manual test cases: Build with hive: ``` (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5 \| packaged by conda-forge \| (default, Apr 6 2018, 13:44:09) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. 18/06/14 14:55:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT /_/ Using Python version 3.6.5 (default, Apr 6 2018 13:44:09) SparkSession available as 'spark'. >>> spark.conf.get('spark.sql.catalogImplementation') 'hive' ``` Build without hive: ``` (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5 \| packaged by conda-forge \| (default, Apr 6 2018, 13:44:09) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. 18/06/14 15:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT /_/ Using Python version 3.6.5 (default, Apr 6 2018 13:44:09) SparkSession available as 'spark'. >>> spark.conf.get('spark.sql.catalogImplementation') 'in-memory' ``` Failed to start shell: ``` (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark Python 3.6.5 \| packaged by conda-forge \| (default, Apr 6 2018, 13:44:09) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. 18/06/14 15:07:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). /Users/icexelloss/workspace/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/Users/icexelloss/workspace/spark/python/pyspark/shell.py", line 41, in <module> spark = SparkSession._create_shell_session() File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py", line 581, in _create_shell_session return SparkSession.builder.getOrCreate() File "/Users/icexelloss/workspace/spark/python/pyspark/sql/session.py", line 168, in getOrCreate raise py4j.protocol.Py4JError("Fake Py4JError") py4j.protocol.Py4JError: Fake Py4JError (pyarrow-dev) Lis-MacBook-Pro:spark icexelloss$ ``` Author: Li Jin <ice.xelloss@gmail.com> Closes #21569 from icexelloss/SPARK-24563-fix-pyspark-shell-without-hive.	2018-06-14 13:16:20 -07:00
Li Jin	9786ce66c5	[SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as window functions with unbounded window frames ## What changes were proposed in this pull request? This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions. ``` >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql import Window >>> df = spark.createDataFrame( ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ... ("id", "v")) >>> pandas_udf("double", PandasUDFType.GROUPED_AGG) ... def mean_udf(v): ... return v.mean() >>> w = Window.partitionBy('id') >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() +---+----+------+ \| id\| v\|mean_v\| +---+----+------+ \| 1\| 1.0\| 1.5\| \| 1\| 2.0\| 1.5\| \| 2\| 3.0\| 6.0\| \| 2\| 5.0\| 6.0\| \| 2\|10.0\| 6.0\| +---+----+------+ ``` The scope of this PR is somewhat limited in terms of: (1) Only supports unbounded window, which acts essentially as group by. (2) Only supports aggregation functions, not "transform" like window functions (n -> n mapping) Both of these are left as future work. Especially, (1) needs careful thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a bit easier but does require more changes therefore I think it's better to leave it as a separate PR. ## How was this patch tested? WindowPandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #21082 from icexelloss/SPARK-22239-window-udf.	2018-06-13 09:10:52 +08:00
Kazuaki Ishizaki	ada28f2595	[SPARK-23933][SQL] Add map_from_arrays function ## What changes were proposed in this pull request? The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior. This function returns returns a map from a pair of arrays for keys and values. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21258 from kiszk/SPARK-23933.	2018-06-12 12:31:22 -07:00
DylanGuedes	f0ef1b311d	[SPARK-23931][SQL] Adds arrays_zip function to sparksql Signed-off-by: DylanGuedes <djmgguedesgmail.com> ## What changes were proposed in this pull request? Addition of arrays_zip function to spark sql functions. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit tests that checks if the results are correct. Author: DylanGuedes <djmgguedes@gmail.com> Closes #21045 from DylanGuedes/SPARK-23931.	2018-06-12 11:57:25 -07:00
Lee Dongjin	5d6a53d983	[SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala\|Python] unit tests. Author: Lee Dongjin <dongjin@apache.org> Closes #21501 from dongjinleekr/feature/SPARK-15064.	2018-06-12 08:16:37 -07:00
Huaxin Gao	a99d284c16	[SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC ## What changes were proposed in this pull request? add spark.ml Python API for PIC ## How was this patch tested? add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21513 from huaxingao/spark--19826.	2018-06-11 12:15:14 -07:00
edorigatti	3e5b4ae63a	[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor ## What changes were proposed in this pull request? SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker ## How does this work? The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used: - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself. - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack. ## How was this patch tested? Same tests, plus tests for pandas UDFs Author: edorigatti <emilio.dorigatti@gmail.com> Closes #21467 from e-dorigatti/fix_udf_hack.	2018-06-11 10:15:42 +08:00
hyukjinkwon	b070ded284	[SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction ## What changes were proposed in this pull request? This PR proposes to wrap the transformed rdd within `TransformFunction`. `PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`. `39e2bad6a8/python/pyspark/streaming/util.py (L67)` `6ee28423ad/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala (L43)` However, this could be `JavaPairRDD` by some APIs, for example, `zip` in PySpark's RDD API. `_jrdd` could be checked as below: ```python >>> rdd.zip(rdd)._jrdd.getClass().toString() u'class org.apache.spark.api.java.JavaPairRDD' ``` So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`. ```python >>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString() u'class org.apache.spark.api.java.JavaRDD' ``` I tried to elaborate some failure cases as below: ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]) \ .transform(lambda rdd: rdd.cartesian(rdd)) \ .pprint() ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd)) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd)) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).union(rdd.zip(rdd))) ssc.start() ``` ```python from pyspark.streaming import StreamingContext ssc = StreamingContext(spark.sparkContext, 10) ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).coalesce(1)) ssc.start() ``` ## How was this patch tested? Unit tests were added in `python/pyspark/streaming/tests.py` and manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19498 from HyukjinKwon/SPARK-17756.	2018-06-09 01:27:51 +07:00
hyukjinkwon	173fe450df	[SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule in ml/__init__.py and add ImageSchema into __all__ ## What changes were proposed in this pull request? This PR attaches submodules to ml's `__init__.py` module. Also, adds `ImageSchema` into `image.py` explicitly. ## How was this patch tested? Before: ```python >>> from pyspark import ml >>> ml.image Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' >>> ml.image.ImageSchema Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'image' ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() False >>> image Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'image' is not defined ``` After: ```python >>> from pyspark import ml >>> ml.image <module 'pyspark.ml.image' from '/.../spark/python/pyspark/ml/image.pyc'> >>> ml.image.ImageSchema <pyspark.ml.image._ImageSchema object at 0x10d973b10> ``` ```python >>> "image" in globals() False >>> from pyspark.ml import * >>> "image" in globals() True >>> image <module 'pyspark.ml.image' from #'/.../spark/python/pyspark/ml/image.pyc'> ``` Author: hyukjinkwon <gurwls223@apache.org> Closes #21483 from HyukjinKwon/SPARK-24454.	2018-06-08 09:32:11 -07:00
Marcelo Vanzin	b3417b731d	[SPARK-16451][REPL] Fail shell if SparkSession fails to start. Currently, in spark-shell, if the session fails to start, the user sees a bunch of unrelated errors which are caused by code in the shell initialization that references the "spark" variable, which does not exist in that case. Things like: ``` <console>:14: error: not found: value spark import spark.sql ``` The user is also left with a non-working shell (unless they want to just write non-Spark Scala or Python code, that is). This change fails the whole shell session at the point where the failure occurs, so that the last error message is the one with the actual information about the failure. For the python error handling, I moved the session initialization code to session.py, so that traceback.print_exc() only shows the last error. Otherwise, the printed exception would contain all previous exceptions with a message "During handling of the above exception, another exception occurred", making the actual error kinda hard to parse. Tested with spark-shell, pyspark (with 2.7 and 3.5), by forcing an error during SparkContext initialization. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21368 from vanzin/SPARK-16451.	2018-06-05 08:29:29 +07:00
Yuanjian Li	dbb4d83829	[SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark ## What changes were proposed in this pull request? Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this. The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html ## How was this patch tested? New ut in DataFrameSuite and manual test in jupyter. Some screenshot below. After: ![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png) Before: ![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png) Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21370 from xuanyuanking/SPARK-24215.	2018-06-05 08:23:08 +07:00
Maxim Gekk	1d9338bb10	[SPARK-23786][SQL] Checking column names of csv headers ## What changes were proposed in this pull request? Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown: ``` java.lang.IllegalArgumentException: CSV file header does not contain the expected fields Header: depth, temperature Schema: temperature, depth CSV file: marina.csv ``` ## How was this patch tested? The changes were tested by existing tests of CSVSuite and by 2 new tests. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20894 from MaxGekk/check-column-names.	2018-06-03 22:02:21 -07:00
Huaxin Gao	98909c398d	[SPARK-23920][SQL] add array_remove to remove all elements that equal element from array ## What changes were proposed in this pull request? add array_remove to remove all elements that equal element from array ## How was this patch tested? add unit tests Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21069 from huaxingao/spark-23920.	2018-05-31 22:04:26 -07:00
Bryan Cutler	b2d0226562	[SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment ## What changes were proposed in this pull request? Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427.	2018-06-01 11:58:59 +08:00
Tathagata Das	223df5d9d4	[SPARK-24397][PYSPARK] Added TaskContext.getLocalProperty(key) in Python ## What changes were proposed in this pull request? This adds a new API `TaskContext.getLocalProperty(key)` to the Python TaskContext. It mirrors the Java TaskContext API of returning a string value if the key exists, or None if the key does not exist. ## How was this patch tested? New test added. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21437 from tdas/SPARK-24397.	2018-05-31 11:23:57 -07:00
WeichenXu	90ae98d1ac	[SPARK-24146][PYSPARK][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## What changes were proposed in this pull request? spark.ml parity for sequential pattern mining - PrefixSpan: Python API ## How was this patch tested? doctests Author: WeichenXu <weichen.xu@databricks.com> Closes #21265 from WeichenXu123/prefix_span_py.	2018-05-31 06:53:10 -07:00
Huaxin Gao	ec6f971dc5	[SPARK-23161][PYSPARK][ML] Add missing APIs to Python GBTClassifier ## What changes were proposed in this pull request? Add featureSubsetStrategy in GBTClassifier and GBTRegressor. Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses. ## How was this patch tested? Add tests in doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21413 from huaxingao/spark-23161.	2018-05-30 11:04:09 -07:00
e-dorigatti	0ebb0c0d4d	[SPARK-23754][PYTHON] Re-raising StopIteration in client code ## What changes were proposed in this pull request? Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s ## How was this patch tested? Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way? ## License This is my original work, licensed in the same way as spark Author: e-dorigatti <emilio.dorigatti@gmail.com> Author: edorigatti <emilio.dorigatti@gmail.com> Closes #21383 from e-dorigatti/fix_spark_23754.	2018-05-30 18:11:33 +08:00
Bryan Cutler	fa2ae9d201	[SPARK-24392][PYTHON] Label pandas_udf as Experimental ## What changes were proposed in this pull request? The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving. This adds a label to indicate it is still an experimental API. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392.	2018-05-28 12:56:05 +08:00
Marek Novotny	a6e883feb3	[SPARK-23935][SQL] Adding map_entries function ## What changes were proposed in this pull request? This PR adds `map_entries` function that returns an unordered array of all entries in the given map. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionsSuite` ## CodeGen examples ### Primitive types ``` val df = Seq(Map(1 -> 5, 2 -> 6)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final long project_size_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 051 / project_numElements_0, / 052 / 32); / 053 / if (project_size_0 > 2147483632) { / 054 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 055 / for (int z = 0; z < project_numElements_0; z++) { / 056 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getInt(z), project_values_0.getInt(z)}); / 057 / } / 058 / project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); / 059 / / 060 / } else { / 061 / final byte[] project_arrayBytes_0 = new byte[(int)project_size_0]; / 062 / UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData(); / 063 / Platform.putLong(project_arrayBytes_0, 16, project_numElements_0); / 064 / project_unsafeArrayData_0.pointTo(project_arrayBytes_0, 16, (int)project_size_0); / 065 / / 066 / final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 8; /* 067 / UnsafeRow project_unsafeRow_0 = new UnsafeRow(2); / 068 / for (int z = 0; z < project_numElements_0; z++) { / 069 / long offset = project_structsOffset_0 + z 24L; /* 070 / project_unsafeArrayData_0.setLong(z, (offset << 32) + 24L); / 071 / project_unsafeRow_0.pointTo(project_arrayBytes_0, 16 + offset, 24); / 072 / project_unsafeRow_0.setInt(0, project_keys_0.getInt(z)); / 073 / project_unsafeRow_0.setInt(1, project_values_0.getInt(z)); / 074 / } / 075 / project_value_0 = project_unsafeArrayData_0; / 076 / / 077 / } ``` ### Non-primitive types ``` val df = Seq(Map("a" -> "foo", "b" -> null)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 051 / for (int z = 0; z < project_numElements_0; z++) { / 052 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getUTF8String(z), project_values_0.getUTF8String(z)}); / 053 / } / 054 */ project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21236 from mn-mikke/feature/array-api-map_entries-to-master.	2018-05-21 23:14:03 +09:00
Liang-Chi Hsieh	6d7d45a1af	[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning ## What changes were proposed in this pull request? Logical `Range` node has been added with `outputOrdering` recently. It's used to eliminate redundant `Sort` during optimization. However, this `outputOrdering` doesn't not propagate to physical `RangeExec` node. We also add correct `outputPartitioning` to `RangeExec` node. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21291 from viirya/SPARK-24242.	2018-05-21 15:39:35 +08:00
hyukjinkwon	0cf59fcbe3	[SPARK-24303][PYTHON] Update cloudpickle to v0.4.4 ## What changes were proposed in this pull request? cloudpickle 0.4.4 is released - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4 There's no invasive change - the main difference is that we are now able to pickle the root logger, which fix is pretty isolated. ## How was this patch tested? Jenkins tests. Author: hyukjinkwon <gurwls223@apache.org> Closes #21350 from HyukjinKwon/SPARK-24303.	2018-05-18 09:53:24 -07:00
Marco Gaido	69350aa2f0	[SPARK-23922][SQL] Add arrays_overlap function ## What changes were proposed in this pull request? The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21028 from mgaido91/SPARK-23922.	2018-05-17 20:45:32 +08:00
Florent Pépin	3e66350c24	[SPARK-23925][SQL] Add array_repeat collection function ## What changes were proposed in this pull request? The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish. The behaviour of the function is based on Presto's one. The function creates an array containing a given element repeated the requested number of times. ## How was this patch tested? New unit tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite Author: Florent Pépin <florentpepin.92@gmail.com> Author: Florent Pépin <florent.pepin14@imperial.ac.uk> Closes #21208 from pepinoflo/SPARK-23925.	2018-05-17 13:31:14 +09:00
hyukjinkwon	9a641e7f72	[SPARK-21945][YARN][PYTHON] Make --py-files work with PySpark shell in Yarn client mode ## What changes were proposed in this pull request? ### Problem When we run _PySpark shell with Yarn client mode_, specified `--py-files` are not recognised in _driver side_. Here are the steps I took to check: ```bash $ cat /home/spark/tmp.py def testtest(): return 1 ``` ```bash $ ./bin/pyspark --master yarn --deploy-mode client --py-files /home/spark/tmp.py ``` ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() # executor side [1] >>> test() # driver side Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in test ImportError: No module named tmp ``` ### How did it happen? Unlike Yarn cluster and client mode with Spark submit, when Yarn client mode with PySpark shell specifically, 1. It first runs Python shell via: `3cb82047f2/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (L158)` as pointed out by tgravescs in the JIRA. 2. this triggers shell.py and submit another application to launch a py4j gateway: `209b9361ac/python/pyspark/java_gateway.py (L45-L60)` 3. it runs a Py4J gateway: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L425)` 4. it copies (or downloads) --py-files into local temp directory: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L365-L376)` and then these files are set up to `spark.submit.pyFiles` 5. Py4J JVM is launched and then the Python paths are set via: `7013eea11c/python/pyspark/context.py (L209-L216)` However, these are not actually set because those files were copied into a tmp directory in 4. whereas this code path looks for `SparkFiles.getRootDirectory` where the files are stored only when `SparkContext.addFile()` is called. In other cluster mode, `spark.files` are set via: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L554-L555)` and those files are explicitly added via: `ecb8b383af/core/src/main/scala/org/apache/spark/SparkContext.scala (L395)` So we are fine in other modes. In case of Yarn client and cluster with _submit_, these are manually being handled. In particular https://github.com/apache/spark/pull/6360 added most of the logics. In this case, the Python path looks manually set via, for example, `deploy.PythonRunner`. We don't use `spark.files` here. ### How does the PR fix the problem? I tried to make an isolated approach as possible as I can: simply copy py file or zip files into `SparkFiles.getRootDirectory()` in driver side if not existing. Another possible way is to set `spark.files` but it does unnecessary stuff together and sounds a bit invasive. Before ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in test ImportError: No module named tmp ``` After ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() 1 ``` ## How was this patch tested? I manually tested in standalone and yarn cluster with PySpark shell. .zip and .py files were also tested with the similar steps above. It's difficult to add a test. Author: hyukjinkwon <gurwls223@apache.org> Closes #21267 from HyukjinKwon/SPARK-21945.	2018-05-17 12:07:58 +08:00
Liang-Chi Hsieh	8a13c50968	[SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved separately in metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in order to keep them in sync. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21153 from viirya/SPARK-24058.	2018-05-15 16:50:09 -07:00
Lu WANG	6b94420f6c	[SPARK-24231][PYSPARK][ML] Provide Python API for evaluateEachIteration for spark.ml GBTs ## What changes were proposed in this pull request? Add evaluateEachIteration for GBTClassification and GBTRegressionModel ## How was this patch tested? doctest Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21335 from ludatabricks/SPARK-14682.	2018-05-15 14:16:31 -07:00
Liang-Chi Hsieh	d610d2a3f5	[SPARK-24259][SQL] ArrayWriter for Arrow produces wrong output ## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21312 from viirya/SPARK-24259.	2018-05-15 22:06:58 +08:00
Maxim Gekk	8cd83acf40	[SPARK-24027][SQL] Support MapType with StringType for keys as the root type by from_json ## What changes were proposed in this pull request? Currently, the from_json function support StructType or ArrayType as the root type. The PR allows to specify MapType(StringType, DataType) as the root type additionally to mentioned types. For example: ```scala import org.apache.spark.sql.types._ val schema = MapType(StringType, IntegerType) val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS() in.select(from_json($"value", schema, Map[String, String]())).collect() ``` ``` res1: Array[org.apache.spark.sql.Row] = Array([Map(a -> 1, b -> 2, c -> 3)]) ``` ## How was this patch tested? It was checked by new tests for the map type with integer type and struct type as value types. Also roundtrip tests like from_json(to_json) and to_json(from_json) for MapType are added. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21108 from MaxGekk/from_json-map-type.	2018-05-14 14:05:42 -07:00
Kelley Robinson	0d210ec8b6	[SPARK-24262][PYTHON] Fix typo in UDF type match error message ## What changes were proposed in this pull request? Updates `functon` to `function`. This was called out in holdenk's PyCon 2018 conference talk. Didn't see any existing PR's for this. holdenk happy to fix the Pandas.Series bug too but will need a bit more guidance. Author: Kelley Robinson <krobinson@twilio.com> Closes #21304 from robinske/master.	2018-05-13 13:19:03 -07:00
aditkumar	92f6f52ff0	[MINOR][DOCS] Documenting months_between direction ## What changes were proposed in this pull request? It's useful to know what relationship between date1 and date2 results in a positive number. Author: aditkumar <aditkumar@gmail.com> Author: Adit Kumar <aditkumar@gmail.com> Closes #20787 from aditkumar/master.	2018-05-11 14:42:23 -05:00
Maxim Gekk	f4fed05121	[SPARK-24171] Adding a note for non-deterministic functions ## What changes were proposed in this pull request? I propose to add a clear statement for functions like `collect_list()` about non-deterministic behavior of such functions. The behavior must be taken into account by user while creating and running queries. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21228 from MaxGekk/deterministic-comments.	2018-05-10 09:44:49 -07:00
Marcelo Vanzin	cc613b552e	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
Liang-Chi Hsieh	b54bbe57b3	[SPARK-24131][PYSPARK][FOLLOWUP] Add majorMinorVersion API to PySpark for determining Spark versions ## What changes were proposed in this pull request? More close to Scala API behavior when can't parse input by throwing exception. Add tests. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21211 from viirya/SPARK-24131-followup.	2018-05-08 21:22:54 +08:00
Jeff Zhang	56a52e0a58	[SPARK-15750][MLLIB][PYSPARK] Constructing FPGrowth fails when no numPartitions specified in pyspark ## What changes were proposed in this pull request? Change FPGrowth from private to private[spark]. If no numPartitions is specified, then default value -1 is used. But -1 is only valid in the construction function of FPGrowth, but not in setNumPartitions. So I make this change and use the constructor directly rather than using set method. ## How was this patch tested? Unit test is added Author: Jeff Zhang <zjffdu@apache.org> Closes #13493 from zjffdu/SPARK-15750.	2018-05-07 14:47:58 -07:00
Marco Gaido	e35ad3cadd	[SPARK-23930][SQL] Add slice function ## What changes were proposed in this pull request? The PR add the `slice` function. The behavior of the function is based on Presto's one. The function slices an array according to the requested start index and length. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21040 from mgaido91/SPARK-23930.	2018-05-07 16:57:37 +09:00
Kazuaki Ishizaki	7564a9a706	[SPARK-23921][SQL] Add array_sort function ## What changes were proposed in this pull request? The PR adds the SQL function `array_sort`. The behavior of the function is based on Presto's one. The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21021 from kiszk/SPARK-23921.	2018-05-07 15:22:23 +09:00
Marcelo Vanzin	a634d66ce7	[SPARK-24126][PYSPARK] Use build-specific temp directory for pyspark tests. This avoids polluting and leaving garbage behind in /tmp, and allows the usual build tools to clean up any leftover files. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21198 from vanzin/SPARK-24126.	2018-05-07 13:00:18 +08:00
Liang-Chi Hsieh	e15850be6e	[SPARK-24131][PYSPARK] Add majorMinorVersion API to PySpark for determining Spark versions ## What changes were proposed in this pull request? We need to determine Spark major and minor versions in PySpark. We can add a `majorMinorVersion` API to PySpark which is similar to the Scala API in `VersionUtils.majorMinorVersion`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21203 from viirya/SPARK-24131.	2018-05-02 10:55:01 +08:00
Dongjoon Hyun	b857fb549f	[SPARK-23853][PYSPARK][TEST] Run Hive-related PySpark tests only for `-Phive` ## What changes were proposed in this pull request? When `PyArrow` or `Pandas` are not available, the corresponding PySpark tests are skipped automatically. Currently, PySpark tests fail when we are not using `-Phive`. This PR aims to skip Hive related PySpark tests when `-Phive` is not given. BEFORE ```bash $ build/mvn -DskipTests clean package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql File "/Users/dongjoon/spark/python/pyspark/sql/readwriter.py", line 295, in pyspark.sql.readwriter.DataFrameReader.table ... IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':" ******************************************************************** 1 of 3 in pyspark.sql.readwriter.DataFrameReader.table Test Failed 1 failures. ``` AFTER** ```bash $ build/mvn -DskipTests clean package $ python/run-tests.py --python-executables python2.7 --modules pyspark-sql ... Tests passed in 138 seconds Skipped tests in pyspark.sql.tests with python2.7: ... test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.' ``` ## How was this patch tested? This is a test-only change. First, this should pass the Jenkins. Then, manually do the following. ```bash build/mvn -DskipTests clean package python/run-tests.py --python-executables python2.7 --modules pyspark-sql ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21141 from dongjoon-hyun/SPARK-23853.	2018-05-01 09:06:23 +08:00
Maxim Gekk	3121b411f7	[SPARK-23846][SQL] The samplingRatio option for CSV datasource ## What changes were proposed in this pull request? I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource: `b14993e1fc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (L49-L50)` ## How was this patch tested? Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20959 from MaxGekk/csv-sampling.	2018-04-30 09:45:22 +08:00
Maxim Gekk	bd14da6fd5	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.	2018-04-29 11:25:31 +08:00
hyukjinkwon	f7435bec6a	[SPARK-24044][PYTHON] Explicitly print out skipped tests from unittest module ## What changes were proposed in this pull request? This PR proposes to remove duplicated dependency checking logics and also print out skipped tests from unittests. For example, as below: ``` Skipped tests in pyspark.sql.tests with pypy: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.' ... Skipped tests in pyspark.sql.tests with python3: test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.' ... ``` Currently, it's not printed out in the console. I think we should better print out skipped tests in the console. ## How was this patch tested? Manually tested. Also, fortunately, Jenkins has good environment to test the skipped output. Author: hyukjinkwon <gurwls223@apache.org> Closes #21107 from HyukjinKwon/skipped-tests-print.	2018-04-26 15:11:42 -07:00
Huaxin Gao	4f1e38649e	[SPARK-24057][PYTHON] put the real data type in the AssertionError message ## What changes were proposed in this pull request? Print out the data type in the AssertionError message to make it more meaningful. ## How was this patch tested? I manually tested the changed code on my local, but didn't add any test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21159 from huaxingao/spark-24057.	2018-04-26 14:21:22 -07:00
Marco Gaido	cd10f9df82	[SPARK-23916][SQL] Add array_join function ## What changes were proposed in this pull request? The PR adds the SQL function `array_join`. The behavior of the function is based on Presto's one. The function accepts an `array` of `string` which is to be joined, a `string` which is the delimiter to use between the items of the first argument and optionally a `string` which is used to replace `null` values. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21011 from mgaido91/SPARK-23916.	2018-04-26 13:37:13 +09:00
Marco Gaido	58c55cb4a6	[SPARK-23902][SQL] Add roundOff flag to months_between ## What changes were proposed in this pull request? HIVE-15511 introduced the `roundOff` flag in order to disable the rounding to 8 digits which is performed in `months_between`. Since this can be a computational intensive operation, skipping it may improve performances when the rounding is not needed. ## How was this patch tested? modified existing UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21008 from mgaido91/SPARK-23902.	2018-04-26 12:19:20 +09:00
Maxim Gekk	3f1e999d3d	[SPARK-23849][SQL] Tests for samplingRatio of json datasource ## What changes were proposed in this pull request? Added the `samplingRatio` option to the `json()` method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: https://github.com/apache/spark/pull/20959 ## How was this patch tested? Added new test for PySpark, updated 2 existing tests according to reviews of https://github.com/apache/spark/pull/20959 and added new negative test Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21056 from MaxGekk/json-sampling.	2018-04-26 09:14:24 +08:00
mn-mikke	5fea17b3be	[SPARK-23821][SQL] Collection function: flatten ## What changes were proposed in this pull request? This PR adds a new collection function that transforms an array of arrays into a single array. The PR comprises: - An expression for flattening array structure - Flatten function - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(Seq(1, 2), Seq(4, 5)), Seq(null, Seq(1)) ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(flatten($"i")).debugCodegen ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 064 / project_numElements, / 065 / 4); / 066 / if (project_size > 2147483632) { / 067 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 068 / project_size + " bytes of data due to exceeding the limit 2147483632" + / 069 / " bytes for UnsafeArrayData."); / 070 / } / 071 / / 072 / byte[] project_array = new byte[(int)project_size]; / 073 / UnsafeArrayData project_tempArrayData = new UnsafeArrayData(); / 074 / Platform.putLong(project_array, 16, project_numElements); / 075 / project_tempArrayData.pointTo(project_array, 16, (int)project_size); / 076 / int project_counter = 0; / 077 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 078 / ArrayData arr = inputadapter_value.getArray(k); / 079 / for (int l = 0; l < arr.numElements(); l++) { / 080 / if (arr.isNullAt(l)) { / 081 / project_tempArrayData.setNullAt(project_counter); / 082 / } else { / 083 / project_tempArrayData.setInt( / 084 / project_counter, / 085 / arr.getInt(l) / 086 / ); / 087 / } / 088 / project_counter++; / 089 / } / 090 / } / 091 / project_value = project_tempArrayData; / 092 / / 093 / } / 094 / / 095 / } ``` ### Non-primitive type ``` val df = Seq( Seq(Seq("a", "b"), Seq(null, "d")), Seq(null, Seq("a")) ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(flatten($"s")).debugCodegen ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / Object[] project_arrayObject = new Object[(int)project_numElements]; / 064 / int project_counter = 0; / 065 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 066 / ArrayData arr = inputadapter_value.getArray(k); / 067 / for (int l = 0; l < arr.numElements(); l++) { / 068 / project_arrayObject[project_counter] = arr.getUTF8String(l); / 069 / project_counter++; / 070 / } / 071 / } / 072 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObject); / 073 / / 074 / } / 075 / / 076 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #20938 from mn-mikke/feature/array-api-flatten-to-master.	2018-04-25 11:19:08 +09:00
mn-mikke	e6b466084c	[SPARK-23736][SQL] Extending the concat function to support array columns ## What changes were proposed in this pull request? The PR adds a logic for easy concatenation of multiple array columns and covers: - Concat expression has been extended to support array columns - A Python wrapper ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite - typeCoercion/native/concat.sql ## Codegen examples ### Primitive-type elements ``` val df = Seq( (Seq(1 ,2), Seq(3, 4)), (Seq(1, 2, 3), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 070 / project_numElements, / 071 / 4); / 072 / if (project_size > 2147483632) { / 073 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_size + / 074 / " bytes of data due to exceeding the limit 2147483632 bytes" + / 075 / " for UnsafeArrayData."); / 076 / } / 077 / / 078 / byte[] project_array = new byte[(int)project_size]; / 079 / UnsafeArrayData project_arrayData = new UnsafeArrayData(); / 080 / Platform.putLong(project_array, 16, project_numElements); / 081 / project_arrayData.pointTo(project_array, 16, (int)project_size); / 082 / int project_counter = 0; / 083 / for (int y = 0; y < 2; y++) { / 084 / for (int z = 0; z < args[y].numElements(); z++) { / 085 / if (args[y].isNullAt(z)) { / 086 / project_arrayData.setNullAt(project_counter); / 087 / } else { / 088 / project_arrayData.setInt( / 089 / project_counter, / 090 / args[y].getInt(z) / 091 / ); / 092 / } / 093 / project_counter++; / 094 / } / 095 / } / 096 / return project_arrayData; / 097 / } / 098 / }.concat(project_args); / 099 / boolean project_isNull = project_value == null; ``` ### Non-primitive-type elements ``` val df = Seq( (Seq("aa" ,"bb"), Seq("ccc", "ddd")), (Seq("x", "y"), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / Object[] project_arrayObjects = new Object[(int)project_numElements]; / 070 / int project_counter = 0; / 071 / for (int y = 0; y < 2; y++) { / 072 / for (int z = 0; z < args[y].numElements(); z++) { / 073 / project_arrayObjects[project_counter] = args[y].getUTF8String(z); / 074 / project_counter++; / 075 / } / 076 / } / 077 / return new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObjects); / 078 / } / 079 / }.concat(project_args); / 080 */ boolean project_isNull = project_value == null; ``` Author: mn-mikke <mrkAha12346github> Closes #20858 from mn-mikke/feature/array-api-concat_arrays-to-master.	2018-04-20 14:58:11 +09:00
Kazuaki Ishizaki	46bb2b5129	[SPARK-23924][SQL] Add element_at function ## What changes were proposed in this pull request? The PR adds the SQL function `element_at`. The behavior of the function is based on Presto's one. This function returns element of array at given index in value if column is array, or returns value for the given key in value if column is map. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21053 from kiszk/SPARK-23924.	2018-04-19 21:00:10 +09:00
Kazuaki Ishizaki	d5bec48b9c	[SPARK-23919][SQL] Add array_position function ## What changes were proposed in this pull request? The PR adds the SQL function `array_position`. The behavior of the function is based on Presto's one. The function returns the position of the first occurrence of the element in array x (or 0 if not found) using 1-based index as BigInt. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21037 from kiszk/SPARK-23919.	2018-04-19 11:59:17 +09:00
Liang-Chi Hsieh	8bb0df2c65	[SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener ## What changes were proposed in this pull request? The `StreamingListener` in PySpark side seems to be lack of `onStreamingStarted` method. This patch adds it and a test for it. This patch also includes a trivial doc improvement for `createDirectStream`. Original PR is #21057. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21098 from viirya/SPARK-24014.	2018-04-19 10:00:57 +08:00
mn-mikke	f81fa478ff	[SPARK-23926][SQL] Extending reverse function to support ArrayType arguments ## What changes were proposed in this pull request? This PR extends `reverse` functions to be able to operate over array columns and covers: - Introduction of `Reverse` expression that represents logic for reversing arrays and also strings - Removal of `StringReverse` expression - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(1, 3, 4, 2), null ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(reverse($"i")).debugCodegen ``` Result: ``` /* 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = inputadapter_value.copy(); / 051 / for(int k = 0; k < project_length / 2; k++) { / 052 / int l = project_length - k - 1; / 053 / boolean isNullAtK = project_value.isNullAt(k); / 054 / boolean isNullAtL = project_value.isNullAt(l); / 055 / if(!isNullAtK) { / 056 / int el = project_value.getInt(k); / 057 / if(!isNullAtL) { / 058 / project_value.setInt(k, project_value.getInt(l)); / 059 / } else { / 060 / project_value.setNullAt(k); / 061 / } / 062 / project_value.setInt(l, el); / 063 / } else if (!isNullAtL) { / 064 / project_value.setInt(k, project_value.getInt(l)); / 065 / project_value.setNullAt(l); / 066 / } / 067 / } / 068 / / 069 / } ``` ### Non-primitive type ``` val df = Seq( Seq("a", "c", "d", "b"), null ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(reverse($"s")).debugCodegen ``` Result: ``` / 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(new Object[project_length]); / 051 / for(int k = 0; k < project_length; k++) { / 052 / int l = project_length - k - 1; / 053 / project_value.update(k, inputadapter_value.getUTF8String(l)); / 054 / } / 055 / / 056 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #21034 from mn-mikke/feature/array-api-reverse-to-master.	2018-04-18 18:41:55 +09:00
WeichenXu	1ca3c50fef	[SPARK-21741][ML][PYSPARK] Python API for DataFrame-based multivariate summarizer ## What changes were proposed in this pull request? Python API for DataFrame-based multivariate summarizer. ## How was this patch tested? doctest added. Author: WeichenXu <weichen.xu@databricks.com> Closes #20695 from WeichenXu123/py_summarizer.	2018-04-17 10:11:08 -07:00
Marco Gaido	14844a62c0	[SPARK-23918][SQL] Add array_min function ## What changes were proposed in this pull request? The PR adds the SQL function `array_min`. It takes an array as argument and returns the minimum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21025 from mgaido91/SPARK-23918.	2018-04-17 17:55:35 +09:00
WeichenXu	04614820e1	[SPARK-21088][ML] CrossValidator, TrainValidationSplit support collect all models when fitting: Python API ## What changes were proposed in this pull request? Add python API for collecting sub-models during CrossValidator/TrainValidationSplit fitting. ## How was this patch tested? UT added. Author: WeichenXu <weichen.xu@databricks.com> Closes #19627 from WeichenXu123/expose-model-list-py.	2018-04-16 11:31:24 -05:00
Marco Gaido	6931022031	[SPARK-23917][SQL] Add array_max function ## What changes were proposed in this pull request? The PR adds the SQL function `array_max`. It takes an array as argument and returns the maximum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21024 from mgaido91/SPARK-23917.	2018-04-15 21:45:55 -07:00
hyukjinkwon	ab7b961a4f	[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener ## What changes were proposed in this pull request? This PR proposes to add `collect` to a query executor as an action. Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below: ```scala package org.apache.spark.sql import org.apache.spark.internal.Logging import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.util.QueryExecutionListener class TestQueryExecutionListener extends QueryExecutionListener with Logging { override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = { logError("Look at me! I'm 'onSuccess'") } override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { } } ``` and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener` Other operations in PySpark or Scala side seems fine: ```python >>> sql("SELECT * FROM range(1)").show() ``` ``` 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' +---+ \| id\| +---+ \| 0\| +---+ ``` ```scala scala> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' res1: Array[org.apache.spark.sql.Row] = Array([0]) ``` but .. Before ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` id 0 0 ``` After ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` 18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' id 0 0 ``` ## How was this patch tested? I have manually tested as described above and unit test was added. Author: hyukjinkwon <gurwls223@apache.org> Closes #21007 from HyukjinKwon/SPARK-23942.	2018-04-13 11:28:13 +08:00
JBauerKogentix	9d960de081	typo rawPredicition changed to rawPrediction MultilayerPerceptronClassifier had 4 occurrences ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: JBauerKogentix <37910022+JBauerKogentix@users.noreply.github.com> Closes #21030 from JBauerKogentix/patch-1.	2018-04-11 15:52:13 -07:00
hyukjinkwon	c7622befda	[SPARK-23847][FOLLOWUP][PYTHON][SQL] Actually test [desc\|acs]_nulls_[first\|last] functions in PySpark ## What changes were proposed in this pull request? There was a mistake in `tests.py` missing `assertEquals`. ## How was this patch tested? Fixed tests. Author: hyukjinkwon <gurwls223@apache.org> Closes #21035 from HyukjinKwon/SPARK-23847.	2018-04-11 19:42:09 +08:00
Huaxin Gao	4f1e8b9bb7	[SPARK-23871][ML][PYTHON] add python api for VectorAssembler handleInvalid ## What changes were proposed in this pull request? add python api for VectorAssembler handleInvalid ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21003 from huaxingao/spark-23871.	2018-04-10 15:41:45 -07:00
WeichenXu	adb222b957	[SPARK-23751][ML][PYSPARK] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` Note API with `CDF` is a little difficult to support in python. We can add it in following PR. ## How was this patch tested? doctest Author: WeichenXu <weichen.xu@databricks.com> Closes #20904 from WeichenXu123/ks-test-py.	2018-04-10 11:18:14 -07:00
Huaxin Gao	2c1fe64757	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark ## What changes were proposed in this pull request? Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last, desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in column.py and functions.py ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20962 from huaxingao/spark-23847.	2018-04-08 12:09:06 +08:00
Huaxin Gao	e998250588	[SPARK-23828][ML][PYTHON] PySpark StringIndexerModel should have constructor from labels ## What changes were proposed in this pull request? The Scala StringIndexerModel has an alternate constructor that will create the model from an array of label strings. Add the corresponding Python API: model = StringIndexerModel.from_labels(["a", "b", "c"]) ## How was this patch tested? Add doctest and unit test. Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20968 from huaxingao/spark-23828.	2018-04-06 11:51:36 -07:00
Li Jin	d766ea2ff2	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause ## What changes were proposed in this pull request? Add docstring to clarify default window frame boundaries with and without orderBy clause ## How was this patch tested? Manually generate doc and check. Author: Li Jin <ice.xelloss@gmail.com> Closes #20978 from icexelloss/SPARK-23861-window-doc.	2018-04-07 00:15:54 +08:00
Bryan Cutler	44a9f8e6e8	[SPARK-15009][PYTHON][FOLLOWUP] Add default param checks for CountVectorizerModel ## What changes were proposed in this pull request? Adding test for default params for `CountVectorizerModel` constructed from vocabulary. This required that the param `maxDF` be added, which was done in SPARK-23615. ## How was this patch tested? Added an explicit test for CountVectorizerModel in DefaultValuesTests. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20942 from BryanCutler/pyspark-CountVectorizerModel-default-param-test-SPARK-15009.	2018-04-02 09:53:37 -07:00
hyukjinkwon	34c4b9c57e	[SPARK-23765][SQL] Supports custom line separator for json datasource ## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with https://github.com/apache/spark/pull/20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes #20877 from HyukjinKwon/linesep-json.	2018-03-28 19:49:27 +08:00
Bryan Cutler	ed72badb04	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled ## What changes were proposed in this pull request? When using Arrow for createDataFrame or toPandas and an error is encountered with fallback disabled, this will raise the same type of error instead of a RuntimeError. This change also allows for the traceback of the error to be retained and prevents the accidental chaining of exceptions with Python 3. ## How was this patch tested? Updated existing tests to verify error type. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20839 from BryanCutler/arrow-raise-same-error-SPARK-23699.	2018-03-27 20:06:12 -07:00
Kevin Yu	3e778f5a91	[SPARK-23162][PYSPARK][ML] Add r2adj into Python API in LinearRegressionSummary ## What changes were proposed in this pull request? Adding r2adj in LinearRegressionSummary for Python API. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes in tests.py. Author: Kevin Yu <qyu@us.ibm.com> Closes #20842 from kevinyu98/spark-23162.	2018-03-26 15:45:27 -07:00
Michael (Stu) Stewart	087fb31420	[SPARK-23645][MINOR][DOCS][PYTHON] Add docs RE `pandas_udf` with keyword args ## What changes were proposed in this pull request? Add documentation about the limitations of `pandas_udf` with keyword arguments and related concepts, like `functools.partial` fn objects. NOTE: intermediate commits on this PR show some of the steps that can be taken to fix some (but not all) of these pain points. ### Survey of problems we face today: (Initialize) Note: python 3.6 and spark 2.4snapshot. ``` from pyspark.sql import SparkSession import inspect, functools from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit, udf spark = SparkSession.builder.getOrCreate() print(spark.version) df = spark.range(1,6).withColumn('b', col('id') * 2) def ok(a,b): return a+b ``` Using a keyword argument at the call site `b=...` (and yes, full stack trace below, haha): ``` ---> 14 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id', b='id')).show() # no kwargs TypeError: wrapper() got an unexpected keyword argument 'b' ``` Using partial with a keyword argument where the kw-arg is the first argument of the fn: (Aside: kind of interesting that lines 15,16 work great and then 17 explodes) ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-e9f31b8799c1> in <module>() 15 df.withColumn('ok', pandas_udf(f=functools.partial(ok, 7), returnType='bigint')('id')).show() 16 df.withColumn('ok', pandas_udf(f=functools.partial(ok, b=7), returnType='bigint')('id')).show() ---> 17 df.withColumn('ok', pandas_udf(f=functools.partial(ok, a=7), returnType='bigint')('id')).show() /Users/stu/ZZ/spark/python/pyspark/sql/functions.py in pandas_udf(f, returnType, functionType) 2378 return functools.partial(_create_udf, returnType=return_type, evalType=eval_type) 2379 else: -> 2380 return _create_udf(f=f, returnType=return_type, evalType=eval_type) 2381 2382 /Users/stu/ZZ/spark/python/pyspark/sql/udf.py in _create_udf(f, returnType, evalType) 54 argspec.varargs is None: 55 raise ValueError( ---> 56 "Invalid function: 0-arg pandas_udfs are not supported. " 57 "Instead, create a 1-arg pandas_udf and ignore the arg in your function." 58 ) ValueError: Invalid function: 0-arg pandas_udfs are not supported. Instead, create a 1-arg pandas_udf and ignore the arg in your function. ``` Author: Michael (Stu) Stewart <mstewart141@gmail.com> Closes #20900 from mstewart141/udfkw2.	2018-03-26 12:45:45 +09:00
Bryan Cutler	a9350d7095	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql ## What changes were proposed in this pull request? This cleans up unused imports, mainly from pyspark.sql module. Added a note in function.py that imports `UserDefinedFunction` only to maintain backwards compatibility for using `from pyspark.sql.function import UserDefinedFunction`. ## How was this patch tested? Existing tests and built docs. Author: Bryan Cutler <cutlerb@gmail.com> Closes #20892 from BryanCutler/pyspark-cleanup-imports-SPARK-23700.	2018-03-26 12:42:32 +09:00
Huaxin Gao	a33655348c	[SPARK-23615][ML][PYSPARK] Add maxDF Parameter to Python CountVectorizer ## What changes were proposed in this pull request? The maxDF parameter is for filtering out frequently occurring terms. This param was recently added to the Scala CountVectorizer and needs to be added to Python also. ## How was this patch tested? add test Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20777 from huaxingao/spark-23615.	2018-03-23 15:58:48 -07:00
Bryan Cutler	cb43bbe136	[SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform ## What changes were proposed in this pull request? Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to `wrapper._transfer_values_to_java`. This incorrectly changes the state of the Param as it should still be marked as a default value only. ## How was this patch tested? Added a new test to verify that when transferring Params to Java, default params have their state preserved. Author: Bryan Cutler <cutlerb@gmail.com> Closes #18982 from BryanCutler/pyspark-ml-param-to-java-defaults-SPARK-21685.	2018-03-23 11:42:40 -07:00
hyukjinkwon	a649fcf32a	[MINOR][PYTHON] Remove unused codes in schema parsing logics of PySpark ## What changes were proposed in this pull request? This PR proposes to remove out unused codes, `_ignore_brackets_split` and `_BRACKETS`. `_ignore_brackets_split` was introduced in `d57daf1f77` to refactor and support `toDF("...")`; however, `ebc124d4c4` replaced the logics here. Seems `_ignore_brackets_split` is not referred anymore. `_BRACKETS` was introduced in `880eabec37`; however, all other usages were removed out in `648a8626b8`. This is rather a followup for `ebc124d4c4` which I missed in that PR. ## How was this patch tested? Manually tested. Existing tests should cover this. I also double checked by `grep` in the whole repo. Author: hyukjinkwon <gurwls223@apache.org> Closes #20878 from HyukjinKwon/minor-remove-unused.	2018-03-22 21:20:41 -07:00
hyukjinkwon	8d79113b81	[SPARK-23577][SQL] Supports custom line separator for text datasource ## What changes were proposed in this pull request? This PR proposes to add `lineSep` option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. ## How was this patch tested? Manual tests and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20727 from HyukjinKwon/linesep-text.	2018-03-21 09:46:47 -07:00
hyukjinkwon	566321852b	[SPARK-23691][PYTHON] Use sql_conf util in PySpark tests where possible ## What changes were proposed in this pull request? `d6632d185e` added an useful util ```python contextmanager def sql_conf(self, pairs): ... ``` to allow configuration set/unset within a block: ```python with self.sql_conf({"spark.blah.blah.blah", "blah"}) # test codes ``` This PR proposes to use this util where possible in PySpark tests. Note that there look already few places affecting tests without restoring the original value back in unittest classes. ## How was this patch tested? Manually tested via: ``` ./run-tests --modules=pyspark-sql --python-executables=python2 ./run-tests --modules=pyspark-sql --python-executables=python3 ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #20830 from HyukjinKwon/cleanup-sql-conf.	2018-03-19 21:25:37 -07:00
hyukjinkwon	61487b308b	[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark ## What changes were proposed in this pull request? Scala: ``` scala> spark.conf.get("hey", null) res1: String = null ``` ``` scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) res2: String = null ``` Python: Before ``` >>> spark.conf.get("hey", None) ... py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. : java.util.NoSuchElementException: hey ... ``` ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) u'STATIC' ``` After ``` >>> spark.conf.get("hey", None) is None True ``` ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) is None True ``` *Note that this PR preserves the case below: ``` >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode") u'STATIC' ``` ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20841 from HyukjinKwon/spark-conf-get.	2018-03-18 20:24:14 +09:00

... 2 3 4 5 6 ...

2125 commits