ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Bryan Cutler	ddc2052ebd	[SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF ## What changes were proposed in this pull request? This change adds support for returning StructType from a scalar Pandas UDF, where the return value of the function is a pandas.DataFrame. Nested structs are not supported and an error will be raised, child types can be any other type currently supported. ## How was this patch tested? Added additional unit tests to `test_pandas_udf_scalar` Closes #23900 from BryanCutler/pyspark-support-scalar_udf-StructType-SPARK-23836. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-03-07 08:52:24 -08:00
Hyukjin Kwon	a56b3511fc	[SPARK-26945][PYTHON][SS][TESTS] Fix flaky test_*_await_termination in PySpark SS tests ## What changes were proposed in this pull request? This PR proposes to make sure processing all available data before stopping and delete the temp directory. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console ``` ERROR: test_query_manager_await_termination (pyspark.sql.tests.test_streaming.StreamingTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_streaming.py", line 259, in test_query_manager_await_termination shutil.rmtree(tmpPath) File "/home/anaconda/lib/python2.7/shutil.py", line 256, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/home/anaconda/lib/python2.7/shutil.py", line 254, in rmtree os.rmdir(path) OSError: [Errno 39] Directory not empty: '/home/jenkins/workspace/SparkPullRequestBuilder/python/target/072153bd-f981-47be-bda2-e2b657a16f65/tmp4WGp7n' ``` See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102311/console ``` ERROR: test_stream_await_termination (pyspark.sql.tests.test_streaming.StreamingTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/sql/tests/test_streaming.py", line 202, in test_stream_await_termination shutil.rmtree(tmpPath) File "/usr/lib64/pypy-2.5.1/lib-python/2.7/shutil.py", line 256, in rmtree onerror(os.rmdir, path, sys.exc_info()) File "/usr/lib64/pypy-2.5.1/lib-python/2.7/shutil.py", line 254, in rmtree os.rmdir(path) OSError: [Errno 39] Directory not empty: '/home/jenkins/workspace/SparkPullRequestBuilder2/python/target/7244f4ff-6b60-4f6c-b787-de4f15922bf5/tmpQbMZSo' ``` ## How was this patch tested? Jenkins tests - I should run multiple times to see if there are other flaky tests + if this PR really fixes it. Closes #23870 from HyukjinKwon/SPARK-26945. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-23 14:57:04 +08:00
Takuya UESHIN	4a4e7aeca7	[SPARK-26887][SQL][PYTHON][NS] Create datetime.date directly instead of creating datetime64 as intermediate data. ## What changes were proposed in this pull request? Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid. ``` >>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ \| value\| +----------+ \|2262-04-12\| \|2263-04-12\| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21 ``` We should avoid creating such intermediate data and create `datetime.date` series directly instead. ## How was this patch tested? Modified some tests to include the date which overflow caused by the intermediate conversion. Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment. Closes #23795 from ueshin/issues/SPARK-26887/date_as_object. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-18 11:48:10 +08:00
Dilip Biswal	7f44c9a252	[SPARK-26864][SQL] Query may return incorrect result when python udf is used as a join condition and the udf uses attributes from both legs of left semi join. ## What changes were proposed in this pull request? In SPARK-25314, we supported the scenario of having a python UDF that refers to attributes from both legs of a join condition by rewriting the plan to convert an inner join or left semi join to a filter over a cross join. In case of left semi join, this transformation may cause incorrect results when the right leg of join condition produces duplicate rows based on the join condition. This fix disallows the rewrite for left semi join and raises an error in the case like we do for other types of join. In future, we should have separate rule in optimizer to convert left semi join to inner join (I am aware of one case we could do it if we leverage informational constraint i.e when we know the right side does not produce duplicates). Python ```SQL >>> from pyspark import SparkContext >>> from pyspark.sql import SparkSession, Column, Row >>> from pyspark.sql.functions import UserDefinedFunction, udf >>> from pyspark.sql.types import * >>> from pyspark.sql.utils import AnalysisException >>> >>> spark.conf.set("spark.sql.crossJoin.enabled", "True") >>> left = spark.createDataFrame([Row(lc1=1, lc2=1), Row(lc1=2, lc2=2)]) >>> right = spark.createDataFrame([Row(rc1=1, rc2=1), Row(rc1=1, rc2=1)]) >>> func = udf(lambda a, b: a == b, BooleanType()) >>> df = left.join(right, func("lc1", "rc1"), "leftsemi").show() 19/02/12 16:07:10 WARN PullOutPythonUDFInJoinCondition: The join condition:<lambda>(lc1#0L, rc1#4L) of the join plan contains PythonUDF only, it will be moved out and the join plan will be turned to cross join. +---+---+ \|lc1\|lc2\| +---+---+ \| 1\| 1\| \| 1\| 1\| +---+---+ ``` Scala ```SQL scala> val left = Seq((1, 1), (2, 2)).toDF("lc1", "lc2") left: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> val right = Seq((1, 1), (1, 1)).toDF("rc1", "rc2") right: org.apache.spark.sql.DataFrame = [rc1: int, rc2: int] scala> val equal = udf((p1: Integer, p2: Integer) => { \| p1 == p2 \| }) equal: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2141/11016292394666f1b5,BooleanType,List(Some(Schema(IntegerType,true)), Some(Schema(IntegerType,true))),None,false,true) scala> val df = left.join(right, equal(col("lc1"), col("rc1")), "leftsemi") df: org.apache.spark.sql.DataFrame = [lc1: int, lc2: int] scala> df.show() +---+---+ \|lc1\|lc2\| +---+---+ \| 1\| 1\| +---+---+ ``` ## How was this patch tested? Modified existing tests. Closes #23769 from dilipbiswal/dkb_python_udf_in_join. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-13 21:14:19 +08:00
Bryan Cutler	16990f9299	[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name ## How was this patch tested? Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes #23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-29 14:18:45 +08:00
Liang-Chi Hsieh	f92d276653	[SPARK-25811][PYSPARK] Raise a proper error when unsafe cast is detected by PyArrow ## What changes were proposed in this pull request? Since 0.11.0, PyArrow supports to raise an error for unsafe cast ([PR](https://github.com/apache/arrow/pull/2504)). We should use it to raise a proper error for pandas udf users when such cast is detected. Added a SQL config `spark.sql.execution.pandas.arrowSafeTypeConversion` to disable Arrow safe type check. ## How was this patch tested? Added test and manually test. Closes #22807 from viirya/SPARK-25811. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-22 14:54:41 +08:00
Hyukjin Kwon	75d84498a4	[SPARK-26676][PYTHON] Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy ## What changes were proposed in this pull request? This particular test is being skipped at PyPy and Python 2. ``` Skipped tests in pyspark.sql.tests.test_context with pypy: test_unbounded_frames (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 3.3 doesn't support mocking" Skipped tests in pyspark.sql.tests.test_context with python2.7: test_unbounded_frames (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 3.3 doesn't support mocking" ``` We don't have to use unittest 3.3 module to mock. And looks the test itself isn't compatible with Python 2. This PR makes: - Manually monkey-patch `sys.maxsize` to get rid of unittest 3.3 condition - Use the built-in `reload` in Python 2, and `importlib.reload` in Python 3 ## How was this patch tested? Manually tested, and unit test is fixed. Closes #23604 from HyukjinKwon/test-window. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-01-21 14:27:17 -08:00
Marco Gaido	6d9c54b62c	[SPARK-26645][PYTHON] Support decimals with negative scale when parsing datatype ## What changes were proposed in this pull request? When parsing datatypes from the json internal representation, PySpark doesn't support decimals with negative scales. Since they are allowed and can actually happen, PySpark should be able to successfully parse them. ## How was this patch tested? added test Closes #23575 from mgaido91/SPARK-26645. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-20 17:43:50 +08:00
Li Jin	86100df54b	[SPARK-24561][SQL][PYTHON] User-defined window aggregation functions with Pandas UDF (bounded window) ## What changes were proposed in this pull request? This PR implements a new feature - window aggregation Pandas UDF for bounded window. #### Doc: https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#heading=h.c87w44wcj3wj #### Example: ``` from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.window import Window df = spark.range(0, 10, 2).toDF('v') w1 = Window.partitionBy().orderBy('v').rangeBetween(-2, 4) w2 = Window.partitionBy().orderBy('v').rowsBetween(-2, 2) pandas_udf('double', PandasUDFType.GROUPED_AGG) def avg(v): return v.mean() df.withColumn('v_mean', avg(df['v']).over(w1)).show() # +---+------+ # \| v\|v_mean\| # +---+------+ # \| 0\| 1.0\| # \| 2\| 2.0\| # \| 4\| 4.0\| # \| 6\| 6.0\| # \| 8\| 7.0\| # +---+------+ df.withColumn('v_mean', avg(df['v']).over(w2)).show() # +---+------+ # \| v\|v_mean\| # +---+------+ # \| 0\| 2.0\| # \| 2\| 3.0\| # \| 4\| 4.0\| # \| 6\| 5.0\| # \| 8\| 6.0\| # +---+------+ ``` #### High level changes: This PR modifies the existing WindowInPandasExec physical node to deal with unbounded (growing, shrinking and sliding) windows. * `WindowInPandasExec` now share the same base class as `WindowExec` and share utility functions. See `WindowExecBase` * `WindowFunctionFrame` now has two new functions `currentLowerBound` and `currentUpperBound` - to return the lower and upper window bound for the current output row. It is also modified to allow `AggregateProcessor` == null. Null aggregator processor is used for `WindowInPandasExec` where we don't have an aggregator and only uses lower and upper bound functions from `WindowFunctionFrame` * The biggest change is in `WindowInPandasExec`, where it is modified to take `currentLowerBound` and `currentUpperBound` and write those values together with the input data to the python process for rolling window aggregation. See `WindowInPandasExec` for more details. #### Discussion In benchmarking, I found numpy variant of the rolling window UDF is much faster than the pandas version: Spark SQL window function: 20s Pandas variant: ~80s Numpy variant: 10s Numpy variant with numba: 4s Allowing numpy variant of the vectorized UDFs is something I want to discuss because of the performance improvement, but doesn't have to be in this PR. ## How was this patch tested? New tests Closes #22305 from icexelloss/SPARK-24561-bounded-window-udf. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-18 09:15:21 +08:00
Li Jin	160e583a17	[SPARK-26364][PYTHON][TESTING] Clean up imports in test_pandas_udf* ## What changes were proposed in this pull request? Clean up unconditional import statements and move them to the top. Conditional imports (pandas, numpy, pyarrow) are left as-is. ## How was this patch tested? Exising tests. Closes #23314 from icexelloss/clean-up-test-imports. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-14 10:45:24 +08:00
Takuya UESHIN	8edae94fa7	[SPARK-26355][PYSPARK] Add a workaround for PyArrow 0.11. ## What changes were proposed in this pull request? In PyArrow 0.11, there is a API breaking change. - [ARROW-1949](https://issues.apache.org/jira/browse/ARROW-1949) - [Python/C++] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts. This causes test failures in `ScalarPandasUDFTests.test_vectorized_udf_null_(byte\|short\|int\|long)`: ``` File "/Users/ueshin/workspace/apache-spark/spark/python/pyspark/worker.py", line 377, in main process() File "/Users/ueshin/workspace/apache-spark/spark/python/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/ueshin/workspace/apache-spark/spark/python/pyspark/serializers.py", line 317, in dump_stream batch = _create_batch(series, self._timezone) File "/Users/ueshin/workspace/apache-spark/spark/python/pyspark/serializers.py", line 286, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/Users/ueshin/workspace/apache-spark/spark/python/pyspark/serializers.py", line 284, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "pyarrow/array.pxi", line 474, in pyarrow.lib.Array.from_pandas return array(obj, mask=mask, type=type, safe=safe, from_pandas=True, File "pyarrow/array.pxi", line 169, in pyarrow.lib.array return _ndarray_to_array(values, mask, type, from_pandas, safe, File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array check_status(NdarrayToArrow(pool, values, mask, from_pandas, File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status raise ArrowInvalid(message) ArrowInvalid: Floating point value truncated ``` We should add a workaround to support PyArrow 0.11. ## How was this patch tested? In my local environment. Closes #23305 from ueshin/issues/SPARK-26355/pyarrow_0.11. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-13 13:14:59 +08:00
Wenchen Fan	7d5f6e8c49	[SPARK-26293][SQL] Cast exception when having python udf in subquery ## What changes were proposed in this pull request? This is a regression introduced by https://github.com/apache/spark/pull/22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` https://github.com/apache/spark/pull/22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes #23248 from cloud-fan/python. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-12-11 14:16:51 +08:00
Bryan Cutler	ecaa495b1f	[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance ## What changes were proposed in this pull request? When executing `toPandas` with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in. This change sends un-ordered partitions to Python as soon as they arrive in the JVM, followed by a list of partition indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased. Followup to #21546 ## How was this patch tested? Added new test with a large number of batches per partition, and test that forces a small delay in the first partition. These test that partitions are collected out-of-order and then are are put in the correct order in Python. ## Performance Tests - toPandas Tests run on a 4 node standalone cluster with 32 cores total, 14.04.1-Ubuntu and OpenJDK 8 measured wall clock time to execute `toPandas()` and took the average best time of 5 runs/5 loops each. Test code ```python df = spark.range(1 << 25, numPartitions=32).toDF("id").withColumn("x1", rand()).withColumn("x2", rand()).withColumn("x3", rand()).withColumn("x4", rand()) for i in range(5): start = time.time() _ = df.toPandas() elapsed = time.time() - start ``` Spark config ``` spark.driver.memory 5g spark.executor.memory 5g spark.driver.maxResultSize 2g spark.sql.execution.arrow.enabled true ``` Current Master w/ Arrow stream \| This PR ---------------------\|------------ 5.16207 \| 4.342533 5.133671 \| 4.399408 5.147513 \| 4.468471 5.105243 \| 4.36524 5.018685 \| 4.373791 Avg Master \| Avg This PR ------------------\|-------------- 5.1134364 \| 4.3898886 Speedup of 1.164821449 Closes #22275 from BryanCutler/arrow-toPandas-oo-batches-SPARK-25274. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-06 10:07:28 -08:00
DylanGuedes	28d3374407	[SPARK-23647][PYTHON][SQL] Adds more types for hint in pyspark Signed-off-by: DylanGuedes <djmgguedesgmail.com> ## What changes were proposed in this pull request? Addition of float, int and list hints for `pyspark.sql` Hint. ## How was this patch tested? I did manual tests following the same principles used in the Scala version, and also added unit tests. Closes #20788 from DylanGuedes/jira-21030. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-01 10:37:03 +08:00
Wenchen Fan	affe80958d	[SPARK-26147][SQL] only pull out unevaluable python udf from join condition ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/22326 made a mistake that, not all python UDFs are unevaluable in join condition. Only python UDFs that refer to attributes from both join side are unevaluable. This PR fixes this mistake. ## How was this patch tested? a new test Closes #23153 from cloud-fan/join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2018-11-28 20:38:42 +08:00
hyukjinkwon	03306a6df3	[SPARK-26036][PYTHON] Break large tests.py files into smaller files ## What changes were proposed in this pull request? This PR continues to break down a big large file into smaller files. See https://github.com/apache/spark/pull/23021. It targets to follow https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/tests.py` into ...: ``` pyspark ... ├── testing ... │ └── utils.py ├── tests │ ├── __init__.py │ ├── test_appsubmit.py │ ├── test_broadcast.py │ ├── test_conf.py │ ├── test_context.py │ ├── test_daemon.py │ ├── test_join.py │ ├── test_profiler.py │ ├── test_rdd.py │ ├── test_readwrite.py │ ├── test_serializers.py │ ├── test_shuffle.py │ ├── test_taskcontext.py │ ├── test_util.py │ └── test_worker.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and .`/run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.tests.test_context ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23033 from HyukjinKwon/SPARK-26036. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-15 12:30:52 +08:00
hyukjinkwon	a7a331df6e	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files ## What changes were proposed in this pull request? This is the official first attempt to break huge single `tests.py` file - I did it locally before few times and gave up for some reasons. Now, currently it really makes the unittests super hard to read and difficult to check. To me, it even bothers me to to scroll down the big file. It's one single 7000 lines file! This is not only readability issue. Since one big test takes most of tests time, the tests don't run in parallel fully - although it will costs to start and stop the context. We could pick up one example and follow. Given my investigation, the current style looks closer to NumPy structure and looks easier to follow. Please see https://github.com/numpy/numpy/tree/master/numpy. Basically this PR proposes to break down `pyspark/sql/tests.py` into ...: ```bash pyspark ... ├── sql ... │ ├── tests # Includes all tests broken down from 'pyspark/sql/tests.py' │ │ │ # Each matchs to module in 'pyspark/sql'. Additionally, some logical group can │ │ │ # be added. For instance, 'test_arrow.py', 'test_datasources.py' ... │ │ ├── __init__.py │ │ ├── test_appsubmit.py │ │ ├── test_arrow.py │ │ ├── test_catalog.py │ │ ├── test_column.py │ │ ├── test_conf.py │ │ ├── test_context.py │ │ ├── test_dataframe.py │ │ ├── test_datasources.py │ │ ├── test_functions.py │ │ ├── test_group.py │ │ ├── test_pandas_udf.py │ │ ├── test_pandas_udf_grouped_agg.py │ │ ├── test_pandas_udf_grouped_map.py │ │ ├── test_pandas_udf_scalar.py │ │ ├── test_pandas_udf_window.py │ │ ├── test_readwriter.py │ │ ├── test_serde.py │ │ ├── test_session.py │ │ ├── test_streaming.py │ │ ├── test_types.py │ │ ├── test_udf.py │ │ └── test_utils.py ... ├── testing # Includes testing utils that can be used in unittests. │ ├── __init__.py │ └── sqlutils.py ... ``` ## How was this patch tested? Existing tests should cover. `cd python` and `./run-tests-with-coverage`. Manually checked they are actually being ran. Each test (not officially) can be ran via: ``` SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests.test_pandas_udf_scalar ``` Note that if you're using Mac and Python 3, you might have to `OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES`. Closes #23021 from HyukjinKwon/SPARK-25344. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-14 14:51:11 +08:00

17 commits