ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zero323	31a16fbb40	[SPARK-32714][PYTHON] Initial pyspark-stubs port ### What changes were proposed in this pull request? This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes. This PR adds type annotations directly to Spark source. This can impact interaction with development tools for users, which haven't used `pyspark-stubs`. ### How was this patch tested? - [x] MyPy tests of the PySpark source ``` mypy --no-incremental --config python/mypy.ini python/pyspark ``` - [x] MyPy tests of Spark examples ``` MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming ``` - [x] Existing Flake8 linter - [x] Existing unit tests Tested against: - `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894` - `mypy==0.782` Closes #29591 from zero323/SPARK-32681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-24 14:15:36 +09:00
Fokko Driesprong	a1e459ed9f	[SPARK-32719][PYTHON] Add Flake8 check missing imports https://issues.apache.org/jira/browse/SPARK-32719 ### What changes were proposed in this pull request? Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard). ### Why are the changes needed? To make sure that the quality of the Python code is up to standard. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit-tests and Flake8 static analysis Closes #29563 from Fokko/fd-add-check-missing-imports. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:23:31 +09:00
Fokko Driesprong	9fcf0ea718	[SPARK-32319][PYSPARK] Disallow the use of unused imports Disallow the use of unused imports: - Unnecessary increases the memory footprint of the application - Removes the imports that are required for the examples in the docstring from the file-scope to the example itself. This keeps the files itself clean, and gives a more complete example as it also includes the imports :) ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" python/pyspark/cloudpickle.py:46:1: F401 'functools.partial' imported but unused python/pyspark/cloudpickle.py:55:1: F401 'traceback' imported but unused python/pyspark/heapq3.py:868:5: F401 '_heapq.' imported but unused python/pyspark/__init__.py:61:1: F401 'pyspark.version.__version__' imported but unused python/pyspark/__init__.py:62:1: F401 'pyspark._globals._NoValue' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.SQLContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.HiveContext' imported but unused python/pyspark/__init__.py:115:1: F401 'pyspark.sql.Row' imported but unused python/pyspark/rdd.py:21:1: F401 're' imported but unused python/pyspark/rdd.py:29:1: F401 'tempfile.NamedTemporaryFile' imported but unused python/pyspark/mllib/regression.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/clustering.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/classification.py:26:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/feature.py:28:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/feature.py:30:1: F401 'pyspark.mllib.regression.LabeledPoint' imported but unused python/pyspark/mllib/tests/test_linalg.py:18:1: F401 'sys' imported but unused python/pyspark/mllib/tests/test_linalg.py:642:5: F401 'pyspark.mllib.tests.test_linalg.' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.random' imported but unused python/pyspark/mllib/tests/test_feature.py:21:1: F401 'numpy.exp' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_feature.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_feature.py:185:5: F401 'pyspark.mllib.tests.test_feature.' imported but unused python/pyspark/mllib/tests/test_util.py:97:5: F401 'pyspark.mllib.tests.test_util.' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.Vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseVector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.VectorUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg._convert_to_vector' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.DenseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.SparseMatrix' imported but unused python/pyspark/mllib/tests/test_stat.py:23:1: F401 'pyspark.mllib.linalg.MatrixUDT' imported but unused python/pyspark/mllib/tests/test_stat.py:181:5: F401 'pyspark.mllib.tests.test_stat.' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.time' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:18:1: F401 'time.sleep' imported but unused python/pyspark/mllib/tests/test_streaming_algorithms.py:470:5: F401 'pyspark.mllib.tests.test_streaming_algorithms.' imported but unused python/pyspark/mllib/tests/test_algorithms.py:295:5: F401 'pyspark.mllib.tests.test_algorithms.' imported but unused python/pyspark/tests/test_serializers.py:90:13: F401 'xmlrunner' imported but unused python/pyspark/tests/test_rdd.py:21:1: F401 'sys' imported but unused python/pyspark/tests/test_rdd.py:29:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/tests/test_rdd.py:885:5: F401 'pyspark.tests.test_rdd.' imported but unused python/pyspark/tests/test_readwrite.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_readwrite.py:22:1: F401 'array.array' imported but unused python/pyspark/tests/test_readwrite.py:309:5: F401 'pyspark.tests.test_readwrite.' imported but unused python/pyspark/tests/test_join.py:62:5: F401 'pyspark.tests.test_join.' imported but unused python/pyspark/tests/test_taskcontext.py:19:1: F401 'shutil' imported but unused python/pyspark/tests/test_taskcontext.py:325:5: F401 'pyspark.tests.test_taskcontext.' imported but unused python/pyspark/tests/test_conf.py:36:5: F401 'pyspark.tests.test_conf.' imported but unused python/pyspark/tests/test_broadcast.py:148:5: F401 'pyspark.tests.test_broadcast.' imported but unused python/pyspark/tests/test_daemon.py:76:5: F401 'pyspark.tests.test_daemon.' imported but unused python/pyspark/tests/test_util.py:77:5: F401 'pyspark.tests.test_util.' imported but unused python/pyspark/tests/test_pin_thread.py:19:1: F401 'random' imported but unused python/pyspark/tests/test_pin_thread.py:149:5: F401 'pyspark.tests.test_pin_thread.' imported but unused python/pyspark/tests/test_worker.py:19:1: F401 'sys' imported but unused python/pyspark/tests/test_worker.py:26:5: F401 'resource' imported but unused python/pyspark/tests/test_worker.py:203:5: F401 'pyspark.tests.test_worker.' imported but unused python/pyspark/tests/test_profiler.py:101:5: F401 'pyspark.tests.test_profiler.' imported but unused python/pyspark/tests/test_shuffle.py:18:1: F401 'sys' imported but unused python/pyspark/tests/test_shuffle.py:171:5: F401 'pyspark.tests.test_shuffle.' imported but unused python/pyspark/tests/test_rddbarrier.py:43:5: F401 'pyspark.tests.test_rddbarrier.' imported but unused python/pyspark/tests/test_context.py:129:13: F401 'userlibrary.UserClass' imported but unused python/pyspark/tests/test_context.py:140:13: F401 'userlib.UserClass' imported but unused python/pyspark/tests/test_context.py:310:5: F401 'pyspark.tests.test_context.' imported but unused python/pyspark/tests/test_appsubmit.py:241:5: F401 'pyspark.tests.test_appsubmit.' imported but unused python/pyspark/streaming/dstream.py:18:1: F401 'sys' imported but unused python/pyspark/streaming/tests/test_dstream.py:27:1: F401 'pyspark.RDD' imported but unused python/pyspark/streaming/tests/test_dstream.py:647:5: F401 'pyspark.streaming.tests.test_dstream.' imported but unused python/pyspark/streaming/tests/test_kinesis.py:83:5: F401 'pyspark.streaming.tests.test_kinesis.' imported but unused python/pyspark/streaming/tests/test_listener.py:152:5: F401 'pyspark.streaming.tests.test_listener.' imported but unused python/pyspark/streaming/tests/test_context.py:178:5: F401 'pyspark.streaming.tests.test_context.' imported but unused python/pyspark/testing/utils.py:30:5: F401 'scipy.sparse' imported but unused python/pyspark/testing/utils.py:36:5: F401 'numpy as np' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._TreeEnsembleParams' imported but unused python/pyspark/ml/regression.py:25:1: F401 'pyspark.ml.tree._HasVarianceImpurity' imported but unused python/pyspark/ml/regression.py:29:1: F401 'pyspark.ml.wrapper.JavaParams' imported but unused python/pyspark/ml/util.py:19:1: F401 'sys' imported but unused python/pyspark/ml/__init__.py:25:1: F401 'pyspark.ml.pipeline' imported but unused python/pyspark/ml/pipeline.py:18:1: F401 'sys' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.DenseMatrix' imported but unused python/pyspark/ml/stat.py:22:1: F401 'pyspark.ml.linalg.Vectors' imported but unused python/pyspark/ml/tests/test_training_summary.py:18:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_training_summary.py:364:5: F401 'pyspark.ml.tests.test_training_summary.' imported but unused python/pyspark/ml/tests/test_linalg.py:381:5: F401 'pyspark.ml.tests.test_linalg.' imported but unused python/pyspark/ml/tests/test_tuning.py:427:9: F401 'pyspark.sql.functions as F' imported but unused python/pyspark/ml/tests/test_tuning.py:757:5: F401 'pyspark.ml.tests.test_tuning.' imported but unused python/pyspark/ml/tests/test_wrapper.py:120:5: F401 'pyspark.ml.tests.test_wrapper.' imported but unused python/pyspark/ml/tests/test_feature.py:19:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_feature.py:304:5: F401 'pyspark.ml.tests.test_feature.' imported but unused python/pyspark/ml/tests/test_image.py:19:1: F401 'py4j' imported but unused python/pyspark/ml/tests/test_image.py:22:1: F401 'pyspark.testing.mlutils.PySparkTestCase' imported but unused python/pyspark/ml/tests/test_image.py:71:5: F401 'pyspark.ml.tests.test_image.' imported but unused python/pyspark/ml/tests/test_persistence.py:456:5: F401 'pyspark.ml.tests.test_persistence.' imported but unused python/pyspark/ml/tests/test_evaluation.py:56:5: F401 'pyspark.ml.tests.test_evaluation.' imported but unused python/pyspark/ml/tests/test_stat.py:43:5: F401 'pyspark.ml.tests.test_stat.' imported but unused python/pyspark/ml/tests/test_base.py:70:5: F401 'pyspark.ml.tests.test_base.' imported but unused python/pyspark/ml/tests/test_param.py:20:1: F401 'sys' imported but unused python/pyspark/ml/tests/test_param.py:375:5: F401 'pyspark.ml.tests.test_param.' imported but unused python/pyspark/ml/tests/test_pipeline.py:62:5: F401 'pyspark.ml.tests.test_pipeline.' imported but unused python/pyspark/ml/tests/test_algorithms.py:333:5: F401 'pyspark.ml.tests.test_algorithms.' imported but unused python/pyspark/ml/param/__init__.py:18:1: F401 'sys' imported but unused python/pyspark/resource/tests/test_resources.py:17:1: F401 'random' imported but unused python/pyspark/resource/tests/test_resources.py:20:1: F401 'pyspark.resource.ResourceProfile' imported but unused python/pyspark/resource/tests/test_resources.py:75:5: F401 'pyspark.resource.tests.test_resources.' imported but unused python/pyspark/sql/functions.py:32:1: F401 'pyspark.sql.udf.UserDefinedFunction' imported but unused python/pyspark/sql/functions.py:34:1: F401 'pyspark.sql.pandas.functions.pandas_udf' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/session.py:30:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/readwriter.py:1084:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.IntegerType' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.Row' imported but unused python/pyspark/sql/context.py:26:1: F401 'pyspark.sql.types.StringType' imported but unused python/pyspark/sql/context.py:27:1: F401 'pyspark.sql.udf.UDFRegistration' imported but unused python/pyspark/sql/streaming.py:1212:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/tests/test_utils.py:55:5: F401 'pyspark.sql.tests.test_utils.' imported but unused python/pyspark/sql/tests/test_pandas_map.py:18:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.pandas_udf' imported but unused python/pyspark/sql/tests/test_pandas_map.py:22:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_map.py:119:5: F401 'pyspark.sql.tests.test_pandas_map.' imported but unused python/pyspark/sql/tests/test_catalog.py:193:5: F401 'pyspark.sql.tests.test_catalog.' imported but unused python/pyspark/sql/tests/test_group.py:39:5: F401 'pyspark.sql.tests.test_group.' imported but unused python/pyspark/sql/tests/test_session.py:361:5: F401 'pyspark.sql.tests.test_session.' imported but unused python/pyspark/sql/tests/test_conf.py:49:5: F401 'pyspark.sql.tests.test_conf.' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.sum' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:21:1: F401 'pyspark.sql.functions.PandasUDFType' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:29:5: F401 'pandas.util.testing.assert_series_equal' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:32:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_cogrouped_map.py:248:5: F401 'pyspark.sql.tests.test_pandas_cogrouped_map.' imported but unused python/pyspark/sql/tests/test_udf.py:24:1: F401 'py4j' imported but unused python/pyspark/sql/tests/test_pandas_udf_typehints.py:246:5: F401 'pyspark.sql.tests.test_pandas_udf_typehints.' imported but unused python/pyspark/sql/tests/test_functions.py:19:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_functions.py:362:9: F401 'pyspark.sql.functions.exists' imported but unused python/pyspark/sql/tests/test_functions.py:387:5: F401 'pyspark.sql.tests.test_functions.' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:21:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_udf_scalar.py:45:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_udf_window.py:355:5: F401 'pyspark.sql.tests.test_pandas_udf_window.' imported but unused python/pyspark/sql/tests/test_arrow.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:20:1: F401 'sys' imported but unused python/pyspark/sql/tests/test_pandas_grouped_map.py:38:5: F401 'pyarrow as pa' imported but unused python/pyspark/sql/tests/test_dataframe.py:382:9: F401 'pyspark.sql.DataFrame' imported but unused python/pyspark/sql/avro/functions.py:125:5: F401 'pyspark.sql.Row' imported but unused python/pyspark/sql/pandas/functions.py:19:1: F401 'sys' imported but unused ``` After: ``` fokkodriesprongFan spark % flake8 python \| grep -i "imported but unused" fokkodriesprongFan spark % ``` ### What changes were proposed in this pull request? Removing unused imports from the Python files to keep everything nice and tidy. ### Why are the changes needed? Cleaning up of the imports that aren't used, and suppressing the imports that are used as references to other modules, preserving backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adding the rule to the existing Flake8 checks. Closes #29121 from Fokko/SPARK-32319. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-08 08:51:57 -07:00
HyukjinKwon	a82aee0441	[SPARK-32435][PYTHON] Remove heapq3 port from Python 3 ### What changes were proposed in this pull request? This PR removes the manual port of `heapq3.py` introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's `heapq.merge()` doesn't not support `key` and `reverse`. See - https://docs.python.org/2/library/heapq.html#heapq.merge in Python 2 - https://docs.python.org/3.8/library/heapq.html#heapq.merge in Python 3 Since we dropped the Python 2 at SPARK-32138, we can remove this away. ### Why are the changes needed? To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at `heapq`. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing tests should cover. I locally ran and verified: ```bash ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle" ./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter" ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key" ``` Closes #29229 from HyukjinKwon/SPARK-32435. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 20:10:13 +09:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00
HyukjinKwon	ea9e8f365a	[SPARK-32094][PYTHON] Update cloudpickle to v1.5.0 ### What changes were proposed in this pull request? This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py) ### Why are the changes needed? There are many bug fixes. For example, the bug described in the JIRA: dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads return dill.loads(s) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type return _reverse_typemap[name] KeyError: 'ClassType' ``` See also https://github.com/cloudpipe/cloudpickle/issues/82. This was fixed for cloudpickle 1.3.0+ (https://github.com/cloudpipe/cloudpickle/pull/337), but PySpark's cloudpickle.py doesn't have this change yet. More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray. ### Does this PR introduce _any_ user-facing change? Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle. ### How was this patch tested? Jenkins will test it out. Closes #29114 from HyukjinKwon/SPARK-32094. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-17 11:49:18 +09:00
Nicholas Chammas	b4748ca0ab	[SPARK-31155] Remove pydocstyle tests ### What changes were proposed in this pull request? As discovered here https://github.com/apache/spark/pull/27910#issuecomment-599027190, pydocstyle tests were not running anywhere (not on Jenkins; not on GitHub). ~This PR enables those tests.~ It also seems like a [large hill to climb](https://github.com/apache/spark/pull/27912#issuecomment-599167117) to enable any meaningful checks, so we're going to just rip pydocstyle out for now. ### Why are the changes needed? Presumably, we defined those doc style tests because we care about whatever it is they enforce. Since we're not actually testing anything, though, it's better to clear the cruft. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Will check the GitHub workflow logs on this PR. Closes #27912 from nchammas/SPARK-31155-pydocstyle. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-17 10:41:41 +09:00
Eric Chang	ed8a260749	[SPARK-30450][INFRA] Exclude .git folder for python linter ### What changes were proposed in this pull request? This excludes the .git folder when the python linter runs. We want to exclude because there may be files in .git from other branches that could cause the linter to fail. ### Why are the changes needed? I ran into a case where there was a branch name that ended ".py" suffix so there were git refs files in .git folder in .git/logs/refs and .git/refs/remotes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual. ``` $ git branch 3.py $ git checkout 3.py Switched to branch '3.py' $ dev/lint-python starting python compilation test... Python compilation failed with the following errors: * Error compiling './.git/logs/refs/heads/3.py'... File "./.git/logs/refs/heads/3.py", line 1 0000000000000000000000000000000000000000 `895e572b73` Dongjoon Hyun <dhyunapple.com> 1578438255 -0800 branch: Created from master ^ SyntaxError: invalid syntax * Error compiling './.git/refs/heads/3.py'... File "./.git/refs/heads/3.py", line 1 `895e572b73` ^ SyntaxError: invalid syntax ``` Closes #27120 from ericfchang/master. Authored-by: Eric Chang <eric.chang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-07 15:14:17 -08:00
Rekha Joshi	d5573c578a	[SPARK-23367][BUILD] Include python document style checking ## What changes were proposed in this pull request? Includes python document style checking. - Use sphinx like check, run only if pydocstyle installed on machine/jenkins - use pydocstyle rather than single file version pep257.py, which is much older and had some known issues - verify pydocstyle latest 3.0.0 is in use, to ensure latest doc checks are getting executed - ignore (inclusion/exclusion error codes) features and support via tox.ini - Be non-breaking change and allow updating docstyle to standards at easy pace ## How was this patch tested? ./dev/run-tests Closes #22425 from rekhajoshm/SPARK-23367-2. Authored-by: Rekha Joshi <rekhajoshm@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-27 08:20:42 -05:00
cclauss	9bb798f2e6	[SPARK-25238][PYTHON] lint-python: Upgrade pycodestyle to v2.4.0 See https://pycodestyle.readthedocs.io/en/latest/developer.html#changes for changes made in this release. ## What changes were proposed in this pull request? Upgrade pycodestyle to v2.4.0 ## How was this patch tested? __pycodestyle__ Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #22231 from cclauss/patch-1. Authored-by: cclauss <cclauss@bluewin.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-09-14 20:13:07 -05:00
Sean Suchter	f433ef7867	[SPARK-23010][K8S] Initial checkin of k8s integration tests. These tests were developed in the https://github.com/apache-spark-on-k8s/spark-integration repo by several contributors. This is a copy of the current state into the main apache spark repo. The only changes from the current spark-integration repo state are: * Move the files from the repo root into resource-managers/kubernetes/integration-tests * Add a reference to these tests in the root README.md * Fix a path reference in dev/dev-run-integration-tests.sh * Add a TODO in include/util.sh ## What changes were proposed in this pull request? Incorporation of Kubernetes integration tests. ## How was this patch tested? This code has its own unit tests, but the main purpose is to provide the integration tests. I tested this on my laptop by running dev/dev-run-integration-tests.sh --spark-tgz ~/spark-2.4.0-SNAPSHOT-bin--.tgz The spark-integration tests have already been running for months in AMPLab, here is an example: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-scheduled-spark-integration-master/ Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Sean Suchter <sean-github@suchter.com> Author: Sean Suchter <ssuchter@pepperdata.com> Closes #20697 from ssuchter/ssuchter-k8s-integration-tests.	2018-06-08 15:15:24 -07:00
Rekha Joshi	7af1a325da	[SPARK-23174][BUILD][PYTHON] python code style checker update ## What changes were proposed in this pull request? Referencing latest python code style checking from PyPi/pycodestyle Removed pending TODO For now, in tox.ini excluded the additional style error discovered on existing python due to latest style checker (will fallback on review comment to finalize exclusion or fix py) Any further code styling requirement needs to be part of pycodestyle, not in SPARK. ## How was this patch tested? ./dev/run-tests Author: Rekha Joshi <rekhajoshm@gmail.com> Author: rjoshi2 <rekhajoshm@gmail.com> Closes #20338 from rekhajoshm/SPARK-11222.	2018-01-24 21:13:47 +09:00
Xin Lu	544a1ba678	[SPARK-22375][TEST] Test script can fail if eggs are installed by set… …up.py during test process ## What changes were proposed in this pull request? Ignore the python/.eggs folder when running lint-python ## How was this patch tested? 1) put a bad python file in python/.eggs and ran the original script. results were: xins-MBP:spark xinlu$ dev/lint-python PEP8 checks failed. ./python/.eggs/worker.py:33:4: E121 continuation line under-indented for hanging indent ./python/.eggs/worker.py:34:5: E131 continuation line unaligned for hanging indent 2) test same situation with change: xins-MBP:spark xinlu$ dev/lint-python PEP8 checks passed. The sphinx-build command was not found. Skipping pydoc checks for now Author: Xin Lu <xlu@salesforce.com> Closes #19597 from xynny/SPARK-22375.	2017-10-29 15:29:23 +09:00
hyukjinkwon	0e31e28d48	[MINOR][PYTHON] Ignore pep8 on test scripts generated in tests in work directory ## What changes were proposed in this pull request? Currently, if we run `./python/run-tests.py` and they are aborted without cleaning up this directory, it fails pep8 check due to some Python scripts generated. For example, `7387126f83/python/pyspark/tests.py (L1955-L1968)` ``` PEP8 checks failed. ./work/app-20170531190857-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531190909-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531190924-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1 ./work/app-20170531190924-0000/0/test.py:7:52: W292 no newline at end of file ./work/app-20170531191016-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531191030-0000/0/test.py:5:55: W292 no newline at end of file ./work/app-20170531191045-0000/0/test.py:3:1: E302 expected 2 blank lines, found 1 ./work/app-20170531191045-0000/0/test.py:7:52: W292 no newline at end of file ``` For me, it is sometimes a bit annoying. This PR proposes to exclude these (assuming we want to skip per https://github.com/apache/spark/blob/master/.gitignore#L73). Also, it moves other pep8 configurations in the script into ini configuration file in pep8. ## How was this patch tested? Manually tested via `./dev/lint-python`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18161 from HyukjinKwon/work-exclude-pep8.	2017-06-02 14:25:38 +01:00
Sean Owen	0eea12a3d9	[SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirs ## What changes were proposed in this pull request? Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`. I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope. ## How was this patch tested? `./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest. Author: Sean Owen <sowen@cloudera.com> Closes #11522 from srowen/SPARK-13596.	2016-03-07 14:48:02 -08:00

15 commits