ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	4ad9bfd53b	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 ### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-14 11:22:44 +09:00
HyukjinKwon	56d4f27cf6	[SPARK-31966][ML][TESTS][PYTHON] Increase the timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction ### What changes were proposed in this pull request? This is similar with `64cb6f7066` The test `StreamingLogisticRegressionWithSGDTests.test_training_and_prediction` seems also flaky. This PR just increases the timeout to 3 mins too. The cause is very likely the time elapsed. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 330, in test_training_and_prediction eventually(condition, timeout=60.0) File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/testing/utils.py", line 90, in eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 60 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75 ``` ### Why are the changes needed? To make PR builds more stable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins will test them out. Closes #28798 from HyukjinKwon/SPARK-31966. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-10 21:56:35 -07:00
HyukjinKwon	64cb6f7066	[SPARK-29137][ML][PYTHON][TESTS] Increase the timeout for StreamingLinearRegressionWithTests.test_train_prediction ### What changes were proposed in this pull request? It increases the timeout for `StreamingLinearRegressionWithTests.test_train_prediction` ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 503, in test_train_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 69, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 498, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 ``` This could likely happen when the PySpark tests run in parallel and it become slow. ### Why are the changes needed? To make the tests less flaky. Seems it's being reported multiple times: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/ ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Jenkins will test it out. Closes #28701 from HyukjinKwon/SPARK-29137. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-01 22:03:47 -07:00
Dongjoon Hyun	534f5d409a	[SPARK-29138][PYTHON][TEST] Increase timeout of StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy ### What changes were proposed in this pull request? This PR aims to increase the timeout of `StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy` from 30s (default) to 60s. In this PR, before increasing the timeout, 1. I verified that this is not a JDK11 environmental issue by repeating 3 times first. 2. I reproduced the accuracy failure by reducing the timeout in Jenkins (https://github.com/apache/spark/pull/27424#issuecomment-580981262) Then, the final commit passed the Jenkins. ### Why are the changes needed? This seems to happen when Jenkins environment has congestion and the jobs are slowdown. The streaming job seems to be unable to repeat the designed iteration `numIteration=25` in 30 seconds. Since the error is decreasing at each iteration, the failure occurs. By reducing the timeout, we can reproduce the similar issue locally like Jenkins. ```python - eventually(condition, catch_assertions=True) + eventually(condition, timeout=10.0, catch_assertions=True) ``` ``` $ python/run-tests --testname 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy' --python-executables=python ... ====================================================================== FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy eventually(condition, timeout=10.0, catch_assertions=True) File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 86, in eventually raise lastValue Reproduce the error File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/testing/utils.py", line 77, in eventually lastValue = condition() File "/Users/dongjoon/PRS/SPARK-TEST/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition self.assertAlmostEqual(rel, 0.1, 1) AssertionError: 0.25749106949322637 != 0.1 within 1 places (0.15749106949322636 difference) ---------------------------------------------------------------------- Ran 1 test in 14.814s FAILED (failures=1) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins (and manual check by reducing the timeout). Since this is a flakiness issue depending on the Jenkins job situation, it's difficult to reproduce there. Closes #27424 from dongjoon-hyun/SPARK-TEST. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-01 15:38:16 +09:00
HyukjinKwon	4ec04e5ef3	[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's ## What changes were proposed in this pull request? This PR proposes to add Single threading model design (pinned thread model) mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread ### Multi threading model Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible. Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See: `7056e004ee/core/src/main/scala/org/apache/spark/SparkContext.scala (L334)` Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA. ### Single threading model design (pinned thread model) This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model Even though this mode can sync threads on PVM and JVM for other thread related code paths, this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode): ```bash $ PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python import threading spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) threading.Thread(target=print_prop).start() ``` ``` None ``` Unlike Scala side: ```scala spark.sparkContext.setLocalProperty("a", "hi") new Thread(new Runnable { def run() = println(spark.sparkContext.getLocalProperty("a")) }).start() ``` ``` hi ``` This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental. ### How does this PR fix? Basically there are two types of Py4J servers `GatewayServer` and `ClientServer`. The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter. In Scala side: The logic to select a server is encapsulated in `Py4JServer` and use `Py4JServer` at `PythonRunner` for Spark summit and `PythonGatewayServer` for Spark shell. Each uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise. In Python side: Simply do an if-else to switch the server to talk. It uses `ClientServer` when `PYSPARK_PIN_THREAD` is `true` and `GatewayServer` otherwise. This is disabled by default for now. ## How was this patch tested? Manually tested. This can be tested via: ```python PYSPARK_PIN_THREAD=true ./bin/pyspark ``` and/or ```bash cd python ./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread" ``` Also, ran the Jenkins tests with `PYSPARK_PIN_THREAD` enabled. Closes #24898 from HyukjinKwon/pinned-thread. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-08 06:44:58 +09:00
Henry D	a32c92c0cd	[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors ## What changes were proposed in this pull request? In both cases, the input `DataFrame` schema must contain only the information that's required for the matrix object, so a vector column in the case of `RowMatrix` and long and vector columns for `IndexedRowMatrix`. ## How was this patch tested? Unit tests that verify: - `RowMatrix` and `IndexedRowMatrix` can be created from `DataFrame`s - If the schema does not match expectations, we throw an `IllegalArgumentException` Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #24953 from henrydavidge/row-matrix-df. Authored-by: Henry D <henrydavidge@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-07-09 16:39:21 -05:00
HyukjinKwon	7c05f61514	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark ## What changes were proposed in this pull request? Currently, pretty skipped message added by `f7435bec6a` mechanism seems not working when xmlrunner is installed apparently. This PR fixes two things: 1. When `xmlrunner` is installed, seems `xmlrunner` does not respect `vervosity` level in unittests (default is level 1). So the output looks as below ``` Running tests... ---------------------------------------------------------------------- SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ---------------------------------------------------------------------- ``` So it is not caught by our message detection mechanism. 2. If we manually set the `vervocity` level to `xmlrunner`, it prints messages as below: ``` test_mixed_udf (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s) test_mixed_udf_and_sql (pyspark.sql.tests.test_pandas_udf_scalar.ScalarPandasUDFTests) ... SKIP (0.000s) ... ``` This is different in our Jenkins machine: ``` test_createDataFrame_column_name_encoding (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.' test_createDataFrame_does_not_modify_input (pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.23.2 must be installed; however, it was not found.' ... ``` Note that last `SKIP` is different. This PR fixes the regular expression to catch `SKIP` case as well. ## How was this patch tested? Manually tested. Before: ``` Starting test(python2.7): pyspark.... Finished test(python2.7): pyspark.... (0s) ... Tests passed in 562 seconds ======================================================================== ... ``` After: ``` Starting test(python2.7): pyspark.... Finished test(python2.7): pyspark.... (48s) ... 93 tests were skipped ... Tests passed in 560 seconds Skipped tests pyspark.... with python2.7: pyspark...(...) ... SKIP (0.000s) ... ======================================================================== ... ``` Closes #24927 from HyukjinKwon/SPARK-28130. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-24 09:58:17 +09:00
Giovanni Lanzani	92530c7db1	[SPARK-9792] Make DenseMatrix equality semantical Before, you could have this code ``` A = SparseMatrix(2, 2, [0, 2, 3], [0], [2]) B = DenseMatrix(2, 2, [2, 0, 0, 0]) B == A # False A == B # True ``` The second would be `True` as `SparseMatrix` already checks for semantic equality. This commit changes `DenseMatrix` so that equality is semantical as well. ## What changes were proposed in this pull request? Better semantic equality for DenseMatrix ## How was this patch tested? Unit tests were added, plus manual testing. Note that the code falls back to the old behavior when `other` is not a SparseMatrix. Closes #17968 from gglanzani/SPARK-9792. Authored-by: Giovanni Lanzani <giovanni@lanzani.nl> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2019-04-01 09:30:33 -07:00
Liang-Chi Hsieh	8503aa3007	[SPARK-26646][TEST][PYSPARK] Fix flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction ## What changes were proposed in this pull request? The test pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction looks sometimes flaky. ``` ====================================================================== FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition, timeout=60.0) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 69, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 362, in condition self.assertGreater(errors[1] - errors[-1], 0.3) AssertionError: -0.070000000000000062 not greater than 0.3 ---------------------------------------------------------------------- Ran 13 tests in 198.327s FAILED (failures=1, skipped=1) Had test failures in pyspark.mllib.tests.test_streaming_algorithms with python3.4; see logs ``` The predict stream can possibly be consumed to the end before the input stream. When it happens, the model improvement is not high as expected and causes test failed. This patch tries to increase number of batches of streams. This won't increase test time because we have a timeout there. ## How was this patch tested? Manually test. Closes #23586 from viirya/SPARK-26646. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-18 23:53:11 +08:00
Hyukjin Kwon	ab76900fed	[SPARK-26275][PYTHON][ML] Increases timeout for StreamingLogisticRegressionWithSGDTests.test_training_and_prediction test ## What changes were proposed in this pull request? Looks this test is flaky https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console ``` ====================================================================== FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 ---------------------------------------------------------------------- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) ``` This looks happening after increasing the parallelism in Jenkins to speed up at https://github.com/apache/spark/pull/23111. I am able to reproduce this manually when the resource usage is heavy (with manual decrease of timeout). ## How was this patch tested? Manually tested by ``` cd python ./run-tests --testnames 'pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction' --python-executables=python ``` Closes #23236 from HyukjinKwon/SPARK-26275. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2018-12-06 09:14:46 +08:00
Hyukjin Kwon	518a3d10c8	[SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small cleanup and deduplication in ml/mllib tests ## What changes were proposed in this pull request? This PR is a small follow up that puts some logic and functions into smaller scope and make it localized, and deduplicate. ## How was this patch tested? Manually tested. Jenkins tests as well. Closes #23200 from HyukjinKwon/followup-SPARK-26034-SPARK-26033. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2018-12-03 14:03:10 -08:00
hyukjinkwon	bbbdaa82a4	[SPARK-26105][PYTHON] Clean unittest2 imports up that were added for Python 2.6 before ## What changes were proposed in this pull request? Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance: ```python if sys.version_info[:2] <= (2, 6): try: import unittest2 as unittest except ImportError: sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') sys.exit(1) else: import unittest ``` While I am here, I removed some of unused imports and reordered imports per PEP 8. We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out. ## How was this patch tested? Manually tests, and existing tests via Jenkins. Closes #23077 from HyukjinKwon/SPARK-26105. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-19 09:22:32 +08:00
Bryan Cutler	a2fc48c28c	[SPARK-26034][PYTHON][TESTS] Break large mllib/tests.py file into smaller files ## What changes were proposed in this pull request? This PR breaks down the large mllib/tests.py file that contains all Python MLlib unit tests into several smaller test files to be easier to read and maintain. The tests are broken down as follows: ``` pyspark ├── __init__.py ... ├── mllib │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_algorithms.py │ │ ├── test_feature.py │ │ ├── test_linalg.py │ │ ├── test_stat.py │ │ ├── test_streaming_algorithms.py │ │ └── test_util.py ... ├── testing ... │ ├── mllibutils.py ... ``` ## How was this patch tested? Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-mllib` to verify all passing with Python 2.7 and Python 3.6. Also installed scipy to include optional tests in test_linalg. Closes #23056 from BryanCutler/python-test-breakup-mllib-SPARK-26034. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-11-17 00:12:17 +08:00

13 commits