spark-instrumented-optimizer/python/run-tests

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#


# Figure out where the Spark framework is installed
FWDIR="$(cd `dirname $0`; cd ../; pwd)"

# CD into the python directory to find things on the right path
cd "$FWDIR/python"

FAILED=0

rm -f unit-tests.log

# Remove the metastore and warehouse directory created by the HiveContext tests in SparkSQL
rm -rf metastore warehouse

function run_test() {
    echo "Running test: $1"
    SPARK_TESTING=1 $FWDIR/bin/pyspark $1 2>&1 | tee -a > unit-tests.log
    FAILED=$((PIPESTATUS[0]||$FAILED))

    # Fail and exit on the first test failure.
    if [[ $FAILED != 0 ]]; then
        cat unit-tests.log | grep -v "^[0-9][0-9]*" # filter all lines starting with a number.
        echo -en "\033[31m"  # Red
        echo "Had test failures; see logs."
        echo -en "\033[0m"  # No color
        exit -1
    fi
}

echo "Running PySpark tests. Output is in python/unit-tests.log."

run_test "pyspark/rdd.py"
run_test "pyspark/context.py"
run_test "pyspark/conf.py"
if [ -n "$_RUN_SQL_TESTS" ]; then
  run_test "pyspark/sql.py"
fi
# These tests are included in the module-level docs, and so must
# be handled on a higher level rather than within the python file.
export PYSPARK_DOC_TEST=1
run_test "pyspark/broadcast.py"
run_test "pyspark/accumulators.py"
run_test "pyspark/serializers.py"
unset PYSPARK_DOC_TEST
run_test "pyspark/tests.py"
run_test "pyspark/mllib/_common.py"
run_test "pyspark/mllib/classification.py"
run_test "pyspark/mllib/clustering.py"
run_test "pyspark/mllib/linalg.py"
run_test "pyspark/mllib/recommendation.py"
run_test "pyspark/mllib/regression.py"
run_test "pyspark/mllib/tests.py"

if [[ $FAILED == 0 ]]; then
    echo -en "\033[32m"  # Green
    echo "Tests passed."
    echo -en "\033[0m"  # No color
fi

# TODO: in the long-run, it would be nice to use a test runner like `nose`.
# The doctest fixtures are the current barrier to doing this.
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00			`#!/usr/bin/env bash`

Add Apache license headers and LICENSE and NOTICE files 2013-07-16 20:21:33 -04:00			`#`
			`# Licensed to the Apache Software Foundation (ASF) under one or more`
			`# contributor license agreements. See the NOTICE file distributed with`
			`# this work for additional information regarding copyright ownership.`
			`# The ASF licenses this file to You under the Apache License, Version 2.0`
			`# (the "License"); you may not use this file except in compliance with`
			`# the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`


			`# Figure out where the Spark framework is installed`
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00			FWDIR="$(cd `dirname $0`; cd ../; pwd)"

Allow python/run-tests to run from any directory 2013-07-28 23:39:27 -04:00			`# CD into the python directory to find things on the right path`
			`cd "$FWDIR/python"`

Indicate success/failure in PySpark test script. 2013-01-09 23:30:36 -05:00			`FAILED=0`

Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`rm -f unit-tests.log`

SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling 2014-04-15 03:07:55 -04:00			`# Remove the metastore and warehouse directory created by the HiveContext tests in SparkSQL`
			`rm -rf metastore warehouse`

Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`function run_test() {`
HOTFIX: Fix Python tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1036 from pwendell/jenkins-test and squashes the following commits: 9c99856 [Patrick Wendell] Better output during tests 71e7b74 [Patrick Wendell] Removing incorrect python path 74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins. 2014-06-10 16:13:17 -04:00			`echo "Running test: $1"`
			`SPARK_TESTING=1 $FWDIR/bin/pyspark $1 2>&1 \| tee -a > unit-tests.log`
Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`FAILED=$((PIPESTATUS[0]\|\|$FAILED))`
[WIP] SPARK-1430: Support sparse data in Python MLlib This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact 2014-04-15 23:33:24 -04:00
SPARK-1336 Reducing the output of run-tests script. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits: 87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast. 811170f [Prashant Sharma] Reducing the ouput of run-tests script. 2014-03-30 02:03:03 -04:00			`# Fail and exit on the first test failure.`
			`if [[ $FAILED != 0 ]]; then`
			`cat unit-tests.log \| grep -v "^[0-9][0-9]*" # filter all lines starting with a number.`
			`echo -en "\033[31m" # Red`
			`echo "Had test failures; see logs."`
			`echo -en "\033[0m" # No color`
			`exit -1`
			`fi`
Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`}`

HOTFIX: Fix Python tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1036 from pwendell/jenkins-test and squashes the following commits: 9c99856 [Patrick Wendell] Better output during tests 71e7b74 [Patrick Wendell] Removing incorrect python path 74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins. 2014-06-10 16:13:17 -04:00			`echo "Running PySpark tests. Output is in python/unit-tests.log."`

Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`run_test "pyspark/rdd.py"`
			`run_test "pyspark/context.py"`
Fix some other Python tests due to initializing JVM in a different way The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM. 2013-12-29 14:31:45 -05:00			`run_test "pyspark/conf.py"`
FIX: Don't build Hive in assembly unless running Hive tests. This will make the tests more stable when not running SQL tests. Author: Patrick Wendell <pwendell@gmail.com> Closes #439 from pwendell/hive-tests and squashes the following commits: 88a6032 [Patrick Wendell] FIX: Don't build Hive in assembly unless running Hive tests. 2014-04-17 20:24:00 -04:00			`if [ -n "$_RUN_SQL_TESTS" ]; then`
			`run_test "pyspark/sql.py"`
			`fi`
HOTFIX: A few PySpark tests were not actually run This is a hot fix for the hot fix in fb499be1ac935b6f91046ec8ff23ac1267c82342. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests: - pyspark/broadcast.py - pyspark/accumulators.py - pyspark/serializers.py (@pwendell I might have told you the wrong thing) Author: Andrew Or <andrewor14@gmail.com> Closes #1053 from andrewor14/python-test-fix and squashes the following commits: d2e5401 [Andrew Or] Explain why these tests are handled differently 0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked 2014-06-11 15:11:46 -04:00			`# These tests are included in the module-level docs, and so must`
			`# be handled on a higher level rather than within the python file.`
			`export PYSPARK_DOC_TEST=1`
HOTFIX: Fix Python tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1036 from pwendell/jenkins-test and squashes the following commits: 9c99856 [Patrick Wendell] Better output during tests 71e7b74 [Patrick Wendell] Removing incorrect python path 74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins. 2014-06-10 16:13:17 -04:00			`run_test "pyspark/broadcast.py"`
			`run_test "pyspark/accumulators.py"`
			`run_test "pyspark/serializers.py"`
HOTFIX: A few PySpark tests were not actually run This is a hot fix for the hot fix in fb499be1ac935b6f91046ec8ff23ac1267c82342. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests: - pyspark/broadcast.py - pyspark/accumulators.py - pyspark/serializers.py (@pwendell I might have told you the wrong thing) Author: Andrew Or <andrewor14@gmail.com> Closes #1053 from andrewor14/python-test-fix and squashes the following commits: d2e5401 [Andrew Or] Explain why these tests are handled differently 0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked 2014-06-11 15:11:46 -04:00			`unset PYSPARK_DOC_TEST`
Fix PySpark unit tests on Python 2.6. 2013-08-14 18:12:12 -04:00			`run_test "pyspark/tests.py"`
Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) 2014-01-14 15:14:48 -05:00			`run_test "pyspark/mllib/_common.py"`
			`run_test "pyspark/mllib/classification.py"`
			`run_test "pyspark/mllib/clustering.py"`
[WIP] SPARK-1430: Support sparse data in Python MLlib This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact 2014-04-15 23:33:24 -04:00			`run_test "pyspark/mllib/linalg.py"`
Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) 2014-01-14 15:14:48 -05:00			`run_test "pyspark/mllib/recommendation.py"`
			`run_test "pyspark/mllib/regression.py"`
[WIP] SPARK-1430: Support sparse data in Python MLlib This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact 2014-04-15 23:33:24 -04:00			`run_test "pyspark/mllib/tests.py"`
Add RDD checkpointing to Python API. 2013-01-16 22:15:14 -05:00
SPARK-1336 Reducing the output of run-tests script. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits: 87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast. 811170f [Prashant Sharma] Reducing the ouput of run-tests script. 2014-03-30 02:03:03 -04:00			`if [[ $FAILED == 0 ]]; then`
Indicate success/failure in PySpark test script. 2013-01-09 23:30:36 -05:00			`echo -en "\033[32m" # Green`
			`echo "Tests passed."`
			`echo -en "\033[0m" # No color`
			`fi`
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00
			# TODO: in the long-run, it would be nice to use a test runner like `nose`.
Indicate success/failure in PySpark test script. 2013-01-09 23:30:36 -05:00			`# The doctest fixtures are the current barrier to doing this.`