Commit graph

251 commits

Author SHA1 Message Date
hyukjinkwon a1877f45c3 [SPARK-22597][SQL] Add spark-sql cmd script for Windows users
## What changes were proposed in this pull request?

This PR proposes to add cmd scripts so that Windows users can also run `spark-sql` script.

## How was this patch tested?

Manually tested on Windows.

**Before**

```cmd
C:\...\spark>.\bin\spark-sql
'.\bin\spark-sql' is not recognized as an internal or external command,
operable program or batch file.

C:\...\spark>.\bin\spark-sql.cmd
'.\bin\spark-sql.cmd' is not recognized as an internal or external command,
operable program or batch file.
```

**After**

```cmd
C:\...\spark>.\bin\spark-sql
...
spark-sql> SELECT 'Hello World !!';
...
Hello World !!
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19808 from HyukjinKwon/spark-sql-cmd.
2017-11-24 19:55:26 +01:00
Jakub Nowacki b4edafa99b [SPARK-22495] Fix setup of SPARK_HOME variable on Windows
## What changes were proposed in this pull request?

Fixing the way how `SPARK_HOME` is resolved on Windows. While the previous version was working with the built release download, the set of directories changed slightly for the PySpark `pip` or `conda` install. This has been reflected in Linux files in `bin` but not for Windows `cmd` files.

First fix improves the way how the `jars` directory is found, as this was stoping Windows version of `pip/conda` install from working; JARs were not found by on Session/Context setup.

Second fix is adding `find-spark-home.cmd` script, which uses `find_spark_home.py` script, as the Linux version, to resolve `SPARK_HOME`. It is based on `find-spark-home` bash script, though, some operations are done in different order due to the `cmd` script language limitations. If environment variable is set, the Python script `find_spark_home.py` will not be run. The process can fail if Python is not installed, but it will mostly use this way if PySpark is installed via `pip/conda`, thus, there is some Python in the system.

## How was this patch tested?

Tested on local installation.

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes #19370 from jsnowacki/fix_spark_cmds.
2017-11-23 12:47:38 +09:00
Kent Yao ee571d79e5 [SPARK-22466][SPARK SUBMIT] export SPARK_CONF_DIR while conf is default
## What changes were proposed in this pull request?

We use SPARK_CONF_DIR to switch spark conf directory and can be visited  if we explicitly export it in spark-env.sh, but with default settings, it can't be done. This PR export SPARK_CONF_DIR while it is default.

### Before

```
KentKentsMacBookPro  ~/Documents/spark-packages/spark-2.3.0-SNAPSHOT-bin-master  bin/spark-shell --master local
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/11/08 10:28:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/08 10:28:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://169.254.168.63:4041
Spark context available as 'sc' (master = local, app id = local-1510108125770).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sys.env.get("SPARK_CONF_DIR")
res0: Option[String] = None
```

### After

```
scala> sys.env.get("SPARK_CONF_DIR")
res0: Option[String] = Some(/Users/Kent/Documents/spark/conf)
```
## How was this patch tested?

vanzin

Author: Kent Yao <yaooqinn@hotmail.com>

Closes #19688 from yaooqinn/SPARK-22466.
2017-11-09 14:33:08 +09:00
minixalpha c7b46d4d8a [SPARK-21877][DEPLOY, WINDOWS] Handle quotes in Windows command scripts
## What changes were proposed in this pull request?

All the windows command scripts can not handle quotes in parameter.

Run a windows command shell with parameter which has quotes can reproduce the bug:

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell --driver-java-options " -Dfile.encoding=utf-8 "
'C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7\bin\spark-shell2.cmd" --driver-java-options "' is not recognized as an internal or external command,
operable program or batch file.
```

Windows recognize "--driver-java-options" as part of the command.
All the Windows command script has the following code have the bug.

```
cmd /V /E /C "<other command>" %*
```

We should quote command and parameters like

```
cmd /V /E /C ""<other command>" %*"
```

## How was this patch tested?

Test manually on Windows 10 and Windows 7

We can verify it by the following demo:

```
C:\Users\meng\program\demo>cat a.cmd
echo off
cmd /V /E /C "b.cmd" %*

C:\Users\meng\program\demo>cat b.cmd
echo off
echo %*

C:\Users\meng\program\demo>cat c.cmd
echo off
cmd /V /E /C ""b.cmd" %*"

C:\Users\meng\program\demo>a.cmd "123"
'b.cmd" "123' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\meng\program\demo>c.cmd "123"
"123"
```

With the spark-shell.cmd example, change it to the following code will make the command execute succeed.

```
cmd /V /E /C ""%~dp0spark-shell2.cmd" %*"
```

```
C:\Users\meng\software\spark-2.2.0-bin-hadoop2.7> bin\spark-shell  --driver-java-options " -Dfile.encoding=utf-8 "
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
...

```

Author: minixalpha <xkzalpha@gmail.com>

Closes #19090 from minixalpha/master.
2017-10-06 23:38:47 +09:00
Jakub Nowacki c11f24a940 [SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows
## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes #19310 from jsnowacki/master.
2017-09-23 21:04:10 +09:00
Sean Owen 12ab7f7e89 [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

## What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.

- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does _not_ fully enable a Scala 2.12 build:

- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

## How was this patch tested?

Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.

Author: Sean Owen <sowen@cloudera.com>

Closes #18645 from srowen/SPARK-14280.
2017-09-01 19:21:21 +01:00
Sean Owen 425c4ada4c [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10
## What changes were proposed in this pull request?

- Remove Scala 2.10 build profiles and support
- Replace some 2.10 support in scripts with commented placeholders for 2.12 later
- Remove deprecated API calls from 2.10 support
- Remove usages of deprecated context bounds where possible
- Remove Scala 2.10 workarounds like ScalaReflectionLock
- Other minor Scala warning fixes

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #17150 from srowen/SPARK-19810.
2017-07-13 17:06:24 +08:00
Bryan Cutler d03aebbe65 [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
2017-07-10 15:21:03 -07:00
Dongjoon Hyun c8d0aba198 [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6
## What changes were proposed in this pull request?

This PR aims to bump Py4J in order to fix the following float/double bug.
Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.

**BEFORE**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+--------------------+
|(id + 17.1335742042)|
+--------------------+
|       17.1335742042|
+--------------------+
```

**AFTER**
```
>>> df = spark.range(1)
>>> df.select(df['id'] + 17.133574204226083).show()
+-------------------------+
|(id + 17.133574204226083)|
+-------------------------+
|       17.133574204226083|
+-------------------------+
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #18546 from dongjoon-hyun/SPARK-21278.
2017-07-05 16:33:23 -07:00
Wenchen Fan 838effb98a Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas"
This reverts commit e44697606f.
2017-06-28 14:28:40 +08:00
Bryan Cutler e44697606f [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas
## What changes were proposed in this pull request?
Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.

Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).

## How was this patch tested?
Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.

Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Li Jin <ice.xelloss@gmail.com>
Author: Li Jin <li.jin@twosigma.com>
Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
2017-06-23 09:01:13 +08:00
Jarrett Meyer b9ad2d1916 [SPARK-20613] Remove excess quotes in Windows executable
## What changes were proposed in this pull request?

Quotes are already added to the RUNNER variable on line 54. There is no need to put quotes on line 67. If you do, you will get an error when launching Spark.

'""C:\Program' is not recognized as an internal or external command, operable program or batch file.

## How was this patch tested?

Tested manually on Windows 10.

Author: Jarrett Meyer <jarrettmeyer@gmail.com>

Closes #17861 from jarrettmeyer/fix-windows-cmd.
2017-05-05 08:30:42 -07:00
jyu00 5773ab121d [SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode
## What changes were proposed in this pull request?

Updated spark-class to turn off posix mode so the process substitution doesn't cause a syntax error.

## How was this patch tested?

Existing unit tests, manual spark-shell testing with posix mode on

Author: jyu00 <jessieyu@us.ibm.com>

Closes #17852 from jyu00/master.
2017-05-05 11:36:51 +01:00
Felix Cheung a8877bdbba [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed
## What changes were proposed in this pull request?

When SparkR is installed as a R package there might not be any java runtime.
If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.

## How was this patch tested?

manually

- [x] need to test on Windows

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16596 from felixcheung/rcheckjava.
2017-03-21 14:24:41 -07:00
Holden Karau a36a76ac43 [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed
## What changes were proposed in this pull request?

This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129).

Done:
- pip installable on conda [manual tested]
- setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested]
- Automated testing of this (virtualenv)
- packaging and signing with release-build*

Possible follow up work:
- release-build update to publish to PyPI (SPARK-18128)
- figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?)
- Windows support and or testing ( SPARK-18136 )
- investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test
- consider how we want to number our dev/snapshot versions

Explicitly out of scope:
- Using pip installed PySpark to start a standalone cluster
- Using pip installed PySpark for non-Python Spark programs

*I've done some work to test release-build locally but as a non-committer I've just done local testing.
## How was this patch tested?

Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration.

release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites)

Author: Holden Karau <holden@us.ibm.com>
Author: Juliet Hougland <juliet@cloudera.com>
Author: Juliet Hougland <not@myemail.com>

Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
2016-11-16 14:22:15 -08:00
Jagadeesan 595893d33a
[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4]
## What changes were proposed in this pull request?

1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark

## How was this patch tested?

Existing doctests & unit tests pass

Author: Jagadeesan <as2@us.ibm.com>

Closes #15514 from jagadeesanas2/SPARK-17960.
2016-10-21 09:48:24 +01:00
Sean Owen 0b3a4be92c [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14748 from srowen/SPARK-16781.
2016-08-24 20:04:09 +01:00
Marcelo Vanzin 1739e75fec [SPARK-16586][CORE] Handle JVM errors printed to stdout.
Some very rare JVM errors are printed to stdout, and that confuses
the code in spark-class. So add a check so that those cases are
detected and the proper error message is shown to the user.

Tested by running spark-submit after setting "ulimit -v 32000".

Closes #14231

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14508 from vanzin/SPARK-16586.
2016-08-08 10:34:54 -07:00
MechCoder 6343f66557 [SPARK-16399][PYSPARK] Force PYSPARK_PYTHON to python
## What changes were proposed in this pull request?

I would like to change

```bash
if hash python2.7 2>/dev/null; then
  # Attempt to use Python 2.7, if installed:
  DEFAULT_PYTHON="python2.7"
else
  DEFAULT_PYTHON="python"
fi
```

to just ```DEFAULT_PYTHON="python"```

I'm not sure if it is a great assumption that python2.7 is used by default, when python points to something else.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: MechCoder <mks542@nyu.edu>

Closes #14016 from MechCoder/followup.
2016-07-07 11:31:10 +01:00
MechCoder 66283ee0b2 [SPARK-15761][MLLIB][PYSPARK] Load ipython when default python is Python3
## What changes were proposed in this pull request?

I would like to use IPython with Python 3.5. It is annoying when it fails with IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON when I have a version greater than 2.7

## How was this patch tested
It now works with IPython and Python3

Author: MechCoder <mks542@nyu.edu>

Closes #13503 from MechCoder/spark-15761.
2016-07-01 09:27:34 +01:00
Sean Owen 623aae5907 [SPARK-15531][DEPLOY] spark-class tries to use too much memory when running Launcher
## What changes were proposed in this pull request?

Explicitly limit launcher JVM memory to modest 128m

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13360 from srowen/SPARK-15531.
2016-05-27 11:28:28 -07:00
Holden Karau 382dbc12bb [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1
## What changes were proposed in this pull request?

This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .

## How was this patch tested?

Existing doctests & unit tests pass

Author: Holden Karau <holden@us.ibm.com>

Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
2016-05-13 08:59:18 +01:00
Marcelo Vanzin 36c5892b46 [SPARK-13670][LAUNCHER] Propagate error from launcher to shell.
bash doesn't really propagate errors from subshells when using redirection
the way spark-class does; so, instead, this change captures the exit code
of the launcher process in the command array, and checks it before executing
the actual command.

Tested by injecting an error in Main.java (the launcher entry point) and
verifying the shell gets the right exit code from spark-class.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12910 from vanzin/SPARK-13670.
2016-05-10 10:34:26 -07:00
pshearer 0368ff30dd [SPARK-13973][PYSPARK] Make pyspark fail noisily if IPYTHON or IPYTHON_OPTS are set
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13973

Following discussion with srowen the IPYTHON and IPYTHON_OPTS variables are removed. If they are set in the user's environment, pyspark will not execute and prints an error message. Failing noisily will force users to remove these options and learn the new configuration scheme, which is much more sustainable and less confusing.

## How was this patch tested?

Manual testing; set IPYTHON=1 and verified that the error message prints.

Author: pshearer <pshearer@massmutual.com>
Author: shearerp <shearerp@umich.edu>

Closes #12528 from shearerp/master.
2016-04-30 10:15:20 +01:00
Mark Grover ff9ae61a3b [SPARK-14601][DOC] Minor doc/usage changes related to removal of Spark assembly
## What changes were proposed in this pull request?

Removing references to assembly jar in documentation.
Adding an additional (previously undocumented) usage of spark-submit to run examples.

## How was this patch tested?

Ran spark-submit usage to ensure formatting was fine. Ran examples using SparkSubmit.

Author: Mark Grover <mark@apache.org>

Closes #12365 from markgrover/spark-14601.
2016-04-14 18:51:43 -07:00
Holden Karau 457e58befe [SPARK-14424][BUILD][DOCS] Update the build docs to switch from assembly to package and add a no…
## What changes were proposed in this pull request?

Change our build docs & shell scripts to that developers are aware of the change from "assembly" to "package"

## How was this patch tested?

Manually ran ./bin/spark-shell after ./build/sbt assembly and verified error message printed, ran new suggested build target and verified ./bin/spark-shell runs after this.

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #12197 from holdenk/SPARK-1424-spark-class-broken-fix-build-docs.
2016-04-06 16:00:29 -07:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Rekha Joshi a91784fb6e [SPARK-13973][PYSPARK] ipython notebook` is going away
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13973

## How was this patch tested?
Pyspark

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: Joshi <rekhajoshm@gmail.com>

Closes #11829 from rekhajoshm/SPARK-13973.
2016-03-26 12:53:37 +00:00
Marcelo Vanzin 48978abfa4 [SPARK-13576][BUILD] Don't create assembly for examples.
As part of the goal to stop creating assemblies in Spark, this change
modifies the mvn and sbt builds to not create an assembly for examples.

Instead, dependencies are copied to the build directory (under
target/scala-xx/jars), and in the final archive, into the "examples/jars"
directory.

To avoid having to deal too much with Windows batch files, I made examples
run through the launcher library; the spark-submit launcher now has a
special mode to run examples, which adds all the necessary jars to the
spark-submit command line, and replaces the bash and batch scripts that
were used to run examples. The scripts are now just a thin wrapper around
spark-submit; another advantage is that now all spark-submit options are
supported.

There are a few glitches; in the mvn build, a lot of duplicated dependencies
get copied, because they are promoted to "compile" scope due to extra
dependencies in the examples module (such as HBase). In the sbt build,
all dependencies are copied, because there doesn't seem to be an easy
way to filter things.

I plan to clean some of this up when the rest of the tasks are finished.
When the main assembly is replaced with jars, we can remove duplicate jars
from the examples directory during packaging.

Tested by running SparkPi in: maven build, sbt build, dist created by
make-distribution.sh.

Finally: note that running the "assembly" target in sbt doesn't build
the examples anymore. You need to run "package" for that.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11452 from vanzin/SPARK-13576.
2016-03-15 09:44:51 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Marcelo Vanzin 45f8053be5 [SPARK-13578][CORE] Modify launch scripts to not use assemblies.
Instead of looking for a specially-named assembly, the scripts now will
blindly add all jars under the libs directory to the classpath. This
libs directory is still currently the old assembly dir, so things should
keep working the same way as before until we make more packaging changes.

The only lost feature is the detection of multiple assemblies; I consider
that a minor nicety that only really affects few developers, so it's probably
ok.

Tested locally by running spark-shell; also did some minor Win32 testing
(just made sure spark-shell started).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11591 from vanzin/SPARK-13578.
2016-03-14 11:13:26 -07:00
Masayoshi TSUZUKI e617508244 [SPARK-13673][WINDOWS] Fixed not to pollute environment variables.
## What changes were proposed in this pull request?

This patch fixes the problem that `bin\beeline.cmd` pollutes environment variables.
The similar problem is reported and fixed in https://issues.apache.org/jira/browse/SPARK-3943, but `bin\beeline.cmd` seems to be added later.

## How was this patch tested?

manual tests:
  I executed the new `bin\beeline.cmd` and confirmed that %SPARK_HOME% doesn't remain in the command prompt.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #11516 from tsudukim/feature/SPARK-13673.
2016-03-04 13:53:53 +00:00
Masayoshi TSUZUKI 12a2a57e1a [SPARK-13592][WINDOWS] fix path of spark-submit2.cmd in spark-submit.cmd
## What changes were proposed in this pull request?

This patch fixes the problem that pyspark fails on Windows because pyspark can't find ```spark-submit2.cmd```.

## How was this patch tested?

manual tests:
  I ran ```bin\pyspark.cmd``` and checked if pyspark is launched correctly after this patch is applyed.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #11442 from tsudukim/feature/SPARK-13592.
2016-03-01 14:37:36 +00:00
Jon Maurer 2ba9b6a2df [SPARK-11518][DEPLOY, WINDOWS] Handle spaces in Windows command scripts
Author: Jon Maurer <tritab@gmail.com>
Author: Jonathan Maurer <jmaurer@Jonathans-MacBook-Pro.local>

Closes #10789 from tritab/cmd_updates.
2016-02-10 09:54:22 +00:00
Shixiong Zhu 4f60651cbe [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.
2016-01-12 14:27:05 -08:00
Jeff Zhang 708129187a [SPARK-12166][TEST] Unset hadoop related environment in testing
Author: Jeff Zhang <zjffdu@apache.org>

Closes #10172 from zjffdu/SPARK-12166.
2015-12-08 11:05:06 +00:00
wangt 9f3e59a168 [SPARK-11880][WINDOWS][SPARK SUBMIT] bin/load-spark-env.cmd loads spark-env.cmd from wrong directory
* On windows the `bin/load-spark-env.cmd` tries to load `spark-env.cmd` from `%~dp0..\..\conf`, where `~dp0` points to `bin` and `conf` is only one level up.
* Updated `bin/load-spark-env.cmd` to load `spark-env.cmd` from `%~dp0..\conf`, instead of `%~dp0..\..\conf`

Author: wangt <wangtao.upc@gmail.com>

Closes #9863 from toddwan/master.
2015-11-25 11:41:05 -08:00
jerryshao 8aff36e91d [SPARK-2960][DEPLOY] Support executing Spark from symlinks (reopen)
This PR is based on the work of roji to support running Spark scripts from symlinks. Thanks for the great work roji . Would you mind taking a look at this PR, thanks a lot.

For releases like HDP and others, normally it will expose the Spark executables as symlinks and put in `PATH`, but current Spark's scripts do not support finding real path from symlink recursively, this will make spark fail to execute from symlink. This PR try to solve this issue by finding the absolute path from symlink.

Instead of using `readlink -f` like what this PR (https://github.com/apache/spark/pull/2386) implemented is that `-f` is not support for Mac, so here manually seeking the path through loop.

I've tested with Mac and Linux (Cent OS), looks fine.

This PR did not fix the scripts under `sbin` folder, not sure if it needs to be fixed also?

Please help to review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>
Author: Shay Rojansky <roji@roji.org>

Closes #8669 from jerryshao/SPARK-2960.
2015-11-04 10:49:34 +00:00
Jeffrey Naisbitt 28132ceb10 [SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
Temporarily remove GREP_OPTIONS if set in bin/spark-class.

Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars.
For example, if the -n option is specified, the grep output will look like:
5:spark-assembly-1.5.1-hadoop2.4.0.jar

This will not match the regular expressions, and so the jar files will not be found.  We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output.

Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org>

Closes #9231 from naisbitt/unset-GREP_OPTIONS.
2015-10-24 18:21:36 +01:00
Holden Karau e18b571c33 [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9
Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
2015-10-20 10:52:49 -07:00
Marcelo Vanzin c53c902fa9 [SPARK-9284] [TESTS] Allow all tests to run without an assembly.
This change aims at speeding up the dev cycle a little bit, by making
sure that all tests behave the same w.r.t. where the code to be tested
is loaded from. Namely, that means that tests don't rely on the assembly
anymore, rather loading all needed classes from the build directories.

The main change is to make sure all build directories (classes and test-classes)
are added to the classpath of child processes when running tests.

YarnClusterSuite required some custom code since the executors are run
differently (i.e. not through the launcher library, like standalone and
Mesos do).

I also found a couple of tests that could leak a SparkContext on failure,
and added code to handle those.

With this patch, it's possible to run the following command from a clean
source directory and have all tests pass:

  mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7629 from vanzin/SPARK-9284.
2015-08-28 12:33:40 -07:00
Cheolsoo Park 9a11396113 [SPARK-9270] [PYSPARK] allow --name option in pyspark
This is continuation of #7512 which added `--name` option to spark-shell. This PR adds the same option to pyspark.

Note that `--conf spark.app.name` in command-line has no effect in spark-shell and pyspark. Instead, `--name` must be used. This is in fact inconsistency with spark-sql which doesn't accept `--name` option while it accepts `--conf spark.app.name`. I am not fixing this inconsistency in this PR. IMO, one of `--name` and `--conf spark.app.name` is needed not both. But since I cannot decide which to choose, I am not making any change here.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7610 from piaozhexiu/SPARK-9270 and squashes the following commits:

763e86d [Cheolsoo Park] Update windows script
400b7f9 [Cheolsoo Park] Allow --name option to pyspark
2015-07-24 11:56:55 -07:00
Kenichi Maehashi 430cd7815d [SPARK-9180] fix spark-shell to accept --name option
This patch fixes [[SPARK-9180]](https://issues.apache.org/jira/browse/SPARK-9180).
Users can now set the app name of spark-shell using `spark-shell --name "whatever"`.

Author: Kenichi Maehashi <webmaster@kenichimaehashi.com>

Closes #7512 from kmaehashi/fix-spark-shell-app-name and squashes the following commits:

e24991a [Kenichi Maehashi] use setIfMissing instead of setAppName
18aa4ad [Kenichi Maehashi] fix spark-shell to accept --name option
2015-07-22 16:15:44 -07:00
Sean Owen e84815dc33 [SPARK-7733] [CORE] [BUILD] Update build, code to use Java 7 for 1.5.0+
Update build to use Java 7, and remove some comments and special-case support for Java 6.

Author: Sean Owen <sowen@cloudera.com>

Closes #6265 from srowen/SPARK-7733 and squashes the following commits:

59bda4e [Sean Owen] Update build to use Java 7, and remove some comments and special-case support for Java 6
2015-06-07 20:18:13 +01:00
Marcelo Vanzin 700312e12f [SPARK-6324] [CORE] Centralize handling of script usage messages.
Reorganize code so that the launcher library handles most of the work
of printing usage messages, instead of having an awkward protocol between
the library and the scripts for that.

This mostly applies to SparkSubmit, since the launcher lib does not do
command line parsing for classes invoked in other ways, and thus cannot
handle failures for those. Most scripts end up going through SparkSubmit,
though, so it all works.

The change adds a new, internal command line switch, "--usage-error",
which prints the usage message and exits with a non-zero status. Scripts
can override the command printed in the usage message by setting an
environment variable - this avoids having to grep the output of
SparkSubmit to remove references to the "spark-submit" script.

The only sub-optimal part of the change is the special handling for the
spark-sql usage, which is now done in SparkSubmitArguments.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5841 from vanzin/SPARK-6324 and squashes the following commits:

2821481 [Marcelo Vanzin] Merge branch 'master' into SPARK-6324
bf139b5 [Marcelo Vanzin] Filter output of Spark SQL CLI help.
c6609bf [Marcelo Vanzin] Fix exit code never being used when printing usage messages.
6bc1b41 [Marcelo Vanzin] [SPARK-6324] [core] Centralize handling of script usage messages.
2015-06-05 14:32:00 +02:00
Michael Nazario 1c5b19827a [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflict
This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`.

Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python.

Author: Michael Nazario <mnazario@palantir.com>

Closes #6439 from mnazario/feature/SPARK-7899 and squashes the following commits:

366ef30 [Michael Nazario] Remove hack on random.py
bb8b04d [Michael Nazario] Make doctests consistent with other tests
6ee4f75 [Michael Nazario] Change test scripts to use "-m"
673528f [Michael Nazario] Move _types back to types
2015-05-29 14:13:44 -07:00
Chris Biow c8c481da18 Limit help option regex
Added word-boundary delimiters so that embedded text such as "-h" within command line options and values doesn't trigger the usage script and exit.

Author: Chris Biow <chris.biow@10gen.com>

Closes #5816 from cbiow/patch-1 and squashes the following commits:

36b3726 [Chris Biow] Limit help option regex
2015-05-01 19:26:55 +01:00
Masayoshi TSUZUKI 268c419f15 [SPARK-6435] spark-shell --jars option does not add all jars to classpath
Modified to accept double-quotated args properly in spark-shell.cmd.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #5227 from tsudukim/feature/SPARK-6435-2 and squashes the following commits:

ac55787 [Masayoshi TSUZUKI] removed unnecessary argument.
60789a7 [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6435-2
1fee420 [Masayoshi TSUZUKI] fixed test code for escaping '='.
0d4dc41 [Masayoshi TSUZUKI] - escaped comman and semicolon in CommandBuilderUtils.java - added random string to the temporary filename - double-quotation followed by `cmd /c` did not worked properly - no need to escape `=` by `^` - if double-quoted string ended with `\` like classpath, the last `\` is parsed as the escape charactor and the closing `"` didn't work properly
2a332e5 [Masayoshi TSUZUKI] Merge branch 'master' into feature/SPARK-6435-2
04f4291 [Masayoshi TSUZUKI] [SPARK-6435] spark-shell --jars option does not add all jars to classpath
2015-04-28 07:56:36 -04:00
Davies Liu 04e44b37cc [SPARK-4897] [PySpark] Python 3 support
This PR update PySpark to support Python 3 (tested with 3.4).

Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.

TODO: ec2/spark-ec2.py is not fully tested with python3.

Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #5173 from davies/python3 and squashes the following commits:

d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
2015-04-16 16:20:57 -07:00
Marcelo Vanzin 9717389365 [SPARK-6890] [core] Fix launcher lib work with SPARK_PREPEND_CLASSES.
The fix for SPARK-6406 broke the case where sub-processes are launched
when SPARK_PREPEND_CLASSES is set, because the code now would only add
the launcher's build directory to the sub-process's classpath instead
of the complete assembly.

This patch fixes the problem by having the launch scripts stash the
assembly's location in an environment variable. This is not the prettiest
solution, but it avoids having to plumb that location all the way through
the Worker code that launches executors. The env variable is always
set by the launch scripts, so users cannot override it.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5504 from vanzin/SPARK-6890 and squashes the following commits:

7aec921 [Marcelo Vanzin] Fix tests.
ff87a60 [Marcelo Vanzin] Merge branch 'master' into SPARK-6890
31d3ce8 [Marcelo Vanzin] [SPARK-6890] [core] Fix launcher lib work with SPARK_PREPEND_CLASSES.
2015-04-14 18:51:39 -07:00