[SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts

## What changes were proposed in this pull request?
We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts.

Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #11400 from rxin/release-script.
This commit is contained in:
Reynold Xin 2016-02-26 22:35:12 -08:00
parent f77dc4e1e2
commit 59e3e10be2
9 changed files with 6 additions and 160 deletions

View file

@ -929,30 +929,6 @@ Apart from these, the following properties are also available, and may be useful
mapping has high overhead for blocks close to or below the page size of the operating system. mapping has high overhead for blocks close to or below the page size of the operating system.
</td> </td>
</tr> </tr>
<tr>
<td><code>spark.externalBlockStore.blockManager</code></td>
<td>org.apache.spark.storage.TachyonBlockManager</td>
<td>
Implementation of external block manager (file system) that store RDDs. The file system's URL is set by
<code>spark.externalBlockStore.url</code>.
</td>
</tr>
<tr>
<td><code>spark.externalBlockStore.baseDir</code></td>
<td>System.getProperty("java.io.tmpdir")</td>
<td>
Directories of the external block store that store RDDs. The file system's URL is set by
<code>spark.externalBlockStore.url</code> It can also be a comma-separated list of multiple
directories on Tachyon file system.
</td>
</tr>
<tr>
<td><code>spark.externalBlockStore.url</code></td>
<td>tachyon://localhost:19998 for Tachyon</td>
<td>
The URL of the underlying external blocker file system in the external block store.
</td>
</tr>
</table> </table>
#### Networking #### Networking

View file

@ -54,8 +54,7 @@ an application to gain back cores on one node when it has work to do. To use thi
Note that none of the modes currently provide memory sharing across applications. If you would like to share Note that none of the modes currently provide memory sharing across applications. If you would like to share
data this way, we recommend running a single server application that can serve multiple requests by querying data this way, we recommend running a single server application that can serve multiple requests by querying
the same RDDs. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will the same RDDs.
provide another approach to share RDDs.
## Dynamic Resource Allocation ## Dynamic Resource Allocation

View file

@ -1177,7 +1177,7 @@ that originally created it.
In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example, In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example,
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
replicate it across nodes, or store it off-heap in [Tachyon](http://tachyon-project.org/). replicate it across nodes.
These levels are set by passing a These levels are set by passing a
`StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel), `StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel),
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html), [Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
@ -1218,24 +1218,11 @@ storage levels is:
<td> MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. </td> <td> MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. </td>
<td> Same as the levels above, but replicate each partition on two cluster nodes. </td> <td> Same as the levels above, but replicate each partition on two cluster nodes. </td>
</tr> </tr>
<tr>
<td> OFF_HEAP (experimental) </td>
<td> Store RDD in serialized format in <a href="http://tachyon-project.org">Tachyon</a>.
Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors
to be smaller and to share a pool of memory, making it attractive in environments with
large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon,
the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory
in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts
from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon
out-of-the-box. Please refer to this <a href="http://tachyon-project.org/master/Running-Spark-on-Tachyon.html">page</a>
for the suggested version pairings.
</td>
</tr>
</table> </table>
**Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, **Note:** *In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library,
so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`,
`MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.* `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, and `DISK_ONLY_2`.*
Spark also automatically persists some intermediate data in shuffle operations (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist` on the resulting RDD if they plan to reuse it. Spark also automatically persists some intermediate data in shuffle operations (e.g. `reduceByKey`), even without users calling `persist`. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call `persist` on the resulting RDD if they plan to reuse it.
@ -1259,11 +1246,6 @@ requests from a web application). *All* the storage levels provide full fault to
recomputing lost data, but the replicated ones let you continue running tasks on the RDD without recomputing lost data, but the replicated ones let you continue running tasks on the RDD without
waiting to recompute a lost partition. waiting to recompute a lost partition.
* In environments with high amounts of memory or multiple applications, the experimental `OFF_HEAP`
mode has several advantages:
* It allows multiple executors to share the same pool of memory in Tachyon.
* It significantly reduces garbage collection costs.
* Cached data is not lost if individual executors crash.
### Removing Data ### Removing Data

View file

@ -32,11 +32,6 @@ set -x
SPARK_HOME="$(cd "`dirname "$0"`"; pwd)" SPARK_HOME="$(cd "`dirname "$0"`"; pwd)"
DISTDIR="$SPARK_HOME/dist" DISTDIR="$SPARK_HOME/dist"
SPARK_TACHYON=false
TACHYON_VERSION="0.8.2"
TACHYON_TGZ="tachyon-${TACHYON_VERSION}-bin.tar.gz"
TACHYON_URL="http://tachyon-project.org/downloads/files/${TACHYON_VERSION}/${TACHYON_TGZ}"
MAKE_TGZ=false MAKE_TGZ=false
NAME=none NAME=none
MVN="$SPARK_HOME/build/mvn" MVN="$SPARK_HOME/build/mvn"
@ -45,7 +40,7 @@ function exit_with_usage {
echo "make-distribution.sh - tool for making binary distributions of Spark" echo "make-distribution.sh - tool for making binary distributions of Spark"
echo "" echo ""
echo "usage:" echo "usage:"
cl_options="[--name] [--tgz] [--mvn <mvn-command>] [--with-tachyon]" cl_options="[--name] [--tgz] [--mvn <mvn-command>]"
echo "./make-distribution.sh $cl_options <maven build options>" echo "./make-distribution.sh $cl_options <maven build options>"
echo "See Spark's \"Building Spark\" doc for correct Maven options." echo "See Spark's \"Building Spark\" doc for correct Maven options."
echo "" echo ""
@ -69,9 +64,6 @@ while (( "$#" )); do
echo "Error: '--with-hive' is no longer supported, use Maven options -Phive and -Phive-thriftserver" echo "Error: '--with-hive' is no longer supported, use Maven options -Phive and -Phive-thriftserver"
exit_with_usage exit_with_usage
;; ;;
--with-tachyon)
SPARK_TACHYON=true
;;
--tgz) --tgz)
MAKE_TGZ=true MAKE_TGZ=true
;; ;;
@ -150,12 +142,6 @@ else
echo "Making distribution for Spark $VERSION in $DISTDIR..." echo "Making distribution for Spark $VERSION in $DISTDIR..."
fi fi
if [ "$SPARK_TACHYON" == "true" ]; then
echo "Tachyon Enabled"
else
echo "Tachyon Disabled"
fi
# Build uber fat JAR # Build uber fat JAR
cd "$SPARK_HOME" cd "$SPARK_HOME"
@ -219,40 +205,6 @@ if [ -d "$SPARK_HOME"/R/lib/SparkR ]; then
cp "$SPARK_HOME/R/lib/sparkr.zip" "$DISTDIR"/R/lib cp "$SPARK_HOME/R/lib/sparkr.zip" "$DISTDIR"/R/lib
fi fi
# Download and copy in tachyon, if requested
if [ "$SPARK_TACHYON" == "true" ]; then
TMPD=`mktemp -d 2>/dev/null || mktemp -d -t 'disttmp'`
pushd "$TMPD" > /dev/null
echo "Fetching tachyon tgz"
TACHYON_DL="${TACHYON_TGZ}.part"
if [ $(command -v curl) ]; then
curl --silent -k -L "${TACHYON_URL}" > "${TACHYON_DL}" && mv "${TACHYON_DL}" "${TACHYON_TGZ}"
elif [ $(command -v wget) ]; then
wget --quiet "${TACHYON_URL}" -O "${TACHYON_DL}" && mv "${TACHYON_DL}" "${TACHYON_TGZ}"
else
printf "You do not have curl or wget installed. please install Tachyon manually.\n"
exit -1
fi
tar xzf "${TACHYON_TGZ}"
cp "tachyon-${TACHYON_VERSION}/assembly/target/tachyon-assemblies-${TACHYON_VERSION}-jar-with-dependencies.jar" "$DISTDIR/lib"
mkdir -p "$DISTDIR/tachyon/src/main/java/tachyon/web"
cp -r "tachyon-${TACHYON_VERSION}"/{bin,conf,libexec} "$DISTDIR/tachyon"
cp -r "tachyon-${TACHYON_VERSION}"/servers/src/main/java/tachyon/web "$DISTDIR/tachyon/src/main/java/tachyon/web"
if [[ `uname -a` == Darwin* ]]; then
# need to run sed differently on osx
nl=$'\n'; sed -i "" -e "s|export TACHYON_JAR=\$TACHYON_HOME/target/\(.*\)|# This is set for spark's make-distribution\\$nl export TACHYON_JAR=\$TACHYON_HOME/../lib/\1|" "$DISTDIR/tachyon/libexec/tachyon-config.sh"
else
sed -i "s|export TACHYON_JAR=\$TACHYON_HOME/target/\(.*\)|# This is set for spark's make-distribution\n export TACHYON_JAR=\$TACHYON_HOME/../lib/\1|" "$DISTDIR/tachyon/libexec/tachyon-config.sh"
fi
popd > /dev/null
rm -rf "$TMPD"
fi
if [ "$MAKE_TGZ" == "true" ]; then if [ "$MAKE_TGZ" == "true" ]; then
TARDIR_NAME=spark-$VERSION-bin-$NAME TARDIR_NAME=spark-$VERSION-bin-$NAME
TARDIR="$SPARK_HOME/$TARDIR_NAME" TARDIR="$SPARK_HOME/$TARDIR_NAME"

View file

@ -25,22 +25,11 @@ if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi fi
TACHYON_STR=""
while (( "$#" )); do
case $1 in
--with-tachyon)
TACHYON_STR="--with-tachyon"
;;
esac
shift
done
# Load the Spark configuration # Load the Spark configuration
. "${SPARK_HOME}/sbin/spark-config.sh" . "${SPARK_HOME}/sbin/spark-config.sh"
# Start Master # Start Master
"${SPARK_HOME}/sbin"/start-master.sh $TACHYON_STR "${SPARK_HOME}/sbin"/start-master.sh
# Start Workers # Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh $TACHYON_STR "${SPARK_HOME}/sbin"/start-slaves.sh

View file

@ -39,21 +39,6 @@ fi
ORIGINAL_ARGS="$@" ORIGINAL_ARGS="$@"
START_TACHYON=false
while (( "$#" )); do
case $1 in
--with-tachyon)
if [ ! -e "${SPARK_HOME}"/tachyon/bin/tachyon ]; then
echo "Error: --with-tachyon specified, but tachyon not found."
exit -1
fi
START_TACHYON=true
;;
esac
shift
done
. "${SPARK_HOME}/sbin/spark-config.sh" . "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh" . "${SPARK_HOME}/bin/load-spark-env.sh"
@ -73,9 +58,3 @@ fi
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \ "${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
--ip $SPARK_MASTER_IP --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \ --ip $SPARK_MASTER_IP --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \
$ORIGINAL_ARGS $ORIGINAL_ARGS
if [ "$START_TACHYON" == "true" ]; then
"${SPARK_HOME}"/tachyon/bin/tachyon bootstrap-conf $SPARK_MASTER_IP
"${SPARK_HOME}"/tachyon/bin/tachyon format -s
"${SPARK_HOME}"/tachyon/bin/tachyon-start.sh master
fi

View file

@ -23,21 +23,6 @@ if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi fi
START_TACHYON=false
while (( "$#" )); do
case $1 in
--with-tachyon)
if [ ! -e "${SPARK_HOME}/sbin"/../tachyon/bin/tachyon ]; then
echo "Error: --with-tachyon specified, but tachyon not found."
exit -1
fi
START_TACHYON=true
;;
esac
shift
done
. "${SPARK_HOME}/sbin/spark-config.sh" . "${SPARK_HOME}/sbin/spark-config.sh"
. "${SPARK_HOME}/bin/load-spark-env.sh" . "${SPARK_HOME}/bin/load-spark-env.sh"
@ -50,12 +35,5 @@ if [ "$SPARK_MASTER_IP" = "" ]; then
SPARK_MASTER_IP="`hostname`" SPARK_MASTER_IP="`hostname`"
fi fi
if [ "$START_TACHYON" == "true" ]; then
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin"/../tachyon/bin/tachyon bootstrap-conf "$SPARK_MASTER_IP"
# set -t so we can call sudo
SPARK_SSH_OPTS="-o StrictHostKeyChecking=no -t" "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/tachyon/bin/tachyon-start.sh" worker SudoMount \; sleep 1
fi
# Launch the slaves # Launch the slaves
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"

View file

@ -26,7 +26,3 @@ fi
. "${SPARK_HOME}/sbin/spark-config.sh" . "${SPARK_HOME}/sbin/spark-config.sh"
"${SPARK_HOME}/sbin"/spark-daemon.sh stop org.apache.spark.deploy.master.Master 1 "${SPARK_HOME}/sbin"/spark-daemon.sh stop org.apache.spark.deploy.master.Master 1
if [ -e "${SPARK_HOME}/sbin"/../tachyon/bin/tachyon ]; then
"${SPARK_HOME}/sbin"/../tachyon/bin/tachyon killAll tachyon.master.Master
fi

View file

@ -25,9 +25,4 @@ fi
. "${SPARK_HOME}/bin/load-spark-env.sh" . "${SPARK_HOME}/bin/load-spark-env.sh"
# do before the below calls as they exec
if [ -e "${SPARK_HOME}/sbin"/../tachyon/bin/tachyon ]; then
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin"/../tachyon/bin/tachyon killAll tachyon.worker.Worker
fi
"${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin"/stop-slave.sh "${SPARK_HOME}/sbin/slaves.sh" cd "${SPARK_HOME}" \; "${SPARK_HOME}/sbin"/stop-slave.sh