05454fd8ae
This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
- a colon-separated list of jar files or directories for hive and hadoop.
- `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
option is only valid when using the execution version of Hive.
- `maven` - download the correct version of hive on demand from maven.
By default, `builtin` is used for Hive 13.
This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR.
Remaining TODOs:
- Remove the Hive Shims and inline code for Hive 13.
- Several HiveCompatibility tests are not yet passing.
- `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified.
- `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist
- `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information
- "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be?
Other required changes:
- `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package.
- `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
Author: Michael Armbrust <michael@databricks.com>
Closes #5876 from marmbrus/useIsolatedClient and squashes the following commits:
258d000 [Michael Armbrust] really really correct path handling
e56fd4a [Michael Armbrust] getAbsolutePath
5a259f5 [Michael Armbrust] fix typos
81bb366 [Michael Armbrust] comments from vanzin
5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
4b5cd41 [Michael Armbrust] yin's comments
f5de7de [Michael Armbrust] cleanup
11e9c72 [Michael Armbrust] better coverage in versions suite
7e8f010 [Michael Armbrust] better error messages and jar handling
e7b3941 [Michael Armbrust] more permisive checking for function registration
da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
5fe5894 [Michael Armbrust] fix serialization suite
81711c4 [Michael Armbrust] Initial support for running without maven
1d8ae44 [Michael Armbrust] fix final tests?
1c50813 [Michael Armbrust] more comments
a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
a6f5df1 [Michael Armbrust] style
ab07f7e [Michael Armbrust] WIP
4d8bf02 [Michael Armbrust] Remove hive 12 compilation
8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
(cherry picked from commit cd1d4110cf
)
Signed-off-by: Yin Huai <yhuai@databricks.com>
233 lines
7.5 KiB
Bash
Executable file
233 lines
7.5 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
|
|
#
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
# this work for additional information regarding copyright ownership.
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
# (the "License"); you may not use this file except in compliance with
|
|
# the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
#
|
|
|
|
# Go to the Spark project root directory
|
|
FWDIR="$(cd "`dirname $0`"/..; pwd)"
|
|
cd "$FWDIR"
|
|
|
|
# Clean up work directory and caches
|
|
rm -rf ./work
|
|
rm -rf ~/.ivy2/local/org.apache.spark
|
|
rm -rf ~/.ivy2/cache/org.apache.spark
|
|
|
|
source "$FWDIR/dev/run-tests-codes.sh"
|
|
|
|
CURRENT_BLOCK=$BLOCK_GENERAL
|
|
|
|
function handle_error () {
|
|
echo "[error] Got a return code of $? on line $1 of the run-tests script."
|
|
exit $CURRENT_BLOCK
|
|
}
|
|
|
|
|
|
# Build against the right version of Hadoop.
|
|
{
|
|
if [ -n "$AMPLAB_JENKINS_BUILD_PROFILE" ]; then
|
|
if [ "$AMPLAB_JENKINS_BUILD_PROFILE" = "hadoop1.0" ]; then
|
|
export SBT_MAVEN_PROFILES_ARGS="-Dhadoop.version=1.0.4"
|
|
elif [ "$AMPLAB_JENKINS_BUILD_PROFILE" = "hadoop2.0" ]; then
|
|
export SBT_MAVEN_PROFILES_ARGS="-Dhadoop.version=2.0.0-mr1-cdh4.1.1"
|
|
elif [ "$AMPLAB_JENKINS_BUILD_PROFILE" = "hadoop2.2" ]; then
|
|
export SBT_MAVEN_PROFILES_ARGS="-Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0"
|
|
elif [ "$AMPLAB_JENKINS_BUILD_PROFILE" = "hadoop2.3" ]; then
|
|
export SBT_MAVEN_PROFILES_ARGS="-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0"
|
|
fi
|
|
fi
|
|
|
|
if [ -z "$SBT_MAVEN_PROFILES_ARGS" ]; then
|
|
export SBT_MAVEN_PROFILES_ARGS="-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0"
|
|
fi
|
|
}
|
|
|
|
export SBT_MAVEN_PROFILES_ARGS="$SBT_MAVEN_PROFILES_ARGS -Pkinesis-asl"
|
|
|
|
# Determine Java path and version.
|
|
{
|
|
if test -x "$JAVA_HOME/bin/java"; then
|
|
declare java_cmd="$JAVA_HOME/bin/java"
|
|
else
|
|
declare java_cmd=java
|
|
fi
|
|
|
|
# We can't use sed -r -e due to OS X / BSD compatibility; hence, all the parentheses.
|
|
JAVA_VERSION=$(
|
|
$java_cmd -version 2>&1 \
|
|
| grep -e "^java version" --max-count=1 \
|
|
| sed "s/java version \"\(.*\)\.\(.*\)\.\(.*\)\"/\1\2/"
|
|
)
|
|
|
|
if [ "$JAVA_VERSION" -lt 18 ]; then
|
|
echo "[warn] Java 8 tests will not run because JDK version is < 1.8."
|
|
fi
|
|
}
|
|
|
|
# Only run Hive tests if there are SQL changes.
|
|
# Partial solution for SPARK-1455.
|
|
if [ -n "$AMPLAB_JENKINS" ]; then
|
|
git fetch origin master:master
|
|
|
|
sql_diffs=$(
|
|
git diff --name-only master \
|
|
| grep -e "^sql/" -e "^bin/spark-sql" -e "^sbin/start-thriftserver.sh"
|
|
)
|
|
|
|
non_sql_diffs=$(
|
|
git diff --name-only master \
|
|
| grep -v -e "^sql/" -e "^bin/spark-sql" -e "^sbin/start-thriftserver.sh"
|
|
)
|
|
|
|
if [ -n "$sql_diffs" ]; then
|
|
echo "[info] Detected changes in SQL. Will run Hive test suite."
|
|
_RUN_SQL_TESTS=true
|
|
|
|
if [ -z "$non_sql_diffs" ]; then
|
|
echo "[info] Detected no changes except in SQL. Will only run SQL tests."
|
|
_SQL_TESTS_ONLY=true
|
|
fi
|
|
fi
|
|
fi
|
|
|
|
set -o pipefail
|
|
trap 'handle_error $LINENO' ERR
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running Apache RAT checks"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_RAT
|
|
|
|
./dev/check-license
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running Scala style checks"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_SCALA_STYLE
|
|
|
|
./dev/lint-scala
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running Python style checks"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_PYTHON_STYLE
|
|
|
|
./dev/lint-python
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Building Spark"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_BUILD
|
|
|
|
{
|
|
HIVE_BUILD_ARGS="$SBT_MAVEN_PROFILES_ARGS -Phive -Phive-thriftserver"
|
|
echo "[info] Compile with Hive 0.13.1"
|
|
[ -d "lib_managed" ] && rm -rf lib_managed
|
|
echo "[info] Building Spark with these arguments: $HIVE_BUILD_ARGS"
|
|
|
|
if [ "${AMPLAB_JENKINS_BUILD_TOOL}" == "maven" ]; then
|
|
build/mvn $HIVE_BUILD_ARGS clean package -DskipTests
|
|
else
|
|
echo -e "q\n" \
|
|
| build/sbt $HIVE_BUILD_ARGS package assembly/assembly streaming-kafka-assembly/assembly \
|
|
| grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"
|
|
fi
|
|
}
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Detecting binary incompatibilities with MiMa"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_MIMA
|
|
|
|
./dev/mima
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running Spark unit tests"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_SPARK_UNIT_TESTS
|
|
|
|
{
|
|
# If the Spark SQL tests are enabled, run the tests with the Hive profiles enabled.
|
|
# This must be a single argument, as it is.
|
|
if [ -n "$_RUN_SQL_TESTS" ]; then
|
|
SBT_MAVEN_PROFILES_ARGS="$SBT_MAVEN_PROFILES_ARGS -Phive -Phive-thriftserver"
|
|
fi
|
|
|
|
if [ -n "$_SQL_TESTS_ONLY" ]; then
|
|
# This must be an array of individual arguments. Otherwise, having one long string
|
|
# will be interpreted as a single test, which doesn't work.
|
|
SBT_MAVEN_TEST_ARGS=("catalyst/test" "sql/test" "hive/test" "hive-thriftserver/test" "mllib/test")
|
|
else
|
|
SBT_MAVEN_TEST_ARGS=("test")
|
|
fi
|
|
|
|
echo "[info] Running Spark tests with these arguments: $SBT_MAVEN_PROFILES_ARGS ${SBT_MAVEN_TEST_ARGS[@]}"
|
|
|
|
if [ "${AMPLAB_JENKINS_BUILD_TOOL}" == "maven" ]; then
|
|
build/mvn test $SBT_MAVEN_PROFILES_ARGS --fail-at-end
|
|
else
|
|
# NOTE: echo "q" is needed because sbt on encountering a build file with failure
|
|
# (either resolution or compilation) prompts the user for input either q, r, etc
|
|
# to quit or retry. This echo is there to make it not block.
|
|
# NOTE: Do not quote $SBT_MAVEN_PROFILES_ARGS or else it will be interpreted as a
|
|
# single argument!
|
|
# "${SBT_MAVEN_TEST_ARGS[@]}" is cool because it's an array.
|
|
# QUESTION: Why doesn't 'yes "q"' work?
|
|
# QUESTION: Why doesn't 'grep -v -e "^\[info\] Resolving"' work?
|
|
echo -e "q\n" \
|
|
| build/sbt $SBT_MAVEN_PROFILES_ARGS "${SBT_MAVEN_TEST_ARGS[@]}" \
|
|
| grep -v -e "info.*Resolving" -e "warn.*Merging" -e "info.*Including"
|
|
fi
|
|
}
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running PySpark tests"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_PYSPARK_UNIT_TESTS
|
|
|
|
# add path for python 3 in jenkins
|
|
export PATH="${PATH}:/home/anaonda/envs/py3k/bin"
|
|
./python/run-tests
|
|
|
|
echo ""
|
|
echo "========================================================================="
|
|
echo "Running SparkR tests"
|
|
echo "========================================================================="
|
|
|
|
CURRENT_BLOCK=$BLOCK_SPARKR_UNIT_TESTS
|
|
|
|
if [ $(command -v R) ]; then
|
|
./R/install-dev.sh
|
|
./R/run-tests.sh
|
|
else
|
|
echo "Ignoring SparkR tests as R was not found in PATH"
|
|
fi
|
|
|