spark-instrumented-optimizer/python/pyspark
hyukjinkwon aa4cf2b19e [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages
## What changes were proposed in this pull request?

Calling `ImageSchema.readImages` multiple times as below in PySpark shell:

```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
```

throws an error as below:

```
...
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details.
...
	at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
...
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
...
	at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
...
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:100)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:88)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.<init>(HiveSessionStateBuilder.scala:69)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
	at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
	at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:70)
	at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:574)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:593)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
	at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
	at org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
...
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1742f639f, see the next exception for details.
	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
	at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source)
	... 121 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /.../spark/metastore_db.
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
    dropImageFailures, float(sampleRatio), seed)
  File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
```

Seems we better stick to `SparkSession.builder.getOrCreate()` like:

51620e288b/python/pyspark/sql/streaming.py (L329)

dc5d34d8dc/python/pyspark/sql/column.py (L541)

33d43bf1b6/python/pyspark/sql/readwriter.py (L105)

## How was this patch tested?

This was tested as below in PySpark shell:

```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True).collect()
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19845 from HyukjinKwon/SPARK-22651.
2017-12-02 11:55:43 +09:00
..
ml [SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for ImageSchema.readImages 2017-12-02 11:55:43 +09:00
mllib [SPARK-22399][ML] update the location of reference paper 2017-10-31 08:20:23 +00:00
sql [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone 2017-11-28 16:45:22 +08:00
streaming [SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs 2017-10-24 12:44:47 +09:00
__init__.py [MINOR] Fix some typo of the document 2017-06-19 20:35:58 +01:00
accumulators.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
broadcast.py [SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry 2017-08-02 07:12:23 +09:00
cloudpickle.py [SPARK-21070][PYSPARK] Attempt to update cloudpickle again 2017-08-22 11:17:53 +09:00
conf.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
context.py [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas 2017-11-13 13:16:01 +09:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
find_spark_home.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
heapq3.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
java_gateway.py [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas 2017-11-13 13:16:01 +09:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
rdd.py [SPARK-22409] Introduce function type argument in pandas_udf 2017-11-17 16:43:08 +01:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone 2017-11-28 16:45:22 +08:00
shell.py [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell 2017-04-12 10:54:50 -07:00
shuffle.py [SPARK-10710] Remove ability to disable spilling in core and SQL 2015-09-19 21:40:21 -07:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api 2016-04-12 23:06:55 -07:00
taskcontext.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
tests.py [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles 2017-09-18 13:20:11 +09:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
util.py [SPARK-19505][PYTHON] AttributeError on Exception.message in Python3 2017-04-11 12:18:31 -07:00
version.py [MINOR] Bump SparkR and PySpark version to 2.3.0. 2017-06-19 11:13:03 +01:00
worker.py [SPARK-22395][SQL][PYTHON] Fix the behavior of timestamp values for Pandas to respect session timezone 2017-11-28 16:45:22 +08:00