ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Bryan Cutler	16990f9299	[SPARK-26566][PYTHON][SQL] Upgrade Apache Arrow to version 0.12.0 ## What changes were proposed in this pull request? Upgrade Apache Arrow to version 0.12.0. This includes the Java artifacts and fixes to enable usage with pyarrow 0.12.0 Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here](https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0) PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name ## How was this patch tested? Ran unit tests with pyarrow versions 0.8.0, 0.10.0, 0.11.1, 0.12.0 Closes #23657 from BryanCutler/arrow-upgrade-012. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-29 14:18:45 +08:00
Takeshi Yamamuro	92706e6576	[SPARK-26747][SQL] Makes GetMapValue nullability more precise ## What changes were proposed in this pull request? In master, `GetMapValue` nullable is always true; `cf133e6110/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (L371)` But, If input key is foldable, we could make its nullability more precise. This fix is the same with SPARK-26637(#23566). ## How was this patch tested? Added tests in `ComplexTypeSuite`. Closes #23669 from maropu/SPARK-26747. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-28 13:39:50 -08:00
Marcelo Vanzin	2a67dbfbd3	[SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache. This change addes a new mode for credential renewal that does not require a keytab; it uses the local ticket cache instead, so it works while the user keeps the cache valid. This can be useful for, e.g., people running long spark-shell sessions where their kerberos login is kept up-to-date. The main change to enable this behavior is in HadoopDelegationTokenManager, with a small change in the HDFS token provider. The other changes are to avoid creating duplicate tokens when submitting the application to YARN; they allow the tokens from the scheduler to be sent to the YARN AM, reducing the round trips to HDFS. For that, the scheduler initialization code was changed a little bit so that the tokens are available when the YARN client is initialized. That basically takes care of a long-standing TODO that was in the code to clean up configuration propagation to the driver's RPC endpoint (in CoarseGrainedSchedulerBackend). Tested with an app designed to stress this functionality, with both keytab and cache-based logins. Some basic kerberos tests on k8s also. Closes #23525 from vanzin/SPARK-26595. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-28 13:32:34 -08:00
Sean Owen	8baf3ba35b	[SPARK-26660][FOLLOWUP] Add warning logs when broadcasting large task binary ## What changes were proposed in this pull request? The warning introduced in https://github.com/apache/spark/pull/23580 has a bug: https://github.com/apache/spark/pull/23580#issuecomment-458000380 This just fixes the logic. ## How was this patch tested? N/A Closes #23668 from srowen/SPARK-26660.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-28 13:47:32 -06:00
s71955	dfed439e33	[SPARK-26432][CORE] Obtain HBase delegation token operation compatible with HBase 2.x.x version API ## What changes were proposed in this pull request? While obtaining token from hbase service , spark uses deprecated API of hbase , ```public static Token<AuthenticationTokenIdentifier> obtainToken(Configuration conf)``` This deprecated API is already been removed from hbase 2.x version as part of the hbase 2.x major release. https://issues.apache.org/jira/browse/HBASE-14713_ there is one more stable API in ```public static Token<AuthenticationTokenIdentifier> obtainToken(Connection conn)``` in TokenUtil class spark shall use this stable api for getting the delegation token. To invoke this api first connection object has to be retrieved from ConnectionFactory and the same connection can be passed to obtainToken(Connection conn) for getting token. eg: Call ```public static Connection createConnection(Configuration conf)``` , then call ```public static Token<AuthenticationTokenIdentifier> obtainToken( Connection conn)```. ## How was this patch tested? Manual testing is been done. Manual test result: Before fix: ![hbase-dep-obtaintok 1](https://user-images.githubusercontent.com/12999161/50699264-64cac200-106d-11e9-81b4-e50ae8097f27.png) After fix: 1. Create 2 tables in hbase shell >Launch hbase shell >Enter commands to create tables and load data create 'table1','cf' put 'table1','row1','cf:cid','20' create 'table2','cf' put 'table2','row1','cf:cid','30' >Show values command get 'table1','row1','cf:cid' will diplay value as 20 get 'table2','row1','cf:cid' will diplay value as 30 2.Run SparkHbasetoHbase class in testSpark.jar using spark-submit spark-submit --master yarn-cluster --class com.mrs.example.spark.SparkHbasetoHbase --conf "spark.yarn.security.credentials.hbase.enabled"="true" --conf "spark.security.credentials.hbase.enabled"="true" --keytab /opt/client/user.keytab --principal sen testSpark.jar The SparkHbasetoHbase test class will update the value of table2 with sum of values of table1 & table2. table2 = table1+table2 As we can see in the snapshot the spark job has been successfully able to interact with hbase service and able to update the row count. ![obtaintok_success 1](https://user-images.githubusercontent.com/12999161/50699393-bd9a5a80-106d-11e9-96c6-6c250d561efa.png) Closes #23429 from sujith71955/master_hbase_service. Authored-by: s71955 <sujithchacko.2010@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-28 10:08:23 -08:00
Xianjin YE	1280bfd756	[SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished ## What changes were proposed in this pull request? Manually release stdin writer and stderr reader thread when task is finished. This commit also marks ShuffleBlockFetchIterator as fully consumed if isZombie is set. ## How was this patch tested? Added new test Closes #23638 from advancedxy/SPARK-26713. Authored-by: Xianjin YE <advancedxy@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-28 10:54:18 -06:00
Maxim Gekk	58e42cf506	[SPARK-26719][SQL] Get rid of java.util.Calendar in DateTimeUtils ## What changes were proposed in this pull request? - Replacing `java.util.Calendar` in `DateTimeUtils. truncTimestamp` and in `DateTimeUtils.getOffsetFromLocalMillis ` by equivalent code using Java 8 API for timestamp manipulations. The reason is `java.util.Calendar` is based on the hybrid calendar (Julian+Gregorian) but java.time classes use Proleptic Gregorian calendar which assumes by SQL standard. - Replacing `Calendar.getInstance()` in `DateTimeUtilsSuite` by similar code in `DateTimeTestUtils` using java.time classes ## How was this patch tested? The changes were tested by existing suites: `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`. Closes #23641 from MaxGekk/cleanup-date-time-utils. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-28 10:52:17 -06:00
Wenchen Fan	ed71a825c5	[SPARK-26700][CORE] enable fetch-big-block-to-disk by default ## What changes were proposed in this pull request? This is a followup of #16989 The fetch-big-block-to-disk feature is disabled by default, because it's not compatible with external shuffle service prior to Spark 2.2. The client sends stream request to fetch block chunks, and old shuffle service can't support it. After 2 years, Spark 2.2 has EOL, and now it's safe to turn on this feature by default ## How was this patch tested? existing tests Closes #23625 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-28 23:41:55 +08:00
Maxim Gekk	bd027f6e0e	[SPARK-26656][SQL] Benchmarks for date and timestamp functions ## What changes were proposed in this pull request? Added the following benchmarks: - Extract components from timestamp like year, month, day and etc. - Current date and time - Date arithmetic like date_add, date_sub - Format dates and timestamps - Convert timestamps from/to UTC Closes #23661 from MaxGekk/datetime-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2019-01-28 14:21:21 +01:00
Sean Owen	d53e11ffce	[SPARK-26725][TEST] Fix the input values of UnifiedMemoryManager constructor in test suites ## What changes were proposed in this pull request? Adjust mem settings in UnifiedMemoryManager used in test suites to ha…ve execution memory > 0 Ref: https://github.com/apache/spark/pull/23457#issuecomment-457409976 ## How was this patch tested? Existing tests Closes #23645 from srowen/SPARK-26725. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-28 12:42:14 +08:00
Hyukjin Kwon	3a17c6a06b	[SPARK-26743][PYTHON] Adds a test to check the actual resource limit set via 'spark.executor.pyspark.memory' ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/21977 added a feature to limit Python worker resource limit. This PR is kind of a followup of it. It proposes to add a test that checks the actual resource limit set by 'spark.executor.pyspark.memory'. ## How was this patch tested? Unit tests were added. Closes #23663 from HyukjinKwon/test_rlimit. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-28 10:02:27 +08:00
maryannxue	ce7e7df99d	[SPARK-26708][SQL] Incorrect result caused by inconsistency between a SQL cache's cached RDD and its physical plan ## What changes were proposed in this pull request? When performing non-cascading cache invalidation, `recache` is called on the other cache entries which are dependent on the cache being invalidated. It leads to the the physical plans of those cache entries being re-compiled. For those cache entries, if the cache RDD has already been persisted, chances are there will be inconsistency between the data and the new plan. It can cause a correctness issue if the new plan's `outputPartitioning` or `outputOrdering` is different from the that of the actual data, and meanwhile the cache is used by another query that asks for specific `outputPartitioning` or `outputOrdering` which happens to match the new plan but not the actual data. The fix is to keep the cache entry as it is if the data has been loaded, otherwise re-build the cache entry, with a new plan and an empty cache buffer. ## How was this patch tested? Added UT. Closes #23644 from maryannxue/spark-26708. Lead-authored-by: maryannxue <maryannxue@apache.org> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-01-27 11:39:27 -08:00
Gengliang Wang	36a2e6371b	[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes #23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-27 10:11:42 -08:00
Dongjoon Hyun	1ca6b8bc3d	[SPARK-26379][SS][FOLLOWUP] Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp ## What changes were proposed in this pull request? Spark replaces `CurrentTimestamp` with `CurrentBatchTimestamp`. However, `CurrentBatchTimestamp` is `TimeZoneAwareExpression` while `CurrentTimestamp` isn't. Without TimeZoneId, `CurrentBatchTimestamp` becomes unresolved and raises `UnresolvedException`. Since `CurrentDate` is `TimeZoneAwareExpression`, there is no problem with `CurrentDate`. This PR reverts the [previous patch](https://github.com/apache/spark/pull/23609) on `MicroBatchExecution` and fixes the root cause. ## How was this patch tested? Pass the Jenkins with the updated test cases. Closes #23660 from dongjoon-hyun/SPARK-26379. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-27 10:04:51 -08:00
Kris Mok	860336d31e	[SPARK-26735][SQL] Verify plan integrity for special expressions ## What changes were proposed in this pull request? Add verification of plan integrity with regards to special expressions being hosted only in supported operators. Specifically: - `AggregateExpression`: should only be hosted in `Aggregate`, or indirectly in `Window` - `WindowExpression`: should only be hosted in `Window` - `Generator`: should only be hosted in `Generate` This will help us catch errors in future optimizer rules that incorrectly hoist special expression out of their supported operator. TODO: This PR actually caught a bug in the analyzer in the test case `SPARK-23957 Remove redundant sort from subquery plan(scalar subquery)` in `SubquerySuite`, where a `max()` aggregate function is hosted in a `Sort` operator in the analyzed plan, which is invalid. That test case is disabled in this PR. SPARK-26741 has been opened to track the fix in the analyzer. ## How was this patch tested? Added new test case in `OptimizerStructuralIntegrityCheckerSuite` Closes #23658 from rednaxelafx/plan-integrity. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-01-26 22:26:10 -08:00
hyukjinkwon	e8982ca7ad	[SPARK-25981][R] Enables Arrow optimization from R DataFrame to Spark DataFrame ## What changes were proposed in this pull request? This PR targets to support Arrow optimization for conversion from R DataFrame to Spark DataFrame. Like PySpark side, it falls back to non-optimization code path when it's unable to use Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r collect(createDataFrame(mtcars)) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks Shall ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` R code ```r createDataFrame(mtcars) # Initializes rdf <- read.csv("500000.csv") test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() createDataFrame(rdf) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` Data (350 MB): ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ Results ``` Time difference of 29.9468 secs ``` ``` Time difference of 3.222129 secs ``` The performance improvement was around 950%. Actually, this PR improves around 1200%+ because this PR includes a small optimization about regular R DataFrame -> Spark DatFrame. See https://github.com/apache/spark/pull/22954#discussion_r231847272 ### Limitations: For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. In this case, we decide to fall back to non-optimization code path. ## How was this patch tested? Small test was added. I manually forced to set this optimization `true` for _all_ R tests and they were _all_ passed (with few of fallback warnings). TODOs: - [x] Draft codes - [x] make the tests passed - [x] make the CRAN check pass - [x] Performance measurement - [x] Supportability investigation (for instance types) - [x] Wait for Arrow 0.12.0 release - [x] Fix and match it to Arrow 0.12.0 Closes #22954 from HyukjinKwon/r-arrow-createdataframe. Lead-authored-by: hyukjinkwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-27 10:45:49 +08:00
heguozi	e71acd9a23	[SPARK-26630][SQL] Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce ## What changes were proposed in this pull request? When we read a hive table and create RDDs in `TableReader`, it'll throw exception `java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to org.apache.hadoop.mapred.InputFormat` if the input format class of the table is from mapreduce package. Now we use NewHadoopRDD to deal with the new input format and keep HadoopRDD to the old one. This PR is from #23506. We can reproduce this issue by executing the new test with the code in old version. When create a table with `org.apache.hadoop.mapreduce.....` input format, we will find the exception thrown in `org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)` ## How was this patch tested? Added a new test. Closes #23559 from Deegue/fix-hadoopRDD. Lead-authored-by: heguozi <zyzzxycj@gmail.com> Co-authored-by: Yizhong Zhang <zyzzxycj@163.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-01-26 10:17:03 -08:00
SongYadong	aa3d16d68b	[SPARK-26698][CORE] Use ConfigEntry for hardcoded configs for memory and storage categories ## What changes were proposed in this pull request? This PR makes hardcoded configs about spark memory and storage to use `ConfigEntry` and put them in the config package. ## How was this patch tested? Existing unit tests. Closes #23623 from SongYadong/configEntry_for_mem_storage. Authored-by: SongYadong <song.yadong1@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-25 22:28:12 -06:00
Bruce Robbins	f17a3d9c3a	[SPARK-26711][SQL] Lazily convert string values to BigDecimal during JSON schema inference ## What changes were proposed in this pull request? This PR fixes a bug where JSON schema inference attempts to convert every String value to a BigDecimal regardless of the setting of "prefersDecimal". With that bug, behavior is still correct, but performance is impacted. This PR makes this conversion lazy, so it is only performed if prefersDecimal is set to true. Using Spark with a single executor thread to infer the schema of a single-column, 100M row JSON file, the performance impact is as follows: option \| baseline \| pr -----\|----\|----- inferTimestamp=_default_<br>prefersDecimal=_default_ \| 12.5 minutes \| 6.1 minutes \| inferTimestamp=false<br>prefersDecimal=_default_ \| 6.5 minutes \| 49 seconds \| inferTimestamp=false<br>prefersDecimal=true \| 6.5 minutes \| 6.5 minutes \| ## How was this patch tested? I ran JsonInferSchemaSuite and JsonSuite. Also, I ran manual tests to see performance impact (see above). Closes #23653 from bersprockets/SPARK-26711_improved. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-25 16:14:38 -08:00
Jungtaek Lim (HeartSaVioR)	a4e48359ac	[SPARK-26379][SS] Fix issue on adding current_timestamp/current_date to streaming query ## What changes were proposed in this pull request? This patch proposes to fix issue on adding `current_timestamp` / `current_date` with streaming query. The root reason is that Spark transforms `CurrentTimestamp`/`CurrentDate` to `CurrentBatchTimestamp` in MicroBatchExecution which makes transformed attributes not-yet-resolved. They will be resolved by IncrementalExecution. (In ContinuousExecution, Spark doesn't allow using `current_timestamp` and `current_date` so it has been OK.) It's OK for DataSource V1 sink because it simply leverages transformed logical plan and don't evaluate until they're resolved, but for DataSource V2 sink, Spark tries to extract the schema of transformed logical plan in prior to IncrementalExecution, and unresolved attributes will raise errors. This patch fixes the issue via having separate pre-resolved logical plan to pass the schema to StreamingWriteSupport safely. ## How was this patch tested? Added UT. Closes #23609 from HeartSaVioR/SPARK-26379. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-25 14:58:03 -08:00
Jungtaek Lim (HeartSaVioR)	5f3658a8d8	[SPARK-26170][SS] Add missing metrics in FlatMapGroupsWithState ## What changes were proposed in this pull request? This patch addresses measuring possible metrics in StateStoreWriter to FlatMapGroupsWithStateExec. Please note that some metrics like time to remove elements are not addressed because they are coupled with state function. ## How was this patch tested? Manually tested with https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala. Snapshots below: ![screen shot 2018-11-26 at 4 13 40 pm](https://user-images.githubusercontent.com/1317309/48999346-b5f7b400-f199-11e8-89c7-8795f13470d6.png) ![screen shot 2018-11-26 at 4 13 54 pm](https://user-images.githubusercontent.com/1317309/48999347-b5f7b400-f199-11e8-91ef-ef0b2f816b2e.png) Closes #23142 from HeartSaVioR/SPARK-26170. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>	2019-01-25 13:37:42 -08:00
Devaraj K	f06bc0cd1d	[SPARK-22404][YARN] Provide an option to use unmanaged AM in yarn-client mode ## What changes were proposed in this pull request? Providing a new configuration "spark.yarn.un-managed-am" (defaults to false) to enable the Unmanaged AM Application in Yarn Client mode which launches the Application Master service as part of the Client. It utilizes the existing code for communicating between the Application Master <-> Task Scheduler for the container requests/allocations/launch, and eliminates these, 1. Allocating and launching the Application Master container 2. Remote Node/Process communication between Application Master <-> Task Scheduler ## How was this patch tested? I verified manually running the applications in yarn-client mode with "spark.yarn.un-managed-am" enabled, and also ensured that there is no impact to the existing execution flows. I would like to hear others feedback/thoughts on this. Closes #19616 from devaraj-kavali/SPARK-22404. Authored-by: Devaraj K <devaraj@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-25 11:52:45 -08:00
Gabor Somogyi	773efede20	[SPARK-26254][CORE] Extract Hive + Kafka dependencies from Core. ## What changes were proposed in this pull request? There are ugly provided dependencies inside core for the following: * Hive * Kafka In this PR I've extracted them out. This PR contains the following: * Token providers are now loaded with service loader * Hive token provider moved to hive project * Kafka token provider extracted into a new project ## How was this patch tested? Existing + newly added unit tests. Additionally tested on cluster. Closes #23499 from gaborgsomogyi/SPARK-26254. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-25 10:36:00 -08:00
ankurgupta	b484490824	[SPARK-26694][CORE] Progress bar should be enabled by default for spark-shell ## What changes were proposed in this pull request? SPARK-21568 made a change to ensure that progress bar is enabled for spark-shell by default but not for other apps. Before that change, this was distinguished using log-level which is not a good way to determine the same as users can change the default log-level. That commit changed the way to determine whether current app is running in spark-shell or not but it left the log-level part as it is, which causes this regression. SPARK-25118 changed the default log level to INFO for spark-shell because of which the progress bar is not enabled anymore. This commit will remove the log-level check for enabling progress bar for spark-shell as it is not necessary and seems to be a leftover from SPARK-21568 ## How was this patch tested? 1. Ensured that progress bar is enabled with spark-shell by default 2. Ensured that progress bar is not enabled with spark-submit Closes #23618 from ankuriitg/ankurgupta/SPARK-26694. Authored-by: ankurgupta <ankur.gupta@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-25 10:21:26 -08:00
Maxim Gekk	e3411a82c3	[SPARK-26720][SQL] Remove DateTimeUtils methods based on system default time zone ## What changes were proposed in this pull request? In the PR, I propose to remove the following methods from `DateTimeUtils`: - `timestampAddInterval` and `stringToTimestamp` - used only in test suites - `truncTimestamp`, `getSeconds`, `getMinutes`, `getHours` - those methods assume system default time zone. They are not used in Spark. ## How was this patch tested? This was tested by `DateTimeUtilsSuite` and `UnsafeArraySuite`. Closes #23643 from MaxGekk/unused-date-time-utils. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-25 17:06:22 +08:00
Gabor Somogyi	9452e0508a	[SPARK-26649][SS] Add DSv2 noop sink ## What changes were proposed in this pull request? Noop data source for batch was added in [#23471](https://github.com/apache/spark/pull/23471). In this PR I've added the streaming part. ## How was this patch tested? Additional unit tests. Closes #23631 from gaborgsomogyi/SPARK-26649. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-24 19:25:38 -08:00
Gengliang Wang	f5b9370da2	[SPARK-26709][SQL] OptimizeMetadataOnlyQuery does not handle empty records correctly ## What changes were proposed in this pull request? When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` may return wrong results: ``` sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)") sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)") sql("SELECT MAX(p1) FROM t") ``` The result is supposed to be `null`. However, with the optimization the result is `5`. The rule is originally ported from https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is disabled by default in a later release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same problem. It is hard to completely avoid the correctness issue. Because data sources like Parquet can be metadata-only. Spark can't tell whether it is empty or not without actually reading it. This PR disable the optimization by default. ## How was this patch tested? Unit test Closes #23635 from gengliangwang/optimizeMetadata. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-01-24 18:24:49 -08:00
Hyukjin Kwon	d2ff10cbe1	[SPARK-23674][ML] Adds Spark ML Events to Instrumentation ## What changes were proposed in this pull request? This PR proposes to add ML events to Instrumentation, and use it in Pipeline so that other developers can track and add some actions for them. ## Introduction ML events (like SQL events) can be quite useful when people want to track and make some actions for corresponding ML operations. For instance, I have been working on integrating Apache Spark with [Apache Atlas](https://atlas.apache.org/QuickStart.html). With some custom changes with this PR, I can visualise ML pipeline as below: ![spark_ml_streaming_lineage](https://user-images.githubusercontent.com/6477701/49682779-394bca80-faf5-11e8-85b8-5fae28b784b3.png) Another good thing that might have to be considered is, that we can interact this with other SQL/Streaming events. For instance, where the input `Dataset` is originated. For instance, with current Apache Spark, I can visualise SQL operations as below: ![screen shot 2018-12-10 at 9 41 36 am](https://user-images.githubusercontent.com/6477701/49706269-d9bdfe00-fc5f-11e8-943a-3309d1856ba5.png) I think we can combine those existing lineages together to easily understand where the data comes and goes. Currently, ML side is a hole so the lineages can't be connected for the current Apache Spark .. To add up, I think it's not to mention how useful it is to track the SQL/Streaming operations. Likewise, I would like to propose ML events as well (as lowest stability `Unstable` APIs for now - no guarantee about stability). ## Implementation Details ### Sends event (but not expose ML specific listener) `mllib/src/main/scala/org/apache/spark/ml/events.scala` ```scala Unstable case class ...StartEvent(caller, input) Unstable case class ...EndEvent(caller, output) trait MLEvents { // Wrappers to send events: // def with...Event(body) = { // body() // SparkContext.getOrCreate().listenerBus.post(event) // } } ``` This trait is used by `Instrumentation`. ```scala class Instrumentation ... with MLEvents { ``` and used as below: ```scala instrumented { instr => instr.with...Event(...) { ... } } ``` This way mimics both: 1. Catalog events (see `org/apache/spark/sql/catalyst/catalog/events.scala`) - This allows a Catalog specific listener to be added `ExternalCatalogEventListener` - It's implemented in a way of wrapping whole `ExternalCatalog` named `ExternalCatalogWithListener` which delegates the operations to `ExternalCatalog` This is not quite possible in this case because most of instances (like `Pipeline`) will be directly created in most of cases. We might be able to do that via extending `ListenerBus` for all possible instances but IMHO it's too invasive. Also, exposing another ML specific listener sounds a bit too much at this stage. Therefore, I simply borrowed file name and structures here 2. SQL execution events (see `org/apache/spark/sql/execution/SQLExecution.scala`) - Add an object that wraps a body to send events Current apporach is rather close to this. It has a `with...` wrapper to send events. I borrowed this approach to be consistent. ## Usage It needs a custom implementation for a query listener. For instance, with the custom listener below: ```scala class CustomMLListener extends SparkListener def onOtherEvents(e) = e match { case e: MLEvent => // do something case _ => // pass } } ``` There are two (existing) ways to use this. ```scala spark.sparkContext.addSparkListener(new CustomMLListener) ``` ```bash spark-submit ...\ --conf spark.extraListeners=CustomMLListener\ ... ``` It's also similar with other existing implementation in SQL side. ## Target users 1. I think someone in general would likely utilise this feature like other event listeners. At least, I can see some interests going on outside. - SQL Listener - https://stackoverflow.com/questions/46409339/spark-listener-to-an-sql-query - http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-Custom-Query-Execution-listener-via-conf-properties-td30979.html - Streaming Query Listener - https://jhui.github.io/2017/01/15/Apache-Spark-Streaming/ - http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Watermark-td25413.html#a25416 2. Someone would likely run this via Atlas. The plugin mirror intentionally is exposed at [spark-atlas-connector](https://github.com/hortonworks-spark/spark-atlas-connector) so that anyone could do something about lineage and governance in Atlas. I'm trying to show integrated lineages in Apache Spark but this is a missing hole. ## How was this patch tested? Manually tested and unit tests were added. Closes #23263 from HyukjinKwon/SPARK-23674-1. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-25 10:11:49 +08:00
Hyukjin Kwon	d97481ba17	[MINOR] Add Jenkins and AppVeyor badges at README.md ## What changes were proposed in this pull request? This PR adds Jenkins build and AppVeyor build badges Take a look at `cc24b9e0b3/README.md` ## How was this patch tested? Manually tested by `grip`. Closes #23629 from HyukjinKwon/minor-badges. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-01-25 10:04:21 +08:00
Ilya Matiach	b2d36f65db	[SPARK-19591][ML][MLLIB] Add sample weights to decision trees This is updated PR https://github.com/apache/spark/pull/16722 to latest master ## What changes were proposed in this pull request? This patch adds support for sample weights to DecisionTreeRegressor and DecisionTreeClassifier. Note: This patch does not add support for sample weights to RandomForest. As discussed in the JIRA, we would like to add sample weights into the bagging process. This patch is large enough as is, and there are some additional considerations to be made for random forests. Since the machinery introduced here needs to be present regardless, I have opted to leave random forests for a follow up pr. ## How was this patch tested? The algorithms are tested to ensure that: 1. Arbitrary scaling of constant weights has no effect 2. Outliers with small weights do not affect the learned model 3. Oversampling and weighting are equivalent Unit tests are also added to test other smaller components. ## Summary of changes - Impurity aggregators now store weighted sufficient statistics. They also store a raw count, however, since this is needed to use minInstancesPerNode. - Impurity aggregators now also hold the raw count. - This patch maintains the meaning of minInstancesPerNode, in that the parameter still corresponds to raw, unweighted counts. It also adds a new parameter minWeightFractionPerNode which requires that nodes must contain at least minWeightFractionPerNode * weightedNumExamples total weight. - This patch modifies findSplitsForContinuousFeatures to use weighted sums. Unit tests are added. - TreePoint is modified to hold a sample weight - BaggedPoint is modified from: ``` Scala private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) extends Serializable ``` to ``` Scala private[spark] class BaggedPoint[Datum]( val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double) extends Serializable ``` We do not simply multiply the counts by the weight and store that because we need the raw counts and the weight in order to use both minInstancesPerNode and minWeightPerNode Note: many of the changed files are due simply to using Instance instead of LabeledPoint Closes #21632 from imatiach-msft/ilmat/sample-weights. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-24 18:20:28 -07:00
Imran Rashid	3699763fda	[SPARK-26697][CORE] Log local & remote block sizes. ## What changes were proposed in this pull request? To help debugging failed or slow tasks, its really useful to know the size of the blocks getting fetched. Though that is available at the debug level, debug logs aren't on in general -- but there is already an info level log line that this augments a little. ## How was this patch tested? Ran very basic local-cluster mode app, looked at logs. Example line: ``` INFO ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 (97.0 B) local blocks and 1 (97.0 B) remote blocks ``` Full suite via jenkins. Closes #23621 from squito/SPARK-26697. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-24 16:13:58 -08:00
Liupengcheng	8d667c511c	[SPARK-26530][CORE] Validate heartheat arguments in HeartbeatReceiver ## What changes were proposed in this pull request? Currently, heartbeat related arguments is not validated in spark, so if these args are inproperly specified, the Application may run for a while and not failed until the max executor failures reached(especially with spark.dynamicAllocation.enabled=true), thus may incurs resources waste. This PR is to precheck these arguments in HeartbeatReceiver to fix this problem. ## How was this patch tested? NA-just validation changes Closes #23445 from liupc/validate-heartbeat-arguments-in-SparkSubmitArguments. Authored-by: Liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-24 15:12:57 -08:00
Rob Vesse	69dab94b13	[SPARK-26687][K8S] Fix handling of custom Dockerfile paths ## What changes were proposed in this pull request? With the changes from vanzin's PR #23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times. However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths. This is because of the following code snippets: ``` (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \ -t $(image_ref spark) \ -f "$BASEDOCKERFILE" .) ``` Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script. This is rather unintuitive and produces somewhat unhelpful errors e.g. ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 ... Successfully tagged rvesse/spark:badpath unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory Failed to build PySpark Docker image, please refer to Docker build output for details. ``` Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory. To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths ## How was this patch tested? Validated that relative paths now work as expected: ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 Step 3/15 : RUN set -ex && apk upgrade --no-cache && apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd ---> Using cache ---> eb0a568e032f Step 4/15 : COPY jars /opt/spark/jars ... Successfully tagged rvesse/spark:badpath Sending build context to Docker daemon 6.599MB Step 1/13 : ARG base_img Step 2/13 : ARG spark_uid=185 Step 3/13 : FROM $base_img ---> 8f4fff16f903 Step 4/13 : WORKDIR / ---> Running in 25466e66f27f Removing intermediate container 25466e66f27f ---> 1470b6efae61 Step 5/13 : USER 0 ---> Running in b094b739df37 Removing intermediate container b094b739df37 ---> 6a27eb4acad3 Step 6/13 : RUN mkdir ${SPARK_HOME}/python ---> Running in bc8002c5b17c Removing intermediate container bc8002c5b17c ---> 19bb12f4286a Step 7/13 : RUN apk add --no-cache python && apk add --no-cache python3 && python -m ensurepip && python3 -m ensurepip && rm -r /usr/lib/python*/ensurepip && pip install --upgrade pip setuptools && rm -r /root/.cache ---> Running in 12dcba5e527f ... Successfully tagged rvesse/spark-py:badpath ``` Closes #23613 from rvesse/SPARK-26687. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-24 10:11:55 -08:00
Tom van Bussel	9813b1d074	[SPARK-26690] Track query execution and time cost for checkpoints ## What changes were proposed in this pull request? Checkpoints of Dataframes currently do not show up in SQL UI. This PR fixes that by setting an execution id for the execution of the checkpoint by wrapping the checkpoint code with a `withAction`. ## How was this patch tested? A unit test was added to DatasetSuite. Closes #23636 from tomvanbussel/SPARK-26690. Authored-by: Tom van Bussel <tom.vanbussel@databricks.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2019-01-24 16:44:39 +01:00
Bruce Robbins	d4a30fa9af	[SPARK-26680][SQL] Eagerly create inputVars while conditions are appropriate ## What changes were proposed in this pull request? When a user passes a Stream to groupBy, ```CodegenSupport.consume``` ends up lazily generating ```inputVars``` from a Stream, since the field ```output``` will be a Stream. At the time ```output.zipWithIndex.map``` is called, conditions are correct. However, by the time the map operation actually executes, conditions are no longer appropriate. The closure used by the map operation ends up using a reference to the partially created ```inputVars```. As a result, a StackOverflowError occurs. This PR ensures that ```inputVars``` is eagerly created while conditions are appropriate. It seems this was also an issue with the code path for creating ```inputVars``` from ```outputVars``` (SPARK-25767). I simply extended the solution for that code path to encompass both code paths. ## How was this patch tested? SQL unit tests new test python tests Closes #23617 from bersprockets/SPARK-26680_opt1. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>	2019-01-24 11:18:08 +01:00
Ryan Blue	d5a97c1c2c	[SPARK-26682][SQL] Use taskAttemptID instead of attemptNumber for Hadoop. ## What changes were proposed in this pull request? Updates the attempt ID used by FileFormatWriter. Tasks in stage attempts use the same task attempt number and could conflict. Using Spark's task attempt ID guarantees that Hadoop TaskAttemptID instances are unique. ## How was this patch tested? Existing tests. Also validated that we no longer detect this failure case in our logs after deployment. Closes #23608 from rdblue/SPARK-26682-fix-hadoop-task-attempt-id. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-24 12:45:25 +08:00
Dave DeCaprio	d0e9219e03	[SPARK-26617][SQL] Cache manager locks ## What changes were proposed in this pull request? Fixed several places in CacheManager where a write lock was being held while running the query optimizer. This could cause a very lock block if the query optimization takes a long time. This builds on changes from [SPARK-26548] that fixed this issue for one specific case in the CacheManager. gatorsmile This is very similar to the PR you approved last week. ## How was this patch tested? Has been tested on a live system where the blocking was causing major issues and it is working well. CacheManager has no explicit unit test but is used in many places internally as part of the SharedState. Closes #23539 from DaveDeCaprio/cache-manager-locks. Lead-authored-by: Dave DeCaprio <daved@alum.mit.edu> Co-authored-by: David DeCaprio <daved@alum.mit.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-24 10:48:48 +08:00
ayudovin	11be22bb5e	[SPARK-25713][SQL] implementing copy for ColumnArray ## What changes were proposed in this pull request? Implement copy() for ColumnarArray ## How was this patch tested? Updating test case to existing tests in ColumnVectorSuite Closes #23569 from ayudovin/copy-for-columnArray. Authored-by: ayudovin <a.yudovin6695@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-24 10:35:44 +08:00
Anton Okolnychyi	0df29bfbdc	[SPARK-26706][SQL] Fix `illegalNumericPrecedence` for ByteType ## What changes were proposed in this pull request? This PR contains a minor change in `Cast$mayTruncate` that fixes its logic for bytes. Right now, `mayTruncate(ByteType, LongType)` returns `false` while `mayTruncate(ShortType, LongType)` returns `true`. Consequently, `spark.range(1, 3).as[Byte]` and `spark.range(1, 3).as[Short]` behave differently. Potentially, this bug can silently corrupt someone's data. ```scala // executes silently even though Long is converted into Byte spark.range(Long.MaxValue - 10, Long.MaxValue).as[Byte] .map(b => b - 1) .show() +-----+ \|value\| +-----+ \| -12\| \| -11\| \| -10\| \| -9\| \| -8\| \| -7\| \| -6\| \| -5\| \| -4\| \| -3\| +-----+ // throws an AnalysisException: Cannot up cast `id` from bigint to smallint as it may truncate spark.range(Long.MaxValue - 10, Long.MaxValue).as[Short] .map(s => s - 1) .show() ``` ## How was this patch tested? This PR comes with a set of unit tests. Closes #23632 from aokolnychyi/cast-fix. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-01-24 00:12:26 +00:00
Liupengcheng	0446363ef4	[SPARK-26660] Add warning logs when broadcasting large task binary ## What changes were proposed in this pull request? Currently, some ML library may generate large ml model, which may be referenced in the task closure, so driver will broadcasting large task binary, and executor may not able to deserialize it and result in OOM failures(for instance, executor's memory is not enough). This problem not only affects apps using ml library, some user specified closure or function which refers large data may also have this problem. In order to facilitate the debuging of memory problem caused by large taskBinary broadcast, we can add same warning logs for it. This PR will add some warning logs on the driver side when broadcasting a large task binary, and it also included some minor log changes in the reading of broadcast. ## How was this patch tested? NA-Just log changes. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23580 from liupc/Add-warning-logs-for-large-taskBinary-size. Authored-by: Liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-23 08:51:39 -06:00
Ryan Blue	d008e23ab5	[SPARK-26681][SQL] Support Ammonite inner-class scopes. ## What changes were proposed in this pull request? This adds a new pattern to recognize Ammonite REPL classes and return the correct scope. ## How was this patch tested? Manually tested with Spark in an Ammonite session. Closes #23607 from rdblue/SPARK-26681-support-ammonite-scopes. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-23 08:50:03 -06:00
Maxim Gekk	46d5bb9a0f	[SPARK-26653][SQL] Use Proleptic Gregorian calendar in parsing JDBC lower/upper bounds ## What changes were proposed in this pull request? In the PR, I propose using of the `stringToDate` and `stringToTimestamp` methods in parsing JDBC lower/upper bounds of the partition column if it has `DateType` or `TimestampType`. Since those methods have been ported on Proleptic Gregorian calendar by #23512, the PR switches parsing of JDBC bounds of the partition column on the calendar as well. ## How was this patch tested? This was tested by `JDBCSuite`. Closes #23597 from MaxGekk/jdbc-parse-timestamp-bounds. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-23 20:23:17 +08:00
Takeshi Yamamuro	1ed1b4d8e1	[SPARK-26637][SQL] Makes GetArrayItem nullability more precise ## What changes were proposed in this pull request? In the master, GetArrayItem nullable is always true; `cf133e6110/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (L236)` But, If input array size is constant and ordinal is foldable, we could make GetArrayItem nullability more precise. This pr added code to make `GetArrayItem` nullability more precise. ## How was this patch tested? Added tests in `ComplexTypeSuite`. Closes #23566 from maropu/GetArrayItemNullability. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-23 15:33:02 +08:00
Ngone51	3da71f2da1	[SPARK-22465][CORE][FOLLOWUP] Use existing partitioner when defaultNumPartitions is equal to maxPartitioner.numPartitions ## What changes were proposed in this pull request? Followup of #20091. We could also use existing partitioner when defaultNumPartitions is equal to the maxPartitioner's numPartitions. ## How was this patch tested? Existed. Closes #23581 from Ngone51/dev-use-existing-partitioner-when-defaultNumPartitions-equalTo-MaxPartitioner#-numPartitions. Authored-by: Ngone51 <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-01-23 10:23:40 +08:00
Sean Owen	6dcad38ba3	[SPARK-26228][MLLIB] OOM issue encountered when computing Gramian matrix ## What changes were proposed in this pull request? Avoid memory problems in closure cleaning when handling large Gramians (>= 16K rows/cols) by using null as zeroValue ## How was this patch tested? Existing tests. Note that it's hard to test the case that triggers this issue as it would require a large amount of memory and run a while. I confirmed locally that a 16K x 16K Gramian failed with tons of driver memory before, and didn't fail upfront after this change. Closes #23600 from srowen/SPARK-26228. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 19:22:06 -06:00
Shahid	0d35f9ea3a	[SPARK-24484][MLLIB] Power Iteration Clustering is giving incorrect clustering results when there are mutiple leading eigen values. ## What changes were proposed in this pull request? ![image](https://user-images.githubusercontent.com/23054875/41823325-e83e1d34-781b-11e8-8c34-fc6e7a042f3f.png) ![image](https://user-images.githubusercontent.com/23054875/41823367-733c9ba4-781c-11e8-8da2-b26460c2af63.png) ![image](https://user-images.githubusercontent.com/23054875/41823409-179dd910-781d-11e8-8d8c-9865156fad15.png) Method to determine if the top eigen values has same magnitude but opposite signs The vector is written as a linear combination of the eigen vectors at iteration k. ![image](https://user-images.githubusercontent.com/23054875/41822941-f8b13d4c-7814-11e8-8091-54c02721c1c5.png) ![image](https://user-images.githubusercontent.com/23054875/41822982-b80a6fc4-7815-11e8-9129-ed96a14f037f.png) ![image](https://user-images.githubusercontent.com/23054875/41823022-5b69e906-7816-11e8-847a-8fa5f0b6200e.png) ![image](https://user-images.githubusercontent.com/23054875/41823087-54311398-7817-11e8-90bf-e1be2bbff323.png) ![image](https://user-images.githubusercontent.com/23054875/41823121-e0b78324-7817-11e8-9596-379bd2e518af.png) ![image](https://user-images.githubusercontent.com/23054875/41823151-965319d2-7818-11e8-8b91-10f6276ace62.png) ![image](https://user-images.githubusercontent.com/23054875/41823182-75cdbad6-7819-11e8-912f-23c66a8359de.png) ![image](https://user-images.githubusercontent.com/23054875/41823221-1ca77a36-781a-11e8-9a40-48bd165797cc.png) ![image](https://user-images.githubusercontent.com/23054875/41823272-f6962b2a-781a-11e8-9978-1b2dc0dc8b2c.png) ![image](https://user-images.githubusercontent.com/23054875/41823303-75b296f0-781b-11e8-8501-6133b04769c8.png) So, we need to check if the reileigh coefficient at the convergence is lesser than the norm of the estimated eigen vector before normalizing (Please fill in changes proposed in this fix) Added a UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #21627 from shahidki31/picConvergence. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 18:29:18 -06:00
Darcy Shen	9d2a11554b	[MINOR][DOC] Documentation on JVM options for SBT ## What changes were proposed in this pull request? Documentation and .gitignore ## How was this patch tested? Manual test that SBT honors the settings in .jvmopts if present Closes #23615 from sadhen/impr/gitignore. Authored-by: Darcy Shen <sadhen@zoho.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-01-22 18:27:24 -06:00
Kris Mok	02d8ae3d59	[SPARK-26661][SQL] Show actual class name of the writing command in CTAS explain ## What changes were proposed in this pull request? The explain output of the Hive CTAS command, regardless of whether it's actually writing via Hive's SerDe or converted into using Spark's data source, would always show that it's using `InsertIntoHiveTable` because it's hardcoded. e.g. ``` Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHiveTable] ``` This CTAS is converted into using Spark's data source, but it still says `InsertIntoHiveTable` in the explain output. It's better to show the actual class name of the writing command used. For the example above, it'd be: ``` Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: foo, InsertIntoHadoopFsRelationCommand] ``` ## How was this patch tested? Added test case in `HiveExplainSuite` Closes #23582 from rednaxelafx/fix-explain-1. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2019-01-22 13:55:41 -08:00
Rob Vesse	dc2da72100	[SPARK-26685][K8S] Correct placement of ARG declaration Latest Docker releases are stricter in their enforcement of build argument scope. The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID. Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into ## What changes were proposed in this pull request? This commit moves the `ARG` declaration to just before the argument is used such that it is in scope. It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant ## How was this patch tested? Prior to the patch images are produced where the Python and R images ignore the default/configured UID: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4# whoami root bash-4.4# id -u 0 bash-4.4# exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope. After the patch the correct UID is observed: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Closes #23611 from rvesse/SPARK-26685. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-22 10:31:17 -08:00
Rob Vesse	c542c247bb	[SPARK-25887][K8S] Configurable K8S context support This enhancement allows for specifying the desired context to use for the initial K8S client auto-configuration. This allows users to more easily access alternative K8S contexts without having to first explicitly change their current context via kubectl. Explicitly set my K8S context to a context pointing to a non-existent cluster, then launched Spark jobs with explicitly specified contexts via the new `spark.kubernetes.context` configuration property. Example Output: ``` > kubectl config current-context minikube > minikube status minikube: Stopped cluster: kubectl: > ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.context=docker-for-desktop --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4 18/10/31 11:57:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/31 11:57:51 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using context docker-for-desktop from users K8S config file 18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: N/A start time: N/A phase: Pending container status: N/A 18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: docker-for-desktop start time: N/A phase: Pending container status: N/A ... 18/10/31 11:58:03 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: docker-for-desktop start time: 2018-10-31T11:57:52Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: rvesse/spark:debian container state: terminated container started at: 2018-10-31T11:57:54Z container finished at: 2018-10-31T11:58:02Z exit code: 0 termination reason: Completed ``` Without the `spark.kubernetes.context` setting this will fail because the current context - `minikube` - is pointing to a non-running cluster e.g. ``` > ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4 18/10/31 12:02:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/31 12:02:30 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 18/10/31 12:02:31 WARN WatchConnectionManager: Exec Failure javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 39 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 45 more Exception in thread "kubernetes-dispatcher-0" Exception in thread "main" java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask611a9c09 rejected from java.util.concurrent.ScheduledThreadPoolExecutor404819e4[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632) at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.scheduleReconnect(WatchConnectionManager.java:300) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$800(WatchConnectionManager.java:48) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:213) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:204) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) ... 4 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 39 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 45 more 18/10/31 12:02:31 INFO ShutdownHookManager: Shutdown hook called 18/10/31 12:02:31 INFO ShutdownHookManager: Deleting directory /private/var/folders/6b/y1010qp107j9w2dhhy8csvz0000xq3/T/spark-5e649891-8a0f-4f17-bf3a-33b34082eba8 ``` Suggested reviews: mccheah liyinan926 - this is the follow up fix to the bug discovered while working on SPARK-25809 (PR #22805) Closes #22904 from rvesse/SPARK-25887. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-01-22 10:25:21 -08:00

1 2 3 4 5 ...

23697 commits