ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
LantaoJin	69dd44af19	[SPARK-27216][CORE] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue ## What changes were proposed in this pull request? HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But RoaringBitmap couldn't be ser/deser with unsafe KryoSerializer. It's a bug of RoaringBitmap-0.5.11 and fixed in latest version. This is an update of #24157 ## How was this patch tested? Add a UT Closes #24264 from LantaoJin/SPARK-27216. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Lantao Jin <jinlantao@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-03 20:09:50 -05:00
Dongjoon Hyun	b51763612a	Revert "[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not" This reverts commit `5888b15d9c`.	2019-04-03 09:41:13 -07:00
Wenchen Fan	ffb362a705	[SPARK-19712][SQL][FOLLOW-UP] reduce code duplication ## What changes were proposed in this pull request? abstract some common code into a method. ## How was this patch tested? existing tests Closes #24281 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-04 00:37:57 +08:00
Liang-Chi Hsieh	d04a7371da	[MINOR][DOC][SQL] Remove out-of-date doc about ORC in DataFrameReader and Writer ## What changes were proposed in this pull request? According to current status, `orc` is available even Hive support isn't enabled. This is a minor doc change to reflect it. ## How was this patch tested? Doc only change. Closes #24280 from viirya/fix-orc-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-03 09:11:09 -07:00
Maxim Gekk	1bc672366d	[SPARK-27344][SQL][TEST] Support the LocalDate and Instant classes in Java Bean encoders ## What changes were proposed in this pull request? - Added new test for Java Bean encoder of the classes: `java.time.LocalDate` and `java.time.Instant`. - Updated comment for `Encoders.bean` - New Row getters: `getLocalDate` and `getInstant` - Extended `inferDataType` to infer types for `java.time.LocalDate` -> `DateType` and `java.time.Instant` -> `TimestampType`. ## How was this patch tested? By `JavaBeanDeserializationSuite` Closes #24273 from MaxGekk/bean-instant-localdate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 17:45:59 +08:00
Dilip Biswal	3286bff942	[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. ## What changes were proposed in this pull request? In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA. Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing. Here is an example: ``` val df = spark.sql("select * from t where sum(ta) > 0") df.explain(true) df.show() ``` Resulting exception: ``` Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint)) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) ``` Checked the behaviour of other database and all of them return an exception: Postgress ``` select * from foo where max(c1) > 0; Error ERROR: aggregate functions are not allowed in WHERE Position: 25 ``` DB2 ``` db2 => select * from foo where max(c1) > 0; SQL0120N Invalid use of an aggregate function or OLAP function. ``` Oracle ``` select * from foo where max(c1) > 0; ORA-00934: group function is not allowed here ``` MySql ``` select * from foo where max(c1) > 0; Invalid use of group function ``` Update This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid. ## How was this patch tested? Added tests in AnalysisErrorSuite and group-by.sql Closes #24209 from dilipbiswal/SPARK-27255. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 13:05:06 +08:00
Maxim Gekk	1d20d13149	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 10:55:56 +08:00
Dilip Biswal	b8b5acdd41	[SPARK-19712][SQL][FOLLOW-UP] Don't do partial pushdown when pushing down LeftAnti joins below Aggregate or Window operators. ## What changes were proposed in this pull request? After [23750](https://github.com/apache/spark/pull/23750), we may pushdown left anti joins below aggregate and window operators with a partial join condition. This is not correct and was pointed out by hvanhovell and cloud-fan [here](https://github.com/apache/spark/pull/23750#discussion_r270017097). This pr addresses their comments. ## How was this patch tested? Added two new tests to verify the behaviour. Closes #24253 from dilipbiswal/SPARK-19712-followup. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-03 09:56:27 +08:00
Gabor Somogyi	3628242bd0	[MINOR][DSTREAMS] Add DStreamCheckpointData.cleanup warning if delete returns false ## What changes were proposed in this pull request? While I was reviewing #24235 I've found a minor addition possibility. Namely `FileSystem.delete` returns a boolean which is not yet checked. In this PR I've added a warning message when it returns false. I've added this as MINOR because no control flow change introduced. ## How was this patch tested? Existing unit tests. Closes #24263 from gaborgsomogyi/SPARK-27301-minor. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-02 18:34:40 -05:00
Hyukjin Kwon	d7dd59a6b4	[SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes #24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:30:24 +09:00
Hyukjin Kwon	949d712839	[SPARK-27346][SQL] Loosen the newline assert condition on 'examples' field in ExpressionInfo ## What changes were proposed in this pull request? I haven't tested by myself on Windows and I am not 100% sure if this is going to cause an actual problem. However, this one line: `827383a97c/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java (L82)` made me to investigate a lot today. Given my speculation, if Spark is built in Linux and it's executed on Windows, it looks possible for multiline strings, like, `5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (L146-L150)` to throw an exception because the newline in the binary is `\n` but `System.lineSeparator` returns `\r\n`. I think this is not yet found because this particular codes are not released yet (see SPARK-26426). Looks just better to loosen the condition and forget about this stuff. This should be backported into branch-2.4 as well. ## How was this patch tested? N/A Closes #24274 from HyukjinKwon/SPARK-27346. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-03 08:27:41 +09:00
Yuming Wang	13c5c1fb4b	[SPARK-27180][BUILD][YARN] Fix testing issues with yarn module in Hadoop-3 ## What changes were proposed in this pull request? Fix testing issues with `yarn` module in Hadoop-3: 1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```. 2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```. 3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```. ## How was this patch tested? manual tests: ```shell build/sbt yarn/test -Pyarn build/sbt yarn/test -Phadoop-3.2 -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2 ``` Closes #24115 from wangyum/hadoop3-yarn. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-02 15:38:26 -05:00
Gabor Somogyi	57aff93886	[SPARK-26998][CORE] Remove SSL configuration from executors ## What changes were proposed in this pull request? Different SSL passwords shown up as command line argument on executor side in standalone mode: * keyStorePassword * keyPassword * trustStorePassword In this PR I've removed SSL configurations from executors. ## How was this patch tested? Existing + additional unit tests. Additionally tested with standalone mode and checked the command line arguments: ``` [gaborsomogyi:~/spark] SPARK-26998(+4/-0,3)+ ± jps 94803 CoarseGrainedExecutorBackend 94818 Jps 90149 RemoteMavenServer 91925 Nailgun 94793 SparkSubmit 94680 Worker 94556 Master 398 [gaborsomogyi:~/spark] SPARK-26998(+4/-1,3)+ ± ps -ef \| egrep "94556\|94680\|94793\|94803" 502 94556 1 0 2:02PM ttys007 0:07.39 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host gsomogyi-MBP.local --port 7077 --webui-port 8080 --properties-file conf/spark-defaults.conf 502 94680 1 0 2:02PM ttys007 0:07.27 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 --properties-file conf/spark-defaults.conf spark://gsomogyi-MBP.local:7077 502 94793 94782 0 2:02PM ttys007 0:35.52 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Dscala.usejavacp=true -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://gsomogyi-MBP.local:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell 502 94803 94680 0 2:03PM ttys007 0:05.20 /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home/bin/java -cp /Users/gaborsomogyi/spark/conf/:/Users/gaborsomogyi/spark/assembly/target/scala-2.12/jars/* -Xmx1024M -Dspark.ssl.ui.port=0 -Dspark.driver.port=60902 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler172.30.65.186:60902 --executor-id 0 --hostname 172.30.65.186 --cores 8 --app-id app-20190326140311-0000 --worker-url spark://Worker172.30.65.186:60899 502 94910 57352 0 2:05PM ttys008 0:00.00 egrep 94556\|94680\|94793\|94803 ``` Closes #24170 from gaborgsomogyi/SPARK-26998. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-04-02 09:18:43 -07:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
Dongjoon Hyun	d575a453db	Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp" This reverts commit `c5e83ab92c`.	2019-04-02 01:05:54 -07:00
Dongjoon Hyun	a0d807d5ab	[SPARK-26856][PYSPARK][FOLLOWUP] Fix UT failure due to wrong patterns for Kinesis assembly ## What changes were proposed in this pull request? After [SPARK-26856](https://github.com/apache/spark/pull/23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console ``` Exception: Found multiple JARs: .../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar, .../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar; please remove all but one ``` It's because the pattern was changed in a wrong way. Original ```python kinesis_asl_assembly_dir, "target/scala-/%s-.jar" % name_prefix)) kinesis_asl_assembly_dir, "target/%s_.jar" % name_prefix)) ``` After SPARK-26856* ```python project_full_path, "target/scala-/%s.jar" % jar_name_prefix)) project_full_path, "target/%s.jar" % jar_name_prefix)) ``` The actual kinesis assembly jar files look like the followings. SBT Build* ``` -rw-r--r-- 1 dongjoon staff 87459461 Apr 1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` MAVEN Build ``` -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff 8.7K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 21M Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns. ## How was this patch tested? Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally. ``` $ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly $ export ENABLE_KINESIS_TESTS=1 $ python/run-tests.py --python-executables python2.7 --module pyspark-streaming ``` Closes #24268 from dongjoon-hyun/SPARK-26856. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-02 14:52:56 +09:00
Marco Gaido	0b150f833c	[SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn ## What changes were proposed in this pull request? We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s. The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs. ## How was this patch tested? NA Closes #23285 from mgaido91/SPARK-26224. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-02 14:12:47 +09:00
Maxim Gekk	c5e83ab92c	[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? In the PR, I propose to deprecate the `from_utc_timestamp()` and `to_utc_timestamp`, and disable them by default. The functions can be enabled back via the SQL config `spark.sql.legacy.utcTimestampFunc.enabled`. By default, any calls of the functions throw an analysis exception. One of the reason for deprecation is functions violate semantic of `TimestampType` which is number of microseconds since epoch in UTC time zone. Shifting microseconds since epoch by time zone offset doesn't make sense because the result doesn't represent microseconds since epoch in UTC time zone any more, and cannot be considered as `TimestampType`. ## How was this patch tested? The changes were tested by `DateExpressionsSuite` and `DateFunctionsSuite`. Closes #24195 from MaxGekk/conv-utc-timestamp-deprecate. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-02 10:20:06 +08:00
Liang-Chi Hsieh	eaf008ad0e	[SPARK-27329][SQL] Pruning nested field in map of map key and value from object serializers ## What changes were proposed in this pull request? If object serializer has map of map key/value, pruning nested field should work. Previously object serializer pruner don't recursively prunes nested fields if it is deeply located in map key or value. This patch proposed to address it by slightly factoring the pruning logic. ## How was this patch tested? Added tests. Closes #24260 from viirya/SPARK-27329. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 13:53:55 -07:00
Giovanni Lanzani	92530c7db1	[SPARK-9792] Make DenseMatrix equality semantical Before, you could have this code ``` A = SparseMatrix(2, 2, [0, 2, 3], [0], [2]) B = DenseMatrix(2, 2, [2, 0, 0, 0]) B == A # False A == B # True ``` The second would be `True` as `SparseMatrix` already checks for semantic equality. This commit changes `DenseMatrix` so that equality is semantical as well. ## What changes were proposed in this pull request? Better semantic equality for DenseMatrix ## How was this patch tested? Unit tests were added, plus manual testing. Note that the code falls back to the old behavior when `other` is not a SparseMatrix. Closes #17968 from gglanzani/SPARK-9792. Authored-by: Giovanni Lanzani <giovanni@lanzani.nl> Signed-off-by: Holden Karau <holden@pigscanfly.ca>	2019-04-01 09:30:33 -07:00
Marco Gaido	5888b15d9c	[SPARK-27278][SQL] Optimize GetMapValue when the map is a foldable and the key is not ## What changes were proposed in this pull request? When `GetMapValue` contains a foldable Map and a non-foldable key, `SimplifyExtractValueOps` fails to optimize it transforming it into case when statements. The PR adds a case for covering this situation too. ## How was this patch tested? added UT Closes #24223 from mgaido91/SPARK-27278. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-01 09:09:06 -07:00
Maxim Gekk	d332958109	[SPARK-27325][SQL] Add implicit encoders for LocalDate and Instant ## What changes were proposed in this pull request? Added implicit encoders for the `java.time.LocalDate` and `java.time.Instant` classes. This allows creation of datasets from instances of the types. ## How was this patch tested? Added new tests to `JavaDatasetSuite` and `DatasetSuite`. Closes #24249 from MaxGekk/instant-localdate-encoders. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 23:02:48 +08:00
Marco Gaido	8012f55a9b	[SPARK-26812][SQL] Report correct nullability for complex datatypes in Union ## What changes were proposed in this pull request? When there is a `Union`, the reported output datatypes are the ones of the first plan and the nullability is updated according to all the plans. For complex types, though, the nullability of their elements is not updated using the types from the other plans. This means that the nullability of the inner elements is the one of the first plan. If this is not compatible with the one of other plans, errors can happen (as reported in the JIRA). The PR proposes to update the nullability of the inner elements of complex datatypes according to most permissive value of all the plans. ## How was this patch tested? added UT Closes #23726 from mgaido91/SPARK-26812. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-01 22:22:10 +08:00
Yuming Wang	f799e34962	[MINOR][BUILD] Upgrade apache-rat to 0.13 ## What changes were proposed in this pull request? This PR upgrade `apache-rat` to 0.13. Issues fixed by 0.13: https://issues.apache.org/jira/issues/?jql=project%20%3D%20RAT%20AND%20fixVersion%20%3D%200.13 ## How was this patch tested? manual tests Closes #24262 from wangyum/apache-rat. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 16:44:42 +09:00
“attilapiros”	9eb896cc3b	[SPARK-27333][TEST] Update thread audit whitelist to skip broadcast-exchange-., process reaper and StatisticsDataReferenceCleaner threads ## What changes were proposed in this pull request? Update thread audit whitelist to skip threads of the global broadcast exchange thread pool, process reaper and Hadoop FS statistics data reference cleaner thread. ## How was this patch tested? Via existing UT using broadcast exchange via `sbt` i.e: ``` > project sql > testOnly .SessionStateSuite -- -z "fork new sessions and run query on inherited table" ``` Before (wrapped long line for manually to save horizontal scrolling for reviewers): ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.SessionStateSuite, thread names: broadcast-exchange-6, broadcast-exchange-0, broadcast-exchange-2, broadcast-exchange-5, broadcast-exchange-7, broadcast-exchange-4, broadcast-exchange-1, process reaper, broadcast-exchange-3, org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner ===== ``` After this change no possible thread leak detected. Closes #24244 from attilapiros/thread-audit-minor. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:33:31 -07:00
chakravarthiT	fc9aad0957	[SPARK-27253][SQL] Prioritizes parent session's SQLConf over SparkConf when cloning a session ## What changes were proposed in this pull request? Cloned session should prioritize `SQLConf` from parent's over `SparkConf`. Currently, when cloning a session, the child session has configuration set in `SparkConf` even the same properties are set to its parent `SQLConf`. Currently, when a Spark session is cloned, `mergeSparkConf` in `BaseSessionStateBuilder`'s `conf` overwrites `SQLConf` values as set in `SparkConf`. This PR proposes to call `mergeSparkConf` only when the parent session is empty. See below codes to read. 1. Parent's `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L268)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L157-L161)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` 2. Child `sessionState` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L269)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (L155)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L102)` `c26379b446/sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (L74)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L305)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L283)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L292)` `5dab5f651f/sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (L88-L90)` ## How was this patch tested? Added UT and with existing Unit Tests. Closes #24189 from chakravarthiT/CloneDiscardsConf. Authored-by: chakravarthiT <tcchakra@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 09:33:18 +09:00
Takeshi Yamamuro	885aab40a2	[SPARK-27266][SQL] Support ANALYZE TABLE to collect tables stats for cached catalog views ## What changes were proposed in this pull request? The current master doesn't support ANALYZE TABLE to collect tables stats for catalog views even if they are cached as follows; ```scala scala> sql(s"CREATE VIEW v AS SELECT 1 c") scala> sql(s"CACHE LAZY TABLE v") scala> sql(s"ANALYZE TABLE v COMPUTE STATISTICS") org.apache.spark.sql.AnalysisException: ANALYZE TABLE is not supported on views.; ... ``` Since SPARK-25196 has supported to an ANALYZE command to collect column statistics for cached catalog view, we could support table stats, too. ## How was this patch tested? Added tests in `StatisticsCollectionSuite` and `InMemoryColumnarQuerySuite`. Closes #24200 from maropu/SPARK-27266. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-31 17:24:21 -07:00
Maxim Gekk	6115a5e1a0	[SPARK-27327][SQL] New JSON benchmarks: functions, Dataset[String] ## What changes were proposed in this pull request? Added new benchmarks for: 1. JSON functions: `from_json`, `json_tuple` and `get_json_object` 2. Parsing `Dataset[String]` with JSON records 3. Comparing just splitting input text by lines with schema inferring, per-line parsing when encoding is set and not set. Also existing benchmarks were refactored to use the `NoOp` datasource to eliminate overhead of triggers like `.filter((_: Row) => true).count()`. ## How was this patch tested? By running `JSONBenchmark` locally. Closes #24252 from MaxGekk/json-benchmark-func. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-04-01 08:33:16 +09:00
gatorsmile	92b6f86f6d	[SPARK-27244][CORE][TEST][FOLLOWUP] toDebugString redacts sensitive information ## What changes were proposed in this pull request? This PR is a FollowUp of https://github.com/apache/spark/pull/24196. It improves the test case by using the parameters that are being used in the actual scenarios. ## How was this patch tested? N/A Closes #24257 from gatorsmile/followupSPARK-27244. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 22:58:28 -07:00
Yuming Wang	b670f39fc6	[SPARK-24793][FOLLOW-UP][K8S] Remove duplicate declaration of mockito-core ## What changes were proposed in this pull request? ``` [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-kubernetes_2.12🫙3.0.0-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.mockito:mockito-core:jar -> duplicate declaration of version (?) org.apache.spark:spark-kubernetes_2.12:[unknown-version], /Users/yumwang/spark/resource-managers/kubernetes/core/pom.xml, line 98, column 17 ``` This pr remove duplicate declaration of `mockito-core`. ## How was this patch tested? N/A Closes #24256 from wangyum/SPARK-24793-FOLLOW-UP. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 21:29:32 -07:00
Felix Cheung	fa0f791d4d	[MINOR][R] fix R project description ## What changes were proposed in this pull request? update as per this NOTE when running CRAN check ``` The Title field should be in title case, current version then in title case: ‘R Front end for 'Apache Spark'’ ‘R Front End for 'Apache Spark'’ ``` Closes #24255 from felixcheung/rdesc. Authored-by: Felix Cheung <felixcheung_m@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 21:00:46 -07:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Dongjoon Hyun	88ea319871	Revert "[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores" This reverts commit `f8fa564dec`.	2019-03-30 16:35:34 -07:00
Gengliang Wang	5dab5f651f	[SPARK-27326][SQL] Fall back all v2 file sources in `InsertIntoTable` to V1 FileFormat ## What changes were proposed in this pull request? In the first PR for file source V2, there was a rule for falling back Orc V2 table to OrcFileFormat: https://github.com/apache/spark/pull/23383/files#diff-57e8244b6964e4f84345357a188421d5R34 As we are migrating more file sources to data source V2, we should make the rule more generic. This PR proposes to: 1. Rename the rule `FallbackOrcDataSourceV2 ` to `FallBackFileSourceV2`.The name is more generic. And we use "fall back" as verb, while "fallback" is noun. 2. Rename the method `fallBackFileFormat` in `FileDataSourceV2` to `fallbackFileFormat`. Here we should use "fallback" as noun. 3. Add new method `fallbackFileFormat` in `FileTable`. This is for falling back to V1 in rule `FallbackOrcDataSourceV2 `. ## How was this patch tested? Existing Unit tests. Closes #24251 from gengliangwang/fallbackV1Rule. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 14:38:26 -07:00
Yuming Wang	0cbef34ede	[MINOR][BUILD] Add ASF license header to plugins.sbt ## What changes were proposed in this pull request? This PR add ASF license header to plugins.sbt, otherwise: ![image](https://user-images.githubusercontent.com/5399861/55273959-670b8800-530d-11e9-9b6f-214a3cde802e.png) ## How was this patch tested? Warning disappears after adding ASF license header: ![image](https://user-images.githubusercontent.com/5399861/55273961-6c68d280-530d-11e9-9d15-5fb73a1b991e.png) Closes #24248 from wangyum/plugins.sbt. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:47:02 -05:00
Yuming Wang	44b0d328e5	[MINOR] Update the scala version of LICENSE-binary to 2.12 ## What changes were proposed in this pull request? Update the scala version of `LICENSE-binary` to 2.12. ## How was this patch tested? N/A Closes #24250 from wangyum/LICENSE-binary. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:46:08 -05:00
liulijia	f8fa564dec	[SPARK-27192][CORE] spark.task.cpus should be less or equal than spark.executor.cores ## What changes were proposed in this pull request? spark.task.cpus should be less or equal than spark.executor.cores when use static executor allocation ## How was this patch tested? manual Closes #24131 from liutang123/SPARK-27192. Authored-by: liulijia <liutang123@yeah.net> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 12:38:05 -05:00
Sean Owen	2ec650d843	[SPARK-27267][CORE] Update snappy to avoid error when decompressing empty serialized data ## What changes were proposed in this pull request? (See JIRA for problem statement) Update snappy 1.1.7.1 -> 1.1.7.3 to pick up an empty-stream and Java 9 fix. There appear to be no other changes of consequence: https://github.com/xerial/snappy-java/blob/master/Milestone.md ## How was this patch tested? Existing tests Closes #24242 from srowen/SPARK-27267. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:41:24 -05:00
Kent Yao	f4c73b7c68	[SPARK-27301][DSTREAM] Shorten the FileSystem cached life cycle to the cleanup method inner scope ## What changes were proposed in this pull request? The cached FileSystem's token will expire if no tokens explicitly are add into it. ```scala 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 83189 19/03/28 13:40:16 INFO rdd.MapPartitionsRDD: Removing RDD 82860 from persistence list 19/03/28 13:40:16 INFO spark.ContextCleaner: Cleaned shuffle 6005 19/03/28 13:40:16 INFO storage.BlockManager: Removing RDD 82860 19/03/28 13:40:16 INFO scheduler.ReceivedBlockTracker: Deleting batches: 19/03/28 13:40:16 INFO scheduler.InputInfoTracker: remove old batch metadata: 1553750250000 ms 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN security.UserGroupInformation: PriviledgedActionException as:ursHADOOP.HZ.NETEASE.COM (auth:KERBEROS) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 19/03/28 13:40:17 WARN hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1396157959_1] for 53 seconds. Will retry shortly ... org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 53240500 for urs) is expired, current time: 2019-03-28 13:40:17,010+0800 expected renewal time: 2019-03-28 13:39:48,523+0800 at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy11.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571) at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.renewLease(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:878) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:748) ``` This PR shorten the FileSystem cached life cycle to the cleanup method inner scope in case of token expiry. ## How was this patch tested? existing ut Closes #24235 from yaooqinn/SPARK-27301. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:35:49 -05:00
Sean Owen	e6d8d0f13f	[SPARK-27121][REPL] Resolve Scala compiler failure for Java 9+ in REPL ## What changes were proposed in this pull request? Avoid trying to extract the classpath of the environment from a URLClassLoader in Java 11, as the default classloader isn't one. Use `java.class.path` instead. ## How was this patch tested? Existing tests, manually tested under Java 11. Closes #24239 from srowen/SPARK-27121.0. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:30:34 -05:00
10129659	144b35fe3a	[SPARK-27320][SQL] Replacing index with iterator to traverse the expressions list in AggregationIterator, which make it simpler ## What changes were proposed in this pull request? In AggregationIterator's loop function, we access the expressions by `expressions(i)`, the type of `expressions` is `::`, a subtype of list. ``` while (i < expressionsLength) { val func = expressions(i).aggregateFunction ``` This PR replacing index with iterator to access the expressions list, which make it simpler. ## How was this patch tested? Existing tests. Closes #24238 from eatoncys/array. Authored-by: 10129659 <chen.yanshan@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 02:27:12 -05:00
Takuya UESHIN	f176dd3f28	[SPARK-27314][SQL] Deduplicate exprIds for Union. ## What changes were proposed in this pull request? We have been having a potential problem with `Union` when the children have the same expression id in their outputs, which happens when self-union. ## How was this patch tested? Modified some tests to adjust plan changes. Closes #24236 from ueshin/issues/SPARK-27314/dedup_union. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-29 14:05:38 -07:00
Maxim Gekk	61561c1c2d	[SPARK-27252][SQL][FOLLOWUP] Calculate min and max days independently from time zone in ComputeCurrentTimeSuite ## What changes were proposed in this pull request? This fixes the `analyzer should replace current_date with literals` test in `ComputeCurrentTimeSuite` by making calculation of `min` and `max` days independent from time zone. ## How was this patch tested? by `ComputeCurrentTimeSuite`. Closes #24240 from MaxGekk/current-date-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:28:36 -05:00
Ninad Ingole	dbc7ce18b9	[SPARK-27244][CORE] Redact Passwords While Using Option logConf=true ## What changes were proposed in this pull request? When logConf is set to true, config keys that contain password were printed in cleartext in driver log. This change uses the already present redact method in Utils, to redact all the passwords based on redact pattern in SparkConf and then print the conf to driver log thus ensuring that sensitive information like passwords is not printed in clear text. ## How was this patch tested? This patch was tested through `SparkConfSuite` & then entire unit test through sbt Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24196 from ninadingole/SPARK-27244. Authored-by: Ninad Ingole <robert.wallis@example.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-29 14:16:53 -05:00
Maxim Gekk	06abd06112	[SPARK-27252][SQL] Make current_date() independent from time zones ## What changes were proposed in this pull request? This makes the `CurrentDate` expression and `current_date` function independent from time zone settings. New result is number of days since epoch in `UTC` time zone. Previously, Spark shifted the current date (in `UTC` time zone) according the session time zone which violets definition of `DateType` - number of days since epoch (which is an absolute point in time, midnight of Jan 1 1970 in UTC time). The changes makes `CurrentDate` consistent to `CurrentTimestamp` which is independent from time zone too. ## How was this patch tested? The changes were tested by existing test suites like `DateExpressionsSuite`. Closes #24185 from MaxGekk/current-date. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-28 18:44:08 -07:00
Xianyang Liu	50cded590f	[MINOR] Move java file to java directory ## What changes were proposed in this pull request? move ```scala org.apache.spark.sql.execution.streaming.BaseStreamingSource org.apache.spark.sql.execution.streaming.BaseStreamingSink ``` to java directory ## How was this patch tested? Existing UT. Closes #24222 from ConeyLiu/move-scala-to-java. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-28 12:11:00 -05:00
zhoukang	43bf4ae641	[SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session ## What changes were proposed in this pull request? When using fair scheduler mode for thrift server, we may have unpredictable result. ``` val pool = sessionToActivePool.get(parentSession.getSessionHandle) if (pool != null) { sqlContext.sparkContext.setLocalProperty(SparkContext.SPARK_SCHEDULER_POOL, pool) } ``` The cause is we use thread pool to execute queries for thriftserver, and when we call setLocalProperty we may have unpredictab behavior. ``` /** * Set a local property that affects jobs submitted from this thread, such as the Spark fair * scheduler pool. User-defined properties may also be set here. These properties are propagated * through to worker tasks and can be accessed there via * [[org.apache.spark.TaskContext#getLocalProperty]]. * * These properties are inherited by child threads spawned from this thread. This * may have unexpected consequences when working with thread pools. The standard java * implementation of thread pools have worker threads spawn other worker threads. * As a result, local properties may propagate unpredictably. */ def setLocalProperty(key: String, value: String) { if (value == null) { localProperties.get.remove(key) } else { localProperties.get.setProperty(key, value) } } ``` I post an example on https://jira.apache.org/jira/browse/SPARK-26914 . ## How was this patch tested? UT Closes #23826 from caneGuy/zhoukang/fix-scheduler-error. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-28 09:24:16 -05:00
Wenchen Fan	e4a968d829	[MINOR][CORE] Remove import scala.collection.Set in TaskSchedulerImpl ## What changes were proposed in this pull request? I was playing with the scheduler and found this weird thing. In `TaskSchedulerImpl` we import `scala.collection.Set` without any reason. This is bad in practice, as it silently changes the actual class when we simply type `Set`, which by default should point to the immutable set. This change only affects one method: `getExecutorsAliveOnHost`. I checked all the caller side and none of them need a general `Set` type. ## How was this patch tested? N/A Closes #24231 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-28 21:12:18 +09:00
Stavros Kontopoulos	39577a27a0	[SPARK-24902][K8S] Add PV integration tests ## What changes were proposed in this pull request? - Adds persistent volume integration tests - Adds a custom tag to the test to exclude it if it is run against a cloud backend. - Assumes default fs type for the host, AFAIK that is ext4. ## How was this patch tested? Manually run the tests against minikube as usual: ``` [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 192 milliseconds. Run starting. Expected test count is: 16 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Test PVs with local storage ``` Closes #23514 from skonto/pvctests. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-03-27 13:00:56 -07:00
Gengliang Wang	49b0411549	[SPARK-27291][SQL] PartitioningAwareFileIndex: Filter out empty files on listing files ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/23130, all empty files are excluded from target file splits in `FileSourceScanExec`. In File source V2, we should keep the same behavior. This PR suggests to filter out empty files on listing files in `PartitioningAwareFileIndex` so that the upper level doesn't need to handle them. ## How was this patch tested? Unit test Closes #24227 from gengliangwang/ignoreEmptyFile. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-27 10:08:38 -07:00

... 5 6 7 8 9 ...

24403 commits