ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	b76f262fc8	[SPARK-27032][TEST] De-flake org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision ## What changes were proposed in this pull request? Reduce work in HDFSMetadataLogSuite test to possibly de-flake it. ## How was this patch tested? Existing tests Closes #23937 from srowen/SPARK-27032. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-04 13:36:41 +09:00
Jungtaek Lim (HeartSaVioR)	34f606678a	[SPARK-27001][SQL] Refactor "serializerFor" method between ScalaReflection and JavaTypeInference ## What changes were proposed in this pull request? This patch proposes refactoring `serializerFor` method between `ScalaReflection` and `JavaTypeInference`, being consistent with what we refactored for `deserializerFor` in #23854. This patch also extracts the logic on recording walk type path since the logic is duplicated across `serializerFor` and `deserializerFor` with `ScalaReflection` and `JavaTypeInference`. ## How was this patch tested? Existing tests. Closes #23908 from HeartSaVioR/SPARK-27001. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-04 10:45:48 +08:00
Dilip Biswal	04ad559ab6	[SPARK-27016][SQL][BUILD] Treat all antlr warnings as errors while generating parser from the sql grammar file. ## What changes were proposed in this pull request? Use the maven plugin option `treatWarningsAsErrors` to make sure the warnings are treated as errors while generating the parser file. In the absence of it, we may inadvertently introducing problems while making grammar changes. Please refer to [PR-23897](https://github.com/apache/spark/pull/23897) to know more about the context. ## How was this patch tested? We can use two ways to build Spark 1) sbt 2) Maven This PR, we made a change to configure the maven antlr plugin to include a parameter that makes antlr4 report error on warning. However, when spark is built using sbt, we use the sbt antlr plugin which does not allow us to pass this additional compilation flag. More info on sbt-antlr plugin can be found at [link](https://github.com/ihji/sbt-antlr4/blob/master/src/main/scala/com/simplytyped/Antlr4Plugin.scala) In summary, this fix only applicable when we use maven to build. Closes #23925 from dilipbiswal/antlr_fix. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-03 10:02:25 -06:00
Sean Owen	d8754df2bf	[SPARK-27029][BUILD] Update Thrift to 0.12.0 ## What changes were proposed in this pull request? Update Thrift to 0.12.0 to pick up bug and security fixes. Changes: https://github.com/apache/thrift/blob/master/CHANGES.md The important one is for https://issues.apache.org/jira/browse/THRIFT-4506 ## How was this patch tested? Existing tests. A quick local test suggests this works. Closes #23935 from srowen/SPARK-27029. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-02 17:28:37 -08:00
Marcelo Vanzin	d00eca75b3	[SPARK-26048][BUILD] Enable flume profile when creating 2.x releases. Closes #23931 from vanzin/SPARK-26048. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-02 08:14:06 -08:00
Huaxin Gao	be5d95adc6	[SPARK-27007][PYTHON] add rawPrediction to OneVsRest in PySpark ## What changes were proposed in this pull request? Add RawPrediction to OneVsRest in PySpark to make it consistent with scala implementation ## How was this patch tested? Add doctest Closes #23910 from huaxingao/spark-27007. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-02 09:09:28 -06:00
Sean Owen	a97a19dd93	[SPARK-26807][DOCS] Clarify that Pyspark is on PyPi now ## What changes were proposed in this pull request? Docs still say that Spark will be available on PyPi "in the future"; just needs to be updated. ## How was this patch tested? Doc build Closes #23933 from srowen/SPARK-26807. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-03-02 14:23:53 +09:00
Dilip Biswal	4a486d6716	[SPARK-26982][SQL] Enhance describe framework to describe the output of a query. ## What changes were proposed in this pull request? Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface. Example: DESCRIBE SELECT * FROM desc_table DESCRIBE QUERY SELECT * FROM desc_table ```SQL spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string); spark-sql> desc select * from desc_table; c1 int c1-comment c2 decimal(10,0) c2-comment c3 string NULL ``` ## How was this patch tested? Added a new test under SQLQueryTestSuite and SparkSqlParserSuite Closes #23883 from dilipbiswal/dkb_describe_query. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-02 11:21:23 +08:00
manuzhang	81dd21fda9	[SPARK-26977][CORE] Fix warn against subclassing scala.App ## What changes were proposed in this pull request? Fix warn against subclassing scala.App ## How was this patch tested? Manual test Closes #23903 from manuzhang/fix_submit_warning. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 17:37:58 -06:00
Dilip Biswal	5fd62ca65a	[SPARK-26215][SQL][FOLLOW-UP][MINOR] Fix the warning from ANTR4 ## What changes were proposed in this pull request? I see the following new warning from ANTR4 after SPARK-26215 after it added `SCHEMA` keyword in the reserved/unreserved list. This is a minor PR to cleanup the warning. ``` WARNING] warning(125): org/apache/spark/sql/catalyst/parser/SqlBase.g4:784:90: implicit definition of token SCHEMA in parser [WARNING] .../apache/spark/org/apache/spark/sql/catalyst/parser/SqlBase.g4 [784:90]: implicit definition of token SCHEMA in parser ``` ## How was this patch tested? Manually built catalyst after the fix to verify Closes #23897 from dilipbiswal/minor_parser_token. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-01 12:34:15 -08:00
Marcelo Vanzin	9f16af6366	[K8S][MINOR] Log minikube version when running integration tests. Closes #23893 from vanzin/minikube-version. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-01 11:24:08 -08:00
SongYadong	86b25c4350	[SPARK-26967][CORE] Put MetricsSystem instance names together for clearer management ## What changes were proposed in this pull request? `MetricsSystem` instance creations have a scattered distribution in the project code. So do their names. It may cause some inconvenience for browsing and management. This PR tries to put them together. In this way, we can have a uniform location for adding or removing them, and have a overall view of `MetircsSystem `instances in current project. It's also helpful for maintaining user documents by avoiding missing something. ## How was this patch tested? Existing unit tests. Closes #23869 from SongYadong/metrics_system_inst_manage. Lead-authored-by: SongYadong <song.yadong1@zte.com.cn> Co-authored-by: walter2001 <ydsong2007@163.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:49:43 -06:00
liuxian	02bbe977ab	[MINOR] Remove unnecessary gets when getting a value from map. ## What changes were proposed in this pull request? Redundant `get` when getting a value from `Map` given a key. ## How was this patch tested? N/A Closes #23901 from 10110346/removegetfrommap. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:48:07 -06:00
Sean Owen	131b464d0c	[SPARK-26986][ML][FOLLOWUP] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Remove a few new JAXB dependencies that shouldn't be necessary now. See https://github.com/apache/spark/pull/23890#issuecomment-468299922 ## How was this patch tested? Existing tests Closes #23923 from srowen/SPARK-26986.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-01 11:23:40 -06:00
Yifei Huang	bc7592ba11	[SPARK-27009][TEST] Add Standard Deviation to benchmark results ## What changes were proposed in this pull request? Add standard deviation to the stats taken during benchmark testing. ## How was this patch tested? Manually ran a few benchmark tests locally and visually inspected the output Closes #23914 from yifeih/spark-27009-stdev. Authored-by: Yifei Huang <yifeih@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-02-28 20:55:55 -08:00
Marcelo Vanzin	14f714fb30	[SPARK-26420][K8S] Generate more unique IDs when creating k8s resource names. Using the current time as an ID is more prone to clashes than people generally realize, so try to make things a bit more unique without necessarily using a UUID, which would eat too much space in the names otherwise. The implemented approach uses some bits from the current time, plus some random bits, which should be more resistant to clashes. Closes #23805 from vanzin/SPARK-26420. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-02-28 20:39:13 -08:00
Maxim Gekk	8e5f9995ca	[SPARK-27008][SQL] Support java.time.LocalDate as an external type of DateType ## What changes were proposed in this pull request? In the PR, I propose to add new Catalyst type converter for `DateType`. It should be able to convert `java.time.LocalDate` to/from `DateType`. Main motivations for the changes: - Smoothly support Java 8 time API - Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Date` (hybrid calendar - Julian + Gregorian). - Make conversion independent from current system timezone. By default, Spark converts values of `DateType` to `java.sql.Date` instances but the SQL config `spark.sql.datetime.java8API.enabled` can change the behavior. If it is set to `true`, Spark uses `java.time.LocalDate` as external type for `DateType`. ## How was this patch tested? Added new testes to `CatalystTypeConvertersSuite` to check conversion of `DateType` to/from `java.time.LocalDate`, `JavaUDFSuite`/ `UDFSuite` to test usage of `LocalDate` type in Scala/Java UDFs. Closes #23913 from MaxGekk/date-localdate. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-03-01 11:04:28 +08:00
Imran Rashid	c8e7eb1fa7	[SPARK-26774][CORE] Update some docs on TaskSchedulerImpl. A couple of places in TaskSchedulerImpl could use a minor doc update on threading concerns. There is one bug fix here, but only in sc.killTaskAttempt() which is probably not used much. Closes #23874 from squito/SPARK-26774. Authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-28 11:30:20 -08:00
zhengruifeng	acd086f207	[SPARK-19591][ML][PYSPARK][FOLLOWUP] Add sample weights to decision trees ## What changes were proposed in this pull request? Add sample weights to decision trees ## How was this patch tested? updated testsuites Closes #23818 from zhengruifeng/py_tree_support_sample_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 21:11:30 -06:00
Hyukjin Kwon	6e31ccf2a1	[SPARK-26895][CORE][FOLLOW-UP] Uninitializing log after `prepareSubmitEnvironment` in SparkSubmit ## What changes were proposed in this pull request? Currently, if I run `spark-shell` in my local, it started to show the logs as below: ``` $ ./bin/spark-shell ... 19/02/28 04:42:43 INFO SecurityManager: Changing view acls to: hkwon 19/02/28 04:42:43 INFO SecurityManager: Changing modify acls to: hkwon 19/02/28 04:42:43 INFO SecurityManager: Changing view acls groups to: 19/02/28 04:42:43 INFO SecurityManager: Changing modify acls groups to: 19/02/28 04:42:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hkwon); groups with view permissions: Set(); users with modify permissions: Set(hkwon); groups with modify permissions: Set() 19/02/28 04:42:43 INFO SignalUtils: Registered signal handler for INT 19/02/28 04:42:48 INFO SparkContext: Running Spark version 3.0.0-SNAPSHOT 19/02/28 04:42:48 INFO SparkContext: Submitted application: Spark shell 19/02/28 04:42:48 INFO SecurityManager: Changing view acls to: hkwon ``` Seems to be the cause is https://github.com/apache/spark/pull/23806 and `prepareSubmitEnvironment` looks actually reinitializing the logging again. This PR proposes to uninitializing log later after `prepareSubmitEnvironment`. ## How was this patch tested? Manually tested. Closes #23911 from HyukjinKwon/SPARK-26895. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 17:01:30 -08:00
Gabor Somogyi	76e0b6bafb	[SPARK-27002][SS] Get kafka delegation tokens right before consumer/producer created ## What changes were proposed in this pull request? Spark not always picking up the latest Kafka delegation tokens even if a new one properly obtained. In the PR I'm setting delegation tokens right before `KafkaConsumer` and `KafkaProducer` creation to be on the safe side. ## How was this patch tested? Long running Kafka to Kafka tests on 4 node cluster with randomly thrown artificial exceptions. Test scenario: * 4 node cluster * Yarn * Kafka broker version 2.1.0 * security.protocol = SASL_SSL * sasl.mechanism = SCRAM-SHA-512 Kafka broker settings: * delegation.token.expiry.time.ms=600000 (10 min) * delegation.token.max.lifetime.ms=1200000 (20 min) * delegation.token.expiry.check.interval.ms=300000 (5 min) After each 7.5 minutes new delegation token obtained from Kafka broker (10 min * 0.75). But when token expired after 10 minutes (Spark obtains new one and doesn't renew the old), the brokers expiring thread comes after each 5 minutes (invalidates expired tokens) and artificial exception has been thrown inside the Spark application (such case Spark closes connection), then the latest delegation token not always picked up. Closes #23906 from gaborgsomogyi/SPARK-27002. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 10:07:02 -08:00
Gabor Somogyi	c4bbfd177b	[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution ## What changes were proposed in this pull request? Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`. ## How was this patch tested? Existing + additional unit tests. Closes #23156 from gaborgsomogyi/SPARK-24063. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 09:52:43 -08:00
Marcelo Vanzin	a6ddc9d083	[SPARK-24736][K8S] Let spark-submit handle dependency resolution. Before this change, there was some code in the k8s backend to deal with how to resolve dependencies and make them available to the Spark application. It turns out that none of that code is necessary, since spark-submit already handles all that for applications started in client mode - like the k8s driver that is run inside a Spark-created pod. For that reason, specifically for pyspark, there's no need for the k8s backend to deal with PYTHONPATH; or, in general, to change the URIs provided by the user at all. spark-submit takes care of that. For testing, I created a pyspark script that depends on another module that is shipped with --py-files. Then I used: - --py-files http://.../dep.py http://.../test.py - --py-files http://.../dep.zip http://.../test.py - --py-files local:/.../dep.py local:/.../test.py - --py-files local:/.../dep.zip local:/.../test.py Without this change, all of the above commands fail. With the change, the driver is able to see the dependencies in all the above cases; but executors don't see the dependencies in the last two. That's a bug in shared Spark code that deals with local: dependencies in pyspark (SPARK-26934). I also tested a Scala app using the main jar from an http server. Closes #23793 from vanzin/SPARK-24736. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-27 09:49:31 -08:00
Hyukjin Kwon	a67e8426e3	[SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0 ## What changes were proposed in this pull request? After upgrading cloudpickle to 0.6.1 at https://github.com/apache/spark/pull/20691, one regression was found. Cloudpickle had a critical https://github.com/cloudpipe/cloudpickle/pull/240 for that. Basically, it currently looks existing globals would override globals shipped in a function's, meaning: Before: ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] ``` After: ```python >>> def hey(): ... return "Hi" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Hi'] >>> >>> def hey(): ... return "Yeah" ... >>> spark.range(1).rdd.map(lambda _: hey()).collect() ['Yeah'] ``` Therefore, this PR upgrades cloudpickle to 0.8.0. Note that cloudpickle's release cycle is quite short. Between 0.6.1 and 0.7.0, it contains minor bug fixes. I don't see notable changes to double check and/or avoid. There is virtually only this fix between 0.7.0 and 0.8.1 - other fixes are about testing. ## How was this patch tested? Manually tested, tests were added. Verified unit tests were added in cloudpickle. Closes #23904 from HyukjinKwon/SPARK-27000. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-28 02:33:10 +09:00
Luca Canali	5fd28e8f5c	[SPARK-26890][DOC] Add list of available Dropwizard metrics in Spark and add additional configuration details to the monitoring documentation ## What changes were proposed in this pull request? This PR proposes to extend the documentation of the Spark metrics system in the monitoring guide. In particular by: - adding a list of the available metrics grouped per component instance - adding information on configuration parameters that can be used to configure the metrics system in alternative to the metrics.properties file - adding information on the configuration parameters needed to enable certain metrics - it also propose to add an example of Graphite sink configuration in metrics.properties.template Closes #23798 from LucaCanali/metricsDocUpdate. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 10:07:15 -06:00
Oliver Urs Lenz	28e1695e17	[SPARK-26803][PYTHON] Add sbin subdirectory to pyspark ## What changes were proposed in this pull request? Modifies `setup.py` so that `sbin` subdirectory is included in pyspark ## How was this patch tested? Manually tested with python 2.7 and python 3.7 ```sh $ ./build/mvn -D skipTests -P hive -P hive-thriftserver -P yarn -P mesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz ``` Checked manually that `sbin` is now present in install directory. srowen holdenk Closes #23715 from oulenz/pyspark_sbin. Authored-by: Oliver Urs Lenz <oliver.urs.lenz@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 08:39:55 -06:00
liuxian	7912dbb88f	[MINOR] Simplify boolean expression ## What changes were proposed in this pull request? Comparing whether Boolean expression is equal to true is redundant For example: The datatype of `a` is boolean. Before: if (a == true) After: if (a) ## How was this patch tested? N/A Closes #23884 from 10110346/simplifyboolean. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-27 08:38:00 -06:00
Maxim Gekk	b0450d07bd	[SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType ## What changes were proposed in this pull request? In the PR, I propose to add new Catalyst type converter for `TimestampType`. It should be able to convert `java.time.Instant` to/from `TimestampType`. Main motivations for the changes: - Smoothly support Java 8 time API - Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and `java.sql.Timestamp` (hybrid calendar - Julian + Gregorian). - Make conversion independent from current system timezone. By default, Spark converts values of `TimestampType` to `java.sql.Timestamp` instances but the SQL config `spark.sql.catalyst.timestampType` can change the behavior. It accepts two values `Timestamp` (default) and `Instant`. If the former one is set, Spark returns `java.time.Instant` instances for timestamp values. ## How was this patch tested? Added new testes to `CatalystTypeConvertersSuite` to check conversion of `TimestampType` to/from `java.time.Instant`. Closes #23811 from MaxGekk/timestamp-instant. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-27 21:05:19 +08:00
Gengliang Wang	95e55720d4	[SPARK-26990][SQL] FileIndex: use user specified field names if possible ## What changes were proposed in this pull request? WIth the following file structure: ``` /tmp/data └── a=5 ``` In the previous release: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root \|-- ID: long (nullable = true) \|-- A: integer (nullable = true) ``` While in current code: ``` scala> spark.read.schema("A int, ID long").parquet("/tmp/data/").printSchema root \|-- ID: long (nullable = true) \|-- a: integer (nullable = true) ``` We can see that the partition column name `a` is different from `A` as user specifed. This PR is to fix the case and make it more user-friendly. ## How was this patch tested? Unit test Closes #23894 from gengliangwang/fileIndexSchema. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-27 14:38:35 +08:00
Jungtaek Lim (HeartSaVioR)	dea18ee85b	[SPARK-22000][SQL] Address missing Upcast in JavaTypeInference.deserializerFor ## What changes were proposed in this pull request? Spark expects the type of column and the type of matching field is same when deserializing to Object, but Spark hasn't actually restrict it (at least for Java bean encoder) and some users just do it and experience undefined behavior (in SPARK-22000, Spark throws compilation failure on generated code because it calls `.toString()` against primitive type. It doesn't produce error in Scala side because `ScalaReflection.deserializerFor` properly inject Upcast if necessary. This patch proposes applying same thing to `JavaTypeInference.deserializerFor` as well. Credit to srowen, maropu, and cloud-fan since they provided various approaches to solve this. ## How was this patch tested? Added UT which query is slightly modified based on sample code in attachment on JIRA issue. Closes #23854 from HeartSaVioR/SPARK-22000. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-27 13:47:20 +08:00
Hyukjin Kwon	88bc481b9e	[SPARK-26830][SQL][R] Vectorized R dapply() implementation ## What changes were proposed in this pull request? This PR targets to add vectorized `dapply()` in R, Arrow optimization. This can be tested as below: ```bash $ ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true ``` ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` ### Requirements - R 3.5.x - Arrow package 0.12+ ```bash Rscript -e 'remotes::install_github("apache/arrowapache-arrow-0.12.0", subdir = "r")' ``` Note: currently, Arrow R package is not in CRAN. Please take a look at ARROW-3204. Note: currently, Arrow R package seems not supporting Windows. Please take a look at ARROW-3204. ### Benchmarks Shall ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=false --driver-memory 4g ``` ```bash sync && sudo purge ./bin/sparkR --conf spark.sql.execution.arrow.enabled=true --driver-memory 4g ``` R code ```r rdf <- read.csv("500000.csv") df <- cache(createDataFrame(rdf)) count(df) test <- function() { options(digits.secs = 6) # milliseconds start.time <- Sys.time() count(cache(dapply(df, function(rdf) { rdf }, schema(df)))) end.time <- Sys.time() time.taken <- end.time - start.time print(time.taken) } test() ``` Data (350 MB): ```r object.size(read.csv("500000.csv")) 350379504 bytes ``` "500000 Records" http://eforexcel.com/wp/downloads-16-sample-csv-files-data-sets-for-testing/ Results ``` Time difference of 13.42037 mins ``` ``` Time difference of 30.64156 secs ``` The performance improvement was around 2627%. ### Limitations - For now, Arrow optimization with R does not support when the data is `raw`, and when user explicitly gives float type in the schema. They produce corrupt values. - Due to ARROW-4512, it cannot send and receive batch by batch. It has to send all batches in Arrow stream format at once. It needs improvement later. ## How was this patch tested? Unit tests were added, and manually tested. Closes #23787 from HyukjinKwon/SPARK-26830-1. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2019-02-27 14:29:58 +09:00
Liang-Chi Hsieh	0f2c0b53e8	[SPARK-26837][SQL] Pruning nested fields from object serializers ## What changes were proposed in this pull request? In SPARK-26619, we make change to prune unnecessary individual serializers when serializing objects. This is extension to SPARK-26619. We can further prune nested fields from object serializers if they are not used. For example, in following query, we only use one field in a struct column: ```scala val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3)) val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1") ``` So, instead of having a serializer to create a two fields struct, we can prune unnecessary field from it. This is what this PR proposes to do. In order to make this change conservative and safer, a SQL config is added to control it. It is disabled by default. TODO: Support to prune nested fields inside MapType's key and value. ## How was this patch tested? Added tests. Closes #23740 from viirya/nested-pruning-serializer-2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-27 12:45:24 +08:00
Sean Owen	9c283662c6	[SPARK-26986][ML] Add JAXB reference impl to build for Java 9+ ## What changes were proposed in this pull request? Add reference JAXB impl for Java 9+ from Glassfish. Right now it's only apparently necessary in MLlib but can be expanded later. ## How was this patch tested? Existing tests particularly PMML-related ones, which use JAXB. This works on Java 11. Closes #23890 from srowen/SPARK-26986. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 18:26:49 -06:00
Hellsen83	387efe29b7	[SPARK-26449][PYTHON] Add transform method to DataFrame API ## What changes were proposed in this pull request? Added .transform() method to Python DataFrame API to be in sync with Scala API. ## How was this patch tested? Addition has been tested manually. Closes #23877 from Hellsen83/pyspark-dataframe-transform. Authored-by: Hellsen83 <erik.christiansen83@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 18:22:36 -06:00
Jungtaek Lim (HeartSaVioR)	c17150a5f5	[SPARK-22860][CORE][YARN] Redact command line arguments for running Driver and Executor before logging (standalone and YARN) ## What changes were proposed in this pull request? This patch applies redaction to command line arguments before logging them. This applies to two resource managers: standalone cluster and YARN. This patch only concerns about arguments starting with `-D` since Spark is likely passing the Spark configuration to command line arguments as `-Dspark.blabla=blabla`. More change is necessary if we also want to handle the case of `--conf spark.blabla=blabla`. ## How was this patch tested? Added UT for redact logic. This patch only touches how to log so not easy to add UT regarding it. Closes #23820 from HeartSaVioR/MINOR-redact-command-line-args-for-running-driver-executor. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-26 14:49:46 -08:00
Marcelo Vanzin	afbff6446f	Revert "[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2" This reverts commit `a3192d966a`.	2019-02-26 13:42:07 -08:00
Maxim Gekk	a2a41b7bf2	[SPARK-26978][CORE][SQL] Avoid magic time constants ## What changes were proposed in this pull request? In the PR, I propose to refactor existing code related to date/time conversions, and replace constants like `1000` and `1000000` by `DateTimeUtils` constants and transformation functions from `java.util.concurrent.TimeUnit._`. ## How was this patch tested? The changes are tested by existing test suites. Closes #23878 from MaxGekk/magic-time-constants. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 09:08:12 -06:00
seancxmao	3d2e55abd0	[MINOR][DOCS] Remove Akka leftover ## What changes were proposed in this pull request? Since Spark 2.0, Akka is not used anymore and Akka related stuff were removed. However there are still some leftover. This PR aims to remove these leftover. * `/pom.xml` has a comment about Akka, which is not needed anymore. ## How was this patch tested? Existing tests. Closes #23885 from seancxmao/remove-akka-leftover. Authored-by: seancxmao <seancxmao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-26 08:31:02 -06:00
Xianyang Liu	bc03c8b3fa	[SPARK-26952][SQL] Row count statics should respect the data reported by data source ## What changes were proposed in this pull request? In data source v2, if the data source scan implemented `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count reported by the data source. ## How was this patch tested? New UT test. Closes #23853 from ConeyLiu/report-row-count. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-02-26 14:10:54 +08:00
liupengcheng	52a180f25f	[SPARK-26674][CORE] Consolidate CompositeByteBuf when reading large frame ## What changes were proposed in this pull request? Currently, TransportFrameDecoder will not consolidate the buffers read from network which may cause memory waste. Actually, bytebuf's writtenIndex is far less than it's capacity in most cases, so we can optimize it by doing consolidation. This PR will do this optimization. Related codes: `9a30e23211/common/network-common/src/main/java/org/apache/spark/network/util/TransportFrameDecoder.java (L143)` ## How was this patch tested? UT Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #23602 from liupc/Reduce-memory-consumption-in-TransportFrameDecoder. Lead-authored-by: liupengcheng <liupengcheng@xiaomi.com> Co-authored-by: Liupengcheng <liupengcheng@xiaomi.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-25 16:40:46 -08:00
Gengliang Wang	4baa2d4449	[SPARK-26673][FOLLOWUP][SQL] File Source V2: check existence of output path before delete it ## What changes were proposed in this pull request? This is a followup PR to resolve comment: https://github.com/apache/spark/pull/23601#pullrequestreview-207101115 When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist, it is suggested to follow the V1 code by checking the existence. ## How was this patch tested? Apply https://github.com/apache/spark/pull/23836 and run unit tests Closes #23889 from gengliangwang/checkFileBeforeOverwrite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-02-25 16:20:06 -08:00
Ilya Matiach	b66be0e490	[SPARK-24103][ML][MLLIB] ML Evaluators should use weight column - added weight column for binary classification evaluator ## What changes were proposed in this pull request? The evaluators BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator and the corresponding metrics classes BinaryClassificationMetrics, RegressionMetrics and MulticlassMetrics should use sample weight data. I've closed the PR: https://github.com/apache/spark/pull/16557 as recommended in favor of creating three pull requests, one for each of the evaluators (binary/regression/multiclass) to make it easier to review/update. ## How was this patch tested? I added tests to the metrics and evaluators classes. Closes #17084 from imatiach-msft/ilmat/binary-evalute. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 17:16:51 -06:00
Marcelo Vanzin	4808393449	[SPARK-26788][YARN] Remove SchedulerExtensionService. Since the yarn module is actually private to Spark, this interface was never actually "public". Since it has no use inside of Spark, let's avoid adding a yarn-specific extension that isn't public, and point any potential users are more general solutions (like using a SparkListener). Closes #23839 from vanzin/SPARK-26788. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 13:57:37 -06:00
“attilapiros”	0ac516bebd	[SPARK-25035][CORE] Avoiding memory mapping at disk-stored blocks replication Before this PR the method `BlockManager#putBlockDataAsStream()` (which is used during block replication where the block data is received as a stream) was reading the whole block content into the memory even at DISK_ONLY storage level. With this change the received block data (which was temporary stored in a file) is just simply moved into the right location backing the target block. This way a possible OOM error is avoided. In this implementation to save code duplications the method `doPutBytes` is refactored into a template method called `BlockStoreUpdater` which has a separate implementation to handle byte buffer based and temporary file based block store updates. With existing unit tests of `DistributedSuite` (the ones dealing with replications): - caching on disk, replicated (encryption = off) (with replication as stream) - caching on disk, replicated (encryption = on) (with replication as stream) - caching in memory, serialized, replicated (encryption = on) (with replication as stream) - caching in memory, serialized, replicated (encryption = off) (with replication as stream) - etc. And with new unit tests testing `putBlockDataAsStream` method directly: - test putBlockDataAsStream with caching (encryption = off) - test putBlockDataAsStream with caching (encryption = on) - test putBlockDataAsStream with caching on disk (encryption = off) - test putBlockDataAsStream with caching on disk (encryption = on) Closes #23688 from attilapiros/SPARK-25035. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-25 11:43:39 -08:00
Jungtaek Lim (HeartSaVioR)	c5de804093	[MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org " ## What changes were proposed in this pull request? Below build failed with Java checkstyle test, but instead of violation it shows FileNotFound on dtd file. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102751/ Looks like the link of dtd file is dead `http://www.puppycrawl.com/dtds/configuration_1_3.dtd`. This patch updates the dtd link to "https://checkstyle.org/dtds/" given checkstyle repository also updated the URL path. https://github.com/checkstyle/checkstyle/issues/5601 ## How was this patch tested? Checked the new links. Closes #23887 from HeartSaVioR/java-checkstyle-dtd-change-url. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-02-25 11:25:53 -08:00
Jiaxin Shan	a3192d966a	[SPARK-26742][K8S] Update Kubernetes-Client version to 4.1.2 ## What changes were proposed in this pull request? Changed the `kubernetes-client` version to 4.1.2. Latest version fix error with exec credentials (used by aws eks) and this will be used to talk with kubernetes API server. Users can submit spark job to EKS api endpoint now with this patch. ## How was this patch tested? unit tests and manual tests. Closes #23814 from Jeffwan/update_k8s_sdk. Authored-by: Jiaxin Shan <seedjeffwan@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 04:56:04 -06:00
Sean Owen	d2529788ed	[SPARK-26966][ML] Update to JPMML 1.4.8 ## What changes were proposed in this pull request? JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures from JPMML relating to JAXB when running on Java 11. It's shaded and not a big change, so should be safe. ## How was this patch tested? Existing tests. Closes #23868 from srowen/SPARK-26966. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-25 04:37:45 -06:00
Maxim Gekk	2d2fb34b93	[SPARK-26953][CORE][TEST] Test TimSort for ArrayIndexOutOfBoundsException ## What changes were proposed in this pull request? In the PR, I propose to test the input showed at the end of the article: https://arxiv.org/pdf/1805.08612.pdf . The difference of the test and paper's test is type of array. This test allocates arrays of bytes instead of array of ints. ## How was this patch tested? New test is added to `SorterSuite`. Closes #23856 from MaxGekk/timsort-bug-fix. Authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-24 17:37:32 -06:00
Douglas R Colkitt	faa61980c4	[SPARK-26935][SQL] Skip DataFrameReader's CSV first line scan when not used Prior to this patch, all DataFrameReader.csv() calls would collect the first line from the CSV input iterator. This is done to allow schema inference from the header row. However when schema is already specified this is a wasteful operation. It results in an unncessary compute step on the first partition. This can be expensive if the CSV itself is expensive to generate (e.g. it's the product of a long-running external pipe()). This patch short-circuits the first-line collection in DataFrameReader.csv() when schema is specified. Thereby improving CSV read performance in certain cases. ## What changes were proposed in this pull request? Short-circuiting DataFrameReader.csv() first-line read when schema is user-specified. ## How was this patch tested? Compiled and tested against several CSV datasets. Closes #23830 from Mister-Meeseeks/master. Authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-23 14:00:57 -06:00
Maxim Gekk	75c48ac36d	[SPARK-26908][SQL] Fix DateTimeUtils.toMillis and millisToDays ## What changes were proposed in this pull request? The `DateTimeUtils.toMillis` can produce inaccurate result for some negative values (timestamps before epoch). The error can be around 1ms. In the PR, I propose to use `Math.floorDiv` in casting microseconds to milliseconds, and milliseconds to days since epoch. ## How was this patch tested? Added new test to `DateTimeUtilsSuite`, and tested by `CastSuite` as well. Closes #23815 from MaxGekk/micros-to-millis. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-02-23 11:35:11 -06:00

... 3 4 5 6 7 ...

24027 commits