ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Arun Mahadevan	710e4e81a8	[SPARK-24308][SQL] Handle DataReaderFactory to InputPartition rename in left over classes ## What changes were proposed in this pull request? SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> InputPartitionReader. Some classes still reflects the old name and causes confusion. This patch renames the left over classes to reflect the new interface and fixes a few comments. ## How was this patch tested? Existing unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Arun Mahadevan <arunm@apache.org> Closes #21355 from arunmahadevan/SPARK-24308.	2018-05-18 14:37:01 -07:00
Takeshi Yamamuro	a53ea70c1d	[SPARK-23856][SQL] Add an option `queryTimeout` in JDBCOptions ## What changes were proposed in this pull request? This pr added an option `queryTimeout` for the number of seconds the the driver will wait for a Statement object to execute. ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21173 from maropu/SPARK-23856.	2018-05-18 13:38:36 -07:00
Marco Gaido	3159ee085b	[SPARK-24149][YARN] Retrieve all federated namespaces tokens ## What changes were proposed in this pull request? Hadoop 3 introduces HDFS federation. This means that multiple namespaces are allowed on the same HDFS cluster. In Spark, we need to ask the delegation token for all the namenodes (for each namespace), otherwise accessing any other namespace different from the default one (for which we already fetch the delegation token) fails. The PR adds the automatic discovery of all the namenodes related to all the namespaces available according to the configs in hdfs-site.xml. ## How was this patch tested? manual tests in dockerized env Author: Marco Gaido <marcogaido91@gmail.com> Closes #21216 from mgaido91/SPARK-24149.	2018-05-18 13:04:00 -07:00
Dongjoon Hyun	7f82c4a47e	[SPARK-24312][SQL] Upgrade to 2.3.3 for Hive Metastore Client 2.3 ## What changes were proposed in this pull request? Hive 2.3.3 was [released on April 3rd](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162&styleName=Text&projectId=12310843). This PR aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. ## How was this patch tested? Pass the Jenkins with the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21359 from dongjoon-hyun/SPARK-24312.	2018-05-18 12:54:19 -07:00
gatorsmile	1c4553d67d	Revert "[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol" This reverts commit `7b2dca5b12`.	2018-05-18 12:51:09 -07:00
Marcelo Vanzin	ed7ba7db8f	[SPARK-23850][SQL] Add separate config for SQL options redaction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21158 from vanzin/SPARK-23850.	2018-05-18 11:14:22 -07:00
Tathagata Das	807ba44cb7	[SPARK-24159][SS] Enable no-data micro batches for streaming mapGroupswithState ## What changes were proposed in this pull request? Enabled no-data batches in flatMapGroupsWithState in following two cases. - When ProcessingTime timeout is used, then we always run a batch every trigger interval. - When event-time watermark is defined, then the user may be doing arbitrary logic against the watermark value even if timeouts are not set. In such cases, it's best to run batches whenever the watermark has changed, irrespective of whether timeouts (i.e. event-time timeout) have been explicitly enabled. ## How was this patch tested? updated tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21345 from tdas/SPARK-24159.	2018-05-18 10:35:43 -07:00
Soham Aurangabadkar	7696b9de0d	[SPARK-20538][SQL] Wrap Dataset.reduce with withNewRddExecutionId. ## What changes were proposed in this pull request? Wrap Dataset.reduce with `withNewExecutionId`. Author: Soham Aurangabadkar <sohama4@gmail.com> Closes #21316 from sohama4/dataset_reduce_withexecutionid.	2018-05-18 10:29:34 -07:00
hyukjinkwon	0cf59fcbe3	[SPARK-24303][PYTHON] Update cloudpickle to v0.4.4 ## What changes were proposed in this pull request? cloudpickle 0.4.4 is released - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4 There's no invasive change - the main difference is that we are now able to pickle the root logger, which fix is pretty isolated. ## How was this patch tested? Jenkins tests. Author: hyukjinkwon <gurwls223@apache.org> Closes #21350 from HyukjinKwon/SPARK-24303.	2018-05-18 09:53:24 -07:00
Gengliang Wang	7b2dca5b12	[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol ## What changes were proposed in this pull request? In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration. Also clean up some code in SQL module. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21329 from gengliangwang/codeCleanWrite.	2018-05-18 15:32:29 +08:00
Sandor Murakozi	d4a0895c62	[SPARK-22884][ML] ML tests for StructuredStreaming: spark.ml.clustering ## What changes were proposed in this pull request? Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. This PR is a new version of https://github.com/apache/spark/pull/20319 Author: Sandor Murakozi <smurakozi@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #21358 from jkbradley/smurakozi-SPARK-22884.	2018-05-17 16:33:06 -07:00
Bago Amirbekian	439c695118	[SPARK-24114] Add instrumentation to FPGrowth. ## What changes were proposed in this pull request? Have FPGrowth keep track of model training using the Instrumentation class. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21344 from MrBago/fpgrowth-instr.	2018-05-17 13:42:10 -07:00
Bago Amirbekian	a7a9b18378	[SPARK-24115] Have logging pass through instrumentation class. ## What changes were proposed in this pull request? Fixes to tuning instrumentation. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Bago Amirbekian <bago@databricks.com> Closes #21340 from MrBago/tunning-instrumentation.	2018-05-17 11:13:16 -07:00
jinxing	8a837bf4f3	[SPARK-24193] create TakeOrderedAndProjectExec only when the limit number is below spark.sql.execution.topKSortFallbackThreshold. ## What changes were proposed in this pull request? Physical plan of `select colA from t order by colB limit M` is `TakeOrderedAndProject`; Currently `TakeOrderedAndProject` sorts data in memory, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158 We can add a config – if the number of limit (M) is too big, we can sort by disk. Thus memory issue can be resolved. ## How was this patch tested? Test added Author: jinxing <jinxing6042@126.com> Closes #21252 from jinxing64/SPARK-24193.	2018-05-17 22:29:18 +08:00
Marco Gaido	69350aa2f0	[SPARK-23922][SQL] Add arrays_overlap function ## What changes were proposed in this pull request? The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21028 from mgaido91/SPARK-23922.	2018-05-17 20:45:32 +08:00
Wenchen Fan	6ec05826d7	[SPARK-24107][CORE][FOLLOWUP] ChunkedByteBuffer.writeFully method has not reset the limit value ## What changes were proposed in this pull request? According to the discussion in https://github.com/apache/spark/pull/21175 , this PR proposes 2 improvements: 1. add comments to explain why we call `limit` to write out `ByteBuffer` with slices. 2. remove the `try ... finally` ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21327 from cloud-fan/minor.	2018-05-17 20:42:40 +08:00
Artem Rudoy	6c35865d94	[SPARK-22371][CORE] Return None instead of throwing an exception when an accumulator is garbage collected. ## What changes were proposed in this pull request? There's a period of time when an accumulator has been garbage collected, but hasn't been removed from AccumulatorContext.originals by ContextCleaner. When an update is received for such accumulator it will throw an exception and kill the whole job. This can happen when a stage completes, but there're still running tasks from other attempts, speculation etc. Since AccumulatorContext.get() returns an option we can just return None in such case. ## How was this patch tested? Unit test. Author: Artem Rudoy <artem.rudoy@gmail.com> Closes #21114 from artemrd/SPARK-22371.	2018-05-17 18:49:46 +08:00
Florent Pépin	3e66350c24	[SPARK-23925][SQL] Add array_repeat collection function ## What changes were proposed in this pull request? The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish. The behaviour of the function is based on Presto's one. The function creates an array containing a given element repeated the requested number of times. ## How was this patch tested? New unit tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite Author: Florent Pépin <florentpepin.92@gmail.com> Author: Florent Pépin <florent.pepin14@imperial.ac.uk> Closes #21208 from pepinoflo/SPARK-23925.	2018-05-17 13:31:14 +09:00
hyukjinkwon	9a641e7f72	[SPARK-21945][YARN][PYTHON] Make --py-files work with PySpark shell in Yarn client mode ## What changes were proposed in this pull request? ### Problem When we run _PySpark shell with Yarn client mode_, specified `--py-files` are not recognised in _driver side_. Here are the steps I took to check: ```bash $ cat /home/spark/tmp.py def testtest(): return 1 ``` ```bash $ ./bin/pyspark --master yarn --deploy-mode client --py-files /home/spark/tmp.py ``` ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() # executor side [1] >>> test() # driver side Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in test ImportError: No module named tmp ``` ### How did it happen? Unlike Yarn cluster and client mode with Spark submit, when Yarn client mode with PySpark shell specifically, 1. It first runs Python shell via: `3cb82047f2/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (L158)` as pointed out by tgravescs in the JIRA. 2. this triggers shell.py and submit another application to launch a py4j gateway: `209b9361ac/python/pyspark/java_gateway.py (L45-L60)` 3. it runs a Py4J gateway: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L425)` 4. it copies (or downloads) --py-files into local temp directory: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L365-L376)` and then these files are set up to `spark.submit.pyFiles` 5. Py4J JVM is launched and then the Python paths are set via: `7013eea11c/python/pyspark/context.py (L209-L216)` However, these are not actually set because those files were copied into a tmp directory in 4. whereas this code path looks for `SparkFiles.getRootDirectory` where the files are stored only when `SparkContext.addFile()` is called. In other cluster mode, `spark.files` are set via: `3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L554-L555)` and those files are explicitly added via: `ecb8b383af/core/src/main/scala/org/apache/spark/SparkContext.scala (L395)` So we are fine in other modes. In case of Yarn client and cluster with _submit_, these are manually being handled. In particular https://github.com/apache/spark/pull/6360 added most of the logics. In this case, the Python path looks manually set via, for example, `deploy.PythonRunner`. We don't use `spark.files` here. ### How does the PR fix the problem? I tried to make an isolated approach as possible as I can: simply copy py file or zip files into `SparkFiles.getRootDirectory()` in driver side if not existing. Another possible way is to set `spark.files` but it does unnecessary stuff together and sounds a bit invasive. Before ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 2, in test ImportError: No module named tmp ``` After ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() 1 ``` ## How was this patch tested? I manually tested in standalone and yarn cluster with PySpark shell. .zip and .py files were also tested with the similar steps above. It's difficult to add a test. Author: hyukjinkwon <gurwls223@apache.org> Closes #21267 from HyukjinKwon/SPARK-21945.	2018-05-17 12:07:58 +08:00
Lu WANG	bfd75cdfb2	[SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in submitMiniBatch - Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution. ## How was this patch tested? Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21183 from ludatabricks/SPARK-22210.	2018-05-16 17:54:06 -07:00
Tathagata Das	991726f31a	[SPARK-24158][SS] Enable no-data batches for streaming joins ## What changes were proposed in this pull request? This is a continuation of the larger task of enabling zero-data batches for more eager state cleanup. This PR enables it for stream-stream joins. ## How was this patch tested? - Updated join tests. Additionally, updated them to not use `CheckLastBatch` anywhere to set good precedence for future. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21253 from tdas/SPARK-24158.	2018-05-16 14:55:02 -07:00
Marcelo Vanzin	8e60a16b73	[SPARK-23601][BUILD][FOLLOW-UP] Keep md5 checksums for nexus artifacts. The repository.apache.org server still requires md5 checksums or it won't publish the staging repo. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21338 from vanzin/SPARK-23601.	2018-05-16 13:34:54 -07:00
Gengliang Wang	6fb7d6c4f7	[SPARK-24275][SQL] Revise doc comments in InputPartition ## What changes were proposed in this pull request? In #21145, DataReaderFactory is renamed to InputPartition. This PR is to revise wording in the comments to make it more clear. ## How was this patch tested? None Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21326 from gengliangwang/revise_reader_comments.	2018-05-17 00:40:39 +08:00
Wenchen Fan	943493b165	Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is acces… …sed only on the driver" This reverts commit `a4206d58e0`. This is from https://github.com/apache/spark/pull/21299 and to ease the review of it. Author: Wenchen Fan <wenchen@databricks.com> Closes #21341 from cloud-fan/revert.	2018-05-16 22:01:24 +08:00
Liang-Chi Hsieh	8a13c50968	[SPARK-24058][ML][PYSPARK] Default Params in ML should be saved separately: Python API ## What changes were proposed in this pull request? See SPARK-23455 for reference. Now default params in ML are saved separately in metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in order to keep them in sync. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21153 from viirya/SPARK-24058.	2018-05-15 16:50:09 -07:00
Lu WANG	6b94420f6c	[SPARK-24231][PYSPARK][ML] Provide Python API for evaluateEachIteration for spark.ml GBTs ## What changes were proposed in this pull request? Add evaluateEachIteration for GBTClassification and GBTRegressionModel ## How was this patch tested? doctest Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21335 from ludatabricks/SPARK-14682.	2018-05-15 14:16:31 -07:00
Jose Torres	3fabbc5762	[SPARK-24040][SS] Support single partition aggregates in continuous processing. ## What changes were proposed in this pull request? Support aggregates with exactly 1 partition in continuous processing. A few small tweaks are needed to make this work: * Replace currentEpoch tracking with an ThreadLocal. This means that current epoch is scoped to a task rather than a node, but I think that's sustainable even once we add shuffle. * Add a new testing-only flag to disable the UnsupportedOperationChecker whitelist of allowed continuous processing nodes. I think this is preferable to writing a pile of custom logic to enforce that there is in fact only 1 partition; we plan to support multi-partition aggregates before the next Spark release, so we'd just have to tear that logic back out. * Restart continuous processing queries from the first available uncommitted epoch, rather than one that's guaranteed to be unused. This is required for stateful operators to overwrite partial state from the previous attempt at the epoch, and there was no specific motivation for the original strategy. In another PR before stabilizing the StreamWriter API, we'll need to narrow down and document more precise semantic guarantees for the epoch IDs. * We need a single-partition ContinuousMemoryStream. The way MemoryStream is constructed means it can't be a text option like it is for rate source, unfortunately. ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21239 from jose-torres/withAggr.	2018-05-15 10:25:29 -07:00
Liang-Chi Hsieh	d610d2a3f5	[SPARK-24259][SQL] ArrayWriter for Arrow produces wrong output ## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21312 from viirya/SPARK-24259.	2018-05-15 22:06:58 +08:00
Kent Yao	4a2b15f0af	[SPARK-24241][SUBMIT] Do not fail fast when dynamic resource allocation enabled with 0 executor ## What changes were proposed in this pull request? ``` ~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf spark.dynamicAllocation.enabled=true Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; support was removed in 8.0 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0 Error: Number of executors must be a positive number Run with --help for usage help or --verbose for debug output ``` Actually, we could start up with min executor number with 0 before if dynamically ## How was this patch tested? ut added Author: Kent Yao <yaooqinn@hotmail.com> Closes #21290 from yaooqinn/SPARK-24241.	2018-05-15 16:04:17 +08:00
maryannxue	80c6d35a3e	[SPARK-24035][SQL] SQL syntax for Pivot - fix antlr warning ## What changes were proposed in this pull request? 1. Change antlr rule to fix the warning. 2. Add PIVOT/LATERAL check in AstBuilder with a more meaningful error message. ## How was this patch tested? 1. Add a counter case in `PlanParserSuite.test("lateral view")` Author: maryannxue <maryann.xue@gmail.com> Closes #21324 from maryannxue/spark-24035-fix.	2018-05-14 23:34:42 -07:00
Goun Na	e29176fd7d	[SPARK-23627][SQL] Provide isEmpty in Dataset ## What changes were proposed in this pull request? This PR adds isEmpty() in DataSet ## How was this patch tested? Unit tests added Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Goun Na <gounna@gmail.com> Author: goungoun <gounna@gmail.com> Closes #20800 from goungoun/SPARK-23627.	2018-05-15 14:11:20 +08:00
Felix Cheung	9059f1ee6a	[SPARK-23780][R] Failed to use googleVis library with new SparkR ## What changes were proposed in this pull request? change generic to get it to work with googleVis also fix lintr ## How was this patch tested? manual test, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21315 from felixcheung/googvis.	2018-05-14 19:20:25 -07:00
Henry Robinson	061e0084ce	[SPARK-23852][SQL] Add withSQLConf(...) to test case ## What changes were proposed in this pull request? Add a `withSQLConf(...)` wrapper to force Parquet filter pushdown for a test that relies on it. ## How was this patch tested? Test passes Author: Henry Robinson <henry@apache.org> Closes #21323 from henryr/spark-23582.	2018-05-14 14:35:08 -07:00
Maxim Gekk	8cd83acf40	[SPARK-24027][SQL] Support MapType with StringType for keys as the root type by from_json ## What changes were proposed in this pull request? Currently, the from_json function support StructType or ArrayType as the root type. The PR allows to specify MapType(StringType, DataType) as the root type additionally to mentioned types. For example: ```scala import org.apache.spark.sql.types._ val schema = MapType(StringType, IntegerType) val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS() in.select(from_json($"value", schema, Map[String, String]())).collect() ``` ``` res1: Array[org.apache.spark.sql.Row] = Array([Map(a -> 1, b -> 2, c -> 3)]) ``` ## How was this patch tested? It was checked by new tests for the map type with integer type and struct type as value types. Also roundtrip tests like from_json(to_json) and to_json(from_json) for MapType are added. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21108 from MaxGekk/from_json-map-type.	2018-05-14 14:05:42 -07:00
Lu WANG	075d678c88	[SPARK-24155][ML] Instrumentation improvements for clustering ## What changes were proposed in this pull request? changed the instrument for all of the clustering methods ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Lu WANG <lu.wang@databricks.com> Closes #21218 from ludatabricks/SPARK-23686-1.	2018-05-14 13:35:54 -07:00
Shixiong Zhu	c26f673252	[SPARK-24246][SQL] Improve AnalysisException by setting the cause when it's available ## What changes were proposed in this pull request? If there is an exception, it's better to set it as the cause of AnalysisException since the exception may contain useful debug information. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #21297 from zsxwing/SPARK-24246.	2018-05-14 11:37:57 -07:00
Felix Cheung	1430fa80e3	[SPARK-24263][R] SparkR java check breaks with openjdk ## What changes were proposed in this pull request? Change text to grep for. ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #21314 from felixcheung/openjdkver.	2018-05-14 10:49:12 -07:00
Kazuaki Ishizaki	b6c50d7820	[SPARK-24228][SQL] Fix Java lint errors ## What changes were proposed in this pull request? This PR fixes the following Java lint errors due to importing unimport classes ``` $ dev/lint-java Using `mvn` from path: /usr/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/Distribution.java:[25] (sizes) LineLength: Line is longer than 100 characters (found 109). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. [ERROR] src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java:[110] (sizes) LineLength: Line is longer than 100 characters (found 101). ``` With this PR ``` $ dev/lint-java Using `mvn` from path: /usr/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Existing UTs. Also manually run checkstyles against these two files. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21301 from kiszk/SPARK-24228.	2018-05-14 10:57:10 +08:00
Maxim Gekk	7a2d4895c7	[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9: https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125 Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details. Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3): - CSV parser now parses quoted values ~30% faster - CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - https://github.com/uniVocity/univocity-parsers/issues/214 - Implemented trim quoted values support - https://github.com/uniVocity/univocity-parsers/issues/230 - NullPointer when stopping parser when nothing is parsed - https://github.com/uniVocity/univocity-parsers/issues/219 - Concurrency issue when calling stopParsing() - https://github.com/uniVocity/univocity-parsers/issues/231 Closes #20068 ## How was this patch tested? Added tests from the PR https://github.com/apache/spark/pull/20068 Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21273 from MaxGekk/univocity-2.6.	2018-05-14 10:01:06 +08:00
Huaxin Gao	3f0e801c11	[SPARK-24186][R][SQL] change reverse and concat to collection functions in R ## What changes were proposed in this pull request? reverse and concat are already in functions.R as column string functions. Since now these two functions are categorized as collection functions in scala and python, we will do the same in R. ## How was this patch tested? Add test in test_sparkSQL.R Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21307 from huaxingao/spark_24186.	2018-05-14 09:48:54 +08:00
Fan Donglai	2fa33649d9	Update StreamingKMeans.scala ## What changes were proposed in this pull request? I think the ‘n_t+t’ in the following code may be wrong, it shoud be ‘n_t+1’ that means is the number of points to the cluster after it finish the no.t+1 min-batch. * <blockquote> * $$ * \begin{align} * c_t+1 &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\ * n_t+t &= n_t * a + m_t * \end{align} * $$ * </blockquote> Author: Fan Donglai <ddna_1022@163.com> Closes #21179 from ddna1021/master.	2018-05-13 18:10:00 -05:00
Kelley Robinson	0d210ec8b6	[SPARK-24262][PYTHON] Fix typo in UDF type match error message ## What changes were proposed in this pull request? Updates `functon` to `function`. This was called out in holdenk's PyCon 2018 conference talk. Didn't see any existing PR's for this. holdenk happy to fix the Pandas.Series bug too but will need a bit more guidance. Author: Kelley Robinson <krobinson@twilio.com> Closes #21304 from robinske/master.	2018-05-13 13:19:03 -07:00
Cody Allen	32acfa78c6	Improve implicitNotFound message for Encoder The `implicitNotFound` message for `Encoder` doesn't mention the name of the type for which it can't find an encoder. Furthermore, it covers up the fact that `Encoder` is the name of the relevant type class. Hopefully this new message provides a little more specific type detail while still giving the general message about which types are supported. ## What changes were proposed in this pull request? Augment the existing message to mention that it's looking for an `Encoder` and what the type of the encoder is. For example instead of: ``` Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. ``` return this message: ``` Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. ``` ## How was this patch tested? It was tested manually in the Scala REPL, since triggering this in a test would cause a compilation error. ``` scala> implicitly[Encoder[Exception]] <console>:51: error: Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. implicitly[Encoder[Exception]] ^ ``` Author: Cody Allen <ceedubs@gmail.com> Closes #20869 from ceedubs/encoder-implicit-msg.	2018-05-12 14:35:40 -05:00
Sean Owen	348ddfd20f	[BUILD] Close stale PRs Closes https://github.com/apache/spark/pull/20458 Closes https://github.com/apache/spark/pull/20530 Closes https://github.com/apache/spark/pull/20557 Closes https://github.com/apache/spark/pull/20966 Closes https://github.com/apache/spark/pull/20857 Closes https://github.com/apache/spark/pull/19694 Closes https://github.com/apache/spark/pull/18227 Closes https://github.com/apache/spark/pull/20683 Closes https://github.com/apache/spark/pull/20881 Closes https://github.com/apache/spark/pull/20347 Closes https://github.com/apache/spark/pull/20825 Closes https://github.com/apache/spark/pull/20078 Closes https://github.com/apache/spark/pull/21281 Closes https://github.com/apache/spark/pull/19951 Closes https://github.com/apache/spark/pull/20905 Closes https://github.com/apache/spark/pull/20635 Author: Sean Owen <srowen@gmail.com> Closes #21303 from srowen/ClosePRs.	2018-05-12 08:35:14 -05:00
Marek Novotny	5902125ac7	[SPARK-24198][SPARKR][SQL] Adding slice function to SparkR ## What changes were proposed in this pull request? The PR adds the `slice` function to SparkR. The function returns a subset of consecutive elements from the given array. ``` > df <- createDataFrame(cbind(model = rownames(mtcars), mtcars)) > tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp)) > head(select(tmp, slice(tmp$v1, 2L, 2L))) ``` ``` slice(v1, 2, 2) 1 6, 110 2 6, 110 3 4, 93 4 6, 110 5 8, 175 6 6, 105 ``` ## How was this patch tested? A test added into R/pkg/tests/fulltests/test_sparkSQL.R Author: Marek Novotny <mn.mikke@gmail.com> Closes #21298 from mn-mikke/SPARK-24198.	2018-05-12 19:21:42 +08:00
Reynold Xin	e3dabdf6ef	[SPARK-23907] Removes regr_* functions in functions.scala ## What changes were proposed in this pull request? This patch removes the various regr_* functions in functions.scala. They are so uncommon that I don't think they deserve real estate in functions.scala. We can consider adding them later if more users need them. ## How was this patch tested? Removed the associated test case as well. Author: Reynold Xin <rxin@databricks.com> Closes #21309 from rxin/SPARK-23907.	2018-05-12 12:15:36 +08:00
Shivaram Venkataraman	f27a035daf	[SPARKR] Require Java 8 for SparkR This change updates the SystemRequirements and also includes a runtime check if the JVM is being launched by R. The runtime check is done by querying `java -version` ## How was this patch tested? Tested on a Mac and Windows machine Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #21278 from shivaram/sparkr-skip-solaris.	2018-05-11 17:00:51 -07:00
aditkumar	92f6f52ff0	[MINOR][DOCS] Documenting months_between direction ## What changes were proposed in this pull request? It's useful to know what relationship between date1 and date2 results in a positive number. Author: aditkumar <aditkumar@gmail.com> Author: Adit Kumar <aditkumar@gmail.com> Closes #20787 from aditkumar/master.	2018-05-11 14:42:23 -05:00
Wenchen Fan	928845a422	[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times ## What changes were proposed in this pull request? In `PushDownOperatorsToDataSource`, we use `transformUp` to match `PhysicalOperation` and apply pushdown. This is problematic if we have multiple `Filter` and `Project` above the data source v2 relation. e.g. for a query ``` Project Filter DataSourceV2Relation ``` The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use `mapChildren` to only apply pushdown once. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #21230 from cloud-fan/step2.	2018-05-11 10:00:28 -07:00
Marcelo Vanzin	54032682b9	[SPARK-24182][YARN] Improve error message when client AM fails. Instead of always throwing a generic exception when the AM fails, print a generic error and throw the exception with the YARN diagnostics containing the reason for the failure. There was an issue with YARN sometimes providing a generic diagnostic message, even though the AM provides a failure reason when unregistering. That was happening because the AM was registering too late, and if errors happened before the registration, YARN would just create a generic "ExitCodeException" which wasn't very helpful. Since most errors in this path are a result of not being able to connect to the driver, this change modifies the AM registration a bit so that the AM is registered before the connection to the driver is established. That way, errors are properly propagated through YARN back to the driver. As part of that, I also removed the code that retried connections to the driver from the client AM. At that point, the driver should already be up and waiting for connections, so it's unlikely that retrying would help - and in case it does, that means a flaky network, which would mean problems would probably show up again. The effect of that is that connection-related errors are reported back to the driver much faster now (through the YARN report). One thing to note is that there seems to be a race on the YARN side that causes a report to be sent to the client without the corresponding diagnostics string from the AM; the diagnostics are available later from the RM web page. For that reason, the generic error messages are kept in the Spark scheduler code, to help guide users to a way of debugging their failure. Also of note is that if YARN's max attempts configuration is lower than Spark's, Spark will not unregister the AM with a proper diagnostics message. Unfortunately there seems to be no way to unregister the AM and still allow further re-attempts to happen. Testing: - existing unit tests - some of our integration tests - hardcoded an invalid driver address in the code and verified the error in the shell. e.g. ``` scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details. 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult: <AM stack trace> Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:1234 <More stack trace> ``` Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21243 from vanzin/SPARK-24182.	2018-05-11 17:40:35 +08:00

... 4 5 6 7 8 ...

22195 commits