ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Marcelo Vanzin	95aef660b7	[SPARK-20205][CORE] Make sure StageInfo is updated before sending event. The DAGScheduler was sending a "stage submitted" event before it properly updated the event's information. This meant that a listener (e.g. the even logging listener) could record wrong information about the event. This change sets the stage's submission time before the event is submitted, when there are tasks to be executed in the stage. Tested with existing unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #17925 from vanzin/SPARK-20205.	2017-05-24 16:57:17 -07:00
Reynold Xin	a64746677b	[SPARK-20867][SQL] Move hints from Statistics into HintInfo class ## What changes were proposed in this pull request? This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future. ## How was this patch tested? Updated test cases to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #18087 from rxin/SPARK-20867.	2017-05-24 13:57:19 -07:00
Liang-Chi Hsieh	f72ad303f0	[SPARK-20848][SQL] Shutdown the pool after reading parquet files ## What changes were proposed in this pull request? From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads. We should shutdown the pool after reading parquet files. ## How was this patch tested? Added a test to ParquetFileFormatSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18073 from viirya/SPARK-20848.	2017-05-25 00:35:40 +08:00
Bago Amirbekian	bc66a77bbe	[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel ## What changes were proposed in this pull request? Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float. ## How was this patch tested? Existing tests run using python3 and numpy 1.12. Author: Bago Amirbekian <bago@databricks.com> Closes #18081 from MrBago/BF-py3floatbug.	2017-05-24 22:55:38 +08:00
zero323	1816eb3bef	[SPARK-20631][FOLLOW-UP] Fix incorrect tests. ## What changes were proposed in this pull request? - Fix incorrect tests for `_check_thresholds`. - Move test to `ParamTests`. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.	2017-05-24 19:57:44 +08:00
Peng	9afcf127d3	[SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? Add test cases for PR-18062 ## How was this patch tested? The existing UT Author: Peng <peng.meng@intel.com> Closes #18068 from mpjlu/moreTest.	2017-05-24 19:54:17 +08:00
Xingbo Jiang	d76633e3ca	[SPARK-18406][CORE] Race between end-of-task and completion iterator read lock release ## What changes were proposed in this pull request? When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method. ## How was this patch tested? Add new failing regression test case in `RDDSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18076 from jiangxb1987/completion-iterator.	2017-05-24 15:43:23 +08:00
Bago Amirbekian	9434280cfd	[SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators Changes: pyspark.ml Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning). Testing: Existing unit tests. Author: Bago Amirbekian <bago@databricks.com> Closes #18077 from MrBago/delegate_params.	2017-05-23 20:56:01 -07:00
Kirby Linvill	4816c2ef5e	[SPARK-15648][SQL] Add teradataDialect for JDBC connection to Teradata The contribution is my original work and I license the work to the project under the project’s open source license. Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent I used is a 255 character/byte length varchar. This effectively limits the max number of string columns to 250 when using the Teradata jdbc connector. ## What changes were proposed in this pull request? Added a teradataDialect for JDBC connection to Teradata. The Teradata dialect uses VARCHAR(255) in place of TEXT for string datatypes, and CHAR(1) in place of BIT(1) for boolean datatypes. ## How was this patch tested? I added two unit tests to double check that the types get set correctly for a teradata jdbc url. I also ran a couple manual tests to make sure the jdbc connector worked with teradata and to make sure that an error was thrown if a row could potentially exceed 64K (this error comes from the teradata jdbc connector, not from the spark code). I did not check how string columns longer than 255 characters are handled. Author: Kirby Linvill <kirby.linvill@teradata.com> Author: klinvill <kjlinvill@gmail.com> Closes #16746 from klinvill/master.	2017-05-23 12:00:58 -07:00
Reynold Xin	0d589ba00b	[SPARK-20857][SQL] Generic resolved hint node ## What changes were proposed in this pull request? This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18072 from rxin/SPARK-20857.	2017-05-23 18:44:49 +02:00
Yanbo Liang	ad09e4ca04	[MINOR][SPARKR][ML] Joint coefficients with intercept for SparkR linear SVM summary. ## What changes were proposed in this pull request? Joint coefficients with intercept for SparkR linear SVM summary. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18035 from yanboliang/svm-r.	2017-05-23 16:16:14 +08:00
Liang-Chi Hsieh	442287ae29	[SPARK-20399][SQL][FOLLOW-UP] Add a config to fallback string literal parsing consistent with old sql parser behavior ## What changes were proposed in this pull request? As srowen pointed in `609ba5f2b9 (commitcomment-22221259)`, the previous tests are not proper. This follow-up is going to fix the tests. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18048 from viirya/SPARK-20399-follow-up.	2017-05-23 16:09:38 +08:00
Shivaram Venkataraman	d06610f992	[SPARK-20727] Skip tests that use Hadoop utils on CRAN Windows ## What changes were proposed in this pull request? This change skips tests that use the Hadoop libraries while running on CRAN check with Windows as the operating system. This is to handle cases where the Hadoop winutils binaries are missing on the target system. The skipped tests consist of 1. Tests that save, load a model in MLlib 2. Tests that save, load CSV, JSON and Parquet files in SQL 3. Hive tests ## How was this patch tested? Tested by running on a local windows VM with HADOOP_HOME unset. Also testing with https://win-builder.r-project.org Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #17966 from shivaram/sparkr-windows-cran.	2017-05-22 23:04:22 -07:00
James Shuster	4dbb63f085	[SPARK-20815][SPARKR] NullPointerException in RPackageUtils#checkManifestForR ## What changes were proposed in this pull request? - Add a null check to RPackageUtils#checkManifestForR so that jars w/o manifests don't NPE. ## How was this patch tested? - Unit tests and manual tests. Author: James Shuster <jshuster@palantir.com> Closes #18040 from jrshust/feature/r-package-utils.	2017-05-22 21:41:11 -07:00
Xiao Li	a2460be9c3	[SPARK-17410][SPARK-17284] Move Hive-generated Stats Info to HiveClientImpl ### What changes were proposed in this pull request? After we adding a new field `stats` into `CatalogTable`, we should not expose Hive-specific Stats metadata to `MetastoreRelation`. It complicates all the related codes. It also introduces a bug in `SHOW CREATE TABLE`. The statistics-related table properties should be skipped by `SHOW CREATE TABLE`, since it could be incorrect in the newly created table. See the Hive JIRA: https://issues.apache.org/jira/browse/HIVE-13792 Also fix the issue to fill Hive-generated RowCounts to our stats. This PR is to handle Hive-specific Stats metadata in `HiveClientImpl`. ### How was this patch tested? Added a few test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #14971 from gatorsmile/showCreateTableNew.	2017-05-22 17:28:30 -07:00
Yuming Wang	9b09101938	[SPARK-20751][SQL][FOLLOWUP] Add cot test in MathExpressionsSuite ## What changes were proposed in this pull request? Add cot test in MathExpressionsSuite as https://github.com/apache/spark/pull/17999#issuecomment-302832794. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18039 from wangyum/SPARK-20751-test.	2017-05-22 13:05:05 -07:00
Marcelo Vanzin	df64fa79d6	[SPARK-20814][MESOS] Restore support for spark.executor.extraClassPath. Restore code that was removed as part of SPARK-17979, but instead of using the deprecated env variable name to propagate the class path, use a new one. Verified by running "./bin/spark-class o.a.s.executor.CoarseGrainedExecutorBackend" manually. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18037 from vanzin/SPARK-20814.	2017-05-22 12:34:15 -07:00
Zheng RuiFeng	4be3375835	[SPARK-15767][ML][SPARKR] Decision Tree wrapper in SparkR ## What changes were proposed in this pull request? support decision tree in R ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17981 from zhengruifeng/dt_r.	2017-05-22 10:40:49 -07:00
Mark Grover	3630911004	[SPARK-20756][YARN] yarn-shuffle jar references unshaded guava and contains scala classes ## What changes were proposed in this pull request? This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar. Also, it explicitly excludes scala classes from being added to the jar. ## How was this patch tested? Ran unit tests on the module and they passed. javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...` ``` javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService \| grep Lists 57: invokestatic #138 // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList; ``` Guava is still shaded in the jar: ``` jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar \| grep guava \| head META-INF/maven/com.google.guava/ META-INF/maven/com.google.guava/guava/ META-INF/maven/com.google.guava/guava/pom.properties META-INF/maven/com.google.guava/guava/pom.xml org/spark_project/guava/ org/spark_project/guava/annotations/ org/spark_project/guava/annotations/Beta.class org/spark_project/guava/annotations/GwtCompatible.class org/spark_project/guava/annotations/GwtIncompatible.class org/spark_project/guava/annotations/VisibleForTesting.class ``` (not sure if the above META-INF/* is a problem or not) I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded. Author: Mark Grover <mark@apache.org> Closes #17990 from markgrover/spark-20756.	2017-05-22 10:10:41 -07:00
Peng	cfca01136b	[SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version ## What changes were proposed in this pull request? SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes. ## How was this patch tested? The existing UT Author: Peng <peng.meng@intel.com> Closes #18062 from mpjlu/spark-20764.	2017-05-22 22:42:37 +08:00
gatorsmile	f3ed62a381	[SPARK-20831][SQL] Fix INSERT OVERWRITE data source tables with IF NOT EXISTS ### What changes were proposed in this pull request? Currently, we have a bug when we specify `IF NOT EXISTS` in `INSERT OVERWRITE` data source tables. For example, given a query: ```SQL INSERT OVERWRITE TABLE $tableName partition (b=2, c=3) IF NOT EXISTS SELECT 9, 10 ``` we will get the following error: ``` unresolved operator 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true;; 'InsertIntoTable Relation[a#425,d#426,b#427,c#428] parquet, Map(b -> Some(2), c -> Some(3)), true, true +- Project [cast(9#423 as int) AS a#429, cast(10#424 as int) AS d#430] +- Project [9 AS 9#423, 10 AS 10#424] +- OneRowRelation$ ``` This PR is to fix the issue to follow the behavior of Hive serde tables > INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition ### How was this patch tested? Modified an existing test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18050 from gatorsmile/insertPartitionIfNotExists.	2017-05-22 22:24:50 +08:00
jinxing	2597674bcc	[SPARK-20801] Record accurate size of blocks in MapStatus when it's above threshold. ## What changes were proposed in this pull request? Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold. ## How was this patch tested? Added test in MapStatusSuite. Author: jinxing <jinxing6042@126.com> Closes #18031 from jinxing64/SPARK-20801.	2017-05-22 22:09:49 +08:00
John Lee	aea73be1b4	[SPARK-20813][WEB UI] Fixed Web UI executor page tab search by status not working ## What changes were proposed in this pull request? On status column of the table, I removed the condition that forced only the display value to take on values Active, Blacklisted and Dead. Before the removal, values used for sort and filter for that particular column was True and False. ## How was this patch tested? Tested with Active, Blacklisted and Dead present as current status. Author: John Lee <jlee2@yahoo-inc.com> Closes #18036 from yoonlee95/SPARK-20813.	2017-05-22 14:24:49 +01:00
caoxuewen	f1ffc6e71f	[SPARK-20609][CORE] Run the SortShuffleSuite unit tests have residual spark_* system directory ## What changes were proposed in this pull request? This PR solution to run the SortShuffleSuite unit tests have residual spark_* system directory For example: OS:Windows 7 After the running SortShuffleSuite unit tests, the system of TMP directory have '..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not deleted ## How was this patch tested? Run SortShuffleSuite unit test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #17869 from heary-cao/SortShuffleSuite.	2017-05-22 14:23:23 +01:00
fjh100456	190d8b0b63	[SPARK-20591][WEB UI] Succeeded tasks num not equal in all jobs page and job detail page on spark web ui when speculative task(s) exist. ## What changes were proposed in this pull request? Modified succeeded num in job detail page from "completed = stageData.completedIndices.size" to "completed = stageData.numCompleteTasks",which making succeeded tasks num in all jobs page and job detail page look more consistent, and more easily to find which stages the speculative task(s) were in. ## How was this patch tested? manual tests Author: fjh100456 <fu.jinhua6@zte.com.cn> Closes #17923 from fjh100456/master.	2017-05-22 13:58:42 +01:00
Nick Pentreath	be846db48b	[SPARK-20506][DOCS] Add HTML links to highlight list in MLlib guide for 2.2 Quick follow up to #17996 - forgot to add the HTML links to the relevant sections of the guide in the highlights list. ## How was this patch tested? Built docs locally and tested links. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18043 from MLnick/SPARK-20506-2.2-migration-guide-2.	2017-05-22 12:29:29 +02:00
Ignacio Bermudez	06dda1d58f	[SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes #17940 from ghoto/bug-fix/SPARK-20687.	2017-05-22 10:27:28 +01:00
Michal Senkyr	a2b3b67624	[SPARK-19089][SQL] Add support for nested sequences ## What changes were proposed in this pull request? Replaced specific sequence encoders with generic sequence encoder to enable nesting of sequences. Does not add support for nested arrays as that cannot be solved in this way. ## How was this patch tested? ```bash build/mvn -DskipTests clean package && dev/run-tests ``` Additionally in Spark shell: ``` scala> Seq(Seq(Seq(1))).toDS.collect() res0: Array[Seq[Seq[Int]]] = Array(List(List(1))) ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #18011 from michalsenkyr/dataset-seq-nested.	2017-05-22 16:49:19 +08:00
Kazuaki Ishizaki	833c8d4152	[SPARK-20770][SQL] Improve ColumnStats ## What changes were proposed in this pull request? This PR improves the implementation of `ColumnStats` by using the following appoaches. 1. Declare subclasses of `ColumnStats` as `final` 2. Remove unnecessary call of `row.isNullAt(ordinal)` 3. Remove the dependency on `GenericInternalRow` For 1., this declaration encourages method inlining and other optimizations of JIT compiler For 2., in `gatherStats()`, while previous code in subclasses of `ColumnStats` always calls `row.isNullAt()` twice, the PR just calls `row.isNullAt()` only once. For 3., `collectedStatistics()` returns `Array[Any]` instead of `GenericInternalRow`. This removes the dependency of unnecessary package and reduces the number of allocations of `GenericInternalRow`. In addition to that, in the future, `gatherValueStats()`, which is specialized for each data type, can be effectively called from the generated code without using generic data structure `InternalRow`. ## How was this patch tested? Tested by existing test suite Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18002 from kiszk/SPARK-20770.	2017-05-22 16:23:23 +08:00
caoxuewen	3c9eef35a8	[SPARK-20786][SQL] Improve ceil and floor handle the value which is not expected ## What changes were proposed in this pull request? spark-sql>SELECT ceil(1234567890123456); 1234567890123456 spark-sql>SELECT ceil(12345678901234567); 12345678901234568 spark-sql>SELECT ceil(123456789012345678); 123456789012345680 when the length of the getText is greater than 16. long to double will be precision loss. but mysql handle the value is ok. mysql> SELECT ceil(1234567890123456); +------------------------+ \| ceil(1234567890123456) \| +------------------------+ \| 1234567890123456 \| +------------------------+ 1 row in set (0.00 sec) mysql> SELECT ceil(12345678901234567); +-------------------------+ \| ceil(12345678901234567) \| +-------------------------+ \| 12345678901234567 \| +-------------------------+ 1 row in set (0.00 sec) mysql> SELECT ceil(123456789012345678); +--------------------------+ \| ceil(123456789012345678) \| +--------------------------+ \| 123456789012345678 \| +--------------------------+ 1 row in set (0.00 sec) ## How was this patch tested? Supplement the unit test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #18016 from heary-cao/ceil_long.	2017-05-21 22:39:07 -07:00
Wayne Zhang	0f2f56c37b	[SPARK-20736][PYTHON] PySpark StringIndexer supports StringOrderType ## What changes were proposed in this pull request? PySpark StringIndexer supports StringOrderType added in #17879. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17978 from actuaryzhang/PythonStringIndexer.	2017-05-21 16:51:55 -07:00
Tathagata Das	9d6661c829	[SPARK-20792][SS] Support same timeout operations in mapGroupsWithState function in batch queries as in streaming queries ## What changes were proposed in this pull request? Currently, in the batch queries, timeout is disabled (i.e. GroupStateTimeout.NoTimeout) which means any GroupState.setTimeout*** operation would throw UnsupportedOperationException. This makes it weird when converting a streaming query into a batch query by changing the input DF from streaming to a batch DF. If the timeout was enabled and used, then the batch query will start throwing UnsupportedOperationException. This PR creates the dummy state in batch queries with the provided timeoutConf so that it behaves in the same way. The code has been refactored to make it obvious when the state is being created for a batch query or a streaming query. ## How was this patch tested? Additional tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18024 from tdas/SPARK-20792.	2017-05-21 13:07:25 -07:00
Sean Owen	bbd8d7def1	[SPARK-20806][DEPLOY] Launcher: redundant check for Spark lib dir ## What changes were proposed in this pull request? Remove redundant check for libdir in CommandBuilderUtils ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18032 from srowen/SPARK-20806.	2017-05-20 15:27:13 +01:00
liuzhaokun	749418d285	[SPARK-20781] the location of Dockerfile in docker.properties.templat is wrong [https://issues.apache.org/jira/browse/SPARK-20781](https://issues.apache.org/jira/browse/SPARK-20781) the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile" Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #18013 from liu-zhaokun/dockerfile_location.	2017-05-19 20:47:30 +01:00
Nick Pentreath	b5d8d9ba17	[SPARK-20506][DOCS] 2.2 migration guide Update ML guide for migration `2.1` -> `2.2` and the previous version migration guide section. ## How was this patch tested? Build doc locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17996 from MLnick/SPARK-20506-2.2-migration-guide.	2017-05-19 20:51:56 +02:00
Wayne Zhang	7f203a248f	[SPARKR] Fix bad examples in DataFrame methods and style issues ## What changes were proposed in this pull request? Some examples in the DataFrame methods are syntactically wrong, even though they are pseudo code. Fix these and some style issues. Author: Wayne Zhang <actuaryzhang@uber.com> Closes #18003 from actuaryzhang/sparkRDoc3.	2017-05-19 11:18:20 -07:00
zero323	2d90c04f23	[SPARKR][DOCS][MINOR] Use consistent names in rollup and cube examples ## What changes were proposed in this pull request? Rename `carsDF` to `df` in SparkR `rollup` and `cube` examples. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17988 from zero323/cube-docs.	2017-05-19 11:04:38 -07:00
liuxian	ea3b1e352a	[SPARK-20763][SQL] The function of `month` and `day` return the value which is not we expected. ## What changes were proposed in this pull request? spark-sql>select month("1582-09-28"); spark-sql>10 For this case, the expected result is 9, but it is 10. spark-sql>select day("1582-04-18"); spark-sql>28 For this case, the expected result is 18, but it is 28. when the date before "1582-10-04", the function of `month` and `day` return the value which is not we expected. ## How was this patch tested? unit tests Author: liuxian <liu.xian3@zte.com.cn> Closes #17997 from 10110346/wip_lx_0516.	2017-05-19 10:25:21 -07:00
Yuming Wang	bff021dfaf	[SPARK-20751][SQL] Add built-in SQL Function - COT ## What changes were proposed in this pull request? Add built-in SQL Function - COT. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #17999 from wangyum/SPARK-20751.	2017-05-19 09:40:22 -07:00
liuzhaokun	dba2ca2c12	[SPARK-20759] SCALA_VERSION in _config.yml should be consistent with pom.xml [https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759) SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml. Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #17992 from liu-zhaokun/new.	2017-05-19 15:26:39 +01:00
caoxuewen	f398640daa	[SPARK-20607][CORE] Add new unit tests to ShuffleSuite ## What changes were proposed in this pull request? This PR update to two: 1.adds the new unit tests. testing would be performed when there is no shuffle stage, shuffle will not generate the data file and the index files. 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its local shuffle file' unit test, parallelize is 1 but not is 2, Check the index file and delete. ## How was this patch tested? The new unit test. Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #17868 from heary-cao/ShuffleSuite.	2017-05-19 15:25:03 +01:00
tpoterba	3f2cd51ee0	[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes #18005 from tpoterba/tpoterba-patch-1.	2017-05-19 14:17:12 +02:00
Ala Luszczak	ce8edb8bf4	[SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter ## What changes were proposed in this pull request? GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior. Example of code generated before: ```scala /* 059 / final UTF8String fieldName = value.getUTF8String(0); / 060 / if (value.isNullAt(0)) { / 061 / rowWriter1.setNullAt(0); / 062 / } else { / 063 / rowWriter1.write(0, fieldName); / 064 / } ``` Example of code generated now: ```scala / 060 / boolean isNull1 = value.isNullAt(0); / 061 / UTF8String value1 = isNull1 ? null : value.getUTF8String(0); / 062 / if (isNull1) { / 063 / rowWriter1.setNullAt(0); / 064 / } else { / 065 / rowWriter1.write(0, value1); / 066 */ } ``` ## How was this patch tested? Adds GenerateUnsafeProjectionSuite. Author: Ala Luszczak <ala@databricks.com> Closes #18030 from ala/fix-generate-unsafe-projection.	2017-05-19 13:18:48 +02:00
Yash Sharma	92580bd0ea	[DSTREAM][DOC] Add documentation for kinesis retry configurations ## What changes were proposed in this pull request? The changes were merged as part of - https://github.com/apache/spark/pull/17467. The documentation was missed somewhere in the review iterations. Adding the documentation where it belongs. ## How was this patch tested? Docs. Not tested. cc budde , brkyvz Author: Yash Sharma <ysharma@atlassian.com> Closes #18028 from yssharma/ysharma/kinesis_retry_docs.	2017-05-18 11:24:33 -07:00
hyukjinkwon	8fb3d5c6da	[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names ## What changes were proposed in this pull request? This is an alternative workaround by simply avoiding the predicate pushdown for columns having dots in the names. This is an approach different with https://github.com/apache/spark/pull/17680. The downside of this PR is, literally it does not push down filters on the column having dots in Parquet files at all (both no record level and no rowgroup level) whereas the downside of the approach in that PR, it does not use the Parquet's API properly but in a hacky way to support this case. I assume we prefer a safe way here by using the Parquet API properly but this does close that PR as we are basically just avoiding here. This way looks a simple workaround and probably it is fine given the problem looks arguably rather corner cases (although it might end up with reading whole row groups under the hood but either looks not the best). Currently, if there are dots in the column name, predicate pushdown seems being failed in Parquet. With dots ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ \|col.dots\| +--------+ +--------+ ``` Without dots ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("coldots").write.parquet(path) spark.read.parquet(path).where("`coldots` IS NOT NULL").show() ``` ``` +-------+ \|coldots\| +-------+ \| 1\| +-------+ ``` After ```scala val path = "/tmp/abcde" Seq(Some(1), None).toDF("col.dots").write.parquet(path) spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() ``` ``` +--------+ \|col.dots\| +--------+ \| 1\| +--------+ ``` ## How was this patch tested? Unit tests added in `ParquetFilterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18000 from HyukjinKwon/SPARK-20364-workaround.	2017-05-18 10:52:23 -07:00
liuzhaokun	99452df44f	[SPARK-20796] the location of start-master.sh in spark-standalone.md is wrong [https://issues.apache.org/jira/browse/SPARK-20796](https://issues.apache.org/jira/browse/SPARK-20796) the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh". Author: liuzhaokun <liu.zhaokun@zte.com.cn> Closes #18027 from liu-zhaokun/sbin.	2017-05-18 17:44:40 +01:00
zuotingbing	4779b86b5a	[SPARK-20779][EXAMPLES] The ASF header placed in an incorrect location in some files. ## What changes were proposed in this pull request? The license is not at the top in some files. and it will be best if we update these places of the ASF header to be consistent with other files. ## How was this patch tested? manual tests Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #18012 from zuotingbing/spark-license.	2017-05-18 17:28:14 +01:00
hyukjinkwon	5d2750aa2d	[INFRA] Close stale PRs ## What changes were proposed in this pull request? This PR proposes to close PRs ... - inactive to the review comments more than a month - WIP and inactive more than a month - with Jenkins build failure but inactive more than a month - suggested to be closed and no comment against that - obviously looking inappropriate (e.g., Branch 0.5) To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below: Closes #11129 Closes #12085 Closes #12162 Closes #12419 Closes #12420 Closes #12491 Closes #13762 Closes #13837 Closes #13851 Closes #13881 Closes #13891 Closes #13959 Closes #14091 Closes #14481 Closes #14547 Closes #14557 Closes #14686 Closes #15594 Closes #15652 Closes #15850 Closes #15914 Closes #15918 Closes #16285 Closes #16389 Closes #16652 Closes #16743 Closes #16893 Closes #16975 Closes #17001 Closes #17088 Closes #17119 Closes #17272 Closes #17971 Added: Closes #17778 Closes #17303 Closes #17872 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #18017 from HyukjinKwon/close-inactive-prs.	2017-05-18 08:58:23 +01:00
Xingbo Jiang	b7aac15d56	[SPARK-20700][SQL] InferFiltersFromConstraints stackoverflows for query (v2) ## What changes were proposed in this pull request? In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail at preventing the recursive deductions. We fix this problem by collecting all `Alias`s from the logical plan. ## How was this patch tested? No additional test case is added, but do modified one test case to cover this situation. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18020 from jiangxb1987/inferConstrants.	2017-05-17 23:32:31 -07:00
Yanbo Liang	697a5e5517	[SPARK-20505][ML] Add docs and examples for ml.stat.Correlation and ml.stat.ChiSquareTest. ## What changes were proposed in this pull request? Add docs and examples for ```ml.stat.Correlation``` and ```ml.stat.ChiSquareTest```. ## How was this patch tested? Generate docs and run examples manually, successfully. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17994 from yanboliang/spark-20505.	2017-05-18 11:54:09 +08:00

1 2 3 4 5 ...

19702 commits