ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Tarek Auel	83b682beec	[SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions Jira: https://issues.apache.org/jira/browse/SPARK-8199 https://issues.apache.org/jira/browse/SPARK-8184 https://issues.apache.org/jira/browse/SPARK-8183 https://issues.apache.org/jira/browse/SPARK-8182 https://issues.apache.org/jira/browse/SPARK-8181 https://issues.apache.org/jira/browse/SPARK-8180 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-8177 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-9115 Regarding `day`and `dayofmonth` are both necessary? ~~I am going to add `Quarter` to this PR as well.~~ Done. ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done Author: Tarek Auel <tarek.auel@googlemail.com> Author: Tarek Auel <tarek.auel@gmail.com> Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits: f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests bb567b6 [Tarek Auel] [SPARK-8199] fixed test 3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix 256c357 [Tarek Auel] [SPARK-8199] code cleanup 5983dcc [Tarek Auel] [SPARK-8199] whitespace fix 6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488 4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master 70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199 3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix 746b80a [Tarek Auel] [SPARK-8199] build fix 0ad6db8 [Tarek Auel] [SPARK-8199] minor fix 523542d [Tarek Auel] [SPARK-8199] address comments 2259299 [Tarek Auel] [SPARK-8199] day_of_month alias d01b977 [Tarek Auel] [SPARK-8199] python underscore 56c4a92 [Tarek Auel] [SPARK-8199] update python docu e223bc0 [Tarek Auel] [SPARK-8199] refactoring d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it 1b2e540 [Tarek Auel] [SPARK-8119] style fix 0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring 1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199 740af0e [Tarek Auel] implement date function using a calculation based on days 4fb66da [Tarek Auel] WIP: date functions on calculation only 1a436c9 [Tarek Auel] wip f775f39 [Tarek Auel] fixed return type ad17e96 [Tarek Auel] improved implementation c42b444 [Tarek Auel] Removed merge conflict file ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues 10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast 7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite 6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track 7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199 5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199 f120415 [Tarek Auel] improved runtime a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat 5fe74e1 [Tarek Auel] fixed python style 3bfac90 [Tarek Auel] fixed style 356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation 02efc5d [Tarek Auel] removed doubled code a5ea120 [Tarek Auel] added python api; changed test to be more meaningful b680db6 [Tarek Auel] added codegeneration to all functions c739788 [Tarek Auel] added support for quarter SPARK-8178 849fb41 [Tarek Auel] fixed stupid test 638596f [Tarek Auel] improved codegen 4d8049b [Tarek Auel] fixed tests and added type check 5ebb235 [Tarek Auel] resolved naming conflict d0e2f99 [Tarek Auel] date functions	2015-07-18 22:48:05 -07:00
Forest Fang	6cb6096c01	[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete. Author: Forest Fang <forest.fang@outlook.com> Closes #7076 from saurfang/codegen_size_limit and squashes the following commits: b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code 1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only 9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit	2015-07-18 21:05:44 -07:00
Reynold Xin	45d798c323	[SPARK-8278] Remove non-streaming JSON reader. Author: Reynold Xin <rxin@databricks.com> Closes #7501 from rxin/jsonrdd and squashes the following commits: 767ec55 [Reynold Xin] More Mima 51f456e [Reynold Xin] Mima exclude. 789cb80 [Reynold Xin] Fixed compilation error. b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.	2015-07-18 20:27:55 -07:00
Reynold Xin	9914b1b2c5	[SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already. Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star). Author: Reynold Xin <rxin@databricks.com> Closes #7487 from rxin/codegenfallback and squashes the following commits: 14ebf38 [Reynold Xin] Fixed Conv 6c1c882 [Reynold Xin] Fixed Alias. b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions. cb5c066 [Reynold Xin] Removed extra import. 39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 18:18:19 -07:00
Reynold Xin	e16a19a39e	[SPARK-9174][SQL] Add documentation for all public SQLConfs. Author: Reynold Xin <rxin@databricks.com> Closes #7500 from rxin/sqlconf and squashes the following commits: a5726c8 [Reynold Xin] [SPARK-9174][SQL] Add documentation for all public SQLConfs.	2015-07-18 15:29:38 -07:00
Reynold Xin	6e1e2eba69	[SPARK-8240][SQL] string function: concat Author: Reynold Xin <rxin@databricks.com> Closes #7486 from rxin/concat and squashes the following commits: 5217d6e [Reynold Xin] Removed Hive's concat test. f5cb7a3 [Reynold Xin] Concat is never nullable. ae4e61f [Reynold Xin] Removed extra import. fddcbbd [Reynold Xin] Fixed NPE. 22e831c [Reynold Xin] Added missing file. 57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 14:07:56 -07:00
Yijie Shen	3d2134fc0d	[SPARK-9055][SQL] WidenTypes should also support Intersect and Except JIRA: https://issues.apache.org/jira/browse/SPARK-9055 cc rxin Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7491 from yijieshen/widen and squashes the following commits: 079fa52 [Yijie Shen] widenType support for intersect and expect	2015-07-18 12:57:53 -07:00
Reynold Xin	cdc36eef41	Closes #6122	2015-07-18 12:25:04 -07:00
Liang-Chi Hsieh	225de8da2b	[SPARK-9151][SQL] Implement code generation for Abs JIRA: https://issues.apache.org/jira/browse/SPARK-9151 Add codegen support for `Abs`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7498 from viirya/abs_codegen and squashes the following commits: 0c8410f [Liang-Chi Hsieh] Implement code generation for Abs.	2015-07-18 12:11:37 -07:00
Wenchen Fan	86c50bf72c	[SPARK-9171][SQL] add and improve tests for nondeterministic expressions Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7496 from cloud-fan/tests and squashes the following commits: 0958f90 [Wenchen Fan] improve test for nondeterministic expressions	2015-07-18 11:58:53 -07:00
Wenchen Fan	692378c01d	[SPARK-9167][SQL] use UTC Calendar in `stringToDate` fix 2 bugs introduced in https://github.com/apache/spark/pull/7353 1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging. 2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7488 from cloud-fan/datetime and squashes the following commits: 9cd6005 [Wenchen Fan] address comments 21ef293 [Wenchen Fan] fix 2 bugs in datetime	2015-07-18 11:25:16 -07:00
Wenchen Fan	1b4ff05538	[SPARK-9142][SQL] remove more self type in catalyst a follow up of https://github.com/apache/spark/pull/7479. The `TreeNode` is the root case of the requirement of `self: Product =>` stuff, so why not make `TreeNode` extend `Product`? Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7495 from cloud-fan/self-type and squashes the following commits: 8676af7 [Wenchen Fan] remove more self type	2015-07-18 11:13:49 -07:00
Josh Rosen	b8aec6cd23	[SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <-> Safe row format converters Now that we have two different internal row formats, UnsafeRow and the old Java-object-based row format, we end up having to perform conversions between these two formats. These conversions should not be performed by the operators themselves; instead, the planner should be responsible for inserting appropriate format conversions when they are needed. This patch makes the following changes: - Add two new physical operators for performing row format conversions, `ConvertToUnsafe` and `ConvertFromUnsafe`. - Add new methods to `SparkPlan` to allow operators to express whether they output UnsafeRows and whether they can handle safe or unsafe rows as inputs. - Implement an `EnsureRowFormats` rule to automatically insert converter operators where necessary. Author: Josh Rosen <joshrosen@databricks.com> Closes #7482 from JoshRosen/unsafe-converter-planning and squashes the following commits: 7450fa5 [Josh Rosen] Resolve conflicts in favor of choosing UnsafeRow 5220cce [Josh Rosen] Add roundtrip converter test 2bb8da8 [Josh Rosen] Add Union unsafe support + tests to bump up test coverage 6f79449 [Josh Rosen] Add even more assertions to execute() 08ce199 [Josh Rosen] Rename ConvertFromUnsafe -> ConvertToSafe 0e2d548 [Josh Rosen] Add assertion if operators' input rows are in different formats cabb703 [Josh Rosen] Add tests for Filter 3b11ce3 [Josh Rosen] Add missing test file. ae2195a [Josh Rosen] Fixes 0fef0f8 [Josh Rosen] Rename file. d5f9005 [Josh Rosen] Finish writing EnsureRowFormats planner rule b5df19b [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-converter-planning 9ba3038 [Josh Rosen] WIP	2015-07-18 11:08:18 -07:00
Reynold Xin	fba3f5ba85	[SPARK-9169][SQL] Improve unit test coverage for null expressions. Author: Reynold Xin <rxin@databricks.com> Closes #7490 from rxin/unit-test-null-funcs and squashes the following commits: 7b276f0 [Reynold Xin] Move isNaN. 8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.	2015-07-18 11:06:46 -07:00
Paweł Kozikowski	b9ef7ac98c	[MLLIB] [DOC] Seed fix in mllib naive bayes example Previous seed resulted in empty test data set. Author: Paweł Kozikowski <mupakoz@gmail.com> Closes #7477 from mupakoz/patch-1 and squashes the following commits: f5d41ee [Paweł Kozikowski] Mllib Naive Bayes example data set enlarged	2015-07-18 10:12:48 -07:00
Rekha Joshi	1017908205	[SPARK-9118] [ML] Implement IntArrayParam in mllib Implement IntArrayParam in mllib Author: Rekha Joshi <rekhajoshm@gmail.com> Author: Joshi <rekhajoshm@gmail.com> Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits: d3b1766 [Joshi] Implement IntArrayParam 0be142d [Rekha Joshi] Merge pull request #3 from apache/master 106fd8e [Rekha Joshi] Merge pull request #2 from apache/master e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master	2015-07-17 20:02:05 -07:00
Yu ISHIKAWA	34a889db85	[SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks. [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879 Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits: be752de [Yu ISHIKAWA] Add assertions a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst 4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python fb2417c [Yu ISHIKAWA] Use getInt, instead of get f397be4 [Yu ISHIKAWA] Switch the comparisons. ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter. effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test 19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests 1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst f8338bc [Yu ISHIKAWA] Add the placeholders in Python 4a03003 [Yu ISHIKAWA] Test for contains in Python 6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply` 288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names 5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception 97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy` e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class 978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans 2ec80bc [Yu ISHIKAWA] Fit on 1 line e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python 3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation 4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon 2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam` 19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF 4d2ad1e [Yu ISHIKAWA] Modify the indentations 0ae422f [Yu ISHIKAWA] Add a test for `setParams` 4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala 11ffdf1 [Yu ISHIKAWA] Use `===` and the variable 220a176 [Yu ISHIKAWA] Set a random seed in the unit testing 92c3efc [Yu ISHIKAWA] Make the points for a test be fewer c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python 6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods 687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations 5bedc51 [Yu ISHIKAWA] Remve an extra new line 444c289 [Yu ISHIKAWA] Add the validation for `runs` e41989c [Yu ISHIKAWA] Modify how to validate `initStep` 7ea133a [Yu ISHIKAWA] Change how to validate `initMode` 7991e15 [Yu ISHIKAWA] Add a validation for `k` c2df35d [Yu ISHIKAWA] Make `predict` private 93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform` d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private 8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans 6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps` 99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode` 79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs 6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault` 20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault` 11c2a12 [Yu ISHIKAWA] Limit the imports badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel} f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods 85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol` aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline 598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python 63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala	2015-07-17 18:30:04 -07:00
Yijie Shen	529a2c2d92	[SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math JIRA: https://issues.apache.org/jira/browse/SPARK-8280 https://issues.apache.org/jira/browse/SPARK-8281 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7451 from yijieshen/nan_null2 and squashes the following commits: 47a529d [Yijie Shen] style fix 63dee44 [Yijie Shen] handle log expressions similar to Hive 188be51 [Yijie Shen] null to nan in Math Expression	2015-07-17 17:33:19 -07:00
Daoyuan Wang	1707238601	[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition When the `condition` extracted by `ExtractEquiJoinKeys` contain join Predicate for left semi join, we can not plan it as semiJoin. Such as SELECT * FROM testData2 x LEFT SEMI JOIN testData2 y ON x.b = y.b AND x.a >= y.a + 2 Condition `x.a >= y.a + 2` can not evaluate on table `x`, so it throw errors Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5643 from adrian-wang/spark7026 and squashes the following commits: cc09809 [Daoyuan Wang] refactor semijoin and add plan test 575a7c8 [Daoyuan Wang] fix notserializable 27841de [Daoyuan Wang] fix rebase 10bf124 [Daoyuan Wang] fix style 72baa02 [Daoyuan Wang] fix style 8e0afca [Daoyuan Wang] merge commits for rebase	2015-07-17 16:45:46 -07:00
Tathagata Das	b13ef7723f	[SPARK-9030] [STREAMING] Add Kinesis.createStream unit tests that actual sends data Current Kinesis unit tests do not test createStream by sending data. This PR is to add such unit test. Note that this unit will not run by default. It will only run when the relevant environment variables are set. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #7413 from tdas/kinesis-tests and squashes the following commits: 0e16db5 [Tathagata Das] Added more comments regarding testOrIgnore 1ea5ce0 [Tathagata Das] Added more comments c7caef7 [Tathagata Das] Address comments a297b59 [Tathagata Das] Reverted unnecessary change in KafkaStreamSuite 90c9bde [Tathagata Das] Removed scalatest.FunSuite deb7f4f [Tathagata Das] Removed scalatest.FunSuite 18c2208 [Tathagata Das] Changed how SparkFunSuite is inherited dbb33a5 [Tathagata Das] Added license 88f6dab [Tathagata Das] Added scala docs c6be0d7 [Tathagata Das] minor changes 24a992b [Tathagata Das] Moved KinesisTestUtils to src instead of test for future python usage 465b55d [Tathagata Das] Made unit tests optional in a nice way 4d70703 [Tathagata Das] Added license 129d436 [Tathagata Das] Minor updates cc36510 [Tathagata Das] Added KinesisStreamSuite	2015-07-17 16:43:18 -07:00
Wenchen Fan	bd903ee89f	[SPARK-9117] [SQL] fix BooleanSimplification in case-insensitive Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7452 from cloud-fan/boolean-simplify and squashes the following commits: 2a6e692 [Wenchen Fan] fix style d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive	2015-07-17 16:28:24 -07:00
Wenchen Fan	fd6b3101fb	[SPARK-9113] [SQL] enable analysis check code for self join The check was unreachable before, as `case operator: LogicalPlan` catches everything already. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7449 from cloud-fan/tmp and squashes the following commits: 2bb6637 [Wenchen Fan] add test 5493aea [Wenchen Fan] add the check back 27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join	2015-07-17 16:03:33 -07:00
Yijie Shen	15fc2ffe55	[SPARK-9080][SQL] add isNaN predicate expression JIRA: https://issues.apache.org/jira/browse/SPARK-9080 cc rxin Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7464 from yijieshen/isNaN and squashes the following commits: 11ae039 [Yijie Shen] add isNaN in functions 666718e [Yijie Shen] add isNaN predicate expression	2015-07-17 15:49:31 -07:00
Reynold Xin	b2aa490bb6	[SPARK-9142] [SQL] Removing unnecessary self types in Catalyst. Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions. Author: Reynold Xin <rxin@databricks.com> Closes #7479 from rxin/remove-self-types and squashes the following commits: e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.	2015-07-17 15:02:13 -07:00
Joshi	42d8a012f6	[SPARK-8593] [CORE] Sort app attempts by start time. This makes sure attempts are listed in the order they were executed, and that the app's state matches the state of the most current attempt. Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #7253 from rekhajoshm/SPARK-8593 and squashes the following commits: 874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner) 716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime) 548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime) 83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time) b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment) cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test) 304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage) 85024e8 [Joshi] History Server: updated order for multiple attempts a41ac4b [Joshi] History Server: updated order for multiple attempts ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete 0be142d [Rekha Joshi] Merge pull request #3 from apache/master 106fd8e [Rekha Joshi] Merge pull request #2 from apache/master e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master	2015-07-17 22:47:28 +01:00
Bryan Cutler	8b8be1f5d6	[SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles Broadcast of ensemble models in transformImpl before call to predict Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits: 86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf 40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127 9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction 1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline 171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model 6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform 83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier	2015-07-17 14:10:16 -07:00
Yanbo Liang	830666f6fe	[SPARK-8792] [ML] Add Python API for PCA transformer Add Python API for PCA transformer Author: Yanbo Liang <ybliang8@gmail.com> Closes #7190 from yanboliang/spark-8792 and squashes the following commits: 8f4ac31 [Yanbo Liang] address comments 8a79cc0 [Yanbo Liang] Add Python API for PCA transformer	2015-07-17 14:08:06 -07:00
Feynman Liang	6da1069696	[SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`. Author: Feynman Liang <fliang@databricks.com> Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits: f4b39d8 [Feynman Liang] Fix doc bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals 63f0d60 [Feynman Liang] Fix definition of residual	2015-07-17 14:00:53 -07:00
zsxwing	ad0954f6de	[SPARK-5681] [STREAMING] Move 'stopReceivers' to the event loop to resolve the race condition This is an alternative way to fix `SPARK-5681`. It minimizes the changes. Closes #4467 Author: zsxwing <zsxwing@gmail.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6294 from zsxwing/pr4467 and squashes the following commits: 709ac1f [zsxwing] Fix the comment e103e8a [zsxwing] Move ReceiverTracker.stop into ReceiverTracker.stop f637142 [zsxwing] Address minor code style comments a178d37 [zsxwing] Move 'stopReceivers' to the event looop to resolve the race condition 51fb07e [zsxwing] Fix the code style 3cb19a3 [zsxwing] Merge branch 'master' into pr4467 b4c29e7 [zsxwing] Stop receiver only if we start it c41ee94 [zsxwing] Make stopReceivers private 7c73c1f [zsxwing] Use trackerStateLock to protect trackerState a8120c0 [zsxwing] Merge branch 'master' into pr4467 7b1d9af [zsxwing] "case Throwable" => "case NonFatal" 15ed4a1 [zsxwing] Register before starting the receiver fff63f9 [zsxwing] Use a lock to eliminate the race condition when stopping receivers and registering receivers happen at the same time. e0ef72a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 19b76d9 [Liang-Chi Hsieh] Remove timeout. 34c18dc [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout c419677 [Liang-Chi Hsieh] Fix style. 9e1a760 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into tracker_status_timeout 355f9ce [Liang-Chi Hsieh] Separate register and start events for receivers. 3d568e8 [Liang-Chi Hsieh] Let receivers get registered first before going started. ae0d9fd [Liang-Chi Hsieh] Merge branch 'master' into tracker_status_timeout 77983f3 [Liang-Chi Hsieh] Add tracker status and stop to receive messages when stopping tracker.	2015-07-17 14:00:31 -07:00
Wenchen Fan	074085d678	[SPARK-9136] [SQL] fix several bugs in DateTimeUtils.stringToTimestamp a follow up of https://github.com/apache/spark/pull/7353 1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM). 2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance` I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7473 from cloud-fan/datetime and squashes the following commits: 66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp	2015-07-17 13:57:31 -07:00
Yanbo Liang	9974642870	[SPARK-8600] [ML] Naive Bayes API for spark.ml Pipelines Naive Bayes API for spark.ml Pipelines Author: Yanbo Liang <ybliang8@gmail.com> Closes #7284 from yanboliang/spark-8600 and squashes the following commits: bc890f7 [Yanbo Liang] remove labels valid check c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel a2b3088 [Yanbo Liang] address comments 3220b82 [Yanbo Liang] trigger jenkins 3018a41 [Yanbo Liang] address comments 208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines	2015-07-17 13:55:17 -07:00
Yuhao Yang	806c579f43	[SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true) jira: https://issues.apache.org/jira/browse/SPARK-9062 Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default. I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7414 from hhbyyh/tokenizer and squashes the following commits: c01bd7a [Yuhao Yang] change output type of tokenizer	2015-07-17 13:43:19 -07:00
Davies Liu	f9a82a884e	[SPARK-9138] [MLLIB] fix Vectors.dense Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck. cc mengxr jkbradley Author: Davies Liu <davies@databricks.com> Closes #7476 from davies/fix_vectors_dense and squashes the following commits: e0fd292 [Davies Liu] fix Vectors.dense	2015-07-17 12:43:58 -07:00
tien-dungle	587c315b20	[SPARK-9109] [GRAPHX] Keep the cached edge in the graph The change here is to keep the cached RDDs in the graph object so that when the graph.unpersist() is called these RDDs are correctly unpersisted. ```java import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD import org.slf4j.LoggerFactory import org.apache.spark.graphx.util.GraphGenerators // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")))) // Create an RDD for edges val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) // Define a default user in case there are relationship with missing user val defaultUser = ("John Doe", "Missing") // Build the initial Graph val graph = Graph(users, relationships, defaultUser) graph.cache().numEdges graph.unpersist() sc.getPersistentRDDs.foreach( r => println( r._2.toString)) ``` Author: tien-dungle <tien-dung.le@realimpactanalytics.com> Closes #7469 from tien-dungle/SPARK-9109_Graphx-unpersist and squashes the following commits: 8d87997 [tien-dungle] Keep the cached edge in the graph	2015-07-17 12:11:32 -07:00
Liang-Chi Hsieh	eba6a1af4c	[SPARK-8945][SQL] Add add and subtract expressions for IntervalType JIRA: https://issues.apache.org/jira/browse/SPARK-8945 Add add and subtract expressions for IntervalType. Author: Liang-Chi Hsieh <viirya@appier.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #7398 from viirya/interval_add_subtract and squashes the following commits: acd1f1e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract 5abae28 [Liang-Chi Hsieh] For comments. 6f5b72e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract dbe3906 [Liang-Chi Hsieh] For comments. 13a2fc5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract 83ec129 [Liang-Chi Hsieh] Remove intervalMethod. acfe1ab [Liang-Chi Hsieh] Fix scala style. d3e9d0e [Liang-Chi Hsieh] Add add and subtract expressions for IntervalType.	2015-07-17 09:38:08 -07:00
zhichao.li	305e77cd83	[SPARK-8209[SQL]Add function conv cc chenghao-intel adrian-wang Author: zhichao.li <zhichao.li@intel.com> Closes #6872 from zhichao-li/conv and squashes the following commits: 6ef3b37 [zhichao.li] add unittest and comments 78d9836 [zhichao.li] polish dataframe api and add unittest e2bace3 [zhichao.li] update to use ImplicitCastInputTypes cbcad3f [zhichao.li] add function conv	2015-07-17 09:32:27 -07:00
Wenchen Fan	59d24c226a	[SPARK-9130][SQL] throw exception when check equality between external and internal row instead of return false, throw exception when check equality between external and internal row is better. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7460 from cloud-fan/row-compare and squashes the following commits: 8a20911 [Wenchen Fan] improve equals 402daa8 [Wenchen Fan] throw exception when check equality between external and internal row	2015-07-17 09:31:13 -07:00
Yanbo Liang	441e072a22	[MINOR] [ML] fix wrong annotation of RFormula.formula fix wrong annotation of RFormula.formula Author: Yanbo Liang <ybliang8@gmail.com> Closes #7470 from yanboliang/RFormula and squashes the following commits: 61f1919 [Yanbo Liang] fix wrong annotation	2015-07-17 09:00:41 -07:00
Hari Shreedharan	c043a3e9df	[SPARK-8851] [YARN] In Client mode, make sure the client logs in and updates tokens In client side, the flow is SparkSubmit -> SparkContext -> yarn/Client. Since the yarn client only gets a cloned config and the staging dir is set here, it is not really possible to do re-logins in the SparkContext. So, do the initial logins in Spark Submit and do re-logins as we do now in the AM, but the Client behaves like an executor in this specific context and reads the credentials file to update the tokens. This way, even if the streaming context is started up from checkpoint - it is fine since we have logged in from SparkSubmit itself itself. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #7394 from harishreedharan/yarn-client-login and squashes the following commits: 9a2166f [Hari Shreedharan] make it possible to use command line args and config parameters together. de08f57 [Hari Shreedharan] Fix import order. 5c4fa63 [Hari Shreedharan] Add a comment explaining what is being done in YarnClientSchedulerBackend. c872caa [Hari Shreedharan] Fix typo in log message. 2c80540 [Hari Shreedharan] Move token renewal to YarnClientSchedulerBackend. 0c48ac2 [Hari Shreedharan] Remove direct use of ExecutorDelegationTokenUpdater in Client. 26f8bfa [Hari Shreedharan] [SPARK-8851][YARN] In Client mode, make sure the client logs in and updates tokens. 58b1969 [Hari Shreedharan] Simple attempt 1.	2015-07-17 09:38:08 -05:00
Davies Liu	ec8973d124	[SPARK-9022] [SQL] Generated projections for UnsafeRow Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow. They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has). cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7437 from davies/unsafe_proj2 and squashes the following commits: dbf538e [Davies Liu] test with all the expression (only for supported types) dc737b2 [Davies Liu] address comment e424520 [Davies Liu] fix scala style 70e231c [Davies Liu] address comments 729138d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_proj2 5a26373 [Davies Liu] unsafe projections	2015-07-17 01:27:14 -07:00
Yu ISHIKAWA	5a3c1ad087	[SPARK-9093] [SPARKR] Fix single-quotes strings in SparkR [[SPARK-9093] Fix single-quotes strings in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9093) This is the result of lintr at the rivision:011551620faa87107a787530f074af3d9be7e695 [[SPARK-9093] The result of lintr at `011551620f`](https://gist.github.com/yu-iskw/8c47acf3202796da4d01) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #7439 from yu-iskw/SPARK-9093 and squashes the following commits: 61c391e [Yu ISHIKAWA] [SPARK-9093][SparkR] Fix single-quotes strings in SparkR	2015-07-17 17:00:50 +09:00
Wenchen Fan	3f6d28a5ca	[SPARK-9102] [SQL] Improve project collapse with nondeterministic expressions Currently we will stop project collapse when the lower projection has nondeterministic expressions. However it's overkill sometimes, we should be able to optimize `df.select(Rand(10)).select('a)` to `df.select('a)` Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7445 from cloud-fan/non-deterministic and squashes the following commits: 0deaef6 [Wenchen Fan] Improve project collapse with nondeterministic expressions	2015-07-17 00:59:15 -07:00
Reynold Xin	111c05538d	Added inline comment for the canEqual PR by @cloud-fan.	2015-07-16 23:13:06 -07:00
Xiangrui Meng	358e7bf652	[SPARK-9126] [MLLIB] do not assert on time taken by Thread.sleep() Measure lower and upper bounds for task time and use them for validation. This PR also implements `Stopwatch.toString`. This suite should finish in less than 1 second. jkbradley pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #7457 from mengxr/SPARK-9126 and squashes the following commits: 4b40faa [Xiangrui Meng] simplify tests 739f5bd [Xiangrui Meng] do not assert on time taken by Thread.sleep()	2015-07-16 23:02:06 -07:00
Joseph K. Bradley	322d286bb7	[SPARK-7131] [ML] Copy Decision Tree, Random Forest impl to spark.ml This PR copies the RandomForest implementation from spark.mllib to spark.ml. Note that this includes the DecisionTree implementation, but not the GradientBoostedTrees one (which will come later). I essentially copied a minimal amount of code to spark.ml, removed the use of bins (and only used splits), and modified code only as much as necessary to get it to compile. The spark.ml implementation still uses some spark.mllib classes (privately), which can be moved in future PRs. This refactoring will be helpful in extending the node representation to include more information, such as class probabilities. Specifically: * Copied code from spark.mllib to spark.ml: * mllib.tree.DecisionTree, mllib.tree.RandomForest copied to ml.tree.impl.RandomForest (main implementation) * NodeIdCache (needed to use splits instead of bins) * TreePoint (use splits instead of bins) * Added ml.tree.LearningNode used in RandomForest training (needed vars) * Removed bins from implementation, and only used splits * Small fix in JavaDecisionTreeRegressorSuite CC: mengxr manishamde codedeft chouqin Author: Joseph K. Bradley <joseph@databricks.com> Closes #7294 from jkbradley/dt-move-impl and squashes the following commits: 48749be [Joseph K. Bradley] cleanups based on code review, mostly style bea9703 [Joseph K. Bradley] scala style fixes. added some scala doc 4e6d2a4 [Joseph K. Bradley] removed unnecessary use of copyValues, setParent for trees 9a4d721 [Joseph K. Bradley] cleanups. removed InfoGainStats from ml, using old one for now. 836e7d4 [Joseph K. Bradley] Fixed test suite failures bd5e063 [Joseph K. Bradley] fixed bucketizing issue 0df3759 [Joseph K. Bradley] Need to remove use of Bucketizer d5224a9 [Joseph K. Bradley] modified tree and forest to use moved impl `cc01823` [Joseph K. Bradley] still editing RF to get it to work 19143fb [Joseph K. Bradley] More progress, but not done yet. Rebased with master after 1.4 release.	2015-07-16 22:26:59 -07:00
Wenchen Fan	f893955b9c	[SPARK-8899] [SQL] remove duplicated equals method for Row Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7291 from cloud-fan/row and squashes the following commits: a11addf [Wenchen Fan] move hashCode back to internal row 2de6180 [Wenchen Fan] making apply() call to get() fbe1b24 [Wenchen Fan] add null check ebdf148 [Wenchen Fan] address comments 25ef087 [Wenchen Fan] remove duplicated equals method for Row	2015-07-16 21:41:36 -07:00
zsxwing	812b63bbee	[SPARK-8857][SPARK-8859][Core]Add an internal flag to Accumulable and send internal accumulator updates to the driver via heartbeats This PR includes the following changes: 1. Remove the thread local `Accumulators.localAccums`. Instead, all Accumulators in the executors will register with its TaskContext. 2. Add an internal flag to Accumulable. For internal Accumulators, their updates will be sent to the driver via heartbeats. Author: zsxwing <zsxwing@gmail.com> Closes #7448 from zsxwing/accumulators and squashes the following commits: c24bc5b [zsxwing] Add comments bd7dcf1 [zsxwing] Add an internal flag to Accumulable and send internal accumulator updates to the driver via heartbeats	2015-07-16 21:09:09 -07:00
Andrew Or	96aa3340f4	[SPARK-8119] HeartbeatReceiver should replace executors, not kill Symptom. If an executor in an application times out, `HeartbeatReceiver` attempts to kill it. After this happens, however, the application never gets an executor back even when there are cluster resources available. Cause. The issue is that `sc.killExecutor` automatically assumes that the application wishes to adjust its resource requirements permanently downwards. This is not the intention in `HeartbeatReceiver`, however, which simply wants a replacement for the expired executor. Fix. Differentiate between the intention to kill and the intention to replace an executor with a fresh one. More details can be found in the commit message. Author: Andrew Or <andrew@databricks.com> Closes #7107 from andrewor14/heartbeat-no-kill and squashes the following commits: 1cd2cd7 [Andrew Or] Add regression test for SPARK-8119 25a347d [Andrew Or] Reuse more code in scheduler backend 31ebd40 [Andrew Or] Differentiate between kill and replace	2015-07-16 19:39:54 -07:00
Timothy Chen	d86bbb4e28	[SPARK-6284] [MESOS] Add mesos role, principal and secret Mesos supports framework authentication and role to be set per framework, which the role is used to identify the framework's role which impacts the sharing weight of resource allocation and optional authentication information to allow the framework to be connected to the master. Author: Timothy Chen <tnachen@gmail.com> Closes #4960 from tnachen/mesos_fw_auth and squashes the following commits: 0f9f03e [Timothy Chen] Fix review comments. 8f9488a [Timothy Chen] Fix rebase f7fc2a9 [Timothy Chen] Add mesos role, auth and secret.	2015-07-16 19:37:15 -07:00
Lianhui Wang	49351c7f59	[SPARK-8646] PySpark does not run on YARN if master not provided in command line andrewor14 davies vanzin can you take a look at this? thanks Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #7438 from lianhuiwang/SPARK-8646 and squashes the following commits: cb3f12d [Lianhui Wang] add whitespace 6d874a6 [Lianhui Wang] support pyspark for yarn-client	2015-07-16 19:31:45 -07:00

1 2 3 4 5 ...

11971 commits