ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yuanjian Li	01cc852982	[SPARK-30803][DOCS] Fix the home page link for Scala API document ### What changes were proposed in this pull request? Change the link to the Scala API document. ``` $ git grep "#org.apache.spark.package" docs/_layouts/global.html: <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li> docs/index.md:* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package) docs/rdd-programming-guide.md:[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/). ``` ### Why are the changes needed? The home page link for Scala API document is incorrect after upgrade to 3.0 ### Does this PR introduce any user-facing change? Document UI change only. ### How was this patch tested? Local test, attach screenshots below: Before: ![image](https://user-images.githubusercontent.com/4833765/74335713-c2385300-4dd7-11ea-95d8-f5a3639d2578.png) After: ![image](https://user-images.githubusercontent.com/4833765/74335727-cbc1bb00-4dd7-11ea-89d9-4dcc1310e679.png) Closes #27549 from xuanyuanking/scala-doc. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-16 09:55:03 -06:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
“attilapiros”	2fbed378bf	[MINOR][DOC] Add missing space after comma Adding missing spaces after commas. Closes #24205 from attilapiros/minor-doc-changes. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 15:22:07 -05:00
Sean Owen	0e2405490f	[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support - Move external/java8-tests tests into core, streaming, sql and remove - Remove MaxPermGen and related options - Fix some reflection / TODOs around Java 8+ methods - Update doc references to 1.7/1.8 differences - Remove Java 7/8 related build profiles - Update some plugins for better Java 8 compatibility - Fix a few Java-related warnings For the future: - Update Java 8 examples to fully use Java 8 - Update Java tests to use lambdas for simplicity - Update Java internal implementations to use lambdas ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #16871 from srowen/SPARK-19493.	2017-02-16 12:32:45 +00:00
Niranjan Padmanabhan	a1e40b1f5d	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo ## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.	2017-01-04 15:07:29 +00:00
Joseph K. Bradley	5ffd5d3838	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * Reviewers: please check this carefully * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * Reviewers: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * Reviewers: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.	2016-07-15 13:38:23 -07:00
Dongjoon Hyun	ad102af169	[SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents ## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. Fix broken links * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md Fix malformed section header and scala coding style * mllib-linear-methods.md Replace indirect forward links with direct one * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13608 from dongjoon-hyun/SPARK-15883.	2016-06-11 12:55:38 +01:00
Xin Ren	dd9ca7b960	[SPARK-13019][DOCS] fix for scala-2.10 build: Replace example code in mllib-statistics.md using include_example ## What changes were proposed in this pull request? This PR for ticket SPARK-13019 is based on previous PR(https://github.com/apache/spark/pull/11108). Since PR(https://github.com/apache/spark/pull/11108) is breaking scala-2.10 build, more work is needed to fix build errors. What I did new in this PR is adding keyword argument for 'fractions': ` val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)` ` val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)` I reopened ticket on JIRA but sorry I don't know how to reopen a GitHub pull request, so I just submitting a new pull request. ## How was this patch tested? Manual build testing on local machine, build based on scala-2.10. Author: Xin Ren <iamshrek@126.com> Closes #11901 from keypointt/SPARK-13019.	2016-03-24 09:34:54 +00:00
Xiangrui Meng	43ef1e52bf	Revert "[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example" This reverts commit `1af8de200c`.	2016-03-21 17:42:30 -07:00
Xin Ren	1af8de200c	[SPARK-13019][DOCS] Replace example code in mllib-statistics.md using include_example https://issues.apache.org/jira/browse/SPARK-13019 The example code in the user guide is embedded in the markdown and hence it is not easy to test. It would be nice to automatically test them. This JIRA is to discuss options to automate example code testing and see what we can do in Spark 1.6. Goal is to move actual example code to spark/examples and test compilation in Jenkins builds. Then in the markdown, we can reference part of the code to show in the user guide. This requires adding a Jekyll tag that is similar to https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., called include_example. `{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}` Jekyll will find `examples/src/main/scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala` and pick code blocks marked "example" and replace code block in `{% highlight %}` in the markdown. See more sub-tasks in parent ticket: https://issues.apache.org/jira/browse/SPARK-11337 Author: Xin Ren <iamshrek@126.com> Closes #11108 from keypointt/SPARK-13019.	2016-03-21 16:09:34 -07:00
Zheng RuiFeng	204c9dec2c	[MINOR][DOC] Add JavaStreamingTestExample ## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.	2016-03-17 11:09:02 +02:00
Timothy Hunter	2ecbe02d5b	[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.	2015-12-10 12:50:46 -08:00
Feynman Liang	5535888930	[SPARK-11960][MLLIB][DOC] User guide for streaming tests CC jkbradley mengxr josepablocam Author: Feynman Liang <feynman.liang@gmail.com> Closes #10005 from feynmanliang/streaming-test-user-guide.	2015-11-30 15:38:44 -08:00
Sean Owen	d981902101	[SPARK-11476][DOCS] Incorrect function referred to in MLib Random data generation documentation Fix Python example to use normalRDD as advertised Author: Sean Owen <sowen@cloudera.com> Closes #9529 from srowen/SPARK-11476.	2015-11-08 11:15:58 +00:00
Xin Ren	27cdde2ff8	[SPARK-10669] [DOCS] Link to each language's API in codetabs in ML docs: spark.mllib In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in `64743870f2/docs/mllib-feature-extraction.md` This JIRA is just for spark.mllib, not spark.ml. Please let me know if more work is needed, thanks a lot. Author: Xin Ren <iamshrek@126.com> Closes #8977 from keypointt/SPARK-10669.	2015-10-07 15:00:19 +01:00
jose.cambronero	c90c605dc6	[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test added doc examples for python. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #8154 from josepablocam/spark_9902.	2015-08-17 19:09:45 -07:00
Sandy Ryza	f9d1a92aa1	[SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza <sandy@cloudera.com> Closes #8230 from sryza/sandy-spark-7707.	2015-08-17 17:57:51 -07:00
Sean Owen	873ab0f969	[SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement Use print(x) not print x for Python 3 in eval examples CC sethah mengxr -- just wanted to close this out before 1.5 Author: Sean Owen <sowen@cloudera.com> Closes #7822 from srowen/SPARK-9490 and squashes the following commits: 01abeba [Sean Owen] Change "print x" to "print(x)" in the rest of the docs too bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples	2015-07-31 13:45:28 -07:00
jose.cambronero	9c5075775d	[SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs This contribution is my original work and I license it to the project under it's open source license. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #6994 from josepablocam/master and squashes the following commits: bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name 0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md 1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf) 1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests 2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly 7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info e760ebd [jose.cambronero] line length changes to fit style check 3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part 9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs 3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity 992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach. 6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal) 4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below 0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm 16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP `c18dc66` [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request b9cff3a [jose.cambronero] made small changes to pass style check ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite 4da189b [jose.cambronero] added user facing ks test functions c659ea1 [jose.cambronero] created KS test class 13dfe4d [jose.cambronero] created test result class for ks test	2015-07-10 20:55:45 -07:00
Kamil Smuga	6ef48632fb	SPARK-6454 [DOCS] Fix links to pyspark api Author: Kamil Smuga <smugakamil@gmail.com> Author: stderr <smugakamil@gmail.com> Closes #5120 from kamilsmuga/master and squashes the following commits: fee3281 [Kamil Smuga] more python api links fixed for docs 13240cb [Kamil Smuga] resolved merge conflicts with upstream/master 6649b3b [Kamil Smuga] fix broken docs links to Python API 92f03d7 [stderr] Fix links to pyspark api	2015-03-22 15:56:25 +00:00
Davies Liu	c8abddc516	[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API ``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <davies@databricks.com> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API	2014-11-04 21:35:52 -08:00
Kousuke Saruta	f799700eec	[SPARK-4055][MLlib] Inconsistent spelling 'MLlib' and 'MLLib' Thare are some inconsistent spellings 'MLlib' and 'MLLib' in some documents and source codes. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2903 from sarutak/SPARK-4055 and squashes the following commits: b031640 [Kousuke Saruta] Fixed inconsistent spelling "MLlib and MLLib"	2014-10-23 09:19:32 -07:00
Xiangrui Meng	43dfc84f88	[SPARK-2830][MLLIB] doc update for 1.1 1. renamed mllib-basics to mllib-data-types 1. renamed mllib-stats to mllib-statistics 1. moved random data generation to the bottom of mllib-stats 1. updated toc accordingly atalwalkar Author: Xiangrui Meng <meng@databricks.com> Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits: 0bd79f3 [Xiangrui Meng] add mllib-data-types b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types 4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md	2014-08-27 01:19:48 -07:00

23 commits