ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Josh Rosen	2117eea71e	[SPARK-10710] Remove ability to disable spilling in core and SQL It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`. This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling. Author: Josh Rosen <joshrosen@databricks.com> Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.	2015-09-19 21:40:21 -07:00
Yanbo Liang	35e8ab9390	[SPARK-10615] [PYSPARK] change assertEquals to assertEqual As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.	2015-09-18 09:53:52 -07:00
Liang-Chi Hsieh	136c77d8bb	[SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keys JIRA: https://issues.apache.org/jira/browse/SPARK-10642 When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8796 from viirya/fix-pyrdd-lookup.	2015-09-17 10:02:15 -07:00
Yu ISHIKAWA	268088b899	[SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8692 from yu-iskw/SPARK-10282.	2015-09-17 08:51:19 -07:00
Yu ISHIKAWA	c74d38fd8f	[SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpm Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8665 from yu-iskw/SPARK-10274.	2015-09-17 08:50:46 -07:00
Yu ISHIKAWA	4a0b56e8db	[SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to pyspark.mllib.util Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8689 from yu-iskw/SPARK-10279.	2015-09-17 08:50:00 -07:00
Yu ISHIKAWA	39b44cb52e	[SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.tree Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8685 from yu-iskw/SPARK-10278.	2015-09-17 08:48:45 -07:00
Yu ISHIKAWA	0ded87a4d4	[SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8691 from yu-iskw/SPARK-10281.	2015-09-17 08:47:21 -07:00
Yu ISHIKAWA	29bf8aa5a5	[SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8693 from yu-iskw/SPARK-10283.	2015-09-17 08:45:20 -07:00
Yu ISHIKAWA	c633ed3260	[SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8694 from yu-iskw/SPARK-10284.	2015-09-17 08:43:59 -07:00
Yu ISHIKAWA	d9b7f3e4db	[SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8677 from yu-iskw/SPARK-10276.	2015-09-16 04:34:14 -07:00
Vinod K C	95b6a8103f	[SPARK-10516] [ MLLIB] Added values property in DenseVector Author: Vinod K C <vinod.kc@huawei.com> Closes #8682 from vinodkc/fix_SPARK-10516.	2015-09-15 23:25:51 -07:00
noelsmith	7ca30b505c	[PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random Missed this when reviewing `pyspark.mllib.random` for SPARK-10275. Author: noelsmith <mail@noelsmith.com> Closes #8773 from noel-smith/mllib-random-versionadded-fix.	2015-09-15 12:23:20 -07:00
Yu ISHIKAWA	a2249359d5	[SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.random Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8666 from yu-iskw/SPARK-10275.	2015-09-14 21:59:40 -07:00
noelsmith	610971ecfe	[SPARK-10273] Add @since annotation to pyspark.mllib.feature Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.	2015-09-14 21:58:52 -07:00
Yanbo Liang	4ae4d54794	[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector. Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8166 from yanboliang/spark-9793.	2015-09-14 21:37:43 -07:00
Davies Liu	5520418100	[SPARK-10542] [PYSPARK] fix serialize namedtuple Author: Davies Liu <davies@databricks.com> Closes #8707 from davies/fix_namedtuple.	2015-09-14 19:46:34 -07:00
Yanbo Liang	ce6f3f163b	[SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.	2015-09-14 12:08:52 -07:00
JihongMa	f4a22808e0	[SPARK-6548] Adding stddev to DataFrame functions Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change. Author: JihongMa <linlin200605@gmail.com> Author: Jihong MA <linlin200605@gmail.com> Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com> Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local> Closes #6297 from JihongMA/SPARK-SQL.	2015-09-12 10:17:15 -07:00
0x0FFF	c34fc19765	[SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014] Added functionality: `Column` object in Python now supports exponential operator `*` Example: ``` from pyspark.sql import df = sqlContext.createDataFrame([Row(a=2)]) df.select(3df.a,df.a3,df.a**df.a).collect() ``` Outputs: ``` [Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8658 from 0x0FFF/SPARK-9014.	2015-09-11 15:19:04 -07:00
Icaro Medeiros	c373866774	[PYTHON] Fixed typo in exception message Just fixing a typo in exception message, raised when attempting to pickle SparkContext. Author: Icaro Medeiros <icaro.medeiros@gmail.com> Closes #8724 from icaromedeiros/master.	2015-09-11 21:46:52 +01:00
Yuhao Yang	5f46444765	[SPARK-8530] [ML] add python API for MinMaxScaler jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.	2015-09-11 10:32:35 -07:00
Joseph K. Bradley	2e3a280754	[MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.	2015-09-11 08:55:35 -07:00
Yanbo Liang	b01b262606	[SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.	2015-09-11 08:52:28 -07:00
Yanbo Liang	b656e6134f	[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here: ```scala HasElasticNetParam HasFitIntercept HasStandardization HasThresholds ``` Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8508 from yanboliang/spark-10026.	2015-09-11 08:50:35 -07:00
Yanbo Liang	a140dd77c6	[SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature Missing method of ml.feature are listed here: ```StringIndexer``` lacks of parameter ```handleInvalid```. ```StringIndexerModel``` lacks of method ```labels```. ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8313 from yanboliang/spark-10027.	2015-09-10 20:43:38 -07:00
Yanbo Liang	89562a172f	[SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__ pyspark.sql.types.Row implements ```__getitem__``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #8333 from yanboliang/spark-7544.	2015-09-10 13:54:20 -07:00
Yanbo Liang	56a0fe5c6e	[SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer Add Python API for ml.feature.VectorSlicer. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8102 from yanboliang/SPARK-9772.	2015-09-09 18:02:33 -07:00
Holden Karau	2f6fd5256c	[SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.	2015-09-08 22:13:05 -07:00
noelsmith	0e2f216331	[SPARK-10094] Pyspark ML Feature transformers marked as experimental Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental. Author: noelsmith <mail@noelsmith.com> Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.	2015-09-08 21:26:20 -07:00
Davies Liu	3a11e50e21	[SPARK-10373] [PYSPARK] move @since into pyspark from sql cc mengxr Author: Davies Liu <davies@databricks.com> Closes #8657 from davies/move_since.	2015-09-08 20:56:22 -07:00
Tathagata Das	7a4f326c00	[SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the programming guides and python docs - Fixed information around Python API tags in streaming programming guides - Added missing stuff in python docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8595 from tdas/SPARK-10440.	2015-09-04 23:16:39 -10:00
0x0FFF	6cd98c1878	[SPARK-10417] [SQL] Iterating through Column results in infinite loop `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.	2015-09-02 13:36:36 -07:00
0x0FFF	00d9af5e19	[SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392) The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement Issue reproduction on master: ``` >>> from pyspark.sql.types import * >>> a = DateType() >>> a.fromInternal(0) 0 >>> a.fromInternal(1) datetime.date(1970, 1, 2) ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8556 from 0x0FFF/SPARK-10392.	2015-09-01 14:58:49 -07:00
0x0FFF	bf550a4b55	[SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162) The issue is with DataFrame filter() function, if datetime.datetime is passed to it: * Timezone information of this datetime is ignored * This datetime is assumed to be in local timezone, which depends on the OS timezone setting Fix includes both code change and regression test. Problem reproduction code on master: ```python import pytz from datetime import datetime from pyspark.sql import * from pyspark.sql.types import * sqc = SQLContext(sc) df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())])) m1 = pytz.timezone('UTC') m2 = pytz.timezone('Etc/GMT+3') df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() ``` It gives the same timestamp ignoring time zone: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946713600000000) Scan PhysicalRDD[dt#0] ``` After the fix: ``` >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain() Filter (dt#0 > 946684800000000) Scan PhysicalRDD[dt#0] >>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain() Filter (dt#0 > 946695600000000) Scan PhysicalRDD[dt#0] ``` PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo Author: 0x0FFF <programmerag@gmail.com> Closes #8555 from 0x0FFF/SPARK-10162.	2015-09-01 14:34:59 -07:00
Holden Karau	e6e483cc4d	[SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.	2015-09-01 10:48:57 -07:00
Yanbo Liang	52ea399e6e	[SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformer Add Python API for SQLTransformer Author: Yanbo Liang <ybliang8@gmail.com> Closes #8527 from yanboliang/spark-10355.	2015-08-31 16:11:27 -07:00
Yanbo Liang	5b3245d6df	[SPARK-8472] [ML] [PySpark] Python API for DCT Add Python API for ml.feature.DCT. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8485 from yanboliang/spark-8472.	2015-08-31 15:50:41 -07:00
noelsmith	7583681e6b	[SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model * Added isLargerBetter() method to Pyspark Evaluator to match the Scala version. * JavaEvaluator delegates isLargerBetter() to underlying Scala object. * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax. * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared). (This contribution is my original work and that I license the work to the project under Sparks' open source license) Author: noelsmith <mail@noelsmith.com> Closes #8399 from noel-smith/pyspark-rmse-xval-fix.	2015-08-27 23:59:30 -07:00
Yanbo Liang	ce97834dc0	[SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String for JSON PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path. If this PR is merged, it should be duplicated to cover the other input types (not just JSON). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8444 from yanboliang/spark-9964.	2015-08-26 22:19:11 -07:00
Davies Liu	d41d6c4820	[SPARK-10305] [SQL] fix create DataFrame from Python class cc jkbradley Author: Davies Liu <davies@databricks.com> Closes #8470 from davies/fix_create_df.	2015-08-26 16:04:44 -07:00
Sean Owen	69c9c17716	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.	2015-08-25 12:33:13 +01:00
zsxwing	4e0395ddb7	[SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build. I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests. Author: zsxwing <zsxwing@gmail.com> Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits: e0b5818 [zsxwing] Fix the sbt build c697627 [zsxwing] Add the jar pathes to the exception message be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars	2015-08-24 12:38:01 -07:00
Tathagata Das	053d94fcf3	[SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following: 1. Use the same code path as Java to check whether a valid checkpoint exists 2. Create a new Python SparkContext only if there no active one. There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8366 from tdas/SPARK-10142 and squashes the following commits: 3afa666 [Tathagata Das] Added tests 2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists 9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files	2015-08-23 19:24:32 -07:00
jerryshao	d89cc38b33	[SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122). tdas , please help to review. Author: jerryshao <sshao@hortonworks.com> Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits: 4039b16 [jerryshao] Fix getOffsetRanges in transform() bug	2015-08-21 13:15:35 -07:00
MechCoder	52c60537a2	[MINOR] [SQL] Fix sphinx warnings in PySpark SQL Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8171 from MechCoder/sql_sphinx.	2015-08-20 10:05:31 -07:00
zsxwing	1f29d502e7	[SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs This PR includes the following fixes: 1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3. 2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3. 3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually. Author: zsxwing <zsxwing@gmail.com> Closes #8315 from zsxwing/SPARK-9812.	2015-08-19 18:36:01 -07:00
Davies Liu	08887369c8	[SPARK-10073] [SQL] Python withColumn should replace the old column DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column that has the same name). cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8300 from davies/with_column.	2015-08-19 13:56:40 -07:00
Feynman Liang	28a98464ea	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator` Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.	2015-08-19 11:35:05 -07:00
Moussa Taifi	865a3df3d5	[DOCS] [SQL] [PYSPARK] Fix typo in ntile function Fix typo in ntile function. Author: Moussa Taifi <moutai10@gmail.com> Closes #8261 from moutai/patch-2.	2015-08-19 09:42:41 +01:00

1 2 3 4 5 ...

938 commits