ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
hyukjinkwon	695647bf2e	[SPARK-21640][SQL][PYTHON][R][FOLLOWUP] Add errorifexists in SparkR and other documentations ## What changes were proposed in this pull request? This PR proposes to add `errorifexists` to SparkR API and fix the rest of them describing the mode, mainly, in API documentations as well. This PR also replaces `convertToJSaveMode` to `setWriteMode` so that string as is is passed to JVM and executes: `b034f2565f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (L72-L82)` and remove the duplication here: `3f958a9992/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L187-L194)` ## How was this patch tested? Manually checked the built documentation. These were mainly found by `` grep -r `error` `` and `grep -r 'error'`. Also, unit tests added in `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19673 from HyukjinKwon/SPARK-21640-followup.	2017-11-09 15:00:31 +09:00
ptkool	d01044233c	[SPARK-22456][SQL] Add support for dayofweek function ## What changes were proposed in this pull request? This PR adds support for a new function called `dayofweek` that returns the day of the week of the given argument as an integer value in the range 1-7, where 1 represents Sunday. ## How was this patch tested? Unit tests and manual tests. Author: ptkool <michael.styles@shopify.com> Closes #19672 from ptkool/day_of_week_function.	2017-11-09 14:44:39 +09:00
Bryan Cutler	1d341042d6	[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp ## What changes were proposed in this pull request? Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType. ## How was this patch tested? Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas Author: Bryan Cutler <cutlerb@gmail.com> Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417.	2017-11-07 21:32:37 +01:00
Marco Gaido	e7adb7d7a6	[SPARK-22437][PYSPARK] default mode for jdbc is wrongly set to None ## What changes were proposed in this pull request? When writing using jdbc with python currently we are wrongly assigning by default None as writing mode. This is due to wrongly calling mode on the `_jwrite` object instead of `self` and it causes an exception. ## How was this patch tested? manual tests Author: Marco Gaido <mgaido@hortonworks.com> Closes #19654 from mgaido91/SPARK-22437.	2017-11-04 16:59:58 +09:00
hyukjinkwon	41b60125b6	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark ## What changes were proposed in this pull request? This PR proposes to add a link from `spark.catalog(..)` to `Catalog` and expose Catalog APIs in PySpark as below: <img width="740" alt="2017-10-29 12 25 46" src="https://user-images.githubusercontent.com/6477701/32135863-f8e9b040-bc40-11e7-92ad-09c8043a1295.png"> <img width="1131" alt="2017-10-29 12 26 33" src="https://user-images.githubusercontent.com/6477701/32135849-bb257b86-bc40-11e7-9eda-4d58fc1301c2.png"> Note that this is not shown in the list on the top - https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql <img width="674" alt="2017-10-29 12 30 58" src="https://user-images.githubusercontent.com/6477701/32135854-d50fab16-bc40-11e7-9181-812c56fd22f5.png"> This is basically similar with `DataFrameReader` and `DataFrameWriter`. ## How was this patch tested? Manually built the doc. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19596 from HyukjinKwon/SPARK-22369.	2017-11-02 15:22:52 +01:00
Liang-Chi Hsieh	07f390a27d	[SPARK-22347][PYSPARK][DOC] Add document to notice users for using udfs with conditional expressions ## What changes were proposed in this pull request? Under the current execution mode of Python UDFs, we don't well support Python UDFs as branch values or else value in CaseWhen expression. Since to fix it might need the change not small (e.g., #19592) and this issue has simpler workaround. We should just notice users in the document about this. ## How was this patch tested? Only document change. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19617 from viirya/SPARK-22347-3.	2017-11-01 13:09:35 +01:00
bomeng	aa6db57e39	[SPARK-22399][ML] update the location of reference paper ## What changes were proposed in this pull request? Update the url of reference paper. ## How was this patch tested? It is comments, so nothing tested. Author: bomeng <bmeng@us.ibm.com> Closes #19614 from bomeng/22399.	2017-10-31 08:20:23 +00:00
hyukjinkwon	188b47e683	[SPARK-22379][PYTHON] Reduce duplication setUpClass and tearDownClass in PySpark SQL tests ## What changes were proposed in this pull request? This PR propose to add `ReusedSQLTestCase` which deduplicate `setUpClass` and `tearDownClass` in `sql/tests.py`. ## How was this patch tested? Jenkins tests and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19595 from HyukjinKwon/reduce-dupe.	2017-10-30 11:50:22 +09:00
Takuya UESHIN	4c5269f1aa	[SPARK-22370][SQL][PYSPARK] Config values should be captured in Driver. ## What changes were proposed in this pull request? `ArrowEvalPythonExec` and `FlatMapGroupsInPandasExec` are refering config values of `SQLConf` in function for `mapPartitions`/`mapPartitionsInternal`, but we should capture them in Driver. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19587 from ueshin/issues/SPARK-22370.	2017-10-28 18:33:09 +01:00
WeichenXu	20eb95e5e9	[SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark ## What changes were proposed in this pull request? Add parallelism support for ML tuning in pyspark. ## How was this patch tested? Test updated. Author: WeichenXu <weichen.xu@databricks.com> Closes #19122 from WeichenXu123/par-ml-tuning-py.	2017-10-27 15:19:27 -07:00
Bryan Cutler	17af727e38	[SPARK-21375][PYSPARK][SQL] Add Date and Timestamp support to ArrowConverters for toPandas() Conversion ## What changes were proposed in this pull request? Adding date and timestamp support with Arrow for `toPandas()` and `pandas_udf`s. Timestamps are stored in Arrow as UTC and manifested to the user as timezone-naive localized to the Python system timezone. ## How was this patch tested? Added Scala tests for date and timestamp types under ArrowConverters, ArrowUtils, and ArrowWriter suites. Added Python tests for `toPandas()` and `pandas_udf`s with date and timestamp types. Author: Bryan Cutler <cutlerb@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18664 from BryanCutler/arrow-date-timestamp-SPARK-21375.	2017-10-26 23:02:46 -07:00
hyukjinkwon	d9798c834f	[SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs ## What changes were proposed in this pull request? This PR proposes to mark the existing warnings as `DeprecationWarning` and print out warnings for deprecated functions. This could be actually useful for Spark app developers. I use (old) PyCharm and this IDE can detect this specific `DeprecationWarning` in some cases: Before <img src="https://user-images.githubusercontent.com/6477701/31762664-df68d9f8-b4f6-11e7-8773-f0468f70a2cc.png" height="45" /> After <img src="https://user-images.githubusercontent.com/6477701/31762662-de4d6868-b4f6-11e7-98dc-3c8446a0c28a.png" height="70" /> For console usage, `DeprecationWarning` is usually disabled (see https://docs.python.org/2/library/warnings.html#warning-categories and https://docs.python.org/3/library/warnings.html#warning-categories): ``` >>> import warnings >>> filter(lambda f: f[2] == DeprecationWarning, warnings.filters) [('ignore', <_sre.SRE_Pattern object at 0x10ba58c00>, <type 'exceptions.DeprecationWarning'>, <_sre.SRE_Pattern object at 0x10bb04138>, 0), ('ignore', None, <type 'exceptions.DeprecationWarning'>, None, 0)] ``` so, it won't actually mess up the terminal much unless it is intended. If this is intendedly enabled, it'd should as below: ``` >>> import warnings >>> warnings.simplefilter('always', DeprecationWarning) >>> >>> from pyspark.sql import functions >>> functions.approxCountDistinct("a") .../spark/python/pyspark/sql/functions.py:232: DeprecationWarning: Deprecated in 2.1, use approx_count_distinct instead. "Deprecated in 2.1, use approx_count_distinct instead.", DeprecationWarning) ... ``` These instances were found by: ``` cd python/pyspark grep -r "Deprecated" . grep -r "deprecated" . grep -r "deprecate" . ``` ## How was this patch tested? Manually tested. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19535 from HyukjinKwon/deprecated-warning.	2017-10-24 12:44:47 +09:00
Takuya UESHIN	b8624b06e5	[SPARK-20396][SQL][PYSPARK][FOLLOW-UP] groupby().apply() with pandas udf ## What changes were proposed in this pull request? This is a follow-up of #18732. This pr modifies `GroupedData.apply()` method to convert pandas udf to grouped udf implicitly. ## How was this patch tested? Exisiting tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19517 from ueshin/issues/SPARK-20396/fup2.	2017-10-20 12:44:30 -07:00
Zhenhua Wang	655f6f86f8	[SPARK-22208][SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. Author: Zhenhua Wang <wzh_zju@163.com> Closes #19438 from wzhfy/improve_percentile_approx.	2017-10-11 00:16:12 -07:00
Li Jin	bfc7e1fe1a	[SPARK-20396][SQL][PYSPARK] groupby().apply() with pandas udf ## What changes were proposed in this pull request? This PR adds an apply() function on df.groupby(). apply() takes a pandas udf that is a transformation on `pandas.DataFrame` -> `pandas.DataFrame`. Static schema ------------------- ``` schema = df.schema pandas_udf(schema) def normalize(df): df = df.assign(v1 = (df.v1 - df.v1.mean()) / df.v1.std() return df df.groupBy('id').apply(normalize) ``` Dynamic schema ----------------------- This use case is removed from the PR and we will discuss this as a follow up. See discussion https://github.com/apache/spark/pull/18732#pullrequestreview-66583248 Another example to use pd.DataFrame dtypes as output schema of the udf: ``` sample_df = df.filter(df.id == 1).toPandas() def foo(df): ret = # Some transformation on the input pd.DataFrame return ret foo_udf = pandas_udf(foo, foo(sample_df).dtypes) df.groupBy('id').apply(foo_udf) ``` In interactive use case, user usually have a sample pd.DataFrame to test function `foo` in their notebook. Having been able to use `foo(sample_df).dtypes` frees user from specifying the output schema of `foo`. Design doc: https://github.com/icexelloss/spark/blob/pandas-udf-doc/docs/pyspark-pandas-udf.md ## How was this patch tested? * Added GroupbyApplyTest Author: Li Jin <ice.xelloss@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes #18732 from icexelloss/groupby-apply-SPARK-20396.	2017-10-11 07:32:01 +09:00
Takuya UESHIN	af8a34c787	[SPARK-22159][SQL][FOLLOW-UP] Make config names consistently end with "enabled". ## What changes were proposed in this pull request? This is a follow-up of #19384. In the previous pr, only definitions of the config names were modified, but we also need to modify the names in runtime or tests specified as string literal. ## How was this patch tested? Existing tests but modified the config names. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19462 from ueshin/issues/SPARK-22159/fup1.	2017-10-09 22:35:34 -07:00
Nick Pentreath	98057583dd	[SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods). The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe. ## How was this patch tested? New unit tests in `ALSSuite` and Python doctests in `ALS`. Ran updated examples locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18748 from MLnick/als-recommend-df.	2017-10-09 10:42:33 +02:00
Sean Owen	0c03297bf0	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile, take 2 ## What changes were proposed in this pull request? Move flume behind a profile, take 2. See https://github.com/apache/spark/pull/19365 for most of the back-story. This change should fix the problem by removing the examples module dependency and moving Flume examples to the module itself. It also adds deprecation messages, per a discussion on dev about deprecating for 2.3.0. ## How was this patch tested? Existing tests, which still enable flume integration. Author: Sean Owen <sowen@cloudera.com> Closes #19412 from srowen/SPARK-22142.2.	2017-10-06 15:08:28 +01:00
gatorsmile	472864014c	Revert "[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile" This reverts commit `a2516f41ae`.	2017-09-29 11:45:58 -07:00
Sean Owen	a2516f41ae	[SPARK-22142][BUILD][STREAMING] Move Flume support behind a profile ## What changes were proposed in this pull request? Add 'flume' profile to enable Flume-related integration modules ## How was this patch tested? Existing tests; no functional change Author: Sean Owen <sowen@cloudera.com> Closes #19365 from srowen/SPARK-22142.	2017-09-29 08:26:53 +01:00
Bryan Cutler	7bf4da8a33	[MINOR] Fixed up pandas_udf related docs and formatting ## What changes were proposed in this pull request? Fixed some minor issues with pandas_udf related docs and formatting. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #19375 from BryanCutler/arrow-pandas_udf-cleanup-minor.	2017-09-28 10:24:51 +09:00
Takuya UESHIN	09cbf3df20	[SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format for vectorized UDF. ## What changes were proposed in this pull request? Currently we use Arrow File format to communicate with Python worker when invoking vectorized UDF but we can use Arrow Stream format. This pr replaces the Arrow File format with the Arrow Stream format. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19349 from ueshin/issues/SPARK-22125.	2017-09-27 23:21:44 +09:00
goldmedal	1fdfe69352	[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark ## What changes were proposed in this pull request? We added a method to the scala API for creating a `DataFrame` from `DataSet[String]` storing CSV in [SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark doesn't have `Dataset` to support this feature. Therfore, I add an API to create a `DataFrame` from `RDD[String]` storing csv and it's also consistent with PySpark's `spark.read.json`. For example as below ``` >>> rdd = sc.textFile('python/test_support/sql/ages.csv') >>> df2 = spark.read.csv(rdd) >>> df2.dtypes [('_c0', 'string'), ('_c1', 'string')] ``` ## How was this patch tested? add unit test cases. Author: goldmedal <liugs963@gmail.com> Closes #19339 from goldmedal/SPARK-22112.	2017-09-27 11:19:45 +09:00
Bryan Cutler	d8e825e3bc	[SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests ## What changes were proposed in this pull request? This change disables the use of 0-parameter pandas_udfs due to the API being overly complex and awkward, and can easily be worked around by using an index column as an input argument. Also added doctests for pandas_udfs which revealed bugs for handling empty partitions and using the pandas_udf decorator. ## How was this patch tested? Reworked existing 0-parameter test to verify error is raised, added doctest for pandas_udf, added new tests for empty partition and decorator usage. Author: Bryan Cutler <cutlerb@gmail.com> Closes #19325 from BryanCutler/arrow-pandas_udf-0-param-remove-SPARK-22106.	2017-09-26 10:54:00 +09:00
Zhenhua Wang	365a29bdbf	[SPARK-22100][SQL] Make percentile_approx support date/timestamp type and change the output type to be the same as input type ## What changes were proposed in this pull request? The `percentile_approx` function previously accepted numeric type input and output double type results. But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily. After this PR, it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. This change is also required when we generate equi-height histograms for these types. ## How was this patch tested? Added a new test and modified some existing tests. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19321 from wzhfy/approx_percentile_support_types.	2017-09-25 09:28:42 -07:00
Liang-Chi Hsieh	3e6a714c9e	[SPARK-21766][PYSPARK][SQL] DataFrame toPandas() raises ValueError with nullable int columns ## What changes were proposed in this pull request? When calling `DataFrame.toPandas()` (without Arrow enabled), if there is a `IntegralType` column (`IntegerType`, `ShortType`, `ByteType`) that has null values the following exception is thrown: ValueError: Cannot convert non-finite values (NA or inf) to integer This is because the null values first get converted to float NaN during the construction of the Pandas DataFrame in `from_records`, and then it is attempted to be converted back to to an integer where it fails. The fix is going to check if the Pandas DataFrame can cause such failure when converting, if so, we don't do the conversion and use the inferred type by Pandas. Closes #18945 ## How was this patch tested? Added pyspark test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19319 from viirya/SPARK-21766.	2017-09-22 22:39:47 +09:00
Bryan Cutler	27fc536d9a	[SPARK-21190][PYSPARK] Python Vectorized UDFs This PR adds vectorized UDFs to the Python API Proposed API Introduce a flag to turn on vectorization for a defined UDF, for example: ``` pandas_udf(DoubleType()) def plus(a, b) return a + b ``` or ``` plus = pandas_udf(lambda a, b: a + b, DoubleType()) ``` Usage is the same as normal UDFs 0-parameter UDFs pandas_udf functions can declare an optional `kwargs` and when evaluated, will contain a key "size" that will give the required length of the output. For example: ``` pandas_udf(LongType()) def f0(kwargs): return pd.Series(1).repeat(kwargs["size"]) df.select(f0()) ``` Added new unit tests in pyspark.sql that are enabled if pyarrow and Pandas are available. - [x] Fix support for promoted types with null values - [ ] Discuss 0-param UDF API (use of kwargs) - [x] Add tests for chained UDFs - [ ] Discuss behavior when pyarrow not installed / enabled - [ ] Cleanup pydoc and add user docs Author: Bryan Cutler <cutlerb@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18659 from BryanCutler/arrow-vectorized-udfs-SPARK-21404.	2017-09-22 16:17:50 +08:00
Marco Gaido	5ac96854cc	[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator ## What changes were proposed in this pull request? Added Python interface for ClusteringEvaluator ## How was this patch tested? Manual test, eg. the example Python code in the comments. cc yanboliang Author: Marco Gaido <mgaido@hortonworks.com> Author: Marco Gaido <marcogaido91@gmail.com> Closes #19204 from mgaido91/SPARK-21981.	2017-09-22 13:12:33 +08:00
Sean Owen	e17901d6df	[SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example ## How was this patch tested? Doc only change / existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19276 from srowen/SPARK-22049.	2017-09-20 20:47:17 +09:00
Yanbo Liang	2f962422a2	[MINOR][ML] Remove unnecessary default value setting for evaluators. ## What changes were proposed in this pull request? Remove unnecessary default value setting for all evaluators, as we have set them in corresponding _HasXXX_ base classes. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19262 from yanboliang/evaluation.	2017-09-19 22:22:35 +08:00
hyukjinkwon	7c7266208a	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19260 from HyukjinKwon/profile-errors.	2017-09-18 13:20:11 +09:00
Andrew Ray	6adf67dd14	[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #19226 from aray/SPARK-21985.	2017-09-18 02:46:27 +09:00
Maciej Bryński	f4073020ad	[SPARK-22032][PYSPARK] Speed up StructType conversion ## What changes were proposed in this pull request? StructType.fromInternal is calling f.fromInternal(v) for every field. We can use precalculated information about type to limit the number of function calls. (its calculated once per StructType and used in per record calculations) Benchmarks (Python profiler) ``` df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache() df.count() df.rdd.map(lambda x: x).count() ``` Before ``` 310274584 function calls (300272456 primitive calls) in 1320.684 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 10000000 253.417 0.000 486.991 0.000 types.py:619(<listcomp>) 30000000 192.272 0.000 1009.986 0.000 types.py:612(fromInternal) 100000000 176.140 0.000 176.140 0.000 types.py:88(fromInternal) 20000000 156.832 0.000 328.093 0.000 types.py:1471(_create_row) 14000 107.206 0.008 1237.917 0.088 {built-in method loads} 20000000 80.176 0.000 1090.162 0.000 types.py:1468(<lambda>) ``` After ``` 210274584 function calls (200272456 primitive calls) in 1035.974 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 30000000 215.845 0.000 698.748 0.000 types.py:612(fromInternal) 20000000 165.042 0.000 351.572 0.000 types.py:1471(_create_row) 14000 116.834 0.008 946.791 0.068 {built-in method loads} 20000000 87.326 0.000 786.073 0.000 types.py:1468(<lambda>) 20000000 85.477 0.000 134.607 0.000 types.py:1519(__new__) 10000000 65.777 0.000 126.712 0.000 types.py:619(<listcomp>) ``` Main difference is types.py:619(<listcomp>) and types.py:88(fromInternal) (which is removed in After) The number of function calls is 100 million less. And performance is 20% better. Benchmark (worst case scenario.) Test ``` df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache() df.count() df.rdd.map(lambda x: x).count() ``` Before ``` 31166064 function calls (31163984 primitive calls) in 150.882 seconds ``` After ``` 31166064 function calls (31163984 primitive calls) in 153.220 seconds ``` IMPORTANT: The benchmark was done on top of https://github.com/apache/spark/pull/19246. Without https://github.com/apache/spark/pull/19246 the performance improvement will be even greater. ## How was this patch tested? Existing tests. Performance benchmark. Author: Maciej Bryński <maciek-github@brynski.pl> Closes #19249 from maver1ck/spark_22032.	2017-09-18 02:34:44 +09:00
goldmedal	a28728a9af	[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR ## What changes were proposed in this pull request? In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR. ### For PySpark ``` >>> data = [(1, {"name": "Alice"})] >>> df = spark.createDataFrame(data, ("key", "value")) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json=u'{"name":"Alice")'] >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])] >>> df = spark.createDataFrame(data, ("key", "value")) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')] ``` ### For SparkR ``` # Converts a map into a JSON object df2 <- sql("SELECT map('name', 'Bob')) as people") df2 <- mutate(df2, people_json = to_json(df2$people)) # Converts an array of maps into a JSON array df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people") df2 <- mutate(df2, people_json = to_json(df2$people)) ``` ## How was this patch tested? Add unit test cases. cc viirya HyukjinKwon Author: goldmedal <liugs963@gmail.com> Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.	2017-09-15 11:53:10 +09:00
Yanbo Liang	c76153cc7d	[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19220 from yanboliang/SPARK-18608.	2017-09-14 14:09:44 +08:00
Ming Jiang	8d8641f122	[SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## What changes were proposed in this pull request? Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API ## How was this patch tested? Added unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Ming Jiang <mjiang@fanatics.com> Author: Ming Jiang <jmwdpk@gmail.com> Author: jmwdpk <jmwdpk@gmail.com> Closes #19185 from jmwdpk/SPARK-21854.	2017-09-14 13:53:28 +08:00
Sean Owen	4fbf748bf8	[SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile ## What changes were proposed in this pull request? Put Kafka 0.8 support behind a kafka-0-8 profile. ## How was this patch tested? Existing tests, but, until PR builder and Jenkins configs are updated the effect here is to not build or test Kafka 0.8 support at all. Author: Sean Owen <sowen@cloudera.com> Closes #19134 from srowen/SPARK-21893.	2017-09-13 10:10:40 +01:00
Ajay Saini	720c94fe77	[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark # What changes were proposed in this pull request? Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism. I take this PR #18281 over because the original author is busy but we need merge this PR soon. After this been merged, we can close #18281 . ## How was this patch tested? Test suite added. Author: Ajay Saini <ajays725@gmail.com> Author: WeichenXu <weichen.xu@databricks.com> Closes #19110 from WeichenXu123/spark-21027.	2017-09-12 10:02:27 -07:00
Chunsheng Ji	4bab8f5996	[SPARK-21856] Add probability and rawPrediction to MLPC for Python Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python Add unit test. Author: Chunsheng Ji <chunsheng.ji@gmail.com> Closes #19172 from chunshengji/SPARK-21856.	2017-09-11 16:52:48 +08:00
Peter Szalai	520d92a191	[SPARK-20098][PYSPARK] dataType's typeName fix ## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai <szalaipeti.vagyok@gmail.com> Closes #17435 from szalai1/datatype-gettype-fix.	2017-09-10 17:47:45 +09:00
Yanbo Liang	e4d8f9a36a	[MINOR][SQL] Correct DataFrame doc. ## What changes were proposed in this pull request? Correct DataFrame doc. ## How was this patch tested? Only doc change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19173 from yanboliang/df-doc.	2017-09-09 09:25:12 -07:00
Xin Ren	31c74fec24	[SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API https://issues.apache.org/jira/browse/SPARK-19866 ## What changes were proposed in this pull request? Add Python API for findSynonymsArray matching Scala API. ## How was this patch tested? Manual test `./python/run-tests --python-executables=python2.7 --modules=pyspark-ml` Author: Xin Ren <iamshrek@126.com> Author: Xin Ren <renxin.ubc@gmail.com> Author: Xin Ren <keypointt@users.noreply.github.com> Closes #17451 from keypointt/SPARK-19866.	2017-09-08 12:09:00 -07:00
hyukjinkwon	8598d03a00	[SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe ## What changes were proposed in this pull request? This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame. For example, this causes a `ValueError` in Python 2.x when param is a unicode string: ```python >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression() >>> lr.hasParam("threshold") True >>> lr.hasParam(u"threshold") Traceback (most recent call last): ... raise TypeError("hasParam(): paramName must be a string") TypeError: hasParam(): paramName must be a string ``` This PR is based on https://github.com/apache/spark/pull/13036 ## How was this patch tested? Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: sethah <seth.hendrickson16@gmail.com> Closes #17096 from HyukjinKwon/SPARK-15243.	2017-09-08 11:57:33 -07:00
Takuya UESHIN	57bc1e9eb4	[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #19158 from ueshin/issues/SPARK-21950.	2017-09-08 14:26:07 +09:00
hyukjinkwon	07fd68a29f	[SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R ## What changes were proposed in this pull request? This PR proposes to add a wrapper for `unionByName` API to R and Python as well. Python ```python df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"]) df1.unionByName(df2).show() ``` ``` +----+----+----+ \|col0\|col1\|col3\| +----+----+----+ \| 1\| 2\| 3\| \| 6\| 4\| 5\| +----+----+----+ ``` R ```R df1 <- select(createDataFrame(mtcars), "carb", "am", "gear") df2 <- select(createDataFrame(mtcars), "am", "gear", "carb") head(unionByName(limit(df1, 2), limit(df2, 2))) ``` ``` carb am gear 1 4 1 4 2 4 1 4 3 4 1 4 4 4 1 4 ``` ## How was this patch tested? Doctests for Python and unit test added in `test_sparkSQL.R` for R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19105 from HyukjinKwon/unionByName-r-python.	2017-09-03 21:03:21 +09:00
hyukjinkwon	648a8626b8	[SPARK-21789][PYTHON] Remove obsolete codes for parsing abstract schema strings ## What changes were proposed in this pull request? This PR proposes to remove private functions that look not used in the main codes, `_split_schema_abstract`, `_parse_field_abstract`, `_parse_schema_abstract` and `_infer_schema_type`. ## How was this patch tested? Existing tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18647 from HyukjinKwon/remove-abstract.	2017-09-01 13:09:24 +09:00
hyukjinkwon	5cd8ea99f0	[SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python ## What changes were proposed in this pull request? This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API. In short, the following examples are allowed: ```python >>> df = spark.range(10) >>> df.sample(0.5).count() 7 >>> df.sample(fraction=0.5).count() 3 >>> df.sample(0.5, seed=42).count() 5 >>> df.sample(fraction=0.5, seed=42).count() 5 ``` In addition, this PR also adds some type checking logics as below: ```python >>> df = spark.range(10) >>> df.sample().count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got []. >>> df.sample(True).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>]. >>> df.sample(42).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>]. >>> df.sample(fraction=False, seed="a").count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>]. >>> df.sample(seed=[1]).count() ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>]. >>> df.sample(withReplacement="a", fraction=0.5, seed=1) ... TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>]. ``` ## How was this patch tested? Manually tested, unit tests added in doc tests and manually checked the built documentation for Python. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18999 from HyukjinKwon/SPARK-21779.	2017-09-01 13:01:23 +09:00
Liang-Chi Hsieh	ecf437a648	[SPARK-21534][SQL][PYSPARK] PickleException when creating dataframe from python row with empty bytearray ## What changes were proposed in this pull request? `PickleException` is thrown when creating dataframe from python row with empty bytearray spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: {"abc": x.xx})).show() net.razorvine.pickle.PickleException: invalid pickle data for bytearray; expected 1 or 2 args, got 0 at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java ... `ByteArrayConstructor` doesn't deal with empty byte array pickled by Python3. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19085 from viirya/SPARK-21534.	2017-08-31 12:55:38 +09:00
Dongjoon Hyun	d8f4540863	[SPARK-21839][SQL] Support SQL config for ORC compression ## What changes were proposed in this pull request? This PR aims to support `spark.sql.orc.compression.codec` like Parquet's `spark.sql.parquet.compression.codec`. Users can use SQLConf to control ORC compression, too. ## How was this patch tested? Pass the Jenkins with new and updated test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19055 from dongjoon-hyun/SPARK-21839.	2017-08-31 08:16:58 +09:00
vinodkc	51620e288b	[SPARK-21756][SQL] Add JSON option to allow unquoted control characters ## What changes were proposed in this pull request? This patch adds allowUnquotedControlChars option in JSON data source to allow JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) ## How was this patch tested? Add new test cases Author: vinodkc <vinod.kc.in@gmail.com> Closes #19008 from vinodkc/br_fix_SPARK-21756.	2017-08-25 10:18:03 -07:00

1 2 3 4 5 ...

1676 commits