ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Felix Cheung	823518c2b5	[SPARKR] add csv tests ## What changes were proposed in this pull request? Add unit tests for csv data for SPARKR ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13904 from felixcheung/rcsv.	2016-06-28 17:08:28 -07:00
Burak Yavuz	5545b79109	[MINOR][DOCS][STRUCTURED STREAMING] Minor doc fixes around `DataFrameWriter` and `DataStreamWriter` ## What changes were proposed in this pull request? Fixes a couple old references to `DataFrameWriter.startStream` to `DataStreamWriter.start Author: Burak Yavuz <brkyvz@gmail.com> Closes #13952 from brkyvz/minor-doc-fix.	2016-06-28 17:02:16 -07:00
James Thomas	3554713a16	[SPARK-16114][SQL] structured streaming network word count examples ## What changes were proposed in this pull request? Network word count example for structured streaming ## How was this patch tested? Run locally Author: James Thomas <jamesjoethomas@gmail.com> Author: James Thomas <jamesthomas@Jamess-MacBook-Pro.local> Closes #13816 from jjthomas/master.	2016-06-28 16:12:48 -07:00
Wenchen Fan	8a977b0654	[SPARK-16100][SQL] fix bug when use Map as the buffer type of Aggregator ## What changes were proposed in this pull request? The root cause is in `MapObjects`. Its parameter `loopVar` is not declared as child, but sometimes can be same with `lambdaFunction`(e.g. the function that takes `loopVar` and produces `lambdaFunction` may be `identity`), which is a child. This brings trouble when call `withNewChildren`, it may mistakenly treat `loopVar` as a child and cause `IndexOutOfBoundsException: 0` later. This PR fixes this bug by simply pulling out the paremters from `LambdaVariable` and pass them to `MapObjects` directly. ## How was this patch tested? new test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13835 from cloud-fan/map-objects.	2016-06-29 06:39:28 +08:00
gatorsmile	25520e9762	[SPARK-16236][SQL] Add Path Option back to Load API in DataFrameReader #### What changes were proposed in this pull request? koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #13933 from gatorsmile/optionPath.	2016-06-28 15:32:45 -07:00
Davies Liu	35438fb0ad	[SPARK-16175] [PYSPARK] handle None for UDT ## What changes were proposed in this pull request? Scala UDT will bypass all the null and will not pass them into serialize() and deserialize() of UDT, this PR update the Python UDT to do this as well. ## How was this patch tested? Added tests. Author: Davies Liu <davies@databricks.com> Closes #13878 from davies/udt_null.	2016-06-28 14:09:38 -07:00
Davies Liu	1aad8c6e59	[SPARK-16259][PYSPARK] cleanup options in DataFrame read/write API ## What changes were proposed in this pull request? There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv(). ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #13948 from davies/csv_options.	2016-06-28 13:43:59 -07:00
Tom Magrino	ae14f36235	[SPARK-16148][SCHEDULER] Allow for underscores in TaskLocation in the Executor ID ## What changes were proposed in this pull request? Previously, the TaskLocation implementation would not allow for executor ids which include underscores. This tweaks the string split used to get the hostname and executor id, allowing for underscores in the executor id. This addresses the JIRA found here: https://issues.apache.org/jira/browse/SPARK-16148 This is moved over from a previous PR against branch-1.6: https://github.com/apache/spark/pull/13857 ## How was this patch tested? Ran existing unit tests for core and streaming. Manually ran a simple streaming job with an executor whose id contained underscores and confirmed that the job ran successfully. This is my original work and I license the work to the project under the project's open source license. Author: Tom Magrino <tmagrino@fb.com> Closes #13858 from tmagrino/fixtasklocation.	2016-06-28 13:36:41 -07:00
WeichenXu	d59ba8e307	[MINOR][SPARKR] update sparkR DataFrame.R comment ## What changes were proposed in this pull request? update sparkR DataFrame.R comment SQLContext ==> SparkSession ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #13946 from WeichenXu123/sparkR_comment_update_sparkSession.	2016-06-28 12:12:20 -07:00
Yanbo Liang	26252f7064	[SPARK-15643][DOC][ML] Update spark.ml and spark.mllib migration guide from 1.6 to 2.0 ## What changes were proposed in this pull request? Update ```spark.ml``` and ```spark.mllib``` migration guide from 1.6 to 2.0. ## How was this patch tested? Docs update, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13378 from yanboliang/spark-13448.	2016-06-28 11:54:25 -07:00
Wenchen Fan	1f2776df6e	[SPARK-16181][SQL] outer join with isNull filter may return wrong result ## What changes were proposed in this pull request? The root cause is: the output attributes of outer join are derived from its children, while they are actually different attributes(outer join can return null). We have already added some special logic to handle it, e.g. `PushPredicateThroughJoin` won't push down predicates through outer join side, `FixNullability`. This PR adds one more special logic in `FoldablePropagation`. ## How was this patch tested? new test in `DataFrameSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13884 from cloud-fan/bug.	2016-06-28 10:26:01 -07:00
Yin Huai	0923c4f567	[SPARK-16224] [SQL] [PYSPARK] SparkSession builder's configs need to be set to the existing Scala SparkContext's SparkConf ## What changes were proposed in this pull request? When we create a SparkSession at the Python side, it is possible that a SparkContext has been created. For this case, we need to set configs of the SparkSession builder to the Scala SparkContext's SparkConf (we need to do so because conf changes on a active Python SparkContext will not be propagated to the JVM side). Otherwise, we may create a wrong SparkSession (e.g. Hive support is not enabled even if enableHiveSupport is called). ## How was this patch tested? New tests and manual tests. Author: Yin Huai <yhuai@databricks.com> Closes #13931 from yhuai/SPARK-16224.	2016-06-28 07:54:44 -07:00
Yanbo Liang	e158478a9f	[SPARK-16242][MLLIB][PYSPARK] Conversion between old/new matrix columns in a DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame. ## How was this patch tested? Doctest in python. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13935 from yanboliang/spark-16242.	2016-06-28 06:28:22 -07:00
Prashant Sharma	f6b497fcdd	[SPARK-16128][SQL] Allow setting length of characters to be truncated to, in Dataset.show function. ## What changes were proposed in this pull request? Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise. ## How was this patch tested? Existing tests. + 1 new test in DataFrameSuite. For SparkR and pyspark, existing tests and manual testing. Author: Prashant Sharma <prashsh1@in.ibm.com> Author: Prashant Sharma <prashant@apache.org> Closes #13839 from ScrapCodes/add_truncateTo_DF.show.	2016-06-28 17:11:06 +05:30
gatorsmile	4cbf611c1d	[SPARK-16202][SQL][DOC] Correct The Description of CreatableRelationProvider's createRelation #### What changes were proposed in this pull request? The API description of `createRelation` in `CreatableRelationProvider` is misleading. The current description only expects users to return the relation. ```Scala trait CreatableRelationProvider { def createRelation( sqlContext: SQLContext, mode: SaveMode, parameters: Map[String, String], data: DataFrame): BaseRelation } ``` However, the major goal of this API should also include saving the `DataFrame`. Since this API is critical for Data Source API developers, this PR is to correct the description. #### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #13903 from gatorsmile/readUnderscoreFiles.	2016-06-27 23:12:17 -07:00
Yin Huai	dd6b7dbe70	[SPARK-15863][SQL][DOC][FOLLOW-UP] Update SQL programming guide. ## What changes were proposed in this pull request? This PR makes several updates to SQL programming guide. Author: Yin Huai <yhuai@databricks.com> Closes #13938 from yhuai/doc.	2016-06-27 22:44:08 -07:00
Dongjoon Hyun	a0da854fb3	[SPARK-16221][SQL] Redirect Parquet JUL logger via SLF4J for WRITE operations ## What changes were proposed in this pull request? [SPARK-8118](https://github.com/apache/spark/pull/8196) implements redirecting Parquet JUL logger via SLF4J, but it is currently applied only when READ operations occurs. If users use only WRITE operations, there occurs many Parquet logs. This PR makes the redirection work on WRITE operations, too. Before ```scala scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY ............ about 70 lines Parquet Log ............. scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") ............ about 70 lines Parquet Log ............. ``` After ```scala scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") ``` This PR also fixes some typos. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13918 from dongjoon-hyun/SPARK-16221.	2016-06-28 13:01:18 +08:00
Dongjoon Hyun	50fdd866b5	[SPARK-16111][SQL][DOC] Hide SparkOrcNewRecordReader in API docs ## What changes were proposed in this pull request? Currently, Spark Scala/Java API documents shows org.apache.hadoop.hive.ql.io.orc package at the top. http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.package http://spark.apache.org/docs/2.0.0-preview/api/java/index.html This PR hides `SparkOrcNewRecordReader` from API docs. ## How was this patch tested? Manual. (`build/sbt unidoc`). The following is the screenshot after this PR. Scala API doc ![Scala API doc](https://app.box.com/representation/file_version_75673952621/image_2048/1.png?shared_name=2mdqydygs8le6q9x00356898662zjwz6) Java API doc ![Java API doc](https://app.box.com/representation/file_version_75673951725/image_2048/1.png?shared_name=iv23eeqy3avvkqz203v9ygfaqeyml85j) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13914 from dongjoon-hyun/SPARK-16111.	2016-06-27 21:58:16 -07:00
Junyang Qian	1b7fc58172	[SPARK-16143][R] group AFT survival regression methods docs in a single Rd ## What changes were proposed in this pull request? This PR groups `spark.survreg`, `summary(AFT)`, `predict(AFT)`, `write.ml(AFT)` for survival regression into a single Rd. ## How was this patch tested? Manually checked generated HTML doc. See attached screenshots. ![screen shot 2016-06-27 at 10 28 20 am](https://cloud.githubusercontent.com/assets/15318264/16392008/a14cf472-3c5e-11e6-9ce5-490ed1a52249.png) ![screen shot 2016-06-27 at 10 28 35 am](https://cloud.githubusercontent.com/assets/15318264/16392009/a14e333c-3c5e-11e6-8bd7-c2e9ba71f8e2.png) Author: Junyang Qian <junyangq@databricks.com> Closes #13927 from junyangq/SPARK-16143.	2016-06-27 20:32:27 -07:00
Herman van Hovell	02a029df43	[SPARK-16220][SQL] Add scope to show functions ## What changes were proposed in this pull request? Spark currently shows all functions when issue a `SHOW FUNCTIONS` command. This PR refines the `SHOW FUNCTIONS` command by allowing users to select all functions, user defined function or system functions. The following syntax can be used: ALL (default) ```SHOW FUNCTIONS``` ```SHOW ALL FUNCTIONS``` SYSTEM ```SHOW SYSTEM FUNCTIONS``` USER ```SHOW USER FUNCTIONS``` ## How was this patch tested? Updated tests and added tests to the DDLSuite Author: Herman van Hovell <hvanhovell@databricks.com> Closes #13929 from hvanhovell/SPARK-16220.	2016-06-27 16:57:34 -07:00
Imran Rashid	c15b552dd5	[SPARK-16106][CORE] TaskSchedulerImpl should properly track executors added to existing hosts ## What changes were proposed in this pull request? TaskSchedulerImpl used to only set `newExecAvailable` when a new host was added, not when a new executor was added to an existing host. It also didn't update some internal state tracking live executors until a task was scheduled on the executor. This patch changes it to properly update as soon as it knows about a new executor. ## How was this patch tested? added a unit test, ran everything via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13826 from squito/SPARK-16106_executorByHosts.	2016-06-27 16:38:03 -05:00
Imran Rashid	282158914d	[SPARK-16136][CORE] Fix flaky TaskManagerSuite ## What changes were proposed in this pull request? TaskManagerSuite "Kill other task attempts when one attempt belonging to the same task succeeds" was flaky. When checking whether a task is speculatable, at least one millisecond must pass since the task was submitted. Use a manual clock to avoid the problem. I noticed these tests were leaving lots of threads lying around as well (which prevented me from running the test repeatedly), so I fixed that too. ## How was this patch tested? Ran the test 1k times on my laptop, passed every time (it failed about 20% of the time before this). Author: Imran Rashid <irashid@cloudera.com> Closes #13848 from squito/fix_flaky_taskmanagersuite.	2016-06-27 16:28:59 -05:00
Bryan Cutler	1aa191e58e	[SPARK-16231][PYSPARK][ML][EXAMPLES] dataframe_example.py fails to convert ML style vectors ## What changes were proposed in this pull request? Need to convert ML Vectors to the old MLlib style before doing Statistics.colStats operations on the DataFrame ## How was this patch tested? Ran example, local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13928 from BryanCutler/pyspark-ml-example-vector-conv-SPARK-16231.	2016-06-27 12:58:39 -07:00
Yuhao Yang	c17b1abff8	[SPARK-16187][ML] Implement util method for ML Matrix conversion in scala/java ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16187 This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. ## How was this patch tested? java and scala ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13888 from hhbyyh/matComp.	2016-06-27 12:27:39 -07:00
Bill Chambers	c48c8ebc0a	[SPARK-16220][SQL] Revert Change to Bring Back SHOW FUNCTIONS Functionality ## What changes were proposed in this pull request? - Fix tests regarding show functions functionality - Revert `catalog.ListFunctions` and `SHOW FUNCTIONS` to return to `Spark 1.X` functionality. Cherry picked changes from this PR: https://github.com/apache/spark/pull/13413/files ## How was this patch tested? Unit tests. Author: Bill Chambers <bill@databricks.com> Author: Bill Chambers <wchambers@ischool.berkeley.edu> Closes #13916 from anabranch/master.	2016-06-27 11:50:34 -07:00
Takeshi YAMAMURO	3e4e868c85	[SPARK-16135][SQL] Remove hashCode and euqals in ArrayBasedMapData ## What changes were proposed in this pull request? This pr is to remove `hashCode` and `equals` in `ArrayBasedMapData` because the type cannot be used as join keys, grouping keys, or in equality tests. ## How was this patch tested? Add a new test suite `MapDataSuite` for comparison tests. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13847 from maropu/UnsafeMapTest.	2016-06-27 21:45:22 +08:00
Dongjoon Hyun	11f420b4bb	[SPARK-10591][SQL][TEST] Add a testcase to ensure if `checkAnswer` handles map correctly ## What changes were proposed in this pull request? This PR adds a testcase to ensure if `checkAnswer` handles Map type correctly. ## How was this patch tested? Pass the jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13913 from dongjoon-hyun/SPARK-10591.	2016-06-27 19:04:50 +08:00
jerryshao	52d4fe0579	[MINOR][CORE] Fix display wrong free memory size in the log ## What changes were proposed in this pull request? Free memory size displayed in the log is wrong (used memory), fix to make it correct. ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #13804 from jerryshao/memory-log-fix.	2016-06-27 09:23:58 +01:00
杨浩	b452026324	[SPARK-16214][EXAMPLES] fix the denominator of SparkPi ## What changes were proposed in this pull request? reduce the denominator of SparkPi by 1 ## How was this patch tested? integration tests Author: 杨浩 <yanghaogn@163.com> Closes #13910 from yanghaogn/patch-1.	2016-06-27 08:31:52 +01:00
Felix Cheung	30b182bcc0	[SPARK-16184][SPARKR] conf API for SparkSession ## What changes were proposed in this pull request? Add `conf` method to get Runtime Config from SparkSession ## How was this patch tested? unit tests, manual tests This is how it works in sparkR shell: ``` SparkSession available as 'spark'. > conf() $hive.metastore.warehouse.dir [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse" $spark.app.id [1] "local-1466749575523" $spark.app.name [1] "SparkR" $spark.driver.host [1] "10.0.2.1" $spark.driver.port [1] "45629" $spark.executorEnv.LD_LIBRARY_PATH [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server" $spark.executor.id [1] "driver" $spark.home [1] "/opt/spark-2.0.0-bin-hadoop2.6" $spark.master [1] "local[]" $spark.sql.catalogImplementation [1] "hive" $spark.submit.deployMode [1] "client" > conf("spark.master") $spark.master [1] "local[]" ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13885 from felixcheung/rconf.	2016-06-26 13:10:43 -07:00
Sean Owen	e87741589a	[SPARK-16193][TESTS] Address flaky ExternalAppendOnlyMapSuite spilling tests ## What changes were proposed in this pull request? Make spill tests wait until job has completed before returning the number of stages that spilled ## How was this patch tested? Existing Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13896 from srowen/SPARK-16193.	2016-06-25 12:14:14 +01:00
Alex Bozarth	3ee9695d1f	[SPARK-1301][WEB UI] Added anchor links to Accumulators and Tasks on StagePage ## What changes were proposed in this pull request? Sometimes the "Aggregated Metrics by Executor" table on the Stage page can get very long so actor links to the Accumulators and Tasks tables below it have been added to the summary at the top of the page. This has been done in the same way as the Jobs and Stages pages. Note: the Accumulators link only displays when the table exists. ## How was this patch tested? Manually Tested and dev/run-tests ![justtasks](https://cloud.githubusercontent.com/assets/13952758/15165269/6e8efe8c-16c9-11e6-9784-cffe966fdcf0.png) ![withaccumulators](https://cloud.githubusercontent.com/assets/13952758/15165270/7019ec9e-16c9-11e6-8649-db69ed7a317d.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #13037 from ajbozarth/spark1301.	2016-06-25 09:27:22 +01:00
Sital Kedia	bf665a9586	[SPARK-15958] Make initial buffer size for the Sorter configurable ## What changes were proposed in this pull request? Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable. ## How was this patch tested? Tested by running a job on the cluster. Author: Sital Kedia <skedia@fb.com> Closes #13699 from sitalkedia/config_sort_buffer_upstream.	2016-06-25 09:13:39 +01:00
José Antonio	a3c7b4187b	[MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ArrayIndexOutOfBoundsException. I have found the bug and tested the solution. ## What changes were proposed in this pull request? Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66. ## How was this patch tested? Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results. line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1 crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length. To fix this just make trueWeights be the same length as x. I have recompiled the project with the change and it is working now: [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test And it generates the data successfully now in the specified folder. Author: José Antonio <joseanmunoz@gmail.com> Closes #13895 from j4munoz/patch-2.	2016-06-25 09:11:25 +01:00
Dongjoon Hyun	a7d29499dc	[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec ## What changes were proposed in this pull request? One of the most frequent usage patterns for Spark SQL is using cached tables. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. Before ```scala $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(2000000000) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect() // About 2 mins scala> sql("select id from t where id = 1").collect() // less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds ``` After ```scala scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms ``` This PR has impacts over 35 queries of TPC-DS if the tables are cached. Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased. ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13887 from dongjoon-hyun/SPARK-16186.	2016-06-24 22:34:31 -07:00
Takeshi YAMAMURO	d2e44d7db8	[SPARK-16192][SQL] Add type checks in CollectSet ## What changes were proposed in this pull request? `CollectSet` cannot have map-typed data because MapTypeData does not implement `equals`. So, this pr is to add type checks in `CheckAnalysis`. ## How was this patch tested? Added tests to check failures when we found map-typed data in `CollectSet`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #13892 from maropu/SPARK-16192.	2016-06-24 21:07:03 -07:00
Dilip Biswal	9053054c7f	[SPARK-16195][SQL] Allow users to specify empty over clause in window expressions through dataset API ## What changes were proposed in this pull request? Allow to specify empty over clause in window expressions through dataset API In SQL, its allowed to specify an empty OVER clause in the window expression. ```SQL select area, sum(product) over () as c from windowData where product > 3 group by area, product having avg(month) > 0 order by avg(month), product ``` In this case the analytic function sum is presented based on all the rows of the result set Currently its not allowed through dataset API and is handled in this PR. ## How was this patch tested? Added a new test in DataframeWindowSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #13897 from dilipbiswal/spark-empty-over.	2016-06-24 17:27:33 -07:00
Dongjoon Hyun	e5d0928e24	[SPARK-16173] [SQL] Can't join describe() of DataFrame in Scala 2.10 ## What changes were proposed in this pull request? This PR fixes `DataFrame.describe()` by forcing materialization to make the `Seq` serializable. Currently, `describe()` of DataFrame throws `Task not serializable` Spark exceptions when joining in Scala 2.10. ## How was this patch tested? Manual. (After building with Scala 2.10, test on `bin/spark-shell` and `bin/pyspark`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13900 from dongjoon-hyun/SPARK-16173.	2016-06-24 17:26:39 -07:00
Davies Liu	20768dade2	Revert "[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec" This reverts commit `a65bcbc27d`.	2016-06-24 17:21:18 -07:00
Dongjoon Hyun	a65bcbc27d	[SPARK-16186] [SQL] Support partition batch pruning with `IN` predicate in InMemoryTableScanExec ## What changes were proposed in this pull request? One of the most frequent usage patterns for Spark SQL is using cached tables. This PR improves `InMemoryTableScanExec` to handle `IN` predicate efficiently by pruning partition batches. Of course, the performance improvement varies over the queries and the datasets. But, for the following simple query, the query duration in Spark UI goes from 9 seconds to 50~90ms. It's about over 100 times faster. Before ```scala $ bin/spark-shell --driver-memory 6G scala> val df = spark.range(2000000000) scala> df.createOrReplaceTempView("t") scala> spark.catalog.cacheTable("t") scala> sql("select id from t where id = 1").collect() // About 2 mins scala> sql("select id from t where id = 1").collect() // less than 90ms scala> sql("select id from t where id in (1,2,3)").collect() // 9 seconds ``` After ```scala scala> sql("select id from t where id in (1,2,3)").collect() // less than 90ms ``` This PR has impacts over 35 queries of TPC-DS if the tables are cached. Note that this optimization is applied for `IN`. To apply `IN` predicate having more than 10 items, `spark.sql.optimizer.inSetConversionThreshold` option should be increased. ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13887 from dongjoon-hyun/SPARK-16186.	2016-06-24 17:13:13 -07:00
Davies Liu	4435de1bd3	[SPARK-16179][PYSPARK] fix bugs for Python udf in generate ## What changes were proposed in this pull request? This PR fix the bug when Python UDF is used in explode (generator), GenerateExec requires that all the attributes in expressions should be resolvable from children when creating, we should replace the children first, then replace it's expressions. ``` >>> df.select(explode(f(df))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/vlad/dev/spark/python/pyspark/sql/dataframe.py", line 286, in show print(self._jdf.showString(n, truncate)) File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/vlad/dev/spark/python/pyspark/sql/utils.py", line 63, in deco return f(a, **kw) File "/home/vlad/dev/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: Generate explode(<lambda>(_1#0L)), false, false, [col#15L] +- Scan ExistingRDD[_1#0L] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:69) at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:45) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:177) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:153) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:95) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:95) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:85) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2557) at org.apache.spark.sql.Dataset.head(Dataset.scala:1923) at org.apache.spark.sql.Dataset.take(Dataset.scala:2138) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$13.apply(TreeNode.scala:413) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:412) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:387) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 42 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#20 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.GenerateExec.<init>(GenerateExec.scala:63) ... 52 more Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#20 in [_1#0L] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49) ... 67 more ``` ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13883 from davies/udf_in_generate.	2016-06-24 15:20:39 -07:00
Reynold Xin	5f8de21606	[SQL][MINOR] Simplify data source predicate filter translation. ## What changes were proposed in this pull request? This is a small patch to rewrite the predicate filter translation in DataSourceStrategy. The original code used excessive functional constructs (e.g. unzip) and was very difficult to understand. ## How was this patch tested? Should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #13889 from rxin/simplify-predicate-filter.	2016-06-24 14:44:24 -07:00
Davies Liu	d48935400c	[SPARK-16077] [PYSPARK] catch the exception from pickle.whichmodule() ## What changes were proposed in this pull request? In the case that we don't know which module a object came from, will call pickle.whichmodule() to go throught all the loaded modules to find the object, which could fail because some modules, for example, six, see https://bitbucket.org/gutworth/six/issues/63/importing-six-breaks-pickling We should ignore the exception here, use `__main__` as the module name (it means we can't find the module). ## How was this patch tested? Manual tested. Can't have a unit test for this. Author: Davies Liu <davies@databricks.com> Closes #13788 from davies/whichmodule.	2016-06-24 14:35:34 -07:00
Liwei Lin	a4851ed050	[SPARK-15963][CORE] Catch `TaskKilledException` correctly in Executor.TaskRunner ## The problem Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`: - the task was killed before it was deserialized - `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException` The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching: ```scala case _: TaskKilledException \| _: InterruptedException if task.killed => ... ``` the semantics of which is: - (`TaskKilledException` OR `InterruptedException`) AND `task.killed` Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`). ## What changes were proposed in this pull request? This patch alters the catch condition's semantics from: - (`TaskKilledException` OR `InterruptedException`) AND `task.killed` to - `TaskKilledException` OR (`InterruptedException` AND `task.killed`) so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly. ## How was this patch tested? Added unit test which failed before the change, ran new test 1000 times manually Author: Liwei Lin <lwlin7@gmail.com> Closes #13685 from lw-lin/fix-task-killed.	2016-06-24 10:09:04 -05:00
GayathriMurali	be88383e15	[SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997.	2016-06-24 13:25:40 +02:00
Sean Owen	158af162ea	[SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in favor of commons-lang3 ## What changes were proposed in this pull request? Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException` ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13843 from srowen/SPARK-16129.	2016-06-24 10:35:54 +01:00
peng.zhang	f4fd7432fb	[SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuite ## What changes were proposed in this pull request? Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly. This pull request fixes it. ## How was this patch tested? Unit test (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: peng.zhang <peng.zhang@xiaomi.com> Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.	2016-06-24 08:28:32 +01:00
Cheng Lian	2d2f607bfa	[SPARK-13709][SQL] Initialize deserializer with both table and partition properties when reading partitioned tables ## What changes were proposed in this pull request? When reading partitions of a partitioned Hive SerDe table, we only initializes the deserializer using partition properties. However, for SerDes like `AvroSerDe`, essential properties (e.g. Avro schema information) may be defined in table properties. We should merge both table properties and partition properties before initializing the deserializer. Note that an individual partition may have different properties than the one defined in the table properties (e.g. partitions within a table can have different SerDes). Thus, for any property key defined in both partition and table properties, the value set in partition properties wins. ## How was this patch tested? New test case added in `QueryPartitionSuite`. Author: Cheng Lian <lian@databricks.com> Closes #13865 from liancheng/spark-13709-partitioned-avro-table.	2016-06-23 23:11:46 -07:00
Yuhao Yang	cc6778ee0b	[SPARK-16133][ML] model loading backward compatibility for ml.feature ## What changes were proposed in this pull request? model loading backward compatibility for ml.feature, ## How was this patch tested? existing ut and manual test for loading 1.6 models. Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13844 from hhbyyh/featureComp.	2016-06-23 21:50:25 -07:00
Xiangrui Meng	4a40d43bb2	[SPARK-16142][R] group naiveBayes method docs in a single Rd ## What changes were proposed in this pull request? This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd. ## How was this patch tested? Manually checked generated HTML doc. See attached screenshots. ![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png) ![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png) Author: Xiangrui Meng <meng@databricks.com> Closes #13877 from mengxr/SPARK-16142.	2016-06-23 21:43:13 -07:00

1 2 3 4 5 ...

16831 commits