Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.
Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
PySpark support ```covar_samp``` and ```covar_pop```.
cc rxin davies marmbrus
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10876 from yanboliang/spark-12962.
https://issues.apache.org/jira/browse/SPARK-13260
This is a quicky fix for `count(*)`.
When the `requiredColumns` is empty, currently it returns `sqlContext.sparkContext.emptyRDD[Row]` which does not have the count.
Just like JSON datasource, this PR lets the CSV datasource count the rows but do not parse each set of tokens.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11169 from HyukjinKwon/SPARK-13260.
Previously we were using Option[String] and None to indicate the case when Spark fails to generate SQL. It is easier to just use exceptions to propagate error cases, rather than having for comprehension everywhere. I also introduced a "build" function that simplifies string concatenation (i.e. no need to reason about whether we have an extra space or not).
Author: Reynold Xin <rxin@databricks.com>
Closes#11171 from rxin/SPARK-13282.
The current implementation of ResolveSortReferences can only push one missing attributes into it's child, it failed to analyze TPCDS Q98, because of there are two missing attributes in that (one from Window, another from Aggregate).
Author: Davies Liu <davies@databricks.com>
Closes#11153 from davies/resolve_sort.
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.
Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.
Author: Holden Karau <holden@us.ibm.com>
Closes#11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
This JIRA is related to
https://github.com/apache/spark/pull/5852
Had to do some minor rework and test to make sure it
works with current version of spark.
Author: Sanket <schintap@untilservice-lm>
Closes#10838 from redsanket/limit-outbound-connections.
When the HistoryServer is showing an incomplete app, it needs to check if there is a newer version of the app available. It does this by checking if a version of the app has been loaded with a larger *filesize*. If so, it detaches the current UI, attaches the new one, and redirects back to the same URL to show the new UI.
https://issues.apache.org/jira/browse/SPARK-7889
Author: Steve Loughran <stevel@hortonworks.com>
Author: Imran Rashid <irashid@cloudera.com>
Closes#11118 from squito/SPARK-7889-alternate.
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <tummyyu@163.com>
Closes#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
It is possible to create faulty but legal ANTLR grammars. ANTLR will produce warnings but also a valid compileable parser. This PR makes sure we treat such warnings as build errors.
cc rxin / viirya
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11174 from hvanhovell/ANTLR-warnings-as-errors.
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false
cc holdenk
Author: sethah <seth.hendrickson16@gmail.com>
Closes#10962 from sethah/SPARK-13047.
jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#11151 from yu-iskw/SPARK-13265.
This commit removes an unnecessary duplicate check in addPendingTask that meant
that scheduling a task set took time proportional to (# tasks)^2.
Author: Sital Kedia <skedia@fb.com>
Closes#11167 from sitalkedia/fix_stuck_driver and squashes the following commits:
3fe1af8 [Sital Kedia] [SPARK-13279] Remove unnecessary duplicate check in addPendingTask function
Add the table name validation at the temp table creation
Author: jayadevanmurali <jayadevan.m@tcs.com>
Closes#11051 from jayadevanmurali/branch-0.2-SPARK-12982.
JIRA: https://issues.apache.org/jira/browse/SPARK-13277
There is an ANTLR warning during compilation:
warning(200): org/apache/spark/sql/catalyst/parser/SparkSqlParser.g:938:7:
Decision can match input such as "KW_USING Identifier" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
This patch is to fix it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#11168 from viirya/fix-parser-using.
Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things
- Makes sure no SparkContext is active after every test
- Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case)
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11166 from tdas/fix-failuresuite.
Made sure the old tables continue to use the old css and the new DataTables use the new css. Also fixed it so the Safari Web Inspector doesn't throw errors when on the new DataTables pages.
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#11038 from ajbozarth/spark13124.
The "getPersistentRDDs()" is a useful API of SparkContext to get cached RDDs. However, the JavaSparkContext does not have this API.
Add a simple getPersistentRDDs() to get java.util.Map<Integer, JavaRDD> for Java users.
Author: Junyang <fly.shenjy@gmail.com>
Closes#10978 from flyjy/master.
In spark-env.sh.template, there are multi-byte characters, this PR will remove it.
Author: Sasaki Toru <sasakitoa@nttdata.co.jp>
Closes#11149 from sasakitoa/remove_multibyte_in_sparkenv.
The parser currently parses the following strings without a hitch:
* Table Identifier:
* `a.b.c` should fail, but results in the following table identifier `a.b`
* `table!#` should fail, but results in the following table identifier `table`
* Expression
* `1+2 r+e` should fail, but results in the following expression `1 + 2`
This PR fixes this by adding terminated rules for both expression parsing and table identifier parsing.
cc cloud-fan (we discussed this in https://github.com/apache/spark/pull/10649) jayadevanmurali (this causes your PR https://github.com/apache/spark/pull/11051 to fail)
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#11159 from hvanhovell/SPARK-13276.
For lots of SQL operators, we have metrics for both of input and output, the number of input rows should be exactly the number of output rows of child, we could only have metrics for output rows.
After we improved the performance using whole stage codegen, the overhead of SQL metrics are not trivial anymore, we should avoid that if it's not necessary.
This PR remove all the SQL metrics for number of input rows, add SQL metric of number of output rows for all LeafNode. All remove the SQL metrics from those operators that have the same number of rows from input and output (for example, Projection, we may don't need that).
The new SQL UI will looks like:
![metrics](https://cloud.githubusercontent.com/assets/40902/12965227/63614e5e-d009-11e5-88b3-84fea04f9c20.png)
Author: Davies Liu <davies@databricks.com>
Closes#11163 from davies/remove_metrics.
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
Author: Davies Liu <davies@databricks.com>
Closes#10677 from davies/grouping.
This PR addresses two issues:
- Self join does not work in SQL Generation
- When creating new instances for `LogicalRelation`, `metastoreTableIdentifier` is lost.
liancheng Could you please review the code changes? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11084 from gatorsmile/selfJoinInSQLGen.
Some analysis rules generate aliases or auxiliary attribute references with the same name but different expression IDs. For example, `ResolveAggregateFunctions` introduces `havingCondition` and `aggOrder`, and `DistinctAggregationRewriter` introduces `gid`.
This is OK for normal query execution since these attribute references get expression IDs. However, it's troublesome when converting resolved query plans back to SQL query strings since expression IDs are erased.
Here's an example Spark 1.6.0 snippet for illustration:
```scala
sqlContext.range(10).select('id as 'a, 'id as 'b).registerTempTable("t")
sqlContext.sql("SELECT SUM(a) FROM t GROUP BY a, b ORDER BY COUNT(a), COUNT(b)").explain(true)
```
The above code produces the following resolved plan:
```
== Analyzed Logical Plan ==
_c0: bigint
Project [_c0#101L]
+- Sort [aggOrder#102L ASC,aggOrder#103L ASC], true
+- Aggregate [a#47L,b#48L], [(sum(a#47L),mode=Complete,isDistinct=false) AS _c0#101L,(count(a#47L),mode=Complete,isDistinct=false) AS aggOrder#102L,(count(b#48L),mode=Complete,isDistinct=false) AS aggOrder#103L]
+- Subquery t
+- Project [id#46L AS a#47L,id#46L AS b#48L]
+- LogicalRDD [id#46L], MapPartitionsRDD[44] at range at <console>:26
```
Here we can see that both aggregate expressions in `ORDER BY` are extracted into an `Aggregate` operator, and both of them are named `aggOrder` with different expression IDs.
The solution is to automatically add the expression IDs into the attribute name for the Alias and AttributeReferences that are generated by Analyzer in SQL Generation.
In this PR, it also resolves another issue. Users could use the same name as the internally generated names. The duplicate names should not cause name ambiguity. When resolving the column, Catalyst should not pick the column that is internally generated.
Could you review the solution? marmbrus liancheng
I did not set the newly added flag for all the alias and attribute reference generated by Analyzers. Please let me know if I should do it? Thank you!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#11050 from gatorsmile/namingConflicts.
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator
Author: raela <raela@databricks.com>
Closes#11158 from raelawang/master.
### Management API for Continuous Queries
**API for getting status of each query**
- Whether active or not
- Unique name of each query
- Status of the sources and sinks
- Exceptions
**API for managing each query**
- Immediately stop an active query
- Waiting for a query to be terminated, correctly or with error
**API for managing multiple queries**
- Listing all active queries
- Getting an active query by name
- Waiting for any one of the active queries to be terminated
**API for listening to query life cycle events**
- ContinuousQueryListener API for query start, progress and termination events.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#11030 from tdas/streaming-df-management-api.
Remove spark.closure.serializer option and use JavaSerializer always
CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
Author: Sean Owen <sowen@cloudera.com>
Closes#11150 from srowen/SPARK-12414.
This pr adds benchmark codes for in-memory cache compression to make future developments and discussions more smooth.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#10965 from maropu/ImproveColumnarCache.
The right margin of the history page is little bit off. A simple fix for that issue.
Author: zhuol <zhuol@yahoo-inc.com>
Closes#11029 from zhuoliu/13126.
The column width for the new DataTables now adjusts for the current page rather than being hard-coded for the entire table's data.
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#11057 from ajbozarth/spark13163.
The patch for SPARK-8964 ("use Exchange to perform shuffle in Limit" / #7334) inadvertently broke the planning of the TakeOrderedAndProject operator: because ReturnAnswer was the new root of the query plan, the TakeOrderedAndProject rule was unable to match before BasicOperators.
This patch fixes this by moving the `TakeOrderedAndCollect` and `CollectLimit` rules into the same strategy.
In addition, I made changes to the TakeOrderedAndProject operator in order to make its `doExecute()` method lazy and added a new TakeOrderedAndProjectSuite which tests the new code path.
/cc davies and marmbrus for review.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11145 from JoshRosen/take-ordered-and-project-fix.
This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027
In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution.
This PR implements two high-level features. These two features are co-dependent, so they're implemented both here:
- Mesos support for spark.executor.cores
- Multiple executors per slave
We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#10993 from mgummelt/executor_sizing.
`FileStreamSource` is an implementation of `org.apache.spark.sql.execution.streaming.Source`. It takes advantage of the existing `HadoopFsRelationProvider` to support various file formats. It remembers files in each batch and stores it into the metadata files so as to recover them when restarting. The metadata files are stored in the file system. There will be a further PR to clean up the metadata files periodically.
This is based on the initial work from marmbrus.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#11034 from zsxwing/stream-df-file-source.