[SPARK-13942][CORE][DOCS] Remove Shark-related docs for 2.x
## What changes were proposed in this pull request? `Shark` was merged into `Spark SQL` since [July 2014](https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html). The followings seem to be the only legacy. For Spark 2.x, we had better clean up those docs. **Migration Guide** ``` - ## Migration Guide for Shark Users - ... - ### Scheduling - ... - ### Reducer number - ... - ### Caching ``` ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11770 from dongjoon-hyun/SPARK-13942.
This commit is contained in:
parent
27e1f38851
commit
4ce2d24e2a
|
@ -2356,51 +2356,6 @@ Python UDF registration is unchanged.
|
|||
When using DataTypes in Python you will need to construct them (i.e. `StringType()`) instead of
|
||||
referencing a singleton.
|
||||
|
||||
## Migration Guide for Shark Users
|
||||
|
||||
### Scheduling
|
||||
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
|
||||
users can set the `spark.sql.thriftserver.scheduler.pool` variable:
|
||||
|
||||
SET spark.sql.thriftserver.scheduler.pool=accounting;
|
||||
|
||||
### Reducer number
|
||||
|
||||
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
|
||||
SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value
|
||||
is 200. Users may customize this property via `SET`:
|
||||
|
||||
SET spark.sql.shuffle.partitions=10;
|
||||
SELECT page, count(*) c
|
||||
FROM logs_last_month_cached
|
||||
GROUP BY page ORDER BY c DESC LIMIT 10;
|
||||
|
||||
You may also put this property in `hive-site.xml` to override the default value.
|
||||
|
||||
For now, the `mapred.reduce.tasks` property is still recognized, and is converted to
|
||||
`spark.sql.shuffle.partitions` automatically.
|
||||
|
||||
### Caching
|
||||
|
||||
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no
|
||||
longer automatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to
|
||||
let user control table caching explicitly:
|
||||
|
||||
CACHE TABLE logs_last_month;
|
||||
UNCACHE TABLE logs_last_month;
|
||||
|
||||
**NOTE:** `CACHE TABLE tbl` is now __eager__ by default not __lazy__. Don't need to trigger cache materialization manually anymore.
|
||||
|
||||
Spark SQL newly introduced a statement to let user control table caching whether or not lazy since Spark 1.2.0:
|
||||
|
||||
CACHE [LAZY] TABLE [AS SELECT] ...
|
||||
|
||||
Several caching related features are not supported yet:
|
||||
|
||||
* User defined partition level cache eviction policy
|
||||
* RDD reloading
|
||||
* In-memory cache write through policy
|
||||
|
||||
## Compatibility with Apache Hive
|
||||
|
||||
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.
|
||||
|
|
Loading…
Reference in a new issue