Commit graph

12184 commits

Author SHA1 Message Date
Hari Shreedharan c1be9f309a [SPARK-8988] [YARN] Make sure driver log links appear in secure cluste…
…r mode.

The NodeReports API currently used does not work in secure mode since we do not get RM tokens. Instead this patch just uses environment vars exported by YARN to create the log links.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #7624 from harishreedharan/driver-logs-env and squashes the following commits:

7368c7e [Hari Shreedharan] [SPARK-8988][YARN] Make sure driver log links appear in secure cluster mode.
2015-07-27 15:16:46 -07:00
Wenchen Fan 3ab7525dce [SPARK-9355][SQL] Remove InternalRow.get generic getter call in columnar cache code
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7673 from cloud-fan/row-generic-getter-columnar and squashes the following commits:

88b1170 [Wenchen Fan] fix style
eeae712 [Wenchen Fan] Remove Internal.get generic getter call in columnar cache code
2015-07-27 13:40:50 -07:00
Cheng Lian 8e7d2bee23 [SPARK-9378] [SQL] Fixes test case "CTAS with serde"
This is a proper version of PR #7693 authored by viirya

The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default.

Author: Cheng Lian <lian@databricks.com>

Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits:

4413af0 [Cheng Lian] Fixes test case "CTAS with serde"
2015-07-27 13:28:03 -07:00
Yin Huai 55946e76fd [SPARK-9349] [SQL] UDAF cleanup
https://issues.apache.org/jira/browse/SPARK-9349

With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits:

db36542 [Yin Huai] Add comments to UDAF examples.
ae17f66 [Yin Huai] Address comments.
9c9fa5f [Yin Huai] UDAF cleanup.
2015-07-27 13:26:57 -07:00
Reynold Xin fa84e4a7ba Closes #7690 since it has been merged into branch-1.4. 2015-07-27 13:21:04 -07:00
Reynold Xin 85a50a6352 [HOTFIX] Disable pylint since it is failing master. 2015-07-27 12:25:34 -07:00
Wenchen Fan 75438422c2 [SPARK-9369][SQL] Support IntervalType in UnsafeRow
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7688 from cloud-fan/interval and squashes the following commits:

5b36b17 [Wenchen Fan] fix codegen
a99ed50 [Wenchen Fan] address comment
9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow
2015-07-27 11:28:22 -07:00
Wenchen Fan dd9ae7945a [SPARK-9351] [SQL] remove literals from grouping expressions in Aggregate
literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer.

I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7583 from cloud-fan/minor and squashes the following commits:

471adff [Wenchen Fan] add test
0839925 [Wenchen Fan] use transformDown when rewrite final result expressions
2015-07-27 11:23:29 -07:00
George Dittmar 1f7b3d9dc7 [SPARK-7423] [MLLIB] Modify ClassificationModel and Probabalistic model to use Vector.argmax
Use Vector.argmax call instead of converting to dense vector before calculating predictions.

Author: George Dittmar <georgedittmar@gmail.com>

Closes #7670 from GeorgeDittmar/sprk-7423 and squashes the following commits:

e796747 [George Dittmar] Changing ClassificationModel and ProbabilisticClassificationModel to use Vector.argmax instead of converting to DenseVector
2015-07-27 11:16:33 -07:00
Wenchen Fan e2f38167f8 [SPARK-9376] [SQL] use a seed in RandomDataGeneratorSuite
Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it.

The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7691 from cloud-fan/seed and squashes the following commits:

eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite
2015-07-27 11:02:16 -07:00
Ryan Williams c0b7df68f8 [SPARK-9366] use task's stageAttemptId in TaskEnd event
Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #7681 from ryan-williams/task-stage-attempt and squashes the following commits:

d6d5f0f [Ryan Williams] use task's stageAttemptId in TaskEnd event
2015-07-27 12:54:08 -05:00
Josh Rosen ecad9d4346 [SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter
This patch fixes two bugs in UnsafeExternalSorter and UnsafeExternalRowSorter:

- UnsafeExternalSorter does not properly update freeSpaceInCurrentPage, which can cause it to write past the end of memory pages and trigger segfaults.
- UnsafeExternalRowSorter has a use-after-free bug when returning the last row from an iterator.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7680 from JoshRosen/SPARK-9364 and squashes the following commits:

590f311 [Josh Rosen] null out row
f4cf91d [Josh Rosen] Fix use-after-free bug in UnsafeExternalRowSorter.
8abcf82 [Josh Rosen] Properly decrement freeSpaceInCurrentPage in UnsafeExternalSorter
2015-07-27 09:34:49 -07:00
Alexander Ulanov 90006f3c51 Pregel example type fix
Pregel example to express single source shortest path from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api does not work due to incorrect type. The reason is that `GraphGenerators.logNormalGraph` returns the graph with `Long` vertices. Fixing `val graph: Graph[Int, Double]` to `val graph: Graph[Long, Double]`.

Author: Alexander Ulanov <nashb@yandex.ru>

Closes #7695 from avulanov/SPARK-9380-pregel-doc and squashes the following commits:

c269429 [Alexander Ulanov] Pregel example type fix
2015-07-28 01:33:31 +09:00
Rene Treffer aa19c696e2 [SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet
This PR is based on #6796 authored by rtreffer.

To support large decimal precisions (> 18), we do the following things in this PR:

1. Making `CatalystSchemaConverter` support large decimal precision

   Decimal types with large precision are always converted to fixed-length byte array.

2. Making `CatalystRowConverter` support reading decimal values with large precision

   When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`.

3. Making `RowWriteSupport` support writing decimal values with large precision

   In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848).

Two follow-up tasks should be done in future PRs:

- [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848
- [ ] Adding compatibility tests as part of SPARK-5463

Author: Cheng Lian <lian@databricks.com>

Closes #7455 from liancheng/spark-4176 and squashes the following commits:

a543d10 [Cheng Lian] Fixes errors introduced while rebasing
9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet
2015-07-27 23:29:40 +08:00
Carson Wang 6228381657 [SPARK-8405] [DOC] Add how to view logs on Web UI when yarn log aggregation is enabled
Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.

Author: Carson Wang <carson.wang@intel.com>

Closes #7463 from carsonwang/YarnLogDoc and squashes the following commits:

274c054 [Carson Wang] Minor text fix
74df3a1 [Carson Wang] address comments
5a95046 [Carson Wang] Update the text in the doc
e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
2015-07-27 08:02:40 -05:00
Cheng Lian 72981bc8f0 [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support
This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally.

Author: Cheng Lian <lian@databricks.com>

Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits:

f3bcd4b [Cheng Lian] Addresses PR comments
e0eb76a [Cheng Lian] Fixes styling issues
41e2207 [Cheng Lian] Fixes multi-database support
d4d1ec2 [Cheng Lian] Adds multi-database test cases
2015-07-27 17:15:35 +08:00
Wenchen Fan 4ffd3a1db5 [SPARK-9371][SQL] fix the support for special chars in column names for hive context
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7684 from cloud-fan/hive and squashes the following commits:

da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context
2015-07-26 23:58:03 -07:00
Reynold Xin aa80c64fcf [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7682 from rxin/unsaferow-generic-getter and squashes the following commits:

3063788 [Reynold Xin] Reset the change for real this time.
0f57c55 [Reynold Xin] Reset the changes in ExpressionEvalHelper.
fb6ca30 [Reynold Xin] Support BinaryType.
24a3e46 [Reynold Xin] Added support for DateType/TimestampType.
9989064 [Reynold Xin] JoinedRow.
11f80a3 [Reynold Xin] [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.
2015-07-26 23:01:04 -07:00
Liang-Chi Hsieh 945d8bcbf6 [SPARK-9306] [SQL] Don't use SortMergeJoin when joining on unsortable columns
JIRA: https://issues.apache.org/jira/browse/SPARK-9306

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7645 from viirya/smj_unsortable and squashes the following commits:

a240707 [Liang-Chi Hsieh] Use forall instead of exists for readability.
55221fa [Liang-Chi Hsieh] Shouldn't use SortMergeJoin when joining on unsortable columns.
2015-07-26 22:13:37 -07:00
Cheng Hao 1efe97dc9e [SPARK-8867][SQL] Support list / describe function usage
As Hive does, we need to list all of the registered UDF and its usage for user.

We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF.
e.g.
```scala
ExpressionDescription(
    usage = "_FUNC_(expr) - Returns the absolute value of the numeric value",
    extended = """> SELECT _FUNC_('-1')
                  1""")
 case class Abs(child: Expression) extends UnaryArithmetic {
...
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7259 from chenghao-intel/desc_function and squashes the following commits:

cf29bba [Cheng Hao] fixing the code style issue
5193855 [Cheng Hao] Add more powerful parser for show functions
c645a6b [Cheng Hao] fix bug in unit test
78d40f1 [Cheng Hao] update the padding issue for usage
48ee4b3 [Cheng Hao] update as feedback
70eb4e9 [Cheng Hao] add show/describe function support
2015-07-26 18:34:19 -07:00
Cheng Lian c025c3d0a1 [SPARK-9095] [SQL] Removes the old Parquet support
This PR removes the old Parquet support:

- Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites.

- Renames `ParquetRelation2` to `ParquetRelation`

- Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files.

  This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow.

There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848.

Author: Cheng Lian <lian@databricks.com>

Closes #7441 from liancheng/spark-9095 and squashes the following commits:

c7b6e38 [Cheng Lian] Removes WriteToFile
2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation
ca9e1b7 [Cheng Lian] Removes old Parquet support
2015-07-26 16:49:19 -07:00
Kay Ousterhout 6b2baec04f [SPARK-9326] Close lock file used for file downloads.
A lock file is used to ensure multiple executors running on the
same machine don't download the same file concurrently. Spark never
closes these lock files (releasing the lock does not close the
underlying file); this commit fixes that.

cc vanzin (looks like you've been involved in various other fixes surrounding these lock files)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #7650 from kayousterhout/SPARK-9326 and squashes the following commits:

0401bd1 [Kay Ousterhout] Close lock file used for file downloads.
2015-07-26 13:35:16 -07:00
Andrew Or 1cf19760d6 [SPARK-9352] [SPARK-9353] Add tests for standalone scheduling code
This also fixes a small issue in the standalone Master that was uncovered by the new tests. For more detail, read the description of SPARK-9353.

Author: Andrew Or <andrew@databricks.com>

Closes #7668 from andrewor14/standalone-scheduling-tests and squashes the following commits:

d852faf [Andrew Or] Add tests + fix scheduling with memory limits
2015-07-26 13:03:13 -07:00
Yijie Shen fb5d43fb25 [SPARK-9356][SQL]Remove the internal use of DecimalType.Unlimited
JIRA: https://issues.apache.org/jira/browse/SPARK-9356

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7671 from yjshen/deprecated_unlimit and squashes the following commits:

c707f56 [Yijie Shen] remove pattern matching in changePrecision
4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited
2015-07-26 10:29:22 -07:00
Reynold Xin 6c400b4f39 [SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code.
Replaced them with get(ordinal, datatype) so we can use UnsafeRow here.

I passed the data types throughout.

Author: Reynold Xin <rxin@databricks.com>

Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits:

3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.
2015-07-26 10:27:39 -07:00
Yuhao Yang b79bf1df62 [SPARK-9337] [MLLIB] Add an ut for Word2Vec to verify the empty vocabulary check
jira: https://issues.apache.org/jira/browse/SPARK-9337

Word2Vec should throw exception when vocabulary is empty

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7660 from hhbyyh/ut4Word2vec and squashes the following commits:

17a18cb [Yuhao Yang] add ut for word2vec
2015-07-26 14:02:20 +01:00
Reynold Xin 4a01bfc2a2 [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.

Author: Reynold Xin <rxin@databricks.com>

Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits:

ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
2015-07-25 23:52:37 -07:00
Nishkam Ravi 41a7cdf85d [SPARK-8881] [SPARK-9260] Fix algorithm for scheduling executors on workers
Current scheduling algorithm allocates one core at a time and in doing so ends up ignoring spark.executor.cores. As a result, when spark.cores.max/spark.executor.cores (i.e, num_executors) < num_workers, executors are not launched and the app hangs. This PR fixes and refactors the scheduling algorithm.

andrewor14

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #7274 from nishkamravi2/master_scheduler and squashes the following commits:

b998097 [nishkamravi2] Update Master.scala
da0f491 [Nishkam Ravi] Update Master.scala
79084e8 [Nishkam Ravi] Update Master.scala
1daf25f [Nishkam Ravi] Update Master.scala
f279cdf [Nishkam Ravi] Update Master.scala
adec84b [Nishkam Ravi] Update Master.scala
a06da76 [nishkamravi2] Update Master.scala
40c8f9f [nishkamravi2] Update Master.scala (to trigger retest)
c11c689 [nishkamravi2] Update EventLoggingListenerSuite.scala
5d6a19c [nishkamravi2] Update Master.scala (for the purpose of issuing a retest)
2d6371c [Nishkam Ravi] Update Master.scala
66362d5 [nishkamravi2] Update Master.scala
ee7cf0e [Nishkam Ravi] Improved scheduling algorithm for executors
2015-07-25 22:56:25 -07:00
Reynold Xin b1f4b4abfd [SPARK-9348][SQL] Remove apply method on InternalRow.
Author: Reynold Xin <rxin@databricks.com>

Closes #7665 from rxin/remove-row-apply and squashes the following commits:

0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
2015-07-25 18:41:51 -07:00
Wenchen Fan 2c94d0f24a [SPARK-9192][SQL] add initialization phase for nondeterministic expression
Currently nondeterministic expression is broken without a explicit initialization phase.

Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add the explicit initialization phase.

Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7535 from cloud-fan/init and squashes the following commits:

6c6f332 [Wenchen Fan] add test
ef68ff4 [Wenchen Fan] fix comments
9eac85e [Wenchen Fan] move init code to interpreted class
bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project
b4a4fc7 [Wenchen Fan] revert a refactor
86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression
2015-07-25 12:10:02 -07:00
Cheng Lian e2ec018e37 [SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation
This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`.

Author: Cheng Lian <lian@databricks.com>

Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits:

036a50c [Cheng Lian] Addresses PR comment
f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation
2015-07-25 11:42:49 -07:00
Sean Owen c980e20cf1 [SPARK-9304] [BUILD] Improve backwards compatibility of SPARK-8401
Add back change-version-to-X.sh scripts, as wrappers for new script, for backwards compatibility

Author: Sean Owen <sowen@cloudera.com>

Closes #7639 from srowen/SPARK-9304 and squashes the following commits:

9ab2681 [Sean Owen] Add deprecation message to wrappers
3c8c202 [Sean Owen] Add back change-version-to-X.sh scripts, as wrappers for new script, for backwards compatibility
2015-07-25 11:05:08 +01:00
Reynold Xin 215713e199 [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
The two are redundant.

Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates.

Author: Reynold Xin <rxin@databricks.com>

Closes #7658 from rxin/unsafeconverters and squashes the following commits:

ed19e6c [Reynold Xin] Updated support types.
2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
2015-07-25 01:37:41 -07:00
Reynold Xin f0ebab3f6d [SPARK-9336][SQL] Remove extra JoinedRows
They were added to improve performance (so JIT can inline the JoinedRow calls). However, we can also just improve it by projecting output out to UnsafeRow in Tungsten variant of the operators.

Author: Reynold Xin <rxin@databricks.com>

Closes #7659 from rxin/remove-joinedrows and squashes the following commits:

7510447 [Reynold Xin] [SPARK-9336][SQL] Remove extra JoinedRows
2015-07-25 01:28:46 -07:00
JD 723db13e06 [Spark-8668][SQL] Adding expr to functions
Author: JD <jd@csh.rit.edu>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7606 from JDrit/expr and squashes the following commits:

ad7f607 [Joseph Batchik] fixing python linter error
9d6daea [Joseph Batchik] removed order by per @rxin's comment
707d5c6 [Joseph Batchik] Added expr to fuctions.py
79df83c [JD] added example to the docs
b89eec8 [JD] moved function up as per @rxin's comment
4960909 [JD] updated per @JoshRosen's comment
2cb329c [JD] updated per @rxin's comment
9a9ad0c [JD] removing unused import
6dc26d0 [JD] removed split
7f2222c [JD] Adding expr function as per SPARK-8668
2015-07-25 00:34:59 -07:00
Patrick Wendell 19bcd6ab12 [HOTFIX] - Disable Kinesis tests due to rate limits 2015-07-24 22:57:01 -07:00
Reynold Xin c84acd4aa4 [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
The generated expression code can be hard to read since they are not indented well. This patch adds a code formatter that formats code automatically when we output them to the screen.

Author: Reynold Xin <rxin@databricks.com>

Closes #7656 from rxin/codeformatter and squashes the following commits:

5ba0e90 [Reynold Xin] [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
2015-07-24 19:35:24 -07:00
Reynold Xin f99cb5615c [SPARK-9330][SQL] Create specialized getStruct getter in InternalRow.
Also took the chance to rearrange some of the methods in UnsafeRow to group static/private/public things together.

Author: Reynold Xin <rxin@databricks.com>

Closes #7654 from rxin/getStruct and squashes the following commits:

b491a09 [Reynold Xin] Fixed typo.
48d77e5 [Reynold Xin] [SPARK-9330][SQL] Create specialized getStruct getter in InternalRow.
2015-07-24 19:29:01 -07:00
MechCoder a400ab516f [SPARK-7045] [MLLIB] Avoid intermediate representation when creating model
Word2Vec used to convert from an Array[Float] representation to a Map[String, Array[Float]] and then back to an Array[Float] through Word2VecModel.

This prevents this conversion while still supporting the older method of supplying a Map.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5748 from MechCoder/spark-7045 and squashes the following commits:

e308913 [MechCoder] move docs
5703116 [MechCoder] minor
fa04313 [MechCoder] style fixes
b1d61c4 [MechCoder] better errors and tests
3b32c8c [MechCoder] [SPARK-7045] Avoid intermediate representation when creating model
2015-07-24 14:58:07 -07:00
Liang-Chi Hsieh 64135cbb33 [SPARK-9067] [SQL] Close reader in NewHadoopRDD early if there is no more data
JIRA: https://issues.apache.org/jira/browse/SPARK-9067

According to the description of the JIRA ticket, calling `reader.close()` only after the task is finished will cause memory and file open limit problem since these resources are occupied even we don't need that anymore.

This PR simply closes the reader early when we know there is no more data to read.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7424 from viirya/close_reader and squashes the following commits:

3ff64e5 [Liang-Chi Hsieh] For comments.
3d20267 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
e152182 [Liang-Chi Hsieh] For comments.
5116cbe [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
3ceb755 [Liang-Chi Hsieh] For comments.
e34d98e [Liang-Chi Hsieh] For comments.
50ed729 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
216912f [Liang-Chi Hsieh] Fix it.
f429016 [Liang-Chi Hsieh] Release reader if we don't need it.
a305621 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader
67569da [Liang-Chi Hsieh] Close reader early if there is no more data.
2015-07-24 12:36:44 -07:00
Cheolsoo Park 9a11396113 [SPARK-9270] [PYSPARK] allow --name option in pyspark
This is continuation of #7512 which added `--name` option to spark-shell. This PR adds the same option to pyspark.

Note that `--conf spark.app.name` in command-line has no effect in spark-shell and pyspark. Instead, `--name` must be used. This is in fact inconsistency with spark-sql which doesn't accept `--name` option while it accepts `--conf spark.app.name`. I am not fixing this inconsistency in this PR. IMO, one of `--name` and `--conf spark.app.name` is needed not both. But since I cannot decide which to choose, I am not making any change here.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7610 from piaozhexiu/SPARK-9270 and squashes the following commits:

763e86d [Cheolsoo Park] Update windows script
400b7f9 [Cheolsoo Park] Allow --name option to pyspark
2015-07-24 11:56:55 -07:00
Marcelo Vanzin 8399ba1487 [SPARK-9261] [STREAMING] Avoid calling APIs that expose shaded classes.
Doing this may cause weird errors when tests are run on maven, depending
on the flags used. Instead, expose the needed functionality through methods
that do not expose shaded classes.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7601 from vanzin/SPARK-9261 and squashes the following commits:

4f64a16 [Marcelo Vanzin] [SPARK-9261] [streaming] Avoid calling APIs that expose shaded classes.
2015-07-24 11:53:16 -07:00
Josh Rosen 6aceaf3d62 [SPARK-9295] Analysis should detect sorting on unsupported column types
This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7633 from JoshRosen/SPARK-9295 and squashes the following commits:

23b2fbf [Josh Rosen] Embed function in foreach
bfe1451 [Josh Rosen] Update to allow sorting by null literals
2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
2015-07-24 11:34:23 -07:00
MechCoder e253124513 [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
This makes it easier to test all the class variables of the DistributedLDAmodel.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7573 from MechCoder/lda_test and squashes the following commits:

2f1a293 [MechCoder] [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
2015-07-24 10:56:48 -07:00
Josh Rosen c2b50d693e [SPARK-9292] Analysis should check that join conditions' data types are BooleanType
This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7630 from JoshRosen/SPARK-9292 and squashes the following commits:

aec6c7b [Josh Rosen] Check condition type in resolved()
75a3ea6 [Josh Rosen] Fix SPARK-9292.
2015-07-24 09:49:50 -07:00
Reynold Xin c8d71a4183 [SPARK-9305] Rename org.apache.spark.Row to Item.
It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.

Author: Reynold Xin <rxin@databricks.com>

Closes #7638 from rxin/remove-row and squashes the following commits:

aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
2015-07-24 09:38:13 -07:00
Reynold Xin 431ca39be5 [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.

Author: Reynold Xin <rxin@databricks.com>

Closes #7626 from rxin/internalRow and squashes the following commits:

e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
2015-07-24 09:37:36 -07:00
Yu ISHIKAWA 3aec9f4e2d [SPARK-9249] [SPARKR] local variable assigned but may not be used
[[SPARK-9249] local variable assigned but may not be used - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9249)

https://gist.github.com/yu-iskw/0e5b0253c11769457ea5

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7640 from yu-iskw/SPARK-9249 and squashes the following commits:

7a51cab [Yu ISHIKAWA] [SPARK-9249][SparkR] local variable assigned but may not be used
2015-07-24 09:10:57 -07:00
François Garillot 428cde5d1c [SPARK-9250] Make change-scala-version more helpful w.r.t. valid Scala versions
Author: François Garillot <francois@garillot.net>

Closes #7595 from huitseeker/issue/SPARK-9250 and squashes the following commits:

80a0218 [François Garillot] [SPARK-9250] Make change-scala-version's usage more explicit, introduce a -h|--help option.
2015-07-24 17:09:33 +01:00
zhichao.li 846cf46282 [SPARK-9238] [SQL] Remove two extra useless entries for bytesOfCodePointInUTF8
Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in `bytesOfCodePointInUTF8` for the case of 6 bytes codepoint(1111110x) is enough.
Details can be found from https://en.wikipedia.org/wiki/UTF-8 in "Description" section.

Author: zhichao.li <zhichao.li@intel.com>

Closes #7582 from zhichao-li/utf8 and squashes the following commits:

8bddd01 [zhichao.li] two extra entries
2015-07-24 08:34:50 -07:00