Commit graph

11438 commits

Author SHA1 Message Date
MechCoder 6765ef98df [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySpark
MatrixUDT was recently coded in scala. This has been ported to PySpark

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6354 from MechCoder/spark-6390 and squashes the following commits:

fc4dc1e [MechCoder] Better error message
c940a44 [MechCoder] Added test
aa9c391 [MechCoder] Add pyUDT to MatrixUDT
62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
2015-06-17 11:10:16 -07:00
Liang-Chi Hsieh 104f30c36f [SPARK-7199] [SQL] Add date and timestamp support to UnsafeRow
JIRA: https://issues.apache.org/jira/browse/SPARK-7199

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5984 from viirya/add_date_timestamp and squashes the following commits:

7f21ce9 [Liang-Chi Hsieh] For comment.
0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes.
c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter.
672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them.
9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
281e844 [Liang-Chi Hsieh] Fix scala style.
fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
80af342 [Liang-Chi Hsieh] Fix compiling error.
f4f5de6 [Liang-Chi Hsieh] Fix scala style.
a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows.
635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils.
b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
752251f [Liang-Chi Hsieh] Support setDate. Fix failed test.
fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow.
e42a809 [Liang-Chi Hsieh] Fix style.
4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.
2015-06-17 09:00:37 -07:00
Vyacheslav Baranov c13da20a55 [SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <slavik.baranov@gmail.com>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap
2015-06-17 09:42:29 +01:00
Reynold Xin e3de14d3b2 Closes #6850. 2015-06-17 00:28:40 -07:00
dragonli bedff7d532 [SPARK-8220][SQL]Add positive identify function
chenghao-intel adrian-wang

Author: dragonli <lisurprise@gmail.com>
Author: zhichao.li <zhichao.li@intel.com>

Closes #6838 from zhichao-li/positive and squashes the following commits:

e1032a0 [dragonli] remove useless import and refactor code
624d438 [zhichao.li] add positive identify function
2015-06-16 23:44:10 -07:00
baishuo 0b8c8fdc12 [SPARK-8156] [SQL] create table to specific database by 'use dbname'
when i test the following code:
hiveContext.sql("""use testdb""")
val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c")
df.write
.format("parquet")
.mode(SaveMode.Overwrite)
.saveAsTable("ttt3")
hiveContext.sql("show TABLES in default")

found that the table ttt3 will be created under the database "default"

Author: baishuo <vc_java@hotmail.com>

Closes #6695 from baishuo/SPARK-8516-use-database and squashes the following commits:

9e155f9 [baishuo] remove no use comment
cb9f027 [baishuo] modify testcase
00a7a2d [baishuo] modify testcase
4df48c7 [baishuo] modify testcase
b742e69 [baishuo] modify testcase
3d19ad9 [baishuo] create table to specific database
2015-06-16 16:40:02 -07:00
Yanbo Liang ca998757e8 [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6460 from yanboliang/spark-7916 and squashes the following commits:

f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
2015-06-16 14:30:30 -07:00
Marcelo Vanzin cebf241184 [SPARK-8126] [BUILD] Make sure temp dir exists when running tests.
If you ran "clean" at the top-level sbt project, the temp dir would
go away, so running "test" without restarting sbt would fail. This
fixes that by making sure the temp dir exists before running tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6805 from vanzin/SPARK-8126-fix and squashes the following commits:

12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.
2015-06-16 21:10:18 +01:00
Radek Ostrowski 4bd10fd509 [SQL] [DOC] improved a comment
[SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs

Author: Radek Ostrowski <dest.hawaii@gmail.com>
Author: radek <radek@radeks-MacBook-Pro-2.local>

Closes #6332 from radek1st/master and squashes the following commits:

dae3347 [Radek Ostrowski] fixed typo
c76bb3a [radek] improved a comment
2015-06-16 21:04:26 +01:00
Moussa Taifi dc455b8833 [SPARK-DOCS] [SPARK-SQL] Update sql-programming-guide.md
Typo in thriftserver section

Author: Moussa Taifi <moutai10@gmail.com>

Closes #6847 from moutai/patch-1 and squashes the following commits:

1bd29df [Moussa Taifi] Update sql-programming-guide.md
2015-06-16 20:59:22 +01:00
hushan[胡珊] 29c5025a70 [SPARK-8387] [WEBUI] Only show 4096 bytes content for executor log instead of show all
Author: hushan[胡珊] <hushan@xiaomi.com>

Closes #6834 from suyanNone/small-display and squashes the following commits:

744212f [hushan[胡珊]] Only show 4096 bytes content for executor log instead all
2015-06-16 20:48:33 +01:00
Kan Zhang 658814c898 [SPARK-8129] [CORE] [Sec] Pass auth secrets to executors via env variables
Env variables are not visible to non-Spark users, based on suggestion from vanzin.

Author: Kan Zhang <kzhang@apache.org>

Closes #6774 from kanzhang/env and squashes the following commits:

5dd84c6 [Kan Zhang] remove auth secret conf from initial set up for executors
90cb7d2 [Kan Zhang] always filter out auth secret
af4d89d [Kan Zhang] minor refactering
e88993e [Kan Zhang] pass auth secret to executors via env variable
2015-06-16 08:18:26 +02:00
huangzhaowei ccf010f27b [SPARK-8367] [STREAMING] Add a limit for 'spark.streaming.blockInterval` since a data loss bug.
Bug had reported in the jira [SPARK-8367](https://issues.apache.org/jira/browse/SPARK-8367)
The relution is limitting the configuration `spark.streaming.blockInterval` to a positive number.

Author: huangzhaowei <carlmartinmax@gmail.com>
Author: huangzhaowei <SaintBacchus@users.noreply.github.com>

Closes #6818 from SaintBacchus/SPARK-8367 and squashes the following commits:

c9d1927 [huangzhaowei] Update BlockGenerator.scala
bd3f71a [huangzhaowei] Use requre instead of if
3d17796 [huangzhaowei] [SPARK_8367][Streaming]Add a limit for 'spark.streaming.blockInterval' since a data loss bug.
2015-06-16 08:16:09 +02:00
Davies Liu bc76a0f750 [SPARK-7184] [SQL] enable codegen by default
In order to have better performance out of box, this PR turn on codegen by default, then codegen can be tested by sql/test and hive/test.

This PR also fix some corner cases for codegen.

Before 1.5 release, we should re-visit this, turn it off if it's not stable or causing regressions.

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6726 from davies/enable_codegen and squashes the following commits:

f3b25a5 [Davies Liu] fix warning
73750ea [Davies Liu] fix long overflow when compare
3017a47 [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
a7d75da [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
ff5b75a [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
f4cf2c2 [Davies Liu] fix style
99fc139 [Davies Liu] Merge branch 'enable_codegen' of github.com:davies/spark into enable_codegen
91fc7a2 [Davies Liu] disable codegen for ScalaUDF
207e339 [Davies Liu] Update CodeGenerator.scala
44573a3 [Davies Liu] check thread safety of expression
f3886fa [Davies Liu] don't inline primitiveTerm for null literal
c8e7cd2 [Davies Liu] address comment
a8618c9 [Davies Liu] enable codegen by default
2015-06-15 23:03:14 -07:00
tedyu 1a62d61696 SPARK-8336 Fix NullPointerException with functions.rand()
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'

Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()

Author: tedyu <yuzhihong@gmail.com>

Closes #6793 from tedyu/master and squashes the following commits:

62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()
2015-06-15 17:00:38 -07:00
Yadong Qi 6ae21a944a [SPARK-6583] [SQL] Support aggregate functions in ORDER BY
Add aggregates in ORDER BY clauses to the `Aggregate` operator beneath.  Project these results away after the Sort.

Based on work by watermen.  Also Closes #5290.

Author: Yadong Qi <qiyadong2010@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #6816 from marmbrus/pr/5290 and squashes the following commits:

3226a97 [Michael Armbrust] consistent ordering
eb8938d [Michael Armbrust] no vars
c8b25c1 [Yadong Qi] move the test data.
7f9b736 [Yadong Qi] delete Substring case
a1e87c1 [Yadong Qi] fix conflict
f119849 [Yadong Qi] order by aggregated function
2015-06-15 12:01:52 -07:00
andrewor14 56d4e8a2d0 [SPARK-8350] [R] Log R unit test output to "unit-tests.log"
Right now it's logged to "R-unit-tests.log". Jenkins currently only archives files named "unit-tests.log", and this is what all other modules (e.g. SQL, network, REPL) use.
1. We should be consistent
2. I don't want to reconfigure Jenkins to accept a different file

shivaram

Author: andrewor14 <andrew@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #6807 from andrewor14/r-logs and squashes the following commits:

96005d2 [andrewor14] Nest unit-tests.log further until R
407c46c [andrewor14] Add target to log path
d7b68ae [Andrew Or] Log R unit test output to "unit-tests.log"
2015-06-15 08:16:22 -07:00
Nicholas Chammas 4c5889e8f5 [SPARK-8316] Upgrade to Maven 3.3.3
Versions of Maven older than 3.3.0 apparently have [a bug in how they handle transitive dependencies](https://github.com/apache/spark/pull/6492#issuecomment-111001101).

I confirmed that upgrading to Maven 3.3.3 resolves at least the particular manifestation of this bug that I ran into.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #6770 from nchammas/maven-333 and squashes the following commits:

6bed2d9 [Nicholas Chammas] upgrade to Maven 3.3.3
2015-06-15 08:18:01 +01:00
Marcelo Vanzin 4eb48ed1da [SPARK-8065] [SQL] Add support for Hive 0.14 metastores
This change has two parts.

The first one gets rid of "ReflectionMagic". That worked well for the differences between 0.12 and
0.13, but breaks in 0.14, since some of the APIs that need to be used have primitive types. I could
not figure out a way to make that class work with primitive types. So instead I wrote some shims
 (I can already hear the collective sigh) that find the appropriate methods via reflection. This should
be faster since the method instances are cached, and the code is not much uglier than before,
with the advantage that all the ugliness is local to one file (instead of multiple switch statements on
the version being used scattered in ClientWrapper).

The second part is simple: add code to handle Hive 0.14. A few new methods had to be added
to the new shims.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6627 from vanzin/SPARK-8065 and squashes the following commits:

3fa4270 [Marcelo Vanzin] Indentation style.
4b8a3d4 [Marcelo Vanzin] Fix dep exclusion.
be3d0cc [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
ca3fb1e [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
b43f13e [Marcelo Vanzin] Since exclusions seem to work, clean up some of the code.
73bd161 [Marcelo Vanzin] Botched merge.
d2ddf01 [Marcelo Vanzin] Comment about excluded dep.
0c929d1 [Marcelo Vanzin] Merge branch 'master' into SPARK-8065
2c3c02e [Marcelo Vanzin] Try to fix tests by adding support for exclusions.
0a03470 [Marcelo Vanzin] Try to fix tests by upgrading calcite dependency.
13b2dfa [Marcelo Vanzin] Fix NPE.
6439d88 [Marcelo Vanzin] Minor style thing.
69b017b [Marcelo Vanzin] Style.
a21cad8 [Marcelo Vanzin] Part II: Add shims / version for Hive 0.14.
ae98c87 [Marcelo Vanzin] PART I: Get rid of reflection magic.
2015-06-14 11:49:22 -07:00
Peter Hoffmann f3f2a4397d fix read/write mixup
Author: Peter Hoffmann <ph@peter-hoffmann.com>

Closes #6815 from hoffmann/patch-1 and squashes the following commits:

2abb6da [Peter Hoffmann] fix read/write mixup
2015-06-14 11:41:16 -07:00
Reynold Xin 53c16b92a5 [SPARK-8362] [SQL] Add unit tests for +, -, *, /, %
Added unit tests for all supported data types for:
- Add
- Subtract
- Multiply
- Divide
- UnaryMinus
- Remainder

Fixed bugs caught by the unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #6813 from rxin/SPARK-8362 and squashes the following commits:

fb3fe62 [Reynold Xin] Added Remainder.
3b266ba [Reynold Xin] [SPARK-8362] Add unit tests for +, -, *, /.
2015-06-14 11:23:23 -07:00
Michael Armbrust 9073a426e4 [SPARK-8358] [SQL] Wait for child resolution when resolving generators
Author: Michael Armbrust <michael@databricks.com>

Closes #6811 from marmbrus/aliasExplodeStar and squashes the following commits:

fbd2065 [Michael Armbrust] more style
806a373 [Michael Armbrust] fix style
7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa
2015-06-14 11:21:42 -07:00
Josh Rosen ea7fd2ff64 [SPARK-8354] [SQL] Fix off-by-factor-of-8 error when allocating scratch space in UnsafeFixedWidthAggregationMap
UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size.  This means that we end up allocating 8x too much conversion space.

This patch fixes this by allocating a `byte[]` array instead.  This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits:

6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size
2015-06-14 09:34:35 -07:00
Liang-Chi Hsieh cb7ada1196 [SPARK-8342][SQL] Fix Decimal setOrNull
JIRA: https://issues.apache.org/jira/browse/SPARK-8342

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6797 from viirya/fix_decimal and squashes the following commits:

8a447b1 [Liang-Chi Hsieh] Add unit test.
d67a5ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal
ab6d8af [Liang-Chi Hsieh] Fix setOrNull.
2015-06-13 22:42:28 -07:00
Mike Dusenberry 35d1267cf8 [Spark-8343] [Streaming] [Docs] Improve Spark Streaming Guides.
This improves the Spark Streaming Guides by fixing broken links, rewording confusing sections, fixing typos, adding missing words, etc.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6801 from dusenberrymw/SPARK-8343_Improve_Spark_Streaming_Guides_MERGED and squashes the following commits:

6688090 [Mike Dusenberry] Improvements to the Spark Streaming Custom Receiver Guide, including slight rewording of confusing sections, and fixing typos & missing words.
436fbd8 [Mike Dusenberry] Bunch of improvements to the Spark Streaming Guide, including fixing broken links, slight rewording of confusing sections, fixing typos & missing words, etc.
2015-06-13 21:22:46 -07:00
Reynold Xin 2d71ba4c8a [SPARK-8349] [SQL] Use expression constructors (rather than apply) in FunctionRegistry
Author: Reynold Xin <rxin@databricks.com>

Closes #6806 from rxin/gs and squashes the following commits:

ed1aebb [Reynold Xin] Fixed style.
c7fc3e6 [Reynold Xin] [SPARK-8349][SQL] Use expression constructors (rather than apply) in FunctionRegistry
2015-06-13 18:22:17 -07:00
Reynold Xin a138953391 [SPARK-8347][SQL] Add unit tests for abs.
Also addressed code review feedback from #6754

Author: Reynold Xin <rxin@databricks.com>

Closes #6803 from rxin/abs and squashes the following commits:

d07beba [Reynold Xin] [SPARK-8347] Add unit tests for abs.
2015-06-13 17:10:13 -07:00
Liang-Chi Hsieh ddec45279e [SPARK-8052] [SQL] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble
JIRA: https://issues.apache.org/jira/browse/SPARK-8052

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6645 from viirya/cast_string_integraltype and squashes the following commits:

e19c6a3 [Liang-Chi Hsieh] For comment.
c3e472a [Liang-Chi Hsieh] Add test.
7ced9b0 [Liang-Chi Hsieh] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble.
2015-06-13 16:39:52 -07:00
Josh Rosen af31335adc [SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies
This patch updates two pieces of logic that are related to handling of keyOrderings in ShuffleDependencies:

- The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. This patch updates the fallback logic to reflect this so that the Tungsten optimizations can apply to more workloads.

- The SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary. The copying was added to guard against cases where ExternalSorter would buffer non-serialized records in memory.  When ExternalSorter is configured without an aggregator, it uses the following logic to determine whether to buffer records in a serialized or deserialized format:

   ```scala
     private val useSerializedPairBuffer =
        ordering.isEmpty &&
        conf.getBoolean("spark.shuffle.sort.serializeMapOutputs", true) &&
        ser.supportsRelocationOfSerializedObjects
   ```

   The `newOrdering.isDefined` branch in `ExternalSorter.needToCopyObjectsBeforeShuffle`, removed by this patch, is not necessary:

   - It was checked even if we weren't using sort-based shuffle, but this was unnecessary because only SortShuffleManager performs map-side sorting.
   - Map-side sorting during shuffle writing is only performed for shuffles that perform map-side aggregation as part of the shuffle (to see this, look at how SortShuffleWriter constructs ExternalSorter).  Since SQL never pushes aggregation into Spark's shuffle, we can guarantee that both the aggregator and ordering will be empty and Spark SQL always uses serializers that support relocation, so sort-shuffle will use the serialized pair buffer unless the user has explicitly disabled it via the SparkConf feature-flag.  Therefore, I think my optimization in Exchange should be safe.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6773 from JoshRosen/SPARK-8319 and squashes the following commits:

7a14129 [Josh Rosen] Revise comments; add handler to guard against future ShuffleManager implementations
07bb2c9 [Josh Rosen] Update comment to clarify circumstances under which shuffle operates on serialized records
269089a [Josh Rosen] Avoid unnecessary copy in SQL Exchange
34e526e [Josh Rosen] Enable Tungsten shuffle for non-agg shuffles w/ key orderings
2015-06-13 16:14:24 -07:00
Davies Liu ce1041c38f [SPARK-8346] [SQL] Use InternalRow instread of catalyst.InternalRow
cc rxin marmbrus

Author: Davies Liu <davies@databricks.com>

Closes #6802 from davies/cleanup_internalrow and squashes the following commits:

769d2aa [Davies Liu] remove not needed cast
4acbbe4 [Davies Liu] catalyst.Internal -> InternalRow
2015-06-13 16:13:26 -07:00
Rene Treffer d986fb9a37 [SPARK-7897] Improbe type for jdbc/"unsigned bigint"
The original fix uses DecimalType.Unlimited, which is harder to
handle afterwards. There is no scale and most data should fit into
a long, thus DecimalType(20,0) should be better.

Author: Rene Treffer <treffer@measite.de>

Closes #6789 from rtreffer/spark-7897-unsigned-bigint-as-decimal and squashes the following commits:

2006613 [Rene Treffer] Fix type for "unsigned bigint" jdbc loading.
2015-06-13 11:58:22 -07:00
Michael Armbrust 4aed66f299 [SPARK-8329][SQL] Allow _ in DataSource options
Author: Michael Armbrust <michael@databricks.com>

Closes #6786 from marmbrus/optionsParser and squashes the following commits:

e7d18ef [Michael Armbrust] add dots
99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options
2015-06-12 23:11:16 -07:00
Davies Liu d46f8e5d4b [SPARK-7186] [SQL] Decouple internal Row from external Row
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc rxin marmbrus JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6792 from davies/internal_row and squashes the following commits:

f2abd13 [Davies Liu] fix scalastyle
a7e025c [Davies Liu] move InternalRow into catalyst
30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row
7cbced8 [Davies Liu] separate Row and InternalRow
2015-06-12 23:06:31 -07:00
Roger Menezes 6e9c3ff1ec [SPARK-8314][MLlib] improvement in performance of MLUtils.appendBias
MLUtils.appendBias method is heavily used in creating intercepts for linear models.
This method uses Breeze's vector concatenation which is very slow compared to the plain
System.arrayCopy. This improvement is to change the implementation to use System.arrayCopy.

I saw the following performance improvements after the change:
Benchmark with mnist dataset for 50 times:
MLUtils.appendBias (SparseVector Before): 47320 ms
MLUtils.appendBias (SparseVector After): 1935 ms
MLUtils.appendBias (DenseVector Before): 5340 ms
MLUtils.appendBias (DenseVector After): 4080 ms
This is almost a 24 times performance boost for SparseVectors.

Author: Roger Menezes <rmenezes@netflix.com>

Closes #6768 from rogermenezes/improve-append-bias and squashes the following commits:

4e42f75 [Roger Menezes] address feedback
e999d79 [Roger Menezes] first commit
2015-06-12 18:29:58 -07:00
Tathagata Das e9471d3414 [SPARK-7284] [STREAMING] Updated streaming documentation
- Kinesis API updated
- Kafka version updated, and Python API for Direct Kafka added
- Added SQLContext.getOrCreate()
- Added information on how to get partitionId in foreachRDD

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6781 from tdas/SPARK-7284 and squashes the following commits:

aac7be0 [Tathagata Das] Added information on how to get partition id
a66ec22 [Tathagata Das] Complete the line incomplete line,
a92ca39 [Tathagata Das] Updated streaming documentation
2015-06-12 15:22:59 -07:00
Andrew Or 8860405151 [SPARK-8330] DAG visualization: trim whitespace from input
Safeguard against DOM rewriting.

Author: Andrew Or <andrew@databricks.com>

Closes #6787 from andrewor14/dag-viz-trim and squashes the following commits:

0fb4afe [Andrew Or] Trim input metadata from DOM
2015-06-12 11:14:55 -07:00
akhilthatipamula 19834fa918 [SPARK-7993] [SQL] Improved DataFrame.show() output
Closes #6633

Author: akhilthatipamula <130050068@iitb.ac.in>
Author: zsxwing <zsxwing@gmail.com>

Closes #6784 from zsxwing/pr6633 and squashes the following commits:

5da1c51 [zsxwing] Address comments and add unit tests
17eab7b [akhilthatipamula] refactored code
19874b3 [akhilthatipamula] Update DataFrame.scala
0a76a5e [akhilthatipamula] Optimised showString()
e3dd03f [akhilthatipamula] Modified showString() method
a21012b [akhilthatipamula] improved the show()
4bb742f [akhilthatipamula] Modified dataframe.show() method
2015-06-12 10:40:28 -07:00
Mark Smith 71cc17bfa7 [SPARK-8322] [EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and…
… SPARK_TACHYON_MAP

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Mark Smith <mark.smith@bronto.com>

Closes #6776 from markmsmith/SPARK-8322 and squashes the following commits:

d744244 [Mark Smith] [SPARK-8322][EC2] Fixed tachyon mapp entry to point to 0.6.4
e4f14d3 [Mark Smith] [SPARK-8322][EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and SPARK_TACHYON_MAP
2015-06-12 08:19:03 -07:00
Wenchen Fan c19c78577a [SQL] [MINOR] correct semanticEquals logic
It's a follow up of https://github.com/apache/spark/pull/6173, for expressions like `Coalesce` that have a `Seq[Expression]`, when we do semantic equal check for it, we need to do semantic equal check for all of its children.
Also we can just use `Seq[(Expression, NamedExpression)]` instead of `Map[Expression, NamedExpression]` as we only search it with `find`.

chenghao-intel, I agree that we probably never knows `semanticEquals` in a general way, but I think we have done that in `TreeNode`, so we can use similar logic. Then we can handle something like `Coalesce(children: Seq[Expression])` correctly.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6261 from cloud-fan/tmp and squashes the following commits:

4daef88 [Wenchen Fan] address comments
dd8fbd9 [Wenchen Fan] correct semanticEquals
2015-06-12 16:38:28 +08:00
Yash Datta e428b3a951 [SPARK-6566] [SQL] Related changes for newer parquet version
This brings in major improvement in that footers are not read on the driver. This also cleans up the code in parquetTableOperations, where we had to override getSplits to eliminate multiple listStatus calls.

cc liancheng

are there any other changes we need for this ?

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #5889 from saucam/parquet_1.6 and squashes the following commits:

d1bf41e [Yash Datta] SPARK-7340: Fix scalastyle and incorporate review comments
c9aa042 [Yash Datta] SPARK-7340: Use the new user defined filter predicate for pushing down inset into parquet
56bc750 [Yash Datta] SPARK-7340: Change parquet version to latest release
2015-06-12 13:44:09 +08:00
zhichao.li 2dd7f93080 [SPARK-7862] [SQL] Fix the deadlock in script transformation for stderr
[Related PR SPARK-7044] (https://github.com/apache/spark/pull/5671)

Author: zhichao.li <zhichao.li@intel.com>

Closes #6404 from zhichao-li/transform and squashes the following commits:

8418c97 [zhichao.li] add comments and remove useless failAfter logic
d9677e1 [zhichao.li] redirect the error desitination to be the same as the current process
2015-06-11 22:28:28 -07:00
Josh Rosen b9d177c511 [SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator
In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this:

- Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side.
- By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted.
- We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying.

This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`.  Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6772 from JoshRosen/SPARK-8317 and squashes the following commits:

ebf9c0f [Josh Rosen] Do not push sort into shuffle in Exchange operator
bf3b4c8 [Josh Rosen] Enable external sort by default
2015-06-11 22:15:15 -07:00
Cheng Hao 767cc94ca6 [SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache()
When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5714 from chenghao-intel/SPARK-7158 and squashes the following commits:

58ea8aa [Cheng Hao] style issue
2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager
a5647d9 [Cheng Hao] hide the queryExecution of DataFrame
fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist
2015-06-11 18:01:47 -07:00
Reynold Xin 337c16d57e [SQL] Miscellaneous SQL/DF expression changes.
SPARK-8201 conditional function: if
SPARK-8205 conditional function: nvl
SPARK-8208 math function: ceiling
SPARK-8210 math function: degrees
SPARK-8211 math function: radians
SPARK-8219 math function: negative
SPARK-8216 math function: rename log -> ln
SPARK-8222 math function: alias power / pow
SPARK-8225 math function: alias sign / signum
SPARK-8228 conditional function: isnull
SPARK-8229 conditional function: isnotnull
SPARK-8250 string function: alias lower/lcase
SPARK-8251 string function: alias upper / ucase

Author: Reynold Xin <rxin@databricks.com>

Closes #6754 from rxin/expressions-misc and squashes the following commits:

35fce15 [Reynold Xin] Removed println.
2647067 [Reynold Xin] Promote to string type.
3c32bbc [Reynold Xin] Fixed if.
de827ac [Reynold Xin] Fixed style
b201cd4 [Reynold Xin] Removed if.
6b21a9b [Reynold Xin] [SQL] Miscellaneous SQL/DF expression changes.
2015-06-11 17:06:21 -07:00
Zhongshuai Pei 7914c720bf [SPARK-7824] [SQL] Collapse operator reordering and constant folding into a single batch.
SQL
```
select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e)
```
Plan before modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) || (b#294 = e#297))))
  MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```
Plan after modify
```
== Optimized Logical Plan ==
Project [a#293,b#294,c#295,d#296,e#297]
 Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
  Filter (a#293 > 3)
   MetastoreRelation default, tablea, None
  MetastoreRelation default, tableb, None
```

CombineLimits ==> Limit(If(LessThan(ne, le), ne, le), grandChild) and LessThan is in BooleanSimplification ,  so CombineLimits  must before BooleanSimplification and BooleanSimplification must before PushPredicateThroughJoin.

Author: Zhongshuai Pei <799203320@qq.com>
Author: DoingDone9 <799203320@qq.com>

Closes #6351 from DoingDone9/master and squashes the following commits:

20de7be [Zhongshuai Pei] Update Optimizer.scala
7bc7d28 [Zhongshuai Pei] Merge pull request #17 from apache/master
0ba5f42 [Zhongshuai Pei] Update Optimizer.scala
f8b9314 [Zhongshuai Pei] Update FilterPushdownSuite.scala
c529d9f [Zhongshuai Pei] Update FilterPushdownSuite.scala
ae3af6d [Zhongshuai Pei] Update FilterPushdownSuite.scala
a04ffae [Zhongshuai Pei] Update Optimizer.scala
11beb61 [Zhongshuai Pei] Update FilterPushdownSuite.scala
f2ee5fe [Zhongshuai Pei] Update Optimizer.scala
be6b1d5 [Zhongshuai Pei] Update Optimizer.scala
b01e622 [Zhongshuai Pei] Merge pull request #15 from apache/master
8df716a [Zhongshuai Pei] Update FilterPushdownSuite.scala
d98bc35 [Zhongshuai Pei] Update FilterPushdownSuite.scala
fa65718 [Zhongshuai Pei] Update Optimizer.scala
ab8e9a6 [Zhongshuai Pei] Merge pull request #14 from apache/master
14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master
f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master
f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master
f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master
34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master
802261c [DoingDone9] Merge pull request #7 from apache/master
d00303b [DoingDone9] Merge pull request #6 from apache/master
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
2015-06-11 17:01:07 -07:00
Reynold Xin 7d669a56ff [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.
Unit test is still in Scala.

Author: Reynold Xin <rxin@databricks.com>

Closes #6738 from rxin/utf8string-java and squashes the following commits:

562dc6e [Reynold Xin] Flag...
98e600b [Reynold Xin] Another try with encoding setting ..
cfa6bdf [Reynold Xin] Merge branch 'master' into utf8string-java
a3b124d [Reynold Xin] Try different UTF-8 encoded characters.
1ff7c82 [Reynold Xin] Enable UTF-8 encoding.
82d58cc [Reynold Xin] Reset run-tests.
2cb3c69 [Reynold Xin] Use utf-8 encoding in set bytes.
53f8ef4 [Reynold Xin] Hack Jenkins to run one test.
9a48e8d [Reynold Xin] Fixed runtime compilation error.
911c450 [Reynold Xin] Moved unit test also to Java.
4eff7bd [Reynold Xin] Improved unit test coverage.
8e89a3c [Reynold Xin] Fixed tests.
77c64bd [Reynold Xin] Fixed string type codegen.
ffedb62 [Reynold Xin] Code review feedback.
0967ce6 [Reynold Xin] Fixed import ordering.
45a123d [Reynold Xin] [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.
2015-06-11 16:07:15 -07:00
Marcelo Vanzin 9cbdf31ec1 [SPARK-6511] [docs] Fix example command in hadoop-provided docs.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6766 from vanzin/SPARK-6511 and squashes the following commits:

49f0f67 [Marcelo Vanzin] [SPARK-6511] [docs] Fix example command in hadoop-provided docs.
2015-06-11 15:29:03 -07:00
zsxwing 95690a17d3 [SPARK-7444] [TESTS] Eliminate noisy css warn/error logs for UISeleniumSuite
Eliminate the following noisy logs for `UISeleniumSuite`:
```
15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Ignoring the following declarations in this rule.
15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Ignoring the following declarations in this rule.
15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Ignoring the following declarations in this rule.
```

Author: zsxwing <zsxwing@gmail.com>

Closes #5983 from zsxwing/SPARK-7444 and squashes the following commits:

4202728 [zsxwing] Add SparkUICssErrorHandler for all tests
d1398ad [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-7444
7bb7f11 [zsxwing] Merge branch 'master' into SPARK-7444
a59f40e [zsxwing] Eliminate noisy css warn/error logs for UISeleniumSuite
2015-06-11 14:21:49 -07:00
Cheng Hao 040f223c5b [SPARK-7915] [SQL] Support specifying the column list for target table in CTAS
```
create table t1 (a int, b string) as select key, value from src;

desc t1;
key	int	NULL
value	string	NULL
```

Thus Hive doesn't support specifying the column list for target table in CTAS, however, we should either throwing exception explicity, or supporting the this feature, we just pick up the later one, which seems useful and straightforward.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6458 from chenghao-intel/ctas_column and squashes the following commits:

d1fa9b6 [Cheng Hao] bug in unittest
4e701aa [Cheng Hao] update as feedback
f305ec1 [Cheng Hao] support specifying the column list for target table in CTAS
2015-06-11 14:03:08 -07:00
Shivaram Venkataraman c8d551d546 [SPARK-8310] [EC2] Updates the master branch EC2 versions
Will send another PR for `branch-1.4`

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6764 from shivaram/SPARK-8310 and squashes the following commits:

d8cd3b3 [Shivaram Venkataraman] This updates the master branch EC2 versions
2015-06-11 13:18:42 -07:00