The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379)
Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception
```
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo
owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53
but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57
```
This pr try to write the data to temporary dir when using dynamic parition avoid the speculative tasks writing the same file
Author: jeanlyn <jeanlyn92@gmail.com>
Closes#6833 from jeanlyn/speculation and squashes the following commits:
64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path
8860af0 [jeanlyn] remove the never using code
e19a3bd [jeanlyn] avoid speculative tasks write same file
(cherry picked from commit a1e3649c87)
Signed-off-by: Cheng Lian <lian@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-8468
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6905 from viirya/cv_min and squashes the following commits:
930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
c3dd8d9 [Liang-Chi Hsieh] For comments.
b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
(cherry picked from commit 0b8995168f)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Author: cody koeninger <cody@koeninger.org>
Closes#6863 from koeninger/SPARK-8390 and squashes the following commits:
26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
…f the existing java direct stream api
Author: cody koeninger <cody@koeninger.org>
Closes#6846 from koeninger/SPARK-8389 and squashes the following commits:
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
**Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes.
**What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself.
I tested this with and without the fix in #6891 and verified that it passes only if the fix is present.
Author: Andrew Or <andrew@databricks.com>
Closes#6909 from andrewor14/SPARK-8498 and squashes the following commits:
5e9d688 [Andrew Or] Add regression test for SPARK-8470
(cherry picked from commit 093c34838d)
Signed-off-by: Yin Huai <yhuai@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Closes#6913 from yhuai/branch-1.4-hotfix and squashes the following commits:
7f91fa0 [Yin Huai] [HOT-FIX] Fix compilation (caused by 0131142d98).
This pull request adds following methods to SparkR:
```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.
cc shivaram and davies
Author: Hossein <hossein@databricks.com>
Closes#6889 from falaki/SPARK-8452 and squashes the following commits:
9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR
(cherry picked from commit 1fa29c2df2)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
This is for 1.4 branch (based on https://github.com/apache/spark/pull/6891).
Author: Yin Huai <yhuai@databricks.com>
Closes#6895 from yhuai/SPARK-8368-1.4 and squashes the following commits:
adbbbc9 [Yin Huai] Minor update.
3cca0e9 [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper.
b1e14a9 [Yin Huai] Failed tests.
This PR solves three SerializationDebugger issues.
* SPARK-7180 - SerializationDebugger fails with ArrayOutOfBoundsException
* SPARK-8090 - SerializationDebugger does not handle classes with writeReplace correctly
* SPARK-8091 - SerializationDebugger does not handle classes with writeObject method
The solutions for each are explained as follows
* SPARK-7180 - The wrong slot desc was used for getting the value of the fields in the object being tested.
* SPARK-8090 - Test the type of the replaced object.
* SPARK-8091 - Use a dummy ObjectOutputStream to collect all the objects written by the writeObject() method, and then test those objects as usual.
I also added more tests in the testsuite to increase code coverage. For example, added tests for cases where there are not serializability issues.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#6625 from tdas/SPARK-7180 and squashes the following commits:
c7cb046 [Tathagata Das] Addressed comments on docs
ae212c8 [Tathagata Das] Improved docs
304c97b [Tathagata Das] Fixed build error
26b5179 [Tathagata Das] more tests.....92% line coverage
7e2fdcf [Tathagata Das] Added more tests
d1967fb [Tathagata Das] Added comments.
da75d34 [Tathagata Das] Removed unnecessary lines.
50a608d [Tathagata Das] Fixed bugs and added support for writeObject
Clarify what may cause long-running Spark apps to preserve shuffle files
Author: Sean Owen <sowen@cloudera.com>
Closes#6901 from srowen/SPARK-5836 and squashes the following commits:
a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
(cherry picked from commit 4be53d0395)
Signed-off-by: Andrew Or <andrew@databricks.com>
This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness.
yhuai brkyvz
Author: Andrew Or <andrew@databricks.com>
Closes#6886 from andrewor14/spark-submit-suite-fix and squashes the following commits:
3f99ff1 [Andrew Or] Move destroy to finally block
9a62188 [Andrew Or] Re-enable ignored tests
2382672 [Andrew Or] Check for exit code
(cherry picked from commit 68a2dca292)
Signed-off-by: Andrew Or <andrew@databricks.com>
andrewor14 can you take a look?thanks
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#6873 from lianhuiwang/SPARK-8430 and squashes the following commits:
51c47ca [Lianhui Wang] update andrewor's comments
2b27b19 [Lianhui Wang] support UnsafeShuffleManager
(cherry picked from commit 9baf093014)
Signed-off-by: Andrew Or <andrew@databricks.com>
Itertools islice requires an integer for the stop argument. Switching to integer division here prevents a ValueError when vs is evaluated above.
davies
This is my original work, and I license it to the project.
Author: Kevin Conor <kevin@discoverybayconsulting.com>
Closes#6794 from kconor/kconor-patch-1 and squashes the following commits:
da5e700 [Kevin Conor] Integer division for batch size
(cherry picked from commit fdf63f1249)
Signed-off-by: Davies Liu <davies@databricks.com>
`Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.
Author: Cheng Lian <lian@databricks.com>
Closes#6892 from liancheng/spark-8458 and squashes the following commits:
87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files
(cherry picked from commit a71cbbdea5)
Signed-off-by: Cheng Lian <lian@databricks.com>
tdas zsxwing this is the new PR for Spark-8080
I have merged https://github.com/apache/spark/pull/6659
Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.
For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk ) number of records will be counted even though the block not able to unroll to memory.
thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.
I have added few test cases to cover those block unrolling scenarios also.
Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com>
Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com>
Closes#6707 from dibbhatt/master and squashes the following commits:
f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
01e6dc8 [U-PEROOT\UBHATD1] A
This fixes various minor documentation issues on the Spark SQL page
Author: Lars Francke <lars.francke@gmail.com>
Closes#6890 from lfrancke/SPARK-8462 and squashes the following commits:
dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes
(cherry picked from commit 4ce3bab89f)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators. This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.
These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.
Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#6885 from JoshRosen/spark-plan-test and squashes the following commits:
f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
84214be [Josh Rosen] Add an extra column which isn't part of the sort
ae1896b [Josh Rosen] Provide implicits automatically
a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
d9ab1e4 [Michael Armbrust] Add simple resolver
c60a44d [Josh Rosen] Manually bind references
996332a [Josh Rosen] Add types so that tests compile
a46144a [Josh Rosen] WIP
(cherry picked from commit 207a98ca59)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.
Author: zsxwing <zsxwing@gmail.com>
Closes#6829 from zsxwing/flume-sink-dep and squashes the following commits:
f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
(cherry picked from commit 24e53793b4)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:
![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)
This makes it easier for users to link to specific sections of the documentation.
I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6808 from JoshRosen/SPARK-8353 and squashes the following commits:
e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code
(cherry picked from commit 44c931f006)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.
cc JoshRosen rxin angelini
Author: Davies Liu <davies@databricks.com>
Closes#6714 from davies/batch_size and squashes the following commits:
b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be.
cc andrewor14
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#6788 from brkyvz/local-ivy-fix and squashes the following commits:
2875bf4 [Burak Yavuz] fix temp dir bug
48cc648 [Burak Yavuz] improve deletion
a69e3e6 [Burak Yavuz] delete cache before test as well
0037197 [Burak Yavuz] fix merge conflicts
f60772c [Burak Yavuz] use different folder for m2 cache during testing
b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache
Conflicts:
core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```
when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.
Author: xutingjun <xutingjun@huawei.com>
Closes#6839 from XuTingjun/DAGImprove and squashes the following commits:
53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside
(cherry picked from commit e2cdb0568b)
Signed-off-by: Andrew Or <andrew@databricks.com>
https://issues.apache.org/jira/browse/SPARK-8306
I will try to add a test later.
marmbrus aarondav
Author: Yin Huai <yhuai@databricks.com>
Closes#6758 from yhuai/SPARK-8306 and squashes the following commits:
1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
(cherry picked from commit 302556ff99)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Conflicts:
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.
This PR changes them to thread-safe collections.
Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.
Author: zsxwing <zsxwing@gmail.com>
Closes#6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:
d464211 [zsxwing] Use thread-safe collections to make the tests more reliable
(cherry picked from commit a06d9c8e76)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.
Author: zsxwing <zsxwing@gmail.com>
Closes#6826 from zsxwing/python-emptyRDD and squashes the following commits:
b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
(cherry picked from commit 0fc4b96f3e)
Signed-off-by: Andrew Or <andrew@databricks.com>
The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed.
![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png)
The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay.
Author: Carson Wang <carson.wang@intel.com>
Closes#6827 from carsonwang/SPARK-8372 and squashes the following commits:
cdbb089 [Carson Wang] Fix code style
3e46b35 [Carson Wang] Update code style
90f5dde [Carson Wang] Add a unit test
d8c9cd0 [Carson Wang] Replaying events only return information when app is started
(cherry picked from commit 2837e06709)
Signed-off-by: Andrew Or <andrew@databricks.com>
externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed.
Author: Mingfei <mingfei.shi@intel.com>
Closes#6702 from shimingfei/SetTrue and squashes the following commits:
add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized
(cherry picked from commit 7ad8c5d869)
Signed-off-by: Andrew Or <andrew@databricks.com>
Now PySpark on YARN with cluster mode is supported so let's update doc.
Author: Kousuke Saruta <sarutakoss.nttdata.co.jp>
Closes#6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:
ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode
Author: Punya Biswal <pbiswal@palantir.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#6842 from punya/feature/SPARK-7515 and squashes the following commits:
0b83648 [Punya Biswal] Merge remote-tracking branch 'origin/branch-1.4' into feature/SPARK-7515
de025cd [Kousuke Saruta] [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
start-slave.sh no longer takes a worker # param in 1.4+
Author: Sean Owen <sowen@cloudera.com>
Closes#6855 from srowen/SPARK-8395 and squashes the following commits:
300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+
(cherry picked from commit f005be0273)
Signed-off-by: Andrew Or <andrew@databricks.com>
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.
I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes#6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:
8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap
(cherry picked from commit c13da20a55)
Signed-off-by: Sean Owen <sowen@cloudera.com>
rxin this is the fix you requested for the break introduced by backporting #6793
Author: Punya Biswal <pbiswal@palantir.com>
Closes#6850 from punya/feature/fix-backport-break and squashes the following commits:
fdc3693 [Punya Biswal] Fix break introduced by backport
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6460 from yanboliang/spark-7916 and squashes the following commits:
f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
(cherry picked from commit ca998757e8)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
If you ran "clean" at the top-level sbt project, the temp dir would
go away, so running "test" without restarting sbt would fail. This
fixes that by making sure the temp dir exists before running tests.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6805 from vanzin/SPARK-8126-fix and squashes the following commits:
12d7768 [Marcelo Vanzin] [SPARK-8126] [build] Make sure temp dir exists when running tests.
(cherry picked from commit cebf241184)
Signed-off-by: Sean Owen <sowen@cloudera.com>
[SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs
Author: Radek Ostrowski <dest.hawaii@gmail.com>
Author: radek <radek@radeks-MacBook-Pro-2.local>
Closes#6332 from radek1st/master and squashes the following commits:
dae3347 [Radek Ostrowski] fixed typo
c76bb3a [radek] improved a comment
(cherry picked from commit 4bd10fd509)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Typo in thriftserver section
Author: Moussa Taifi <moutai10@gmail.com>
Closes#6847 from moutai/patch-1 and squashes the following commits:
1bd29df [Moussa Taifi] Update sql-programming-guide.md
(cherry picked from commit dc455b8833)
Signed-off-by: Sean Owen <sowen@cloudera.com>
Bug had reported in the jira [SPARK-8367](https://issues.apache.org/jira/browse/SPARK-8367)
The relution is limitting the configuration `spark.streaming.blockInterval` to a positive number.
Author: huangzhaowei <carlmartinmax@gmail.com>
Author: huangzhaowei <SaintBacchus@users.noreply.github.com>
Closes#6818 from SaintBacchus/SPARK-8367 and squashes the following commits:
c9d1927 [huangzhaowei] Update BlockGenerator.scala
bd3f71a [huangzhaowei] Use requre instead of if
3d17796 [huangzhaowei] [SPARK_8367][Streaming]Add a limit for 'spark.streaming.blockInterval' since a data loss bug.
(cherry picked from commit ccf010f27b)
Signed-off-by: Sean Owen <sowen@cloudera.com>
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'
Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()
Author: tedyu <yuzhihong@gmail.com>
Closes#6793 from tedyu/master and squashes the following commits:
62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()
(cherry picked from commit 1a62d61696)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Peter Hoffmann <ph@peter-hoffmann.com>
Closes#6815 from hoffmann/patch-1 and squashes the following commits:
2abb6da [Peter Hoffmann] fix read/write mixup
(cherry picked from commit f3f2a4397d)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#6811 from marmbrus/aliasExplodeStar and squashes the following commits:
fbd2065 [Michael Armbrust] more style
806a373 [Michael Armbrust] fix style
7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa
(cherry picked from commit 9073a426e4)
Signed-off-by: Michael Armbrust <michael@databricks.com>
UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size. This means that we end up allocating 8x too much conversion space.
This patch fixes this by allocating a `byte[]` array instead. This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits:
6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size
(cherry picked from commit ea7fd2ff64)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
This improves the Spark Streaming Guides by fixing broken links, rewording confusing sections, fixing typos, adding missing words, etc.
Author: Mike Dusenberry <dusenberrymw@gmail.com>
Closes#6801 from dusenberrymw/SPARK-8343_Improve_Spark_Streaming_Guides_MERGED and squashes the following commits:
6688090 [Mike Dusenberry] Improvements to the Spark Streaming Custom Receiver Guide, including slight rewording of confusing sections, and fixing typos & missing words.
436fbd8 [Mike Dusenberry] Bunch of improvements to the Spark Streaming Guide, including fixing broken links, slight rewording of confusing sections, fixing typos & missing words, etc.
(cherry picked from commit 35d1267cf8)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#6786 from marmbrus/optionsParser and squashes the following commits:
e7d18ef [Michael Armbrust] add dots
99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options
(cherry picked from commit 4aed66f299)
Signed-off-by: Reynold Xin <rxin@databricks.com>
- Kinesis API updated
- Kafka version updated, and Python API for Direct Kafka added
- Added SQLContext.getOrCreate()
- Added information on how to get partitionId in foreachRDD
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#6781 from tdas/SPARK-7284 and squashes the following commits:
aac7be0 [Tathagata Das] Added information on how to get partition id
a66ec22 [Tathagata Das] Complete the line incomplete line,
a92ca39 [Tathagata Das] Updated streaming documentation
(cherry picked from commit e9471d3414)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
Safeguard against DOM rewriting.
Author: Andrew Or <andrew@databricks.com>
Closes#6787 from andrewor14/dag-viz-trim and squashes the following commits:
0fb4afe [Andrew Or] Trim input metadata from DOM
(cherry picked from commit 8860405151)
Signed-off-by: Andrew Or <andrew@databricks.com>
… SPARK_TACHYON_MAP
Author: Mark Smith <mark.smith@bronto.com>
Closes#6777 from markmsmith/branch-1.4 and squashes the following commits:
a218cfa [Mark Smith] [SPARK-8322][EC2] Fixed tachyon mapp entry to point to 0.6.4
90d1655 [Mark Smith] [SPARK-8322][EC2] Added spark 1.4.0 into the VALID_SPARK_VERSIONS and SPARK_TACHYON_MAP
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#6766 from vanzin/SPARK-6511 and squashes the following commits:
49f0f67 [Marcelo Vanzin] [SPARK-6511] [docs] Fix example command in hadoop-provided docs.
(cherry picked from commit 9cbdf31ec1)
Signed-off-by: Reynold Xin <rxin@databricks.com>
cc pwendell -- We should probably update our release guidelines to change this when we cut a release branch ?
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#6765 from shivaram/SPARK-8310-14 and squashes the following commits:
066e44e [Shivaram Venkataraman] Update spark-ec2 branch to 1.4