ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
lockwobr	27693e1757	[SQL] [DOCS] updated the documentation for explode the syntax was incorrect in the example in explode Author: lockwobr <lockwobr@gmail.com> Closes #6943 from lockwobr/master and squashes the following commits: 3d864d1 [lockwobr] updated the documentation for explode (cherry picked from commit `4f7fbefb8d`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>	2015-06-24 02:51:36 +09:00
Josh Rosen	77cb1d5ed1	Revert "[SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer" This reverts commit `3348245055`. Reverting because `catch (Exception e) ... throw e` doesn't compile under Java 6 unless the method declares that it throws Exception.	2015-06-23 09:19:11 -07:00
Holden Karau	3348245055	[SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer Author: Holden Karau <holden@pigscanfly.ca> Closes #6918 from holdenk/SPARK-8498-fix-npe-in-errorhandling-path-in-unsafeshuffle-writer and squashes the following commits: f807832 [Holden Karau] Log error if we can't throw it 855f9aa [Holden Karau] Spelling - not my strongest suite. Fix Propegates to Propagates. 039d620 [Holden Karau] Add missing closeandwriteoutput 30e558d [Holden Karau] go back to try/finally e503b8c [Holden Karau] Improve the test to ensure we aren't masking the underlying exception ae0b7a7 [Holden Karau] Fix the test 2e6abf7 [Holden Karau] Be more cautious when cleaning up during failed write and re-throw user exceptions (cherry picked from commit `0f92be5b5f`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-06-23 09:08:49 -07:00
Hari Shreedharan	9294796750	[SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Sink Author: Hari Shreedharan <hshreedharan@apache.org> Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the following commits: 9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0 ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 dependency from Flume Sink. Also bump Flume version to 1.6.0	2015-06-22 23:41:35 -07:00
Scott Taylor	d0943afbcf	[SPARK-8541] [PYSPARK] test the absolute error in approx doctests A minor change but one which is (presumably) visible on the public api docs webpage. Author: Scott Taylor <github@megatron.me.uk> Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits: fbed000 [Scott Taylor] test the absolute error in approx doctests (cherry picked from commit `f0dcbe8a7c`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-06-22 23:38:21 -07:00
Holden Karau	22cc1ab66e	[SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins Author: Holden Karau <holden@pigscanfly.ca> Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits: 2894695 [Holden Karau] remove extra blank line 2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too 3a09170 [Holden Karau] add maxBins to to the train method as well af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100 (cherry picked from commit `164fe2aa44`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2015-06-22 22:40:31 -07:00
Patrick Wendell	1cfa7302ee	Preparing development version 1.4.2-SNAPSHOT	2015-06-22 22:21:31 -07:00
Patrick Wendell	d0a5560ce4	Preparing Spark release v1.4.1-rc1	2015-06-22 22:21:26 -07:00
Patrick Wendell	48d6830144	[BUILD] Preparing Spark release 1.4.1	2015-06-22 22:18:52 -07:00
Yu ISHIKAWA	250179485b	[SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files [[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8548) - This is the result of `lint-r` https://gist.github.com/yu-iskw/0019b37a2c1167f33986 Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits: 0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from the SparkR files (cherry picked from commit `44fa7df64d`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-06-22 20:55:55 -07:00
Cheng Hao	d73900a903	[SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8 To reproduce that: ``` JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 \| build/sbt -Phadoop-2.3 -Phive 'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite' ``` A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set()) Author: Cheng Hao <hao.cheng@intel.com> Closes #6402 from chenghao-intel/windowing and squashes the following commits: 99312ad [Cheng Hao] add order by for the select clause edf8ce3 [Cheng Hao] update the code as suggested 7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK (cherry picked from commit `13321e6555`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-06-22 20:05:00 -07:00
Yin Huai	994abbaeb3	[SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode https://issues.apache.org/jira/browse/SPARK-8532 This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`. Author: Yin Huai <yhuai@databricks.com> Closes #6937 from yhuai/SPARK-8532 and squashes the following commits: f972d5d [Yin Huai] davies's comment. d37abd2 [Yin Huai] style. d21290a [Yin Huai] Python doc. 889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet. 7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value. d696dff [Yin Huai] Python style. 88eb6c4 [Yin Huai] If mode is "error", do not call mode method. c40c461 [Yin Huai] Regression test. (cherry picked from commit `5ab9fcfb01`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-06-22 13:51:34 -07:00
Yu ISHIKAWA	507381d393	[SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py` [[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits: 7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()` 4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py` (cherry picked from commit `5d89d9f00b`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com> Conflicts: python/pyspark/mllib/tests.py	2015-06-22 11:59:53 -07:00
Michael Armbrust	65981619b2	[SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings (branch-1.4) This is branch 1.4 backport of https://github.com/apache/spark/pull/6888. Below is the original description. In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to `StringType` when it was involved in a binary comparison with a `StringType`. This allowed comparing a timestamp with a partial date as a user would expect. - `time > "2014-06-10"` - `time > "2014"` In 1.4.0 we tried to cast the String instead into a Timestamp. However, since partial dates are not a valid complete timestamp this results in `null` which results in the tuple being filtered. This PR restores the earlier behavior. Note that we still special case equality so that these comparisons are not affected by not printing zeros for subsecond precision. Author: Michael Armbrust <michaeldatabricks.com> Closes #6888 from marmbrus/timeCompareString and squashes the following commits: bdef29c [Michael Armbrust] test partial date 1f09adf [Michael Armbrust] special handling of equality 1172c60 [Michael Armbrust] more test fixing 4dfc412 [Michael Armbrust] fix tests aaa9508 [Michael Armbrust] newline 04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of timestamps/dates with strings Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala Author: Michael Armbrust <michael@databricks.com> Closes #6914 from yhuai/timeCompareString-1.4 and squashes the following commits: 9882915 [Michael Armbrust] [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings	2015-06-22 10:45:33 -07:00
Cheng Lian	451c8722af	[SPARK-8406] [SQL] Backports SPARK-8406 and PR #6864 to branch-1.4 Author: Cheng Lian <lian@databricks.com> Closes #6932 from liancheng/spark-8406-for-1.4 and squashes the following commits: a0168fe [Cheng Lian] Backports SPARK-8406 and PR #6864 to branch-1.4	2015-06-22 10:04:29 -07:00
Liang-Chi Hsieh	b836bac3fe	[HOTFIX] Hotfix branch-1.4 building by removing avgMetrics in CrossValidatorSuite Ref. #6905 ping yhuai Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6929 from viirya/hot_fix_cv_test and squashes the following commits: `b1aec53` [Liang-Chi Hsieh] Hotfix branch-1.4 by removing avgMetrics in CrossValidatorSuite.	2015-06-21 22:25:08 -07:00
Joseph K. Bradley	2a7ea31a9e	[SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4 Reorganized docs a bit. Added migration guides. Q: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```? It would be a lot. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits: 4bf26d6 [Joseph K. Bradley] tiny fix 8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR 6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4 (cherry picked from commit `a1894422ad`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-06-21 16:27:14 -07:00
jeanlyn	f0e4040202	[SPARK-8379] [SQL] avoid speculative tasks write to the same file The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379) Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception ``` org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57 ``` This pr try to write the data to temporary dir when using dynamic parition avoid the speculative tasks writing the same file Author: jeanlyn <jeanlyn92@gmail.com> Closes #6833 from jeanlyn/speculation and squashes the following commits: 64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path 8860af0 [jeanlyn] remove the never using code e19a3bd [jeanlyn] avoid speculative tasks write same file (cherry picked from commit `a1e3649c87`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-06-21 00:13:55 -07:00
Liang-Chi Hsieh	fe59a4a5f5	[SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to get correct cross validation JIRA: https://issues.apache.org/jira/browse/SPARK-8468 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6905 from viirya/cv_min and squashes the following commits: 930d3db [Liang-Chi Hsieh] Fix python unit test and add document. d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min 16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal. c3dd8d9 [Liang-Chi Hsieh] For comments. b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value. (cherry picked from commit `0b8995168f`) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>	2015-06-20 13:02:14 -07:00
Andrew Or	9b16508d2c	[HOTFIX] [SPARK-8489] Correct JIRA number in previous commit It should be SPARK-8489, not SPARK-8498.	2015-06-19 17:40:21 -07:00
cody koeninger	a7b773a8b5	[SPARK-8390] [STREAMING] [KAFKA] fix docs related to HasOffsetRanges Author: cody koeninger <cody@koeninger.org> Closes #6863 from koeninger/SPARK-8390 and squashes the following commits: 26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390 3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup 3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api	2015-06-19 17:36:59 -07:00
cody koeninger	78d0ceea82	[SPARK-8389] [STREAMING] [KAFKA] Example of getting offset ranges out o… …f the existing java direct stream api Author: cody koeninger <cody@koeninger.org> Closes #6846 from koeninger/SPARK-8389 and squashes the following commits: 3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api	2015-06-19 17:36:54 -07:00
Andrew Or	2248ad8b70	[SPARK-8498] [SQL] Add regression test for SPARK-8470 Summary of the problem in SPARK-8470. When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes. What this issue is about. This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself. I tested this with and without the fix in #6891 and verified that it passes only if the fix is present. Author: Andrew Or <andrew@databricks.com> Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits: 5e9d688 [Andrew Or] Add regression test for SPARK-8470 (cherry picked from commit `093c34838d`) Signed-off-by: Yin Huai <yhuai@databricks.com>	2015-06-19 17:34:36 -07:00
Yin Huai	2510365faa	[HOT-FIX] Fix compilation (caused by `0131142d98`) Author: Yin Huai <yhuai@databricks.com> Closes #6913 from yhuai/branch-1.4-hotfix and squashes the following commits: 7f91fa0 [Yin Huai] [HOT-FIX] Fix compilation (caused by `0131142d98`).	2015-06-19 17:29:51 -07:00
Nathan Howell	0131142d98	[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents Author: Nathan Howell <nhowell@godaddy.com> Closes #6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents (cherry picked from commit `9814b971f0`) Signed-off-by: Yin Huai <yhuai@databricks.com> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala	2015-06-19 16:23:11 -07:00
Hossein	1a6b510784	[SPARK-8452] [SPARKR] expose jobGroup API in SparkR This pull request adds following methods to SparkR: ```R setJobGroup() cancelJobGroup() clearJobGroup() ``` For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R. cc shivaram and davies Author: Hossein <hossein@databricks.com> Closes #6889 from falaki/SPARK-8452 and squashes the following commits: 9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors c706af9 [Hossein] Added examples a2c19af [Hossein] taking spark context as first argument 343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR (cherry picked from commit `1fa29c2df2`) Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-06-19 15:52:27 -07:00
Yin Huai	9ac8393663	[SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread (branch 1.4) This is for 1.4 branch (based on https://github.com/apache/spark/pull/6891). Author: Yin Huai <yhuai@databricks.com> Closes #6895 from yhuai/SPARK-8368-1.4 and squashes the following commits: adbbbc9 [Yin Huai] Minor update. 3cca0e9 [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper. b1e14a9 [Yin Huai] Failed tests.	2015-06-19 11:15:28 -07:00
Tathagata Das	4b2c793a27	[SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations This PR solves three SerializationDebugger issues. * SPARK-7180 - SerializationDebugger fails with ArrayOutOfBoundsException * SPARK-8090 - SerializationDebugger does not handle classes with writeReplace correctly * SPARK-8091 - SerializationDebugger does not handle classes with writeObject method The solutions for each are explained as follows * SPARK-7180 - The wrong slot desc was used for getting the value of the fields in the object being tested. * SPARK-8090 - Test the type of the replaced object. * SPARK-8091 - Use a dummy ObjectOutputStream to collect all the objects written by the writeObject() method, and then test those objects as usual. I also added more tests in the testsuite to increase code coverage. For example, added tests for cases where there are not serializability issues. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6625 from tdas/SPARK-7180 and squashes the following commits: c7cb046 [Tathagata Das] Addressed comments on docs ae212c8 [Tathagata Das] Improved docs 304c97b [Tathagata Das] Fixed build error 26b5179 [Tathagata Das] more tests.....92% line coverage 7e2fdcf [Tathagata Das] Added more tests d1967fb [Tathagata Das] Added comments. da75d34 [Tathagata Das] Removed unnecessary lines. 50a608d [Tathagata Das] Fixed bugs and added support for writeObject	2015-06-19 11:06:32 -07:00
Sean Owen	3415fb978b	[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files Clarify what may cause long-running Spark apps to preserve shuffle files Author: Sean Owen <sowen@cloudera.com> Closes #6901 from srowen/SPARK-5836 and squashes the following commits: a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files (cherry picked from commit `4be53d0395`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-19 11:03:12 -07:00
Andrew Or	aedd893b42	[SPARK-8451] [SPARK-7287] SparkSubmitSuite should check exit code This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness. yhuai brkyvz Author: Andrew Or <andrew@databricks.com> Closes #6886 from andrewor14/spark-submit-suite-fix and squashes the following commits: 3f99ff1 [Andrew Or] Move destroy to finally block 9a62188 [Andrew Or] Re-enable ignored tests 2382672 [Andrew Or] Check for exit code (cherry picked from commit `68a2dca292`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-19 10:56:36 -07:00
Lianhui Wang	6f2e411084	[SPARK-8430] ExternalShuffleBlockResolver of shuffle service should support UnsafeShuffleManager andrewor14 can you take a look?thanks Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #6873 from lianhuiwang/SPARK-8430 and squashes the following commits: 51c47ca [Lianhui Wang] update andrewor's comments 2b27b19 [Lianhui Wang] support UnsafeShuffleManager (cherry picked from commit `9baf093014`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-19 10:47:15 -07:00
Xiangrui Meng	1f2dafb77f	[SPARK-8151] [MLLIB] pipeline components should correctly implement copy Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6622 from mengxr/SPARK-8087 and squashes the following commits: 0e4c8c4 [Xiangrui Meng] fix merge issues 26fc1f0 [Xiangrui Meng] address comments e607a04 [Xiangrui Meng] merge master b85b57e [Xiangrui Meng] fix examples/compile d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy 84ec278 [Xiangrui Meng] remove setter checks due to generics 2cf2ed0 [Xiangrui Meng] snapshot 291814f [Xiangrui Meng] OneVsRest.copy 1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages (cherry picked from commit `43c7ec6384`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-06-19 10:05:07 -07:00
Kevin Conor	164b9d32e7	[SPARK-8339] [PYSPARK] integer division for python 3 Itertools islice requires an integer for the stop argument. Switching to integer division here prevents a ValueError when vs is evaluated above. davies This is my original work, and I license it to the project. Author: Kevin Conor <kevin@discoverybayconsulting.com> Closes #6794 from kconor/kconor-patch-1 and squashes the following commits: da5e700 [Kevin Conor] Integer division for batch size (cherry picked from commit `fdf63f1249`) Signed-off-by: Davies Liu <davies@databricks.com>	2015-06-19 00:12:43 -07:00
Cheng Lian	f48f3a2e2f	[SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files `Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead. Author: Cheng Lian <lian@databricks.com> Closes #6892 from liancheng/spark-8458 and squashes the following commits: 87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files (cherry picked from commit `a71cbbdea5`) Signed-off-by: Cheng Lian <lian@databricks.com>	2015-06-18 22:02:13 -07:00
Dibyendu Bhattacharya	b55e4b9a52	[SPARK-8080] [STREAMING] Receiver.store with Iterator does not give correct count at Spark UI tdas zsxwing this is the new PR for Spark-8080 I have merged https://github.com/apache/spark/pull/6659 Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine. For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk ) number of records will be counted even though the block not able to unroll to memory. thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID. I have added few test cases to cover those block unrolling scenarios also. Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com> Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com> Closes #6707 from dibbhatt/master and squashes the following commits: f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI 5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI 0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing 4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI 01e6dc8 [U-PEROOT\UBHATD1] A	2015-06-18 20:04:29 -07:00
Lars Francke	bd9bbd6119	[SPARK-8462] [DOCS] Documentation fixes for Spark SQL This fixes various minor documentation issues on the Spark SQL page Author: Lars Francke <lars.francke@gmail.com> Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits: dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462 34eff2c [Lars Francke] Minor documentation fixes (cherry picked from commit `4ce3bab89f`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-06-18 19:40:55 -07:00
Josh Rosen	152f4465d3	[SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators. This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries. These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits: f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code 84214be [Josh Rosen] Add an extra column which isn't part of the sort ae1896b [Josh Rosen] Provide implicits automatically a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885 d9ab1e4 [Michael Armbrust] Add simple resolver c60a44d [Josh Rosen] Manually bind references 996332a [Josh Rosen] Add types so that tests compile a46144a [Josh Rosen] WIP (cherry picked from commit `207a98ca59`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2015-06-18 16:45:27 -07:00
zsxwing	9f293a9eb6	[SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it. Author: zsxwing <zsxwing@gmail.com> Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits: f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc (cherry picked from commit `24e53793b4`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-06-18 16:00:42 -07:00
Josh Rosen	c1da5cf029	[SPARK-8353] [DOCS] Show anchor links when hovering over documentation headers This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example: ![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png) This makes it easier for users to link to specific sections of the documentation. I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore). Author: Josh Rosen <joshrosen@databricks.com> Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits: e59d8a7 [Josh Rosen] Suppress underline on hover f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code (cherry picked from commit `44c931f006`) Signed-off-by: Josh Rosen <joshrosen@databricks.com>	2015-06-18 15:10:33 -07:00
Davies Liu	ca23c3b014	[SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop. Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill. cc JoshRosen rxin angelini Author: Davies Liu <davies@databricks.com> Closes #6714 from davies/batch_size and squashes the following commits: b170dfb [Davies Liu] update test b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size 6ade745 [Davies Liu] update test 5c21777 [Davies Liu] Update shuffle.py e746aec [Davies Liu] fix batch size during sort	2015-06-18 13:49:32 -07:00
Burak Yavuz	9dabc12936	[SPARK-8095] Resolve dependencies of --packages in local ivy cache Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be. cc andrewor14 Author: Burak Yavuz <brkyvz@gmail.com> Closes #6788 from brkyvz/local-ivy-fix and squashes the following commits: 2875bf4 [Burak Yavuz] fix temp dir bug 48cc648 [Burak Yavuz] improve deletion a69e3e6 [Burak Yavuz] delete cache before test as well 0037197 [Burak Yavuz] fix merge conflicts f60772c [Burak Yavuz] use different folder for m2 cache during testing b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache Conflicts: core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala	2015-06-17 22:36:28 -07:00
xutingjun	67ad12d793	[SPARK-8392] RDDOperationGraph: getting cached nodes is slow ```def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) }``` when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here. Author: xutingjun <xutingjun@huawei.com> Closes #6839 from XuTingjun/DAGImprove and squashes the following commits: 53b03ea [xutingjun] change code to more concise and easier to read f98728b [xutingjun] fix words: node -> nodes f87c663 [xutingjun] put the filter inside 81f9fd2 [xutingjun] put the filter inside (cherry picked from commit `e2cdb0568b`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-17 22:31:39 -07:00
Yin Huai	73cf5def06	[SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state. https://issues.apache.org/jira/browse/SPARK-8306 I will try to add a test later. marmbrus aarondav Author: Yin Huai <yhuai@databricks.com> Closes #6758 from yhuai/SPARK-8306 and squashes the following commits: 1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state. (cherry picked from commit `302556ff99`) Signed-off-by: Michael Armbrust <michael@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala	2015-06-17 15:14:42 -07:00
zsxwing	5aedfa2ceb	[SPARK-8404] [STREAMING] [TESTS] Use thread-safe collections to make the tests more reliable KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests. This PR changes them to thread-safe collections. Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable. Author: zsxwing <zsxwing@gmail.com> Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits: d464211 [zsxwing] Use thread-safe collections to make the tests more reliable (cherry picked from commit `a06d9c8e76`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2015-06-17 15:00:17 -07:00
zsxwing	5e7973df0e	[SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case. Author: zsxwing <zsxwing@gmail.com> Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits: b36993f [zsxwing] Update the return type to JavaRDD[T] 71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD (cherry picked from commit `0fc4b96f3e`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-17 13:59:47 -07:00
Carson Wang	f0513733d4	[SPARK-8372] History server shows incorrect information for application not started The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed. ![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png) The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay. Author: Carson Wang <carson.wang@intel.com> Closes #6827 from carsonwang/SPARK-8372 and squashes the following commits: cdbb089 [Carson Wang] Fix code style 3e46b35 [Carson Wang] Update code style 90f5dde [Carson Wang] Add a unit test d8c9cd0 [Carson Wang] Replaying events only return information when app is started (cherry picked from commit `2837e06709`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-17 13:42:46 -07:00
Mingfei	d75c53d88d	[SPARK-8161] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed. Author: Mingfei <mingfei.shi@intel.com> Closes #6702 from shimingfei/SetTrue and squashes the following commits: add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized (cherry picked from commit `7ad8c5d869`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-17 13:40:16 -07:00
Punya Biswal	a7f6979d0f	[SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode Now PySpark on YARN with cluster mode is supported so let's update doc. Author: Kousuke Saruta <sarutakoss.nttdata.co.jp> Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits: ad9f88c [Kousuke Saruta] Brushed up sentences 469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode Author: Punya Biswal <pbiswal@palantir.com> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #6842 from punya/feature/SPARK-7515 and squashes the following commits: 0b83648 [Punya Biswal] Merge remote-tracking branch 'origin/branch-1.4' into feature/SPARK-7515 de025cd [Kousuke Saruta] [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode	2015-06-17 13:37:20 -07:00
Sean Owen	320c4420b9	[SPARK-8395] [DOCS] start-slave.sh docs incorrect start-slave.sh no longer takes a worker # param in 1.4+ Author: Sean Owen <sowen@cloudera.com> Closes #6855 from srowen/SPARK-8395 and squashes the following commits: 300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+ (cherry picked from commit `f005be0273`) Signed-off-by: Andrew Or <andrew@databricks.com>	2015-06-17 13:31:17 -07:00
Vyacheslav Baranov	a5f602efcf	[SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array. I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits: 8557445 [Vyacheslav Baranov] Resolved review comments 4d5b954 [Vyacheslav Baranov] Resolved review comments eaf1e68 [Vyacheslav Baranov] Fixed failing test f9284fd [Vyacheslav Baranov] Resolved review comments 3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap (cherry picked from commit `c13da20a55`) Signed-off-by: Sean Owen <sowen@cloudera.com>	2015-06-17 09:42:41 +01:00

1 2 3 4 5 ...

11385 commits