ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	9fe693b5b6	[SPARK-2446][SQL] Add BinaryType support to Parquet I/O. Note that this commit changes the semantics when loading in data that was created with prior versions of Spark SQL. Before, we were writing out strings as Binary data without adding any other annotations. Thus, when data is read in from prior versions, data that was StringType will now become BinaryType. Users that need strings can CAST that column to a String. It was decided that while this breaks compatibility, it does make us compatible with other systems (Hive, Thrift, etc) and adds support for Binary data, so this is the right decision long term. To support `BinaryType`, the following changes are needed: - Make `StringType` use `OriginalType.UTF8` - Add `BinaryType` using `PrimitiveTypeName.BINARY` without `OriginalType` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1373 from ueshin/issues/SPARK-2446 and squashes the following commits: ecacb92 [Takuya UESHIN] Add BinaryType support to Parquet I/O. 616e04a [Takuya UESHIN] Make StringType use OriginalType.UTF8.	2014-07-14 15:42:35 -07:00
li-zhihui	3dd8af7a66	[SPARK-1946] Submit tasks after (configured ratio) executors have been registered Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered. \# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0 spark.scheduler.minRegisteredExecutorsRatio = 0.8 \# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000 spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000 Author: li-zhihui <zhihui.li@intel.com> Closes #900 from li-zhihui/master and squashes the following commits: b9f8326 [li-zhihui] Add logs & edit docs 1ac08b1 [li-zhihui] Add new configs to user docs 22ead12 [li-zhihui] Move waitBackendReady to postStartHook c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS 4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor 0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks 4261454 [li-zhihui] Add docs for new configs & code style ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime 6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha 812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode e7b6272 [li-zhihui] support yarn-cluster 37f7dc2 [li-zhihui] support yarn mode(percentage style) 3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered	2014-07-14 15:32:49 -05:00
Zongheng Yang	d60b09bb60	[SPARK-2443][SQL] Fix slow read from partitioned tables This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.52s end-to-end (1.64s Spark job) \| 36.6s (28.3s) Stablized runs \| 1.21s (1.18s) \| 27.6s (27.5s) With this fix: Type \| Non-partitioned \| Partitioned (1 part) ------------ \| ------------ \| ------------- First run \| 9.57s (1.46s) \| 11.0s (1.69s) Stablized runs \| 1.13s (1.10s) \| 1.23s (1.19s) Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits: d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.	2014-07-14 13:22:24 -07:00
Daoyuan	38ccd6ebd4	move some test file to match src code Just move some test suite to corresponding package Author: Daoyuan <daoyuan.wang@intel.com> Closes #1401 from adrian-wang/movetestfiles and squashes the following commits: d1a6803 [Daoyuan] move some test file to match src code	2014-07-14 10:40:44 -07:00
Prashant Sharma	aab5349660	Made rdd.py pep8 complaint by using Autopep8 and a little manual editing. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1354 from ScrapCodes/pep8-comp-1 and squashes the following commits: 9858ea8 [Prashant Sharma] Code Review d8851b7 [Prashant Sharma] Found # noqa works even inside comment blocks. Not sure if it works with all versions of python. 10c0cef [Prashant Sharma] Made rdd.py pep8 complaint by using Autopep8 and a little manual tweaking.	2014-07-14 00:42:59 -07:00
Sean Owen	635888cbed	SPARK-2363. Clean MLlib's sample data files (Just made a PR for this, mengxr was the reporter of:) MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with Matei Zaharia, we want to put them under `data/mllib` and clean outdated files. Author: Sean Owen <sowen@cloudera.com> Closes #1394 from srowen/SPARK-2363 and squashes the following commits: 54313dd [Sean Owen] Move ML example data from /mllib/data/ and /data/ into /data/mllib/	2014-07-13 19:27:43 -07:00
Sandy Ryza	4c8be64e76	SPARK-2462. Make Vector.apply public. Apologies if there's an already-discussed reason I missed for why this doesn't make sense. Author: Sandy Ryza <sandy@cloudera.com> Closes #1389 from sryza/sandy-spark-2462 and squashes the following commits: 2e5e201 [Sandy Ryza] SPARK-2462. Make Vector.apply public.	2014-07-12 16:55:15 -07:00
Michael Armbrust	1a7d7cc85f	[SPARK-2405][SQL] Reusue same byte buffers when creating new instance of InMemoryRelation Reuse byte buffers when creating unique attributes for multiple instances of an InMemoryRelation in a single query plan. Author: Michael Armbrust <michael@databricks.com> Closes #1332 from marmbrus/doubleCache and squashes the following commits: 4a19609 [Michael Armbrust] Clean up concurrency story by calculating buffersn the constructor. b39c931 [Michael Armbrust] Allocations are kind of a side effect. f67eff7 [Michael Armbrust] Reusue same byte buffers when creating new instance of InMemoryRelation	2014-07-12 12:13:32 -07:00
Michael Armbrust	7e26b57615	[SPARK-2441][SQL] Add more efficient distinct operator. Author: Michael Armbrust <michael@databricks.com> Closes #1366 from marmbrus/partialDistinct and squashes the following commits: 12a31ab [Michael Armbrust] Add more efficient distinct operator.	2014-07-12 12:07:27 -07:00
Ankur Dave	7a01352931	[SPARK-2455] Mark (Shippable)VertexPartition serializable VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization. The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization. Author: Ankur Dave <ankurdave@gmail.com> Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits: ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable 1fd42c5 [Ankur Dave] Add failing tests for Java serialization	2014-07-12 12:05:34 -07:00
Daniel Darabos	2245c87af4	Use the Executor's ClassLoader in sc.objectFile(). This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.) I created this because I ran into the following problem. I have x:RDD[X] with X being defined in the jar that I provide to SparkContext. I save it with x.saveAsObjectFile("x"). I try to load it with sc.objectFile\[X\]("x"). It fails with ClassNotFoundException. After a good while of debugging I figured out that Utils.deserialize() most likely uses the ClassLoader of Utils. This is the bootstrap ClassLoader, so it is not aware of the dynamically added jars. This patch fixes the issue. A more robust fix would be to always default to Thread.currentThread.getContextClassLoader. This would prevent this problem from biting anyone in the future. It would be a bit harder to test though. On the topic of testing, if you'd like to see tests for this, I will need some hand-holding. Thanks! Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #181 from darabos/master and squashes the following commits: 45a011a [Daniel Darabos] Add test for SPARK-1877. (Fixed in 52eb54d.) e13e090 [Daniel Darabos] Merge branch 'master' of https://github.com/apache/spark 61fe0d0 [Daniel Darabos] Fix style (line too long). 1b5df2c [Daniel Darabos] Use the Executor's ClassLoader in sc.objectFile(). This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.)	2014-07-12 00:07:42 -07:00
Li Pu	d38887b8a0	use specialized axpy in RowMatrix for SVD After running some more tests on large matrix, found that the BV axpy (breeze/linalg/Vector.scala, axpy) is slower than the BSV axpy (breeze/linalg/operators/SparseVectorOps.scala, sv_dv_axpy), 8s v.s. 2s for each multiplication. The BV axpy operates on an iterator while BSV axpy directly operates on the underlying array. I think the overhead comes from creating the iterator (with a zip) and advancing the pointers. Author: Li Pu <lpu@twitter.com> Author: Xiangrui Meng <meng@databricks.com> Author: Li Pu <li.pu@outlook.com> Closes #1378 from vrilleup/master and squashes the following commits: 6fb01a3 [Li Pu] use specialized axpy in RowMatrix 5255f2a [Li Pu] Merge remote-tracking branch 'upstream/master' 7312ec1 [Li Pu] very minor comment fix 4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master a461082 [Xiangrui Meng] make superscript show up correctly in doc 861ec48 [Xiangrui Meng] simplify axpy 62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs c273771 [Li Pu] automatically determine SVD compute mode and parameters 7148426 [Li Pu] improve RowMatrix multiply 5543cce [Li Pu] improve svd api 819824b [Li Pu] add flag for dense svd or sparse svd eb15100 [Li Pu] fix binary compatibility 4c7aec3 [Li Pu] improve comments e7850ed [Li Pu] use aggregate and axpy 827411b [Li Pu] fix EOF new line 9c80515 [Li Pu] use non-sparse implementation when k = n fe983b0 [Li Pu] improve scala style 96d2ecb [Li Pu] improve eigenvalue sorting e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK	2014-07-11 23:26:47 -07:00
DB Tsai	5596086935	[SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests. Changes: 1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer 2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample. 3) Added the APIs documentation for MultivariateOnlineSummarizer. 4) Added the unittests for MultivariateOnlineSummarizer. Author: DB Tsai <dbtsai@dbtsai.com> Closes #955 from dbtsai/dbtsai-summarizer and squashes the following commits: b13ac90 [DB Tsai] dbtsai-summarizer	2014-07-11 23:04:43 -07:00
Kousuke Saruta	cbff18774b	[SPARK-2457] Inconsistent description in README about build option Now, we should use -Pyarn instead of SPARK_YARN when building but README says as follows. For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, also set `SPARK_YARN=true`: # Apache Hadoop 2.0.5-alpha $ sbt/sbt -Dhadoop.version=2.0.5-alpha -Pyarn assembly # Cloudera CDH 4.2.0 with MapReduce v2 $ sbt/sbt -Dhadoop.version=2.0.0-cdh4.2.0 -Pyarn assembly # Apache Hadoop 2.2.X and newer $ sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #1382 from sarutak/SPARK-2457 and squashes the following commits: e7b2d64 [Kousuke Saruta] Replaced "SPARK_YARN=true" with "-Pyarn" in README	2014-07-11 21:10:26 -07:00
Prashant Sharma	b23e9c3e40	[SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES NOTE: It is not possible to use both env variable `SBT_MAVEN_PROFILES` and `-P` flag at same time. `-P` if specified takes precedence. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1374 from ScrapCodes/SPARK-2437/rename-MAVEN_PROFILES and squashes the following commits: 8694bde [Prashant Sharma] [SPARK-2437] Rename MAVEN_PROFILES to SBT_MAVEN_PROFILES and add SBT_MAVEN_PROPERTIES	2014-07-11 11:52:35 -07:00
Andrew Or	f4f46dec5a	[Minor] Remove unused val in Master Author: Andrew Or <andrewor14@gmail.com> Closes #1365 from andrewor14/master-fs and squashes the following commits: 497f100 [Andrew Or] Sneak in a space and hope no one will notice 05ba6da [Andrew Or] Remove unused val	2014-07-11 00:21:16 -07:00
CrazyJvm	282cca0e49	fix Graph partitionStrategy comment Author: CrazyJvm <crazyjvm@gmail.com> Closes #1368 from CrazyJvm/graph-comment-1 and squashes the following commits: d47f3c5 [CrazyJvm] fix style e190d6f [CrazyJvm] fix Graph partitionStrategy comment	2014-07-11 00:02:24 -07:00
Xiangrui Meng	2f59ce7dbe	[SPARK-2358][MLLIB] Add an option to include native BLAS/LAPACK loader in the build It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which could be enabled by turning on "-Pganglia-lgpl" at build time. We can use "-Pnetlib-lgpl" flag for this. Author: Xiangrui Meng <meng@databricks.com> Closes #1295 from mengxr/netlib-lgpl and squashes the following commits: aebf001 [Xiangrui Meng] add a profile to optionally include native BLAS/LAPACK loader in mllib	2014-07-10 21:57:54 -07:00
Takuya UESHIN	10b59ba230	[SPARK-2428][SQL] Add except and intersect methods to SchemaRDD. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1355 from ueshin/issues/SPARK-2428 and squashes the following commits: b6fa264 [Takuya UESHIN] Add except and intersect methods to SchemaRDD.	2014-07-10 19:27:24 -07:00
Takuya UESHIN	f5abd27129	[SPARK-2415] [SQL] RowWriteSupport should handle empty ArrayType correctly. `RowWriteSupport` doesn't write empty `ArrayType` value, so the read value becomes `null`. It should write empty `ArrayType` value as it is. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1339 from ueshin/issues/SPARK-2415 and squashes the following commits: 32afc87 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-2415 2f05196 [Takuya UESHIN] Fix RowWriteSupport to handle empty ArrayType correctly.	2014-07-10 19:23:44 -07:00
Takuya UESHIN	f62c427289	[SPARK-2431][SQL] Refine StringComparison and related codes. Refine `StringComparison` and related codes as follows: - `StringComparison` could be similar to `StringRegexExpression` or `CaseConversionExpression`. - Nullability of `StringRegexExpression` could depend on children's nullabilities. - Add a case that the like condition includes no wildcard to `LikeSimplification`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1357 from ueshin/issues/SPARK-2431 and squashes the following commits: 77766f5 [Takuya UESHIN] Add a case that the like condition includes no wildcard to LikeSimplification. b9da9d2 [Takuya UESHIN] Fix nullability of StringRegexExpression. 680bb72 [Takuya UESHIN] Refine StringComparison.	2014-07-10 19:20:00 -07:00
Artjom-Metro	ae8ca4dfba	SPARK-2427: Fix Scala examples that use the wrong command line arguments index The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script. Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com> Author: Artjom-Metro <artjom31415@googlemail.com> Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits: 6111801 [Artjom-Metro] Reduce the default number of iterations cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments	2014-07-10 16:03:30 -07:00
Issac Buenrostro	2dd6724850	[SPARK-1341] [Streaming] Throttle BlockGenerator to limit rate of data consumption. Author: Issac Buenrostro <buenrostro@ooyala.com> Closes #945 from ibuenros/SPARK-1341-throttle and squashes the following commits: 5514916 [Issac Buenrostro] Formatting changes, added documentation for streaming throttling, stricter unit tests for throttling. 62f395f [Issac Buenrostro] Add comments and license to streaming RateLimiter.scala 7066438 [Issac Buenrostro] Moved throttle code to RateLimiter class, smoother pushing when throttling active ccafe09 [Issac Buenrostro] Throttle BlockGenerator to limit rate of data consumption.	2014-07-10 16:01:08 -07:00
tmalaska	40a8fef4e6	[SPARK-1478].3: Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915 This is a modified version of this PR https://github.com/apache/spark/pull/1168 done by @tmalaska Adds MIMA binary check exclusions. Author: tmalaska <ted.malaska@cloudera.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #1347 from tdas/FLUME-1915 and squashes the following commits: 96065df [Tathagata Das] Added Mima exclusion for FlumeReceiver. 41d5338 [tmalaska] Address line 57 that was too long 12617e5 [tmalaska] SPARK-1478: Upgrade FlumeInputDStream's Flume...	2014-07-10 13:15:02 -07:00
Nicholas Chammas	369aa84e8f	name ec2 instances and security groups consistently Security groups created by `spark-ec2` do not prepend “spark-“ to the name. Since naming the instances themselves is new to `spark-ec2`, it’s better to change that pattern to match the existing naming pattern for the security groups, rather than the other way around. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #1344 from nchammas/master and squashes the following commits: f7e4581 [Nicholas Chammas] unrelated pep8 fix a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently de7292a [nchammas] Merge pull request #4 from apache/master 2e4fe00 [nchammas] Merge pull request #3 from apache/master 89fde08 [nchammas] Merge pull request #2 from apache/master 69f6e22 [Nicholas Chammas] PEP8 fixes 2627247 [Nicholas Chammas] broke up lines before they hit 100 chars 6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names 69da6cf [nchammas] Merge pull request #1 from apache/master	2014-07-10 12:56:00 -07:00
Patrick Wendell	88006a6237	HOTFIX: Minor doc update for sbt change	2014-07-10 11:11:00 -07:00
Prashant Sharma	628932b8d0	[SPARK-1776] Have Spark's SBT build read dependencies from Maven. Patch introduces the new way of working also retaining the existing ways of doing things. For example build instruction for yarn in maven is `mvn -Pyarn -PHadoop2.2 clean package -DskipTests` in sbt it can become `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly` Also supports `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly` Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #772 from ScrapCodes/sbt-maven and squashes the following commits: a8ac951 [Prashant Sharma] Updated sbt version. 62b09bb [Prashant Sharma] Improvements. fa6221d [Prashant Sharma] Excluding sql from mima 4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default. 72651ca [Prashant Sharma] Addresses code reivew comments. acab73d [Prashant Sharma] Revert "Small fix to run-examples script." ac4312c [Prashant Sharma] Revert "minor fix" 6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit. 65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path. 446768e [Prashant Sharma] minor fix 89b9777 [Prashant Sharma] Merge conflicts d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups. dccc8ac [Prashant Sharma] updated mima to check against 1.0 a49c61b [Prashant Sharma] Fix for tools jar a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies. cf88758 [Prashant Sharma] cleanup 9439ea3 [Prashant Sharma] Small fix to run-examples script. 96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven. `36efa62` [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins. 4973dbd [Patrick Wendell] Example build using pom reader.	2014-07-10 11:03:37 -07:00
Masayoshi TSUZUKI	c2babc089b	SPARK-2115: Stage kill link is too close to stage details link Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #1350 from tsudukim/feature/SPARK-2115 and squashes the following commits: e2263b0 [Masayoshi TSUZUKI] Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.	2014-07-10 01:18:37 -07:00
Raymond Liu	2b18ea9826	Clean up SparkKMeans example's code remove unused code Author: Raymond Liu <raymond.liu@intel.com> Closes #1352 from colorant/kmeans and squashes the following commits: ddcd1dd [Raymond Liu] Clean up SparkKMeans example's code	2014-07-09 23:39:29 -07:00
Patrick Wendell	553c578de1	HOTFIX: Remove persistently failing test in master. Apparently this functionality is going to be removed soon anywyas.	2014-07-09 19:44:24 -07:00
Patrick Wendell	dd22bc2d57	Revert "[HOTFIX] Synchronize on SQLContext.settings in tests." This reverts commit `d4c30cd991`.	2014-07-09 19:36:38 -07:00
Patrick Wendell	2e0a037dff	SPARK-2416: Allow richer reporting of unit test results The built-in Jenkins integration is pretty bad. It's very confusing to users whether tests have passed or failed and we can't easily customize the message. With some small scripting around the Github API we can do much better than this. Author: Patrick Wendell <pwendell@gmail.com> Closes #1340 from pwendell/better-qa-messages and squashes the following commits: fd6077d [Patrick Wendell] Better automation for unit tests.	2014-07-09 19:26:16 -07:00
Li Pu	1f33e1f201	SPARK-1782: svd for sparse matrix using ARPACK copy ARPACK dsaupd/dseupd code from latest breeze change RowMatrix to use sparse SVD change tests for sparse SVD All tests passed. I will run it against some large matrices. Author: Li Pu <lpu@twitter.com> Author: Xiangrui Meng <meng@databricks.com> Author: Li Pu <li.pu@outlook.com> Closes #964 from vrilleup/master and squashes the following commits: 7312ec1 [Li Pu] very minor comment fix 4c618e9 [Li Pu] Merge pull request #1 from mengxr/vrilleup-master a461082 [Xiangrui Meng] make superscript show up correctly in doc 861ec48 [Xiangrui Meng] simplify axpy 62969fa [Xiangrui Meng] use BDV directly in symmetricEigs change the computation mode to local-svd, local-eigs, and dist-eigs update tests and docs c273771 [Li Pu] automatically determine SVD compute mode and parameters 7148426 [Li Pu] improve RowMatrix multiply 5543cce [Li Pu] improve svd api 819824b [Li Pu] add flag for dense svd or sparse svd eb15100 [Li Pu] fix binary compatibility 4c7aec3 [Li Pu] improve comments e7850ed [Li Pu] use aggregate and axpy 827411b [Li Pu] fix EOF new line 9c80515 [Li Pu] use non-sparse implementation when k = n fe983b0 [Li Pu] improve scala style 96d2ecb [Li Pu] improve eigenvalue sorting e1db950 [Li Pu] SPARK-1782: svd for sparse matrix using ARPACK	2014-07-09 12:15:08 -07:00
johnnywalleye	d35e3db232	[SPARK-2417][MLlib] Fix DecisionTree tests Fixes test failures introduced by https://github.com/apache/spark/pull/1316. For both the regression and classification cases, val stats is the InformationGainStats for the best tree split. stats.predict is the predicted value for the data, before the split is made. Since 600 of the 1,000 values generated by DecisionTreeSuite.generateCategoricalDataPoints() are 1.0 and the rest 0.0, the regression tree and classification tree both correctly predict a value of 0.6 for this data now, and the assertions have been changed to reflect that. Author: johnnywalleye <jsondag@gmail.com> Closes #1343 from johnnywalleye/decision-tree-tests and squashes the following commits: ef80603 [johnnywalleye] [SPARK-2417][MLlib] Fix DecisionTree tests	2014-07-09 11:06:34 -07:00
Manuel Laflamme	0eb11527d1	[STREAMING] SPARK-2343: Fix QueueInputDStream with oneAtATime false Fix QueueInputDStream which was not removing dequeued items when used with the oneAtATime flag disabled. Author: Manuel Laflamme <manuel.laflamme@gmail.com> Closes #1285 from mlaflamm/spark-2343 and squashes the following commits: 61c9e38 [Manuel Laflamme] Unit tests for queue input stream c51d029 [Manuel Laflamme] Fix QueueInputDStream with oneAtATime false	2014-07-09 10:45:45 -07:00
Kay Ousterhout	339441f545	[SPARK-2384] Add tooltips to UI. This patch adds tooltips to clarify some points of confusion in the UI. When users mouse over some of the table headers (shuffle read, write, and input size) as well as over the "scheduler delay" metric shown for each stage, a black tool tip (see image below) pops up describing the metric in more detail. After the tooltip mechanism is added by this commit, I imagine others may want to add more tooltips for other things in the UI, but I think this is a good starting point. ![tooltip](https://cloud.githubusercontent.com/assets/1108612/3491905/994e179e-059f-11e4-92f2-c6c12d248d81.jpg) This looks scary-big but much of it is adding the bootstrap tool tip JavaScript. Also I have no idea what to put for the license in tooltip (I left it the same -- the Twitter apache header) or for JQuery (left it as nothing) -- @mateiz what's the right thing here? cc @pwendell @andrewor14 @rxin Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #1314 from kayousterhout/tooltips and squashes the following commits: 19981b5 [Kay Ousterhout] Exclude non-licensed javascript files from style check d9ab5a9 [Kay Ousterhout] Response to Andrew's review 7752449 [Kay Ousterhout] [SPARK-2384] Add tooltips to UI.	2014-07-08 22:57:21 -07:00
johnnywalleye	1114207cc8	[SPARK-2152][MLlib] fix bin offset in DecisionTree node aggregations (also resolves SPARK-2160) Hi, this pull fixes (what I believe to be) a bug in DecisionTree.scala. In the extractLeftRightNodeAggregates function, the first set of rightNodeAgg values for Regression are set in line 792 as follows: rightNodeAgg(featureIndex)(2 * (numBins - 2)) = binData(shift + (2 * numBins - 1))) Then there is a loop that sets the rest of the values, as in line 809: rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = binData(shift + (2 (numBins - 2 - splitIndex))) + rightNodeAgg(featureIndex)(2 (numBins - 1 - splitIndex)) But since splitIndex starts at 1, this ends up skipping a set of binData values. The changes here address this issue, for both the Regression and Classification cases. Author: johnnywalleye <jsondag@gmail.com> Closes #1316 from johnnywalleye/master and squashes the following commits: 73809da [johnnywalleye] fix bin offset in DecisionTree node aggregations	2014-07-08 19:17:26 -07:00
DB Tsai	ac9cdc116e	[SPARK-2413] Upgrade junit_xml_listener to 0.5.1 which fixes the following issues 1) fix the class name to be fully qualified classpath 2) make sure the the reporting time is in second not in miliseond, which causing JUnit HTML to report incorrect number 3) make sure the duration of the tests are accumulative. Author: DB Tsai <dbtsai@alpinenow.com> Closes #1333 from dbtsai/dbtsai-junit and squashes the following commits: bbeac4b [DB Tsai] Upgrade junit_xml_listener to 0.5.1 which fixes the following issues	2014-07-08 17:50:36 -07:00
Andrew Or	bf04a390e4	[SPARK-2392] Executors should not start their own HTTP servers Executors currently start their own unused HTTP file servers. This is because we use the same SparkEnv class for both executors and drivers, and we do not distinguish this case. In the longer term, we should separate out SparkEnv for the driver and SparkEnv for the executors. Author: Andrew Or <andrewor14@gmail.com> Closes #1335 from andrewor14/executor-http-server and squashes the following commits: 46ef263 [Andrew Or] Start HTTP server only on the driver	2014-07-08 17:35:31 -07:00
Gabriele Nizzoli	e6f7bfcfbf	[SPARK-2362] Fix for newFilesOnly logic in file DStream The newFilesOnly logic should be inverted: the logic should be that if the flag newFilesOnly==true then only start reading files older than current time. As the code is now if newFilesOnly==true then it will start to read files that are older than 0L (that is: every file in the directory). Author: Gabriele Nizzoli <mail@nizzoli.net> Closes #1077 from gabrielenizzoli/master and squashes the following commits: 4f1d261 [Gabriele Nizzoli] Fix for newFilesOnly logic in file DStream	2014-07-08 14:23:38 -07:00
Reynold Xin	32516f866a	[SPARK-2409] Make SQLConf thread safe. Author: Reynold Xin <rxin@apache.org> Closes #1334 from rxin/sqlConfThreadSafetuy and squashes the following commits: c1e0a5a [Reynold Xin] Fixed the duplicate comment. 7614372 [Reynold Xin] [SPARK-2409] Make SQLConf thread safe.	2014-07-08 14:00:47 -07:00
CrazyJvm	b520b6453e	SPARK-2400 : fix spark.yarn.max.executor.failures explaination According to ```scala private val maxNumExecutorFailures = sparkConf.getInt("spark.yarn.max.executor.failures", sparkConf.getInt("spark.yarn.max.worker.failures", math.max(args.numExecutors * 2, 3))) ``` default value should be numExecutors * 2, with minimum of 3, and it's same to the config `spark.yarn.max.worker.failures` Author: CrazyJvm <crazyjvm@gmail.com> Closes #1282 from CrazyJvm/yarn-doc and squashes the following commits: 1a5f25b [CrazyJvm] remove deprecated config c438aec [CrazyJvm] fix style 86effa6 [CrazyJvm] change expression 211f130 [CrazyJvm] fix html tag 2900d23 [CrazyJvm] fix style a4b2e27 [CrazyJvm] fix configuration spark.yarn.max.executor.failures	2014-07-08 13:55:42 -05:00
Daniel Darabos	c8a2313cdf	[SPARK-2403] Catch all errors during serialization in DAGScheduler https://issues.apache.org/jira/browse/SPARK-2403 Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion. I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #1329 from darabos/spark-2403 and squashes the following commits: 3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization. 52c22ba [Daniel Darabos] Only catch NonFatal exceptions. 361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.	2014-07-08 10:43:46 -07:00
Michael Armbrust	cc3e0a14da	[SPARK-2395][SQL] Optimize common LIKE patterns. Author: Michael Armbrust <michael@databricks.com> Closes #1325 from marmbrus/slowLike and squashes the following commits: 023c3eb [Michael Armbrust] add comment. 8b421c2 [Michael Armbrust] Handle the case where the final % is actually escaped. d34d37e [Michael Armbrust] add periods. 3bbf35f [Michael Armbrust] Roll back changes to SparkBuild 53894b1 [Michael Armbrust] Fix grammar. 4094462 [Michael Armbrust] Fix grammar. 6d3d0a0 [Michael Armbrust] Optimize common LIKE patterns.	2014-07-08 10:36:18 -07:00
Andrew Or	56e009d4f0	[EC2] Add default history server port to ec2 script Right now I have to open it manually Author: Andrew Or <andrewor14@gmail.com> Closes #1296 from andrewor14/hist-serv-port and squashes the following commits: 8895a1f [Andrew Or] Add default history server port to ec2 script	2014-07-08 16:49:31 +09:00
Michael Armbrust	5a4063645d	[SPARK-2391][SQL] Custom take() for LIMIT queries. Using Spark's take can result in an entire in-memory partition to be shipped in order to retrieve a single row. Author: Michael Armbrust <michael@databricks.com> Closes #1318 from marmbrus/takeLimit and squashes the following commits: 77289a5 [Michael Armbrust] Update scala doc 32f0674 [Michael Armbrust] Custom take implementation for LIMIT queries.	2014-07-08 00:41:46 -07:00
witgo	3cd5029be7	Resolve sbt warnings during build Ⅱ Author: witgo <witgo@qq.com> Closes #1153 from witgo/expectResult and squashes the following commits: 97541d8 [witgo] merge master ead26e7 [witgo] Resolve sbt warnings during build	2014-07-08 00:31:42 -07:00
Rishi Verma	0128905eea	Updated programming-guide.md Made sure that readers know the random number generator seed argument, within the 'takeSample' method, is optional. Author: Rishi Verma <riverma@apache.org> Closes #1324 from riverma/patch-1 and squashes the following commits: 4699676 [Rishi Verma] Updated programming-guide.md	2014-07-08 00:29:23 -07:00
Yanjie Gao	50561f4396	[SPARK-2235][SQL]Spark SQL basicOperator add Intersect operator Hi all, I want to submit a basic operator Intersect For example , in sql case select * from table1 intersect select * from table2 So ,i want use this operator support this function in Spark SQL This operator will return the the intersection of SparkPlan child table RDD . JIRA:https://issues.apache.org/jira/browse/SPARK-2235 Author: Yanjie Gao <gaoyanjie55@163.com> Author: YanjieGao <396154235@qq.com> Closes #1150 from YanjieGao/patch-5 and squashes the following commits: 4629afe [YanjieGao] reformat the code bdc2ac0 [YanjieGao] reformat the code as Michael's suggestion 3b29ad6 [YanjieGao] Merge remote branch 'upstream/master' into patch-5 1cfbfe6 [YanjieGao] refomat some files ea78f33 [YanjieGao] resolve conflict and add annotation on basicOperator and remove HiveQl 0c7cca5 [YanjieGao] modify format problem a802ca8 [YanjieGao] Merge remote branch 'upstream/master' into patch-5 5e374c7 [YanjieGao] resolve conflict in SparkStrategies and basicOperator f7961f6 [Yanjie Gao] update the line less than bdc4a05 [Yanjie Gao] Update basicOperators.scala 0b49837 [Yanjie Gao] delete the annotation f1288b4 [Yanjie Gao] delete annotation e2b64be [Yanjie Gao] Update basicOperators.scala 4dd453e [Yanjie Gao] Update SQLQuerySuite.scala 790765d [Yanjie Gao] Update SparkStrategies.scala ac73e60 [Yanjie Gao] Update basicOperators.scala d4ac5e5 [Yanjie Gao] Update HiveQl.scala 61e88e7 [Yanjie Gao] Update SqlParser.scala 469f099 [Yanjie Gao] Update basicOperators.scala e5bff61 [Yanjie Gao] Spark SQL basicOperator add Intersect operator	2014-07-07 19:40:04 -07:00
Yin Huai	4352a2fdaa	[SPARK-2376][SQL] Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException JIRA: https://issues.apache.org/jira/browse/SPARK-2376 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1320 from yhuai/SPARK-2376 and squashes the following commits: 0107417 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2376 480803d [Yin Huai] Correctly handling JSON arrays in PySpark.	2014-07-07 18:37:38 -07:00

... 7 8 9 10 11 ...

7793 commits