ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Alex Liu	1e56eba5d9	[SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact Author: Alex Liu <alex_liu68@yahoo.com> Closes #3766 from alexliu68/SPARK-SQL-4925 and squashes the following commits: 3137b51 [Alex Liu] [SPARK-4925][SQL] Remove sql/hive-thriftserver module from pom.xml 15f2e38 [Alex Liu] [SPARK-4925][SQL] Publish Spark SQL hive-thriftserver maven artifact	2015-01-10 13:19:12 -08:00
luogankun	545dfcb92f	[SPARK-5141][SQL]CaseInsensitiveMap throws java.io.NotSerializableException CaseInsensitiveMap throws java.io.NotSerializableException. Author: luogankun <luogankun@gmail.com> Closes #3944 from luogankun/SPARK-5141 and squashes the following commits: b6d63d5 [luogankun] [SPARK-5141]CaseInsensitiveMap throws java.io.NotSerializableException	2015-01-09 20:38:41 -08:00
MechCoder	4554529dce	[SPARK-4406] [MLib] FIX: Validate k in SVD Raise exception when k is non-positive in SVD Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #3945 from MechCoder/spark-4406 and squashes the following commits: 64e6d2d [MechCoder] TST: Add better test errors and messages 12dae73 [MechCoder] [SPARK-4406] FIX: Validate k in SVD	2015-01-09 17:45:18 -08:00
WangTaoTheTonic	8782eb992f	[SPARK-4990][Deploy]to find default properties file, search SPARK_CONF_DIR first https://issues.apache.org/jira/browse/SPARK-4990 Author: WangTaoTheTonic <barneystinson@aliyun.com> Author: WangTao <barneystinson@aliyun.com> Closes #3823 from WangTaoTheTonic/SPARK-4990 and squashes the following commits: 133c43e [WangTao] Update spark-submit2.cmd b1ab402 [WangTao] Update spark-submit 4cc7f34 [WangTaoTheTonic] rebase 55300bc [WangTaoTheTonic] use export to make it global d8d3cb7 [WangTaoTheTonic] remove blank line 07b9ebf [WangTaoTheTonic] check SPARK_CONF_DIR instead of checking properties file c5a85eb [WangTaoTheTonic] to find default properties file, search SPARK_CONF_DIR first	2015-01-09 17:10:02 -08:00
bilna	4e1f12d997	[Minor] Fix import order and other coding style fixed import order and other coding style Author: bilna <bilnap@am.amrita.edu> Author: Bilna P <bilna.p@gmail.com> Closes #3966 from Bilna/master and squashes the following commits: 5e76f04 [bilna] fix import order and other coding style 5718d66 [bilna] Merge remote-tracking branch 'upstream/master' ae56514 [bilna] Merge remote-tracking branch 'upstream/master' acea3a3 [bilna] Adding dependency with scope test 28681fa [bilna] Merge remote-tracking branch 'upstream/master' fac3904 [bilna] Correction in Indentation and coding style ed9db4c [bilna] Merge remote-tracking branch 'upstream/master' 4b34ee7 [Bilna P] Update MQTTStreamSuite.scala 04503cf [bilna] Added embedded broker service for mqtt test 89d804e [bilna] Merge remote-tracking branch 'upstream/master' fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master' 4b58094 [Bilna P] Update MQTTStreamSuite.scala b1ac4ad [bilna] Added BeforeAndAfter 5f6bfd2 [bilna] Added BeforeAndAfter e8b6623 [Bilna P] Update MQTTStreamSuite.scala 5ca6691 [Bilna P] Update MQTTStreamSuite.scala 8616495 [bilna] [SPARK-4631] unit test for MQTT	2015-01-09 14:45:28 -08:00
Kousuke Saruta	ae628725ab	[DOC] Fixed Mesos version in doc from 0.18.1 to 0.21.0 #3934 upgraded Mesos version so we should also fix docs right? This issue is really minor so I don't file in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3982 from sarutak/fix-mesos-version and squashes the following commits: 9a86ee3 [Kousuke Saruta] Fixed mesos version from 0.18.1 to 0.21.0	2015-01-09 14:40:45 -08:00
mcheah	e0f28e010c	[SPARK-4737] Task set manager properly handles serialization errors Dealing with [SPARK-4737], the handling of serialization errors should not be the DAGScheduler's responsibility. The task set manager now catches the error and aborts the stage. If the TaskSetManager throws a TaskNotSerializableException, the TaskSchedulerImpl will return an empty list of task descriptions, because no tasks were started. The scheduler should abort the stage gracefully. Note that I'm not too familiar with this part of the codebase and its place in the overall architecture of the Spark stack. If implementing it this way will have any averse side effects please voice that loudly. Author: mcheah <mcheah@palantir.com> Closes #3638 from mccheah/task-set-manager-properly-handle-ser-err and squashes the following commits: 1545984 [mcheah] Some more style fixes from Andrew Or. 5267929 [mcheah] Fixing style suggestions from Andrew Or. dfa145b [mcheah] Fixing style from Josh Rosen's feedback b2a430d [mcheah] Not returning empty seq when a task set cannot be serialized. 94844d7 [mcheah] Fixing compilation error, one brace too many 5f486f4 [mcheah] Adding license header for fake task class bf5e706 [mcheah] Fixing indentation. 097e7a2 [mcheah] [SPARK-4737] Catching task serialization exception in TaskSetManager	2015-01-09 14:16:20 -08:00
WangTaoTheTonic	e966452060	[SPARK-1953][YARN]yarn client mode Application Master memory size is same as driver memory... ... size Ways to set Application Master's memory on yarn-client mode: 1. `spark.yarn.am.memory` in SparkConf or System Properties 2. default value 512m Note: this arguments is only available in yarn-client mode. Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3607 from WangTaoTheTonic/SPARK4181 and squashes the following commits: d5ceb1b [WangTaoTheTonic] spark.driver.memeory is used in both modes 6c1b264 [WangTaoTheTonic] rebase b8410c0 [WangTaoTheTonic] minor optiminzation ddcd592 [WangTaoTheTonic] fix the bug produced in rebase and some improvements 3bf70cc [WangTaoTheTonic] rebase and give proper hint 987b99d [WangTaoTheTonic] disable --driver-memory in client mode 2b27928 [WangTaoTheTonic] inaccurate description b7acbb2 [WangTaoTheTonic] incorrect method invoked 2557c5e [WangTaoTheTonic] missing a single blank 42075b0 [WangTaoTheTonic] arrange the args and warn logging 69c7dba [WangTaoTheTonic] rebase 1960d16 [WangTaoTheTonic] fix wrong comment 7fa9e2e [WangTaoTheTonic] log a warning f6bee0e [WangTaoTheTonic] docs issue d619996 [WangTaoTheTonic] Merge branch 'master' into SPARK4181 b09c309 [WangTaoTheTonic] use code format ab16bb5 [WangTaoTheTonic] fix bug and add comments 44e48c2 [WangTaoTheTonic] minor fix 6fd13e1 [WangTaoTheTonic] add overhead mem and remove some configs 0566bb8 [WangTaoTheTonic] yarn client mode Application Master memory size is same as driver memory size	2015-01-09 13:23:13 -08:00
Joseph K. Bradley	7e8e62aec1	[SPARK-5015] [mllib] Random seed for GMM + make test suite deterministic Issues: * From JIRA: GaussianMixtureEM uses randomness but does not take a random seed. It should take one as a parameter. * This also makes the test suite flaky since initialization can fail due to stochasticity. Fix: * Add random seed * Use it in test suite CC: mengxr tgaloppo Author: Joseph K. Bradley <joseph@databricks.com> Closes #3981 from jkbradley/gmm-seed and squashes the following commits: f0df4fd [Joseph K. Bradley] Added seed parameter to GMM. Updated test suite to use seed to prevent flakiness	2015-01-09 13:00:15 -08:00
Jongyoul Lee	454fe129ee	[SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - update version from 0.18.1 to 0.21.0 - I'm doing some tests in order to verify some spark jobs work fine on mesos 0.21.0 environment. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3934 from jongyoul/SPARK-3619 and squashes the following commits: ab994fa [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - update version from 0.18.1 to 0.21.0	2015-01-09 10:47:08 -08:00
Liang-Chi Hsieh	e9ca16ec94	[SPARK-5145][Mllib] Add BLAS.dsyr and use it in GaussianMixtureEM This pr uses BLAS.dsyr to replace few implementations in GaussianMixtureEM. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3949 from viirya/blas_dsyr and squashes the following commits: 4e4d6cf [Liang-Chi Hsieh] Add unit test. Rename function name, modify doc and style. 3f57fd2 [Liang-Chi Hsieh] Add BLAS.dsyr and use it in GaussianMixtureEM.	2015-01-09 10:27:33 -08:00
Kay Ousterhout	b6aa557300	[SPARK-1143] Separate pool tests into their own suite. The current TaskSchedulerImplSuite includes some tests that are actually for the TaskSchedulerImpl, but the remainder of the tests avoid using the TaskSchedulerImpl entirely, and actually test the pool and scheduling algorithm mechanisms. This commit separates the pool/scheduling algorithm tests into their own suite, and also simplifies those tests. The pull request replaces #339. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3967 from kayousterhout/SPARK-1143 and squashes the following commits: 8a898c4 [Kay Ousterhout] [SPARK-1143] Separate pool tests into their own suite.	2015-01-09 09:47:06 -08:00
Patrick Wendell	1790b38695	HOTFIX: Minor improvements to make-distribution.sh 1. Renames $FWDIR to $SPARK_HOME (vast majority of diff). 2. Use Spark-provided Maven. 3. Logs build flags in the RELEASE file. Author: Patrick Wendell <pwendell@gmail.com> Closes #3973 from pwendell/master and squashes the following commits: 340a2fa [Patrick Wendell] HOTFIX: Minor improvements to make-distribution.sh	2015-01-09 09:40:18 -08:00
Sean Owen	547df97715	SPARK-5136 [DOCS] Improve documentation around setting up Spark IntelliJ project This PR simply points to the IntelliJ wiki page instead of also including IntelliJ notes in the docs. The intent however is to also update the wiki page with updated tips. This is the text I propose for the IntelliJ section on the wiki. I realize it omits some of the existing instructions on the wiki, about enabling Hive, but I think those are actually optional. ------ IntelliJ supports both Maven- and SBT-based projects. It is recommended, however, to import Spark as a Maven project. Choose "Import Project..." from the File menu, and select the `pom.xml` file in the Spark root directory. It is fine to leave all settings at their default values in the Maven import wizard, with two caveats. First, it is usually useful to enable "Import Maven projects automatically", sincchanges to the project structure will automatically update the IntelliJ project. Second, note the step that prompts you to choose active Maven build profiles. As documented above, some build configuration require specific profiles to be enabled. The same profiles that are enabled with `-P[profile name]` above may be enabled on this screen. For example, if developing for Hadoop 2.4 with YARN support, enable profiles `yarn` and `hadoop-2.4`. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section. "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources. Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. Author: Sean Owen <sowen@cloudera.com> Closes #3952 from srowen/SPARK-5136 and squashes the following commits: f3baa66 [Sean Owen] Point to new IJ / Eclipse wiki link 016b7df [Sean Owen] Point to IntelliJ wiki page instead of also including IntelliJ notes in the docs	2015-01-09 09:35:46 -08:00
Aaron Davidson	b4034c3f88	[Minor] Fix test RetryingBlockFetcherSuite after changed config name Flakey due to the default retry interval being the same as our test's wait timeout. Author: Aaron Davidson <aaron@databricks.com> Closes #3972 from aarondav/fix-test and squashes the following commits: db77cab [Aaron Davidson] [Minor] Fix test after changed config name	2015-01-09 09:20:16 -08:00
WangTaoTheTonic	f3da4bd728	[SPARK-5169][YARN]fetch the correct max attempts Soryy for fetching the wrong max attempts in this commit `8fdd48959c`. We need to fix it now. tgravescs If we set an spark.yarn.maxAppAttempts which is larger than `yarn.resourcemanager.am.max-attempts` in yarn side, it will be overrided as described here: >The maximum number of application attempts. It's a global setting for all application masters. Each application master can specify its individual maximum number of application attempts via the API, but the individual number cannot be more than the global upper bound. If it is, the resourcemanager will override it. The default number is set to 2, to allow at least one retry for AM. http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3942 from WangTaoTheTonic/HOTFIX and squashes the following commits: 9ac16ce [WangTaoTheTonic] fetch the correct max attempts	2015-01-09 08:10:09 -06:00
Nicholas Chammas	167a5ab0bd	[SPARK-5122] Remove Shark from spark-ec2 I moved the Spark-Shark version map [to the wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark-Shark+version+mapping). This PR has a [matching PR in mesos/spark-ec2](https://github.com/mesos/spark-ec2/pull/89). Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #3939 from nchammas/remove-shark and squashes the following commits: 66e0841 [Nicholas Chammas] fix style ceeab85 [Nicholas Chammas] show default Spark GitHub repo 7270126 [Nicholas Chammas] validate Spark hashes db4935d [Nicholas Chammas] validate spark version upfront fc0d5b9 [Nicholas Chammas] remove Shark	2015-01-08 17:42:08 -08:00
Marcelo Vanzin	48cecf673c	[SPARK-4048] Enhance and extend hadoop-provided profile. This change does a few things to make the hadoop-provided profile more useful: - Create new profiles for other libraries / services that might be provided by the infrastructure - Simplify and fix the poms so that the profiles are only activated while building assemblies. - Fix tests so that they're able to run when the profiles are activated - Add a new env variable to be used by distributions that use these profiles to provide the runtime classpath for Spark jobs and daemons. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #2982 from vanzin/SPARK-4048 and squashes the following commits: 82eb688 [Marcelo Vanzin] Add a comment. eb228c0 [Marcelo Vanzin] Fix borked merge. 4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes. 371ebee [Marcelo Vanzin] Review feedback. 52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 322f882 [Marcelo Vanzin] Fix merge fail. f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048 9640503 [Marcelo Vanzin] Cleanup child process log message. 115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom). e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile. 7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles. 1be73d4 [Marcelo Vanzin] Restore flume-provided profile. d1399ed [Marcelo Vanzin] Restore jetty dependency. 82a54b9 [Marcelo Vanzin] Remove unused profile. 5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles. 1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver. f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list. 9e4e001 [Marcelo Vanzin] Remove duplicate hive profile. d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log. 4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn. 417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH". 2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing. 1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects. 284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.	2015-01-08 17:15:13 -08:00
RJ Nowling	c9c8b219ad	[SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P... ...ySpark MLlib This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 . Author: RJ Nowling <rnowling@gmail.com> Closes #3955 from rnowling/spark4891 and squashes the following commits: 1236a01 [RJ Nowling] Fix Python style issues 7a01a78 [RJ Nowling] Fix Python style issues 174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib	2015-01-08 15:03:43 -08:00
Kousuke Saruta	a00af6bec5	[SPARK-4973][CORE] Local directory in the driver of client-mode continues remaining even if application finished when external shuffle is enabled When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished. I think local directories for drivers should be deleted. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3811 from sarutak/SPARK-4973 and squashes the following commits: ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver 43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973 d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled	2015-01-08 13:43:09 -08:00
Fernando Otero (ZeoS)	72df5a301e	SPARK-5148 [MLlib] Make usersOut/productsOut storagelevel in ALS configurable Author: Fernando Otero (ZeoS) <fotero@gmail.com> Closes #3953 from zeitos/storageLevel and squashes the following commits: 0f070b9 [Fernando Otero (ZeoS)] fix imports 6869e80 [Fernando Otero (ZeoS)] fix comment length 90c9f7e [Fernando Otero (ZeoS)] fix comment length 18a992e [Fernando Otero (ZeoS)] changing storage level	2015-01-08 12:42:54 -08:00
Eric Moyer	538f221627	Document that groupByKey will OOM for large keys This pull request is my own work and I license it under Spark's open-source license. This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]). Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem. [datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html Author: Eric Moyer <eric_moyer@yahoo.com> Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits: 5b6f4e9 [Eric Moyer] groupByKey docs naming updates 238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys	2015-01-08 11:55:23 -08:00
WangTaoTheTonic	0760787da8	[SPARK-5130][Deploy]Take yarn-cluster as cluster mode in spark-submit https://issues.apache.org/jira/browse/SPARK-5130 Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3929 from WangTaoTheTonic/SPARK-5130 and squashes the following commits: c490648 [WangTaoTheTonic] take yarn-cluster as cluster mode in spark-submit	2015-01-08 11:45:42 -08:00
Kousuke Saruta	0a597276db	[Minor] Fix the value represented by spark.executor.id for consistency. The property `spark.executor.id` can represent both `driver` and `<driver>` for one driver. It's inconsistent. This issue is minor so I didn't file this in JIRA. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3812 from sarutak/fix-driver-identifier and squashes the following commits: d885498 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-driver-identifier 4275663 [Kousuke Saruta] Fixed the value represented by spark.executor.id of local mode	2015-01-08 11:35:56 -08:00
Zhang, Liye	06dc4b5206	[SPARK-4989][CORE] avoid wrong eventlog conf cause cluster down in standalone mode when enabling eventlog in standalone mode, if give the wrong configuration, the standalone cluster will down (cause master restart, lose connection with workers). How to reproduce: just give an invalid value to "spark.eventLog.dir", for example: spark.eventLog.dir=hdfs://tmp/logdir1, hdfs://tmp/logdir2. This will throw illegalArgumentException, which will cause the Master restart. And the whole cluster is not available. Author: Zhang, Liye <liye.zhang@intel.com> Closes #3824 from liyezhang556520/wrongConf4Cluster and squashes the following commits: 3c24d98 [Zhang, Liye] revert change with logwarning and excetption for FileNotFoundException 3c1ac2e [Zhang, Liye] change var to val a49c52f [Zhang, Liye] revert wrong modification 12eee85 [Zhang, Liye] add more message in log and on webUI 5c1fa33 [Zhang, Liye] cache exceptions when eventlog with wrong conf	2015-01-08 10:40:26 -08:00
Takeshi Yamamuro	f825e193f3	[SPARK-4917] Add a function to convert into a graph with canonical edges in GraphOps Convert bi-directional edges into uni-directional ones instead of 'canonicalOrientation' in GraphLoader.edgeListFile. This function is useful when a graph is loaded as it is and then is transformed into one with canonical edges. It rewrites the vertex ids of edges so that srcIds are bigger than dstIds, and merges the duplicated edges. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3760 from maropu/ConvertToCanonicalEdgesSpike and squashes the following commits: 7f8b580 [Takeshi Yamamuro] Add a function to convert into a graph with canonical edges in GraphOps	2015-01-08 09:55:12 -08:00
Sandy Ryza	8d45834deb	SPARK-5087. [YARN] Merge yarn.Client and yarn.ClientBase Author: Sandy Ryza <sandy@cloudera.com> Closes #3896 from sryza/sandy-spark-5087 and squashes the following commits: 65611d0 [Sandy Ryza] Review feedback 3294176 [Sandy Ryza] SPARK-5087. [YARN] Merge yarn.Client and yarn.ClientBase	2015-01-08 09:25:43 -08:00
Patrick Wendell	c08238570c	MAINTENANCE: Automated closing of pull requests. This commit exists to close the following pull requests on Github: Closes #3880 (close requested by 'ash211') Closes #3649 (close requested by 'marmbrus') Closes #3791 (close requested by 'mengxr') Closes #3559 (close requested by 'andrewor14') Closes #3879 (close requested by 'ash211')	2015-01-07 23:25:56 -08:00
Shuo Xiang	c66a976300	[SPARK-5116][MLlib] Add extractor for SparseVector and DenseVector Add extractor for SparseVector and DenseVector in MLlib to save some code while performing pattern matching on Vectors. For example, previously we may use: vec match { case dv: DenseVector => val values = dv.values ... case sv: SparseVector => val indices = sv.indices val values = sv.values val size = sv.size ... } with extractor it is: vec match { case DenseVector(values) => ... case SparseVector(size, indices, values) => ... } Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #3919 from coderxiang/extractor and squashes the following commits: 359e8d5 [Shuo Xiang] merge master ca5fc3e [Shuo Xiang] merge master 0b1e190 [Shuo Xiang] use extractor for vectors in RowMatrix.scala e961805 [Shuo Xiang] use extractor for vectors in StandardScaler.scala c2bbdaf [Shuo Xiang] use extractor for vectors in IDFscala 8433922 [Shuo Xiang] use extractor for vectors in NaiveBayes.scala and Normalizer.scala d83c7ca [Shuo Xiang] use extractor for vectors in Vectors.scala 5523dad [Shuo Xiang] Add extractor for SparseVector and DenseVector	2015-01-07 23:22:37 -08:00
zsxwing	2b729d2250	[SPARK-5126][Core] Verify Spark urls before creating Actors so that invalid urls can crash the process. Because `actorSelection` will return `deadLetters` for an invalid path, Worker keeps quiet for an invalid master url. It's better to log an error so that people can find such problem quickly. This PR will check the url before sending to `actorSelection`, throw and log a SparkException for an invalid url. Author: zsxwing <zsxwing@gmail.com> Closes #3927 from zsxwing/SPARK-5126 and squashes the following commits: 9d429ee [zsxwing] Create a utility method in Utils to parse Spark url; verify urls before creating Actors so that invalid urls can crash the process. 8286e51 [zsxwing] Check the url before sending to Akka and log the error if the url is invalid	2015-01-07 23:01:30 -08:00
hushan[胡珊]	d345ebebd5	[SPARK-5132][Core]Correct stage Attempt Id key in stageInfofromJson SPARK-5132: stageInfoToJson: Stage Attempt Id stageInfoFromJson: Attempt Id Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3932 from suyanNone/json-stage and squashes the following commits: 41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson	2015-01-07 12:09:12 -08:00
DB Tsai	60e2d9e290	[SPARK-5128][MLLib] Add common used log1pExp API in MLUtils When `x` is positive and large, computing `math.log(1 + math.exp(x))` will lead to arithmetic overflow. This will happen when `x > 709.78` which is not a very large number. It can be addressed by rewriting the formula into `x + math.log1p(math.exp(-x))` when `x > 0`. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3915 from dbtsai/mathutil and squashes the following commits: bec6a84 [DB Tsai] remove empty line 3239541 [DB Tsai] revert part of patch into another PR 23144f3 [DB Tsai] doc 49f3658 [DB Tsai] temp 6c29ed3 [DB Tsai] formating f8447f9 [DB Tsai] address another overflow issue in gradientMultiplier in LOR gradient code 64eefd0 [DB Tsai] first commit	2015-01-07 10:13:41 -08:00
Masayoshi TSUZUKI	6e74edeca3	[SPARK-2458] Make failed application log visible on History Server Enabled HistoryServer to show incomplete applications. We can see the log for incomplete applications by clicking the bottom link. Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp> Closes #3467 from tsudukim/feature/SPARK-2458-2 and squashes the following commits: 76205d2 [Masayoshi TSUZUKI] Fixed and added test code. 29a04a9 [Masayoshi TSUZUKI] Merge branch 'master' of github.com:tsudukim/spark into feature/SPARK-2458-2 f9ef854 [Masayoshi TSUZUKI] Added space between "if" and "(". Fixed "Incomplete" as capitalized in the web UI. Modified double negative variable name. 9b465b0 [Masayoshi TSUZUKI] Modified typo and better implementation. 3ed8a41 [Masayoshi TSUZUKI] Modified too long lines. 08ea14d [Masayoshi TSUZUKI] [SPARK-2458] Make failed application log visible on History Server	2015-01-07 07:32:53 -08:00
WangTaoTheTonic	8fdd48959c	[SPARK-2165][YARN]add support for setting maxAppAttempts in the ApplicationSubmissionContext ...xt https://issues.apache.org/jira/browse/SPARK-2165 I still have 2 questions: * If this config is not set, we should use yarn's corresponding value or a default value(like 2) on spark side? * Is the config name best? Or "spark.yarn.am.maxAttempts"? Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3878 from WangTaoTheTonic/SPARK-2165 and squashes the following commits: 1416c83 [WangTaoTheTonic] use the name spark.yarn.maxAppAttempts 202ac85 [WangTaoTheTonic] rephrase some afdfc99 [WangTaoTheTonic] more detailed description 91562c6 [WangTaoTheTonic] add support for setting maxAppAttempts in the ApplicationSubmissionContext	2015-01-07 08:14:39 -06:00
huangzhaowei	5fde66163f	[YARN][SPARK-4929] Bug fix: fix the yarn-client code to support HA Nowadays, yarn-client will exit directly when the HA change happens no matter how many times the am should retry. The reason may be that the default final status only considerred the sys.exit, and the yarn-client HA cann't benefit from this. So we should distinct the default final status between client and cluster, because the SUCCEEDED status may cause the HA failed in client mode and UNDEFINED may cause the error reporter in cluster when using sys.exit. Author: huangzhaowei <carlmartinmax@gmail.com> Closes #3771 from SaintBacchus/YarnHA and squashes the following commits: c02bfcc [huangzhaowei] Improve the comment of the funciton 'getDefaultFinalStatus' 0e69924 [huangzhaowei] Bug fix: fix the yarn-client code to support HA	2015-01-07 08:10:42 -06:00
Liang-Chi Hsieh	e21acc1978	[SPARK-5099][Mllib] Simplify logistic loss function This is a minor pr where I think that we can simply take minus of `margin`, instead of subtracting `margin`. Mathematically, they are equal. But the modified equation is the common form of logistic loss function and so more readable. It also computes more accurate value as some quick tests show. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3899 from viirya/logit_func and squashes the following commits: 91a3860 [Liang-Chi Hsieh] Modified for comment. 0aa51e4 [Liang-Chi Hsieh] Further simplified. 72a295e [Liang-Chi Hsieh] Revert LogLoss back and add more considerations in Logistic Loss. a3f83ca [Liang-Chi Hsieh] Fix a bug. 2bc5712 [Liang-Chi Hsieh] Simplify loss function.	2015-01-06 21:23:31 -08:00
Liang-Chi Hsieh	bb38ebb1ab	[SPARK-5050][Mllib] Add unit test for sqdist Related to #3643. Follow the previous suggestion to add unit test for `sqdist` in `VectorsSuite`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3869 from viirya/sqdist_test and squashes the following commits: fb743da [Liang-Chi Hsieh] Modified for comment and fix bug. 90a08f3 [Liang-Chi Hsieh] Modified for comment. 39a3ca6 [Liang-Chi Hsieh] Take care of special case. b789f42 [Liang-Chi Hsieh] More proper unit test with random sparsity pattern. c36be68 [Liang-Chi Hsieh] Add unit test for sqdist.	2015-01-06 14:00:45 -08:00
Travis Galoppo	4108e5f36f	SPARK-5017 [MLlib] - Use SVD to compute determinant and inverse of covariance matrix MultivariateGaussian was calling both pinv() and det() on the covariance matrix, effectively performing two matrix decompositions. Both values are now computed using the singular value decompositon. Both the pseudo-inverse and the pseudo-determinant are used to guard against singular matrices. Author: Travis Galoppo <tjg2107@columbia.edu> Closes #3871 from tgaloppo/spark-5017 and squashes the following commits: 383b5b3 [Travis Galoppo] MultivariateGaussian - minor optimization in density calculation a5b8bc5 [Travis Galoppo] Added additional points to tests in test suite. Fixed comment in MultivariateGaussian 629d9d0 [Travis Galoppo] Moved some test values from var to val. dc3d0f7 [Travis Galoppo] Catch potential exception calculating pseudo-determinant. Style improvements. d448137 [Travis Galoppo] Added test suite for MultivariateGaussian, including test for degenerate case. 1989be0 [Travis Galoppo] SPARK-5017 - Fixed to use SVD to compute determinant and inverse of covariance matrix. Previous code called both pinv() and det(), effectively performing two matrix decompositions. Additionally, the pinv() implementation in Breeze is known to fail for singular matrices. b4415ea [Travis Galoppo] Merge branch 'spark-5017' of https://github.com/tgaloppo/spark into spark-5017 6f11b6d [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix. Code was calling both det() and pinv(), effectively performing two matrix decompositions. Futhermore, Breeze pinv() currently fails for singular matrices. fd9784c [Travis Galoppo] SPARK-5017 - Use SVD to compute determinant and inverse of covariance matrix	2015-01-06 13:57:42 -08:00
Sean Owen	4cba6eb420	SPARK-4159 [CORE] Maven build doesn't run JUnit test suites This PR: - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar) - Tells `surefire` to test only Java tests - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication. For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run. Author: Sean Owen <sowen@cloudera.com> Closes #3651 from srowen/SPARK-4159 and squashes the following commits: 2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete 12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit. e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent	2015-01-06 12:02:08 -08:00
kj-ki	5e3ec11104	[Minor] Fix comments for GraphX 2D partitioning strategy The sum of vertices on matrix (v0 to v11) is 12. And, I think one same block overlaps in this strategy. This is minor PR, so I didn't file in JIRA. Author: kj-ki <kikushima.kenji@lab.ntt.co.jp> Closes #3904 from kj-ki/fix-partitionstrategy-comments and squashes the following commits: 79829d9 [kj-ki] Fix comments for 2D partitioning.	2015-01-06 09:49:37 -08:00
Josh Rosen	a6394bc2c0	[SPARK-1600] Refactor FileInputStream tests to remove Thread.sleep() calls and SystemClock usage This patch refactors Spark Streaming's FileInputStream tests to remove uses of Thread.sleep() and SystemClock, which should hopefully resolve some longstanding flakiness in these tests (see SPARK-1600). Key changes: - Modify FileInputDStream to use the scheduler's Clock instead of System.currentTimeMillis(); this allows it to be tested using ManualClock. - Fix a synchronization issue in ManualClock's `currentTime` method. - Add a StreamingTestWaiter class which allows callers to block until a certain number of batches have finished. - Change the FileInputStream tests so that files' modification times are manually set based off of ManualClock; this eliminates many Thread.sleep calls. - Update these tests to use the withStreamingContext fixture. Author: Josh Rosen <joshrosen@databricks.com> Closes #3801 from JoshRosen/SPARK-1600 and squashes the following commits: e4494f4 [Josh Rosen] Address a potential race when setting file modification times 8340bd0 [Josh Rosen] Use set comparisons for output. 0b9c252 [Josh Rosen] Fix some ManualClock usage problems. 1cc689f [Josh Rosen] ConcurrentHashMap -> SynchronizedMap db26c3a [Josh Rosen] Use standard timeout in ScalaTest `eventually` blocks. 3939432 [Josh Rosen] Rename StreamingTestWaiter to BatchCounter 0b9c3a1 [Josh Rosen] Wait for checkpoint to complete 863d71a [Josh Rosen] Remove Thread.sleep that was used to make task run slowly b4442c3 [Josh Rosen] batchTimeToSelectedFiles should be thread-safe 15b48ee [Josh Rosen] Replace several TestWaiter methods w/ ScalaTest eventually. fffc51c [Josh Rosen] Revert "Remove last remaining sleep() call" dbb8247 [Josh Rosen] Remove last remaining sleep() call 566a63f [Josh Rosen] Fix log message and comment typos da32f3f [Josh Rosen] Fix log message and comment typos 3689214 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-1600 c8f06b1 [Josh Rosen] Remove Thread.sleep calls in FileInputStream CheckpointSuite test. d4f2d87 [Josh Rosen] Refactor file input stream tests to not rely on SystemClock. dda1403 [Josh Rosen] Add StreamingTestWaiter class. 3c3efc3 [Josh Rosen] Synchronize `currentTime` in ManualClock a95ddc4 [Josh Rosen] Modify FileInputDStream to use Clock class.	2015-01-06 00:31:19 -08:00
Kostas Sakellis	451546aa6d	SPARK-4843 [YARN] Squash ExecutorRunnableUtil and ExecutorRunnable ExecutorRunnableUtil is a parent of ExecutorRunnable because of the yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, this commit squashes the unnecessary hierarchy. The methods from ExecutorRunnableUtil are added as private. Author: Kostas Sakellis <kostas@cloudera.com> Closes #3696 from ksakellis/kostas-spark-4843 and squashes the following commits: 486716f [Kostas Sakellis] Moved prepareEnvironment call to after yarnConf declaration 470e22e [Kostas Sakellis] Fixed indentation and renamed sparkConf variable 9b1b1c9 [Kostas Sakellis] SPARK-4843 [YARN] Squash ExecutorRunnableUtil and ExecutorRunnable	2015-01-05 23:26:33 -08:00
Reynold Xin	04d55d8e8e	[SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL. Author: Reynold Xin <rxin@databricks.com> Closes #3862 from rxin/stringcontext-attr and squashes the following commits: 9b10f57 [Reynold Xin] Rename StrongToAttributeConversionHelper 72121af [Reynold Xin] [SPARK-5040][SQL] Support expressing unresolved attributes using $"attribute name" notation in SQL DSL.	2015-01-05 15:34:22 -08:00
Reynold Xin	bbcba3a943	[SPARK-5093] Set spark.network.timeout to 120s consistently. Author: Reynold Xin <rxin@databricks.com> Closes #3903 from rxin/timeout-120 and squashes the following commits: 7c2138e [Reynold Xin] [SPARK-5093] Set spark.network.timeout to 120s consistently.	2015-01-05 15:19:53 -08:00
freeman	6c6f325740	[SPARK-5089][PYSPARK][MLLIB] Fix vector convert This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans. The PR includes the fix, as well as a new test for the correct conversion behavior. davies Author: freeman <the.freeman.lab@gmail.com> Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits: 764db47 [freeman] Add a test for proper conversion behavior 704f97e [freeman] Return array after changing type	2015-01-05 13:10:59 -08:00
Jongyoul Lee	1c0e7ce056	[SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environme... ...nt at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask - See the Jira Issue for more details. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3741 from jongyoul/SPARK-4465 and squashes the following commits: 46ad71e [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed unused import 3d6631f [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed comments and adjusted indentations 2343f13 [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask	2015-01-05 12:05:09 -08:00
WangTao	ce39b34404	[SPARK-5057] Log message in failed askWithReply attempts https://issues.apache.org/jira/browse/SPARK-5057 Author: WangTao <barneystinson@aliyun.com> Author: WangTaoTheTonic <barneystinson@aliyun.com> Closes #3875 from WangTaoTheTonic/SPARK-5057 and squashes the following commits: 1503487 [WangTao] use string interpolation 706c8a7 [WangTaoTheTonic] log more messages	2015-01-05 12:00:02 -08:00
Varun Saxena	d3f07fd23c	[SPARK-4688] Have a single shared network timeout in Spark [SPARK-4688] Have a single shared network timeout in Spark Author: Varun Saxena <vsaxena.varun@gmail.com> Author: varunsaxena <vsaxena.varun@gmail.com> Closes #3562 from varunsaxena/SPARK-4688 and squashes the following commits: 6e97f72 [Varun Saxena] [SPARK-4688] Single shared network timeout cd783a2 [Varun Saxena] SPARK-4688 d6f8c29 [Varun Saxena] SCALA-4688 9562b15 [Varun Saxena] SPARK-4688 a75f014 [varunsaxena] SPARK-4688 594226c [varunsaxena] SPARK-4688	2015-01-05 10:32:37 -08:00
zsxwing	5c506cecb9	[SPARK-5074][Core] Fix a non-deterministic test failure Add `assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS))` to make sure `sparkListener` receive the message. Author: zsxwing <zsxwing@gmail.com> Closes #3889 from zsxwing/SPARK-5074 and squashes the following commits: e61c198 [zsxwing] Fix a non-deterministic test failure	2015-01-04 21:18:33 -08:00
zsxwing	27e7f5a723	[SPARK-5083][Core] Fix a flaky test in TaskResultGetterSuite Because `sparkEnv.blockManager.master.removeBlock` is asynchronous, we need to make sure the block has already been removed before calling `super.enqueueSuccessfulTask`. Author: zsxwing <zsxwing@gmail.com> Closes #3894 from zsxwing/SPARK-5083 and squashes the following commits: d97c03d [zsxwing] Fix a flaky test in TaskResultGetterSuite	2015-01-04 21:09:21 -08:00

... 2 3 4 5 6 ...

9421 commits