ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kan Zhang	7dd9fc67a6	[SPARK-1837] NumericRange should be partitioned in the same way as other... ... sequences Author: Kan Zhang <kzhang@apache.org> Closes #776 from kanzhang/SPARK-1837 and squashes the following commits: e48f018 [Kan Zhang] [SPARK-1837] code refactoring 67c33b5 [Kan Zhang] minor change 403f9b1 [Kan Zhang] [SPARK-1837] NumericRange should be partitioned in the same way as other sequences	2014-06-14 14:31:28 -07:00
Kan Zhang	b52603b039	[SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python Author: Kan Zhang <kzhang@apache.org> Closes #983 from kanzhang/SPARK-2013 and squashes the following commits: 0e128bb [Kan Zhang] [SPARK-2013] minor update e728516 [Kan Zhang] [SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python	2014-06-14 13:22:30 -07:00
Kan Zhang	2550533a28	[SPARK-2079] Support batching when serializing SchemaRDD to Python Added batching with default batch size 10 in SchemaRDD.javaToPython Author: Kan Zhang <kzhang@apache.org> Closes #1023 from kanzhang/SPARK-2079 and squashes the following commits: 2d1915e [Kan Zhang] [SPARK-2079] Add batching in SchemaRDD.javaToPython 19b0c09 [Kan Zhang] [SPARK-2079] Removing unnecessary wrapping in SchemaRDD.javaToPython	2014-06-14 13:17:22 -07:00
Yin Huai	8919685091	[Spark-2137][SQL] Timestamp UDFs broken https://issues.apache.org/jira/browse/SPARK-2137 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #1081 from yhuai/SPARK-2137 and squashes the following commits: c04f910 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-2137 205f17b [Yin Huai] Make Hive UDF wrapper support Timestamp.	2014-06-13 23:28:57 -07:00
akkomar	edb1f0e316	Small correction in Streaming Programming Guide doc Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'. Author: akkomar <ak.komar@gmail.com> Closes #1079 from akkomar/streaming-guide-doc and squashes the following commits: 32dfc62 [akkomar] Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.	2014-06-13 15:37:26 -07:00
Cheng Lian	ac96d9657c	[SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements ## Related JIRA issues - Main issue: - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands - Issues resolved as dependencies: - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to - Other related issue: - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue. ## PR Overview This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once. Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results. This PR defines schemas for the following DDL/commands: - EXPLAIN command - `plan`: String, the plan explanation - SET command - `key`: String, the key(s) of the propert(y/ies) being set or queried - `value`: String, the value(s) of the propert(y/ies) being queried - Other Hive native command - `result`: String, execution result returned by Hive NOTE: We should refine schemas for different native commands by defining physical plans for them in the future. ## Examples ### EXPLAIN command Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL: ``` scala> loadTestTable("src") ... scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key") ... q0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == ExplainCommandPhysical [plan#11:0] Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L] Exchange (HashPartitioning [key#4:0], 200) Exchange (HashPartitioning [key#4:0], 200) Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L] HiveTableScan [key#4], (MetastoreRelation default, src, None), None scala> q0.select('plan).collect() ... [ExplainCommandPhysical [plan#24:0] Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L] Exchange (HashPartitioning [key#17:0], 200) Exchange (HashPartitioning [key#17:0], 200) Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L] HiveTableScan [key#17], (MetastoreRelation default, src, None), None] scala> ``` ### SET command In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL: ``` scala> val q1 = hql("SET") ... q1: org.apache.spark.sql.SchemaRDD = SchemaRDD[7] at RDD at SchemaRDD.scala:98 == Query Plan == <SET command: executed by Hive, and noted by SQLContext> scala> q1.registerAsTable("properties") scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println) ... == Query Plan == TakeOrdered 10, [key#51:0 ASC] Project [key#51:0,value#52:1] SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents 14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98 == Query Plan == TakeOrdered 10, [key#51:0 ASC] Project [key#51:0,value#52:1] SetCommandPhysical None, None, [key#55:0,value#56:1]) ... [datanucleus.autoCreateSchema,true] [datanucleus.autoStartMechanismMode,checked] [datanucleus.cache.level2,false] [datanucleus.cache.level2.type,none] [datanucleus.connectionPoolingType,BONECP] [datanucleus.fixedDatastore,false] [datanucleus.identifierFactory,datanucleus1] [datanucleus.plugin.pluginRegistryBundleCheck,LOG] [datanucleus.rdbms.useLegacyNativeValueStrategy,true] [datanucleus.storeManagerType,rdbms] scala> ``` ### "Exactly once" semantics At last, an example of the "exactly once" semantics: ``` scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)") ... q2: org.apache.spark.sql.SchemaRDD = SchemaRDD[28] at RDD at SchemaRDD.scala:98 == Query Plan == <Native command: executed by Hive> scala> table("t1") ... res9: org.apache.spark.sql.SchemaRDD = SchemaRDD[32] at RDD at SchemaRDD.scala:98 == Query Plan == HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None scala> q2.collect() ... res10: Array[org.apache.spark.sql.Row] = Array([]) scala> ``` As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits: d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair `f6c7715` [Cheng Lian] Added test cases for DDL/command statement RDDs 1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs 5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching 48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row] cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands 74789c1 [Cheng Lian] Fixed failing test cases 0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics	2014-06-13 12:59:48 -07:00
Michael Armbrust	1c2fd015b0	[SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType Author: Michael Armbrust <michael@databricks.com> Closes #1061 from marmbrus/timestamp and squashes the following commits: 79c3903 [Michael Armbrust] Add timestamp to HiveMetastoreTypes.toMetastoreType()	2014-06-13 12:55:15 -07:00
nravi	70c8116c0a	Workaround in Spark for ConcurrentModification issue (JIRA Hadoop-10456, Spark-1097) This fix has gone into Hadoop 2.4.1. For developers using < 2.4.1, it would be good to have a workaround in Spark as well. Fix has been tested for performance as well, no regressions found. Author: nravi <nravi@c1704.halxg.cloudera.com> Closes #1000 from nishkamravi2/master and squashes the following commits: eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles	2014-06-13 10:52:21 -07:00
Xiangrui Meng	b3736e3d2f	[HOTFIX] add math3 version to pom Passed `mvn package`. Author: Xiangrui Meng <meng@databricks.com> Closes #1075 from mengxr/takeSample-fix and squashes the following commits: 45b4590 [Xiangrui Meng] add math3 version to pom	2014-06-13 02:59:38 -07:00
Michael Armbrust	13f8cfdc04	[SPARK-2135][SQL] Use planner for in-memory scans Author: Michael Armbrust <michael@databricks.com> Closes #1072 from marmbrus/cachedStars and squashes the following commits: 8757c8e [Michael Armbrust] Use planner for in-memory scans.	2014-06-12 23:09:41 -07:00
John Zhao	f95ac686bc	[SPARK-1516]Throw exception in yarn client instead of run system.exit directly. All the changes is in the package of "org.apache.spark.deploy.yarn": 1) Throw exception in ClinetArguments and ClientBase instead of exit directly. 2) in Client's main method, if exception is caught, it will exit with code 1, otherwise exit with code 0. After the fix, if user integrate the spark yarn client into their applications, when the argument is wrong or the running is finished, the application won't be terminated. Author: John Zhao <jzhao@alpinenow.com> Closes #490 from codeboyyong/jira_1516_systemexit_inyarnclient and squashes the following commits: 138cb48 [John Zhao] [SPARK-1516]Throw exception in yarn clinet instead of run system.exit directly. All the changes is in the package of "org.apache.spark.deploy.yarn": 1) Add a ClientException with an exitCode 2) Throws exception in ClinetArguments and ClientBase instead of exit directly 3) in Client's main method, catch exception and exit with the exitCode.	2014-06-12 21:39:00 -07:00
Andrew Or	44daec5abd	[Minor] Fix style, formatting and naming in BlockManager etc. This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated. (Warning: this PR is full of unimportant nitpicks) Author: Andrew Or <andrewor14@gmail.com> Closes #1058 from andrewor14/bm-minor and squashes the following commits: 8e12eaf [Andrew Or] SparkException -> BlockException c36fd53 [Andrew Or] Make parts of BlockManager more readable 0a5f378 [Andrew Or] Entry -> MemoryEntry e9762a5 [Andrew Or] Tone down string interpolation (minor reverts) c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor b3470f1 [Andrew Or] More string interpolation (minor) 7f9dcab [Andrew Or] Use string interpolation (minor) 94a425b [Andrew Or] Refactor against duplicate code + minor changes 8a6a7dc [Andrew Or] Exception -> SparkException 97c410f [Andrew Or] Deal with MIMA excludes 2480f1d [Andrew Or] Fixes in StorgeLevel.scala abb0163 [Andrew Or] Style, formatting and naming fixes	2014-06-12 20:40:58 -07:00
Doris Xin	1de1d703bf	SPARK-1939 Refactor takeSample method in RDD to use ScaSRS Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate. Author: Doris Xin <doris.s.xin@gmail.com> Author: dorx <doris.s.xin@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #916 from dorx/takeSample and squashes the following commits: 5b061ae [Doris Xin] merge master 444e750 [Doris Xin] edge cases 3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939 82dde31 [Xiangrui Meng] update pyspark's takeSample 48d954d [Doris Xin] remove unused imports from RDDSuite fb1452f [Doris Xin] allowing num to be greater than count in all cases 1481b01 [Doris Xin] washing test tubes and making coffee dc699f3 [Doris Xin] give back imports removed by accident in rdd.py 64e445b [Doris Xin] logwarnning as soon as it enters the while loop 55518ed [Doris Xin] added TODO for logging in rdd.py eff89e2 [Doris Xin] addressed reviewer comments. ecab508 [Doris Xin] "fixed checkstyle violation 0a9b3e3 [Doris Xin] "reviewer comment addressed" f80f270 [Doris Xin] Merge branch 'master' into takeSample ae3ad04 [Doris Xin] fixed edge cases to prevent overflow 065ebcd [Doris Xin] Merge branch 'master' into takeSample 9bdd36e [Doris Xin] Check sample size and move computeFraction e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample 7cab53a [Doris Xin] fixed import bug in rdd.py ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD 1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS	2014-06-12 19:44:27 -07:00
Ariel Rabkin	0154587ab7	document laziness of parallelize Took me several hours to figure out this behavior. It would be good to highlight it in the documentation. Author: Ariel Rabkin <asrabkin@cs.princeton.edu> Closes #1070 from asrabkin/master and squashes the following commits: 29a076e [Ariel Rabkin] doc fix	2014-06-12 17:51:33 -07:00
Shuo Xiang	a6e0afdcf0	SPARK-2085: [MLlib] Apply user-specific regularization instead of uniform regularization in ALS The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda. Author: Shuo Xiang <sxiang@twitter.com> Closes #1026 from coderxiang/als-reg and squashes the following commits: 93dfdb4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into als-reg b98f19c [Shuo Xiang] merge latest master 52c7b58 [Shuo Xiang] Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)	2014-06-12 17:37:06 -07:00
Patrick Wendell	1c04652c8f	SPARK-1843: Replace assemble-deps with env variable. (This change is actually small, I moved some logic into compute-classpath that was previously in spark-class). Assemble deps has existed for a while to allow developers to run local code with new changes quickly. When I'm developing I typically use a simpler approach which just prepends the Spark classes to the classpath before the assembly jar. This is well defined in the JVM and the Spark classes take precedence over those in the assembly. This approach is portable across both builds which is the main reason I'd like to switch to it. It's also a bit easier to toggle on and off quickly. The way you use this is the following: ``` $ ./bin/spark-shell # Use spark with the normal assembly $ export SPARK_PREPEND_CLASSES=true $ ./bin/spark-shell # Now it's using compiled classes $ unset SPARK_PREPEND_CLASSES $ ./bin/spark-shell # Back to normal ``` Author: Patrick Wendell <pwendell@gmail.com> Closes #877 from pwendell/assemble-deps and squashes the following commits: 8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps faa3168 [Patrick Wendell] Adding a warning for compatibility 3f151a7 [Patrick Wendell] Small fix bbfb73c [Patrick Wendell] Review feedback 328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.	2014-06-12 15:43:32 -07:00
Marcelo Vanzin	ecde5b8375	[SPARK-2080] Yarn: report HS URL in client mode, correct user in cluster mode. Yarn client mode was not setting the app's tracking URL to the History Server's URL when configured by the user. Now client mode behaves the same as cluster mode. In SparkContext.scala, the "user.name" system property had precedence over the SPARK_USER environment variable. This means that SPARK_USER was never used, since "user.name" is always set by the JVM. In Yarn cluster mode, this means the application always reported itself as being run by user "yarn" (or whatever user was running the Yarn NM). One could argue that the correct fix would be to use UGI.getCurrentUser() here, but at least for Yarn that will match what SPARK_USER is set to. Author: Marcelo Vanzin <vanzin@cloudera.com> This patch had conflicts when merged, resolved by Committer: Thomas Graves <tgraves@apache.org> Closes #1002 from vanzin/yarn-client-url and squashes the following commits: 4046e04 [Marcelo Vanzin] Set HS link in yarn-alpha also. 4c692d9 [Marcelo Vanzin] Yarn: report HS URL in client mode, correct user in cluster mode.	2014-06-12 16:19:36 -05:00
Doris Xin	83c226d454	[SPARK-2088] fix NPE in toString After deserialization, the transient field creationSiteInfo does not get backfilled with the default value, but the toString method, which is invoked by the serializer, expects the field to always be non-null. An NPE is thrown when toString is called by the serializer when creationSiteInfo is null. Author: Doris Xin <doris.s.xin@gmail.com> Closes #1028 from dorx/toStringNPE and squashes the following commits: f20021e [Doris Xin] unit test for toString after desrialization 6f0a586 [Doris Xin] Merge branch 'master' into toStringNPE f47fecf [Doris Xin] Merge branch 'master' into toStringNPE 76199c6 [Doris Xin] [SPARK-2088] fix NPE in toString	2014-06-12 12:53:07 -07:00
Sandy Ryza	ce92a9c18f	SPARK-554. Add aggregateByKey. Author: Sandy Ryza <sandy@cloudera.com> Closes #705 from sryza/sandy-spark-554 and squashes the following commits: 2302b8f [Sandy Ryza] Add MIMA exclude f52e0ad [Sandy Ryza] Fix Python tests for real 2f3afa3 [Sandy Ryza] Fix Python test 0b735e9 [Sandy Ryza] Fix line lengths ae56746 [Sandy Ryza] Fix doc (replace T with V) c2be415 [Sandy Ryza] Java and Python aggregateByKey 23bf400 [Sandy Ryza] SPARK-554. Add aggregateByKey.	2014-06-12 08:14:25 -07:00
Jeff Thompson	43d53d51c9	fixed typo in docstring for min() Hi, I found this typo while learning spark and thought I'd do a pull request. Author: Jeff Thompson <jeffreykeatingthompson@gmail.com> Closes #1065 from jkthompson/docstring-typo-minmax and squashes the following commits: 29b6a26 [Jeff Thompson] fixed typo in docstring for min()	2014-06-12 08:10:51 -07:00
Henry Saputra	4d8ae709fb	Cleanup on Connection and ConnectionManager Simple cleanup on Connection and ConnectionManager to make IDE happy while working of issue: 1. Replace var with var 2. Add parentheses to Queue#dequeu to be consistent with side-effects. 3. Remove return on final line of a method. Author: Henry Saputra <henry.saputra@gmail.com> Closes #1060 from hsaputra/cleanup_connection_classes and squashes the following commits: 245fd09 [Henry Saputra] Cleanup on Connection and ConnectionManager to make IDE happy while working of issue: 1. Replace var with var 2. Add parentheses to Queue#dequeu to be consistent with side-effects. 3. Remove return on final line of a method.	2014-06-11 23:17:51 -07:00
Yadong	e056320cc8	'killFuture' is never used Author: Yadong <qiyadong2010@gmail.com> Closes #1052 from watermen/bug-fix1 and squashes the following commits: 409d09a [Yadong] 'killFuture' is never used	2014-06-11 20:58:39 -07:00
Matei Zaharia	508fd371d6	[SPARK-2044] Pluggable interface for shuffles This is a first cut at moving shuffle logic behind a pluggable interface, as described at https://issues.apache.org/jira/browse/SPARK-2044, to let us more easily experiment with new shuffle implementations. It moves the existing shuffle code to a class HashShuffleManager behind a general ShuffleManager interface. Two things are still missing to make this complete: * MapOutputTracker needs to be hidden behind the ShuffleManager interface; this will also require adding methods to ShuffleManager that will let the DAGScheduler interact with it as it does with the MapOutputTracker today * The code to do map-sides and reduce-side combine in ShuffledRDD, PairRDDFunctions, etc needs to be moved into the ShuffleManager's readers and writers However, some of these may also be done later after we merge the current interface. Author: Matei Zaharia <matei@databricks.com> Closes #1009 from mateiz/pluggable-shuffle and squashes the following commits: 7a09862 [Matei Zaharia] review comments be33d3f [Matei Zaharia] review comments 1513d4e [Matei Zaharia] Add ASF header ac56831 [Matei Zaharia] Bug fix and better error message 4f681ba [Matei Zaharia] Move write part of ShuffleMapTask to ShuffleManager f6f011d [Matei Zaharia] Move hash shuffle reader behind ShuffleManager interface 55c7717 [Matei Zaharia] Changed RDD code to use ShuffleReader 75cc044 [Matei Zaharia] Partial work to move hash shuffle in	2014-06-11 20:45:29 -07:00
Tor Myklebust	d9203350b0	[SPARK-1672][MLLIB] Separate user and product partitioning in ALS Some clean up work following #593. 1. Allow to set different number user blocks and number product blocks in `ALS`. 2. Update `MovieLensALS` to reflect the change. Author: Tor Myklebust <tmyklebu@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #1014 from mengxr/SPARK-1672 and squashes the following commits: 0e910dd [Xiangrui Meng] change private[this] to private[recommendation] 36420c7 [Xiangrui Meng] set exclusion rules for ALS 9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672 294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672 9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS 84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672 d17a8bf [Xiangrui Meng] merge master a4925fd [Tor Myklebust] Style. bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar 021f54b [Tor Myklebust] Separate user and product blocks. dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go. 23d6f91 [Tor Myklebust] Stop making the partitioner configurable. 495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark 674933a [Tor Myklebust] Fix style. 40edc23 [Tor Myklebust] Fix missing space. f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach. 5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'. 36a0f43 [Tor Myklebust] Make the partitioner private. d872b09 [Tor Myklebust] Add negative id ALS test. df27697 [Tor Myklebust] Support custom partitioners. Currently we use the same partitioner for users and products. c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing. c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.	2014-06-11 18:16:33 -07:00
Takuya UESHIN	9a2448daf9	[SPARK-2052] [SQL] Add optimization for CaseConversionExpression's. Add optimization for `CaseConversionExpression`'s. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #990 from ueshin/issues/SPARK-2052 and squashes the following commits: 2568666 [Takuya UESHIN] Move some rules back. dde7ede [Takuya UESHIN] Add tests to check if ConstantFolding can handle null literals and remove the unneeded rules from NullPropagation. c4eea67 [Takuya UESHIN] Fix toString methods. 23e2363 [Takuya UESHIN] Make CaseConversionExpressions foldable if the child is foldable. 0ff7568 [Takuya UESHIN] Add tests for collapsing case statements. 3977d80 [Takuya UESHIN] Add optimization for CaseConversionExpression's.	2014-06-11 17:58:35 -07:00
Patrick Wendell	d45e0c6b98	HOTFIX: Forgot to remove false change in previous commit	2014-06-11 15:55:41 -07:00
Patrick Wendell	14e6dc94f6	HOTFIX: PySpark tests should be order insensitive. This has been messing up the SQL PySpark tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1054 from pwendell/pyspark and squashes the following commits: 1eb5487 [Patrick Wendell] False change 06f062d [Patrick Wendell] HOTFIX: PySpark tests should be order insensitive	2014-06-11 15:54:41 -07:00
Andrew Or	fe78b8b6f7	HOTFIX: A few PySpark tests were not actually run This is a hot fix for the hot fix in `fb499be1ac`. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests: - pyspark/broadcast.py - pyspark/accumulators.py - pyspark/serializers.py (@pwendell I might have told you the wrong thing) Author: Andrew Or <andrewor14@gmail.com> Closes #1053 from andrewor14/python-test-fix and squashes the following commits: d2e5401 [Andrew Or] Explain why these tests are handled differently 0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked	2014-06-11 12:11:46 -07:00
Daoyuan	ce6deb1e5b	[SQL] Code Cleanup: Left Semi Hash Join Some improvement for PR #837, add another case to white list and use `filter` to build result iterator. Author: Daoyuan <daoyuan.wang@intel.com> Closes #1049 from adrian-wang/clean-LeftSemiJoinHash and squashes the following commits: b314d5a [Daoyuan] change hashSet name 27579a9 [Daoyuan] add semijoin to white list and use filter to create new iterator in LeftSemiJoinBNL Signed-off-by: Michael Armbrust <michael@databricks.com>	2014-06-11 12:09:42 -07:00
Sameer Agarwal	4107cce58c	[SPARK-2042] Prevent unnecessary shuffle triggered by take() This PR implements `take()` on a `SchemaRDD` by inserting a logical limit that is followed by a `collect()`. This is also accompanied by adding a catalyst optimizer rule for collapsing adjacent limits. Doing so prevents an unnecessary shuffle that is sometimes triggered by `take()`. Author: Sameer Agarwal <sameer@databricks.com> Closes #1048 from sameeragarwal/master and squashes the following commits: 3eeb848 [Sameer Agarwal] Fixing Tests 1b76ff1 [Sameer Agarwal] Deprecating limit(limitExpr: Expression) in v1.1.0 b723ac4 [Sameer Agarwal] Added limit folding tests a0ff7c4 [Sameer Agarwal] Adding catalyst rule to fold two consecutive limits 8d42d03 [Sameer Agarwal] Implement trigger() as limit() followed by collect()	2014-06-11 12:01:04 -07:00
Lars Albertsson	4d5c12aa1c	SPARK-2113: awaitTermination() after stop() will hang in Spark Stremaing Author: Lars Albertsson <lalle@spotify.com> Closes #1001 from lallea/contextwaiter_stopped and squashes the following commits: 93cd314 [Lars Albertsson] Mend StreamingContext stop() followed by awaitTermination().	2014-06-11 10:54:45 -07:00
Prashant Sharma	e508f599f8	[SPARK-2108] Mark SparkContext methods that return block information as developer API's Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1047 from ScrapCodes/SPARK-2108/mark-as-dev-api and squashes the following commits: 073ee34 [Prashant Sharma] [SPARK-2108] Mark SparkContext methods that return block information as developer API's	2014-06-11 10:49:34 -07:00
Prashant Sharma	5b754b45f3	[SPARK-2069] MIMA false positives Fixes SPARK 2070 and 2071 Author: Prashant Sharma <prashant.s@imaginea.com> Closes #1021 from ScrapCodes/SPARK-2070/package-private-methods and squashes the following commits: 7979a57 [Prashant Sharma] addressed code review comments 558546d [Prashant Sharma] A little fancy error message. 59275ab [Prashant Sharma] SPARK-2071 Mima ignores classes and its members from previous versions too. 0c4ff2b [Prashant Sharma] SPARK-2070 Ignore methods along with annotated classes.	2014-06-11 10:47:06 -07:00
Sandy Ryza	2a4225dd94	SPARK-1639. Tidy up some Spark on YARN code This contains a bunch of small tidyings of the Spark on YARN code. I focused on the yarn stable code. @tgravescs, let me know if you'd like me to make these for the alpha code as well. Author: Sandy Ryza <sandy@cloudera.com> Closes #561 from sryza/sandy-spark-1639 and squashes the following commits: 72b6a02 [Sandy Ryza] Fix comment and set name on driver thread c2190b2 [Sandy Ryza] SPARK-1639. Tidy up some Spark on YARN code	2014-06-11 07:57:28 -05:00
Qiuzhuang.Lian	6e11930310	SPARK-2107: FilterPushdownSuite doesn't need Junit jar. Author: Qiuzhuang.Lian <Qiuzhuang.Lian@gmail.com> Closes #1046 from Qiuzhuang/master and squashes the following commits: 0a9921a [Qiuzhuang.Lian] SPARK-2107: FilterPushdownSuite doesn't need Junit jar.	2014-06-11 00:36:06 -07:00
Xiangrui Meng	0f1dc3a73d	[SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot `ndarray.dot` is not available in numpy 1.4. This PR makes pyspark/mllib compatible with numpy 1.4. Author: Xiangrui Meng <meng@databricks.com> Closes #1035 from mengxr/numpy-1.4 and squashes the following commits: 7ad2f0c [Xiangrui Meng] use numpy.dot instead of ndarray.dot	2014-06-11 00:22:40 -07:00
Cheng Lian	0266a0c8a7	[SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables JIRA issue: [SPARK-1968](https://issues.apache.org/jira/browse/SPARK-1968) This PR added support for SQL/HiveQL command for caching/uncaching tables: ``` scala> sql("CACHE TABLE src") ... res0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == CacheCommandPhysical src, true scala> table("src") ... res1: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:98 == Query Plan == InMemoryColumnarTableScan [key#0,value#1], (HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None), false scala> isCached("src") res2: Boolean = true scala> sql("CACHE TABLE src") ... res3: org.apache.spark.sql.SchemaRDD = SchemaRDD[4] at RDD at SchemaRDD.scala:98 == Query Plan == CacheCommandPhysical src, false scala> table("src") ... res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[11] at RDD at SchemaRDD.scala:98 == Query Plan == HiveTableScan [key#2,value#3], (MetastoreRelation default, src, None), None scala> isCached("src") res5: Boolean = false ``` Things also work for `hql`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1038 from liancheng/sqlCacheTable and squashes the following commits: ecb7194 [Cheng Lian] Trimmed the SQL string before parsing special commands 6f4ce42 [Cheng Lian] Moved logical command classes to a separate file 3458a24 [Cheng Lian] Added comment for public API f0ffacc [Cheng Lian] Added isCached() predicate 15ec6d2 [Cheng Lian] Added "(UN)CACHE TABLE" SQL/HiveQL statements	2014-06-11 00:06:50 -07:00
Takuya UESHIN	0402bd77ec	[SPARK-2093] [SQL] NullPropagation should use exact type value. `NullPropagation` should use exact type value when transform `Count` or `Sum`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #1034 from ueshin/issues/SPARK-2093 and squashes the following commits: 65b6ff1 [Takuya UESHIN] Modify the literal value of the result of transformation from Sum to long value. 830c20b [Takuya UESHIN] Add Cast to the result of transformation from Count. 9314806 [Takuya UESHIN] Fix NullPropagation to use exact type value.	2014-06-10 23:13:48 -07:00
Zongheng Yang	601032f5bf	HOTFIX: clear() configs in SQLConf-related unit tests. Thanks goes to @liancheng, who pointed out that `sql/test-only .SQLConfSuite .SQLQuerySuite` passed but `sql/test-only .SQLQuerySuite .SQLConfSuite` failed. The reason is that some tests use the same test keys and without clear()'ing, they get carried over to other tests. This hotfix simply adds some `clear()` calls. This problem was not evident on Jenkins before probably because `parallelExecution` is not set to `false` for `sqlCoreSettings`. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #1040 from concretevitamin/sqlconf-tests and squashes the following commits: 6d14ceb [Zongheng Yang] HOTFIX: clear() confs in SQLConf related unit tests.	2014-06-10 21:59:01 -07:00
Nicholas Chammas	a2052a44f3	[SPARK-2065] give launched instances names This update resolves [SPARK-2065](https://issues.apache.org/jira/browse/SPARK-2065). It gives launched EC2 instances descriptive names by using instance tags. Launched instances now show up in the EC2 console with these names. I used `format()` with named parameters, which I believe is the recommended practice for string formatting in Python, but which doesn’t seem to be used elsewhere in the script. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Author: nchammas <nicholas.chammas@gmail.com> Closes #1043 from nchammas/master and squashes the following commits: 69f6e22 [Nicholas Chammas] PEP8 fixes 2627247 [Nicholas Chammas] broke up lines before they hit 100 chars 6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names 69da6cf [nchammas] Merge pull request #1 from apache/master	2014-06-10 21:49:08 -07:00
witgo	c48b6222ea	Resolve scalatest warnings during build Author: witgo <witgo@qq.com> Closes #1032 from witgo/ShouldMatchers and squashes the following commits: 7ebf34c [witgo] Resolve scalatest warnings during build	2014-06-10 20:24:05 -07:00
Tathagata Das	4823bf470e	[SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed. Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function). This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`). The web UI has also been modified to show the logs across the rolled-over files. You can test this locally (without waiting a whole day) by setting configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to). ``` stderr stderr--2014-05-27--14-37 stderr--2014-05-27--14-47 stderr--2014-05-27--15-05 stdout stdout--2014-05-27--14-47 ``` The web ui should show logs across these files. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #895 from tdas/rolling-logs and squashes the following commits: fd8f87f [Tathagata Das] Minor change. d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs ad956c1 [Tathagata Das] Scala style fix. 1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments. c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names. 4224409 [Tathagata Das] Style fix. 108a9f8 [Tathagata Das] Added better constraint handling for rolling policies. f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled 312d874 [Tathagata Das] Minor fixes based on PR comments. 8a67d83 [Tathagata Das] Fixed comments. b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly. b7e8272 [Tathagata Das] Style fix, 374c9a9 [Tathagata Das] Added missing license. 24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements. adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs 931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files. cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.	2014-06-10 20:22:02 -07:00
joyyoj	2966044307	[SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not re... flume event sent to Spark will fail if the body is too large and numHeaders is greater than zero Author: joyyoj <sunshch@gmail.com> Closes #951 from joyyoj/master and squashes the following commits: f4660c5 [joyyoj] [SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not read properly	2014-06-10 17:26:17 -07:00
egraldlo	1abbde0e89	[SQL] Add average overflow test case from #978 By @egraldlo. Author: egraldlo <egraldlo@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #1033 from marmbrus/pr/978 and squashes the following commits: e228c5e [Michael Armbrust] Remove "test". 762aeaf [Michael Armbrust] Remove unneeded rule. More descriptive name for test table. d414cd7 [egraldlo] fommatting issues 1153f75 [egraldlo] do best to avoid overflowing in function avg().	2014-06-10 14:07:55 -07:00
Ankur Dave	55a0e87ee4	HOTFIX: Increase time limit for Bagel test The test was timing out on some slow EC2 workers. Author: Ankur Dave <ankurdave@gmail.com> Closes #1037 from ankurdave/bagel-test-time-limit and squashes the following commits: 67fd487 [Ankur Dave] Increase time limit for Bagel test	2014-06-10 13:15:11 -07:00
Patrick Wendell	fb499be1ac	HOTFIX: Fix Python tests on Jenkins. Author: Patrick Wendell <pwendell@gmail.com> Closes #1036 from pwendell/jenkins-test and squashes the following commits: 9c99856 [Patrick Wendell] Better output during tests 71e7b74 [Patrick Wendell] Removing incorrect python path 74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.	2014-06-10 13:13:17 -07:00
Cheng Hao	db0c038a66	[SPARK-2076][SQL] Pushdown the join filter & predication for outer join As the rule described in https://cwiki.apache.org/confluence/display/Hive/OuterJoinBehavior, we can optimize the SQL Join by pushing down the Join predicate and Where predicate. Author: Cheng Hao <hao.cheng@intel.com> Closes #1015 from chenghao-intel/join_predicate_push_down and squashes the following commits: 10feff9 [Cheng Hao] fix bug of changing the join type in PredicatePushDownThroughJoin 44c6700 [Cheng Hao] Add logical to support pushdown the join filter 0bce426 [Cheng Hao] Pushdown the join filter & predicate for outer join	2014-06-10 12:59:52 -07:00
witgo	884ca718b2	[SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container Author: witgo <witgo@qq.com> Closes #921 from witgo/allocateExecutors and squashes the following commits: bc3aa66 [witgo] review commit 8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 32ac7af [witgo] review commit 056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 04c6f7e [witgo] Merge branch 'master' into allocateExecutors aff827c [witgo] review commit 5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors 1faf4f4 [witgo] Merge branch 'master' into allocateExecutors 3c464bd [witgo] add time limit to allocateExecutors e00b656 [witgo] In some cases, yarn does not automatically restart the container	2014-06-10 10:34:57 -05:00
Cheng Lian	a9a461c594	Moved hiveOperators.scala to the right package folder The package is `org.apache.spark.sql.hive.execution`, while the file was placed under `sql/hive/src/main/scala/org/apache/spark/sql/hive/`. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1029 from liancheng/moveHiveOperators and squashes the following commits: d632eb8 [Cheng Lian] Moved hiveOperators.scala to the right package folder	2014-06-10 01:14:44 -07:00
Zongheng Yang	08ed9ad813	[SPARK-1508][SQL] Add SQLConf to SQLContext. This PR (1) introduces a new class SQLConf that stores key-value properties for a SQLContext (2) clean up the semantics of various forms of SET commands. The SQLConf class unlocks user-controllable optimization opportunities; for example, user can now override the number of partitions used during an Exchange. A SQLConf can be accessed and modified programmatically through its getters and setters. It can also be modified through SET commands executed by `sql()` or `hql()`. Note that users now have the ability to change a particular property for different queries inside the same Spark job, unlike settings configured in SparkConf. For SET commands: "SET" will return all properties currently set in a SQLConf, "SET key" will return the key-value pair (if set) or an undefined message, and "SET key=value" will call the setter on SQLConf, and if a HiveContext is used, it will be executed in Hive as well. Author: Zongheng Yang <zongheng.y@gmail.com> Closes #956 from concretevitamin/sqlconf and squashes the following commits: 4968c11 [Zongheng Yang] Very minor cleanup. d74dde5 [Zongheng Yang] Remove the redundant mkQueryExecution() method. c129b86 [Zongheng Yang] Merge remote-tracking branch 'upstream/master' into sqlconf 26c40eb [Zongheng Yang] Make SQLConf a trait and have SQLContext mix it in. dd19666 [Zongheng Yang] Update a comment. baa5d29 [Zongheng Yang] Remove default param for shuffle partitions accessor. 5f7e6d8 [Zongheng Yang] Add default num partitions. 22d9ed7 [Zongheng Yang] Fix output() of Set physical. Add SQLConf param accessor method. e9856c4 [Zongheng Yang] Use java.util.Collections.synchronizedMap on a Java HashMap. 88dd0c8 [Zongheng Yang] Remove redundant SET Keyword. 271f0b1 [Zongheng Yang] Minor change. f8983d1 [Zongheng Yang] Minor changes per review comments. 1ce8a5e [Zongheng Yang] Invoke runSqlHive() in SQLConf#get for the HiveContext case. b766af9 [Zongheng Yang] Remove a test. d52e1bd [Zongheng Yang] De-hardcode number of shuffle partitions for BasicOperators (read from SQLConf). 555599c [Zongheng Yang] Bullet-proof (relatively) parsing SET per review comment. c2067e8 [Zongheng Yang] Mark SQLContext transient and put it in a second param list. 2ea8cdc [Zongheng Yang] Wrap long line. 41d7f09 [Zongheng Yang] Fix imports. 13279e6 [Zongheng Yang] Refactor the logic of eagerly processing SET commands. b14b83e [Zongheng Yang] In a HiveContext, make SQLConf a subset of HiveConf. 6983180 [Zongheng Yang] Move a SET test to SQLQuerySuite and make it complete. 5b67985 [Zongheng Yang] New line at EOF. c651797 [Zongheng Yang] Add commands.scala. efd82db [Zongheng Yang] Clean up semantics of several cases of SET. c1017c2 [Zongheng Yang] WIP in changing SetCommand to take two Options (for different semantics of SETs). 0f00d86 [Zongheng Yang] Add a test for singleton set command in SQL. 41acd75 [Zongheng Yang] Add a test for hql() in HiveQuerySuite. 2276929 [Zongheng Yang] Fix default hive result for set commands in HiveComparisonTest. 3b0c71b [Zongheng Yang] Remove Parser for set commands. A few other fixes. d0c4578 [Zongheng Yang] Tmux typo. 0ecea46 [Zongheng Yang] Changes for HiveQl and HiveContext. ce22d80 [Zongheng Yang] Fix parsing issues. cb722c1 [Zongheng Yang] Finish up SQLConf patch. 4ebf362 [Zongheng Yang] First cut at SQLConf inside SQLContext.	2014-06-10 00:49:09 -07:00

1 2 3 4 5 ...

7182 commits