ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Nong Li	1ab72b0860	[SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW… …ithinPartitions. Author: Nong Li <nong@databricks.com> Closes #9504 from nongli/spark-11410.	2015-11-06 15:48:20 -08:00
Wenchen Fan	7e9a9e603a	[SPARK-11269][SQL] Java API support & test cases for Dataset This simply brings https://github.com/apache/spark/pull/9358 up-to-date. Author: Wenchen Fan <wenchen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #9528 from rxin/dataset-java.	2015-11-06 15:37:07 -08:00
Thomas Graves	f6680cdc5d	[SPARK-11555] spark on yarn spark-class --num-workers doesn't work I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied. --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #9523 from tgravescs/SPARK-11555.	2015-11-06 15:24:33 -08:00
Xiangrui Meng	c447c9d546	[SPARK-11217][ML] save/load for non-meta estimators and transformers This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.	2015-11-06 14:51:03 -08:00
Reynold Xin	3a652f691b	[SPARK-11561][SQL] Rename text data source's column name to value. Author: Reynold Xin <rxin@databricks.com> Closes #9527 from rxin/SPARK-11561.	2015-11-06 14:47:41 -08:00
Herman van Hovell	f328fedafd	[SPARK-11450] [SQL] Add Unsafe Row processing to Expand This PR enables the Expand operator to process and produce Unsafe Rows. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9414 from hvanhovell/SPARK-11450.	2015-11-06 12:21:53 -08:00
Imran Rashid	49f1a82037	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.	2015-11-06 20:06:24 +00:00
Jacek Laskowski	62bb290773	Typo fixes + code readability improvements Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #9501 from jaceklaskowski/typos-with-style.	2015-11-06 20:05:18 +00:00
Yin Huai	8211aab079	[SPARK-9858][SQL] Add an ExchangeCoordinator to estimate the number of post-shuffle partitions for aggregates and joins (follow-up) https://issues.apache.org/jira/browse/SPARK-9858 This PR is the follow-up work of https://github.com/apache/spark/pull/9276. It addresses JoshRosen's comments. Author: Yin Huai <yhuai@databricks.com> Closes #9453 from yhuai/numReducer-followUp.	2015-11-06 11:13:51 -08:00
Cheng Lian	c048929c6a	[SPARK-10978][SQL][FOLLOW-UP] More comprehensive tests for PR #9399 This PR adds test cases that test various column pruning and filter push-down cases. Author: Cheng Lian <lian@databricks.com> Closes #9468 from liancheng/spark-10978.follow-up.	2015-11-06 11:11:36 -08:00
Liang-Chi Hsieh	574141a298	[SPARK-9162] [SQL] Implement code generation for ScalaUDF JIRA: https://issues.apache.org/jira/browse/SPARK-9162 Currently ScalaUDF extends CodegenFallback and doesn't provide code generation implementation. This path implements code generation for ScalaUDF. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9270 from viirya/scalaudf-codegen.	2015-11-06 10:52:04 -08:00
Shixiong Zhu	cf69ce1365	[SPARK-11511][STREAMING] Fix NPE when an InputDStream is not used Just ignored `InputDStream`s that have null `rememberDuration` in `DStreamGraph.getMaxInputStreamRememberDuration`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9476 from zsxwing/SPARK-11511.	2015-11-06 14:51:53 +00:00
Wenchen Fan	253e87e8ab	[SPARK-11453][SQL][FOLLOW-UP] remove DecimalLit A cleanup for https://github.com/apache/spark/pull/9085. The `DecimalLit` is very similar to `FloatLit`, we can just keep one of them. Also added low level unit test at `SqlParserSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #9482 from cloud-fan/parser.	2015-11-06 06:38:49 -08:00
Reynold Xin	bc5d6c0389	[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark various dialects as private. Author: Reynold Xin <rxin@databricks.com> Closes #9511 from rxin/SPARK-11541.	2015-11-05 22:03:26 -08:00
Michael Armbrust	363a476c3f	[SPARK-11528] [SQL] Typed aggregations for Datasets This PR adds the ability to do typed SQL aggregations. We will likely also want to provide an interface to allow users to do aggregations on objects, but this is deferred to another PR. ```scala val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS() ds.groupBy(_._1).agg(sum("_2").as[Int]).collect() res0: Array(("a", 30), ("b", 3), ("c", 1)) ``` Author: Michael Armbrust <michael@databricks.com> Closes #9499 from marmbrus/dataset-agg.	2015-11-05 21:42:32 -08:00
Davies Liu	eec74ba8bd	[SPARK-7542][SQL] Support off-heap index/sort buffer This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution. Closes #8068 Author: Davies Liu <davies@databricks.com> Closes #9477 from davies/unsafe_timsort.	2015-11-05 19:02:18 -08:00
Reynold Xin	3cc2c053b5	[SPARK-11540][SQL] API audit for QueryExecutionListener. Author: Reynold Xin <rxin@databricks.com> Closes #9509 from rxin/SPARK-11540.	2015-11-05 18:12:54 -08:00
Marcelo Vanzin	5e31db70bb	[SPARK-11538][BUILD] Force guava 14 in sbt build. sbt's version resolution code always picks the most recent version, and we don't want that for guava. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9508 from vanzin/SPARK-11538.	2015-11-05 18:05:58 -08:00
jerryshao	468ad0ae87	[SPARK-11457][STREAMING][YARN] Fix incorrect AM proxy filter conf recovery from checkpoint Currently Yarn AM proxy filter configuration is recovered from checkpoint file when Spark Streaming application is restarted, which will lead to some unwanted behaviors: 1. Wrong RM address if RM is redeployed from failure. 2. Wrong proxyBase, since app id is updated, old app id for proxyBase is wrong. So instead of recovering from checkpoint file, these configurations should be reloaded each time when app started. This problem only exists in Yarn cluster mode, for Yarn client mode, these configurations will be updated with RPC message `AddWebUIFilter`. Please help to review tdas harishreedharan vanzin , thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9412 from jerryshao/SPARK-11457.	2015-11-05 18:03:12 -08:00
Yu ISHIKAWA	8fa8c8375d	[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree* cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.	2015-11-05 17:59:01 -08:00
Reynold Xin	6091e91fca	Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs." This reverts commit `9cf56c96b7`.	2015-11-05 17:10:35 -08:00
Davies Liu	07414afac9	[SPARK-11537] [SQL] fix negative hours/minutes/seconds Currently, if the Timestamp is before epoch (1970/01/01), the hours, minutes and seconds will be negative (also rounding up). Author: Davies Liu <davies@databricks.com> Closes #9502 from davies/neg_hour.	2015-11-05 17:02:22 -08:00
Davies Liu	2440106242	[SPARK-11542] [SPARKR] fix glm with long fomular Because deparse() will break the long string into multiple lines, the deserialization will fail Author: Davies Liu <davies@databricks.com> Closes #9510 from davies/fix_glm.	2015-11-05 16:34:10 -08:00
Reynold Xin	b6974f8fed	[SPARK-11536][SQL] Remove the internal implicit conversion from Expression to Column in functions.scala Author: Reynold Xin <rxin@databricks.com> Closes #9505 from rxin/SPARK-11536.	2015-11-05 15:34:05 -08:00
Wenchen Fan	d9e30c59ce	[SPARK-10656][SQL] completely support special chars in DataFrame the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close https://github.com/apache/spark/pull/8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.	2015-11-05 14:53:16 -08:00
adrian555	b9455d1f18	[SPARK-11260][SPARKR] with() function support Author: adrian555 <wzhuang@us.ibm.com> Author: Adrian Zhuang <adrian555@users.noreply.github.com> Closes #9443 from adrian555/with.	2015-11-05 14:47:38 -08:00
Reynold Xin	8a5314efd1	[SPARK-11532][SQL] Remove implicit conversion from Expression to Column Author: Reynold Xin <rxin@databricks.com> Closes #9500 from rxin/SPARK-11532.	2015-11-05 13:34:36 -08:00
Travis Hegner	14ee0f5726	[SPARK-10648] Oracle dialect to handle nonspecific numeric types This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <thegner@trilliumit.com> Closes #9495 from travishegner/OracleDialect.	2015-11-05 12:36:57 -08:00
Ehsan M.Kermani	f80f7b69a3	[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.	2015-11-05 12:11:57 -08:00
Reynold Xin	6b87acd664	[SPARK-11513][SQL] Remove implicit conversion from LogicalPlan to DataFrame This internal implicit conversion has been a source of confusion for a lot of new developers. Author: Reynold Xin <rxin@databricks.com> Closes #9479 from rxin/SPARK-11513.	2015-11-05 11:58:13 -08:00
Srinivasa Reddy Vundela	c76865c622	[SPARK-11484][WEBUI] Using proxyBase set by spark AM Use the proxyBase set by the AM, if not found then use env. This is to fix the issue if somebody accidentally set APPLICATION_WEB_PROXY_BASE to wrong proxyBase Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9448 from vundela/master.	2015-11-05 11:30:44 -08:00
Yanbo Liang	9da7ceed81	[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.	2015-11-05 09:56:18 -08:00
Huaxin Gao	b072ff4d1d	[SPARK-11474][SQL] change fetchSize to fetchsize In DefaultDataSource.scala, it has override def createRelation( sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation The parameters is CaseInsensitiveMap. After this line parameters.foreach(kv => properties.setProperty(kv._1, kv._2)) properties is set to all lower case key/value pairs and fetchSize becomes fetchsize. However, in compute method in JDBCRDD, it has val fetchSize = properties.getProperty("fetchSize", "0").toInt so fetchSize value is always 0 and never gets set correctly. Author: Huaxin Gao <huaxing@oc0558782468.ibm.com> Closes #9473 from huaxingao/spark-11474.	2015-11-05 09:41:14 -08:00
Nishkam Ravi	a4b5cefcf1	[SPARK-11501][CORE][YARN] Propagate spark.rpc config to executors spark.rpc is supposed to be configurable but is not currently (doesn't get propagated to executors because RpcEnv.create is done before driver properties are fetched). Author: Nishkam Ravi <nishkamravi@gmail.com> Closes #9460 from nishkamravi2/master_akka.	2015-11-05 09:35:49 -08:00
Yanbo Liang	2e86cf1b01	[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9492 from yanboliang/spark-11527.	2015-11-05 09:00:03 -08:00
Yanbo Liang	72634f27e3	[MINOR][ML][DOC] Rename weights to coefficients in user guide We should use ```coefficients``` rather than ```weights``` in user guide that freshman can get the right conventional name at the outset. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9493 from yanboliang/docs-coefficients.	2015-11-05 08:59:06 -08:00
Cheng Lian	77488fb8e5	[MINOR][SQL] A minor log line fix `jars` in the log line is an array, so `$jars` doesn't print its content. Author: Cheng Lian <lian@databricks.com> Closes #9494 from liancheng/minor.log-fix.	2015-11-05 23:49:44 +08:00
a1singh	a94671a027	[SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementation In file LDAOptimizer.scala: line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach. - nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) => + nonEmptyDocs.foreach { case (_, termCounts: Vector) => Author: a1singh <a1singh@ucsd.edu> Closes #9456 from a1singh/master.	2015-11-05 12:51:10 +00:00
Herman van Hovell	7bdc92197c	[SPARK-11449][CORE] PortableDataStream should be a factory ```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces). This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns. cc srowen Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9417 from hvanhovell/SPARK-11449.	2015-11-05 09:23:09 +00:00
Nick Evans	859dff56eb	[SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout return properly This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`. tdas zsxwing Author: Nick Evans <me@nicolasevans.org> Closes #9336 from manygrams/fix_await_termination_or_timeout.	2015-11-05 09:18:20 +00:00
Sean Owen	6f81eae24f	[SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items non-experimental if they've existed since 1.2.0 Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are: * SparkContext * binary{Files,Records} : 1.2.0 * submitJob : 1.0.0 * JavaSparkContext * binary{Files,Records} : 1.2.0 * DoubleRDDFunctions, JavaDoubleRDD * {mean,sum}Approx : 1.0.0 * PairRDDFunctions, JavaPairRDD * sampleByKeyExact : 1.2.0 * countByKeyApprox : 1.0.0 * PairRDDFunctions * countApproxDistinctByKey : 1.1.0 * RDD * countApprox, countByValueApprox, countApproxDistinct : 1.0.0 * JavaRDDLike * countApprox : 1.0.0 * PythonHadoopUtil.Converter : 1.1.0 * PortableDataStream : 1.2.0 (related to binaryFiles) * BoundedDouble : 1.0.0 * PartialResult : 1.0.0 * StreamingContext, JavaStreamingContext * binaryRecordsStream : 1.2.0 * HiveContext * analyze : 1.2.0 Author: Sean Owen <sowen@cloudera.com> Closes #9396 from srowen/SPARK-11440.	2015-11-05 09:08:53 +00:00
Davies Liu	81498dd5c8	[SPARK-11425] [SPARK-11486] Improve hybrid aggregation After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them. Author: Davies Liu <davies@databricks.com> Closes #9383 from davies/fix_switch.	2015-11-04 21:30:21 -08:00
Josh Rosen	d0b5633962	[SPARK-11307] Reduce memory consumption of OutputCommitCoordinator OutputCommitCoordinator uses a map in a place where an array would suffice, increasing its memory consumption for result stages with millions of tasks. This patch replaces that map with an array. The only tricky part of this is reasoning about the range of possible array indexes in order to make sure that we never index out of bounds. Author: Josh Rosen <joshrosen@databricks.com> Closes #9274 from JoshRosen/SPARK-11307.	2015-11-04 17:19:52 -08:00
Zhenhua Wang	a752ddad7f	[SPARK-11398] [SQL] unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql 1. def dialectClassName in HiveContext is unnecessary. In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new HiveQLDialect(this); else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it calls dialectClassName, which is overriden in HiveContext and still return super.dialectClassName. So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of def dialectClassName in HiveContext. 2. When we start bin/spark-sql, the default context is HiveContext, and the corresponding dialect is hiveql. However, if we type "set spark.sql.dialect;", the result is "sql", which is inconsistent with the actual dialect and is misleading. For example, we can use sql like "create table" which is only allowed in hiveql, but this dialect conf shows it's "sql". Although this problem will not cause any execution error, it's misleading to spark sql users. Therefore I think we should fix it. In this pr, while procesing “set spark.sql.dialect” in SetCommand, I use "conf.dialect" instead of "getConf()" for the case of key == SQLConf.DIALECT.key, so that it will return the right dialect conf. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #9349 from wzhfy/dialect.	2015-11-04 17:16:00 -08:00
Josh Rosen	ce5e6a2849	[SPARK-11491] Update build to use Scala 2.10.5 Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479 Author: Josh Rosen <joshrosen@databricks.com> Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.	2015-11-04 16:58:38 -08:00
Reynold Xin	b6e0a5ae6f	[SPARK-11510][SQL] Remove SQL aggregation tests for higher order statistics We have some aggregate function tests in both DataFrameAggregateSuite and SQLQuerySuite. The two have almost the same coverage and we should just remove the SQL one. Author: Reynold Xin <rxin@databricks.com> Closes #9475 from rxin/SPARK-11510.	2015-11-04 16:49:25 -08:00
Yu ISHIKAWA	411ff6afb4	[SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.	2015-11-04 15:28:19 -08:00
Davies Liu	1b6a5d4af9	[SPARK-11493] remove bitset from BytesToBytesMap Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset. For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway). For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark. For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false): ``` sqlContext.range(1<<20).write.parquet("small") df = sqlContext.read.parquet('small') for i in range(3): t = time.time() df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2") df2.join(df, df.id == df2.id2).count() print time.time() -t ``` Having bitset (used time in seconds): ``` 17.5404241085 10.2758829594 10.5786800385 ``` After removing bitset (used time in seconds): ``` 21.8939979076 12.4132959843 9.97224712372 ``` cc rxin nongli Author: Davies Liu <davies@databricks.com> Closes #9452 from davies/remove_bitset.	2015-11-04 14:45:02 -08:00
Adam Roberts	701fb50520	[SPARK-10949] Update Snappy version to 1.1.2 This is an updated version of #8995 by a-roberts. Original description follows: Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test. Snappy 1.1.2 changelog mentions: > snappy-java-1.1.2 (22 September 2015) > This is a backward compatible release for 1.1.x. > Add AIX (32-bit) support. > There is no upgrade for the native libraries of the other platforms. > A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s) > snappy-java-1.1.2-RC2 (18 May 2015) > Fix #107: SnappyOutputStream.close() is not idempotent > snappy-java-1.1.2-RC1 (13 May 2015) > SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream > There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions. > Fixes a problem when java.io.tmpdir does not exist. Closes #8995. Author: Adam Roberts <aroberts@uk.ibm.com> Author: Josh Rosen <joshrosen@databricks.com> Closes #9439 from JoshRosen/update-snappy.	2015-11-04 14:03:31 -08:00
Reynold Xin	d19f4fda63	[SPARK-11505][SQL] Break aggregate functions into multiple files functions.scala was getting pretty long. I broke it into multiple files. I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions. Author: Reynold Xin <rxin@databricks.com> Closes #9471 from rxin/SPARK-11505.	2015-11-04 13:44:07 -08:00

... 6 7 8 9 10 ...

13853 commits