ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	276c2d51a3	[SPARK-13890][SQL] Remove some internal classes' dependency on SQLContext ## What changes were proposed in this pull request? In general it is better for internal classes to not depend on the external class (in this case SQLContext) to reduce coupling between user-facing APIs and the internal implementations. This patch removes SQLContext dependency from some internal classes such as SparkPlanner, SparkOptimizer. As part of this patch, I also removed the following internal methods from SQLContext: ``` protected[sql] def functionRegistry: FunctionRegistry protected[sql] def optimizer: Optimizer protected[sql] def sqlParser: ParserInterface protected[sql] def planner: SparkPlanner protected[sql] def continuousQueryManager protected[sql] def prepareForExecution: RuleExecutor[SparkPlan] ``` ## How was this patch tested? Existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11712 from rxin/sqlContext-planner.	2016-03-14 23:58:57 -07:00
Dongjoon Hyun	a51f877b5d	[SPARK-13870][SQL] Add scalastyle escaping correctly in CVSSuite.scala ## What changes were proposed in this pull request? When initial creating `CVSSuite.scala` in SPARK-12833, there was a typo on `scalastyle:on`: `scalstyle:on`. So, it turns off ScalaStyle checking for the rest of the file mistakenly. So, it can not find a violation on the code of `SPARK-12668` added recently. This issue fixes the existing escaping correctly and adds a new escaping for `SPARK-12668` code like the following. ```scala test("test aliases sep and encoding for delimiter and charset") { + // scalastyle:off val cars = sqlContext ... .load(testFile(carsFile8859)) + // scalastyle:on ``` This will prevent future potential problems, too. ## How was this patch tested? Pass the Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11700 from dongjoon-hyun/SPARK-13870.	2016-03-14 23:23:05 -07:00
Shixiong Zhu	43304b1758	[SPARK-13888][DOC] Remove Akka Receiver doc and refer to the DStream Akka project ## What changes were proposed in this pull request? I have copied the docs of Streaming Akka to https://github.com/spark-packages/dstream-akka/blob/master/README.md So we can remove them from Spark now. ## How was this patch tested? Only document changes. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Shixiong Zhu <shixiong@databricks.com> Closes #11711 from zsxwing/remove-akka-doc.	2016-03-14 23:21:30 -07:00
Reynold Xin	e64958001c	[SPARK-13884][SQL] Remove DescribeCommand's dependency on LogicalPlan ## What changes were proposed in this pull request? This patch removes DescribeCommand's dependency on LogicalPlan. After this patch, DescribeCommand simply accepts a TableIdentifier. It minimizes the dependency, and blocks my next patch (removes SQLContext dependency from SparkPlanner). ## How was this patch tested? Should be covered by existing unit tests and Hive compatibility tests that run describe table. Author: Reynold Xin <rxin@databricks.com> Closes #11710 from rxin/SPARK-13884.	2016-03-14 23:09:10 -07:00
Davies Liu	f72743d971	[SPARK-13353][SQL] fast serialization for collecting DataFrame/Dataset ## What changes were proposed in this pull request? When we call DataFrame/Dataset.collect(), Java serializer (or Kryo Serializer) will be used to serialize the UnsafeRows in executor, then deserialize them into UnsafeRows in driver. Java serializer (and Kyro serializer) are slow on millions rows, because they try to find out the same rows, but usually there is no same rows. This PR will serialize the UnsafeRows as byte array by packing them together, then Java serializer (or Kyro serializer) serialize the bytes very fast (there are fewer blocks and byte array are not compared by content). The UnsafeRow format is highly compressible, the serialized bytes are also compressed (configurable by spark.io.compression.codec). ## How was this patch tested? Existing unit tests. Add a benchmark for collect, before this patch: ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect 1 million 3991 / 4311 0.3 3805.7 1.0X collect 2 millions 10083 / 10637 0.1 9616.0 0.4X collect 4 millions 29551 / 30072 0.0 28182.3 0.1X ``` ``` Intel(R) Core(TM) i7-4558U CPU 2.80GHz collect: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect 1 million 775 / 1170 1.4 738.9 1.0X collect 2 millions 1153 / 1758 0.9 1099.3 0.7X collect 4 millions 4451 / 5124 0.2 4244.9 0.2X ``` We can see about 5-7X speedup. Author: Davies Liu <davies@databricks.com> Closes #11664 from davies/serialize_row.	2016-03-14 22:32:22 -07:00
Davies Liu	9256840cb6	[SPARK-13661][SQL] avoid the copy in HashedRelation ## What changes were proposed in this pull request? Avoid the copy in HashedRelation, since most of the HashedRelation are built with Array[Row], added the copy() for LeftSemiJoinHash. This could help to reduce the memory consumption for Broadcast join. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #11666 from davies/remove_copy.	2016-03-14 22:25:57 -07:00
Reynold Xin	e76679a814	[SPARK-13880][SPARK-13881][SQL] Rename DataFrame.scala Dataset.scala, and remove LegacyFunctions ## What changes were proposed in this pull request? 1. Rename DataFrame.scala Dataset.scala, since the class is now named Dataset. 2. Remove LegacyFunctions. It was introduced in Spark 1.6 for backward compatibility, and can be removed in Spark 2.0. ## How was this patch tested? Should be covered by existing unit/integration tests. Author: Reynold Xin <rxin@databricks.com> Closes #11704 from rxin/SPARK-13880.	2016-03-15 10:39:07 +08:00
Shixiong Zhu	b5e3bd87f5	[SPARK-13791][SQL] Add MetadataLog and HDFSMetadataLog ## What changes were proposed in this pull request? - Add a MetadataLog interface for metadata reliably storage. - Add HDFSMetadataLog as a MetadataLog implementation based on HDFS. - Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11625 from zsxwing/metadata-log.	2016-03-14 19:28:13 -07:00
Reynold Xin	8e0b030606	[SPARK-10380][SQL] Fix confusing documentation examples for astype/drop_duplicates. ## What changes were proposed in this pull request? We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases. ## How was this patch tested? Existing PySpark unit tests. Closes #11543. Author: Reynold Xin <rxin@databricks.com> Closes #11698 from rxin/SPARK-10380.	2016-03-14 19:25:49 -07:00
Reynold Xin	4bf4609795	[SPARK-13882][SQL] Remove org.apache.spark.sql.execution.local ## What changes were proposed in this pull request? We introduced some local operators in org.apache.spark.sql.execution.local package but never fully wired the engine to actually use these. We still plan to implement a full local mode, but it's probably going to be fairly different from what the current iterator-based local mode would look like. Based on what we know right now, we might want a push-based columnar version of these operators. Let's just remove them for now, and we can always re-introduced them in the future by looking at branch-1.6. ## How was this patch tested? This is simply dead code removal. Author: Reynold Xin <rxin@databricks.com> Closes #11705 from rxin/SPARK-13882.	2016-03-14 19:22:11 -07:00
Michael Armbrust	17eec0a71b	[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.	2016-03-14 19:21:12 -07:00
Ehsan M.Kermani	992142b87e	[SPARK-11826][MLLIB] Refactor add() and subtract() methods srowen Could you please check this when you have time? Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9916 from ehsanmok/JIRA-11826.	2016-03-14 19:17:09 -07:00
Shixiong Zhu	06dec37455	[SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages ## What changes were proposed in this pull request? Currently there are a few sub-projects, each for integrating with different external sources for Streaming. Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages - streaming-flume - streaming-akka - streaming-mqtt - streaming-zeromq - streaming-twitter They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster. I have already copied these projects to https://github.com/spark-packages ## How was this patch tested? Jenkins tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11672 from zsxwing/remove-external-pkg.	2016-03-14 16:56:04 -07:00
Marcelo Vanzin	8301fadd8d	[SPARK-13626][CORE] Avoid duplicate config deprecation warnings. Three different things were needed to get rid of spurious warnings: - silence deprecation warnings when cloning configuration - change the way SparkHadoopUtil instantiates SparkConf to silence warnings - avoid creating new SparkConf instances where it's not needed. On top of that, I changed the way that Logging.scala detects the repl; now it uses a method that is overridden in the repl's Main class, and the hack in Utils.scala is not needed anymore. This makes the 2.11 repl behave like the 2.10 one and set the default log level to WARN, which is a lot better. Previously, this wasn't working because the 2.11 repl triggers log initialization earlier than the 2.10 one. I also removed and simplified some other code in the 2.11 repl's Main to avoid replicating logic that already exists elsewhere in Spark. Tested the 2.11 repl in local and yarn modes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11510 from vanzin/SPARK-13626.	2016-03-14 14:27:33 -07:00
Josh Rosen	38529d8f23	[SPARK-10907][SPARK-6157] Remove pendingUnrollMemory from MemoryStore This patch refactors the MemoryStore to remove the concept of `pendingUnrollMemory`. It also fixes fixes SPARK-6157: "Unrolling with MEMORY_AND_DISK should always release memory". Key changes: - Inline `MemoryStore.tryToPut` at its three call sites in the `MemoryStore`. - Inline `Memory.unrollSafely` at its only call site (in `MemoryStore.putIterator`). - Inline `MemoryManager.acquireStorageMemory` at its call sites. - Simplify the code as a result of this inlining (some parameters have fixed values after inlining, so lots of branches can be removed). - Remove the `pendingUnrollMemory` map by returning the amount of unrollMemory allocated when returning an iterator after a failed `putIterator` call. - Change `putIterator` to return an instance of `PartiallyUnrolledIterator`, a special iterator subclass which will automatically free the unroll memory of its partially-unrolled elements when the iterator is consumed. To handle cases where the iterator is not consumed (e.g. when a MEMORY_ONLY put fails), `PartiallyUnrolledIterator` exposes a `close()` method which may be called to discard the unrolled values and free their memory. Author: Josh Rosen <joshrosen@databricks.com> Closes #11613 from JoshRosen/cleanup-unroll-memory.	2016-03-14 14:26:39 -07:00
Dongjoon Hyun	a48296f4fe	[SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to (Streaming)LinearRegressionWithSGD ## What changes were proposed in this pull request? `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. The same default value is used. ## How was this patch tested? Pass the existing unit test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11527 from dongjoon-hyun/SPARK-13686.	2016-03-14 12:46:53 -07:00
Thomas Graves	23385e853e	[SPARK-13054] Always post TaskEnd event for tasks I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks. There are multiple issues with this: - If the task end for tasks (in this case probably because of speculation) comes in after the stage is finished, then the DAGScheduler.handleTaskCompletion will skip the task completion event Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com> Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes #10951 from tgravescs/SPARK-11701.	2016-03-14 12:31:46 -07:00
Bjorn Jonsson	e06493cb7b	[MINOR][COMMON] Fix copy-paste oversight in variable naming ## What changes were proposed in this pull request? JavaUtils.java has methods to convert time and byte strings for internal use, this change renames a variable used in byteStringAs(), from timeError to byteError. Author: Bjorn Jonsson <bjornjon@gmail.com> Closes #11695 from bjornjon/master.	2016-03-14 12:27:49 -07:00
Daniel Santana	9f13f0fc17	[MINOR][DOCS] Added Missing back slashes ## What changes were proposed in this pull request? When studying spark many users just copy examples on the documentation and paste on their terminals and because of that the missing backlashes lead them run into some shell errors. The added backslashes avoid that problem for spark users with that behavior. ## How was this patch tested? I generated the documentation locally using jekyll and checked the generated pages Author: Daniel Santana <mestresan@gmail.com> Closes #11699 from danielsan/master.	2016-03-14 12:26:08 -07:00
Bertrand Bossy	310981d49a	[SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle files before application has stopped ## Problem description: Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. ### Context and analysis: spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on #11207 . ## What changes were proposed in this pull request? This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. ## How was the this patch tested? This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.	2016-03-14 12:22:57 -07:00
Josh Rosen	07cb323e7a	[SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark. In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches. Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2 /cc zsxwing tdas davies brkyvz Author: Josh Rosen <joshrosen@databricks.com> Closes #11687 from JoshRosen/py4j-0.9.2.	2016-03-14 12:22:02 -07:00
Liang-Chi Hsieh	6a4bfcd62b	[SPARK-13658][SQL] BooleanSimplification rule is slow with large boolean expressions JIRA: https://issues.apache.org/jira/browse/SPARK-13658 ## What changes were proposed in this pull request? Quoted from JIRA description: When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree). It will great if we could speedup it. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql How to speed up it: When we ask the canonicalized expression in `Expression`, it calls `Canonicalize.execute` on itself. `Canonicalize.execute` basically transforms up all expressions included in this expression. However, we don't keep the canonicalized versions for these children expressions. So in next time we ask the canonicalized expressions for the children expressions (e.g., `BooleanSimplification`), we will rerun `Canonicalize.execute` on each of them. It wastes much time. By forcing the children expressions to get and keep their canonicalized versions first, we can avoid re-canonicalize these expressions. I simply benchmark it with an expression which is part of the where clause in TPCDS Q3: val testRelation = LocalRelation('ss_sold_date_sk.int, 'd_moy.int, 'i_manufact_id.int, 'ss_item_sk.string, 'i_item_sk.string, 'd_date_sk.int) val input = ('d_date_sk === 'ss_sold_date_sk) && ('ss_item_sk === 'i_item_sk) && ('i_manufact_id === 436) && ('d_moy === 12) && (('ss_sold_date_sk > 2415355 && 'ss_sold_date_sk < 2415385) \|\| ('ss_sold_date_sk > 2415720 && 'ss_sold_date_sk < 2415750) \|\| ('ss_sold_date_sk > 2416085 && 'ss_sold_date_sk < 2416115) \|\| ('ss_sold_date_sk > 2416450 && 'ss_sold_date_sk < 2416480) \|\| ('ss_sold_date_sk > 2416816 && 'ss_sold_date_sk < 2416846) \|\| ('ss_sold_date_sk > 2417181 && 'ss_sold_date_sk < 2417211) \|\| ('ss_sold_date_sk > 2417546 && 'ss_sold_date_sk < 2417576) \|\| ('ss_sold_date_sk > 2417911 && 'ss_sold_date_sk < 2417941) \|\| ('ss_sold_date_sk > 2418277 && 'ss_sold_date_sk < 2418307) \|\| ('ss_sold_date_sk > 2418642 && 'ss_sold_date_sk < 2418672) \|\| ('ss_sold_date_sk > 2419007 && 'ss_sold_date_sk < 2419037) \|\| ('ss_sold_date_sk > 2419372 && 'ss_sold_date_sk < 2419402) \|\| ('ss_sold_date_sk > 2419738 && 'ss_sold_date_sk < 2419768) \|\| ('ss_sold_date_sk > 2420103 && 'ss_sold_date_sk < 2420133) \|\| ('ss_sold_date_sk > 2420468 && 'ss_sold_date_sk < 2420498) \|\| ('ss_sold_date_sk > 2420833 && 'ss_sold_date_sk < 2420863) \|\| ('ss_sold_date_sk > 2421199 && 'ss_sold_date_sk < 2421229) \|\| ('ss_sold_date_sk > 2421564 && 'ss_sold_date_sk < 2421594) \|\| ('ss_sold_date_sk > 2421929 && 'ss_sold_date_sk < 2421959) \|\| ('ss_sold_date_sk > 2422294 && 'ss_sold_date_sk < 2422324) \|\| ('ss_sold_date_sk > 2422660 && 'ss_sold_date_sk < 2422690) \|\| ('ss_sold_date_sk > 2423025 && 'ss_sold_date_sk < 2423055) \|\| ('ss_sold_date_sk > 2423390 && 'ss_sold_date_sk < 2423420) \|\| ('ss_sold_date_sk > 2423755 && 'ss_sold_date_sk < 2423785) \|\| ('ss_sold_date_sk > 2424121 && 'ss_sold_date_sk < 2424151) \|\| ('ss_sold_date_sk > 2424486 && 'ss_sold_date_sk < 2424516) \|\| ('ss_sold_date_sk > 2424851 && 'ss_sold_date_sk < 2424881) \|\| ('ss_sold_date_sk > 2425216 && 'ss_sold_date_sk < 2425246) \|\| ('ss_sold_date_sk > 2425582 && 'ss_sold_date_sk < 2425612) \|\| ('ss_sold_date_sk > 2425947 && 'ss_sold_date_sk < 2425977) \|\| ('ss_sold_date_sk > 2426312 && 'ss_sold_date_sk < 2426342) \|\| ('ss_sold_date_sk > 2426677 && 'ss_sold_date_sk < 2426707) \|\| ('ss_sold_date_sk > 2427043 && 'ss_sold_date_sk < 2427073) \|\| ('ss_sold_date_sk > 2427408 && 'ss_sold_date_sk < 2427438) \|\| ('ss_sold_date_sk > 2427773 && 'ss_sold_date_sk < 2427803) \|\| ('ss_sold_date_sk > 2428138 && 'ss_sold_date_sk < 2428168) \|\| ('ss_sold_date_sk > 2428504 && 'ss_sold_date_sk < 2428534) \|\| ('ss_sold_date_sk > 2428869 && 'ss_sold_date_sk < 2428899) \|\| ('ss_sold_date_sk > 2429234 && 'ss_sold_date_sk < 2429264) \|\| ('ss_sold_date_sk > 2429599 && 'ss_sold_date_sk < 2429629) \|\| ('ss_sold_date_sk > 2429965 && 'ss_sold_date_sk < 2429995) \|\| ('ss_sold_date_sk > 2430330 && 'ss_sold_date_sk < 2430360) \|\| ('ss_sold_date_sk > 2430695 && 'ss_sold_date_sk < 2430725) \|\| ('ss_sold_date_sk > 2431060 && 'ss_sold_date_sk < 2431090) \|\| ('ss_sold_date_sk > 2431426 && 'ss_sold_date_sk < 2431456) \|\| ('ss_sold_date_sk > 2431791 && 'ss_sold_date_sk < 2431821) \|\| ('ss_sold_date_sk > 2432156 && 'ss_sold_date_sk < 2432186) \|\| ('ss_sold_date_sk > 2432521 && 'ss_sold_date_sk < 2432551) \|\| ('ss_sold_date_sk > 2432887 && 'ss_sold_date_sk < 2432917) \|\| ('ss_sold_date_sk > 2433252 && 'ss_sold_date_sk < 2433282) \|\| ('ss_sold_date_sk > 2433617 && 'ss_sold_date_sk < 2433647) \|\| ('ss_sold_date_sk > 2433982 && 'ss_sold_date_sk < 2434012) \|\| ('ss_sold_date_sk > 2434348 && 'ss_sold_date_sk < 2434378) \|\| ('ss_sold_date_sk > 2434713 && 'ss_sold_date_sk < 2434743))) val plan = testRelation.where(input).analyze val actual = Optimize.execute(plan) With this patch: 352 milliseconds 346 milliseconds 340 milliseconds Without this patch: 585 milliseconds 880 milliseconds 677 milliseconds ## How was this patch tested? Existing tests should pass. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11647 from viirya/improve-expr-canonicalize.	2016-03-14 11:23:29 -07:00
Ryan Blue	63f642aea3	[SPARK-13779][YARN] Avoid cancelling non-local container requests. To maximize locality, the YarnAllocator would cancel any requests with a stale locality preference or no locality preference. This assumed that the majority of tasks had locality preferences, but may not be the case when scanning S3. This caused container requests for S3 tasks to be constantly cancelled and resubmitted. This changes the allocator's logic to cancel only stale requests and just enough requests without locality preferences to submit requests with locality preferences. This avoids cancelling requests without locality preferences that would be resubmitted without locality preferences. We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine. Author: Ryan Blue <blue@apache.org> Closes #11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.	2016-03-14 11:18:37 -07:00
Marcelo Vanzin	45f8053be5	[SPARK-13578][CORE] Modify launch scripts to not use assemblies. Instead of looking for a specially-named assembly, the scripts now will blindly add all jars under the libs directory to the classpath. This libs directory is still currently the old assembly dir, so things should keep working the same way as before until we make more packaging changes. The only lost feature is the detection of multiple assemblies; I consider that a minor nicety that only really affects few developers, so it's probably ok. Tested locally by running spark-shell; also did some minor Win32 testing (just made sure spark-shell started). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11591 from vanzin/SPARK-13578.	2016-03-14 11:13:26 -07:00
Josh Rosen	9a87afd7d1	[SPARK-13833] Guard against race condition when re-caching disk blocks in memory When reading data from the DiskStore and attempting to cache it back into the memory store, we should guard against race conditions where multiple readers are attempting to re-cache the same block in memory. This patch accomplishes this by synchronizing on the block's `BlockInfo` object while trying to re-cache a block. (Will file JIRA as soon as ASF JIRA stops being down / laggy). Author: Josh Rosen <joshrosen@databricks.com> Closes #11660 from JoshRosen/concurrent-recaching-fixes.	2016-03-14 10:48:24 -07:00
Andrew Or	9a1680c2c8	[SPARK-13139][SQL] Follow-ups to #11573 Addressing outstanding comments in #11573. Jenkins, new test case in `DDLCommandSuite` Author: Andrew Or <andrew@databricks.com> Closes #11667 from andrewor14/ddl-parser-followups.	2016-03-14 09:59:22 -07:00
Yin Huai	250832c733	[SPARK-13207][SQL] Make partitioning discovery ignore _SUCCESS files. If a _SUCCESS appears in the inner partitioning dir, partition discovery will treat that _SUCCESS file as a data file. Then, partition discovery will fail because it finds that the dir structure is not valid. We should ignore those `_SUCCESS` files. In future, it is better to ignore all files/dirs starting with `_` or `.`. This PR does not make this change. I am thinking about making this change simple, so we can consider of getting it in branch 1.6. To ignore all files/dirs starting with `_` or `, the main change is to let ParquetRelation have another way to get metadata files. Right now, it relies on FileStatusCache's cachedLeafStatuses, which returns file statuses of both metadata files (e.g. metadata files used by parquet) and data files, which requires more changes. https://issues.apache.org/jira/browse/SPARK-13207 Author: Yin Huai <yhuai@databricks.com> Closes #11088 from yhuai/SPARK-13207.	2016-03-14 09:03:13 -07:00
Wilson Wu	31d069d4c2	[SPARK-13746][TESTS] stop using deprecated SynchronizedSet trait SynchronizedSet in package mutable is deprecated Author: Wilson Wu <wilson888888888@gmail.com> Closes #11580 from wilson888888888/spark-synchronizedset.	2016-03-14 09:13:29 +00:00
Dongjoon Hyun	acdf219703	[MINOR][DOCS] Fix more typos in comments/strings. ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.	2016-03-14 09:07:39 +00:00
Reynold Xin	e58fa19d17	Closes #11668	2016-03-13 22:14:59 -07:00
Sean Owen	1840852841	[SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit `1deecd8d9c` ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.	2016-03-13 21:03:49 -07:00
Dongjoon Hyun	473263f959	[SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x. ## What changes were proposed in this pull request? For 2.0.0, we had better make sbt and sbt plugins up-to-date. This PR checks the status of each plugins and bumps the followings. * sbt: 0.13.9 --> 0.13.11 * sbteclipse-plugin: 2.2.0 --> 4.0.0 * sbt-dependency-graph: 0.7.4 --> 0.8.2 * sbt-mima-plugin: 0.1.6 --> 0.1.9 * sbt-revolver: 0.7.2 --> 0.8.0 All other plugins are up-to-date. (Note that `sbt-avro` seems to be change from 0.3.2 to 1.0.1, but it's not published in the repository.) During upgrade, this PR also updated the following MiMa error. Note that the related excluding filter is already registered correctly. It seems due to the change of MiMa exception result. ``` // SPARK-12896 Send only accumulator updates to driver, not TaskMetrics ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulable.this"), -ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.Accumulator.this"), +ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.Accumulator.this"), ``` ## How was this patch tested? Pass the Jenkins build. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11669 from dongjoon-hyun/update_mima.	2016-03-13 18:47:04 -07:00
Jacky Li	f3daa099bf	[SQL] fix typo in DataSourceRegister ## What changes were proposed in this pull request? fix typo in DataSourceRegister ## How was this patch tested? found when going through latest code Author: Jacky Li <jacky.likun@huawei.com> Closes #11686 from jackylk/patch-12.	2016-03-13 18:44:02 -07:00
Sun Rui	c7e68c3968	[SPARK-13812][SPARKR] Fix SparkR lint-r test errors. ## What changes were proposed in this pull request? This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11652 from sun-rui/SPARK-13812.	2016-03-13 14:30:44 -07:00
Bjorn Jonsson	515e4afbc7	[SPARK-13810][CORE] Add Port Configuration Suggestions on Bind Exceptions ## What changes were proposed in this pull request? Currently, when a java.net.BindException is thrown, it displays the following message: java.net.BindException: Address already in use: Service '$serviceName' failed after 16 retries! This change adds port configuration suggestions to the BindException, for example, for the UI, it now displays java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries! Consider explicitly setting the appropriate port for 'SparkUI' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries. ## How was this patch tested? Manual tests Author: Bjorn Jonsson <bjornjon@gmail.com> Closes #11644 from bjornjon/master.	2016-03-13 10:18:24 +00:00
Dongjoon Hyun	db88d0204e	[MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc. ## What changes were proposed in this pull request? SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc. * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.	2016-03-13 12:11:18 +08:00
Cheng Lian	c079420d7c	[SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows() ## What changes were proposed in this pull request? This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11678 from liancheng/remove-collect-rows-and-take-rows.	2016-03-13 12:02:52 +08:00
Cheng Lian	4eace4d384	[SPARK-13828][SQL] Bring back stack trace of AnalysisException thrown from QueryExecution.assertAnalyzed PR #11443 added an extra `plan: Option[LogicalPlan]` argument to `AnalysisException` and attached partially analyzed plan to thrown `AnalysisException` in `QueryExecution.assertAnalyzed()`. However, the original stack trace wasn't properly inherited. This PR fixes this issue by inheriting the stack trace. A test case is added to verify that the first entry of `AnalysisException` stack trace isn't from `QueryExecution`. Author: Cheng Lian <lian@databricks.com> Closes #11677 from liancheng/analysis-exception-stacktrace.	2016-03-12 11:25:15 -08:00
Davies Liu	ba8c86d06f	[SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources ## What changes were proposed in this pull request? This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them. Also fix the problem for sameResult() on two DataSourceScan. Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad). ## How was this patch tested? Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan). Author: Davies Liu <davies@databricks.com> Closes #11514 from davies/existing_rdd.	2016-03-12 00:48:36 -08:00
Davies Liu	2ef4c5963b	[SPARK-13830] prefer block manager than direct result for large result ## What changes were proposed in this pull request? The current RPC can't handle large blocks very well, it's very slow to fetch 100M block (about 1 minute). Once switch to block manager to fetch that, it took about 10 seconds (still could be improved). ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11659 from davies/direct_result.	2016-03-11 15:39:21 -08:00
Andrew Or	66d9d0edfe	[SPARK-13139][SQL] Parse Hive DDL commands ourselves ## What changes were proposed in this pull request? This patch is ported over from viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <andrew@databricks.com> Closes #11573 from andrewor14/parser-plus-plus.	2016-03-11 15:13:48 -08:00
Zheng RuiFeng	42afd72c65	[SPARK-13814] [PYSPARK] Delete unnecessary imports in python examples files JIRA: https://issues.apache.org/jira/browse/SPARK-13814 ## What changes were proposed in this pull request? delete unnecessary imports in python examples files ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11651 from zhengruifeng/del_import_pe.	2016-03-11 13:49:37 -08:00
Josh Rosen	073bf9d4d9	[SPARK-13807] De-duplicate `PythonHelper` instantiation code in PySpark streaming This patch de-duplicates code in PySpark streaming which loads the `PythonHelper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing. Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0 Author: Josh Rosen <joshrosen@databricks.com> Closes #11641 from JoshRosen/pyspark-reflection-deduplication.	2016-03-11 11:18:51 -08:00
Nezih Yigitbasi	ff776b2fc1	[SPARK-13328][CORE] Poor read performance for broadcast variables with dynamic resource allocation When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11241 from nezihyigitbasi/SPARK-13328.	2016-03-11 11:11:53 -08:00
Liwei Lin	eb650a81f1	[STREAMING][MINOR] Fix a duplicate "be" in comments Author: Liwei Lin <proflin.me@gmail.com> Closes #11650 from lw-lin/typo.	2016-03-11 11:07:27 -08:00
Marcelo Vanzin	99b7187c2d	[SPARK-13780][SQL] Add missing dependency to build. This is needed to avoid odd compiler errors when building just the sql package with maven, because of odd interactions between scalac and shaded classes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11640 from vanzin/SPARK-13780.	2016-03-11 10:27:38 -08:00
Cheng Lian	6d37e1eb90	[SPARK-13817][BUILD][SQL] Re-enable MiMA and removes object DataFrame ## What changes were proposed in this pull request? PR #11443 temporarily disabled MiMA check, this PR re-enables it. One extra change is that `object DataFrame` is also removed. The only purpose of introducing `object DataFrame` was to use it as an internal factory for creating `Dataset[Row]`. By replacing this internal factory with `Dataset.newDataFrame`, both `DataFrame` and `DataFrame$` are entirely removed from the API, so that we can simply put a `MissingClassProblem` filter in `MimaExcludes.scala` for most DataFrame API changes. ## How was this patch tested? Tested by MiMA check triggered by Jenkins. Author: Cheng Lian <lian@databricks.com> Closes #11656 from liancheng/re-enable-mima.	2016-03-11 22:17:50 +08:00
Marcelo Vanzin	07f1c54477	[SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive. In preparation for the demise of assemblies, this change allows the YARN backend to use multiple jars and globs as the "Spark jar". The config option has been renamed to "spark.yarn.jars" to reflect that. A second option "spark.yarn.archive" was also added; if set, this takes precedence and uploads an archive expected to contain the jar files with the Spark code and its dependencies. Existing deployments should keep working, mostly. This change drops support for the "SPARK_JAR" environment variable, and also does not fall back to using "jarOfClass" if no configuration is set, falling back to finding files under SPARK_HOME instead. This should be fine since "jarOfClass" probably wouldn't work unless you were using spark-submit anyway. Tested with the unit tests, and trying the different config options on a YARN cluster. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11500 from vanzin/SPARK-13577.	2016-03-11 07:54:57 -06:00
Nick Pentreath	8fff0f92a4	[HOT-FIX][SQL][ML] Fix compile error from use of DataFrame in Java MaxAbsScaler example ## What changes were proposed in this pull request? Fix build failure introduced in #11392 (change `DataFrame` -> `Dataset<Row>`). ## How was this patch tested? Existing build/unit tests Author: Nick Pentreath <nick.pentreath@gmail.com> Closes #11653 from MLnick/java-maxabs-example-fix.	2016-03-11 10:20:39 +02:00
sethah	234f781ae1	[SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and random forest ## What changes were proposed in this pull request? This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`. ## How was this patch tested? Python doc tests for the affected classes were updated to check feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #11622 from sethah/SPARK-13787.	2016-03-11 09:54:23 +02:00

1 2 3 4 5 ...

15174 commits