ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xusen Yin	454a00df2a	[SPARK-13993][PYSPARK] Add pyspark Rformula/RforumlaModel save/load ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13993 ## How was this patch tested? doctest Author: Xusen Yin <yinxusen@gmail.com> Closes #11807 from yinxusen/SPARK-13993.	2016-03-20 15:34:34 -07:00
sethah	811a524722	[SPARK-12182][ML] Distributed binning for trees in spark.ml This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #10231 from sethah/SPARK-12182.	2016-03-20 12:31:28 -07:00
Shixiong Zhu	d630a203d6	[SPARK-10680][TESTS] Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable ## What changes were proposed in this pull request? Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable ## How was this patch tested? Existing unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11833 from zsxwing/SPARK-10680.	2016-03-19 12:35:35 -07:00
Reynold Xin	dcaa016610	[SPARK-13897][SQL] RelationalGroupedDataset and KeyValueGroupedDataset ## What changes were proposed in this pull request? Previously, Dataset.groupBy returns a GroupedData, and Dataset.groupByKey returns a GroupedDataset. The naming is very similar, and unfortunately does not convey the real differences between the two. Assume we are grouping by some keys (K). groupByKey is a key-value style group by, in which the schema of the returned dataset is a tuple of just two fields: key and value. groupBy, on the other hand, is a relational style group by, in which the schema of the returned dataset is flattened and contain \|K\| + \|V\| fields. This pull request also removes the experimental tag from RelationalGroupedDataset. It has been with DataFrame since 1.3, and we have enough confidence now to stabilize it. ## How was this patch tested? This is a rename to improve API understandability. Should be covered by all existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #11841 from rxin/SPARK-13897.	2016-03-19 11:23:14 -07:00
Dongjoon Hyun	2082a49569	[MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script. ## What changes were proposed in this pull request? Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`. ```bash $ ./bin/sparkR examples/src/main/r/dataframe.R Running R applications through 'sparkR' is not supported as of Spark 2.0. Use ./bin/spark-submit <R file> ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11842 from dongjoon-hyun/update_r_readme.	2016-03-19 13:23:34 +00:00
Reynold Xin	1970d911d9	[SPARK-14018][SQL] Use 64-bit num records in BenchmarkWholeStageCodegen ## What changes were proposed in this pull request? 500L << 20 is actually pretty close to 32-bit int limit. I was trying to increase this to 500L << 23 and got negative numbers instead. ## How was this patch tested? I'm only modifying test code. Author: Reynold Xin <rxin@databricks.com> Closes #11839 from rxin/SPARK-14018.	2016-03-19 00:27:23 -07:00
Sameer Agarwal	b39594472b	[SPARK-14012][SQL] Extract VectorizedColumnReader from VectorizedParquetRecordReader ## What changes were proposed in this pull request? This is a minor followup on https://github.com/apache/spark/pull/11799 that extracts out the `VectorizedColumnReader` from `VectorizedParquetRecordReader` into its own file. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #11834 from sameeragarwal/rename.	2016-03-18 22:33:43 -07:00
Dongjoon Hyun	c11ea2e413	[MINOR][DOCS] Update build descriptions and commands ## What changes were proposed in this pull request? This PR updates Scala and Hadoop versions in the build description and commands in `Building Spark` documents. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11838 from dongjoon-hyun/fix_doc_building_spark.	2016-03-18 21:32:48 -07:00
Yuhao Yang	f43a26ef92	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013, containing some comment update and style adjustment. jkbradley ## How was this patch tested? unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11830 from hhbyyh/cvToggle.	2016-03-18 17:34:33 -07:00
Sameer Agarwal	54794113a6	[SPARK-13989] [SQL] Remove non-vectorized/unsafe-row parquet record reader ## What changes were proposed in this pull request? This PR cleans up the new parquet record reader with the following changes: 1. Removes the non-vectorized parquet reader code from `UnsafeRowParquetRecordReader`. 2. Removes the non-vectorized column reader code from `ColumnReader`. 3. Renames `UnsafeRowParquetRecordReader` to `VectorizedParquetRecordReader` and `ColumnReader` to `VectorizedColumnReader` 4. Deprecate `PARQUET_UNSAFE_ROW_RECORD_READER_ENABLED` ## How was this patch tested? Refactoring only; Existing tests should reveal any problems. Author: Sameer Agarwal <sameer@databricks.com> Closes #11799 from sameeragarwal/vectorized-parquet.	2016-03-18 14:04:42 -07:00
Yin Huai	238fb485be	[SPARK-13972][SQL][FOLLOW-UP] When creating the query execution for a converted SQL query, we eagerly trigger analysis ## What changes were proposed in this pull request? As part of testing generating SQL query from a analyzed SQL plan, we run the generated SQL for tests in HiveComparisonTest. This PR makes the generated SQL get eagerly analyzed. So, when a generated SQL has any analysis error, we can see the error message created by ``` case NonFatal(e) => fail( s"""Failed to analyze the converted SQL string: \| \|# Original HiveQL query string: \|$queryString \| \|# Resolved query plan: \|${originalQuery.analyzed.treeString} \| \|# Converted SQL query string: \|$convertedSQL """.stripMargin, e) ``` Right now, if we can parse a generated SQL but fail to analyze it, we will see error message generated by the following code (it only mentions that we cannot execute the original query, i.e. `queryString`). ``` case e: Throwable => val errorMessage = s""" \|Failed to execute query using catalyst: \|Error: ${e.getMessage} \|${stackTraceToString(e)} \|$queryString \|$query \|== HIVE - ${hive.size} row(s) == \|${hive.mkString("\n")} """.stripMargin ``` ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #11825 from yhuai/SPARK-13972-follow-up.	2016-03-18 13:40:53 -07:00
Sital Kedia	2e0c5284fd	[SPARK-13958] Executor OOM due to unbounded growth of pointer array in… ## What changes were proposed in this pull request? This change fixes the executor OOM which was recently introduced in PR apache/spark#11095 (Please fill in changes proposed in this fix) ## How was this patch tested? Tested by running a spark job on the cluster. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) … Sorter Author: Sital Kedia <skedia@fb.com> Closes #11794 from sitalkedia/SPARK-13958.	2016-03-18 12:56:06 -07:00
jerryshao	3537782168	[SPARK-13885][YARN] Fix attempt id regression for Spark running on Yarn ## What changes were proposed in this pull request? This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code. Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it. Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now. ## How was this patch tested? Test it with unit test and manually test different scenario: 1. application running in yarn-client mode. 2. application running in yarn-cluster mode. 3. application running in yarn-cluster mode with multiple attempts. Checked both the event log file name and url link. CC vanzin tgravescs , please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #11721 from jerryshao/SPARK-13885.	2016-03-18 12:39:49 -07:00
Davies Liu	9c23c818ca	[SPARK-13977] [SQL] Brings back Shuffled hash join ## What changes were proposed in this pull request? ShuffledHashJoin (also outer join) is removed in 1.6, in favor of SortMergeJoin, which is more robust and also fast. ShuffledHashJoin is still useful in this case: 1) one table is much smaller than the other one, then cost to build a hash table on smaller table is smaller than sorting the larger table 2) any partition of the small table could fit in memory. This PR brings back ShuffledHashJoin, basically revert #9645, and fix the conflict. Also merging outer join and left-semi join into the same class. This PR does not implement full outer join, because it's not implemented efficiently (requiring build hash table on both side). A simple benchmark (one table is 5x smaller than other one) show that ShuffledHashJoin could be 2X faster than SortMergeJoin. ## How was this patch tested? Added new unit tests for ShuffledHashJoin. Author: Davies Liu <davies@databricks.com> Closes #11788 from davies/shuffle_join.	2016-03-18 10:32:53 -07:00
Cheng Lian	14c7236dc6	[SPARK-14004][SQL][MINOR] AttributeReference and Alias should only use the first qualifier to generate SQL strings ## What changes were proposed in this pull request? Current implementations of `AttributeReference.sql` and `Alias.sql` joins all available qualifiers, which is logically wrong. But this implementation mistake doesn't cause any real SQL generation bugs though, since there is always at most one qualifier for any given `AttributeReference` or `Alias`. This PR fixes this issue by only picking the first qualifiers. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Existing tests should be enough. Author: Cheng Lian <lian@databricks.com> Closes #11820 from liancheng/spark-14004-single-qualifier.	2016-03-19 00:22:17 +08:00
Wenchen Fan	0acb32a3f1	[SPARK-13972][SQ] hive tests should fail if SQL generation failed ## What changes were proposed in this pull request? Now we should be able to convert all logical plans to SQL string, if they are parsed from hive query. This PR changes the error handling to throw exceptions instead of just log. We will send new PRs for spotted bugs, and merge this one after all bugs are fixed. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11782 from cloud-fan/test.	2016-03-18 23:16:14 +08:00
Zheng RuiFeng	53f32a22da	[MINOR][DOC] Fix nits in JavaStreamingTestExample ## What changes were proposed in this pull request? Fix some nits discussed in https://github.com/apache/spark/pull/11776#issuecomment-198207419 use !rdd.isEmpty instead of rdd.count > 0 use static instead of AtomicInteger remove unneeded "throws Exception" ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11821 from zhengruifeng/je_fix.	2016-03-18 12:34:14 +00:00
Wenchen Fan	0f1015ffdd	[SPARK-14001][SQL] support multi-children Union in SQLBuilder ## What changes were proposed in this pull request? The fix is simple, use the existing `CombineUnions` rule to combine adjacent Unions before build SQL string. ## How was this patch tested? The re-enabled test Author: Wenchen Fan <wenchen@databricks.com> Closes #11818 from cloud-fan/bug-fix.	2016-03-18 19:42:33 +08:00
Yanbo Liang	7783b6f38f	[MINOR][ML] When trainingSummary is None, it should throw RuntimeException. ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11784 from yanboliang/fix-summary.	2016-03-18 11:23:17 +00:00
Reynold Xin	bb1fda01fe	[SPARK-13826][SQL] Addendum: update documentation for Datasets ## What changes were proposed in this pull request? This patch updates documentations for Datasets. I also updated some internal documentation for exchange/broadcast. ## How was this patch tested? Just documentation/api stability update. Author: Reynold Xin <rxin@databricks.com> Closes #11814 from rxin/dataset-docs.	2016-03-18 00:57:23 -07:00
Liang-Chi Hsieh	750ed64cd9	[SPARK-13930] [SQL] Apply fast serialization on collect limit operator ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13930 Recently the fast serialization has been introduced to collecting DataFrame/Dataset (#11664). The same technology can be used on collect limit operator too. ## How was this patch tested? Add a benchmark for collect limit to `BenchmarkWholeStageCodegen`. Without this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 3413 / 3768 0.3 3255.0 1.0X collect limit 2 millions 9728 / 10440 0.1 9277.3 0.4X With this patch: model name : Westmere E56xx/L56xx/X56xx (Nehalem-C) collect limit: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------- collect limit 1 million 833 / 1284 1.3 794.4 1.0X collect limit 2 millions 3348 / 4005 0.3 3193.3 0.2X Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11759 from viirya/execute-take.	2016-03-17 23:24:44 -07:00
Cheng Lian	10ef4f3e77	[SPARK-13826][SQL] Revises Dataset ScalaDoc ## What changes were proposed in this pull request? This PR revises Dataset API ScalaDoc. All public methods are divided into the following groups * `groupname basic`: Basic Dataset functions * `groupname action`: Actions * `groupname untypedrel`: Untyped Language Integrated Relational Queries * `groupname typedrel`: Typed Language Integrated Relational Queries * `groupname func`: Functional Transformations * `groupname rdd`: RDD Operations * `groupname output`: Output Operations `since` tag and sample code are also updated. We may want to add more sample code for typed APIs. ## How was this patch tested? Documentation change. Checked by building unidoc locally. Author: Cheng Lian <lian@databricks.com> Closes #11769 from liancheng/spark-13826-ds-api-doc.	2016-03-17 21:31:11 -07:00
tedyu	90a1d8db70	[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 PR #11696 introduced a complex pattern match that broke Scala 2.10 match unreachability check and caused build failure. This PR fixes this issue by expanding this pattern match into several simpler ones. Note that tuning or turning off `-Dscalac.patmat.analysisBudget` doesn't work for this case. Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11798 from yy2016/master.	2016-03-18 12:11:32 +08:00
Josh Rosen	6c2d894a2f	[SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer. This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted. This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet). Author: Josh Rosen <joshrosen@databricks.com> Closes #11748 from JoshRosen/chunked-block-serialization.	2016-03-17 20:00:56 -07:00
Wenchen Fan	6037ed0a1d	[SPARK-13976][SQL] do not remove sub-queries added by user when generate SQL ## What changes were proposed in this pull request? We haven't figured out the corrected logical to add sub-queries yet, so we should not clear all sub-queries before generate SQL. This PR changed the logic to only remove sub-queries above table relation. an example for this bug, original SQL: `SELECT a FROM (SELECT a FROM tbl) t WHERE a = 1` before this PR, we will generate: ``` SELECT attr_1 AS a FROM SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ``` We missed a sub-query and this SQL string is illegal. After this PR, we will generate: ``` SELECT attr_1 AS a FROM ( SELECT attr_1 FROM ( SELECT a AS attr_1 FROM tbl ) AS sub_q0 WHERE attr_1 = 1 ) AS t ``` TODO: for long term, we should find a way to add sub-queries correctly, so that arbitrary logical plans can be converted to SQL string. ## How was this patch tested? `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11786 from cloud-fan/bug-fix.	2016-03-18 10:16:48 +08:00
Wenchen Fan	453455c479	[SPARK-13974][SQL] sub-query names do not need to be globally unique while generate SQL ## What changes were proposed in this pull request? We only need to make sub-query names unique every time we generate a SQL string, but not all the time. This PR moves the `newSubqueryName` method to `class SQLBuilder` and remove `object SQLBuilder`. also addressed 2 minor comments in https://github.com/apache/spark/pull/11696 ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11783 from cloud-fan/tmp.	2016-03-18 09:30:36 +08:00
sethah	1614485fd9	[SPARK-10788][MLLIB][ML] Remove duplicate bins for decision trees Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A). This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features. Author: sethah <seth.hendrickson16@gmail.com> Closes #9474 from sethah/SPARK-10788.	2016-03-17 16:44:41 -07:00
Joseph K. Bradley	b39e80d39d	[SPARK-13761][ML] Remove remaining uses of validateParams ## What changes were proposed in this pull request? Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema ## How was this patch tested? Existing unit tests, modified to check using transformSchema instead of validateParams Author: Joseph K. Bradley <joseph@databricks.com> Closes #11790 from jkbradley/SPARK-13761-cleanup.	2016-03-17 13:23:07 -07:00
Yin Huai	4c08e2c085	Revert "[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10" This reverts commit `3ee7996187`.	2016-03-17 11:16:03 -07:00
Xusen Yin	edf8b8775b	[SPARK-11891] Model export/import for RFormula and RFormulaModel https://issues.apache.org/jira/browse/SPARK-11891 Author: Xusen Yin <yinxusen@gmail.com> Closes #9884 from yinxusen/SPARK-11891.	2016-03-17 10:19:10 -07:00
Bryan Cutler	828213d4ca	[SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable ## What changes were proposed in this pull request? In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes. ## How was this patch tested? Ran python tests for ML and MLlib. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.	2016-03-17 10:16:51 -07:00
tedyu	3ee7996187	[SPARK-12719][HOTFIX] Fix compilation against Scala 2.10 ## What changes were proposed in this pull request? Compilation against Scala 2.10 fails with: ``` [error] [warn] /home/jenkins/workspace/spark-master-compile-sbt-scala-2.10/sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala:483: Cannot check match for unreachability. [error] (The analysis required more space than allowed. Please try with scalac -Dscalac.patmat.analysisBudget=512 or -Dscalac.patmat.analysisBudget=off.) [error] [warn] private def addSubqueryIfNeeded(plan: LogicalPlan): LogicalPlan = plan match { ``` ## How was this patch tested? Compilation against Scala 2.10 Author: tedyu <yuzhihong@gmail.com> Closes #11787 from yy2016/master.	2016-03-17 10:09:37 -07:00
Liang-Chi Hsieh	5f3bda6fe2	[SPARK-13838] [SQL] Clear variable code to prevent it to be re-evaluated in BoundAttribute JIRA: https://issues.apache.org/jira/browse/SPARK-13838 ## What changes were proposed in this pull request? We should also clear the variable code in `BoundReference.genCode` to prevent it to be evaluated twice, as we did in `evaluateVariables`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11674 from viirya/avoid-reevaluate.	2016-03-17 10:08:42 -07:00
Dilip Biswal	637a78f1d3	[SPARK-13427][SQL] Support USING clause in JOIN. ## What changes were proposed in this pull request? Support queries that JOIN tables with USING clause. SELECT * from table1 JOIN table2 USING <column_list> USING clause can be used as a means to simplify the join condition when : 1) Equijoin semantics is desired and 2) The column names in the equijoin have the same name. We already have the support for Natural Join in Spark. This PR makes use of the already existing infrastructure for natural join to form the join condition and also the projection list. ## How was the this patch tested? Have added unit tests in SQLQuerySuite, CatalystQlSuite, ResolveNaturalJoinSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #11297 from dilipbiswal/spark-13427.	2016-03-17 10:01:41 -07:00
Shixiong Zhu	65b75e66e8	[SPARK-13776][WEBUI] Limit the max number of acceptors and selectors for Jetty ## What changes were proposed in this pull request? As each acceptor/selector in Jetty will use one thread, the number of threads should at least be the number of acceptors and selectors plus 1. Otherwise, the thread pool of Jetty server may be exhausted by acceptors/selectors and not be able to response any request. To avoid wasting threads, the PR limits the max number of acceptors and selectors and also updates the max thread number if necessary. ## How was this patch tested? Just make sure we don't break any existing tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11615 from zsxwing/SPARK-13776.	2016-03-17 13:05:29 +00:00
Wenchen Fan	1974d1d34d	[SPARK-12719][SQL] SQL generation support for Generate ## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on https://github.com/apache/spark/pull/11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over https://github.com/apache/spark/pull/11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.	2016-03-17 20:25:05 +08:00
Wenchen Fan	8ef3399aff	[SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.	2016-03-17 19:23:38 +08:00
trueyao	ea9ca6f04c	[SPARK-13901][CORE] correct the logDebug information when jump to the next locality level JIRA Issue:https://issues.apache.org/jira/browse/SPARK-13901 In getAllowedLocalityLevel method of TaskSetManager,we get wrong logDebug information when jump to the next locality level.So we should fix it. Author: trueyao <501663994@qq.com> Closes #11719 from trueyao/logDebug-localityWait.	2016-03-17 09:45:06 +00:00
Yuhao Yang	357d82d84d	[SPARK-13629][ML] Add binary toggle Param to CountVectorizer ## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.	2016-03-17 11:21:11 +02:00
Zheng RuiFeng	204c9dec2c	[MINOR][DOC] Add JavaStreamingTestExample ## What changes were proposed in this pull request? Add the java example of StreamingTest ## How was this patch tested? manual tests in CLI: bin/run-example mllib.JavaStreamingTestExample dataDir 5 100 Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11776 from zhengruifeng/streaming_je.	2016-03-17 11:09:02 +02:00
Davies Liu	30c18841e4	Revert "[SPARK-13840][SQL] Split Optimizer Rule ColumnPruning to ColumnPruning and EliminateOperator" This reverts commit `99bd2f0e94`.	2016-03-16 23:11:13 -07:00
Josh Rosen	82066a1667	[SPARK-13948] MiMa check should catch if the visibility changes to private MiMa excludes are currently generated using both the current Spark version's classes and Spark 1.2.0's classes, but this doesn't make sense: we should only be ignoring classes which were `private` in the previous Spark version, not classes which became private in the current version. This patch updates `dev/mima` to only generate excludes with respect to the previous artifacts that MiMa checks against. It also updates `MimaBuild` so that `excludeClass` only applies directly to the class being excluded and not to its companion object (since a class and its companion object can have different accessibility). Author: Josh Rosen <joshrosen@databricks.com> Closes #11774 from JoshRosen/SPARK-13948.	2016-03-16 23:02:25 -07:00
Ryan Blue	5faba9facc	[SPARK-13403][SQL] Pass hadoopConfiguration to HiveConf constructors. This commit updates the HiveContext so that sc.hadoopConfiguration is used to instantiate its internal instances of HiveConf. I tested this by overriding the S3 FileSystem implementation from spark-defaults.conf as "spark.hadoop.fs.s3.impl" (to avoid [HADOOP-12810](https://issues.apache.org/jira/browse/HADOOP-12810)). Author: Ryan Blue <blue@apache.org> Closes #11273 from rdblue/SPARK-13403-new-hive-conf-from-hadoop-conf.	2016-03-16 22:57:06 -07:00
Josh Rosen	de1a84e56e	[SPARK-13926] Automatically use Kryo serializer when shuffling RDDs with simple types Because ClassTags are available when constructing ShuffledRDD we can use them to automatically use Kryo for shuffle serialization when the RDD's types are known to be compatible with Kryo. This patch introduces `SerializerManager`, a component which picks the "best" serializer for a shuffle given the elements' ClassTags. It will automatically pick a Kryo serializer for ShuffledRDDs whose key, value, and/or combiner types are primitives, arrays of primitives, or strings. In the future we can use this class as a narrow extension point to integrate specialized serializers for other types, such as ByteBuffers. In a planned followup patch, I will extend the BlockManager APIs so that we're able to use similar automatic serializer selection when caching RDDs (this is a little trickier because the ClassTags need to be threaded through many more places). Author: Josh Rosen <joshrosen@databricks.com> Closes #11755 from JoshRosen/automatically-pick-best-serializer.	2016-03-16 22:52:55 -07:00
Daoyuan Wang	d1c193a2f1	[SPARK-12855][MINOR][SQL][DOC][TEST] remove spark.sql.dialect from doc and test ## What changes were proposed in this pull request? Since developer API of plug-able parser has been removed in #10801 , docs should be updated accordingly. ## How was this patch tested? This patch will not affect the real code path. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #11758 from adrian-wang/spark12855.	2016-03-16 22:52:10 -07:00
Dongjoon Hyun	c890c359b1	[MINOR][SQL][BUILD] Remove duplicated lines ## What changes were proposed in this pull request? This PR removes three minor duplicated lines. First one is making the following unreachable code warning. ``` JoinSuite.scala:52: unreachable code [warn] case j: BroadcastHashJoin => j ``` The other two are just consecutive repetitions in `Seq` of MiMa filters. ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11773 from dongjoon-hyun/remove_duplicated_line.	2016-03-16 22:48:58 -07:00
Jakob Odersky	7eef2463ad	[SPARK-13118][SQL] Expression encoding for optional synthetic classes ## What changes were proposed in this pull request? Fix expression generation for optional types. Standard Java reflection causes issues when dealing with synthetic Scala objects (things that do not map to Java and thus contain a dollar sign in their name). This patch introduces Scala reflection in such cases. This patch also adds a regression test for Dataset's handling of classes defined in package objects (which was the initial purpose of this PR). ## How was this patch tested? A new test in ExpressionEncoderSuite that tests optional inner classes and a regression test for Dataset's handling of package objects. Author: Jakob Odersky <jakob@odersky.com> Closes #11708 from jodersky/SPARK-13118-package-objects.	2016-03-16 21:53:16 -07:00
Davies Liu	c100d31ddc	[SPARK-13873] [SQL] Avoid copy of UnsafeRow when there is no join in whole stage codegen ## What changes were proposed in this pull request? We need to copy the UnsafeRow since a Join could produce multiple rows from single input rows. We could avoid that if there is no join (or the join will not produce multiple rows) inside WholeStageCodegen. Updated the benchmark for `collect`, we could see 20-30% speedup. ## How was this patch tested? existing unit tests. Author: Davies Liu <davies@databricks.com> Closes #11740 from davies/avoid_copy2.	2016-03-16 21:46:04 -07:00
hyukjinkwon	917f4000b4	[SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild ## What changes were proposed in this pull request? This https://github.com/apache/spark/pull/2400 added the support to parse JSON rows wrapped with an array. However, this throws an exception when the given data contains array data and struct data in the same field as below: ```json {"a": {"b": 1}} {"a": []} ``` and the schema is given as below: ```scala val schema = StructType( StructField("a", StructType( StructField("b", StringType) :: Nil )) :: Nil) ``` - Before ```scala sqlContext.read.schema(schema).json(path).show() ``` ```scala Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) ... ``` - After ```scala sqlContext.read.schema(schema).json(path).show() ``` ```bash +----+ \| a\| +----+ \| [1]\| \|null\| +----+ ``` For other data types, in this case it converts the given values are `null` but only this case emits an exception. This PR makes the support for wrapped rows applied only at the top level. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11752 from HyukjinKwon/SPARK-3308-follow-up.	2016-03-16 18:20:30 -07:00
Andrew Or	ca9ef86c84	[SPARK-13923][SQL] Implement SessionCatalog ## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11750 from andrewor14/temp-catalog.	2016-03-16 18:02:43 -07:00

... 3 4 5 6 7 ...

15416 commits