ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yu ISHIKAWA	26e760581f	[SPARK-9871] [SPARKR] Add expression functions into SparkR which have a variable parameter ### Summary - Add `lit` function - Add `concat`, `greatest`, `least` functions I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue. ### JIRA [[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8194 from yu-iskw/SPARK-9856.	2015-08-16 23:33:20 -07:00
Cheng Lian	ae2370e72f	[SPARK-10005] [SQL] Fixes schema merging for nested structs In case of schema merging, we only handled first level fields when converting Parquet groups to `InternalRow`s. Nested struct fields are not properly handled. For example, the schema of a Parquet file to be read can be: ``` message individual { required group f1 { optional binary f11 (utf8); } } ``` while the global schema is: ``` message global { required group f1 { optional binary f11 (utf8); optional int32 f12; } } ``` This PR fixes this issue by padding missing fields when creating actual converters. Author: Cheng Lian <lian@databricks.com> Closes #8228 from liancheng/spark-10005/nested-schema-merging.	2015-08-16 10:17:58 -07:00
Matei Zaharia	cf016075a0	[SPARK-10008] Ensure shuffle locality doesn't take precedence over narrow deps The shuffle locality patch made the DAGScheduler aware of shuffle data, but for RDDs that have both narrow and shuffle dependencies, it can cause them to place tasks based on the shuffle dependency instead of the narrow one. This case is common in iterative join-based algorithms like PageRank and ALS, where one RDD is hash-partitioned and one isn't. Author: Matei Zaharia <matei@databricks.com> Closes #8220 from mateiz/shuffle-loc-fix.	2015-08-16 00:34:58 -07:00
Sun Rui	5f9ce738fe	[SPARK-8844] [SPARKR] head/collect is broken in SparkR. This is a WIP patch for SPARK-8844 for collecting reviews. This bug is about reading an empty DataFrame. in readCol(), lapply(1:numRows, function(x) { does not take into consideration the case where numRows = 0. Will add unit test case. Author: Sun Rui <rui.sun@intel.com> Closes #7419 from sun-rui/SPARK-8844.	2015-08-16 00:30:02 -07:00
Kun Xu	182f9b7a6d	[SPARK-9973] [SQL] Correct in-memory columnar buffer size The `initialSize` argument of `ColumnBuilder.initialize()` should be the number of rows rather than bytes. However `InMemoryColumnarTableScan` passes in a byte size, which makes Spark SQL allocate more memory than necessary when building in-memory columnar buffers. Author: Kun Xu <viper_kun@163.com> Closes #8189 from viper-kun/errorSize.	2015-08-16 14:44:45 +08:00
Joseph K. Bradley	1db7179fae	[SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming pyspark tests Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time. Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method. With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally). CC: mengxr tdas freeman-lab Author: Joseph K. Bradley <joseph@databricks.com> Closes #8087 from jkbradley/streaming-ml-tests.	2015-08-15 18:48:20 -07:00
Wenchen Fan	570567258b	[SPARK-9955] [SQL] correct error message for aggregate We should skip unresolved `LogicalPlan`s for `PullOutNondeterministic`, as calling `output` on unresolved `LogicalPlan` will produce confusing error message. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8203 from cloud-fan/error-msg and squashes the following commits: 1c67ca7 [Wenchen Fan] move test 7593080 [Wenchen Fan] correct error message for aggregate	2015-08-15 14:13:12 -07:00
Herman van Hovell	a85fb6c07f	[SPARK-9980] [BUILD] Fix SBT publishLocal error due to invalid characters in doc Tiny modification to a few comments ```sbt publishLocal``` work again. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #8209 from hvanhovell/SPARK-9980.	2015-08-15 10:46:04 +01:00
Davies Liu	7c1e56825b	[SPARK-9725] [SQL] fix serialization of UTF8String across different JVM The BYTE_ARRAY_OFFSET could be different in JVM with different configurations (for example, different heap size, 24 if heap > 32G, otherwise 16), so offset of UTF8String is not portable, we should handler that during serialization. Author: Davies Liu <davies@databricks.com> Closes #8210 from davies/serialize_utf8string.	2015-08-14 22:30:35 -07:00
zc he	71a3af8a94	[SPARK-9960] [GRAPHX] sendMessage type fix in LabelPropagation.scala Author: zc he <farseer90718@gmail.com> Closes #8188 from farseer90718/farseer-patch-1.	2015-08-14 21:28:50 -07:00
Reynold Xin	609ce3c07d	[SPARK-9984] [SQL] Create local physical operator interface. This pull request creates a new operator interface that is more similar to traditional database query iterators (with open/close/next/get). These local operators are not currently used anywhere, but will become the basis for SPARK-9983 (local physical operators for query execution). cc zsxwing Author: Reynold Xin <rxin@databricks.com> Closes #8212 from rxin/SPARK-9984.	2015-08-14 21:12:11 -07:00
Yijie Shen	6c4fdbec33	[SPARK-8887] [SQL] Explicit define which data types can be used as dynamic partition columns This PR enforce dynamic partition column data type requirements by adding analysis rules. JIRA: https://issues.apache.org/jira/browse/SPARK-8887 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #8201 from yjshen/dynamic_partition_columns.	2015-08-14 21:03:14 -07:00
Wenchen Fan	ec29f2034a	[SPARK-9634] [SPARK-9323] [SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957.	2015-08-14 20:59:54 -07:00
Davies Liu	37586e5449	[HOTFIX] fix duplicated braces Author: Davies Liu <davies@databricks.com> Closes #8219 from davies/fix_typo.	2015-08-14 20:56:55 -07:00
Reynold Xin	e5fd60415f	[SPARK-9934] Deprecate NIO ConnectionManager. Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0. Author: Reynold Xin <rxin@databricks.com> Closes #8162 from rxin/SPARK-9934.	2015-08-14 20:55:32 -07:00
Yin Huai	932b24fd14	[SPARK-9949] [SQL] Fix TakeOrderedAndProject's output. https://issues.apache.org/jira/browse/SPARK-9949 Author: Yin Huai <yhuai@databricks.com> Closes #8179 from yhuai/SPARK-9949.	2015-08-14 17:35:17 -07:00
Tathagata Das	18a761ef7a	[SPARK-9968] [STREAMING] Reduced time spent within synchronized block to prevent lock starvation When the rate limiter is actually limiting the rate at which data is inserted into the buffer, the synchronized block of BlockGenerator.addData stays blocked for long time. This causes the thread switching the buffer and generating blocks (synchronized with addData) to starve and not generate blocks for seconds. The correct solution is to not block on the rate limiter within the synchronized block for adding data to the buffer. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8204 from tdas/SPARK-9968 and squashes the following commits: 8cbcc1b [Tathagata Das] Removed unused val a73b645 [Tathagata Das] Reduced time spent within synchronized block	2015-08-14 15:54:14 -07:00
Tathagata Das	f3bfb711c1	[SPARK-9966] [STREAMING] Handle couple of corner cases in PIDRateEstimator 1. The rate estimator should not estimate any rate when there are no records in the batch, as there is no data to estimate the rate. In the current state, it estimates and set the rate to zero. That is incorrect. 2. The rate estimator should not never set the rate to zero under any circumstances. Otherwise the system will stop receiving data, and stop generating useful estimates (see reason 1). So the fix is to define a parameters that sets a lower bound on the estimated rate, so that the system always receives some data. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #8199 from tdas/SPARK-9966 and squashes the following commits: 829f793 [Tathagata Das] Fixed unit test and added comments 3a994db [Tathagata Das] Added min rate and updated tests in PIDRateEstimator	2015-08-14 15:10:01 -07:00
Wenchen Fan	1150a19b18	[SPARK-8670] [SQL] Nested columns can't be referenced in pyspark This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8202 from cloud-fan/nested.	2015-08-14 14:09:46 -07:00
Joseph K. Bradley	2a6590e510	[SPARK-9981] [ML] Made labels public for StringIndexerModel Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.	2015-08-14 14:05:03 -07:00
Davies Liu	11ed2b180e	[SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile() Author: Davies Liu <davies@databricks.com> Closes #8213 from davies/fix_window.	2015-08-14 13:55:29 -07:00
jerryshao	9407baa2a7	[SPARK-9877] [CORE] Fix StandaloneRestServer NPE when submitting application Detailed exception log can be seen in [SPARK-9877](https://issues.apache.org/jira/browse/SPARK-9877), the problem is when creating `StandaloneRestServer`, `self` (`masterEndpoint`) is null. So this fix is creating `StandaloneRestServer` when `self` is available. Author: jerryshao <sshao@hortonworks.com> Closes #8127 from jerryshao/SPARK-9877.	2015-08-14 13:44:38 -07:00
Andrew Or	6518ef6303	[SPARK-9948] Fix flaky AccumulatorSuite - internal accumulators In these tests, we use a custom listener and we assert on fields in the stage / task completion events. However, these events are posted in a separate thread so they're not guaranteed to be posted in time. This commit fixes this flakiness through a job end registration callback. Author: Andrew Or <andrew@databricks.com> Closes #8176 from andrewor14/fix-accumulator-suite.	2015-08-14 13:42:53 -07:00
Carson Wang	33bae585d4	[SPARK-9809] Task crashes because the internal accumulators are not properly initialized When a stage failed and another stage was resubmitted with only part of partitions to compute, all the tasks failed with error message: java.util.NoSuchElementException: key not found: peakExecutionMemory. This is because the internal accumulators are not properly initialized for this stage while other codes assume the internal accumulators always exist. Author: Carson Wang <carson.wang@intel.com> Closes #8090 from carsonwang/SPARK-9809.	2015-08-14 13:38:25 -07:00
MechCoder	ffa05c84fe	[SPARK-9828] [PYSPARK] Mutable values should not be default arguments Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8110 from MechCoder/spark-9828.	2015-08-14 12:46:05 -07:00
Andrew Or	ece00566e4	[SPARK-9561] Re-enable BroadcastJoinSuite We can do this now that SPARK-9580 is resolved. Author: Andrew Or <andrew@databricks.com> Closes #8208 from andrewor14/reenable-sql-tests.	2015-08-14 12:37:21 -07:00
Davies Liu	3bc5528722	[SPARK-9946] [SPARK-9589] [SQL] fix NPE and thread-safety in TaskMemoryManager Currently, we access the `page.pageNumer` after it's freed, that could be modified by other thread, cause NPE. The same TaskMemoryManager could be used by multiple threads (for example, Python UDF and TransportScript), so it should be thread safe to allocate/free memory/page. The underlying Bitset and HashSet are not thread safe, we should put them inside a synchronized block. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #8177 from davies/memory_manager.	2015-08-14 12:32:35 -07:00
Neelesh Srinivas Salian	57c2d08800	[SPARK-9923] [CORE] ShuffleMapStage.numAvailableOutputs should be an Int instead of Long Modified type of ShuffleMapStage.numAvailableOutputs from Long to Int Author: Neelesh Srinivas Salian <nsalian@cloudera.com> Closes #8183 from nssalian/SPARK-9923.	2015-08-14 20:03:50 +01:00
Wenchen Fan	34d610be85	[SPARK-9929] [SQL] support metadata in withColumn in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8159 from cloud-fan/withColumn.	2015-08-14 12:00:01 -07:00
Holden Karau	a7317ccdc2	[SPARK-8744] [ML] Add a public constructor to StringIndexer It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.	2015-08-14 11:22:10 -07:00
Joseph K. Bradley	7ecf0c4699	[SPARK-9956] [ML] Make trees work with one-category features This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.	2015-08-14 10:48:02 -07:00
Xiangrui Meng	a0e1abbd01	[SPARK-9661] [MLLIB] minor clean-up of SPARK-9661 Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8190 from mengxr/SPARK-9661-fix.	2015-08-14 10:25:11 -07:00
zsxwing	c8677d7366	[SPARK-9958] [SQL] Make HiveThriftServer2Listener thread-safe and update the tab name to "JDBC/ODBC Server" This PR fixed the thread-safe issue of HiveThriftServer2Listener, and also changed the tab name to "JDBC/ODBC Server" since it's conflict with the new SQL tab. <img width="1377" alt="thriftserver" src="https://cloud.githubusercontent.com/assets/1000778/9265707/c46f3f2c-4269-11e5-8d7e-888c9113ab4f.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8185 from zsxwing/SPARK-9958.	2015-08-14 14:41:53 +08:00
Liang-Chi Hsieh	7c7c7529a1	[MINOR] [SQL] Remove canEqual in Row As `InternalRow` does not extend `Row` now, I think we can remove it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8170 from viirya/remove_canequal.	2015-08-13 22:06:09 -07:00
Davies Liu	bd35385d53	[SPARK-9945] [SQL] pageSize should be calculated from executor.memory Currently, pageSize of TungstenSort is calculated from driver.memory, it should use executor.memory instead. Also, in the worst case, the safeFactor could be 4 (because of rounding), increase it to 16. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8175 from davies/page_size.	2015-08-13 21:12:59 -07:00
Andrew Or	8187b3ae47	[SPARK-9580] [SQL] Replace singletons in SQL tests A fundamental limitation of the existing SQL tests is that there is simply no way to create your own `SparkContext`. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure. This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch all the SQL test files. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111) <!-- Reviewable:end --> Author: Andrew Or <andrew@databricks.com> Closes #8111 from andrewor14/sql-tests-refactor.	2015-08-13 17:42:01 -07:00
Davies Liu	c50f97dafd	[SPARK-9943] [SQL] deserialized UnsafeHashedRelation should be serializable When the free memory in executor goes low, the cached broadcast objects need to serialized into disk, but currently the deserialized UnsafeHashedRelation can't be serialized , fail with NPE. This PR fixes that. cc rxin Author: Davies Liu <davies@databricks.com> Closes #8174 from davies/serialize_hashed.	2015-08-13 17:35:11 -07:00
Davies Liu	693949ba40	[SPARK-8976] [PYSPARK] fix open mode in python3 This bug only happen on Python 3 and Windows. I tested this manually with python 3 and disable python daemon, no unit test yet. Author: Davies Liu <davies@databricks.com> Closes #8181 from davies/open_mode.	2015-08-13 17:33:37 -07:00
Xiangrui Meng	6c5858bc65	[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng <meng@databricks.com> Closes #8152 from mengxr/SPARK-9922.	2015-08-13 16:52:17 -07:00
hyukjinkwon	c2520f501a	[SPARK-9935] [SQL] EqualNotNull not processed in ORC https://issues.apache.org/jira/browse/SPARK-9935 Author: hyukjinkwon <gurwls223@gmail.com> Closes #8163 from HyukjinKwon/master.	2015-08-13 16:07:03 -07:00
Davies Liu	a8d2f4c5f9	[SPARK-9942] [PYSPARK] [SQL] ignore exceptions while try to import pandas If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions. Author: Davies Liu <davies@databricks.com> Closes #8173 from davies/fix_pandas.	2015-08-13 14:03:55 -07:00
MechCoder	864de8eaf4	[SPARK-9661] [MLLIB] [ML] Java compatibility I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same. 1. Some methods in LDAModel. 2. runMiniBatchSGD 3. kolmogorovSmirnovTest Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8126 from MechCoder/java_incop.	2015-08-13 13:42:35 -07:00
Andrew Or	8815ba2f67	[SPARK-9649] Fix MasterSuite, third time's a charm This particular test did not load the default configurations so it continued to start the REST server, which causes port bind exceptions.	2015-08-13 11:31:10 -07:00
Xiangrui Meng	65fec798ce	[MINOR] [DOC] fix mllib pydoc warnings Switch to correct Sphinx syntax. MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #8169 from mengxr/mllib-pydoc-fix.	2015-08-13 10:16:40 -07:00
Yanbo Liang	4b70798c96	[MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8164 from yanboliang/mlp-name.	2015-08-13 09:31:14 -07:00
Rosstin	7a539ef3b1	[SPARK-8965] [DOCS] Add ml-guide Python Example: Estimator, Transformer, and Param Added ml-guide Python Example: Estimator, Transformer, and Param /docs/_site/ml-guide.html Author: Rosstin <asterazul@gmail.com> Closes #8081 from Rosstin/SPARK-8965.	2015-08-13 09:18:39 -07:00
lewuathe	2932e25da4	[SPARK-9073] [ML] spark.ml Models copy() should call setParent when there is a parent Copied ML models must have the same parent of original ones Author: lewuathe <lewuathe@me.com> Author: Lewuathe <lewuathe@me.com> Closes #7447 from Lewuathe/SPARK-9073.	2015-08-13 09:17:19 -07:00
Cheng Lian	6993031011	[SPARK-9757] [SQL] Fixes persistence of Parquet relation with decimal column PR #7967 enables us to save data source relations to metastore in Hive compatible format when possible. But it fails to persist Parquet relations with decimal column(s) to Hive metastore of versions lower than 1.2.0. This is because `ParquetHiveSerDe` in Hive versions prior to 1.2.0 doesn't support decimal. This PR checks for this case and falls back to Spark SQL specific metastore table format. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8130 from liancheng/spark-9757/old-hive-parquet-decimal.	2015-08-13 16:16:50 +08:00
Yin Huai	84a27916a6	[SPARK-9885] [SQL] Also pass barrierPrefixes and sharedPrefixes to IsolatedClientLoader when hiveMetastoreJars is set to maven. https://issues.apache.org/jira/browse/SPARK-9885 cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #8158 from yhuai/classloaderMaven.	2015-08-13 15:08:57 +08:00
Xiangrui Meng	68f9957149	[SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues. This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters. jkbradley yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8148 from mengxr/SPARK-9918 and squashes the following commits: 149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol 3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API	2015-08-12 23:04:59 -07:00

1 2 3 4 5 ...

12581 commits