ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Shixiong Zhu	16ba71aba4	[SPARK-15416][SQL] Display a better message for not finding classes removed in Spark 2.0 ## What changes were proposed in this pull request? If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message. ## How was this patch tested? 1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1` 2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`. It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0` Author: Shixiong Zhu <shixiong@databricks.com> Closes #13201 from zsxwing/better-message.	2016-05-19 18:31:05 -07:00
Yanbo Liang	6643677817	[MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync ## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.	2016-05-19 17:56:21 -07:00
Yanbo Liang	f8107c7846	[SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify "summary" was not saved ## What changes were proposed in this pull request? Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it. We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW. ## How was this patch tested? Documentation update, no unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13131 from yanboliang/spark-15341.	2016-05-19 17:54:18 -07:00
jerryshao	dcf407de67	[SPARK-15375][SQL][STREAMING] Add ConsoleSink to structure streaming ## What changes were proposed in this pull request? Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it: ``` val query = result.write .format("console") .trigger(ProcessingTime("2 seconds")) .startStream() ``` ## How was this patch tested? local verified. Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #13162 from jerryshao/SPARK-15375.	2016-05-19 17:42:59 -07:00
Sandeep Singh	ef43a5fe51	[SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public ## What changes were proposed in this pull request? Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg): `Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML` ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13202 from techaddict/SPARK-15414.	2016-05-19 17:24:42 -07:00
Yanbo Liang	59e6c5560d	[SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering ## What changes were proposed in this pull request? Audit Scala API for ml.clustering. Fix some wrong API documentations and update outdated one. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13148 from yanboliang/spark-15361.	2016-05-19 13:26:41 -07:00
DB Tsai	5255e55c84	[SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala ## What changes were proposed in this pull request? Add since to ml.stat.MultivariateOnlineSummarizer.scala ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #13197 from dbtsai/cleanup.	2016-05-19 13:10:51 -07:00
Shixiong Zhu	305263954a	Fix the compiler error introduced by #13153 for Scala 2.10	2016-05-19 12:36:44 -07:00
Davies Liu	5ccecc078a	[SPARK-15392][SQL] fix default value of size estimation of logical plan ## What changes were proposed in this pull request? We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD. This PR change the default value to Long.MaxValue. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13183 from davies/fix_default_size.	2016-05-19 12:12:42 -07:00
Shixiong Zhu	4e3cb7a5d9	[SPARK-15317][CORE] Don't store accumulators for every task in listeners ## What changes were proposed in this pull request? In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values. In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s. ## How was this patch tested? I ran two tests reported in JIRA locally: The first one is: ``` val data = spark.range(0, 10000, 1, 10000) data.cache().count() ``` The retained size of JobProgressListener decreases from 60.7M to 6.9M. The second one is: ``` import org.apache.spark.ml.CC import org.apache.spark.sql.SQLContext val sqlContext = SQLContext.getOrCreate(sc) CC.runTest(sqlContext) ``` This test won't cause OOM after applying this patch. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13153 from zsxwing/memory.	2016-05-19 12:05:17 -07:00
Cheng Lian	6ac1c3a040	[SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output ## What changes were proposed in this pull request? This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message. ## How was this patch tested? Updated existing test case to check exception message. Author: Cheng Lian <lian@databricks.com> Closes #13173 from liancheng/spark-14346-follow-up.	2016-05-19 12:02:41 -07:00
Holden Karau	e71cd96bf7	[SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression ## What changes were proposed in this pull request? Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list ## How was this patch tested? doctests & built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.	2016-05-19 20:59:19 +02:00
hyukjinkwon	f5065abf49	[SPARK-15322][SQL][FOLLOW-UP] Update deprecated accumulator usage into accumulatorV2 ## What changes were proposed in this pull request? This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed. Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage. ## How was this patch tested? Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13187 from HyukjinKwon/SPARK-15322.	2016-05-19 11:54:50 -07:00
Kousuke Saruta	faafd1e9db	[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to make database directory. ## What changes were proposed in this pull request? After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory. ## How was this patch tested? Manual test. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13175 from sarutak/SPARK-15387.	2016-05-19 11:51:59 -07:00
Davies Liu	ad182086cc	[SPARK-15300] Fix writer lock conflict when remove a block ## What changes were proposed in this pull request? A writer lock could be acquired when 1) create a new block 2) remove a block 3) evict a block to disk. 1) and 3) could happen in the same time within the same task, all of them could happen in the same time outside a task. It's OK that when someone try to grab the write block for a block, but the block is acquired by another one that has the same task attempt id. This PR remove the check. ## How was this patch tested? Updated existing tests. Author: Davies Liu <davies@databricks.com> Closes #13082 from davies/write_lock_conflict.	2016-05-19 11:47:17 -07:00
gatorsmile	ef7a5e0bca	[SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session Catalog #### What changes were proposed in this pull request? This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385 The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135 For example, in PySpark, if we input the following statement: ```python >>> l = [('Alice', 1)] >>> df = sqlContext.createDataFrame(l) >>> df.createTempView("people") >>> df.createTempView("people") ``` Before this PR, the exception we will get is like ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco return f(a, *kw) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView. : org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists; at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324) at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523) at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) ``` After this PR, the exception we will get become cleaner: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView self._jdf.createTempView(name) File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;" ``` #### How was this patch tested? Fixed an existing PySpark test case Author: gatorsmile <gatorsmile@gmail.com> Closes #13126 from gatorsmile/followup-14684.	2016-05-19 11:46:11 -07:00
Davies Liu	9308bf1192	[SPARK-15390] fix broadcast with 100 millions rows ## What changes were proposed in this pull request? When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow. This PR fix the overflow by converting it to Long when calculating the size of memory. Also add more checking in broadcast to show reasonable messages. ## How was this patch tested? Add test. Author: Davies Liu <davies@databricks.com> Closes #13182 from davies/fix_broadcast.	2016-05-19 11:45:18 -07:00
Pravin Gadakh	31f63ac25d	[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local ## What changes were proposed in this pull request? This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local. ## How was this patch tested? Scala Style Checks. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #13191 from pravingadakh/SPARK-14613.	2016-05-19 10:59:07 -07:00
Yanbo Liang	8ecf7f77b2	[SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification ## What changes were proposed in this pull request? Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section. * Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver``` * Add missing setter function for ```solver``` and ```stepSize```. * Make ```GD``` solver take effect. * Update docs, annotations and fix other minor issues. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13076 from yanboliang/spark-15292.	2016-05-19 10:27:17 -07:00
Yanbo Liang	1052d3644d	[SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible ## What changes were proposed in this pull request? [SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6. ## How was this patch tested? Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13149 from yanboliang/spark-15362.	2016-05-19 10:25:33 -07:00
Sandeep Singh	3facca5152	[CORE][MINOR] Remove redundant set master in OutputCommitCoordinatorIntegrationSuite ## What changes were proposed in this pull request? Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43. ## How was this patch tested? existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13168 from techaddict/minor-1.	2016-05-19 10:44:26 +01:00
Dongjoon Hyun	5907ebfc11	[SPARK-14939][SQL] Add FoldablePropagation optimizer ## What changes were proposed in this pull request? This PR aims to add new FoldablePropagation optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.) 1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()" 2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3" 3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z" This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did. Before ``` scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain == Physical Plan == WholeStageCodegen : +- Sort [1.0#5 ASC,x#0 ASC], true, 0 : +- INPUT +- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None +- WholeStageCodegen : +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0] : +- INPUT +- Scan OneRowRelation[] ``` After ``` scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain == Physical Plan == WholeStageCodegen : +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0] : +- INPUT +- Scan OneRowRelation[] ``` ## How was this patch tested? Pass the Jenkins tests including a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12719 from dongjoon-hyun/SPARK-14939.	2016-05-19 15:57:44 +08:00
hyukjinkwon	e2ec32dab8	[SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession ## What changes were proposed in this pull request? It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below: - `simple_params_example.py` - `aft_survival_regression.py` are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet. This PR corrects the example and make this use SparkSession. In more detail, it seems `threshold` is replaced to `thresholds` here and there by `5a23213c14`. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`). According to the comment below. this is not allowed, `354f8f11bd/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (L58-L61)`. So, in this PR, it sets the equivalent value so that this does not throw an exception. ## How was this patch tested? Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`) Author: hyukjinkwon <gurwls223@gmail.com> Closes #13135 from HyukjinKwon/SPARK-15031.	2016-05-19 08:52:41 +02:00
Wenchen Fan	661c21049b	[SPARK-15381] [SQL] physical object operator should define reference correctly ## What changes were proposed in this pull request? Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly. ## How was this patch tested? new test in DatasetSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #13167 from cloud-fan/bug.	2016-05-18 21:43:07 -07:00
Shixiong Zhu	5c9117a3ed	[SPARK-15395][CORE] Use getHostString to create RpcAddress ## What changes were proposed in this pull request? Right now the netty RPC uses `InetSocketAddress.getHostName` to create `RpcAddress` for network events. If we use an IP address to connect, then the RpcAddress's host will be a host name (if the reverse lookup successes) instead of the IP address. However, some places need to compare the original IP address and the RpcAddress in `onDisconnect` (e.g., CoarseGrainedExecutorBackend), and this behavior will make the check incorrect. This PR uses `getHostString` to resolve the issue. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13185 from zsxwing/host-string.	2016-05-18 20:15:00 -07:00
Bryan Cutler	b1bc5ebdd5	[DOC][MINOR] ml.feature Scala and Python API sync ## What changes were proposed in this pull request? I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. ## How was this patch tested? Built docs locally, ran style checks Author: Bryan Cutler <cutlerb@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync.	2016-05-19 04:48:36 +02:00
Reynold Xin	4987f39ac7	[SPARK-14463][SQL] Document the semantics for read.text ## What changes were proposed in this pull request? This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #13184 from rxin/SPARK-14463.	2016-05-18 19:16:28 -07:00
gatorsmile	9c2a376e41	[SPARK-15297][SQL] Fix Set -V Command #### What changes were proposed in this pull request? The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value. In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value. This PR is to fix the above two issues. #### How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #13081 from gatorsmile/setVcommand.	2016-05-19 10:05:53 +08:00
Wenchen Fan	ebfe3a1f2c	[SPARK-15192][SQL] null check for SparkSession.createDataFrame ## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13008 from cloud-fan/row-encoder.	2016-05-18 18:06:38 -07:00
Jurriaan Pruis	32be51fba4	[SPARK-15323][SPARK-14463][SQL] Fix reading of partitioned format=text datasets https://issues.apache.org/jira/browse/SPARK-15323 I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0. It would be logical if you could also write those, but not entirely sure how to solve this with the new DataSet implementation. Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`. See https://issues.apache.org/jira/browse/SPARK-14463 for that issue. Author: Jurriaan Pruis <email@jurriaanpruis.nl> Closes #13104 from jurriaan/fix-partitioned-text-reads.	2016-05-18 16:15:09 -07:00
Davies Liu	84b23453dd	Revert "[SPARK-15392][SQL] fix default value of size estimation of logical plan" This reverts commit `fc29b896da`.	2016-05-18 16:02:52 -07:00
Davies Liu	fc29b896da	[SPARK-15392][SQL] fix default value of size estimation of logical plan ## What changes were proposed in this pull request? We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD. This PR change the default value to Long.MaxValue. ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #13179 from davies/fix_default_size.	2016-05-18 15:45:59 -07:00
Dongjoon Hyun	cc6a47dd81	[SPARK-15373][WEB UI] Spark UI should show consistent timezones. ## What changes were proposed in this pull request? Currently, SparkUI shows two timezones in a single page when the timezone of browser is different from the server JVM timezone. The following is an example on Databricks CE which uses 'Etc/UTC' timezone. - The time of `submitted` column of list and pop-up description shows `2016/05/18 00:03:07` - The time of `timeline chart` shows `2016/05/17 17:03:07`. ![Different Timezone](https://issues.apache.org/jira/secure/attachment/12804553/12804553_timezone.png) This PR fixes the timeline chart to use the same timezone by the followings. - Upgrade `vis` from 3.9.0(2015-01-16) to 4.16.1(2016-04-18) - Override `moment` of `vis` to get `offset` - Update `AllJobsPage`, `JobPage`, and `StagePage`. ## How was this patch tested? Manual. Run the following command and see the Spark UI's event timelines. ``` $ SPARK_SUBMIT_OPTS="-Dscala.usejavacp=true -Duser.timezone=Etc/UTC" bin/spark-submit --class org.apache.spark.repl.Main ... scala> sql("select 1").head ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13158 from dongjoon-hyun/SPARK-15373.	2016-05-18 23:19:55 +01:00
Sean Owen	4768d037b7	[SPARK-15386][CORE] Master doesn't compile against Java 1.7 / Process.isAlive ## What changes were proposed in this pull request? Remove call to Process.isAlive -- Java 8 only. Introduced in https://github.com/apache/spark/pull/13042 / SPARK-15263 ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13174 from srowen/SPARK-15386.	2016-05-18 14:22:21 -07:00
Nick Pentreath	e8b79afa02	[SPARK-14891][ML] Add schema validation for ALS This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown. With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue). ## How was this patch tested? New test cases in `ALSSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12762 from MLnick/SPARK-14891-als-validate-schema.	2016-05-18 21:13:12 +02:00
Liang-Chi Hsieh	3d1e67f903	[SPARK-15342] [SQL] [PYSPARK] PySpark test for non ascii column name does not actually test with unicode column name ## What changes were proposed in this pull request? The PySpark SQL `test_column_name_with_non_ascii` wants to test non-ascii column name. But it doesn't actually test it. We need to construct an unicode explicitly using `unicode` under Python 2. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13134 from viirya/correct-non-ascii-colname-pytest.	2016-05-18 11:18:33 -07:00
Davies Liu	8fb1d1c7f3	[SPARK-15357] Cooperative spilling should check consumer memory mode ## What changes were proposed in this pull request? Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling. ## How was this patch tested? Add new test. Author: Davies Liu <davies@databricks.com> Closes #13151 from davies/fix_mode.	2016-05-18 09:44:21 -07:00
Tejas Patil	c1fd9cacba	[SPARK-15263][CORE] Make shuffle service dir cleanup faster by using `rm -rf` ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-15263 The current logic for directory cleanup is slow because it does directory listing, recurses over child directories, checks for symbolic links, deletes leaf files and finally deletes the dirs when they are empty. There is back-and-forth switching from kernel space to user space while doing this. Since most of the deployment backends would be Unix systems, we could essentially just do `rm -rf` so that entire deletion logic runs in kernel space. The current Java based impl in Spark seems to be similar to what standard libraries like guava and commons IO do (eg. http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540). However, guava removed this method in favour of shelling out to an operating system command (like in this PR). See the `Deprecated` note in older javadocs for guava for details : http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File) Ideally, Java should be providing such APIs so that users won't have to do such things to get platform specific code. Also, its not just about speed, but also handling race conditions while doing at FS deletions is tricky. I could find this bug for Java in similar context : http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952 ## How was this patch tested? I am relying on existing test cases to test the method. If there are suggestions about testing it, welcome to hear about it. ## Performance gains Input setup : Created a nested directory structure of depth 3 and each entry having 50 sub-dirs. The input being cleaned up had total ~125k dirs. Ran both approaches (in isolation) for 6 times to get average numbers: Native Java cleanup \| `rm -rf` as a separate process ------------ \| ------------- 10.04 sec \| 4.11 sec This change made deletion 2.4 times faster for the given test input. Author: Tejas Patil <tejasp@fb.com> Closes #13042 from tejasapatil/delete_recursive.	2016-05-18 12:10:32 +01:00
DLucky	420b700695	[SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points mateiz srowen I state that the contribution is my original work and that I license the work to the project under the project's open source license There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here. The related JIRA issue though marked as resolved, this change may relate to it I think. ## Proposed Change After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones. Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them. ## Test result One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706) I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans\|\| with a large one with 240k points. The data has 4096 features and I tunes K from 100 to 500. The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below ### Local KMeans++ test: Dataset:4m_ini_center Data_size:16234 Dimension:4096 Lloyd's Iteration = 10 The y-axis is time in sec, the x-axis is tuning the K. ![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png) ![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg) ### On a larger dataset An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results. Dataset:4k24 Data_size:243960 Dimension:4096 Normlize Enlarge Initialize Lloyd's_Iteration NO 1 3 5 YES 10000 3 5 Notice: the normlized data is enlarged to ensure precision The cost time: x-for value of K, y-for time in sec ![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg) SE for unnormalized data between two version, to ensure the correctness ![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg) Here is the SE between normalized data just for reference, it's also correct. ![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg) Author: DLucky <mouendless@gmail.com> Closes #13133 from mouendless/patch-2.	2016-05-18 12:05:21 +01:00
Cheng Lian	c4a45fd855	[SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10 ## What changes were proposed in this pull request? This PR fixes a Scala 2.10 compilation failure introduced in PR #13127. ## How was this patch tested? Jenkins build. Author: Cheng Lian <lian@databricks.com> Closes #13166 from liancheng/hotfix-for-scala-2.10.	2016-05-18 18:58:24 +08:00
Dongjoon Hyun	d2f81df1ba	[MINOR][SQL] Remove unused pattern matching variables in Optimizers. ## What changes were proposed in this pull request? This PR removes unused pattern matching variable in Optimizers in order to improve readability. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.	2016-05-18 11:51:50 +01:00
WeichenXu	2f9047b5eb	[SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project ## What changes were proposed in this pull request? I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself) ## How was this patch tested? Exisiting unit tests Author: WeichenXu <WeichenXu123@outlook.com> Closes #13112 from WeichenXu123/update_accuV2_in_mllib.	2016-05-18 11:48:46 +01:00
Davies Liu	33814f887a	[SPARK-15307][SQL] speed up listing files for data source ## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <davies@databricks.com> Closes #13094 from davies/listing.	2016-05-18 18:46:57 +08:00
Sean Zhong	6e02aec44b	[SPARK-15334][SQL] HiveClient facade not compatible with Hive 0.12 ## What changes were proposed in this pull request? HiveClient facade is not compatible with Hive 0.12. This PR Fixes the following compatibility issues: 1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`. 2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12. 3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction` 4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0. ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #13127 from clockfly/versionSuite.	2016-05-18 16:00:02 +08:00
Takuya Kuwahara	411c04adb5	[SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support validationMetrics ## What changes were proposed in this pull request? This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it. ## How was this patch tested? test in `python/pyspark/ml/tests.py` Author: Takuya Kuwahara <taakuu19@gmail.com> Closes #12767 from taku-k/spark-14978.	2016-05-18 08:29:47 +02:00
Yin Huai	2a5db9c140	[SPARK-14346] Fix scala-2.10 build ## What changes were proposed in this pull request? Scala 2.10 build was broken by #13079. I am reverting the change of that line. Author: Yin Huai <yhuai@databricks.com> Closes #13157 from yhuai/SPARK-14346-fix-scala2.10.	2016-05-17 18:02:31 -07:00
Sean Zhong	25b315e6ca	[SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.	2016-05-18 09:01:59 +08:00
Cheng Lian	b674e67c22	[SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views ## What changes were proposed in this pull request? This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc.. ## How was this patch tested? New test cases are added in `ShowCreateTableSuite` to do round-trip tests. Author: Cheng Lian <lian@databricks.com> Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.	2016-05-17 15:56:44 -07:00
Shixiong Zhu	8e8bc9f957	[SPARK-11735][CORE][SQL] Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped ## What changes were proposed in this pull request? Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #13154 from zsxwing/check-spark-context-stop.	2016-05-17 14:57:21 -07:00
Dongjoon Hyun	0f576a5748	[SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not consistent. ## What changes were proposed in this pull request? createDataFrame returns inconsistent types for column names. ```python >>> from pyspark.sql.types import StructType, StructField, StringType >>> schema = StructType([StructField(u"col", StringType())]) >>> df1 = spark.createDataFrame([("a",)], schema) >>> df1.columns # "col" is str ['col'] >>> df2 = spark.createDataFrame([("a",)], [u"col"]) >>> df2.columns # "col" is unicode [u'col'] ``` The reason is only StructField has the following code. ``` if not isinstance(name, str): name = name.encode('utf-8') ``` This PR adds the same logic into createDataFrame for consistency. ``` if isinstance(schema, list): schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema] ``` ## How was this patch tested? Pass the Jenkins test (with new python doctest) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13097 from dongjoon-hyun/SPARK-15244.	2016-05-17 13:05:07 -07:00

1 2 3 4 5 ...

16363 commits