ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Davies Liu	37526aca24	[SPARK-10980] [SQL] fix bug in create Decimal The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <davies@databricks.com> Closes #9014 from davies/fix_decimal.	2015-10-07 15:51:09 -07:00
Yanbo Liang	7bf07faa71	[SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS Consolidate the Cholesky solvers in WeightedLeastSquares and ALS. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8936 from yanboliang/spark-10490.	2015-10-07 15:50:45 -07:00
Reynold Xin	6dbfd7ecf4	[SPARK-10982] [SQL] Rename ExpressionAggregate -> DeclarativeAggregate. DeclarativeAggregate matches more closely with ImperativeAggregate we already have. Author: Reynold Xin <rxin@databricks.com> Closes #9013 from rxin/SPARK-10982.	2015-10-07 15:38:46 -07:00
Evan Chen	da936fbb74	[SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) Provide initialModel param for pyspark.mllib.clustering.KMeans Author: Evan Chen <chene@us.ibm.com> Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.	2015-10-07 15:04:53 -07:00
navis.ryu	713e4f44e9	[SPARK-10679] [CORE] javax.jdo.JDOFatalUserException in executor HadoopRDD throws exception in executor, something like below. {noformat} 5/09/17 18:51:21 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/09/17 18:51:21 INFO metastore.ObjectStore: ObjectStore, initialize called 15/09/17 18:51:21 WARN metastore.HiveMetaStore: Retrying creating default database after error: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found. at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174) at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:166) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803) at org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:298) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {noformat} Author: navis.ryu <navis@apache.org> Closes #8804 from navis/SPARK-10679.	2015-10-07 14:56:02 -07:00
Liang-Chi Hsieh	c14aee4da9	[SPARK-10856][SQL] Mapping TimestampType to DATETIME for SQL Server jdbc dialect JIRA: https://issues.apache.org/jira/browse/SPARK-10856 For Microsoft SQL Server, TimestampType should be mapped to DATETIME instead of TIMESTAMP. Related information for the datatype mapping: https://msdn.microsoft.com/en-us/library/ms378878(v=sql.110).aspx Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8978 from viirya/mysql-jdbc-timestamp.	2015-10-07 14:49:08 -07:00
Marcelo Vanzin	94fc57afdf	[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.	2015-10-07 14:11:21 -07:00
Josh Rosen	a9ecd06149	[SPARK-10941] [SQL] Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity This patch refactors several of the Aggregate2 interfaces in order to improve code clarity. The biggest change is a refactoring of the `AggregateFunction2` class hierarchy. In the old code, we had a class named `AlgebraicAggregate` that inherited from `AggregateFunction2`, added a new set of methods, then banned the use of the inherited methods. I found this to be fairly confusing because. If you look carefully at the existing code, you'll see that subclasses of `AggregateFunction2` fall into two disjoint categories: imperative aggregation functions which directly extended `AggregateFunction2` and declarative, expression-based aggregate functions which extended `AlgebraicAggregate`. In order to make this more explicit, this patch refactors things so that `AggregateFunction2` is a sealed abstract class with two subclasses, `ImperativeAggregateFunction` and `ExpressionAggregateFunction`. The superclass, `AggregateFunction2`, now only contains methods and fields that are common to both subclasses. After making this change, I updated the various AggregationIterator classes to comply with this new naming scheme. I also performed several small renamings in the aggregate interfaces themselves in order to improve clarity and rewrote or expanded a number of comments. Author: Josh Rosen <joshrosen@databricks.com> Closes #8973 from JoshRosen/tungsten-agg-comments.	2015-10-07 13:19:49 -07:00
Holden Karau	5be5d24744	[SPARK-9841] [ML] Make clear public It is currently impossible to clear Param values once set. It would be helpful to be able to. Author: Holden Karau <holden@pigscanfly.ca> Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.	2015-10-07 12:00:56 -07:00
Marcelo Vanzin	6ca27f8550	[SPARK-10964] [YARN] Correctly register the AM with the driver. The `self` method returns null when called from the constructor; instead, registration should happen in the `onStart` method, at which point the `self` reference has already been initialized. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9005 from vanzin/SPARK-10964.	2015-10-07 11:38:47 -07:00
Marcelo Vanzin	4b74755122	[SPARK-10812] [YARN] Fix shutdown of token renewer. A recent change to fix the referenced bug caused this exception in the `SparkContext.stop()` path: org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode! at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185) at org.apache.spark.SparkContext.stop(SparkContext.scala:1729) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8996 from vanzin/SPARK-10812.	2015-10-07 11:38:07 -07:00
Michael Armbrust	f5d154bc73	[SPARK-10966] [SQL] Codegen framework cleanup This PR is mostly cosmetic and cleans up some warts in codegen (nearly all of which were inherited from the original quasiquote version). - Add lines numbers to errors (in stacktraces when debug logging is on, and always for compile fails) - Use a variable for input row instead of hardcoding "i" everywhere - rename `primitive` -> `value` (since its often actually an object) Author: Michael Armbrust <michael@databricks.com> Closes #9006 from marmbrus/codegen-cleanup.	2015-10-07 10:17:29 -07:00
Kevin Cox	9672602c7e	[SPARK-10952] Only add hive to classpath if HIVE_HOME is set. Currently if it isn't set it scans `/lib/*` and adds every dir to the classpath which makes the env too large and every command called afterwords fails. Author: Kevin Cox <kevincox@kevincox.ca> Closes #8994 from kevincox/kevincox-only-add-hive-to-classpath-if-var-is-set.	2015-10-07 09:54:02 -07:00
Sun Rui	f57c63d4c3	[SPARK-10752] [SPARKR] Implement corr() and cov in DataFrameStatFunctions. Author: Sun Rui <rui.sun@intel.com> Closes #8869 from sun-rui/SPARK-10752.	2015-10-07 09:46:37 -07:00
Xin Ren	27cdde2ff8	[SPARK-10669] [DOCS] Link to each language's API in codetabs in ML docs: spark.mllib In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in `64743870f2/docs/mllib-feature-extraction.md` This JIRA is just for spark.mllib, not spark.ml. Please let me know if more work is needed, thanks a lot. Author: Xin Ren <iamshrek@126.com> Closes #8977 from keypointt/SPARK-10669.	2015-10-07 15:00:19 +01:00
zsxwing	ffe6831e49	[SPARK-10885] [STREAMING] Display the failed output op in Streaming UI This PR implements the following features for both `master` and `branch-1.5`. 1. Display the failed output op count in the batch list 2. Display the failure reason of output op in the batch detail page Screenshots: <img width="1356" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10198387/5b2b97ec-67ce-11e5-81c2-f818b9d2f3ad.png"> <img width="1356" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10198388/5b76ac14-67ce-11e5-8c8b-de2683c5b485.png"> There are still two remaining problems in the UI. 1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations. 2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site. We need to add new `StreamingListenerEvent` about output operations to fix them. So I'd like to fix them only for `master` in another PR. Author: zsxwing <zsxwing@gmail.com> Closes #8950 from zsxwing/batch-failure.	2015-10-06 16:51:03 -07:00
Xiangrui Meng	5e035403d4	[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #9001 from mengxr/SPARK-10957.	2015-10-06 14:58:42 -07:00
vectorijk	5952bdb7df	[SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression Implement Python API for AFTSurvivalRegression Author: vectorijk <jiangkai@gmail.com> Closes #8926 from vectorijk/spark-10688.	2015-10-06 12:43:28 -07:00
Thomas Graves	e978360159	[SPARK-10901] [YARN] spark.yarn.user.classpath.first doesn't work This should go into 1.5.2 also. The issue is we were no longer adding the __app__.jar to the system classpath. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes #8959 from tgravescs/SPARK-10901.	2015-10-06 10:18:50 -07:00
Marcelo Vanzin	744f03e700	[SPARK-10916] [YARN] Set perm gen size when launching containers on YARN. This makes YARN containers behave like all other processes launched by Spark, which launch with a default perm gen size of 256m unless overridden by the user (or not needed by the vm). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8970 from vanzin/SPARK-10916.	2015-10-06 10:17:12 -07:00
Davies Liu	27ecfe61f0	[SPARK-10938] [SQL] remove typeId in columnar cache This PR remove the typeId in columnar cache, it's not needed anymore, it also remove DATE and TIMESTAMP (use INT/LONG instead). Author: Davies Liu <davies@databricks.com> Closes #8989 from davies/refactor_cache.	2015-10-06 08:45:31 -07:00
Wenchen Fan	4e0027feae	[SPARK-10585] [SQL] [FOLLOW-UP] remove no-longer-necessary code for unsafe generation These code was left there to produce clear diff for https://github.com/apache/spark/pull/8747 Author: Wenchen Fan <cloud0fan@163.com> Closes #8991 from cloud-fan/clean.	2015-10-05 23:24:12 -07:00
zsxwing	be7c5ff1ad	[SPARK-10900] [STREAMING] Add output operation events to StreamingListener Add output operation events to StreamingListener so as to implement the following UI features: 1. Progress bar of a batch in the batch list. 2. Be able to display output operation `description` and `duration` when there is no spark job in a Streaming job. Author: zsxwing <zsxwing@gmail.com> Closes #8958 from zsxwing/output-operation-events.	2015-10-05 19:23:41 -07:00
Wenchen Fan	a609eb20d9	[SPARK-10934] [SQL] handle hashCode of unsafe array correctly `Murmur3_x86_32.hashUnsafeWords` only accepts word-aligned bytes, but unsafe array is not. Author: Wenchen Fan <cloud0fan@163.com> Closes #8987 from cloud-fan/hash.	2015-10-05 17:31:54 -07:00
Wenchen Fan	c4871369db	[SPARK-10585] [SQL] only copy data once when generate unsafe projection This PR is a completely rewritten of GenerateUnsafeProjection, to accomplish the goal of copying data only once. The old code of GenerateUnsafeProjection is still there to reduce review difficulty. Instead of creating unsafe conversion code for struct, array and map, we create code of writing the content to the global row buffer. Author: Wenchen Fan <cloud0fan@163.com> Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8747 from cloud-fan/copy-once.	2015-10-05 13:00:58 -07:00
Avrohom Katz	883bd8fccf	[SPARK-10889] [STREAMING] Bump KCL to add MillisBehindLatest metric I don't believe the API changed at all. Author: Avrohom Katz <iambpentameter@gmail.com> Closes #8957 from akatz/kcl-upgrade.	2015-10-04 09:36:07 +01:00
Sean Owen	82bbc2a5f2	[SPARK-9570] [DOCS] Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'. Recommend `--master yarn --deploy-mode {cluster,client}` consistently in docs. Follow-on to https://github.com/apache/spark/pull/8385 CC nssalian Author: Sean Owen <sowen@cloudera.com> Closes #8968 from srowen/SPARK-9570.	2015-10-04 09:31:52 +01:00
felixcheung	721e8b5f35	[SPARK-10904] [SPARKR] Fix to support `select(df, c("col1", "col2"))` The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with. Author: felixcheung <felixcheung_m@hotmail.com> Closes #8961 from felixcheung/rselect.	2015-10-03 22:42:36 -07:00
Reynold Xin	ae6570ec2b	Remove TODO in ShuffleMemoryManager.	2015-10-03 18:08:25 -07:00
Guillaume Poulin	be0dcd6eb1	FIX: rememberDuration reassignment error message I was reading throught the scheduler and found this small mistake. Author: Guillaume Poulin <guillaume@hopper.com> Closes #8966 from gpoulin/remember_duration_typo.	2015-10-03 12:14:00 +01:00
zsxwing	107320c9bb	[SPARK-6028] [CORE] Remerge #6457 : new RPC implemetation and also pick #8905 This PR just reverted `02144d6745` to remerge #6457 and also included the commits in #8905. Author: zsxwing <zsxwing@gmail.com> Closes #8944 from zsxwing/SPARK-6028.	2015-10-03 01:04:35 -07:00
gweidner	314bc68435	[SPARK-7275] [SQL] Make LogicalRelation public Given LogicalRelation (and other classes) were moved from sources package to execution.sources package, removed private[sql] to make LogicalRelation public to facilitate access for data sources. Author: gweidner <gweidner@us.ibm.com> Closes #8965 from gweidner/SPARK-7275.	2015-10-03 01:04:14 -07:00
Joshi	f85aa06464	[SPARK-10317] [CORE] Compatibility between history server script and functionality Compatibility between history server script and functionality The history server has its argument parsing class in HistoryServerArguments. However, this doesn't get involved in the start-history-server.sh codepath where the $0 arg is assigned to spark.history.fs.logDirectory and all other arguments discarded (e.g --property-file.) This stops the other options being usable from this script Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #8758 from rekhajoshm/SPARK-10317.	2015-10-02 15:26:11 -07:00
Yin Huai	b0baa11d3b	[HOT-FIX] Fix style. https://github.com/apache/spark/pull/8882 broke our build. Author: Yin Huai <yhuai@databricks.com> Closes #8964 from yhuai/fixStyle.	2015-10-02 11:23:08 -07:00
Xusen Yin	633aaae0a1	[SPARK-6530] [ML] Add chi-square selector for ml package See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530). Author: Xusen Yin <yinxusen@gmail.com> Closes #5742 from yinxusen/SPARK-6530.	2015-10-02 10:25:58 -07:00
Xusen Yin	23a9448c04	[SPARK-5890] [ML] Add feature discretizer JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890). I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly. Author: Xusen Yin <yinxusen@gmail.com> Closes #5779 from yinxusen/SPARK-5890.	2015-10-02 10:19:18 -07:00
Rerngvit Yanggratoke	2a717821bb	[SPARK-9798] [ML] CrossValidatorModel Documentation Improvements Document CrossValidatorModel members: bestModel and avgMetrics Author: Rerngvit Yanggratoke <rerngvit@kth.se> Closes #8882 from rerngvit/Spark-9798.	2015-10-02 10:15:02 -07:00
Takeshi YAMAMURO	2272962eb0	[SPARK-9867] [SQL] Move utilities for binary data into ByteArray The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #8122 from maropu/CleanUpForBinaryType.	2015-10-01 21:33:27 -04:00
Cheng Lian	01cd688f52	[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC We introduced SQL option `spark.sql.parquet.followParquetFormatSpec` while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files. This option defaults to `false` and is marked as a non-public option (`isPublic = false`) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to `spark.sql.parquet.writeLegacyFormat`, and invert its default value (the two option names have opposite meanings). Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format. Author: Cheng Lian <lian@databricks.com> Closes #8566 from liancheng/spark-10400/deprecate-follow-parquet-format-spec.	2015-10-01 17:23:27 -07:00
Wenchen Fan	02026a8132	[SPARK-10671] [SQL] Throws an analysis exception if we cannot find Hive UDFs Takes over https://github.com/apache/spark/pull/8800 Author: Wenchen Fan <cloud0fan@163.com> Closes #8941 from cloud-fan/hive-udf.	2015-10-01 13:23:59 -07:00
Cheng Hao	4d8c7c6d1c	[SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should returns long instead of the Double type Floor & Ceiling function should returns Long type, rather than Double. Verified with MySQL & Hive. Author: Cheng Hao <hao.cheng@intel.com> Closes #8933 from chenghao-intel/ceiling.	2015-10-01 11:48:15 -07:00
zsxwing	9b3e7768a2	[SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuite Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/ This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](`8fb3a65cbb/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala (L121)`). There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them. Author: zsxwing <zsxwing@gmail.com> Closes #8946 from zsxwing/SPARK-10058.	2015-10-01 07:09:31 -07:00
Oscar D. Lara Yejas	f21e2da03f	[SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collect Created method as.data.frame as a synonym for collect(). Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: olarayej <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Closes #8908 from olarayej/SPARK-10807.	2015-09-30 18:03:31 -07:00
Nathan Howell	89ea0041ae	[SPARK-9617] [SQL] Implement json_tuple This is an implementation of Hive's `json_tuple` function using Jackson Streaming. Author: Nathan Howell <nhowell@godaddy.com> Closes #7946 from NathanHowell/SPARK-9617.	2015-09-30 15:33:12 -07:00
Reynold Xin	03cca5dce2	[SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row. Author: Reynold Xin <rxin@databricks.com> Closes #8900 from rxin/SPARK-10770-1.	2015-09-30 14:36:54 -04:00
Sun Rui	c7b29ae641	[SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster mode) The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application. This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException. Author: Sun Rui <rui.sun@intel.com> Closes #8938 from sun-rui/SPARK-10851.	2015-09-30 11:03:08 -07:00
Herman van Hovell	16fd2a2f42	[SPARK-9741] [SQL] Approximate Count Distinct using the new UDAF interface. This PR implements a HyperLogLog based Approximate Count Distinct function using the new UDAF interface. The implementation is inspired by the ClearSpring HyperLogLog implementation and should produce the same results. There is still some documentation and testing left to do. cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #8362 from hvanhovell/SPARK-9741.	2015-09-30 10:12:52 -07:00
Yanbo Liang	2931e89f0c	[SPARK-10736] [ML] Use 1 for all ratings if $(ratingCol) = "" For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8937 from yanboliang/spark-10736.	2015-09-29 23:58:32 -07:00
Cheng Lian	4d5a005b0d	[SPARK-10811] [SQL] Eliminates unnecessary byte array copying When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <lian@databricks.com> Closes #8907 from liancheng/spark-10811/eliminate-array-copying.	2015-09-29 23:30:27 -07:00
asokadiggs	c1ad373f26	[SPARK-10782] [PYTHON] Update dropDuplicates documentation Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases. Author: asokadiggs <asoka.diggs@intel.com> Closes #8930 from asokadiggs/jira-10782.	2015-09-29 17:45:18 -04:00

1 2 3 4 5 ...

13126 commits