ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zero323	a97d6f3a58	[SPARK-11281][SPARKR] Add tests covering the issue. The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086). Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9743 from zero323/SPARK-11281-tests.	2015-11-18 08:32:03 -08:00
Jeff Zhang	3a6807fdf0	[SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt… …ion in PySpark Author: Jeff Zhang <zjffdu@apache.org> Closes #9791 from zjffdu/SPARK-11804.	2015-11-18 08:18:54 -08:00
Viveka Kulharia	1429e0a2b5	rmse was wrongly calculated It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <vivkul@iitk.ac.in> Closes #9771 from vivkul/patch-1.	2015-11-18 09:10:15 +00:00
Sean Owen	9631ca3527	[SPARK-11652][CORE] Remote code execution with InvokerTransformer Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability Author: Sean Owen <sowen@cloudera.com> Closes #9731 from srowen/SPARK-11652.	2015-11-18 08:59:20 +00:00
Jean-Baptiste Onofré	e62820c85f	[SPARK-6541] Sort executors by ID (numeric) "Force" the executor ID sort with Int. Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #9165 from jbonofre/SPARK-6541.	2015-11-18 08:57:58 +00:00
somideshmukh	b8f4379ba1	[SPARK-10946][SQL] JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs New changes with JDBCRDD Author: somideshmukh <somilde@us.ibm.com> Closes #9733 from somideshmukh/SomilBranch-1.1.	2015-11-18 08:51:01 +00:00
Yin Huai	1714350bdd	[SPARK-11792][SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations https://issues.apache.org/jira/browse/SPARK-11792 Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs. Author: Yin Huai <yhuai@databricks.com> Closes #9788 from yhuai/SPARK-11792.	2015-11-18 00:42:52 -08:00
Reynold Xin	5e2b44474c	[SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <rxin@databricks.com> Closes #9789 from rxin/SPARK-11802.	2015-11-18 00:09:29 -08:00
Wenchen Fan	8019f66df5	[SPARK-10186][SQL][FOLLOW-UP] simplify test Author: Wenchen Fan <wenchen@databricks.com> Closes #9783 from cloud-fan/postgre.	2015-11-17 23:51:05 -08:00
Xusen Yin	9154f89bef	[SPARK-11728] Replace example code in ml-ensembles.md using include_example JIRA issue https://issues.apache.org/jira/browse/SPARK-11728. The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`. Author: Xusen Yin <yinxusen@gmail.com> Closes #9716 from yinxusen/SPARK-11728.	2015-11-17 23:44:06 -08:00
Davies Liu	2f191c66b6	[SPARK-11643] [SQL] parse year with leading zero Support the years between 0 <= year < 1000 Author: Davies Liu <davies@databricks.com> Closes #9701 from davies/leading_zero.	2015-11-17 23:14:05 -08:00
RoyGaoVLIS	67a5132c21	[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <roygao@zju.edu.cn> Closes #6665 from RoyGao/7013.	2015-11-17 23:00:49 -08:00
tedyu	446738e51f	[SPARK-11761] Prevent the call to StreamingContext#stop() in the listener bus's thread See discussion toward the tail of https://github.com/apache/spark/pull/9723 From zsxwing : ``` The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext. I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally. ``` Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread. Author: tedyu <yuzhihong@gmail.com> Closes #9741 from tedyu/master.	2015-11-17 22:47:53 -08:00
Yanbo Liang	8fb775ba87	[SPARK-11755][R] SparkR should export "predict" The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following: ```Java > help(predict) Help on topic ‘predict’ was found in the following packages: Package Library SparkR /Users/yanboliang/data/trunk2/spark/R/lib stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library Choose one 1: Make predictions from a model {SparkR} 2: Model Predictions {stats} ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9732 from yanboliang/spark-11755.	2015-11-17 22:13:15 -08:00
Reynold Xin	91f4b6f2db	[SPARK-11797][SQL] collect, first, and take should use encoders for serialization They were previously using Spark's default serializer for serialization. Author: Reynold Xin <rxin@databricks.com> Closes #9787 from rxin/SPARK-11797.	2015-11-17 21:40:58 -08:00
Davies Liu	98be8169f0	[SPARK-11737] [SQL] Fix serialization of UTF8String with Kyro The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM) Author: Davies Liu <davies@databricks.com> Closes #9704 from davies/kyro_string.	2015-11-17 19:50:02 -08:00
Kent Yao	e33053ee00	[SPARK-11583] [CORE] MapStatus Using RoaringBitmap More Properly This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty. This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 . Author: Kent Yao <yaooqinn@hotmail.com> Author: Davies Liu <davies@databricks.com> Author: Charles Allen <charles@allen-net.com> Closes #9746 from davies/roaring_mapstatus.	2015-11-17 19:44:29 -08:00
Davies Liu	bf25f9bdfc	[SPARK-11016] Move RoaringBitmap to explicit Kryo serializer Fix the serialization of RoaringBitmap with Kyro serializer This PR came from https://github.com/metamx/spark/pull/1, thanks to drcrallen Author: Davies Liu <davies@databricks.com> Author: Charles Allen <charles@allen-net.com> Closes #9748 from davies/SPARK-11016.	2015-11-17 19:39:39 -08:00
Reynold Xin	ed8d1531f9	[SPARK-11793][SQL] Dataset should set the resolved encoders internally for maps. I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795. Author: Reynold Xin <rxin@databricks.com> Closes #9784 from rxin/SPARK-11503.	2015-11-17 19:02:44 -08:00
jerryshao	75a2922910	[SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API Fixed the merge conflicts in #7410 Closes #7410 Author: Shixiong Zhu <shixiong@databricks.com> Author: jerryshao <saisai.shao@intel.com> Author: jerryshao <sshao@hortonworks.com> Closes #9742 from zsxwing/pr7410.	2015-11-17 16:57:52 -08:00
Jacek Lewandowski	b362d50fca	[SPARK-11726] Throw exception on timeout when waiting for REST server response Author: Jacek Lewandowski <lewandowski.jacek@gmail.com> Closes #9692 from jacek-lewandowski/SPARK-11726.	2015-11-17 16:00:00 -08:00
Holden Karau	52c734b589	[SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two params have both in error msg When we exceed the max memory tell users to increase both params instead of just the one. Author: Holden Karau <holden@us.ibm.com> Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.	2015-11-17 15:51:03 -08:00
Shixiong Zhu	3720b1480c	[SPARK-11790][STREAMING][TESTS] Increase the connection timeout Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9778 from zsxwing/SPARK-11790.	2015-11-17 15:47:39 -08:00
Rohan Bhanderi	e29656f8e7	[MINOR] Correct comments in JavaDirectKafkaWordCount Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu> Closes #9781 from RohanBhanderi/patch-3.	2015-11-17 15:45:46 -08:00
Grace	965245d087	[SPARK-9552] Add force control for killExecutors to avoid false killing for those busy executors By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized. For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time. 1. the timer expiration starts before the listener event arrives. 2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally. Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation. For the rest of usages, the end users can decide if to use force killing or not by themselves. If to turn on that option, the killExecutor will do the action without any status checking. Author: Grace <jie.huang@intel.com> Author: Andrew Or <andrew@databricks.com> Author: Jie Huang <jie.huang@intel.com> Closes #7888 from GraceH/forcekill.	2015-11-17 15:43:35 -08:00
Shixiong Zhu	928d631625	[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9707 from zsxwing/fix-checkpoint.	2015-11-17 14:48:29 -08:00
Marcelo Vanzin	936bc0bcbf	[SPARK-11786][CORE] Tone down messages from akka error monitor. There events happen normally during the app's lifecycle, so printing out ERROR logs all the time is misleading, and can actually affect usability of interactive shells. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9772 from vanzin/SPARK-11786.	2015-11-17 14:23:28 -08:00
Xiangrui Meng	3e9e638023	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.	2015-11-17 14:04:49 -08:00
Joseph K. Bradley	6eb7008b7f	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.	2015-11-17 14:03:49 -08:00
Xusen Yin	328eb49e62	[SPARK-11729] Replace example code in ml-linear-methods.md using include_example JIRA link: https://issues.apache.org/jira/browse/SPARK-11729 Author: Xusen Yin <yinxusen@gmail.com> Closes #9713 from yinxusen/SPARK-11729.	2015-11-17 13:59:59 -08:00
Timothy Hunter	fa603e08de	[SPARK-11732] Removes some MiMa false positives This adds an extra filter for private or protected classes. We only filter for package private right now. Author: Timothy Hunter <timhunter@databricks.com> Closes #9697 from thunterdb/spark-11732.	2015-11-17 20:51:20 +00:00
Davies Liu	5aca6ad00c	[SPARK-11767] [SQL] limit the size of caced batch Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management). This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns). This also change the way to grow buffer, double it each time, then trim it once finished. cc liancheng Author: Davies Liu <davies@databricks.com> Closes #9760 from davies/cache_limit.	2015-11-17 12:50:01 -08:00
Joseph K. Bradley	d98d1cb000	[SPARK-11769][ML] Add save, load to all basic Transformers This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.	2015-11-17 12:43:56 -08:00
Wenchen Fan	d925149664	[SPARK-10186][SQL] support postgre array type in JDBCRDD Add ARRAY support to `PostgresDialect`. Nested ARRAY is not allowed for now because it's hard to get the array dimension info. See http://stackoverflow.com/questions/16619113/how-to-get-array-base-type-in-postgres-via-jdbc Thanks for the initial work from mariusvniekerk ! Close https://github.com/apache/spark/pull/9137 Author: Wenchen Fan <wenchen@databricks.com> Closes #9662 from cloud-fan/postgre.	2015-11-17 11:29:02 -08:00
gatorsmile	0158ff7737	[SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all the members Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers. Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it. marmbrus cloud-fan Please review if the changes are good. Author: gatorsmile <gatorsmile@gmail.com> Closes #9761 from gatorsmile/hashCodeNamedExpression.	2015-11-17 11:23:54 -08:00
Cheng Lian	7b1407c7b9	[SPARK-11089][SQL] Adds option for disabling multi-session in Thrift server This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server. Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one. Author: Cheng Lian <lian@databricks.com> Closes #9740 from liancheng/spark-11089.single-session-option.	2015-11-17 11:17:52 -08:00
mayuanwen	e8833dd12c	[SPARK-11679][SQL] Invoking method " apply(fields: java.util.List[StructField])" in "StructType" gets ClassCastException In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;" I directly cast java.util.List[StructField] into Array[StructField] in this patch. Author: mayuanwen <mayuanwen@qiyi.com> Closes #9649 from jackieMaKing/Spark-11679.	2015-11-17 11:15:46 -08:00
Xiangrui Meng	21fac54341	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.	2015-11-17 10:17:16 -08:00
Chris Bannister	cc567b6634	[SPARK-11695][CORE] Set s3a credentials Set s3a credentials when creating a new default hadoop configuration. Author: Chris Bannister <chris.bannister@swiftkey.com> Closes #9663 from Zariel/set-s3a-creds.	2015-11-17 10:03:46 -08:00
jerryshao	6fc2740ebb	[SPARK-11744][LAUNCHER] Fix print version throw exception when using pyspark shell Exception details can be seen here (https://issues.apache.org/jira/browse/SPARK-11744). Author: jerryshao <sshao@hortonworks.com> Closes #9721 from jerryshao/SPARK-11744.	2015-11-17 10:01:33 -08:00
Philipp Hoffmann	15cc36b778	[SPARK-11779][DOCS] Fix reference to deprecated MESOS_NATIVE_LIBRARY MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #9768 from philipphoffmann/patch-2.	2015-11-17 14:13:13 +00:00
yangping.wu	7276fa9aa9	[SPARK-11751] Doc describe error in the "Spark Streaming Programming Guide" page In the [Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads) section, >Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves. as we known Task Serialization is configuration by spark.closure.serializer parameter, but currently only the Java serializer is supported. If we set spark.closure.serializer to org.apache.spark.serializer.KryoSerializer, then this will throw a exception. Author: yangping.wu <wyphao.2007@163.com> Closes #9734 from 397090770/397090770-patch-1.	2015-11-17 14:11:34 +00:00
Cheng Lian	fa13301ae4	[SPARK-11191][SQL][FOLLOW-UP] Cleans up unnecessary anonymous HiveFunctionRegistry According to discussion in PR #9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now. Author: Cheng Lian <lian@databricks.com> Closes #9737 from liancheng/spark-11191.follow-up.	2015-11-17 18:11:08 +08:00
Liang-Chi Hsieh	d79d8b08ff	[MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9757 from viirya/fix-randomgenerated-udt.	2015-11-16 23:16:17 -08:00
Kevin Yu	e01865af0d	[SPARK-11447][SQL] change NullType to StringType during binaryComparison between NullType and StringType During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira. I proposal to the changes through this PR, can you review my code changes ? This problem only happen for <=>, other operators works fine. scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> df.registerTempTable("DF") scala> sqlContext.sql("select * from DF where 'column' = NULL") res27: org.apache.spark.sql.DataFrame = [column: string] scala> res27.show +------+ \|column\| +------+ +------+ Author: Kevin Yu <qyu@us.ibm.com> Closes #9720 from kevinyu98/working_on_spark-11447.	2015-11-16 22:54:29 -08:00
hyukjinkwon	75d2020731	[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.	2015-11-17 14:35:00 +08:00
Reynold Xin	fbad920dbf	[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for current_timestamp). This patch adds an alias for current_timestamp (now function). Also fixes SPARK-9196 to re-enable the test case for current_timestamp. Author: Reynold Xin <rxin@databricks.com> Closes #9753 from rxin/SPARK-11768.	2015-11-16 20:47:46 -08:00
Marcelo Vanzin	540bf58f18	[SPARK-11617][NETWORK] Fix leak in TransportFrameDecoder. The code was using the wrong API to add data to the internal composite buffer, causing buffers to leak in certain situations. Use the right API and enhance the tests to catch memory leaks. Also, avoid reusing the composite buffers when downstream handlers keep references to them; this seems to cause a few different issues even though the ref counting code seems to be correct, so instead pay the cost of copying a few bytes when that situation happens. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9619 from vanzin/SPARK-11617.	2015-11-16 17:28:11 -08:00
Joseph K. Bradley	1c5475f140	[SPARK-11612][ML] Pipeline and PipelineModel persistence Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.	2015-11-16 17:12:39 -08:00
jerryshao	bd10eb81c9	[EXAMPLE][MINOR] Add missing awaitTermination in click stream example Author: jerryshao <sshao@hortonworks.com> Closes #9730 from jerryshao/clickstream-fix.	2015-11-16 17:02:21 -08:00

1 2 3 4 5 ...

13721 commits