ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Shixiong Zhu	928d631625	[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <shixiong@databricks.com> Closes #9707 from zsxwing/fix-checkpoint.	2015-11-17 14:48:29 -08:00
Marcelo Vanzin	936bc0bcbf	[SPARK-11786][CORE] Tone down messages from akka error monitor. There events happen normally during the app's lifecycle, so printing out ERROR logs all the time is misleading, and can actually affect usability of interactive shells. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9772 from vanzin/SPARK-11786.	2015-11-17 14:23:28 -08:00
Xiangrui Meng	3e9e638023	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.	2015-11-17 14:04:49 -08:00
Joseph K. Bradley	6eb7008b7f	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.	2015-11-17 14:03:49 -08:00
Xusen Yin	328eb49e62	[SPARK-11729] Replace example code in ml-linear-methods.md using include_example JIRA link: https://issues.apache.org/jira/browse/SPARK-11729 Author: Xusen Yin <yinxusen@gmail.com> Closes #9713 from yinxusen/SPARK-11729.	2015-11-17 13:59:59 -08:00
Timothy Hunter	fa603e08de	[SPARK-11732] Removes some MiMa false positives This adds an extra filter for private or protected classes. We only filter for package private right now. Author: Timothy Hunter <timhunter@databricks.com> Closes #9697 from thunterdb/spark-11732.	2015-11-17 20:51:20 +00:00
Davies Liu	5aca6ad00c	[SPARK-11767] [SQL] limit the size of caced batch Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management). This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns). This also change the way to grow buffer, double it each time, then trim it once finished. cc liancheng Author: Davies Liu <davies@databricks.com> Closes #9760 from davies/cache_limit.	2015-11-17 12:50:01 -08:00
Joseph K. Bradley	d98d1cb000	[SPARK-11769][ML] Add save, load to all basic Transformers This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.	2015-11-17 12:43:56 -08:00
Wenchen Fan	d925149664	[SPARK-10186][SQL] support postgre array type in JDBCRDD Add ARRAY support to `PostgresDialect`. Nested ARRAY is not allowed for now because it's hard to get the array dimension info. See http://stackoverflow.com/questions/16619113/how-to-get-array-base-type-in-postgres-via-jdbc Thanks for the initial work from mariusvniekerk ! Close https://github.com/apache/spark/pull/9137 Author: Wenchen Fan <wenchen@databricks.com> Closes #9662 from cloud-fan/postgre.	2015-11-17 11:29:02 -08:00
gatorsmile	0158ff7737	[SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all the members Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers. Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it. marmbrus cloud-fan Please review if the changes are good. Author: gatorsmile <gatorsmile@gmail.com> Closes #9761 from gatorsmile/hashCodeNamedExpression.	2015-11-17 11:23:54 -08:00
Cheng Lian	7b1407c7b9	[SPARK-11089][SQL] Adds option for disabling multi-session in Thrift server This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server. Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one. Author: Cheng Lian <lian@databricks.com> Closes #9740 from liancheng/spark-11089.single-session-option.	2015-11-17 11:17:52 -08:00
mayuanwen	e8833dd12c	[SPARK-11679][SQL] Invoking method " apply(fields: java.util.List[StructField])" in "StructType" gets ClassCastException In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;" I directly cast java.util.List[StructField] into Array[StructField] in this patch. Author: mayuanwen <mayuanwen@qiyi.com> Closes #9649 from jackieMaKing/Spark-11679.	2015-11-17 11:15:46 -08:00
Xiangrui Meng	21fac54341	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.	2015-11-17 10:17:16 -08:00
Chris Bannister	cc567b6634	[SPARK-11695][CORE] Set s3a credentials Set s3a credentials when creating a new default hadoop configuration. Author: Chris Bannister <chris.bannister@swiftkey.com> Closes #9663 from Zariel/set-s3a-creds.	2015-11-17 10:03:46 -08:00
jerryshao	6fc2740ebb	[SPARK-11744][LAUNCHER] Fix print version throw exception when using pyspark shell Exception details can be seen here (https://issues.apache.org/jira/browse/SPARK-11744). Author: jerryshao <sshao@hortonworks.com> Closes #9721 from jerryshao/SPARK-11744.	2015-11-17 10:01:33 -08:00
Philipp Hoffmann	15cc36b778	[SPARK-11779][DOCS] Fix reference to deprecated MESOS_NATIVE_LIBRARY MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #9768 from philipphoffmann/patch-2.	2015-11-17 14:13:13 +00:00
yangping.wu	7276fa9aa9	[SPARK-11751] Doc describe error in the "Spark Streaming Programming Guide" page In the [Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads) section, >Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves. as we known Task Serialization is configuration by spark.closure.serializer parameter, but currently only the Java serializer is supported. If we set spark.closure.serializer to org.apache.spark.serializer.KryoSerializer, then this will throw a exception. Author: yangping.wu <wyphao.2007@163.com> Closes #9734 from 397090770/397090770-patch-1.	2015-11-17 14:11:34 +00:00
Cheng Lian	fa13301ae4	[SPARK-11191][SQL][FOLLOW-UP] Cleans up unnecessary anonymous HiveFunctionRegistry According to discussion in PR #9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now. Author: Cheng Lian <lian@databricks.com> Closes #9737 from liancheng/spark-11191.follow-up.	2015-11-17 18:11:08 +08:00
Liang-Chi Hsieh	d79d8b08ff	[MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9757 from viirya/fix-randomgenerated-udt.	2015-11-16 23:16:17 -08:00
Kevin Yu	e01865af0d	[SPARK-11447][SQL] change NullType to StringType during binaryComparison between NullType and StringType During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira. I proposal to the changes through this PR, can you review my code changes ? This problem only happen for <=>, other operators works fine. scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null)))) filteredDF: org.apache.spark.sql.DataFrame = [column: string] scala> filteredDF.show +------+ \|column\| +------+ +------+ scala> df.registerTempTable("DF") scala> sqlContext.sql("select * from DF where 'column' = NULL") res27: org.apache.spark.sql.DataFrame = [column: string] scala> res27.show +------+ \|column\| +------+ +------+ Author: Kevin Yu <qyu@us.ibm.com> Closes #9720 from kevinyu98/working_on_spark-11447.	2015-11-16 22:54:29 -08:00
hyukjinkwon	75d2020731	[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet. For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools). Author: hyukjinkwon <gurwls223@gmail.com> Closes #9754 from HyukjinKwon/SPARK-11694-followup.	2015-11-17 14:35:00 +08:00
Reynold Xin	fbad920dbf	[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for current_timestamp). This patch adds an alias for current_timestamp (now function). Also fixes SPARK-9196 to re-enable the test case for current_timestamp. Author: Reynold Xin <rxin@databricks.com> Closes #9753 from rxin/SPARK-11768.	2015-11-16 20:47:46 -08:00
Marcelo Vanzin	540bf58f18	[SPARK-11617][NETWORK] Fix leak in TransportFrameDecoder. The code was using the wrong API to add data to the internal composite buffer, causing buffers to leak in certain situations. Use the right API and enhance the tests to catch memory leaks. Also, avoid reusing the composite buffers when downstream handlers keep references to them; this seems to cause a few different issues even though the ref counting code seems to be correct, so instead pay the cost of copying a few bytes when that situation happens. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9619 from vanzin/SPARK-11617.	2015-11-16 17:28:11 -08:00
Joseph K. Bradley	1c5475f140	[SPARK-11612][ML] Pipeline and PipelineModel persistence Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.	2015-11-16 17:12:39 -08:00
jerryshao	bd10eb81c9	[EXAMPLE][MINOR] Add missing awaitTermination in click stream example Author: jerryshao <sshao@hortonworks.com> Closes #9730 from jerryshao/clickstream-fix.	2015-11-16 17:02:21 -08:00
Andrew Or	33a0ec9377	[SPARK-11710] Document new memory management model Author: Andrew Or <andrew@databricks.com> Closes #9676 from andrewor14/memory-management-docs.	2015-11-16 17:00:18 -08:00
Kousuke Saruta	30f3cfda1c	[SPARK-11480][CORE][WEBUI] Wrong callsite is displayed when using AsyncRDDActions#takeAsync When we call AsyncRDDActions#takeAsync, actually another DAGScheduler#runJob is called from another thread so we cannot get proper callsite infomation. Following screenshots are before this patch applied and after. Before: <img width="1268" alt="2015-11-04 1 26 40" src="https://cloud.githubusercontent.com/assets/4736016/10914069/0ffc1306-8294-11e5-8e89-c4fadf58dd12.png"> <img width="1258" alt="2015-11-04 1 26 52" src="https://cloud.githubusercontent.com/assets/4736016/10914070/0ffe84ce-8294-11e5-8b2a-69d36276bedb.png"> After: <img width="1268" alt="2015-11-04 0 48 07" src="https://cloud.githubusercontent.com/assets/4736016/10914080/1d8cfb7a-8294-11e5-9e09-ede25c2563e8.png"> <img width="1269" alt="2015-11-04 0 48 26" src="https://cloud.githubusercontent.com/assets/4736016/10914081/1d934e3a-8294-11e5-8b5e-e3dc37aaced3.png"> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #9437 from sarutak/SPARK-11480.	2015-11-16 16:59:16 -08:00
Shivaram Venkataraman	ea6f53e48a	[SPARKR][HOTFIX] Disable flaky SparkR package build test See https://github.com/apache/spark/pull/9390#issuecomment-157160063 and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #9744 from shivaram/sparkr-package-test-disable.	2015-11-16 16:57:50 -08:00
Wenchen Fan	fd14936be7	[SPARK-11625][SQL] add java test for typed aggregate Author: Wenchen Fan <wenchen@databricks.com> Closes #9591 from cloud-fan/agg-test.	2015-11-16 15:32:49 -08:00
gatorsmile	75ee12f09c	[SPARK-8658][SQL] AttributeReference's equals method compares all the members This fix is to change the equals method to check all of the specified fields for equality of AttributeReference. Author: gatorsmile <gatorsmile@gmail.com> Closes #9216 from gatorsmile/namedExpressEqual.	2015-11-16 15:22:12 -08:00
Bartlomiej Alberski	31296628ac	[SPARK-11553][SQL] Primitive Row accessors should not convert null to default value Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details. Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com> Closes #9642 from alberskib/bugfix/SPARK-11553.	2015-11-16 15:14:38 -08:00
Shixiong Zhu	bcea0bfda6	[SPARK-11742][STREAMING] Add the failure info to the batch lists <img width="1365" alt="screen shot 2015-11-13 at 9 57 43 pm" src="https://cloud.githubusercontent.com/assets/1000778/11162322/9b88e204-8a51-11e5-8c57-a44889cab713.png"> Author: Shixiong Zhu <shixiong@databricks.com> Closes #9711 from zsxwing/failure-info.	2015-11-16 15:06:06 -08:00
Davies Liu	3c025087b5	Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap to reduce memory usage" This reverts commit `e209fa271a`.	2015-11-16 14:50:38 -08:00
Zee Chen	985b38dd2f	[SPARK-11390][SQL] Query plan with/without filterPushdown indistinguishable …ishable Propagate pushed filters to PhyicalRDD in DataSourceStrategy.apply Author: Zee Chen <zeechen@us.ibm.com> Closes #9679 from zeocio/spark-11390.	2015-11-16 14:21:28 -08:00
Wenchen Fan	b1a9662623	[SPARK-11754][SQL] consolidate `ExpressionEncoder.tuple` and `Encoders.tuple` These 2 are very similar, we can consolidate them into one. Also add tests for it and fix a bug. Author: Wenchen Fan <wenchen@databricks.com> Closes #9729 from cloud-fan/tuple.	2015-11-16 12:45:34 -08:00
jerryshao	24477d2705	[SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work. The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done. One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple. Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query. Here this PR chooses solution 2. Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9684 from jerryshao/SPARK-11718.	2015-11-16 11:43:18 -08:00
Daniel Jalova	ace0db4714	[SPARK-6328][PYTHON] Python API for StreamingListener Author: Daniel Jalova <djalova@us.ibm.com> Closes #9186 from djalova/SPARK-6328.	2015-11-16 11:29:27 -08:00
Burak Yavuz	de5e531d33	[SPARK-11731][STREAMING] Enable batching on Driver WriteAheadLog by default Using batching on the driver for the WriteAheadLog should be an improvement for all environments and use cases. Users will be able to scale to much higher number of receivers with the BatchedWriteAheadLog. Therefore we should turn it on by default, and QA it in the QA period. I've also added some tests to make sure the default configurations are correct regarding recent additions: - batching on by default - closeFileAfterWrite off by default - parallelRecovery off by default Author: Burak Yavuz <brkyvz@gmail.com> Closes #9695 from brkyvz/enable-batch-wal.	2015-11-16 11:21:17 -08:00
Liang-Chi Hsieh	b0c3fd34e4	[SPARK-11743] [SQL] Add UserDefinedType support to RowEncoder JIRA: https://issues.apache.org/jira/browse/SPARK-11743 RowEncoder doesn't support UserDefinedType now. We should add the support for it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9712 from viirya/rowencoder-udt.	2015-11-16 09:03:42 -08:00
Wenchen Fan	06f1fdba6d	[SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds code snippet to reproduce it: ``` TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai")) val t = Timestamp.valueOf("1900-06-11 12:14:50.789") val us = fromJavaTimestamp(t) assert(getSeconds(us) === t.getSeconds) ``` it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed. Author: Wenchen Fan <wenchen@databricks.com> Closes #9728 from cloud-fan/seconds.	2015-11-16 08:58:40 -08:00
xin Wu	0e79604aed	[SPARK-11522][SQL] input_file_name() returns "" for external tables When computing partition for non-parquet relation, `HadoopRDD.compute` is used. but it does not set the thread local variable `inputFileName` in `NewSqlHadoopRDD`, like `NewSqlHadoopRDD.compute` does.. Yet, when getting the `inputFileName`, `NewSqlHadoopRDD.inputFileName` is exptected, which is empty now. Adding the setting inputFileName in HadoopRDD.compute resolves this issue. Author: xin Wu <xinwu@us.ibm.com> Closes #9542 from xwu0226/SPARK-11522.	2015-11-16 08:10:48 -08:00
hyukjinkwon	e388b39d10	[SPARK-11692][SQL] Support for Parquet logical types, JSON and BSON (embedded types) Parquet supports some JSON and BSON datatypes. They are represented as binary for BSON and string (UTF-8) for JSON internally. I searched a bit and found Apache drill also supports both in this way, [link](https://drill.apache.org/docs/parquet-format/). Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #9658 from HyukjinKwon/SPARK-11692.	2015-11-16 21:59:33 +08:00
hyukjinkwon	7f8eb3bf6e	[SPARK-11044][SQL] Parquet writer version fixed as version1 https://issues.apache.org/jira/browse/SPARK-11044 Spark writes a parquet file only with writer version1 ignoring the writer version given by user. So, in this PR, it keeps the writer version if given or sets version1 as default. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #9060 from HyukjinKwon/SPARK-11044.	2015-11-16 21:30:10 +08:00
Reynold Xin	42de5253f3	[SPARK-11745][SQL] Enable more JSON parsing options This patch adds the following options to the JSON data source, for dealing with non-standard JSON files: * `allowComments` (default `false`): ignores Java/C++ style comment in JSON records * `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names * `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes * `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012) To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options. Also updated documentation to explain these options. Scala ![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png) Python ![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png) Author: Reynold Xin <rxin@databricks.com> Closes #9724 from rxin/SPARK-11745.	2015-11-16 00:06:14 -08:00
Josh Rosen	fd50fa4c3e	Revert "[SPARK-11572] Exit AsynchronousListenerBus thread when stop() is called" This reverts commit `3e0a6cf1e0`.	2015-11-15 22:38:30 -08:00
gatorsmile	b58765caa6	[SPARK-9928][SQL] Removal of LogicalLocalTable LogicalLocalTable in ExistingRDD.scala is replaced by localRelation in LocalRelation.scala? Do you know any reason why we still keep this class? Author: gatorsmile <gatorsmile@gmail.com> Closes #9717 from gatorsmile/LogicalLocalTable.	2015-11-15 21:10:46 -08:00
Sun Rui	835a79d78e	[SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable The basic idea is that: The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable. When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip. sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes. The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR. Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages) so that these package can be accessed in R. Author: Sun Rui <rui.sun@intel.com> Closes #9390 from sun-rui/SPARK-10500.	2015-11-15 19:29:09 -08:00
zero323	d7d9fa0b87	[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame` At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns). A simple improvement is to apply `dropFactor `column-wise and then reshape output list. It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277). Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9099 from zero323/SPARK-11086.	2015-11-15 19:15:27 -08:00
Yu Gao	72c1d68b4a	[SPARK-10181][SQL] Do kerberos login for credentials during hive client initialization On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread. The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance. yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this? Author: Yu Gao <ygao@us.ibm.com> Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com> Author: Yu Gao <crystalgaoyu@gmail.com> Closes #9272 from yolandagao/master.	2015-11-15 14:53:59 -08:00
Yin Huai	3e2e1873b2	[SPARK-11738] [SQL] Making ArrayType orderable https://issues.apache.org/jira/browse/SPARK-11738 Author: Yin Huai <yhuai@databricks.com> Closes #9718 from yhuai/makingArrayOrderable.	2015-11-15 13:59:59 -08:00

1 2 3 4 5 ...

13696 commits