ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
jerryshao	f97e9323b5	[SPARK-10739] [YARN] Add application attempt window for Spark on Yarn Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures. Author: jerryshao <sshao@hortonworks.com> Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits: 36eabdc [jerryshao] change the doc 7f9b77d [jerryshao] Style change 1c9afd0 [jerryshao] Address the comments caca695 [jerryshao] Add application attempt window for Spark on Yarn	2015-10-12 18:18:19 -07:00
Kay Ousterhout	091c2c3ecd	[SPARK-11056] Improve documentation of SBT build. This commit improves the documentation around building Spark to (1) recommend using SBT interactive mode to avoid the overhead of launching SBT and (2) refer to the wiki page that documents using SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each compile. cc srowen Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9068 from kayousterhout/SPARK-11056.	2015-10-12 14:23:29 -07:00
Yin Huai	8a354bef55	[SPARK-11042] [SQL] Add a mechanism to ban creating multiple root SQLContexts/HiveContexts in a JVM https://issues.apache.org/jira/browse/SPARK-11042 Author: Yin Huai <yhuai@databricks.com> Closes #9058 from yhuai/SPARK-11042.	2015-10-12 13:50:34 -07:00
Ashwin Shankar	2e572c4135	[SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and cancel all running jobs This patch adds a signal handler to trap Ctrl-C and cancels running job. Author: Ashwin Shankar <ashankar@netflix.com> Closes #9033 from ashwinshankar77/master.	2015-10-12 11:06:21 -07:00
Marcelo Vanzin	149472a01d	[SPARK-11023] [YARN] Avoid creating URIs from local paths directly. The issue is that local paths on Windows, when provided with drive letters or backslashes, are not valid URIs. Instead of trying to figure out whether paths are URIs or not, use Utils.resolveURI() which does that for us. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9049 from vanzin/SPARK-11023 and squashes the following commits: 77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.	2015-10-12 10:21:57 -07:00
Cheng Lian	64b1d00e1a	[SPARK-11007] [SQL] Adds dictionary aware Parquet decimal converters For Parquet decimal columns that are encoded using plain-dictionary encoding, we can make the upper level converter aware of the dictionary, so that we can pre-instantiate all the decimals to avoid duplicated instantiation. Note that plain-dictionary encoding isn't available for `FIXED_LEN_BYTE_ARRAY` for Parquet writer version `PARQUET_1_0`. So currently only decimals written as `INT32` and `INT64` can benefit from this optimization. Author: Cheng Lian <lian@databricks.com> Closes #9040 from liancheng/spark-11007.decimal-converter-dict-support.	2015-10-12 10:17:19 -07:00
Liang-Chi Hsieh	fcb37a0417	[SPARK-10960] [SQL] SQL with windowing function should be able to refer column in inner select JIRA: https://issues.apache.org/jira/browse/SPARK-10960 When accessing a column in inner select from a select with window function, `AnalysisException` will be thrown. For example, an query like this: select area, rank() over (partition by area order by tmp.month) + tmp.tmp1 as c1 from (select month, area, product, 1 as tmp1 from windowData) tmp Currently, the rule `ExtractWindowExpressions` in `Analyzer` only extracts regular expressions from `WindowFunction`, `WindowSpecDefinition` and `AggregateExpression`. We need to also extract other attributes as the one in `Alias` as shown in the above query. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #9011 from viirya/fix-window-inner-column.	2015-10-12 09:16:14 -07:00
Josh Rosen	595012ea8b	[SPARK-11053] Remove use of KVIterator in SortBasedAggregationIterator SortBasedAggregationIterator uses a KVIterator interface in order to process input rows as key-value pairs, but this use of KVIterator is unnecessary, slightly complicates the code, and might hurt performance. This patch refactors this code to remove the use of this extra layer of iterator wrapping and simplifies other parts of the code in the process. Author: Josh Rosen <joshrosen@databricks.com> Closes #9066 from JoshRosen/sort-iterator-cleanup.	2015-10-11 18:11:08 -07:00
Jacker Hu	a16396df76	[SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform function in DStream returns NULL Currently, the ```TransformedDStream``` will using ```Some(transformFunc(parentRDDs, validTime))``` as compute return value, when the ```transformFunc``` somehow returns null as return value, the followed operator will have NullPointerExeception. This fix uses the ```Option()``` instead of ```Some()``` to deal with the possible null value. When ```transformFunc``` returns ```null```, the option will transform null to ```None```, the downstream can handle ```None``` correctly. NOTE (2015-09-25): The latest fix will check the return value of transform function, if it is ```NULL```, a spark exception will be thrown out Author: Jacker Hu <gt.hu.chang@gmail.com> Author: jhu-chang <gt.hu.chang@gmail.com> Closes #8881 from jhu-chang/Fix_Transform.	2015-10-10 11:36:18 +01:00
Sun Rui	864de3bf40	[SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions. 1. Add a "col" function into DataFrame. 2. Move the current "col" function in Column.R to functions.R, convert it to S4 function. 3. Add a s4 "column" function in functions.R. 4. Convert the "column" function in Column.R to S4 function. This is for private use. Author: Sun Rui <rui.sun@intel.com> Closes #8864 from sun-rui/SPARK-10079.	2015-10-09 23:05:38 -07:00
Vladimir Vladimirov	c1b4ce4326	[SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com> Closes #8700 from smartkiwi/SPARK-10535_.	2015-10-09 14:16:13 -07:00
Tom Graves	63c340a710	[SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl https://issues.apache.org/jira/browse/SPARK-10858 The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI(). But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that later in Client as a URI it doesn't recognize there is a fragment. so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on. Author: Tom Graves <tgraves@yahoo-inc.com> Closes #9035 from tgravescs/SPARK-10858.	2015-10-09 14:06:25 -07:00
Rick Hillegas	12b7191d20	[SPARK-10855] [SQL] Add a JDBC dialect for Apache Derby marmbrus rxin This patch adds a JdbcDialect class, which customizes the datatype mappings for Derby backends. The patch also adds unit tests for the new dialect, corresponding to the existing tests for other JDBC dialects. JDBCSuite runs cleanly for me with this patch. So does JDBCWriteSuite, although it produces noise as described here: https://issues.apache.org/jira/browse/SPARK-10890 This patch is my original work, which I license to the ASF. I am a Derby contributor, so my ICLA is on file under SVN id "rhillegas": http://people.apache.org/committer-index.html Touches the following files: --------------------------------- org.apache.spark.sql.jdbc.JdbcDialects Adds a DerbyDialect. --------------------------------- org.apache.spark.sql.jdbc.JDBCSuite Adds unit tests for the new DerbyDialect. Author: Rick Hillegas <rhilleg@us.ibm.com> Closes #8982 from rick-ibm/b_10855.	2015-10-09 13:36:51 -07:00
Marcelo Vanzin	015f7ef503	[SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps. This change adds an API that encapsulates information about an app launched using the library. It also creates a socket-based communication layer for apps that are launched as child processes; the launching application listens for connections from launched apps, and once communication is established, the channel can be used to send updates to the launching app, or to send commands to the child app. The change also includes hooks for local, standalone/client and yarn masters. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7052 from vanzin/SPARK-8673.	2015-10-09 15:28:09 -05:00
Rerngvit Yanggratoke	70f44ad2d8	[SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions [SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions - Add function (together with roxygen2 doc) to DataFrame.R and generics.R - Expose the function in NAMESPACE - Add unit test for the function Author: Rerngvit Yanggratoke <rerngvit@kth.se> Closes #8962 from rerngvit/SPARK-10905.	2015-10-09 09:36:40 -07:00
Nick Pritchard	5994cfe812	[SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric Compute upper triangular values of the covariance matrix, then copy to lower triangular values. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8940 from pnpritchard/SPARK-10875.	2015-10-08 22:22:20 -07:00
Bryan Cutler	5410747a84	[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. Same with StreamingLinearRegressionWithSGD. I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.	2015-10-08 22:21:07 -07:00
Andrew Or	67fbecbf32	[SPARK-10956] Common MemoryManager interface for storage and execution This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible. This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks. Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #9000 from andrewor14/memory-manager.	2015-10-08 21:44:59 -07:00
Hari Shreedharan	0984129005	[SPARK-10955] [STREAMING] Add a warning if dynamic allocation for Streaming applications Dynamic allocation can be painful for streaming apps and can lose data. Log a warning for streaming applications if dynamic allocation is enabled. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #8998 from harishreedharan/ss-log-error and squashes the following commits: 462b264 [Hari Shreedharan] Improve log message. 2733d94 [Hari Shreedharan] Minor change to warning message. eaa48cc [Hari Shreedharan] Log a warning instead of failing the application if dynamic allocation is enabled. 725f090 [Hari Shreedharan] Add config parameter to allow dynamic allocation if the user explicitly sets it. b3f9a95 [Hari Shreedharan] Disable dynamic allocation and kill app if it is enabled. a4a5212 [Hari Shreedharan] [streaming] SPARK-10955. Disable dynamic allocation for Streaming applications.	2015-10-08 18:53:38 -07:00
Hari Shreedharan	fa3e4d8f52	[SPARK-11019] [STREAMING] [FLUME] Gracefully shutdown Flume receiver th… …reads. Wait for a minute for the receiver threads to shutdown before interrupting them. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #9041 from harishreedharan/flume-graceful-shutdown.	2015-10-08 18:50:27 -07:00
zero323	8e67882b90	[SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we… __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry from pyspark.mllib.linalg import Vectors sv = Vectors.sparse(5, {1: 3}) sv[0] ## 0.0 sv[1] ## 3.0 sv[2] ## Traceback (most recent call last): ## File "<stdin>", line 1, in <module> ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ ## row_ind = inds[insert_index] ## IndexError: index out of bounds Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9009 from zero323/sparse_vector_index_error.	2015-10-08 18:34:15 -07:00
Davies Liu	3390b400d0	[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session. A new session of SQLContext could be created by: 1) create an new SQLContext 2) call newSession() on existing SQLContext For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession). CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache. Added jars are still shared by all the sessions, because SparkContext does not support sessions. cc marmbrus yhuai rxin Author: Davies Liu <davies@databricks.com> Closes #8909 from davies/sessions.	2015-10-08 17:34:24 -07:00
Reynold Xin	84ea287178	[SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM). To reproduce, launch Spark using MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" And then run the following scala> sql("select 1 xx").collect() Author: Reynold Xin <rxin@databricks.com> Closes #9030 from rxin/SPARK-10914.	2015-10-08 17:25:14 -07:00
Cheng Lian	02149ff08e	[SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format This PR refactors Parquet write path to follow parquet-format spec. It's a successor of PR #7679, but with less non-essential changes. Major changes include: 1. Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport` - Writes Parquet data using standard layout defined in parquet-format Specifically, we are now writing ... - ... arrays and maps in standard 3-level structure with proper annotations and field names - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback - Supports legacy mode which is compatible with Spark 1.4 and prior versions The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`. - Eliminates per value data type dispatching costs via prebuilt composed writer functions 1. Cleans up the last pieces of old Parquet support code As pointed out by rxin previously, we probably want to rename all those `Catalyst` Parquet classes to `Parquet` for clarity. But I'd like to do this in a follow-up PR to minimize code review noises in this one. Author: Cheng Lian <lian@databricks.com> Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.	2015-10-08 16:18:35 -07:00
Josh Rosen	2816c89b6a	[SPARK-10988] [SQL] Reduce duplication in Aggregate2's expression rewriting logic In `aggregate/utils.scala`, there is a substantial amount of duplication in the expression-rewriting logic. As a prerequisite to supporting imperative aggregate functions in `TungstenAggregate`, this patch refactors this file so that the same expression-rewriting logic is used for both `SortAggregate` and `TungstenAggregate`. In order to allow both operators to use the same rewriting logic, `TungstenAggregationIterator. generateResultProjection()` has been updated so that it first evaluates all declarative aggregate functions' `evaluateExpression`s and writes the results into a temporary buffer, and then uses this temporary buffer and the grouping expressions to evaluate the final resultExpressions. This matches the logic in SortAggregateIterator, where this two-pass approach is necessary in order to support imperative aggregates. If this change turns out to cause performance regressions, then we can look into re-implementing the single-pass evaluation in a cleaner way as part of a followup patch. Since the rewriting logic is now shared across both operators, this patch also extracts that logic and places it in `SparkStrategies`. This makes the rewriting logic a bit easier to follow, I think. Author: Josh Rosen <joshrosen@databricks.com> Closes #9015 from JoshRosen/SPARK-10988.	2015-10-08 14:56:27 -07:00
Michael Armbrust	9e66a53c99	[SPARK-10993] [SQL] Inital code generated encoder for product types This PR is a first cut at code generating an encoder that takes a Scala `Product` type and converts it directly into the tungsten binary format. This is done through the addition of a new set of expression that can be used to invoke methods on raw JVM objects, extracting fields and converting the result into the required format. These can then be used directly in an `UnsafeProjection` allowing us to leverage the existing encoding logic. According to some simple benchmarks, this can significantly speed up conversion (~4x). However, replacing CatalystConverters is deferred to a later PR to keep this PR at a reasonable size. ```scala case class SomeInts(a: Int, b: Int, c: Int, d: Int, e: Int) val data = SomeInts(1, 2, 3, 4, 5) val encoder = ProductEncoder[SomeInts] val converter = CatalystTypeConverters.createToCatalystConverter(ScalaReflection.schemaFor[SomeInts].dataType) (1 to 5).foreach {iter => benchmark(s"converter $iter") { var i = 100000000 while (i > 0) { val res = converter(data).asInstanceOf[InternalRow] assert(res.getInt(0) == 1) assert(res.getInt(1) == 2) i -= 1 } } benchmark(s"encoder $iter") { var i = 100000000 while (i > 0) { val res = encoder.toRow(data) assert(res.getInt(0) == 1) assert(res.getInt(1) == 2) i -= 1 } } } ``` Results: ``` [info] converter 1: 7170ms [info] encoder 1: 1888ms [info] converter 2: 6763ms [info] encoder 2: 1824ms [info] converter 3: 6912ms [info] encoder 3: 1802ms [info] converter 4: 7131ms [info] encoder 4: 1798ms [info] converter 5: 7350ms [info] encoder 5: 1912ms ``` Author: Michael Armbrust <michael@databricks.com> Closes #9019 from marmbrus/productEncoder.	2015-10-08 14:28:14 -07:00
Michael Armbrust	a8226a9f14	Revert [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN This reverts commit `dcbd58a929` from #8983 Author: Michael Armbrust <michael@databricks.com> Closes #9034 from marmbrus/revert8654.	2015-10-08 13:49:10 -07:00
Wenchen Fan	af2a554487	[SPARK-10337] [SQL] fix hive views on non-hive-compatible tables. add a new config to deal with this special case. Author: Wenchen Fan <cloud0fan@163.com> Closes #8990 from cloud-fan/view-master.	2015-10-08 12:42:10 -07:00
Yin Huai	82d275f27c	[SPARK-10887] [SQL] Build HashedRelation outside of HashJoinNode. This PR refactors `HashJoinNode` to take a existing `HashedRelation`. So, we can reuse this node for both `ShuffledHashJoin` and `BroadcastHashJoin`. https://issues.apache.org/jira/browse/SPARK-10887 Author: Yin Huai <yhuai@databricks.com> Closes #8953 from yhuai/SPARK-10887.	2015-10-08 11:56:44 -07:00
tedyu	2a6f614cd6	[SPARK-11006] Rename NullColumnAccess as NullColumnAccessor davies I think NullColumnAccessor follows same convention for other accessors Author: tedyu <yuzhihong@gmail.com> Closes #9028 from tedyu/master.	2015-10-08 11:51:58 -07:00
Yanbo Liang	2268356002	[SPARK-7770] [ML] GBT validationTol change to compare with relative or absolute error GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8549 from yanboliang/spark-7770.	2015-10-08 11:27:46 -07:00
Holden Karau	0903c6489e	[SPARK-9718] [ML] linear regression training summary all columns LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful. Author: Holden Karau <holden@pigscanfly.ca> Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.	2015-10-08 11:16:20 -07:00
Dilip Biswal	dcbd58a929	[SPARK-8654] [SQL] Fix Analysis exception when using NULL IN (...) In the analysis phase , while processing the rules for IN predicate, we compare the in-list types to the lhs expression type and generate cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up generating cast between in list types to NULL like cast (1 as NULL) which is not a valid cast. The fix is to not generate such a cast if the lhs type is a NullType instead we translate the expression to Literal(Null). Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #8983 from dilipbiswal/spark_8654.	2015-10-08 10:41:45 -07:00
Michael Armbrust	5c9fdf74e3	[SPARK-10998] [SQL] Show non-children in default Expression.toString Its pretty hard to debug problems with expressions when you can't see all the arguments. Before: `invoke()` After: `invoke(inputObject#1, intField, IntegerType)` Author: Michael Armbrust <michael@databricks.com> Closes #9022 from marmbrus/expressionToString.	2015-10-08 10:22:06 -07:00
Narine Kokhlikyan	e8f90d9dda	[SPARK-10836] [SPARKR] Added sort(x, decreasing, col, ... ) method to DataFrame the sort function can be used as an alternative to arrange(... ). As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names for example: sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order sort(df, decreasing=TRUE, "col1") sort(df, decreasing=c(TRUE,FALSE), "col1","col2") Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #8920 from NarineK/sparkrsort.	2015-10-08 09:53:44 -07:00
Marcelo Vanzin	56a9692fc0	[SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event. In YARN client mode, when the AM connects to the driver, it may be the case that the driver never needs to send a message back to the AM (i.e., no dynamic allocation or preemption). This triggers an issue in the netty rpc backend where no disconnection event is sent to endpoints, and the AM never exits after the driver shuts down. The real fix is too complicated, so this is a quick hack to unblock YARN client mode until we can work on the real fix. It forces the driver to send a message to the AM when the AM registers, thus establishing that connection and enabling the disconnection event when the driver goes away. Also, a minor side issue: when the executor is shutting down, it needs to send an "ack" back to the driver when using the netty rpc backend; but that "ack" wasn't being sent because the handler was shutting down the rpc env before returning. So added a change to delay the shutdown a little bit, allowing the ack to be sent back. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9021 from vanzin/SPARK-10987.	2015-10-08 09:47:58 -07:00
Cheng Lian	2df882ef14	[SPARK-5775] [SPARK-5508] [SQL] Re-enable Hive Parquet array reading tests Since SPARK-5508 has already been fixed. Author: Cheng Lian <lian@databricks.com> Closes #8999 from liancheng/spark-5775.enable-array-tests.	2015-10-08 09:22:42 -07:00
Cheng Lian	59b0606f33	[SPARK-10999] [SQL] Coalesce should be able to handle UnsafeRow Author: Cheng Lian <lian@databricks.com> Closes #9024 from liancheng/spark-10999.coalesce-unsafe-row-handling.	2015-10-08 09:20:36 -07:00
Jean-Baptiste Onofré	60150cf00a	[SPARK-10883] Add a note about how to build Spark sub-modules (reactor) Author: Jean-Baptiste Onofré <jbonofre@apache.org> Closes #8993 from jbonofre/SPARK-10883-2.	2015-10-08 11:38:39 +01:00
admackin	cd28139c9b	Akka framesize units should be specified 1.4 docs noted that the units were MB - i have assumed this is still the case Author: admackin <admackin@users.noreply.github.com> Closes #9025 from admackin/master.	2015-10-08 00:01:23 -07:00
0x0FFF	b8f849b546	[SPARK-7869][SQL] Adding Postgres JSON and JSONb data types support This PR addresses [SPARK-7869](https://issues.apache.org/jira/browse/SPARK-7869) Before the patch, attempt to load the table from Postgres with JSON/JSONb datatype caused error `java.sql.SQLException: Unsupported type 1111` Postgres data types JSON and JSONb are now mapped to String on Spark side thus they can be loaded into DF and processed on Spark side Example Postgres: ``` create table test_json (id int, value json); create table test_jsonb (id int, value jsonb); insert into test_json (id, value) values (1, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::json), (2, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::json), (3, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::json); insert into test_jsonb (id, value) values (4, '{"field1":"value1","field2":"value2","field3":[1,2,3]}'::jsonb), (5, '{"field1":"value3","field2":"value4","field3":[4,5,6]}'::jsonb), (6, '{"field3":"value5","field4":"value6","field3":[7,8,9]}'::jsonb); ``` PySpark: ``` >>> import json >>> df1 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_json") >>> df1.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field3'))).collect() [(1, [1, 2, 3]), (2, [4, 5, 6]), (3, [7, 8, 9])] >>> df2 = sqlContext.read.jdbc("jdbc:postgresql://127.0.0.1:5432/test?user=testuser", "test_jsonb") >>> df2.map(lambda x: (x.id, json.loads(x.value))).map(lambda (id, value): (id, value.get('field1'))).collect() [(4, u'value1'), (5, u'value3'), (6, None)] ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8948 from 0x0FFF/SPARK-7869.	2015-10-07 23:12:35 -07:00
Holden Karau	3aff0866a8	[SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression Add the Python API for isotonicregression. Author: Holden Karau <holden@pigscanfly.ca> Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.	2015-10-07 17:50:35 -07:00
Nathan Howell	1bc435ae3a	[SPARK-10064] [ML] Parallelize decision tree bin split calculations Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.	2015-10-07 17:46:16 -07:00
Davies Liu	075a0b6582	[SPARK-10917] [SQL] improve performance of complex type in columnar cache This PR improve the performance of complex types in columnar cache by using UnsafeProjection instead of KryoSerializer. A simple benchmark show that this PR could improve the performance of scanning a cached table with complex columns by 15x (comparing to Spark 1.5). Here is the code used to benchmark: ``` df = sc.range(1<<23).map(lambda i: Row(a=Row(b=i, c=str(i)), d=range(10), e=dict(zip(range(10), [str(i) for i in range(10)])))).toDF() df.write.parquet("table") ``` ``` df = sqlContext.read.parquet("table") df.cache() df.count() t = time.time() print df.select("*")._jdf.queryExecution().toRdd().count() print time.time() - t ``` Author: Davies Liu <davies@databricks.com> Closes #8971 from davies/complex.	2015-10-07 15:58:07 -07:00
DB Tsai	dd36ec6bc5	[SPARK-10738] [ML] Refactoring `Instance` out from LOR and LIR, and also cleaning up some code Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code. Author: DB Tsai <dbt@netflix.com> Closes #8853 from dbtsai/refactoring.	2015-10-07 15:56:57 -07:00
Josh Rosen	7e2e268289	[SPARK-9702] [SQL] Use Exchange to implement logical Repartition operator This patch allows `Repartition` to support UnsafeRows. This is accomplished by implementing the logical `Repartition` operator in terms of `Exchange` and a new `RoundRobinPartitioning`. Author: Josh Rosen <joshrosen@databricks.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #8083 from JoshRosen/SPARK-9702.	2015-10-07 15:53:37 -07:00
Davies Liu	37526aca24	[SPARK-10980] [SQL] fix bug in create Decimal The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <davies@databricks.com> Closes #9014 from davies/fix_decimal.	2015-10-07 15:51:09 -07:00
Yanbo Liang	7bf07faa71	[SPARK-10490] [ML] Consolidate the Cholesky solvers in WeightedLeastSquares and ALS Consolidate the Cholesky solvers in WeightedLeastSquares and ALS. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8936 from yanboliang/spark-10490.	2015-10-07 15:50:45 -07:00
Reynold Xin	6dbfd7ecf4	[SPARK-10982] [SQL] Rename ExpressionAggregate -> DeclarativeAggregate. DeclarativeAggregate matches more closely with ImperativeAggregate we already have. Author: Reynold Xin <rxin@databricks.com> Closes #9013 from rxin/SPARK-10982.	2015-10-07 15:38:46 -07:00
Evan Chen	da936fbb74	[SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) Provide initialModel param for pyspark.mllib.clustering.KMeans Author: Evan Chen <chene@us.ibm.com> Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.	2015-10-07 15:04:53 -07:00

1 2 3 4 5 ...

13172 commits