Use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.
Another PR coming to update documentation.
Author: Reza Zadeh <reza@databricks.com>
Closes#5364 from rezazadeh/optmemsim and squashes the following commits:
47c90ba [Reza Zadeh] Iterators in columnSimilarities for flatMap
Reduce "is the same as ending offset" message to INFO level per JIRA discussion
Author: Sean Owen <sowen@cloudera.com>
Closes#5366 from srowen/SPARK-6569 and squashes the following commits:
8a5b992 [Sean Owen] Reduce "is the same as ending offset" message to INFO level per JIRA discussion
added equivalent script to load-spark-env.sh
Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Closes#5328 from tsudukim/feature/SPARK-6673 and squashes the following commits:
aaefb19 [Masayoshi TSUZUKI] removed dust.
be3405e [Masayoshi TSUZUKI] [SPARK-6673] spark-shell.cmd can't start in Windows even when spark was built
This is the second PR for [SPARK-6602]. It updated MapOutputTrackerMasterActor and its unit tests.
cc rxin
Author: zsxwing <zsxwing@gmail.com>
Closes#5371 from zsxwing/rpc-rewrite-part2 and squashes the following commits:
fcf3816 [zsxwing] Fix the code style
4013a22 [zsxwing] Add doc for uncaught exceptions in RpcEnv
93c6c20 [zsxwing] Add an example of UnserializableException and add ErrorMonitor to monitor errors from Akka
134fe7b [zsxwing] Update MapOutputTrackerMasterActor to MapOutputTrackerMasterEndpoint
Add below methods in pyspark for MultivariateStatisticalSummary
- normL1
- normL2
Author: lewuathe <lewuathe@me.com>
Closes#5359 from Lewuathe/SPARK-6262 and squashes the following commits:
cbe439e [lewuathe] Implement missing methods for MultivariateStatisticalSummary
This PR replaced the following `Actor`s to `RpcEndpoint`:
1. HeartbeatReceiver
1. ExecutorActor
1. BlockManagerMasterActor
1. BlockManagerSlaveActor
1. CoarseGrainedExecutorBackend and subclasses
1. CoarseGrainedSchedulerBackend.DriverActor
This is the first PR. I will split the work of SPARK-6602 to several PRs for code review.
Author: zsxwing <zsxwing@gmail.com>
Closes#5268 from zsxwing/rpc-rewrite and squashes the following commits:
287e9f8 [zsxwing] Fix the code style
26c56b7 [zsxwing] Merge branch 'master' into rpc-rewrite
9cc825a [zsxwing] Rmove setupThreadSafeEndpoint and add ThreadSafeRpcEndpoint
30a9036 [zsxwing] Make self return null after stopping RpcEndpointRef; fix docs and error messages
705245d [zsxwing] Fix some bugs after rebasing the changes on the master
003cf80 [zsxwing] Update CoarseGrainedExecutorBackend and CoarseGrainedSchedulerBackend to use RpcEndpoint
7d0e6dc [zsxwing] Update BlockManagerSlaveActor to use RpcEndpoint
f5d6543 [zsxwing] Update BlockManagerMaster to use RpcEndpoint
30e3f9f [zsxwing] Update ExecutorActor to use RpcEndpoint
478b443 [zsxwing] Update HeartbeatReceiver to use RpcEndpoint
'(' and ')' are special characters used in Parquet schema for type annotation. When we run an aggregation query, we will obtain attribute name such as "MAX(a)".
If we directly store the generated DataFrame as Parquet file, it causes failure when reading and parsing the stored schema string.
Several methods can be adopted to solve this. This pr uses a simplest one to just replace attribute names before generating Parquet schema based on these attributes.
Another possible method might be modifying all aggregation expression names from "func(column)" to "func[column]".
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5263 from viirya/parquet_aggregation_name and squashes the following commits:
2d70542 [Liang-Chi Hsieh] Address comment.
463dff4 [Liang-Chi Hsieh] Instead of replacing special chars, showing error message to user to suggest using Alias.
1de001d [Liang-Chi Hsieh] Replace special characters '(' and ')' of Parquet schema.
Author: Yin Huai <yhuai@databricks.com>
Closes#5353 from yhuai/wrongFS and squashes the following commits:
849603b [Yin Huai] Not use deprecated method.
6d6ae34 [Yin Huai] Use path.makeQualified.
Now trait `StringComparison` is a `BinaryExpression`. In fact, it should be a `BinaryPredicate`.
By making `StringComparison` as `BinaryPredicate`, we can throw error when a `expressions.Predicate` can't translate to a data source `Filter` in function `selectFilters`.
Without this modification, because we will wrap a `Filter` outside the scanned results in `pruneFilterProjectRaw`, we can't detect about something is wrong in translating predicates to filters in `selectFilters`.
The unit test of #5285 demonstrates such problem. In that pr, even `expressions.Contains` is not properly translated to `sources.StringContains`, the filtering is still performed by the `Filter` and so the test passes.
Of course, by doing this modification, all `expressions.Predicate` classes need to have its data source `Filter` correspondingly.
There is a small bug in `FilteredScanSuite` for doing `StringEndsWith` filter. This pr also fixes it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5309 from viirya/translate_predicate and squashes the following commits:
b176385 [Liang-Chi Hsieh] Address comment.
275a493 [Liang-Chi Hsieh] More properly test for StringStartsWith, StringEndsWith and StringContains.
caf2347 [Liang-Chi Hsieh] Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5340 from vanzin/SPARK-6688 and squashes the following commits:
ccfddd9 [Marcelo Vanzin] Resolve at the source.
20d2a34 [Marcelo Vanzin] [SPARK-6688] [core] Always use resolved URIs in EventLoggingListener.
This PR moved the code of creating `HeartbeatReceiver` above the code of creating `schedulerBackend` to resolve the race condition.
Author: zsxwing <zsxwing@gmail.com>
Closes#5306 from zsxwing/SPARK-6640 and squashes the following commits:
840399d [zsxwing] Don't send TaskScheduler through Akka
a90616a [zsxwing] Fix docs
dd202c7 [zsxwing] Fix typo
d7c250d [zsxwing] Fix the race condition of creating HeartbeatReceiver and retrieving HeartbeatReceiver
I've added a timeout and retry loop around the SparkContext shutdown code that should fix this deadlock. If a SparkContext shutdown is in progress when another thread comes knocking, it will wait for 10 seconds for the lock, then fall through where the outer loop will re-submit the request.
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes#5277 from ilganeli/SPARK-6492 and squashes the following commits:
8617a7e [Ilya Ganelin] Resolved merge conflict
2fbab66 [Ilya Ganelin] Added MIMA Exclude
a0e2c70 [Ilya Ganelin] Deleted stale imports
fa28ce7 [Ilya Ganelin] reverted to just having a single stopped
76fc825 [Ilya Ganelin] Updated to use atomic booleans instead of the synchronized vars
6e8a7f7 [Ilya Ganelin] Removing unecessary null check for now since i'm not fixing stop ordering yet
cdf7073 [Ilya Ganelin] [SPARK-6492] Moved stopped=true back to the start of the shutdown sequence so this can be addressed in a seperate PR
7fb795b [Ilya Ganelin] Spacing
b7a0c5c [Ilya Ganelin] Import ordering
df8224f [Ilya Ganelin] Added comment for added lock
343cb94 [Ilya Ganelin] [SPARK-6492] Added timeout/retry logic to fix a deadlock in SparkContext shutdown
When union non-decimal types with decimals, we use the following rules:
- FIRST `intTypeToFixed`, then fixed union decimals with precision/scale p1/s2 and p2/s2 will be promoted to
DecimalType(max(p1, p2), max(s1, s2))
- FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE (this is the same as Hive,
but note that unlimited decimals are considered bigger than doubles in WidenTypes)
Author: guowei2 <guowei2@asiainfo.com>
Closes#4004 from guowei2/SPARK-5203 and squashes the following commits:
ff50f5f [guowei2] fix code style
11df1bf [guowei2] fix decimal union with double, double->Decimal(15,15)
0f345f9 [guowei2] fix structType merge with decimal
101ed4d [guowei2] fix build error after rebase
0b196e4 [guowei2] code style
fe2c2ca [guowei2] handle union decimal precision in 'DecimalPrecision'
421d840 [guowei2] fix union types for decimal precision
ef2c661 [guowei2] fix union with different decimal type
Just fix a typo.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5352 from viirya/fix_a_typo and squashes the following commits:
303b2d2 [Liang-Chi Hsieh] Fix typo.
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5296 from Lewuathe/SPARK-6615 and squashes the following commits:
f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
There's no corresponding printing in linear regression. Here was my previous PR (something weird happened and I can't reopen it) https://github.com/apache/spark/pull/5272
Author: Omede Firouz <ofirouz@palantir.com>
Closes#5338 from oefirouz/println and squashes the following commits:
3f3dbf4 [Omede Firouz] [MLLIB] Remove println
If there is a failure in the Hadoop backend while calling
writer.write, we should remember this original exception,
and try to call writer.close(), but if that fails as well,
still report the original exception.
Note that, if writer.write fails, it is likely that writer
was left in an invalid state, and so actually makes it more
likely that writer.close will also fail. Which just increases
the chances for writer.write's exception to be suppressed.
This patch introduces an admittedly potentially too cute
Utils.tryWithSafeFinally method to handle the try/finally
gyrations.
Author: Stephen Haberman <stephen@exigencecorp.com>
Closes#5223 from stephenh/do_not_suppress_writer_exception and squashes the following commits:
c7ad53f [Stephen Haberman] [SPARK-6560][CORE] Do not suppress exceptions from writer.write.
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.
Author: Reynold Xin <rxin@databricks.com>
Closes#5342 from rxin/SPARK-6428 and squashes the following commits:
7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
https://issues.apache.org/jira/browse/SPARK-6575
Author: Yin Huai <yhuai@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Cheng Lian <lian@databricks.com>
Closes#5339 from yhuai/parquetRelationCache and squashes the following commits:
b0e1a42 [Yin Huai] Address comments.
83d9846 [Yin Huai] Remove unnecessary change.
c0dc7a4 [Yin Huai] Cache converted parquet relations.
Author: zsxwing <zsxwing@gmail.com>
Closes#5280 from zsxwing/SPARK-6621 and squashes the following commits:
521125e [zsxwing] Fix the bug that calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError doesn't call onStop
This patch fixes a reported bug causing model updates to not properly propagate to model predictions during streaming regression. These minor changes in model declaration fix the problem, and I expanded the tests to include the scenario in which the bug was arising. The two new tests failed prior to the patch and now pass.
cc mengxr
Author: freeman <the.freeman.lab@gmail.com>
Closes#5037 from freeman-lab/train-predict-fix and squashes the following commits:
3af953e [freeman] Expand test coverage to include combined training and prediction
8f84fc8 [freeman] Move model declaration
The config option is spark.history.fs.logDirectory, not spark.fs.history.logDirectory. So the descriptionof should be changed. Thanks.
Author: KaiXinXiaoLei <huleilei1@huawei.com>
Closes#5332 from KaiXinXiaoLei/historyConfig and squashes the following commits:
5ffbfb5 [KaiXinXiaoLei] the describe of jobHistory config is error
This fixes the thread leak. I also changed the unit test to keep track
of allocated contexts and make sure they're closed after tests are
run; this is needed since some tests use this pattern:
val sc = createContext()
doSomethingThatMayThrow()
sc.stop()
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5311 from vanzin/SPARK-6650 and squashes the following commits:
652c73b [Marcelo Vanzin] Nits.
5711512 [Marcelo Vanzin] More exception safety.
cc5a744 [Marcelo Vanzin] Stop alloc manager before scheduler.
9886f69 [Marcelo Vanzin] [SPARK-6650] [core] Stop ExecutorAllocationManager when context stops.
This is a workaround for a problem reported on the user list. This doesn't fix the core problem, but in general is a more robust way to do renames.
Author: Michael Armbrust <michael@databricks.com>
Closes#5337 from marmbrus/toDFrename and squashes the following commits:
6a3159d [Michael Armbrust] [SPARK-6686][SQL] Use resolved output instead of names for toDF rename
It did not conside that order.dataType does not match NativeType. So i add "case other => ..." for other cenarios.
Author: DoingDone9 <799203320@qq.com>
Closes#4959 from DoingDone9/case_ and squashes the following commits:
6278846 [DoingDone9] Update rows.scala
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
This PR addresses rxin's comments in PR #5210.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5219)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#5219 from liancheng/spark-6554-followup and squashes the following commits:
41f3a09 [Cheng Lian] Addresses comments in #5210
NotImplementedError in scala 2.10 is a fatal exception, which is not very nice to throw when not actually fatal.
Author: Michael Armbrust <michael@databricks.com>
Closes#5315 from marmbrus/throwUnsupported and squashes the following commits:
c29e03b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError
052e05b [Michael Armbrust] [SQL] Throw UnsupportedOperationException instead of NotImplementedError
Use Option for ActiveJob.properties to avoid NPE bug
Author: Hung Lin <hung.lin@gmail.com>
Closes#5124 from hunglin/SPARK-6414 and squashes the following commits:
2290b6b [Hung Lin] [SPARK-6414][core] Fix NPE in SparkContext.cancelJobGroup()
The reused address on server side had caused the server can not acknowledge the connected connections, remove it.
This PR will retry once after timeout, it also add a timeout at client side.
Author: Davies Liu <davies@databricks.com>
Closes#5324 from davies/collect_hang and squashes the following commits:
e5a51a2 [Davies Liu] remove setReuseAddress
7977c2f [Davies Liu] do retry on client side
b838f35 [Davies Liu] retry after timeout
We assume that `RDD[Row]` contains Scala types. So we need to convert them into catalyst types in createDataFrame. liancheng
Author: Xiangrui Meng <meng@databricks.com>
Closes#5329 from mengxr/SPARK-6672 and squashes the following commits:
2d52644 [Xiangrui Meng] set needsConversion = false in jsonRDD
06896e4 [Xiangrui Meng] add createDataFrame without conversion
4a3767b [Xiangrui Meng] convert Row to catalyst
Before diving into review #4450 I did a look through the existing shuffle
code to learn how it works. Unfortunately, there are some very
confusing things in this code. This patch makes a few small changes
to simplify things. It is not easily to concisely describe the changes
because of how convoluted the issues were, but they are fairly small
logically:
1. There is a trait named `ShuffleBlockManager` that only deals with
one logical function which is retrieving shuffle block data given shuffle
block coordinates. This trait has two implementors FileShuffleBlockManager
and IndexShuffleBlockManager. Confusingly the vast majority of those
implementations have nothing to do with this particular functionality.
So I've renamed the trait to ShuffleBlockResolver and documented it.
2. The aforementioned trait had two almost identical methods, for no good
reason. I removed one method (getBytes) and modified callers to use the
other one. I think the behavior is preserved in all cases.
3. The sort shuffle code uses an identifier "0" in the reduce slot of a
BlockID as a placeholder. I made it into a constant since it needs to
be consistent across multiple places.
I think for (3) there is actually a better solution that would avoid the
need to do this type of workaround/hack in the first place, but it's more
complex so I'm punting it for now.
Author: Patrick Wendell <patrick@databricks.com>
Closes#5286 from pwendell/cleanup and squashes the following commits:
c71fbc7 [Patrick Wendell] Open interface back up for testing
f36edd5 [Patrick Wendell] Code review feedback
d1c0494 [Patrick Wendell] Style fix
a406079 [Patrick Wendell] [HOTFIX] Some clean-up in shuffle code.
In order to do inbound checking and type conversion, we should use Literal.create() instead of constructor.
Author: Davies Liu <davies@databricks.com>
Closes#5320 from davies/literal and squashes the following commits:
1667604 [Davies Liu] fix style and add comment
5f8c0fd [Davies Liu] use Literal.create instread of constructor
First contribution here; would love to be getting some code contributions in soon. Let me know if there's anything about contribution process I should improve.
Author: Chet Mancini <chetmancini@gmail.com>
Closes#5316 from chetmancini/SPARK_6658_dataframe_doc and squashes the following commits:
53b627a [Chet Mancini] [SQL] SPARK-6658: Update DataFrame documentation to refer to correct types
Author: Reynold Xin <rxin@databricks.com>
Closes#5319 from rxin/SPARK-6578 and squashes the following commits:
7c62a64 [Reynold Xin] Small rewrite to make the logic more clear in transferTo.
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#5318 from mengxr/SPARK-6660 and squashes the following commits:
0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used.
Author: ksonj <kson@siberie.de>
Closes#5206 from ksonj/partials and squashes the following commits:
ea66f3d [ksonj] Inserted blank lines for PEP8 compliance
d81b02b [ksonj] added tests for udf with partial function and callable object
2c76100 [ksonj] Makes UDFs work with all types of callables
b814a12 [ksonj] support functools.partial as udf
(cherry picked from commit 98f72dfc17)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
Support indexing in DenseMatrices in PySpark
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5232 from MechCoder/SPARK-6576 and squashes the following commits:
a735078 [MechCoder] Change bounds
a062025 [MechCoder] Matrices are stored in column order
7917bc1 [MechCoder] [SPARK-6576] DenseMatrix in PySpark should support indexing
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5314 from mengxr/SPARK-6642 and squashes the following commits:
dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
While the inbound path of a netty pipeline is thread-safe, the outbound
path is not. That means that multiple threads can compete to write messages
to the next stage of the pipeline.
The network library sometimes breaks a single RPC message into multiple
buffers internally to avoid copying data (see MessageEncoder). This can
result in the following scenario (where "FxBy" means "frame x, buffer y"):
T1 F1B1 F1B2
\ \
\ \
socket F1B1 F2B1 F1B2 F2B2
/ /
/ /
T2 F2B1 F2B2
And the frames now cannot be rebuilt on the receiving side because the
different messages have been mixed up on the wire.
The fix wraps these multi-buffer messages into a `FileRegion` object
so that these messages are written "atomically" to the next pipeline handler.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#5234 from vanzin/SPARK-6578 and squashes the following commits:
16b2d70 [Marcelo Vanzin] Forgot to update a type.
c9c2e4e [Marcelo Vanzin] Review comments: simplify some code.
9c888ac [Marcelo Vanzin] Small style nits.
8474bab [Marcelo Vanzin] Fix multiple calls to MessageWithHeader.transferTo().
e26509f [Marcelo Vanzin] Merge branch 'master' into SPARK-6578
c503f6c [Marcelo Vanzin] Implement a custom FileRegion instead of using locks.
84aa7ce [Marcelo Vanzin] Rename handler to the correct name.
432f3bd [Marcelo Vanzin] Remove unneeded method.
8d70e60 [Marcelo Vanzin] Fix thread-safety issue in outbound path of network library.
fixed python doc build warnings
CC whomever wants to review: rxin mengxr davies
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5317 from jkbradley/python-doc-warnings and squashes the following commits:
4cd43c2 [Joseph K. Bradley] fixed python doc build warnings
Users should be able to use numpy operators directly on dense vectors. davies atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes#5312 from mengxr/SPARK-6651 and squashes the following commits:
e665c5c [Xiangrui Meng] wrap the result in a dense vector
23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array
1. Test JARs are built & published
1. log4j.resources is explicitly excluded. Without this, downstream test run logging depends on the order the JARs are listed/loaded
1. sql/hive pulls in spark-sql &...spark-catalyst for its test runs
1. The copied in test classes were rm'd, and a test edited to remove its now duplicate assert method
1. Spark streaming is now build with the same plugin/phase as the rest, but its shade plugin declaration is kept in (so different from the rest of the test plugins). Due to (#2), this means the test JAR no longer includes its log4j file.
Outstanding issues:
* should the JARs be shaded? `spark-streaming-test.jar` does, but given these are test jars for developers only, especially in the same spark source tree, it's hard to justify.
* `maven-jar-plugin` v 2.6 was explicitly selected; without this the apache-1.4 parent template JAR version (2.4) chosen.
* Are there any other resources to exclude?
Author: Steve Loughran <stevel@hortonworks.com>
Closes#5119 from steveloughran/stevel/patches/SPARK-6433-test-jars and squashes the following commits:
81ceb01 [Steve Loughran] SPARK-6433 add a clearer comment explaining what the plugin is doing & why
a6dca33 [Steve Loughran] SPARK-6433 : pull configuration section form archive plugin
c2b5f89 [Steve Loughran] SPARK-6433 omit "jar" goal from jar plugin
fdac51b [Steve Loughran] SPARK-6433 -002; indentation & delegate plugin version to parent
650f442 [Steve Loughran] SPARK-6433 patch 001: test JARs are built; sql/hive pulls in spark-sql & spark-catalyst for its test runs