Previously we could be getting garbage data if the number of bytes is 0, or on JVMs that are 4 byte aligned, or when compressedoops is on.
Author: Reynold Xin <rxin@databricks.com>
Closes#7789 from rxin/utf8string and squashes the following commits:
86ffa3e [Reynold Xin] Mask out data outside of valid range.
4d647ed [Reynold Xin] Mask out data.
c6e8794 [Reynold Xin] [SPARK-9460] Fix prefix generation for UTF8String.
unix_timestamp(): long
Gets current Unix timestamp in seconds.
unix_timestamp(string|date): long
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.
from_unixtime(bigint unixtime[, string format]): string
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
Jira:
https://issues.apache.org/jira/browse/SPARK-8174https://issues.apache.org/jira/browse/SPARK-8175
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#7644 from adrian-wang/udfunixtime and squashes the following commits:
2fe20c4 [Daoyuan Wang] util.Date
ea2ec16 [Daoyuan Wang] use util.Date for better performance
a2cf929 [Daoyuan Wang] doc return null instead of 0
f6f070a [Daoyuan Wang] address comments from davies
6a4cbb3 [Daoyuan Wang] temp
56ded53 [Daoyuan Wang] rebase and address comments
14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
https://issues.apache.org/jira/browse/SPARK-9437
Author: Imran Rashid <irashid@cloudera.com>
Closes#7750 from squito/SPARK-9437_size_estimator_overflow and squashes the following commits:
29493f1 [Imran Rashid] prevent another potential overflow
bc1cb82 [Imran Rashid] avoid overflow
This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:
**List of fixed blockers**:
- [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
- [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
- [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
- [x] Investigate failing HiveThriftBinaryServerSuite test. This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD. This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
- [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
- [x] ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF. This is necessary for `UDFSuite` to pass. For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
- [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
- [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
- [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
- [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
- [x] Wrong answer in `join_1to1` (fixed by #7680)
- [x] Wrong answer in `join_nulls` (fixed by #7680)
- [x] Managed memory OOM / leak in `lateral_view`
- [x] Seems to hang indefinitely in `partcols1`. This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
- [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
- [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
- [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7564 from JoshRosen/unsafe-by-default and squashes the following commits:
83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
963f567 [Josh Rosen] Reduce buffer size for R tests
d6986de [Josh Rosen] Lower page size in PySpark tests
013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
715517b [Josh Rosen] Enable Unsafe by default
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7706 from vanzin/SPARK-9388 and squashes the following commits:
028b990 [Marcelo Vanzin] Single log statement.
3c5fb6a [Marcelo Vanzin] YARN not Yarn.
5bcd7a0 [Marcelo Vanzin] [SPARK-9388] [yarn] Make executor info log messages easier to read.
This change adds code to notify the scheduler backend when a container dies in YARN.
Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7431 from vanzin/SPARK-8297 and squashes the following commits:
471e4a0 [Marcelo Vanzin] Fix unit test after merge.
d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
537da6f [Marcelo Vanzin] Make an expected log less scary.
04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request.
8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297
687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master
9218fcc [Mridul Muralidharan] Fix failing testcase
362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
9ee1307 [Mridul Muralidharan] Fix SPARK-8297
a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
JIRA: https://issues.apache.org/jira/browse/SPARK-9361
Currently, we call `aggregate.Utils.tryConvert` in many places to check it the logical.Aggregate can be run with new aggregation. But looks like `aggregate.Utils.tryConvert` will cost considerable time to run. We should only call `tryConvert` once and keep it value in `logical.Aggregate` and reuse it.
In `org.apache.spark.sql.execution.aggregate.Utils`, the codes involving with `tryConvert` should be moved to catalyst because it actually doesn't deal with execution details.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#7677 from viirya/refactor_aggregate and squashes the following commits:
babea30 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into refactor_aggregate
9a589d7 [Liang-Chi Hsieh] Fix scala style.
0a91329 [Liang-Chi Hsieh] Refactor new aggregation code to reduce the times to call tryConvert.
cc srowen
Author: François Garillot <francois@garillot.net>
Closes#7678 from huitseeker/master and squashes the following commits:
5e99f57 [François Garillot] [SPARK-9267][Core] Retire stringify(Partial)?Value from Accumulators
### JIRA
[[SPARK-9248] Closing curly-braces should always be on their own line - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9248)
## The result of `dev/lint-r`
[The result of `dev/lint-r` for SPARK-9248 at the revistion:6175d6cfe795fbd88e3ee713fac375038a3993a8](https://gist.github.com/yu-iskw/96cadcea4ce664c41f81)
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#7795 from yu-iskw/SPARK-9248 and squashes the following commits:
c8eccd3 [Yuu ISHIKAWA] [SPARK-9248][SparkR] Closing curly-braces should always be on their own line
This is #7791 for Python. hhbyyh
Author: Xiangrui Meng <meng@databricks.com>
Closes#7798 from mengxr/regex-tok-py and squashes the following commits:
baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well.
Author: Sean Owen <sowen@cloudera.com>
Closes#7794 from srowen/SPARK-9277 and squashes the following commits:
e8dc31e [Sean Owen] Fix scalastyle
6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
Add unit tests for running LDA with empty documents.
Both EMLDAOptimizer and OnlineLDAOptimizer are tested.
feynmanliang
Author: Meihua Wu <meihuawu@umich.edu>
Closes#7620 from rotationsymmetry/SPARK-9225 and squashes the following commits:
3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
f9432e8 [Meihua Wu] Incorporate reviewer's comments
8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225
ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents
A trivial fix for the comments of RegexTokenizer.
Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#7791 from hhbyyh/docFix and squashes the following commits:
cdf2542 [Yuhao Yang] minor fix on tokenizer doc
Continuation of work by zhangjiajin
Closes#7412
Author: zhangjiajin <zhangjiajin@huawei.com>
Author: Feynman Liang <fliang@databricks.com>
Author: zhang jiajin <zhangjiajin@huawei.com>
Closes#7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits:
a61943d [Feynman Liang] Collect small patterns to local
4ddf479 [Feynman Liang] Parallelize freqItemCounts
ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal
87fa021 [Feynman Liang] Improve extend prefix readability
c2caa5c [Feynman Liang] Readability improvements and comments
1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
da0091b [Feynman Liang] Use lists for prefixes to reuse data
cb2a4fc [Feynman Liang] Inline code for readability
01c9ae9 [Feynman Liang] Add getters
6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
64271b3 [zhangjiajin] Modified codes according to comments.
d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing.
b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes
095aa3a [zhangjiajin] Modified the code according to the review comments.
baa2885 [zhangjiajin] Modified the code according to the review comments.
6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan
a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
4dd1c8a [zhangjiajin] initialize file before rebase.
078d410 [zhangjiajin] fix a scala style error.
22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it.
For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.
I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well.
Changes I made:
* Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
* Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
* The subclasses for Graphs and RDDs implement those abstract methods.
* I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.
To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
(2) diff between the 2 test suites
CCing andrewor14 in case there are relevant changes to checkpointing.
CCing feynmanliang in case you're interested in learning about checkpointing.
CCing mengxr for final OK.
Thanks all!
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#7728 from jkbradley/gbt-checkpoint and squashes the following commits:
d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda
0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data.
568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs.
jira: https://issues.apache.org/jira/browse/SPARK-7368
Add QR decomposition for RowMatrix.
I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable.
The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf]
Austin R. Benson, David F. Gleich, James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures", 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability.
Currently I tried it on a 400000 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced) to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement.
Any trial and suggestion is welcome.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#5909 from hhbyyh/qrDecomposition and squashes the following commits:
cec797b [Yuhao Yang] remove unnecessary qr
0fb1012 [Yuhao Yang] hierarchy R computing
3fbdb61 [Yuhao Yang] update qr to indirect and add ut
0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
39b0b22 [Yuhao Yang] initial draft for discussion
JIRA: https://issues.apache.org/jira/browse/SPARK-8838
Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.
In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#7238 from viirya/option_partfile_merge and squashes the following commits:
71d5b5f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
8816f44 [Liang-Chi Hsieh] For comments.
dbc8e6b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
afc2fa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
d4ed7e6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
df43027 [Liang-Chi Hsieh] Get dataStatuses' partitions based on all paths.
4eb2f00 [Liang-Chi Hsieh] Use given parameter.
ea8f6e5 [Liang-Chi Hsieh] Correct the code comments.
a57be0e [Liang-Chi Hsieh] Merge part-files if there are no summary files.
47df981 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
4caf293 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
0e734e0 [Liang-Chi Hsieh] Use correct API.
3b6be5b [Liang-Chi Hsieh] Fix key not found.
4bdd7e0 [Liang-Chi Hsieh] Don't read footer files if we can skip them.
8bbebcb [Liang-Chi Hsieh] Figure out how to test the config.
bbd4ce7 [Liang-Chi Hsieh] Add config to enable/disable merging part-files when merging parquet schema.
SparkEnv might not have been set in local unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#7784 from rxin/HashedRelationSuite and squashes the following commits:
435d64b [Reynold Xin] Fix flaky HashedRelationSuite
Author: zsxwing <zsxwing@gmail.com>
Closes#7711 from zsxwing/SPARK-9335-test and squashes the following commits:
c13ec2f [zsxwing] environs -> environ
69c2865 [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-9335-test
ef84a08 [zsxwing] Revert "Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS"
f691028 [zsxwing] Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS
7618205 [zsxwing] Enable Kinesis tests only when files in extras/kinesis-asl are changed
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.
This supports:
`df.select(inputFileName())`
and
`sqlContext.sql("select input_file_name() from table")`
Author: Joseph Batchik <josephbatchik@gmail.com>
Closes#7743 from JDrit/input_file_name and squashes the following commits:
abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-9428
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7748 from yjshen/string_cleanup and squashes the following commits:
e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
26614d2 [Yijie Shen] MathFunctionSuite
a402859 [Yijie Shen] complex_create, conditional and cast
6e4e608 [Yijie Shen] arithmetic and cast
52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
Also we could create a Python UDT without having a Scala one, it's important for Python users.
cc mengxr JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#7453 from davies/class_in_main and squashes the following commits:
4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
dc65f19 [Davies Liu] address comment
a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
a86e1fc [Davies Liu] fix serialization
ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
63f52ef [Davies Liu] fix pylint check
655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
316a394 [Davies Liu] support Python UDT with UTF
0bcb3ef [Davies Liu] fix bug in mllib
de986d6 [Davies Liu] fix test
83d65ac [Davies Liu] fix bug in StructType
55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
`names` is not defined in this context, I think you meant `self.names`.
davies
Author: Alex Angelini <alex.louis.angelini@gmail.com>
Closes#7766 from angelini/fix_struct_type_names and squashes the following commits:
01543a1 [Alex Angelini] Fix reference to self.names in StructType
Author: Reynold Xin <rxin@databricks.com>
Closes#7767 from rxin/SPARK-9462 and squashes the following commits:
ef3e2d9 [Reynold Xin] Removed println
713ac3a [Reynold Xin] More unit tests.
bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
As of today, StringPrefixComparator converts the long values back to byte arrays in order to compare them. This patch optimizes this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte (unsigned).
This only works on little-endian architecture right now.
Author: Reynold Xin <rxin@databricks.com>
Closes#7765 from rxin/SPARK-9460 and squashes the following commits:
e4908cc [Reynold Xin] Stricter randomized tests.
4c8d094 [Reynold Xin] [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.
I also removed the FLOAT prefix, opting for converting float directly to double.
Author: Reynold Xin <rxin@databricks.com>
Closes#7763 from rxin/sort-prefix and squashes the following commits:
5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
jkbradley MechCoder
Resolves blocking issue for SPARK-6793. Please review after #7705 is merged.
Author: Feynman Liang <fliang@databricks.com>
Closes#7757 from feynmanliang/SPARK-9940-localSaveLoad and squashes the following commits:
d0d8cf4 [Feynman Liang] Fix thisClassName
0f30109 [Feynman Liang] Fix tests after changing LDAModel public API
dc61981 [Feynman Liang] Add hyperparams to LocalLDAModel save/load
Author: sethah <seth.hendrickson16@gmail.com>
Closes#7655 from sethah/Working_on_6129 and squashes the following commits:
253db2d [sethah] removed number formatting from example code
b769cab [sethah] rewording threshold section
d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
c9dd058 [sethah] Cleaning up and formatting metrics user guide section
6f31c21 [sethah] All example code for metrics section done
98813fe [sethah] Most java and python example code added. Further latex formatting
53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).
Author: Holden Karau <holden@pigscanfly.ca>
Closes#7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:
bf22fa6 [Holden Karau] Add missing imports for testing suite
e948f0d [Holden Karau] Check the prediction generation from rawprediciton
25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
1a67e04 [Holden Karau] Use old decission tree stuff instead
673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
0d15b96 [Holden Karau] FIx typo
5eafad4 [Holden Karau] add a constructor for rootnode + num classes
fc6156f [Holden Karau] scala style fix
2597915 [Holden Karau] take num classes in constructor
3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
16aea1c [Holden Karau] Make tests match the new models
b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
77b4114 [Holden Karau] Import vectors lib
Author: Bimal Tandel <bimal@bimal-MBP.local>
Closes#7730 from BimalTandel/branch_spark_8921 and squashes the following commits:
3ea230a [Bimal Tandel] Spark 8921 add @since tags
We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.
This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.
Author: Reynold Xin <rxin@databricks.com>
Closes#7759 from rxin/SPARK-9448 and squashes the following commits:
c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads. These sizes are now controlled by a new configuration, `spark.buffer.pageSize`. The new default is 64 megabytes.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7741 from JoshRosen/SPARK-9411 and squashes the following commits:
a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
Pregel code contains two consecutive joins:
```
g.vertices.innerJoin(messages)(vprog)
...
g = g.outerJoinVertices(newVerts)
{ (vid, old, newOpt) => newOpt.getOrElse(old) }
```
This can be simplified with one join. ankurdave proposed a patch based on our discussion in the mailing list: https://www.mail-archive.com/devspark.apache.org/msg10316.html
Author: Alexander Ulanov <nashb@yandex.ru>
Closes#7749 from avulanov/SPARK-9436-pregel and squashes the following commits:
8568e06 [Alexander Ulanov] Pregel simplification patch
We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
so interval can be compared.
Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.
Author: Reynold Xin <rxin@databricks.com>
Closes#7745 from rxin/calendarintervaltype and squashes the following commits:
99f64e8 [Reynold Xin] One more line ...
13466c8 [Reynold Xin] Fixed tests.
e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
Based on #7471.
- [x] add a test that exercises the publish path from driver to receiver
- [ ] remove Serializable from `RateController` and `RateEstimator`
Author: Iulian Dragos <jaguarul@gmail.com>
Author: François Garillot <francois@garillot.net>
Closes#7600 from dragos/topic/streaming-bp/rate-controller and squashes the following commits:
f168c94 [Iulian Dragos] Latest review round.
5125e60 [Iulian Dragos] Fix style.
a2eb3b9 [Iulian Dragos] Merge remote-tracking branch 'upstream/master' into topic/streaming-bp/rate-controller
475e346 [Iulian Dragos] Latest round of reviews.
e9fb45e [Iulian Dragos] - Add a test for checkpointing - fixed serialization for RateController.executionContext
715437a [Iulian Dragos] Review comments and added a `reset` call in ReceiverTrackerTest.
e57c66b [Iulian Dragos] Added a couple of tests for the full scenario from driver to receivers, with several rate updates.
b425d32 [Iulian Dragos] Removed DeveloperAPI, removed rateEstimator field, removed Noop rate estimator, changed logic for initialising rate estimator.
238cfc6 [Iulian Dragos] Merge remote-tracking branch 'upstream/master' into topic/streaming-bp/rate-controller
34a389d [Iulian Dragos] Various style changes and a first test for the rate controller.
d32ca36 [François Garillot] [SPARK-8977][Streaming] Defines the RateEstimator interface, and implements the ReceiverRateController
8941cf9 [Iulian Dragos] Renames and other nitpicks.
162d9e5 [Iulian Dragos] Use Reflection for accessing truly private `executor` method and use the listener bus to know when receivers have registered (`onStart` is called before receivers have registered, leading to flaky behavior).
210f495 [Iulian Dragos] Revert "Added a few tests that measure the receiver’s rate."
0c51959 [Iulian Dragos] Added a few tests that measure the receiver’s rate.
261a051 [Iulian Dragos] - removed field to hold the current rate limit in rate limiter - made rate limit a Long and default to Long.MaxValue (consequence of the above) - removed custom `waitUntil` and replaced it by `eventually`
cd1397d [Iulian Dragos] Add a test for the propagation of a new rate limit from driver to receivers.
6369b30 [Iulian Dragos] Merge pull request #15 from huitseeker/SPARK-8975
d15de42 [François Garillot] [SPARK-8975][Streaming] Adds Ratelimiter unit tests w.r.t. spark.streaming.receiver.maxRate
4721c7d [François Garillot] [SPARK-8975][Streaming] Add a mechanism to send a new rate from the driver to the block generator
Added a custom Kryo serializer for generic Avro records to reduce the network IO
involved during a shuffle. This compresses the schema and allows for users to
register their schemas ahead of time to further reduce traffic.
Currently Kryo tries to use its default serializer for generic Records, which will include
a lot of unneeded data in each record.
Author: Joseph Batchik <joseph.batchik@cloudera.com>
Author: Joseph Batchik <josephbatchik@gmail.com>
Closes#7004 from JDrit/Avro_serialization and squashes the following commits:
8158d51 [Joseph Batchik] updated per feedback
c0cf329 [Joseph Batchik] implemented @squito suggestion for SparkEnv
dd71efe [Joseph Batchik] fixed bug with serializing
1183a48 [Joseph Batchik] updated codec settings
fa9298b [Joseph Batchik] forgot a couple of fixes
c5fe794 [Joseph Batchik] implemented @squito suggestion
0f5471a [Joseph Batchik] implemented @squito suggestion to use a codec that is already in spark
6d1925c [Joseph Batchik] fixed to changes suggested by @squito
d421bf5 [Joseph Batchik] updated pom to removed versions
ab46d10 [Joseph Batchik] Changed Avro dependency to be similar to parent
f4ae251 [Joseph Batchik] fixed serialization error in that SparkConf cannot be serialized
2b545cc [Joseph Batchik] started working on fixes for pr
97fba62 [Joseph Batchik] Added a custom Kryo serializer for generic Avro records to reduce the network IO involved during a shuffle. This compresses the schema and allows for users to register their schemas ahead of time to further reduce traffic.
Author: Reynold Xin <rxin@databricks.com>
Closes#7747 from rxin/SPARK-9127 and squashes the following commits:
e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7593 from cloud-fan/sort and squashes the following commits:
7b1bef7 [Wenchen Fan] add test
daf206d [Wenchen Fan] add more comments
289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.
This PR change to parse float number to decimal or double, based on it's using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx
This is a break change, should we doc it somewhere?
Author: Davies Liu <davies@databricks.com>
Closes#7642 from davies/parse_decimal and squashes the following commits:
1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
5e142b6 [Davies Liu] fix scala style
eca99de [Davies Liu] fix tests
2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
f4a320b [Davies Liu] Update SqlParser.scala
1c48e34 [Davies Liu] use decimal or double when parsing SQL
Spark's ShuffleMemoryManager and MemoryStore track memory on a per-thread basis, which causes problems in the handful of cases where we have tasks that use multiple threads. In PythonRDD, RRDD, ScriptTransformation, and PipedRDD we consume the input iterator in a separate thread in order to write it to an external process. As a result, these RDD's input iterators are consumed in a different thread than the thread that created them, which can cause problems in our memory allocation tracking. For example, if allocations are performed in one thread but deallocations are performed in a separate thread then memory may be leaked or we may get errors complaining that more memory was allocated than was freed.
I think that the right way to fix this is to change our accounting to be performed on a per-task instead of per-thread basis. Note that the current per-thread tracking has caused problems in the past; SPARK-3731 (#2668) fixes a memory leak in PythonRDD that was caused by this issue (that fix is no longer necessary as of this patch).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7734 from JoshRosen/memory-tracking-fixes and squashes the following commits:
b4b1702 [Josh Rosen] Propagate TaskContext to writer threads.
57c9b4e [Josh Rosen] Merge remote-tracking branch 'origin/master' into memory-tracking-fixes
ed25d3b [Josh Rosen] Address minor PR review comments
44f6497 [Josh Rosen] Fix long line.
7b0f04b [Josh Rosen] Fix ShuffleMemoryManagerSuite
f57f3f2 [Josh Rosen] More thread -> task changes
fa78ee8 [Josh Rosen] Move Executor's cleanup into Task so that TaskContext is defined when cleanup is performed
5e2f01e [Josh Rosen] Fix capitalization
1b0083b [Josh Rosen] Roll back fix in PySpark, which is no longer necessary
2e1e0f8 [Josh Rosen] Use TaskAttemptIds to track shuffle memory
c9e8e54 [Josh Rosen] Use TaskAttemptIds to track unroll memory
We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.
Fix `PullOutNondeterministic` rule to make it work for `Sort`.
Also got a chance to cleanup the dataframe test suite.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7674 from cloud-fan/show and squashes the following commits:
888934f [Wenchen Fan] fix sort
c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
https://issues.apache.org/jira/browse/SPARK-9422
Author: Yin Huai <yhuai@databricks.com>
Closes#7737 from yhuai/removePlaceHolder and squashes the following commits:
ec29b44 [Yin Huai] Remove placeholder attributes.
UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).
In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.
This patch addresses both of these issues and adds a regression test.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:
c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator.
Author: Reynold Xin <rxin@databricks.com>
Closes#7733 from rxin/smj and squashes the following commits:
61e4d34 [Reynold Xin] Fixed test case.
5ffd731 [Reynold Xin] Fixed JoinSuite.
a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.