Commit graph

345 commits

Author SHA1 Message Date
Davies Liu ec1adecbb7 [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx
Using Sphinx to generate API docs for PySpark.

requirement: Sphinx

```
$ cd python/docs/
$ make html
```

The generated API docs will be located at python/docs/_build/html/index.html

It can co-exists with those generated by Epydoc.

This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2292 from davies/sphinx and squashes the following commits:

425a3b1 [Davies Liu] cleanup
1573298 [Davies Liu] move docs to python/docs/
5fe3903 [Davies Liu] Merge branch 'master' into sphinx
9468ab0 [Davies Liu] fix makefile
b408f38 [Davies Liu] address all comments
e2ccb1b [Davies Liu] update name and version
9081ead [Davies Liu] generate PySpark API docs using Sphinx
2014-09-16 12:51:58 -07:00
Aaron Staple 8e7ae477ba [SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions.
SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer.  The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well.  This patch overrides relevant calls in the python and java interfaces with optimized versions.

Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1592 from staple/SPARK-2314 and squashes the following commits:

89ff550 [Aaron Staple] Merge with master.
6bb7b6c [Aaron Staple] Fix typo.
b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
0fc9d40 [Aaron Staple] Fix comment typos.
f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
2014-09-16 11:45:35 -07:00
Matthew Farrellee 9d5fa763d8 [SPARK-3519] add distinct(n) to PySpark
Added missing rdd.distinct(numPartitions) and associated tests

Author: Matthew Farrellee <matt@redhat.com>

Closes #2383 from mattf/SPARK-3519 and squashes the following commits:

30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
2014-09-16 11:39:57 -07:00
Aaron Staple 60050f4288 [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
Also made some cosmetic cleanups.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #2385 from staple/SPARK-1087 and squashes the following commits:

7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
2014-09-15 19:28:17 -07:00
Davies Liu da33acb8b6 [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6
Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.

There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.

I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11

Author: Davies Liu <davies.liu@gmail.com>

Closes #2365 from davies/pickle and squashes the following commits:

f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-15 18:57:25 -07:00
qiping.lqp fdb302f49c [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)

Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes

CC: mengxr

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:

61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-15 17:43:26 -07:00
Davies Liu 4e3fbe8cdb [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python
Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.

![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png)

This patch is blocked by SPARK-3465. (It includes a fix for that).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2336 from davies/metrics and squashes the following commits:

e37df38 [Davies Liu] remove outdated comments
1245eb7 [Davies Liu] remove the temporary fix
ebd2f43 [Davies Liu] Merge branch 'master' into metrics
7e4ad04 [Davies Liu] Merge branch 'master' into metrics
fbe9029 [Davies Liu] show spilled bytes in Python in web ui
2014-09-13 22:31:21 -07:00
Davies Liu 2aea0da84c [SPARK-3030] [PySpark] Reuse Python worker
Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.

This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.

For a job with broadcast (43M after compress):
```
    b = sc.broadcast(set(range(30000000)))
    print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
```
It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.

It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2259 from davies/reuse-worker and squashes the following commits:

f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
3939f20 [Davies Liu] fix bug in serializer in mllib
cf1c55e [Davies Liu] address comments
3133a60 [Davies Liu] fix accumulator with reused worker
760ab1f [Davies Liu] do not reuse worker if there are any exceptions
7abb224 [Davies Liu] refactor: sychronized with itself
ac3206e [Davies Liu] renaming
8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
6325fc1 [Davies Liu] bugfix: bid >= 0
e0131a2 [Davies Liu] fix name of config
583716e [Davies Liu] only reuse completed and not interrupted worker
ace2917 [Davies Liu] kill python worker after timeout
6123d0f [Davies Liu] track broadcasts for each worker
8d2f08c [Davies Liu] reuse python worker
2014-09-13 16:22:04 -07:00
Davies Liu 885d1621bc [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd
Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.

In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().

BTW, Do we need JavaSQLContext?

Author: Davies Liu <davies.liu@gmail.com>

Closes #2369 from davies/fix_schemardd and squashes the following commits:

abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
2014-09-12 19:05:39 -07:00
Davies Liu 71af030b46 [SPARK-3094] [PySpark] compatitable with PyPy
After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:

```
PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
```

The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:

 Job | CPython 2.7 | PyPy 2.3.1  | Speed up
 ------- | ------------ | ------------- | -------
 Word Count | 41s   | 15s  | 2.7x
 Sort | 46s |  44s | 1.05x
 Stats | 174s | 3.6s | 48x

Here is the code used for benchmark:

```python
rdd = sc.textFile("text")
def wordcount():
    rdd.flatMap(lambda x:x.split('/'))\
        .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
    rdd.sortBy(lambda x:x, 1).count()
def stats():
    sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
```

Author: Davies Liu <davies.liu@gmail.com>

Closes #2144 from davies/pypy and squashes the following commits:

9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
4bc1f04 [Davies Liu] refactor
b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
3ca2351 [Davies Liu] Merge branch 'master' into pypy
fae8b19 [Davies Liu] improve attrgetter, add tests
591f830 [Davies Liu] try to run tests with PyPy in run-tests
c8d62ba [Davies Liu] cleanup
f651fd0 [Davies Liu] fix tests using array with PyPy
1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
3c1dbfe [Davies Liu] Merge branch 'master' into pypy
42fb5fa [Davies Liu] Merge branch 'master' into pypy
cb2d724 [Davies Liu] fix tests
9986692 [Davies Liu] Merge branch 'master' into pypy
25b4ca7 [Davies Liu] support PyPy
2014-09-12 18:42:50 -07:00
RJ Nowling 533377621f [PySpark] Add blank line so that Python RDD.top() docstring renders correctly
Author: RJ Nowling <rnowling@gmail.com>

Closes #2370 from rnowling/python_rdd_docstrings and squashes the following commits:

5230574 [RJ Nowling] Add blank line so that Python RDD.top() docstring renders correctly
2014-09-12 09:46:21 -07:00
Davies Liu 1ef656ea85 [SPARK-3047] [PySpark] add an option to use str in textFileRDD
str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).

use_unicode=True:

daviesliudm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	2m8.298s
user	0m0.185s
sys	0m0.064s

use_unicode=False

daviesliudm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	1m26.402s
user	0m0.182s
sys	0m0.062s

We can see that it got 32% improvement!

Author: Davies Liu <davies.liu@gmail.com>

Closes #1951 from davies/unicode and squashes the following commits:

8352d57 [Davies Liu] update version number
a286f2f [Davies Liu] rollback loads()
85246e5 [Davies Liu] add docs for use_unicode
a0295e1 [Davies Liu] add an option to use str in textFile()
2014-09-11 11:50:36 -07:00
Matthew Farrellee 25b5b867d5 [SPARK-3458] enable python "with" statements for SparkContext
allow for best practice code,

```
try:
  sc = SparkContext()
  app(sc)
finally:
  sc.stop()
```

to be written using a "with" statement,

```
with SparkContext() as sc:
  app(sc)
```

Author: Matthew Farrellee <matt@redhat.com>

Closes #2335 from mattf/SPARK-3458 and squashes the following commits:

5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
2014-09-09 18:54:54 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
Matthew Rocklin 939a322c85 [SPARK-3417] Use new-style classes in PySpark
Tiny PR making SQLContext a new-style class.  This allows various type logic to work more effectively

```Python
In [1]: import pyspark

In [2]: pyspark.sql.SQLContext.mro()
Out[2]: [pyspark.sql.SQLContext, object]
```

Author: Matthew Rocklin <mrocklin@gmail.com>

Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits:

4aadab6 [Matthew Rocklin] update other old-style classes
a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
2014-09-08 15:45:36 -07:00
Sandy Ryza 16a73c2473 SPARK-2978. Transformation with MR shuffle semantics
I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits:

4a5332a [Sandy Ryza] Fix Java test
c04b447 [Sandy Ryza] Fix Python doc and add back deleted code
433ad5b [Sandy Ryza] Add Java test
4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes
9b0ba99 [Sandy Ryza] Fix compilation
36e0571 [Sandy Ryza] Fix import ordering
48c12c2 [Sandy Ryza] Add Java version and additional doc
e5381cd [Sandy Ryza] Fix python style warnings
f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
2014-09-08 11:20:00 -07:00
Ward Viaene ecfa76cdfe [SPARK-3415] [PySpark] removes SerializingAdapter code
This code removes the SerializingAdapter code that was copied from PiCloud

Author: Ward Viaene <ward.viaene@bigdatapartnership.com>

Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:

5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
65ffeff [Ward Viaene] removed duplicate test
a958866 [Ward Viaene] SPARK-3415: test script
e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
2014-09-07 18:54:36 -07:00
Davies Liu 110fb8b24d [SPARK-2334] fix AttributeError when call PipelineRDD.id()
The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.

The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.

closes #1276

Author: Davies Liu <davies.liu@gmail.com>

Closes #2296 from davies/id and squashes the following commits:

e197958 [Davies Liu] fix style
9721716 [Davies Liu] fix id of PipelineRDD
2014-09-06 16:12:29 -07:00
GuoQiang Li 21a1e1bb89 [SPARK-3273][SPARK-3301]We should read the version information from the same place
Author: GuoQiang Li <witgo@qq.com>

Closes #2175 from witgo/SPARK-3273 and squashes the following commits:

cf9c65a [GuoQiang Li] We should read the version information from the same place
2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct
2014-09-06 15:08:43 -07:00
Holden Karau da35330e83 Spark-3406 add a default storage level to python RDD persist API
Author: Holden Karau <holden@pigscanfly.ca>

Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and squashes the following commits:

33eaade [Holden Karau] As Josh pointed out, sql also override persist. Make persist behave the same as in the underlying RDD as well
e658227 [Holden Karau] Fix the test I added
e95a6c5 [Holden Karau] The Python persist function did not have a default storageLevel unlike the Scala API. Noticed this issue because we got a bug report back from the book where we had documented it as if it was the same as the Scala API
2014-09-06 14:49:25 -07:00
Andrew Ash ba5bcaddec SPARK-3211 .take() is OOM-prone with empty partitions
Instead of jumping straight from 1 partition to all partitions, do exponential
growth and double the number of partitions to attempt each time instead.

Fix proposed by Paul Nepywoda

Author: Andrew Ash <andrew@andrewash.com>

Closes #2117 from ash211/SPARK-3211 and squashes the following commits:

8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup
e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing
09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well
3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions
2014-09-05 18:52:05 -07:00
Kousuke Saruta dc1ba9e9fc [SPARK-3378] [DOCS] Replace the word "SparkSQL" with right word "Spark SQL"
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2251 from sarutak/SPARK-3378 and squashes the following commits:

0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"
2014-09-04 15:06:08 -07:00
Matthew Farrellee 7c6e71f05f [SPARK-2435] Add shutdown hook to pyspark
Author: Matthew Farrellee <matt@redhat.com>

Closes #2183 from mattf/SPARK-2435 and squashes the following commits:

ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark
2014-09-03 19:37:37 -07:00
Davies Liu c5cbc49233 [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF
After this patch, broadcast can be used in Python UDF.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2243 from davies/udf_broadcast and squashes the following commits:

7b88861 [Davies Liu] support broadcast in UDF
2014-09-03 19:08:39 -07:00
Davies Liu 6481d27425 [SPARK-3309] [PySpark] Put all public API in __all__
Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2205 from davies/public and squashes the following commits:

c6c5567 [Davies Liu] fix message
f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
7e3016a [Davies Liu] add __all__ in mllib
6281b48 [Davies Liu] fix doc for SchemaRDD
6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
2014-09-03 11:49:45 -07:00
Davies Liu e2c901b4c7 [SPARK-2871] [PySpark] add countApproxDistinct() API
RDD.countApproxDistinct(relativeSD=0.05):

        :: Experimental ::
        Return approximate number of distinct elements in the RDD.

        The algorithm used is based on streamlib's implementation of
        "HyperLogLog in Practice: Algorithmic Engineering of a State
        of The Art Cardinality Estimation Algorithm", available
        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

        This support all the types of objects, which is supported by
        Pyrolite, nearly all builtin types.

        param relativeSD Relative accuracy. Smaller values create
                           counters that require more space.
                           It must be greater than 0.000017.

        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
        >>> 950 < n < 1050
        True
        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
        >>> 18 < n < 22
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2142 from davies/countApproxDistinct and squashes the following commits:

e20da47 [Davies Liu] remove the correction in Python
c38c4e4 [Davies Liu] fix doc tests
2ab157c [Davies Liu] fix doc tests
9d2565f [Davies Liu] add commments and link for hash collision correction
d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
ded624f [Davies Liu] calculate hash in Python
4cba98f [Davies Liu] add more tests
a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
e97e342 [Davies Liu] add countApproxDistinct()
2014-09-02 15:47:47 -07:00
Holden Karau ba78383bac SPARK-3318: Documentation update in addFile on how to use SparkFiles.get
Rather than specifying the path to SparkFiles we need to use the filename.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits:

a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well
0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
2014-08-30 16:58:17 -07:00
Davies Liu e248328b39 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()
remove invalid docs

Author: Davies Liu <davies.liu@gmail.com>

Closes #2202 from davies/keep and squashes the following commits:

aa3b44f [Davies Liu] remove invalid docs
2014-08-29 11:47:49 -07:00
Davies Liu 4fa2fda88f [SPARK-2871] [PySpark] add RDD.lookup(key)
RDD.lookup(key)

        Return the list of values in the RDD for key `key`. This operation
        is done efficiently if the RDD has a known partitioner by only
        searching the partition that the key maps to.

        >>> l = range(1000)
        >>> rdd = sc.parallelize(zip(l, l), 10)
        >>> rdd.lookup(42)  # slow
        [42]
        >>> sorted = rdd.sortByKey()
        >>> sorted.lookup(42)  # fast
        [42]

It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2093 from davies/lookup and squashes the following commits:

1789cd4 [Davies Liu] `f` in foreach could be generator or not.
2871b80 [Davies Liu] Merge branch 'master' into lookup
c6390ea [Davies Liu] address all comments
0f1bce8 [Davies Liu] add test case for lookup()
be0e8ba [Davies Liu] fix preservesPartitioning
eb1305d [Davies Liu] add RDD.lookup(key)
2014-08-27 13:18:33 -07:00
Andrew Or 7557c4cfef [SPARK-3167] Handle special driver configs in Windows
This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2129 from andrewor14/windows-config and squashes the following commits:

881a8f0 [Andrew Or] Add reference to Windows taskkill
92e6047 [Andrew Or] Update a few comments (minor)
22b1acd [Andrew Or] Fix style again (minor)
afcffea [Andrew Or] Fix style (minor)
72004c2 [Andrew Or] Actually respect --driver-java-options
803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
eeb34a0 [Andrew Or] Update outdated comment (minor)
35caecc [Andrew Or] In Windows, actually kill Java processes on exit
f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)
2014-08-26 22:52:16 -07:00
Davies Liu f1e71d4c3b [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()
Using external sort to support sort large datasets in reduce stage.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1978 from davies/sort and squashes the following commits:

bbcd9ba [Davies Liu] check spilled bytes in tests
b125d2f [Davies Liu] add test for external sort in rdd
eae0176 [Davies Liu] choose different disks from different processes and instances
1f075ed [Davies Liu] Merge branch 'master' into sort
eb53ca6 [Davies Liu] Merge branch 'master' into sort
644abaf [Davies Liu] add license in LICENSE
19f7873 [Davies Liu] improve tests
55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
2014-08-26 16:57:40 -07:00
Takuya UESHIN 98c2bb0bbd [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
2014-08-26 13:22:55 -07:00
Davies Liu 3cedc4f4d7 [SPARK-2871] [PySpark] add histgram() API
RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
2014-08-26 13:04:30 -07:00
Davies Liu fb0db77242 [SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId()
RDD.zipWithIndex()

        Zips this RDD with its element indices.

        The ordering is first based on the partition index and then the
        ordering of items within each partition. So the first item in
        the first partition gets index 0, and the last item in the last
        partition receives the largest index.

        This method needs to trigger a spark job when this RDD contains
        more than one partitions.

        >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
        [(0, 0), (1, 1), (2, 2), (3, 3)]

RDD.zipWithUniqueId()

        Zips this RDD with generated unique Long ids.

        Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
        n is the number of partitions. So there may exist gaps, but this
        method won't trigger a spark job, which is different from
        L{zipWithIndex}

        >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
        [(0, 0), (2, 1), (1, 2), (3, 3)]

Author: Davies Liu <davies.liu@gmail.com>

Closes #2092 from davies/zipWith and squashes the following commits:

cebe5bf [Davies Liu] improve test cases, reverse the order of index
0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
2014-08-24 21:16:05 -07:00
Davies Liu 8df4dad495 [SPARK-2871] [PySpark] add approx API for RDD
RDD.countApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate version of count() that returns a potentially incomplete
        result within a timeout, even if not all tasks have finished.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> rdd.countApprox(1000, 1.0)
        1000

RDD.sumApprox(self, timeout, confidence=0.95)

        Approximate operation to return the sum within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000))
        >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate operation to return the mean within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000)) / 1000.0
        >>> (rdd.meanApprox(1000) - r) / r < 0.05
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2095 from davies/approx and squashes the following commits:

e8c252b [Davies Liu] add approx API for RDD
2014-08-23 19:33:34 -07:00
Davies Liu db436e36c4 [SPARK-2871] [PySpark] add key argument for max(), min() and top(n)
RDD.max(key=None)

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
        >>> rdd.max()
        43.0
        >>> rdd.max(key=str)
        5.0

RDD.min(key=None)

        Find the minimum item in this RDD.

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
        >>> rdd.min()
        2.0
        >>> rdd.min(key=str)
        10.0

RDD.top(num, key=None)

        Get the top N elements from a RDD.

        Note: It returns the list sorted in descending order.
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
        [12]
        >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
        [6, 5]
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
        [4, 3, 2]

Author: Davies Liu <davies.liu@gmail.com>

Closes #2094 from davies/cmp and squashes the following commits:

ccbaf25 [Davies Liu] add `key` to top()
ad7e374 [Davies Liu] fix tests
2f63512 [Davies Liu] change `comp` to `key` in min/max
dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
2014-08-23 18:55:13 -07:00
Andrew Or ba3c730e35 [SPARK-3140] Clarify confusing PySpark exception message
We read the py4j port from the stdout of the `bin/spark-submit` subprocess. If there is interference in stdout (e.g. a random echo in `spark-submit`), we throw an exception with a warning message. We do not, however, distinguish between this case from the case where no stdout is produced at all.

I wasted a non-trivial amount of time being baffled by this exception in search of places where I print random whitespace (in vain, of course). A clearer exception message that distinguishes between these cases will prevent similar headaches that I have gone through.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2067 from andrewor14/python-exception and squashes the following commits:

742f823 [Andrew Or] Further clarify warning messages
e96a7a0 [Andrew Or] Distinguish between unexpected output and no output at all
2014-08-20 17:07:39 -07:00
Davies Liu 0a7ef6339f [SPARK-3141] [PySpark] fix sortByKey() with take()
Fix sortByKey() with take()

The function `f` used in mapPartitions should always return an iterator.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2045 from davies/fix_sortbykey and squashes the following commits:

1160f59 [Davies Liu] fix sortByKey() with take()
2014-08-19 22:43:49 -07:00
Josh Rosen ebcb94f701 [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs
This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).

This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.

It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.

I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).

Author: Josh Rosen <joshrosen@apache.org>

Closes #2002 from JoshRosen/local-dirs and squashes the following commits:

efad8c6 [Josh Rosen] Address review comments:
1dec709 [Josh Rosen] Minor updates to Javadocs.
7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
399ac25 [Josh Rosen] Update getLocalDir() documentation.
bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
007298b [Josh Rosen] Allow environment variables to be mocked in tests.
6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
2014-08-19 22:42:50 -07:00
Xiangrui Meng 825d4fe47b [SPARK-3136][MLLIB] Create Java-friendly methods in RandomRDDs
Though we don't use default argument for methods in RandomRDDs, it is still not easy for Java users to use because the output type is either `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java users, and allow default arguments in Scala methods in RandomRDDs, to make life easier for both Java and Scala users. This PR also contains documentation for random data generation. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #2041 from mengxr/stat-doc and squashes the following commits:

fc5eedf [Xiangrui Meng] add missing comma
ffde810 [Xiangrui Meng] address comments
aef6d07 [Xiangrui Meng] add doc for random data generation
b99d94b [Xiangrui Meng] add java-friendly methods to RandomRDDs
2014-08-19 16:06:48 -07:00
Davies Liu d7e80c2597 [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.
If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1894 from davies/zip and squashes the following commits:

c4652ea [Davies Liu] add more test cases
6d05fc8 [Davies Liu] Merge branch 'master' into zip
813b1e4 [Davies Liu] add more tests for failed cases
a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
2014-08-19 14:46:32 -07:00
Josh Rosen 1f1819b20f [SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.
This fixes SPARK-3114, an issue where we inadvertently broke Python UDFs in Spark SQL.

This PR modifiers the test runner script to always run the PySpark SQL tests, irrespective of whether SparkSQL itself has been modified.  It also includes Davies' fix for the bug.

Closes #2026.

Author: Josh Rosen <joshrosen@apache.org>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2027 from JoshRosen/pyspark-sql-fix and squashes the following commits:

9af2708 [Davies Liu] bugfix: disable compression of command
0d8d3a4 [Josh Rosen] Always run Python Spark SQL tests.
2014-08-18 20:42:19 -07:00
Joseph K. Bradley c8b16ca0d8 [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
2014-08-18 18:01:39 -07:00
Joseph K. Bradley 115eeb30dd [mllib] DecisionTree: treeAggregate + Python example bug fix
Small DecisionTree updates:
* Changed main DecisionTree aggregate to treeAggregate.
* Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
* Fixed same bug in python doc tests, and added tree.py to doc tests.

CC: mengxr

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:

b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
3726d20 [Joseph K. Bradley] Small code improvements based on code review.
ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
db0d773 [Joseph K. Bradley] scala style fix
6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
2014-08-18 14:40:05 -07:00
Davies Liu d1d0ee41c2 [SPARK-3103] [PySpark] fix saveAsTextFile() with utf-8
bugfix: It will raise an exception when it try to encode non-ASCII strings into unicode. It should only encode unicode as "utf-8".

Author: Davies Liu <davies.liu@gmail.com>

Closes #2018 from davies/fix_utf8 and squashes the following commits:

4db7967 [Davies Liu] fix saveAsTextFile() with utf-8
2014-08-18 13:58:35 -07:00
Davies Liu 2fc8aca086 [SPARK-1065] [PySpark] improve supporting for large broadcast
Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast
2014-08-16 16:59:34 -07:00
iAmGhost 379e7585c3 [SPARK-3035] Wrong example with SparkContext.addFile
https://issues.apache.org/jira/browse/SPARK-3035

fix for wrong document.

Author: iAmGhost <kdh7807@gmail.com>

Closes #1942 from iAmGhost/master and squashes the following commits:

487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix for wrong document.
2014-08-16 16:48:38 -07:00
Xiangrui Meng ac6411c6e7 [SPARK-3081][MLLIB] rename RandomRDDGenerators to RandomRDDs
`RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its methods return RDDs but not RDDGenerators. So a more proper (and shorter) name would be `RandomRDDs`.

dorx brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #1979 from mengxr/randomrdds and squashes the following commits:

b161a2d [Xiangrui Meng] rename RandomRDDGenerators to RandomRDDs
2014-08-16 15:14:43 -07:00
Cheng Lian b4a05928e9 [SQL] Using safe floating-point numbers in doctest
Test code in `sql.py` tries to compare two floating-point numbers directly, and cased [build failure(s)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18365/consoleFull).

[Doctest documentation](https://docs.python.org/3/library/doctest.html#warnings) recommends using numbers in the form of `I/2**J` to avoid the precision issue.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1925 from liancheng/fix-pysql-fp-test and squashes the following commits:

0fbf584 [Cheng Lian] Removed unnecessary `...' from inferSchema doctest
e8059d4 [Cheng Lian] Using safe floating-point numbers in doctest
2014-08-16 11:26:51 -07:00
Ahir Reddy fde692b361 [SQL] Python JsonRDD UTF8 Encoding Fix
Only encode unicode objects to UTF-8, and not strings

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #1914 from ahirreddy/json-rdd-unicode-fix1 and squashes the following commits:

ca4e9ba [Ahir Reddy] Encoding Fix
2014-08-14 10:48:52 -07:00