The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.
Author: jerryshao <sshao@hortonworks.com>
Closes#10350 from jerryshao/SPARK-12353.
After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.
For example, users can do the Equi-Join like
```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated.
Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10477 from gatorsmile/pyOuterJoin.
Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule.
It is no need to register to master iteratively.
Author: echo2mei <534384876@qq.com>
Closes#10447 from echoTomei/master.
In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local.
In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing.
In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected.
This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not.
(The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not).
The warning is only issued if the directory is not local, on top of the existing conditions.
Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>
Closes#10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.
In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()):
INSERT INTO $table VALUES ( ?, ?, ..., ? )
But some technologies require a list of column names:
INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
This was blocking the use of e.g. the Progress JDBC Driver for Cassandra.
Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc().
If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types.
This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names.
Author: CK50 <christian.kurz@oracle.com>
Closes#10380 from CK50/master-SPARK-12010-2.
Restore the original value of os.arch property after each test
Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10289 from kiszk/SPARK-12311.
fix an exception with IBM JDK by removing update field from a JavaVersion tuple. This is because IBM JDK does not have information on update '_xx'
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10463 from kiszk/SPARK-12502.
allow the user to override MAVEN_OPTS (2GB wasn't sufficient for me)
Author: Adrian Bridgett <adrian@smop.co.uk>
Closes#10448 from abridgett/feature/do_not_force_maven_opts.
Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only
CC calvinjia as I probably need a double-check that the usage of the new API is correct.
Author: Sean Owen <sowen@cloudera.com>
Closes#10449 from srowen/SPARK-12500.
Accessing null elements in an array field fails when tungsten is enabled.
It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
This PR solves this by checking if the accessed element in the array field is null, in the generated code.
Example:
```
// Array of String
case class AS( as: Seq[String] )
val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
dfAS.registerTempTable("T_AS")
for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))}
```
With Tungsten disabled:
```
0 = [a]
1 = [null]
2 = [b]
```
With Tungsten enabled:
```
0 = [a]
15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
```
Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>
Closes#10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.
When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks!
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#10278 from gatorsmile/parquetFilterNot.
When creating extractors for product types (i.e. case classes and tuples), a null check is missing, thus we always assume input product values are non-null.
This PR adds a null check in the extractor expression for product types. The null check is stripped off for top level product fields, which are mapped to the outermost `Row`s, since they can't be null.
Thanks cloud-fan for helping investigating this issue!
Author: Cheng Lian <lian@databricks.com>
Closes#10431 from liancheng/spark-12478.top-level-null-field.
This PR adds Scala, Java and Python examples to show how to use Accumulator and Broadcast in Spark Streaming to support checkpointing.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10385 from zsxwing/accumulator-broadcast-example.
Compare both left and right side of the case expression ignoring nullablity when checking for type equality.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#10156 from dilipbiswal/spark-12102.
First try, not sure how much information we need to provide in the usage part.
Author: Xiu Guo <xguo27@gmail.com>
Closes#10423 from xguo27/SPARK-12456.
We should update to the latest version of Zinc in order to match our SBT version.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10426 from JoshRosen/update-zinc.
https://issues.apache.org/jira/browse/SPARK-11677
Although it checks correctly the filters by the number of results if ORC filter-push-down is enabled, the filters themselves are not being tested.
So, this PR includes the test similarly with `ParquetFilterSuite`.
Since the results are checked by `OrcQuerySuite`, this `OrcFilterSuite` only checks if the appropriate filters are created.
One thing different with `ParquetFilterSuite` here is, it does not check the results because that is checked in `OrcQuerySuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#10341 from HyukjinKwon/SPARK-11677-followup.
This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.
Author: Cheng Lian <lian@databricks.com>
Closes#10331 from liancheng/dataset-nullability-check.
Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.
Author: Holden Karau <holden@us.ibm.com>
Closes#10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.
This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out.
For more background, see my comments on #6207 (the PR which introduced this test).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10425 from JoshRosen/SPARK-11823.
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#10342 from davies/lz4.
```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull
This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).
Tested locally to verify that the NPE is gone.
Author: Andrew Or <andrew@databricks.com>
Closes#10417 from andrewor14/fix-harmless-npe.
Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function.
Author: Alex Bozarth <ajbozart@us.ibm.com>
Closes#10405 from ajbozarth/spark12339.
When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#10346 from maropu/OptimizeBlockLocationOrder.
Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.
Also added another API for resolving the JIRA Spark-12150.
Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )
Thank you very much!
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10335 from gatorsmile/rangeOperators.
When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information.
// Standard output
[a: int, b: int]
// Truncate many top level fields
[a: int, b, string ... 10 more fields]
// Truncate long inner structs
[a: struct<a: Int ... 10 more fields>]
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#10373 from dilipbiswal/spark-12398.
No jira is created since this is a trivial change.
davies Please help review it
Author: Jeff Zhang <zjffdu@apache.org>
Closes#10143 from zjffdu/pyspark_typo.
Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x
jkbradley is this kind of what you had in mind?
Author: Sean Owen <sowen@cloudera.com>
Closes#10327 from srowen/SPARK-12349.
Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#10092 from gatorsmile/persistStorageLevel.
It is usually an invalid location on the remote machine executing the job.
It is picked up by the Mesos support in cluster mode, and most of the time causes
the job to fail.
Fixes SPARK-12345
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes#10329 from skyluc/issue/SPARK_HOME.