ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
mike	7d16776d28	[SPARK-21255][SQL][WIP] Fixed NPE when creating encoder for enum ## What changes were proposed in this pull request? Fixed NPE when creating encoder for enum. When you try to create an encoder for Enum type (or bean with enum property) via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. I did a little research and it turns out, that in JavaTypeInference following code ``` def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = { val beanInfo = Introspector.getBeanInfo(beanClass) beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") .filter(_.getReadMethod != null) } ``` filters out properties named "class", because we wouldn't want to serialize that. But enum types have another property of type Class named "declaringClass", which we are trying to inspect recursively. Eventually we try to inspect ClassLoader class, which has property "defaultAssertionStatus" with no read method, which leads to NPE at TypeToken:495. I added property name "declaringClass" to filtering to resolve this. ## How was this patch tested? Unit test in JavaDatasetSuite which creates an encoder for enum Author: mike <mike0sv@gmail.com> Author: Mikhail Sveshnikov <mike0sv@gmail.com> Closes #18488 from mike0sv/enum-support.	2017-08-25 07:22:34 +01:00
Yuhao Yang	f3676d6391	[SPARK-21108][ML] convert LinearSVC to aggregator framework ## What changes were proposed in this pull request? convert LinearSVC to new aggregator framework ## How was this patch tested? existing unit test. Author: Yuhao Yang <yuhao.yang@intel.com> Closes #18315 from hhbyyh/svcAggregator.	2017-08-25 10:22:27 +08:00
Herman van Hovell	05af2de0fd	[SPARK-21830][SQL] Bump ANTLR version and fix a few issues. ## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19042 from hvanhovell/SPARK-21830.	2017-08-24 16:33:55 -07:00
xu.zhang	763b83ee84	[SPARK-21701][CORE] Enable RPC client to use `SO_RCVBUF` and `SO_SNDBUF` in SparkConf. ## What changes were proposed in this pull request? TCP parameters like SO_RCVBUF and SO_SNDBUF can be set in SparkConf, and `org.apache.spark.network.server.TransportServe`r can use those parameters to build server by leveraging netty. But for TransportClientFactory, there is no such way to set those parameters from SparkConf. This could be inconsistent in server and client side when people set parameters in SparkConf. So this PR make RPC client to be enable to use those TCP parameters as well. ## How was this patch tested? Existing tests. Author: xu.zhang <xu.zhang@hulu.com> Closes #18964 from neoremind/add_client_param.	2017-08-24 14:27:52 -07:00
Shixiong Zhu	d3abb36990	[SPARK-21788][SS] Handle more exceptions when stopping a streaming query ## What changes were proposed in this pull request? Add more cases we should view as a normal query stop rather than a failure. ## How was this patch tested? The new unit tests. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #18997 from zsxwing/SPARK-21788.	2017-08-24 10:23:59 -07:00
Wenchen Fan	2dd37d827f	[SPARK-21826][SQL] outer broadcast hash join should not throw NPE ## What changes were proposed in this pull request? This is a bug introduced by https://github.com/apache/spark/pull/11274/files#diff-7adb688cbfa583b5711801f196a074bbL274 . Non-equal join condition should only be applied when the equal-join condition matches. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19036 from cloud-fan/bug.	2017-08-24 16:44:12 +02:00
Liang-Chi Hsieh	183d4cb71f	[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery ## What changes were proposed in this pull request? With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans. For a correlated IN query looks like: SELECT t1.a FROM t1 WHERE t1.a IN (SELECT t2.c FROM t2 WHERE t1.b < t2.d); The query plan might look like: Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] After `PullupCorrelatedPredicates`, it produces query plan like: 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery. When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`. We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18968 from viirya/SPARK-21759.	2017-08-24 21:46:58 +08:00
Takuya UESHIN	9e33954ddf	[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. ## What changes were proposed in this pull request? This is a refactoring of `ColumnVector` hierarchy and related classes. 1. make `ColumnVector` read-only 2. introduce `WritableColumnVector` with write interface 3. remove `ReadOnlyColumnVector` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18958 from ueshin/issues/SPARK-21745.	2017-08-24 21:13:44 +08:00
hyukjinkwon	dc5d34d8dc	[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column ## What changes were proposed in this pull request? While preparing to take over https://github.com/apache/spark/pull/16537, I realised a (I think) better approach to make the exception handling in one point. This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which most of functions in `functions.py` and some other APIs use. This `_to_java_column` basically looks not working with other types than `pyspark.sql.column.Column` or string (`str` and `unicode`). If this is not `Column`, then it calls `_create_column_from_name` which calls `functions.col` within JVM: `42b9eda80e/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L76)` And it looks we only have `String` one with `col`. So, these should work: ```python >>> from pyspark.sql.column import _to_java_column, Column >>> _to_java_column("a") JavaObject id=o28 >>> _to_java_column(u"a") JavaObject id=o29 >>> _to_java_column(spark.range(1).id) JavaObject id=o33 ``` whereas these do not: ```python >>> _to_java_column(1) ``` ``` ... py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace: py4j.Py4JException: Method col([class java.lang.Integer]) does not exist ... ``` ```python >>> _to_java_column([]) ``` ``` ... py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist ... ``` ```python >>> class A(): pass >>> _to_java_column(A()) ``` ``` ... AttributeError: 'A' object has no attribute '_get_object_id' ``` Meaning most of functions using `_to_java_column` such as `udf` or `to_json` or some other APIs throw an exception as below: ```python >>> from pyspark.sql.functions import udf >>> udf(lambda x: x)(None) ``` ``` ... py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col. : java.lang.NullPointerException ... ``` ```python >>> from pyspark.sql.functions import to_json >>> to_json(None) ``` ``` ... py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.col. : java.lang.NullPointerException ... ``` After this PR: ```python >>> from pyspark.sql.functions import udf >>> udf(lambda x: x)(None) ... ``` ``` TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions. ``` ```python >>> from pyspark.sql.functions import to_json >>> to_json(None) ``` ``` ... TypeError: Invalid argument, not a string or column: None of type <type 'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' functions. ``` ## How was this patch tested? Unit tests added in `python/pyspark/sql/tests.py` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Author: zero323 <zero323@users.noreply.github.com> Closes #19027 from HyukjinKwon/SPARK-19165.	2017-08-24 20:29:03 +09:00
Jen-Ming Chung	95713eb4f2	[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one ## What changes were proposed in this pull request? When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below: ``` scala scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show() +---+---+----+ \| c0\| c1\| c2\| +---+---+----+ \| 1\| 2\|null\| +---+---+----+ ``` I think this should be consistent with Hive's implementation: ``` hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); ... 1 1 ``` In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf. ## How was this patch tested? Added test in JsonExpressionsSuite. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19017 from jmchung/SPARK-21804.	2017-08-24 19:24:00 +09:00
lufei	846bc61cf5	[MINOR][SQL] The comment of Class ExchangeCoordinator exist a typing and context error ## What changes were proposed in this pull request? The given example in the comment of Class ExchangeCoordinator is exist four post-shuffle partitions,but the current comment is “three”. ## How was this patch tested? Author: lufei <lu.fei80@zte.com.cn> Closes #19028 from figo77/SPARK-21816.	2017-08-24 10:07:27 +01:00
Susan X. Huynh	ce0d3bb377	[SPARK-21694][MESOS] Support Mesos CNI network labels JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21694 ## What changes were proposed in this pull request? Spark already supports launching containers attached to a given CNI network by specifying it via the config `spark.mesos.network.name`. This PR adds support to pass in network labels to CNI plugins via a new config option `spark.mesos.network.labels`. These network labels are key-value pairs that are set in the `NetworkInfo` of both the driver and executor tasks. More details in the related Mesos documentation: http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins ## How was this patch tested? Unit tests, for both driver and executor tasks. Manual integration test to submit a job with the `spark.mesos.network.labels` option, hit the mesos/state.json endpoint, and check that the labels are set in the driver and executor tasks. ArtRand skonto Author: Susan X. Huynh <xhuynh@mesosphere.com> Closes #18910 from susanxhuynh/sh-mesos-cni-labels.	2017-08-24 10:05:38 +01:00
Felix Cheung	43cbfad999	[SPARK-21805][SPARKR] Disable R vignettes code on Windows ## What changes were proposed in this pull request? Code in vignettes requires winutils on windows to run, when publishing to CRAN or building from source, winutils might not be available, so it's better to disable code run (so resulting vigenttes will not have output from code, but text is still there and code is still there) fix * checking re-building of vignette outputs ... WARNING and > %LOCALAPPDATA% not found. Please define the environment variable or restart and enter an installation path in localDir. ## How was this patch tested? jenkins, appveyor, r-hub before: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-49cecef3bb09db1db130db31604e0293/SparkR.Rcheck/00check.log after: https://artifacts.r-hub.io/SparkR_2.2.0.tar.gz-86a066c7576f46794930ad114e5cff7c/SparkR.Rcheck/00check.log Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #19016 from felixcheung/rvigwind.	2017-08-23 21:35:17 -07:00
10129659	b8aaef49fb	[SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time ## What changes were proposed in this pull request? The getAliasedConstraints fuction in LogicalPlan.scala will clone the expression set when an element added, and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time. Before modified, the cost of getAliasedConstraints is: 100 expressions: 41 seconds 150 expressions: 466 seconds After modified, the cost of getAliasedConstraints is: 100 expressions: 1.8 seconds 150 expressions: 6.5 seconds The test is like this: test("getAliasedConstraints") { val expressionNum = 150 val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")()) val aggPlan = Aggregate(Nil, aggExpression, LocalRelation()) val beginTime = System.currentTimeMillis() val expressions = aggPlan.validConstraints println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms") // The size of Aliased expression is n * (n - 1) / 2 + n assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum) } (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Run new added test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10129659 <chen.yanshan@zte.com.cn> Closes #19022 from eatoncys/getAliasedConstraints.	2017-08-23 20:35:08 -07:00
Takeshi Yamamuro	6942aeeb0a	[SPARK-21603][SQL][FOLLOW-UP] Change the default value of maxLinesPerFunction into 4000 ## What changes were proposed in this pull request? This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression: ``` Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement) Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression) ``` To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default. ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19021 from maropu/SPARK-21603-FOLLOWUP-1.	2017-08-23 12:02:24 -07:00
Sanket Chintapalli	1662e93119	[SPARK-21501] Change CacheLoader to limit entries based on memory footprint Right now the spark shuffle service has a cache for index files. It is based on a # of files cached (spark.shuffle.service.index.cache.entries). This can cause issues if people have a lot of reducers because the size of each entry can fluctuate based on the # of reducers. We saw an issues with a job that had 170000 reducers and it caused NM with spark shuffle service to use 700-800MB or memory in NM by itself. We should change this cache to be memory based and only allow a certain memory size used. When I say memory based I mean the cache should have a limit of say 100MB. https://issues.apache.org/jira/browse/SPARK-21501 Manual Testing with 170000 reducers has been performed with cache loaded up to max 100MB default limit, with each shuffle index file of size 1.3MB. Eviction takes place as soon as the total cache size reaches the 100MB limit and the objects will be ready for garbage collection there by avoiding NM to crash. No notable difference in runtime has been observed. Author: Sanket Chintapalli <schintap@yahoo-inc.com> Closes #18940 from redsanket/SPARK-21501.	2017-08-23 11:51:11 -05:00
Weichen Xu	d6b30edd49	[SPARK-12664][ML] Expose probability in mlp model ## What changes were proposed in this pull request? Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability column when transforming data. ## How was this patch tested? Test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.	2017-08-22 21:16:34 -07:00
Jane Wang	d58a3507ed	[SPARK-19326] Speculated task attempts do not get launched in few scenarios ## What changes were proposed in this pull request? Add a new listener event when a speculative task is created and notify it to ExecutorAllocationManager for requesting more executor. ## How was this patch tested? - Added Unittests. - For the test snippet in the jira: val n = 100 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index == 1) { Thread.sleep(Long.MaxValue) // fake long running task(s) } it.toList.map(x => index + ", " + x).iterator }).collect With this code change, spark indicates 101 jobs are running (99 succeeded, 2 running and 1 is speculative job) Author: Jane Wang <janewang@fb.com> Closes #18492 from janewangfb/speculated_task_not_launched.	2017-08-23 11:31:54 +08:00
Yanbo Liang	3429619055	[ML][MINOR] Make sharedParams update. ## What changes were proposed in this pull request? ```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue. ## How was this patch tested? Offline check. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19011 from yanboliang/sharedParams.	2017-08-23 11:06:53 +08:00
Jose Torres	3c0c2d09ca	[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. ## What changes were proposed in this pull request? All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from. ## How was this patch tested? Existing unit tests - no functional change is intended in this PR. Author: Jose Torres <joseph-torres@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18973 from joseph-torres/SPARK-21765.	2017-08-22 19:07:43 -07:00
Bryan Cutler	41bb1ddc63	[SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator ## What changes were proposed in this pull request? Added call to copy values of Params from Estimator to Model after fit in PySpark ML. This will copy values for any params that are also defined in the Model. Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object. This is a temporary fix that can be removed once the PySpark models properly define the params themselves. ## How was this patch tested? Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.	2017-08-22 17:40:50 -07:00
Weichen Xu	d56c262109	[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero ## What changes were proposed in this pull request? fix bug of MLOR do not work correctly when featureStd contains zero We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) ``` val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } ``` ## How was this patch tested? testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.	2017-08-22 16:55:34 -07:00
gatorsmile	01a8e46278	[SPARK-21769][SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL ## What changes were proposed in this pull request? For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different. The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema. ## How was this patch tested? Added a cross-version test case Author: gatorsmile <gatorsmile@gmail.com> Closes #19003 from gatorsmile/respectSparkSchema.	2017-08-22 13:12:59 -07:00
gatorsmile	43d71d9659	[SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction) ## What changes were proposed in this pull request? This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction). ```SQL CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg' ``` Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #18700 from gatorsmile/javaUDFinScala.	2017-08-22 13:01:35 -07:00
jerryshao	3ed1ae1005	[SPARK-20641][CORE] Add missing kvstore module in Laucher and SparkSubmit code There're two code in Launcher and SparkSubmit will will explicitly list all the Spark submodules, newly added kvstore module is missing in this two parts, so submitting a minor PR to fix this. Author: jerryshao <sshao@hortonworks.com> Closes #19014 from jerryshao/missing-kvstore.	2017-08-22 10:14:45 -07:00
gatorsmile	be72b157ea	[SPARK-21803][TEST] Remove the HiveDDLCommandSuite ## What changes were proposed in this pull request? We do not have any Hive-specific parser. It does not make sense to keep a parser-specific test suite `HiveDDLCommandSuite.scala` in the Hive package. This PR is to remove it. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19015 from gatorsmile/combineDDL.	2017-08-22 17:54:39 +08:00
Andrew Ray	5c9b301727	[SPARK-21584][SQL][SPARKR] Update R method for summary to call new implementation ## What changes were proposed in this pull request? SPARK-21100 introduced a new `summary` method to the Scala/Java Dataset API that included expanded statistics (vs `describe`) and control over which statistics to compute. Currently in the R API `summary` acts as an alias for `describe`. This patch updates the R API to call the new `summary` method in the JVM that includes additional statistics and ability to select which to compute. This does not break the current interface as the present `summary` method does not take additional arguments like `describe` and the output was never meant to be used programmatically. ## How was this patch tested? Modified and additional unit tests. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18786 from aray/summary-r.	2017-08-21 23:08:27 -07:00
Kyle Kelley	751f513367	[SPARK-21070][PYSPARK] Attempt to update cloudpickle again ## What changes were proposed in this pull request? Based on https://github.com/apache/spark/pull/18282 by rgbkrk this PR attempts to update to the current released cloudpickle and minimize the difference between Spark cloudpickle and "stock" cloud pickle with the goal of eventually using the stock cloud pickle. Some notable changes: * Import submodules accessed by pickled functions (cloudpipe/cloudpickle#80) * Support recursive functions inside closures (cloudpipe/cloudpickle#89, cloudpipe/cloudpickle#90) * Fix ResourceWarnings and DeprecationWarnings (cloudpipe/cloudpickle#88) * Assume modules with __file__ attribute are not dynamic (cloudpipe/cloudpickle#85) * Make cloudpickle Python 3.6 compatible (cloudpipe/cloudpickle#72) * Allow pickling of builtin methods (cloudpipe/cloudpickle#57) * Add ability to pickle dynamically created modules (cloudpipe/cloudpickle#52) * Support method descriptor (cloudpipe/cloudpickle#46) * No more pickling of closed files, was broken on Python 3 (cloudpipe/cloudpickle#32) * Remove non-standard __transient__check (cloudpipe/cloudpickle#110) -- while we don't use this internally, and have no tests or documentation for its use, downstream code may use __transient__, although it has never been part of the API, if we merge this we should include a note about this in the release notes. * Support for pickling loggers (yay!) (cloudpipe/cloudpickle#96) * BUG: Fix crash when pickling dynamic class cycles. (cloudpipe/cloudpickle#102) ## How was this patch tested? Existing PySpark unit tests + the unit tests from the cloudpickle project on their own. Author: Holden Karau <holden@us.ibm.com> Author: Kyle Kelley <rgbkrk@gmail.com> Closes #18734 from holdenk/holden-rgbkrk-cloudpickle-upgrades.	2017-08-22 11:17:53 +09:00
Yanbo Liang	c108a5d30e	[SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. ## What changes were proposed in this pull request? MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly. ## How was this patch tested? Existing tests, only adding some comments in code. Author: Yanbo Liang <ybliang8@gmail.com> Closes #18992 from yanboliang/SPARK-19762.	2017-08-22 08:43:18 +08:00
Marcelo Vanzin	84b5b16ea6	[SPARK-21617][SQL] Store correct table metadata when altering schema in Hive metastore. For Hive tables, the current "replace the schema" code is the correct path, except that an exception in that path should result in an error, and not in retrying in a different way. For data source tables, Spark may generate a non-compatible Hive table; but for that to work with Hive 2.1, the detection of data source tables needs to be fixed in the Hive client, to also consider the raw tables used by code such as `alterTableSchema`. Tested with existing and added unit tests (plus internal tests with a 2.1 metastore). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18849 from vanzin/SPARK-21617.	2017-08-21 15:09:02 -07:00
Yuming Wang	ba843292e3	[SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown verification back. ## What changes were proposed in this pull request? The previous PR(https://github.com/apache/spark/pull/19000) removed filter pushdown verification, This PR add them back. ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19002 from wangyum/SPARK-21790-follow-up.	2017-08-21 10:16:56 -07:00
Nick Pentreath	988b84d7ed	[SPARK-21468][PYSPARK][ML] Python API for FeatureHasher Add Python API for `FeatureHasher` transformer. ## How was this patch tested? New doc test. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.	2017-08-21 14:35:38 +02:00
Sean Owen	b3a07526fe	[SPARK-21718][SQL] Heavy log of type: "Skipping partition based on stats ..." ## What changes were proposed in this pull request? Reduce 'Skipping partitions' message to debug ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19010 from srowen/SPARK-21718.	2017-08-21 14:20:40 +02:00
Sergey Serebryakov	77d046ec47	[SPARK-21782][CORE] Repartition creates skews when numPartitions is a power of 2 ## Problem When an RDD (particularly with a low item-per-partition ratio) is repartitioned to numPartitions = power of 2, the resulting partitions are very uneven-sized, due to using fixed seed to initialize PRNG, and using the PRNG only once. See details in https://issues.apache.org/jira/browse/SPARK-21782 ## What changes were proposed in this pull request? Instead of directly using `0, 1, 2,...` seeds to initialize `Random`, hash them with `scala.util.hashing.byteswap32()`. ## How was this patch tested? `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite test` Author: Sergey Serebryakov <sserebryakov@tesla.com> Closes #18990 from megaserg/repartition-skew.	2017-08-21 08:21:25 +01:00
Liang-Chi Hsieh	28a6cca7df	[SPARK-21721][SQL][FOLLOWUP] Clear FileSystem deleteOnExit cache when paths are successfully removed ## What changes were proposed in this pull request? Fix a typo in test. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19005 from viirya/SPARK-21721-followup.	2017-08-21 00:45:23 +08:00
hyukjinkwon	41e0eb71a6	[SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in the path in SQL documentation build ## What changes were proposed in this pull request? This PR proposes to install `mkdocs` by `pip install` if missing in the path. Mainly to fix Jenkins's documentation build failure in `spark-master-docs`. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console. It also adds `mkdocs` as requirements in `docs/README.md`. ## How was this patch tested? I manually ran `jekyll build` under `docs` directory after manually removing `mkdocs` via `pip uninstall mkdocs`. Also, tested this in the same way but on CentOS Linux release 7.3.1611 (Core) where I built Spark few times but never built documentation before and `mkdocs` is not installed. ``` ... Moving back into docs dir. Moving to SQL directory and building docs. Missing mkdocs in your path, trying to install mkdocs for SQL documentation generation. Collecting mkdocs Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB) 100% \|████████████████████████████████\| 1.2MB 574kB/s Requirement already satisfied: PyYAML>=3.10 in /usr/lib64/python2.7/site-packages (from mkdocs) Collecting livereload>=2.5.1 (from mkdocs) Downloading livereload-2.5.1-py2-none-any.whl Collecting tornado>=4.1 (from mkdocs) Downloading tornado-4.5.1.tar.gz (483kB) 100% \|████████████████████████████████\| 491kB 1.4MB/s Collecting Markdown>=2.3.1 (from mkdocs) Downloading Markdown-2.6.9.tar.gz (271kB) 100% \|████████████████████████████████\| 276kB 2.4MB/s Collecting click>=3.3 (from mkdocs) Downloading click-6.7-py2.py3-none-any.whl (71kB) 100% \|████████████████████████████████\| 71kB 2.8MB/s Requirement already satisfied: Jinja2>=2.7.1 in /usr/lib/python2.7/site-packages (from mkdocs) Requirement already satisfied: six in /usr/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs) Requirement already satisfied: backports.ssl_match_hostname in /usr/lib/python2.7/site-packages (from tornado>=4.1->mkdocs) Collecting singledispatch (from tornado>=4.1->mkdocs) Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl Collecting certifi (from tornado>=4.1->mkdocs) Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB) 100% \|████████████████████████████████\| 358kB 2.1MB/s Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs) Downloading backports_abc-0.5-py2.py3-none-any.whl Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib/python2.7/site-packages (from Jinja2>=2.7.1->mkdocs) Building wheels for collected packages: tornado, Markdown Running setup.py bdist_wheel for tornado ... done Stored in directory: /root/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7 Running setup.py bdist_wheel for Markdown ... done Stored in directory: /root/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5 Successfully built tornado Markdown Installing collected packages: singledispatch, certifi, backports-abc, tornado, livereload, Markdown, click, mkdocs Successfully installed Markdown-2.6.9 backports-abc-0.5 certifi-2017.7.27.1 click-6.7 livereload-2.5.1 mkdocs-0.16.3 singledispatch-3.4.0.3 tornado-4.5.1 Generating markdown files for SQL documentation. Generating HTML files for SQL documentation. INFO - Cleaning site directory INFO - Building documentation to directory: .../spark/sql/site Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql Source: .../spark/docs Destination: .../spark/docs/_site Generating... done. Auto-regeneration: disabled. Use --watch to enable. ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #18984 from HyukjinKwon/sql-doc-mkdocs.	2017-08-20 19:48:04 +09:00
Cédric Pelvet	73e04ecc4f	[MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression ## What changes were proposed in this pull request? The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol. ## How was this patch tested? Manually. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Cédric Pelvet <cedric.pelvet@gmail.com> Closes #18980 from sharp-pixel/master.	2017-08-20 11:05:54 +01:00
Yuming Wang	72b738d8dc	[SPARK-21790][TESTS] Fix Docker-based Integration Test errors. ## What changes were proposed in this pull request? [SPARK-17701](https://github.com/apache/spark/pull/18600/files#diff-b9f96d092fb3fea76bcf75e016799678L77) removed `metadata` function, this PR removed the Docker-based Integration module that has been relevant to `SparkPlan.metadata`. ## How was this patch tested? manual tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19000 from wangyum/SPARK-21709.	2017-08-19 11:41:32 -07:00
Andrew Ray	10be01848e	[SPARK-21566][SQL][PYTHON] Python method for summary ## What changes were proposed in this pull request? Adds the recently added `summary` method to the python dataframe interface. ## How was this patch tested? Additional inline doctests. Author: Andrew Ray <ray.andrew@gmail.com> Closes #18762 from aray/summary-py.	2017-08-18 18:10:54 -07:00
Andrew Ash	a2db5c5761	[MINOR][TYPO] Fix typos: runnning and Excecutors ## What changes were proposed in this pull request? Fix typos ## How was this patch tested? Existing tests Author: Andrew Ash <andrew@andrewash.com> Closes #18996 from ash211/patch-2.	2017-08-18 13:43:42 -07:00
Wenchen Fan	7880909c45	[SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18993 from cloud-fan/bug.	2017-08-18 11:19:22 -07:00
Masha Basmanova	23ea898080	[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes ## What changes were proposed in this pull request? Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows. When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified. For example, table t has 4 partitions with the following specs: * Partition1: (ds='2008-04-08', hr=11) * Partition2: (ds='2008-04-08', hr=12) * Partition3: (ds='2008-04-09', hr=11) * Partition4: (ds='2008-04-09', hr=12) 'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3. 'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4. 'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions. When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes. The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command. ## How was this patch tested? Added tests. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18421 from mbasmanova/mbasmanova-analyze-partition.	2017-08-18 09:54:39 -07:00
Reynold Xin	07a2b8738e	[SPARK-21778][SQL] Simpler Dataset.sample API in Scala / Java ## What changes were proposed in this pull request? Dataset.sample requires a boolean flag withReplacement as the first argument. However, most of the time users simply want to sample some records without replacement. This ticket introduces a new sample function that simply takes in the fraction and seed. ## How was this patch tested? Tested manually. Not sure yet if we should add a test case for just this wrapper ... Author: Reynold Xin <rxin@databricks.com> Closes #18988 from rxin/SPARK-21778.	2017-08-18 23:58:20 +09:00
donnyzone	310454be3b	[SPARK-21739][SQL] Cast expression should initialize timezoneId when it is called statically to convert something into TimestampType ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739 This issue is caused by introducing TimeZoneAwareExpression. When the Cast expression converts something into TimestampType, it should be resolved with setting `timezoneId`. In general, it is resolved in LogicalPlan phase. However, there are still some places that use Cast expression statically to convert datatypes without setting `timezoneId`. In such cases, `NoSuchElementException: None.get` will be thrown for TimestampType. This PR is proposed to fix the issue. We have checked the whole project and found two such usages(i.e., in`TableReader` and `HiveTableScanExec`). ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes #18960 from DonnyZone/spark-21739.	2017-08-17 22:37:32 -07:00
gatorsmile	2caaed970e	[SPARK-21767][TEST][SQL] Add Decimal Test For Avro in VersionSuite ## What changes were proposed in this pull request? Decimal is a logical type of AVRO. We need to ensure the support of Hive's AVRO serde works well in Spark ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18977 from gatorsmile/addAvroTest.	2017-08-17 16:33:39 -07:00
Jen-Ming Chung	7ab951885f	[SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type ## What changes were proposed in this pull request? ``` scala scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() ... java.lang.NullPointerException at ... ``` Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names. ## How was this patch tested? Added unit test. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18930 from jmchung/SPARK-21677.	2017-08-17 15:59:45 -07:00
ArtRand	bfdc361ede	[SPARK-16742] Mesos Kerberos Support ## What changes were proposed in this pull request? Add Kerberos Support to Mesos. This includes kinit and --keytab support, but does not include delegation token renewal. ## How was this patch tested? Manually against a Secure DC/OS Apache HDFS cluster. Author: ArtRand <arand@soe.ucsc.edu> Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #18519 from mgummelt/SPARK-16742-kerberos.	2017-08-17 15:47:07 -07:00
Takeshi Yamamuro	6aad02d036	[SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent ## What changes were proposed in this pull request? This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent. If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order. ## How was this patch tested? Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18959 from maropu/SPARK-18394.	2017-08-17 22:47:14 +02:00
gatorsmile	ae9e424792	[SQL][MINOR][TEST] Set spark.unsafe.exceptionOnMemoryLeak to true ## What changes were proposed in this pull request? When running IntelliJ, we are unable to capture the exception of memory leak detection. > org.apache.spark.executor.Executor: Managed memory leak detected Explicitly setting `spark.unsafe.exceptionOnMemoryLeak` in SparkConf when building the SparkSession, instead of reading it from system properties. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18967 from gatorsmile/setExceptionOnMemoryLeak.	2017-08-17 13:00:37 -07:00
Kent Yao	b83b502c41	[SPARK-21428] Turn IsolatedClientLoader off while using builtin Hive jars for reusing CliSessionState ## What changes were proposed in this pull request? Set isolated to false while using builtin hive jars and `SessionState.get` returns a `CliSessionState` instance. ## How was this patch tested? 1 Unit Tests 2 Manually verified: `hive.exec.strachdir` was only created once because of reusing cliSessionState ```java ➜ spark git:(SPARK-21428) ✗ bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/07/16 23:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/07/16 23:59:27 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 17/07/16 23:59:27 INFO ObjectStore: ObjectStore, initialize called 17/07/16 23:59:28 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 17/07/16 23:59:28 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 17/07/16 23:59:29 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/07/16 23:59:30 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 17/07/16 23:59:31 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 17/07/16 23:59:31 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY 17/07/16 23:59:31 INFO ObjectStore: Initialized ObjectStore 17/07/16 23:59:31 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 17/07/16 23:59:31 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 17/07/16 23:59:32 INFO HiveMetaStore: Added admin role in metastore 17/07/16 23:59:32 INFO HiveMetaStore: Added public role in metastore 17/07/16 23:59:32 INFO HiveMetaStore: No user is added in admin role, since config is empty 17/07/16 23:59:32 INFO HiveMetaStore: 0: get_all_databases 17/07/16 23:59:32 INFO audit: ugi=Kent ip=unknown-ip-addr cmd=get_all_databases 17/07/16 23:59:32 INFO HiveMetaStore: 0: get_functions: db=default pat=* 17/07/16 23:59:32 INFO audit: ugi=Kent ip=unknown-ip-addr cmd=get_functions: db=default pat=* 17/07/16 23:59:32 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. 17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/beea7261-221a-4711-89e8-8b12a9d37370_resources 17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370 17/07/16 23:59:32 INFO SessionState: Created local directory: /var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/Kent/beea7261-221a-4711-89e8-8b12a9d37370 17/07/16 23:59:32 INFO SessionState: Created HDFS directory: /tmp/hive/Kent/beea7261-221a-4711-89e8-8b12a9d37370/_tmp_space.db 17/07/16 23:59:32 INFO SparkContext: Running Spark version 2.3.0-SNAPSHOT 17/07/16 23:59:32 INFO SparkContext: Submitted application: SparkSQL::10.0.0.8 17/07/16 23:59:32 INFO SecurityManager: Changing view acls to: Kent 17/07/16 23:59:32 INFO SecurityManager: Changing modify acls to: Kent 17/07/16 23:59:32 INFO SecurityManager: Changing view acls groups to: 17/07/16 23:59:32 INFO SecurityManager: Changing modify acls groups to: 17/07/16 23:59:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Kent); groups with view permissions: Set(); users with modify permissions: Set(Kent); groups with modify permissions: Set() 17/07/16 23:59:33 INFO Utils: Successfully started service 'sparkDriver' on port 51889. 17/07/16 23:59:33 INFO SparkEnv: Registering MapOutputTracker 17/07/16 23:59:33 INFO SparkEnv: Registering BlockManagerMaster 17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/07/16 23:59:33 INFO DiskBlockManager: Created local directory at /private/var/folders/k2/04p4k4ws73l6711h_mz2_tq00000gn/T/blockmgr-9cfae28a-01e9-4c73-a1f1-f76fa52fc7a5 17/07/16 23:59:33 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 17/07/16 23:59:33 INFO SparkEnv: Registering OutputCommitCoordinator 17/07/16 23:59:33 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/07/16 23:59:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.8:4040 17/07/16 23:59:33 INFO Executor: Starting executor ID driver on host localhost 17/07/16 23:59:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51890. 17/07/16 23:59:33 INFO NettyBlockTransferService: Server created on 10.0.0.8:51890 17/07/16 23:59:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 17/07/16 23:59:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None) 17/07/16 23:59:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.8:51890 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.8, 51890, None) 17/07/16 23:59:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.8, 51890, None) 17/07/16 23:59:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.8, 51890, None) 17/07/16 23:59:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/Kent/Documents/spark/spark-warehouse'). 17/07/16 23:59:34 INFO SharedState: Warehouse path is 'file:/Users/Kent/Documents/spark/spark-warehouse'. 17/07/16 23:59:34 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse 17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: default 17/07/16 23:59:34 INFO audit: ugi=Kent ip=unknown-ip-addr cmd=get_database: default 17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse 17/07/16 23:59:34 INFO HiveMetaStore: 0: get_database: global_temp 17/07/16 23:59:34 INFO audit: ugi=Kent ip=unknown-ip-addr cmd=get_database: global_temp 17/07/16 23:59:34 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 17/07/16 23:59:34 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse 17/07/16 23:59:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint spark-sql> ``` cc cloud-fan gatorsmile Author: Kent Yao <yaooqinn@hotmail.com> Author: hzyaoqin <hzyaoqin@corp.netease.com> Closes #18648 from yaooqinn/SPARK-21428.	2017-08-18 00:24:45 +08:00

1 2 3 4 5 ...

20386 commits