[MINOR][DOC][SQL][CORE] Fix typo in document and comments
### What changes were proposed in this pull request? Fixed typo in `docs` directory and in other directories 1. Find typo in `docs` and apply fixes to files in all directories 2. Fix `the the` -> `the` ### Why are the changes needed? Better readability of documents ### Does this PR introduce any user-facing change? No ### How was this patch tested? No test needed Closes #26976 from kiszk/typo_20191221. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This commit is contained in:
parent
8384ff4c9d
commit
f31d9a629b
|
@ -42,7 +42,7 @@ private[spark] trait Clock {
|
|||
*
|
||||
* TL;DR: on modern (2.6.32+) Linux kernels with modern (AMD K8+) CPUs, the values returned by
|
||||
* `System.nanoTime()` are consistent across CPU cores *and* packages, and provide always
|
||||
* increasing values (although it may not be completely monotonic when the the system clock is
|
||||
* increasing values (although it may not be completely monotonic when the system clock is
|
||||
* adjusted by NTP daemons using time slew).
|
||||
*/
|
||||
// scalastyle:on line.size.limit
|
||||
|
|
|
@ -83,7 +83,7 @@ class SparkListenerSuite extends SparkFunSuite with LocalSparkContext with Match
|
|||
(1 to 5).foreach { _ => bus.post(SparkListenerJobEnd(0, jobCompletionTime, JobSucceeded)) }
|
||||
|
||||
// Five messages should be marked as received and queued, but no messages should be posted to
|
||||
// listeners yet because the the listener bus hasn't been started.
|
||||
// listeners yet because the listener bus hasn't been started.
|
||||
assert(bus.metrics.numEventsPosted.getCount === 5)
|
||||
assert(bus.queuedEvents.size === 5)
|
||||
|
||||
|
@ -206,7 +206,7 @@ class SparkListenerSuite extends SparkFunSuite with LocalSparkContext with Match
|
|||
assert(sharedQueueSize(bus) === 1)
|
||||
assert(numDroppedEvents(bus) === 1)
|
||||
|
||||
// Allow the the remaining events to be processed so we can stop the listener bus:
|
||||
// Allow the remaining events to be processed so we can stop the listener bus:
|
||||
listenerWait.release(2)
|
||||
bus.stop()
|
||||
}
|
||||
|
|
|
@ -436,7 +436,7 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite
|
|||
val it = map.iterator
|
||||
assert(it.isInstanceOf[CompletionIterator[_, _]])
|
||||
// org.apache.spark.util.collection.AppendOnlyMap.destructiveSortedIterator returns
|
||||
// an instance of an annonymous Iterator class.
|
||||
// an instance of an anonymous Iterator class.
|
||||
|
||||
val underlyingMapRef = WeakReference(map.currentMap)
|
||||
|
||||
|
|
|
@ -233,5 +233,5 @@
|
|||
url: sql-ref-functions-udf-scalar.html
|
||||
- text: Aggregate functions
|
||||
url: sql-ref-functions-udf-aggregate.html
|
||||
- text: Arthmetic operations
|
||||
- text: Arithmetic operations
|
||||
url: sql-ref-arithmetic-ops.html
|
||||
|
|
|
@ -2423,7 +2423,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
|
|||
Interval at which data received by Spark Streaming receivers is chunked
|
||||
into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the
|
||||
<a href="streaming-programming-guide.html#level-of-parallelism-in-data-receiving">performance
|
||||
tuning</a> section in the Spark Streaming programing guide for more details.
|
||||
tuning</a> section in the Spark Streaming programming guide for more details.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
@ -2434,7 +2434,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
|
|||
Effectively, each stream will consume at most this number of records per second.
|
||||
Setting this configuration to 0 or a negative number will put no limit on the rate.
|
||||
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
|
||||
in the Spark Streaming programing guide for mode details.
|
||||
in the Spark Streaming programming guide for mode details.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
@ -2444,7 +2444,7 @@ showDF(properties, numRows = 200, truncate = FALSE)
|
|||
Enable write-ahead logs for receivers. All the input data received through receivers
|
||||
will be saved to write-ahead logs that will allow it to be recovered after driver failures.
|
||||
See the <a href="streaming-programming-guide.html#deploying-applications">deployment guide</a>
|
||||
in the Spark Streaming programing guide for more details.
|
||||
in the Spark Streaming programming guide for more details.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
|
|
@ -670,7 +670,7 @@ others.
|
|||
<tr>
|
||||
<td>Gamma</td>
|
||||
<td>Continuous</td>
|
||||
<td>Inverse*, Idenity, Log</td>
|
||||
<td>Inverse*, Identity, Log</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Tweedie</td>
|
||||
|
|
|
@ -254,7 +254,7 @@ Deprecations in the `spark.mllib` and `spark.ml` packages include:
|
|||
We move all functionality in overridden methods to the corresponding `transformSchema`.
|
||||
* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
|
||||
In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
|
||||
We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`.
|
||||
We encourage users to use `spark.ml.regression.LinearRegression` and `spark.ml.classification.LogisticRegression`.
|
||||
* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
|
||||
In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`.
|
||||
* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
|
||||
|
@ -266,12 +266,12 @@ Deprecations in the `spark.mllib` and `spark.ml` packages include:
|
|||
Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
|
||||
|
||||
* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
|
||||
`spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now.
|
||||
`spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegression` for binary classification now.
|
||||
This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`:
|
||||
* The intercept will not be regularized when training binary classification model with L1/L2 Updater.
|
||||
* If users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate.
|
||||
* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
|
||||
In order to provide better and consistent result with `spark.ml.classification.LogisticRegresson`,
|
||||
In order to provide better and consistent result with `spark.ml.classification.LogisticRegression`,
|
||||
the default value of `spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has been changed from 1E-4 to 1E-6.
|
||||
* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
|
||||
Fix a bug of `PowerIterationClustering` which will likely change its result.
|
||||
|
|
|
@ -640,7 +640,7 @@ A list of the available metrics, with a short description:
|
|||
|
||||
### Executor Metrics
|
||||
|
||||
Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC infomation. Metrics `peakExecutorMetrics.*` are only enabled if `spark.eventLog.logStageExecutorMetrics.enabled` is true.
|
||||
Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information. Metrics `peakExecutorMetrics.*` are only enabled if `spark.eventLog.logStageExecutorMetrics.enabled` is true.
|
||||
A list of the available metrics, with a short description:
|
||||
|
||||
<table class="table">
|
||||
|
|
|
@ -245,7 +245,7 @@ Data source options of Avro can be set via:
|
|||
<td>None</td>
|
||||
<td>Optional Avro schema (in JSON format) that was used to serialize the data. This should be set if the schema provided
|
||||
for deserialization is compatible with - but not the same as - the one used to originally convert the data to Avro.
|
||||
For more information on Avro's schema evolution and compatability, please refer to the [documentation of Confluent](https://docs.confluent.io/current/schema-registry/avro.html).
|
||||
For more information on Avro's schema evolution and compatibility, please refer to the [documentation of Confluent](https://docs.confluent.io/current/schema-registry/avro.html).
|
||||
</td>
|
||||
<td>function <code>from_avro</code></td>
|
||||
</tr>
|
||||
|
|
|
@ -220,7 +220,7 @@ license: |
|
|||
|
||||
- Since Spark 3.0, when casting interval values to string type, there is no "interval" prefix, e.g. `1 days 2 hours`. In Spark version 2.4 and earlier, the string contains the "interval" prefix like `interval 1 days 2 hours`.
|
||||
|
||||
- Since Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces(<= ACSII 32) will be trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be `null`, while to datetimes, only the trailing spaces(= ASCII 32) will be removed.
|
||||
- Since Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, e.g. `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and earlier, while casting string to integrals and booleans, it will not trim the whitespaces from both ends, the foregoing results will be `null`, while to datetimes, only the trailing spaces (= ASCII 32) will be removed.
|
||||
|
||||
- Since Spark 3.0, numbers written in scientific notation(e.g. `1E2`) would be parsed as Double. In Spark version 2.4 and earlier, they're parsed as Decimal. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.exponentLiteralAsDecimal.enabled` to `true`.
|
||||
|
||||
|
|
|
@ -255,7 +255,7 @@ different than a Pandas timestamp. It is recommended to use Pandas time series f
|
|||
working with timestamps in `pandas_udf`s to get the best performance, see
|
||||
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details.
|
||||
|
||||
### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
|
||||
### Compatibility Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
|
||||
|
||||
Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be
|
||||
compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark
|
||||
|
|
|
@ -25,14 +25,14 @@ A column is associated with a data type and represents
|
|||
a specific attribute of an entity (for example, `age` is a column of an
|
||||
entity called `person`). Sometimes, the value of a column
|
||||
specific to a row is not known at the time the row comes into existence.
|
||||
In `SQL`, such values are represnted as `NULL`. This section details the
|
||||
In `SQL`, such values are represented as `NULL`. This section details the
|
||||
semantics of `NULL` values handling in various operators, expressions and
|
||||
other `SQL` constructs.
|
||||
|
||||
1. [Null handling in comparison operators](#comp-operators)
|
||||
2. [Null handling in Logical operators](#logical-operators)
|
||||
3. [Null handling in Expressions](#expressions)
|
||||
1. [Null handling in null-in-tolerant expressions](#null-in-tolerant)
|
||||
1. [Null handling in null-intolerant expressions](#null-intolerant)
|
||||
2. [Null handling Expressions that can process null value operands](#can-process-null)
|
||||
3. [Null handling in built-in aggregate expressions](#built-in-aggregate)
|
||||
4. [Null handling in WHERE, HAVING and JOIN conditions](#condition-expressions)
|
||||
|
@ -61,10 +61,10 @@ the `age` column and this table will be used in various examples in the sections
|
|||
<tr><td>700</td><td>Dan</td><td>50</td></tr>
|
||||
</table>
|
||||
|
||||
### Comparision operators <a name="comp-operators"></a>
|
||||
### Comparison operators <a name="comp-operators"></a>
|
||||
|
||||
Apache spark supports the standard comparison operators such as '>', '>=', '=', '<' and '<='.
|
||||
The result of these operators is unknown or `NULL` when one of the operarands or both the operands are
|
||||
The result of these operators is unknown or `NULL` when one of the operands or both the operands are
|
||||
unknown or `NULL`. In order to compare the `NULL` values for equality, Spark provides a null-safe
|
||||
equal operator ('<=>'), which returns `False` when one of the operand is `NULL` and returns 'True` when
|
||||
both the operands are `NULL`. The following table illustrates the behaviour of comparison operators when
|
||||
|
@ -152,7 +152,7 @@ SELECT NULL <=> NULL;
|
|||
Spark supports standard logical operators such as `AND`, `OR` and `NOT`. These operators take `Boolean` expressions
|
||||
as the arguments and return a `Boolean` value.
|
||||
|
||||
The following tables illustrate the behavior of logical opeators when one or both operands are `NULL`.
|
||||
The following tables illustrate the behavior of logical operators when one or both operands are `NULL`.
|
||||
|
||||
<table class="tsclass" border="1">
|
||||
<tr>
|
||||
|
@ -236,12 +236,12 @@ The comparison operators and logical operators are treated as expressions in
|
|||
Spark. Other than these two kinds of expressions, Spark supports other form of
|
||||
expressions such as function expressions, cast expressions, etc. The expressions
|
||||
in Spark can be broadly classified as :
|
||||
- Null in-tolerent expressions
|
||||
- Null intolerant expressions
|
||||
- Expressions that can process `NULL` value operands
|
||||
- The result of these expressions depends on the expression itself.
|
||||
|
||||
#### Null in-tolerant expressions <a name="null-in-tolerant"></a>
|
||||
Null in-tolerant expressions return `NULL` when one or more arguments of
|
||||
#### Null intolerant expressions <a name="null-intolerant"></a>
|
||||
Null intolerant expressions return `NULL` when one or more arguments of
|
||||
expression are `NULL` and most of the expressions fall in this category.
|
||||
|
||||
##### Examples
|
||||
|
@ -297,7 +297,7 @@ SELECT isnull(null) AS expression_output;
|
|||
|true |
|
||||
+-----------------+
|
||||
|
||||
-- Returns the first occurence of non `NULL` value.
|
||||
-- Returns the first occurrence of non `NULL` value.
|
||||
SELECT coalesce(null, null, 3, null) AS expression_output;
|
||||
+-----------------+
|
||||
|expression_output|
|
||||
|
@ -460,7 +460,7 @@ WHERE p1.age <=> p2.age
|
|||
{% endhighlight %}
|
||||
|
||||
### Aggregate operator (GROUP BY, DISTINCT) <a name="aggregate-operator"></a>
|
||||
As discussed in the previous section [comparison operator](sql-ref-null-semantics.html#comparision-operators),
|
||||
As discussed in the previous section [comparison operator](sql-ref-null-semantics.html#comparison-operators),
|
||||
two `NULL` values are not equal. However, for the purpose of grouping and distinct processing, the two or more
|
||||
values with `NULL data`are grouped together into the same bucket. This behaviour is conformant with SQL
|
||||
standard and with other enterprise database management systems.
|
||||
|
|
|
@ -120,7 +120,7 @@ ALTER TABLE table_identifier [ partition_spec ] SET SERDE serde_class_name
|
|||
|
||||
#### SET LOCATION And SET FILE FORMAT
|
||||
`ALTER TABLE SET` command can also be used for changing the file location and file format for
|
||||
exsisting tables.
|
||||
existing tables.
|
||||
|
||||
##### Syntax
|
||||
{% highlight sql %}
|
||||
|
|
|
@ -55,7 +55,7 @@ INSERT INTO [ TABLE ] table_identifier [ partition_spec ]
|
|||
|
||||
<dl>
|
||||
<dt><code><em>VALUES ( { value | NULL } [ , ... ] ) [ , ( ... ) ]</em></code></dt>
|
||||
<dd>Specifies the values to be inserted. Either an explicitly specified value or a NULL can be inserted. A comma must be used to seperate each value in the clause. More than one set of values can be specified to insert multiple rows.</dd>
|
||||
<dd>Specifies the values to be inserted. Either an explicitly specified value or a NULL can be inserted. A comma must be used to separate each value in the clause. More than one set of values can be specified to insert multiple rows.</dd>
|
||||
</dl>
|
||||
|
||||
<dl>
|
||||
|
|
|
@ -233,7 +233,7 @@ For data stores that support transactions, saving offsets in the same transactio
|
|||
{% highlight scala %}
|
||||
// The details depend on your data store, but the general idea looks like this
|
||||
|
||||
// begin from the the offsets committed to the database
|
||||
// begin from the offsets committed to the database
|
||||
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
|
||||
new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
|
||||
}.toMap
|
||||
|
@ -263,7 +263,7 @@ stream.foreachRDD { rdd =>
|
|||
{% highlight java %}
|
||||
// The details depend on your data store, but the general idea looks like this
|
||||
|
||||
// begin from the the offsets committed to the database
|
||||
// begin from the offsets committed to the database
|
||||
Map<TopicPartition, Long> fromOffsets = new HashMap<>();
|
||||
for (resultSet : selectOffsetsFromYourDatabase)
|
||||
fromOffsets.put(new TopicPartition(resultSet.string("topic"), resultSet.int("partition")), resultSet.long("offset"));
|
||||
|
|
|
@ -405,7 +405,7 @@ The following configurations are optional:
|
|||
</td>
|
||||
<td>latest</td>
|
||||
<td>batch query</td>
|
||||
<td>The end point when a batch query is ended, a json string specifying an ending timesamp for each TopicPartition.
|
||||
<td>The end point when a batch query is ended, a json string specifying an ending timestamp for each TopicPartition.
|
||||
The returned offset for each partition is the earliest offset whose timestamp is greater than or equal to
|
||||
the given timestamp in the corresponding partition. If the matched offset doesn't exist, the offset will
|
||||
be set to latest.<p/>
|
||||
|
|
|
@ -444,7 +444,7 @@ The third section has the SQL statistics of the submitted operations.
|
|||
* _Canceled_, final state when the execution is canceled.
|
||||
* _Finished_ processing and waiting to fetch results.
|
||||
* _Closed_, final state when client closed the statement.
|
||||
* **Detail** of the execution plan with parsed logical plan, analyzed logical plan, optimized logical plan and physical plan or errors in the the SQL statement.
|
||||
* **Detail** of the execution plan with parsed logical plan, analyzed logical plan, optimized logical plan and physical plan or errors in the SQL statement.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/JDBCServer3.png" title="JDBC/ODBC SQL Statistics" alt="JDBC/ODBC SQL Statistics">
|
||||
|
|
|
@ -23,7 +23,7 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
|
|||
|
||||
|
||||
/**
|
||||
* Class to calculate offset ranges to process based on the the from and until offsets, and
|
||||
* Class to calculate offset ranges to process based on the from and until offsets, and
|
||||
* the configured `minPartitions`.
|
||||
*/
|
||||
private[kafka010] class KafkaOffsetRangeCalculator(val minPartitions: Option[Int]) {
|
||||
|
|
|
@ -148,7 +148,7 @@ private[spark] object DecisionTreeMetadata extends Logging {
|
|||
require(maxCategoriesPerFeature <= maxPossibleBins,
|
||||
s"DecisionTree requires maxBins (= $maxPossibleBins) to be at least as large as the " +
|
||||
s"number of values in each categorical feature, but categorical feature $maxCategory " +
|
||||
s"has $maxCategoriesPerFeature values. Considering remove this and other categorical " +
|
||||
s"has $maxCategoriesPerFeature values. Consider removing this and other categorical " +
|
||||
"features with a large number of values, or add more training examples.")
|
||||
}
|
||||
|
||||
|
|
|
@ -132,7 +132,7 @@ object InterpretedUnsafeProjection {
|
|||
dt: DataType,
|
||||
nullable: Boolean): (SpecializedGetters, Int) => Unit = {
|
||||
|
||||
// Create the the basic writer.
|
||||
// Create the basic writer.
|
||||
val unsafeWriter: (SpecializedGetters, Int) => Unit = dt match {
|
||||
case BooleanType =>
|
||||
(v, i) => writer.write(i, v.getBoolean(i))
|
||||
|
|
|
@ -406,7 +406,7 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
|
|||
|
||||
// Create the attribute mapping. Note that the currentNextAttrPairs can contain duplicate
|
||||
// keys in case of Union (this is caused by the PushProjectionThroughUnion rule); in this
|
||||
// case we use the the first mapping (which should be provided by the first child).
|
||||
// case we use the first mapping (which should be provided by the first child).
|
||||
val mapping = AttributeMap(currentNextAttrPairs)
|
||||
|
||||
// Create a an expression cleaning function for nodes that can actually produce redundant
|
||||
|
|
|
@ -118,7 +118,7 @@ class AnalysisHelperSuite extends SparkFunSuite {
|
|||
|
||||
test("do not allow transform in analyzer") {
|
||||
val plan = Project(Nil, LocalRelation())
|
||||
// These should be OK since we are not in the analzyer
|
||||
// These should be OK since we are not in the analyzer
|
||||
plan.transform { case p: Project => p }
|
||||
plan.transformUp { case p: Project => p }
|
||||
plan.transformDown { case p: Project => p }
|
||||
|
|
|
@ -158,7 +158,7 @@ class ObjectAggregationIterator(
|
|||
val buffer: InternalRow = getAggregationBufferByKey(hashMap, groupingKey)
|
||||
processRow(buffer, newInput)
|
||||
|
||||
// The the hash map gets too large, makes a sorted spill and clear the map.
|
||||
// The hash map gets too large, makes a sorted spill and clear the map.
|
||||
if (hashMap.size >= fallbackCountThreshold) {
|
||||
logInfo(
|
||||
s"Aggregation hash map size ${hashMap.size} reaches threshold " +
|
||||
|
|
|
@ -78,7 +78,7 @@ object PythonForeachWriter {
|
|||
*
|
||||
* Internally, it uses a [[HybridRowQueue]] to buffer the rows in a practically unlimited queue
|
||||
* across memory and local disk. However, HybridRowQueue is designed to be used only with
|
||||
* EvalPythonExec where the reader is always behind the the writer, that is, the reader does not
|
||||
* EvalPythonExec where the reader is always behind the writer, that is, the reader does not
|
||||
* try to read n+1 rows if the writer has only written n rows at any point of time. This
|
||||
* assumption is not true for PythonForeachWriter where rows may be added at a different rate as
|
||||
* they are consumed by the python worker. Hence, to maintain the invariant of the reader being
|
||||
|
|
|
@ -110,7 +110,7 @@ import org.apache.spark.util.{CompletionIterator, SerializableConfiguration}
|
|||
*
|
||||
* 3. When both window in join key and time range conditions are present, case 1 + 2.
|
||||
* In this case, since window equality is a stricter condition than the time range, we can
|
||||
* use the the State Key Watermark = event time watermark to discard state (similar to case 1).
|
||||
* use the State Key Watermark = event time watermark to discard state (similar to case 1).
|
||||
*
|
||||
* @param leftKeys Expression to generate key rows for joining from left input
|
||||
* @param rightKeys Expression to generate key rows for joining from right input
|
||||
|
|
|
@ -25,7 +25,7 @@ import org.apache.spark.sql.connector.write.streaming.{StreamingDataWriterFactor
|
|||
import org.apache.spark.sql.types.StructType
|
||||
import org.apache.spark.sql.util.CaseInsensitiveStringMap
|
||||
|
||||
/** Common methods used to create writes for the the console sink */
|
||||
/** Common methods used to create writes for the console sink */
|
||||
class ConsoleWrite(schema: StructType, options: CaseInsensitiveStringMap)
|
||||
extends StreamingWrite with Logging {
|
||||
|
||||
|
|
|
@ -93,7 +93,7 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalGroupState
|
|||
* any trigger and timeout function call will not occur until there is data.
|
||||
* - Since the processing time timeout is based on the clock time, it is affected by the
|
||||
* variations in the system clock (i.e. time zone changes, clock skew, etc.).
|
||||
* - With `EventTimeTimeout`, the user also has to specify the the the event time watermark in
|
||||
* - With `EventTimeTimeout`, the user also has to specify the event time watermark in
|
||||
* the query using `Dataset.withWatermark()`. With this setting, data that is older than the
|
||||
* watermark are filtered out. The timeout can be set for a group by setting a timeout timestamp
|
||||
* using`GroupState.setTimeoutTimestamp()`, and the timeout would occur when the watermark
|
||||
|
|
|
@ -36,7 +36,7 @@ abstract class NestedSchemaPruningBenchmark extends SqlBasedBenchmark {
|
|||
|
||||
// We use `col1 BIGINT, col2 STRUCT<_1: BIGINT, _2: STRING>,
|
||||
// col3 ARRAY<STRUCT<_1: BIGINT, _2: STRING>>` as a test schema.
|
||||
// col1, col2._1 and col3._1 are used for comparision. col2._2 and col3._2 mimics the burden
|
||||
// col1, col2._1 and col3._1 are used for comparison. col2._2 and col3._2 mimics the burden
|
||||
// for the other columns
|
||||
private val df = spark
|
||||
.range(N * 10)
|
||||
|
|
Loading…
Reference in a new issue