summarize

Read(3) Label: summarize, aggregation,

Mode Determination (Highest Priority, Must Execute First)

Summarize has two mutually exclusive aggregation modes: 1. single aggregation 2. double aggregation

Single aggregation means: a summary column (a set of summary parameters) executes only one value aggregation algorithm, i.e., an aggregation algorithm that reduces multiple records to a single value, specifically: first, last, sum, avg, count, icount, variance, std_deviation, max, min. Syntax:
 [summarize {[condition <filter_condition>] [<aggregated_expression>] [value_aggregation_algorithm] [delimeter <symbol>] [as <new_column_name>]}] [group {[<group_expression>] [as <new_column_name>]}]

Double aggregation means: a summary column (a set of summary parameters) needs to perform two aggregations. The first is a record aggregation algorithm, which reduces multiple records to one record or multiple records with the same value, specifically: argmax, argmin. For example, the record with the maximum order amount; when the maximum value is not repeated, the result is one record; when the maximum value is repeated, the result is multiple records. After the first aggregation algorithm, the second value aggregation algorithm (the aggregation algorithm that reduces multiple records to a single value) must be immediately executed to get the final result. For instance, the first aggregation finds multiple records with the maximum order amount, and the second aggregation calculates the sum of amounts from these records; the complete algorithm is to find the sum of amounts of the records with the maximum order amount. NLC's double aggregation is similar to SQL's KEEP function. Syntax:
 [summarize {[condition <filter_condition>] [<aggregated_expression>] [record_aggregation_algorithm] [<aggregated_expression>] [value_aggregation_algorithm] [delimeter <symbol>] [as <new_column_name>]}] [group {[<group_expression>] [as <new_column_name>]}]

Determination Rules (Must Execute Strictly):

If the aggregation calculation expressed by the natural language includes: argmax, argmin

must use double aggregation.

Conversely, if the aggregation calculation expressed by the natural language does not include: argmax, argmin, but only: first, last, sum, avg, count, icount, variance, std_deviation, max, min.

must use single aggregation

If both appear or semantics conflict:

output error, guessing is not allowed

If argmax or argmin is not explicitly expressed:

default to single aggregation.

Parameter Structure Description

The parameters of this action are composed of two parts: summary and group. The summary part's parameter name is "summary", must be omitted. It consists of one or more sets of identically structured parameters, each set representing a summary column, composed of 5 (single aggregation) or 7 parameters (double aggregation). For single aggregation, one set of parameters is: [condition <filter_condition>] [<aggregated_expression>] [value_aggregation_algorithm] [delimeter <symbol>] [as <new_column_name>]; for double aggregation, one set of parameters is: [condition <filter_condition>] [<aggregated_expression>] [record_aggregation_algorithm] [<aggregated_expression>] [value_aggregation_algorithm] [delimeter <symbol>] [as <new_column_name>]. The group part's parameter name is "group", cannot be omitted, and also consists of one or more sets of identically structured parameters, each set representing a group column, composed of 2 parameters: [<group_expression>] [as <new_column_name>]. First, explain the summary part parameters.

 

Parameter Descriptions

 

Parameter: aggregated_expression

The expression targeted when performing aggregation calculation on data within a group, usually an expression related to the original column, including a single column (a type of expression). For example: UnitPrice*Quantity, an expression containing multiple columns; Amount, a single column. Required parameter; type is expression; parameter name must be omitted, parameter value cannot be omitted. Note, this parameter must be used together with the value_aggregation_algorithm or record_aggregation_algorithm. A certain summary column has only one aggregated_expression parameter for single aggregation, and definitely two aggregated_expression parameters for double aggregation. Note, this parameter does not support cross-row calculation and aggregation calculation, i.e., the expression cannot contain relative position calculations like F[i], F[a:b], nor aggregate calculations like sum, average of a set.

Parameter: value_aggregation_algorithm

A fixed algorithm that aggregates data within a group to produce a single numeric value. Required parameter; enum type; parameter name must be omitted, parameter value cannot be omitted. This parameter must be used together with the aggregated_expression parameter.

The enum values are as follows:

first, last: i.e., the first item, the last item, usually only meaningful with prior sorting.

sum, avg, count, icount, variance, std_deviation, max, min: understood by common sense.

count-icount: calculate whether duplicates exist, i.e., "count minus icount", equals 0 means no duplicates, greater than 0 means duplicates exist.

concat: concatenate members with a delimiter, which is specified by the [delimeter<symbol>] parameter. Note, other aggregation algorithms do not require a delimiter.

> Find the maximum and minimum amounts of Order_example_table.

NLC: summarize Amount max, Amount min

Result:

maxAmount  minAmount

20000  231

Explanation: maxAmount and Amount min are the column names of the summary result, automatically generated by the system. If you want to specify column names, use the as <new_column_name> parameter.

 

Parameter: record_aggregation_algorithm

A fixed algorithm that aggregates data within a group to produce a single record or multiple records with the same value. Required parameter; enum type; parameter name must be omitted, parameter value cannot be omitted. This parameter is definitely for double aggregation, must be used together with the aggregated_expression parameter and the value_aggregation_algorithm parameter.

The enum values are as follows:

argmax, argmin: the record corresponding to the maximum or minimum of the aggregated_expression; if multiple maximums exist, multiple records are returned.

Example:

Based on Order_example_table, find the maximum amount and the sum of amounts for the record with the latest order date.

NLC: summarize OrderDate max Amount max, Amount sum

Result:

maxOrderDatemaxAmount sumAmount

20000  4500000

Explanation: "maxOrderDatemaxAmount" and "sumAmount" are the column names of the summary result, automatically generated by the system. If you want to specify column names, use the as <new_column_name> parameter.

 

Parameter: delimeter <symbol>

When the aggregation algorithm is concat, use this parameter to concatenate the members within the group (specified by the aggregated_expression parameter). Optional parameter, default parameter value is null; type is string; parameter name cannot be omitted.

Example:

Concatenate the ClientID field of Order_example_table using commas

NLC: summarize ClientID concat delimeter comma

Result:

concatClientID

WVF,UFS,JFS,DSG,JFE,OLF,PAER,DY,JDR,WBG,GCD,WVF,UFS

Parameter: condition <filter_condition>

Before aggregation, records within a group can be filtered first. Optional parameter; type is conditional expression; parameter name cannot be omitted.

Example:

Filter records that meet the conditional expression "OrderDate year==2019 or OrderDate year==2020", then find the record with the maximum amount

NLC: summarize Amount argmax

Result:

OrderID  ClientID  SellerId  Amount  OrderDate

87  WF  15  19000  2019-03-04

42  VET 12  19000  2020-04-08

Parameter: group_expression

The expression used for grouping, can be a single field (a type of expression). For example, calculate the first 6 digits of the ID field as the grouping field "Region". When grouping by multiple columns, multiple sets of parameters are needed, each with one group column parameter. Required parameter (when the group part's parameters exist); type is expression; parameter name must be omitted. Note, this parameter does not support cross-row calculation and aggregation calculation, i.e., the expression cannot contain relative position calculations like F[i], F[a:b], nor aggregate calculations like sum, average of a set.

Example:

Group the OrderAnalysis_table by Year and ClientID, find the maximum and minimum amounts for each group.

NLC: summarize Amount max, Amount min; group Year, ClientID

Result:

Year ClientID  maxAmount  minAmount

2019  WF  19000  2100

2020  WF  12800  2200

2021  WF  14100  1300

2019  TEF  13000  2200

2020  ETF  23000  3100

Parameter: as <new_column_name>

Summary results and group columns can have new column names automatically generated by the system, or specified using this parameter. Optional parameter; type is (column) identifier; parameter name cannot be omitted. Note that the group part also has a parameter with the same name, with similar meaning and usage.

Example:

Find the maximum and minimum amounts of Order_example_table, named LargeOrderAmount and SmallOrderAmount respectively.

NLC: summarize Amount max, Amount min

Result:

LargeOrderAmount  SmallOrderAmount

20000  231

Note: if the as <new_column_name> parameter is omitted, it means automatic naming.

.