distinct

Read(1) Label: distinct,

Description:

Based on one or more specified columns, keep unique records. Optionally delete records with duplicates, delete non-duplicate records.

Syntax:
<column> <option> [partition <partition_field>]

Parameter: column

One or more columns used to determine duplicates. Required parameter, when absent means the parameter value is the focus column in the context; type is identifier or set of identifiers; parameter name must be omitted.

Parameter: option

Three ways of deduplication. Optional parameter; enum type; parameter name must be omitted, parameter value cannot be omitted. Enum values are as follows:

keep_unique: default value, when N records are duplicates, remove the 2nd to Nth records, keep only the 1st.

kill_dups: delete records with duplicates, equivalent to keeping only records that are singletons.

dups_only: delete non-duplicate records, i.e., delete all singleton records.

Example:
Supermarket order example table as follows

OrderID  Product  Salesperson  OrderAmount

1  Watermelon  Zhang San  100

2  Watermelon  Zhang San  200

3  Watermelon  Zhang San  300

4  Apple  Zhang San  400

5  Apple  Zhang San  500

6  Watermelon  Li Si 600

Goal: remove duplicate records based on Product and Salesperson fields

NLC: distinct  Product,Salesperson

Result:

OrderID  Product  Salesperson  OrderAmount

1  Watermelon  Zhang San  100

4  Apple  Zhang San  400

6  Watermelon  Li Si 600

 

Example:

Based on the above supermarket order example table, delete records with duplicates based on Product and Salesperson fields.

NLC: distinct  Product,Salesperson; kill_dups

Result:

OrderID  Product  Salesperson  OrderAmount

6  Watermelon  Li Si 600

 

Example:

Based on the above supermarket order example table, delete non-duplicate records based on Product and Salesperson fields.

NLC: distinct  Product,Salesperson; dups_only

 

Result:

OrderID  Product  Salesperson  OrderAmount

1  Watermelon  Zhang San  100

2  Watermelon  Zhang San  200

3  Watermelon  Zhang San  300

4  Apple  Zhang San  400

5  Apple  Zhang San  500

Parameter: partition

Deduplicate by partition, partitions do not affect each other, conceptually similar to SQL's PARTITION BY. Optional parameter; identifier type; parameter name cannot be omitted..