Skip to content

Augmenter Denormalizations

Joe Abbate edited this page May 22, 2013 · 1 revision

Augmenter will be able to generate various denormalizations that are maintained automatically and cannot be altered by the session user.

The denormalizations fall into three categories:

  • Column values copied from a parent table to a child table via a specified foreign key
  • Column values aggregated from a child table to a parent table via a specific foreign key and aggregate function, e.g. SUM, MIN, MAX, AVG, etc. Functions defined using CREATE AGGREGATE are also allowed.
  • Column values calculated from other columns in the table

[jmafc] For the last item, I'd prefer to use C.J. Date's terminology, In Tutorial D syntax, the computed value is specified with EXTEND. The BNF is:

<extend> ::= EXTEND <relation_exp> ADD ( <extend add commalist> )
<extend add> ::- <exp> AS <attribute_name>

An example given by Date is:

EXTEND P ADD ( WEIGHT * 454 ) AS GMWT

In terms of the YAML file, this could be specified as

table t1:
  extend columns:
  - gmwt:
    expression: weight * 454

Tutorial D also has a SUMMARIZE operator which would be suitable for the aggregates.

[rhunwicks] I'd be inclined to :

table t1:
  columns:
  - gmwt:
    extend_expression: __weight__ * 454

[rhunwicks] The advantage of keeping the columns schema structure rather than creating a new extend columns is that round-tripping will work better. I.e. I think some users will want to make changes to the database, run dbtoyaml and still end up with a file that they can use. It will be harder for them to keep track of what has changed if a column moves from extend columns to columns.

In a simple case all that is required is to define the EXTEND expression used to derive the column value:

table t1:
  columns:
  - total_amount:
    extend_expression: __qty__ * __item_value__

The __ delimiter defines a column in the current table. This definition will generate plpgsql like:

NEW.total_amount := NEW.qty * NEW.item_value;

In some situations the column value will not be derived from the other columns in the table. If you need to access the OLD value for a column, which is not defined for an INSERT statement, then you need separate functions:

table t1:
  columns:
  - change_at_last_measurement:
    insert_expression: NEW.item_value
    update_expression: NEW.item_value - OLD.item_value

This YAML will generate plpgsql like:

IF TG_OP = 'INSERT' THEN
  NEW.change_at_last_measurement = NEW.item_value;
ELSIF TG_OP = 'UPDATE' THEN
  NEW.change_at_last_measurement = NEW.item_value - OLD.item_value;
END IF;

If performance is a concern, for example if calculating the value is computationally expensive, then you may want to only calculate the value if the input values have changed:

table t1:
  columns:
  - change_at_last_measurement:
    insert_expression: NEW.item_value
    update_expression: function_to_calculate_change(NEW.item_value, NEW.item_timestamp)
    depends_on:
    - item_value
    - item_timestamp

This YAML will generate plpgsql like:

IF TG_OP = 'INSERT' THEN
  NEW.change_at_last_measurement = NEW.item_value;
ELSIF TG_OP = 'UPDATE' AND
  (NEW.item_value IS DISTINCT FROM OLD.item_value OR
   NEW.item_timestamp IS DISTINCT FROM OLD.item_timestamp)
THEN
  NEW.change_at_last_measurement = function_to_calculate_change(NEW.item_value, NEW.item_timestamp);
END IF;

A simple copy denormalization, where the value of parent_other_column is copied from parent_table.other_column into child_table using the child_parent_fk foreign key:

schema example_schema:
  table denorm_examples:
    columns:
    - child_id:
        default: nextval('example_schema.example_seq'::regclass)
        not_null: true
        type: integer
    - parent_id:
        not_null: true
        type: integer
    - parent_other_column:
        not_null: true
        type: character varying(20)
        copied_from:
            foreign_key: child_parent_fk
            column: other_column
    foreign_keys:
      child_parent_fk:
        columns:
        - parent_id
        on_delete: restrict
        on_update: restrict
        references:
          columns:
          - parent_id
          schema: example_schema
          table: parent_table
    indexes:
      child_parent_fk_i:
        access_method: btree
        columns:
        - parent_id
    primary_key:
      denorm_examples_pkey:
        access_method: btree
        columns:
        - child_id

[rhunwicks] Start

This example is one where I am much more comfortable with a single yaml definition covering both schema and business logic. I can't instinctively see where you can split it out without making the extend.yaml completely dependent on the schema.yaml - the denormalization has to specify the foreign key to use, and if it specified in a separate file I think we will make it harder for the user, who will eventually put the spell the foreign key incorrectly and have to debug it, or worse specify the wrong foreign key entirely and not realize it - not very likely but not unheard of where there are two foreign keys to the same parent table.

That said, I created a separate yaml example here in case you want to include it or use it to replace the example above.

This example can also be done using separate schema and extension definition files:

schema.yaml

schema example_schema:
  table denorm_examples:
    columns:
    - child_id:
        default: nextval('example_schema.example_seq'::regclass)
        not_null: true
        type: integer
    - parent_id:
        not_null: true
        type: integer
    - parent_other_column:
        not_null: true
        type: character varying(20)
    foreign_keys:
      child_parent_fk:
        columns:
        - parent_id
        on_delete: restrict
        on_update: restrict
        references:
          columns:
          - parent_id
          schema: example_schema
          table: parent_table
    indexes:
      child_parent_fk_i:
        access_method: btree
        columns:
        - parent_id
    primary_key:
      denorm_examples_pkey:
        access_method: btree
        columns:
        - child_id

extend.yaml

schema example_schema:
  table denorm_examples:
    columns:
    - parent_other_column:
        copied_from:
            foreign_key: child_parent_fk
            column: other_column

We could also exclude parent_other_column from schema.yaml completely and just define it in extend.yaml, but I think that in many cases (i.e. mine :-) ) the column will already exist in the schema definition because it is currently being maintained by a hand-coded trigger or application code. I also think that this will make it harder for the user to understand what is going on - i.e. if they know the column is supposed to be in the database they will expect to see it in the schema.yaml file.

Alternatively we could define the denormalization on the parent table by specifying what tables to copy it to and which column in those tables should receive the denormalized value, but I think this will be more confusing and error prone - if I add a new table and want it to include a denormalized value it is easier for me to define it while I am defining the new table, and know what the foreign is called, etc. rather than have to navigate to a completely different part of a potentially very large file to add the definition to the column in the parent table

[rhunwicks] End

Note that the denormalized column can only be not_null: true if all the columns that make up the foreign key are not_null: true, and the column in the parent table is also not_null: true

A simple example of an aggregate denormalization, where the value of parent_aggregate is summmed from child_table.other_column into parent_table using the child_parent_fk foreign key:

schema example_schema:
  table denorm_parents:
    columns:
    - parent_id:
        default: nextval('example_schema.example_seq'::regclass)
        not_null: true
        type: integer
    - parent_aggregate:
        type: integer
        aggregated_from:
            schema: example_schema
            table: child_table
            foreign_key: child_parent_fk
            column: other_column
            aggregate: sum
    primary_key:
      denorm_examples_pkey:
        access_method: btree
        columns:
        - parent_id

[rhunwicks] Start

This example can also be done using separate schema and extension definition files:

schema.yaml

schema example_schema:
  table denorm_parents:
    columns:
    - parent_id:
        default: nextval('example_schema.example_seq'::regclass)
        not_null: true
        type: integer
    - parent_aggregate:
        type: integer
    primary_key:
      denorm_examples_pkey:
        access_method: btree
        columns:
        - parent_id

extend.yaml

schema example_schema:
  table denorm_parents:
    columns:
    - parent_aggregate:
        type: integer
        aggregated_from:
            schema: example_schema
            table: child_table
            foreign_key: child_parent_fk
            column: other_column
            aggregate: sum

As before you could exclude the parent_aggregate column from schema.yaml and define the type in the extend.yaml, but I think it should be defined in the schema.yaml for the same reasons I gave for the copied_from example.

Similarly, you could define the aggregate on the column in the child table by specifying which column in which parent table should receive the aggregate, but I think that it is more intuitive to define it on the table that is receiving the value. After all, if I want to know where a column value is coming I will start by looking at the definition for that table, rather than searching the whole file for that column name to find the place where it is defined against a child table

We need to decide what to do when the parent table has no child records - the logical SQL thing would be to say that the aggregate value is NULL, which in turn means that the column in the parent table cannot be not_null: true. However, in business terms, the denormalization is often for things like total_order_value or number_of_line_items where the logical answer is 0. Note that this only holds true for sum and count, for min and max having 0 as a default/empty value makes less sense.

[rhunwicks] End

A simple example of an calculated denormalization, where the value of total_value is calculated from item_price and qty:

schema example_schema:
  table order_items:
    columns:
    - order_id:
        not_null: true
        type: integer
    - item_no:
        not_null: true
        type: integer
    - item_price:
        not_null: true
        type: numeric(10,2)
    - qty:
        not_null: true
        type: integer
    - total_value:
        type: numeric(10,2)
        calculated_from: {item_price} * {qty}
    primary_key:
      order_items_pkey:
        access_method: btree
        columns:
        - order_id
        - item_no

[rhunwicks] Start

This example can also be done using separate schema and extension definition files:

schema.yaml

schema example_schema:
  table order_items:
    columns:
    - order_id:
        not_null: true
        type: integer
    - item_no:
        not_null: true
        type: integer
    - item_price:
        not_null: true
        type: numeric(10,2)
    - qty:
        not_null: true
        type: integer
    - total_value:
        type: numeric(10,2)
    primary_key:
      order_items_pkey:
        access_method: btree
        columns:
        - order_id
        - item_no

extend.yaml

schema example_schema:
  table order_items:
    columns:
    - total_value:
        calculated_from: {item_price} * {qty}

As before you could exclude the total_value column from schema.yaml and define the type in the extend.yaml, but I think it should be defined in the schema.yaml for the same reasons I gave for the copied_from example.

We need to decide how you declare the calculation such that dbaugment can get the sql correct. Two obvious answers either a delimiter that indicates which tokens in the calculation are column names (as above), or a straight piece of SQL, e.g. NEW.item_price * NEW.qty

[rhunwicks] End

The plpgsql implementation for denormalizations creates a trigger for each table with denormalized columns. The trigger is called tablename_40_denormalize (or alias_40_denormalize). The trigger is called BEFORE INSERT OR UPDATE FOR EACH ROW and it executes a function with the same name as the trigger.

The trigger function calculates the correct value for the column if any of the columns it depends on have changed:

  • for a copied_from denormalization: on UPDATE these are the referencing columns that are part of the foreign key, and the column itself; on INSERT the column is always derived
  • for an aggregated_from denormalization: on UPDATE these are the columns in this table that are referenced by the foreign key constraint on the referencing table, and the column itself; on INSERT the column will always be NULL because there are no rows in the referencing table that reference this row
  • for an extend denormalization: on UPDATE* these are the columns referenced by the extend expression; on INSERT the column is always derived

If the operation is an UPDATE and none of the conditions that cause a recalculation are met, then the old value of the column will be used to prevent users from manually overriding denormalized columns.

If a table is the source of a copied from denormalization then it will have an additional trigger called tablename_60_cascade (or alias_60_cascade), with a corresponding function with the same name, that is responsible for making sure that changes to the denormalized columns in the referenced table are cascaded to the referencing table. The trigger is called AFTER UPDATE FOR EACH ROW and if the any of the denormalized columns have been changed then the function will force an update to the referencing table by setting the value of the denormalized column in that table to NULL.

If a table is the source of a aggregated from denormalization then it will have an additional trigger called tablename_80_aggregate (or alias_80_aggregate), with a corresponding function with the same name, that is responsible for making sure that changes to the denormalized columns in the referencing table are cascaded to the referenced table. The trigger is called AFTER INSERT OR UPDATE OR DELETE FOR EACH ROW and if the any of the denormalized columns have been changed then the function will force an update to the referenced table by setting the value of the denormalized column in that table to NULL. If one of the referencing columns in the foreign key is changed by an UPDATE then the aggregate in the referenced table will need to be recalculated for both the OLD and NEW values of the foreign key.

The copied from example above:

CREATE OR REPLACE FUNCTION example_schema.denorm_examples_denormalize()
  RETURNS trigger AS
$BODY$
BEGIN
  -- Denormalize parent_other_column via child_parent_fk
  IF TG_OP = 'INSERT' THEN
    SELECT 
    other_column
    INTO 
    NEW.parent_other_column
    FROM example_schema.parent_table
    WHERE parent_id = NEW.parent_id;
  ELSIF TG_OP = 'UPDATE' AND (
      NEW.parent_id IS DISTINCT FROM OLD.parent_id OR 
      NEW.parent_other_column IS NULL) THEN
    SELECT 
    other_column
    INTO 
    NEW.parent_other_column
    FROM example_schema.parent_table
    WHERE parent_id = NEW.parent_id;
  ELSE
    NEW.parent_other_column = OLD.parent_other_column;
  END IF;
  RETURN NEW;
END;
$BODY$
  LANGUAGE plpgsql;

CREATE TRIGGER denorm_examples_40_denormalize
  BEFORE INSERT OR UPDATE
  ON example_schema.denorm_examples
  FOR EACH ROW
  EXECUTE PROCEDURE example_schema.denorm_examples_denormalize();

CREATE OR REPLACE FUNCTION example_schema.parent_table_cascade()
  RETURNS trigger AS
$BODY$
BEGIN
  -- Force example_schema.denorm_examples to copy the denormalized columns
  IF TG_OP = 'UPDATE' AND (
       NEW.other_column IS DISTINCT FROM OLD.other_column
    ) THEN
    UPDATE example_schema.denorm_examples SET
    parent_other_column = NULL
    WHERE parent_id = NEW.parent_id;
  END IF;
  RETURN NULL;
END;
$BODY$
  LANGUAGE plpgsql;

CREATE TRIGGER parent_table_60_cascade
  AFTER UPDATE
  ON example_schema.parent_table
  FOR EACH ROW
  EXECUTE PROCEDURE example_schema.parent_table_cascade();

[rhunwicks] Start

I need to add the intermediate yaml and the SQL for the remaining examples.

[rhunwicks] End

Clone this wiki locally