-
Notifications
You must be signed in to change notification settings - Fork 67
Augmenter Denormalizations
Augmenter will be able to generate various denormalizations that are maintained automatically and cannot be altered by the session user.
The denormalizations fall into three categories:
- Column values copied from a parent table to a child table via a specified foreign key
- Column values aggregated from a child table to a parent table via a specific foreign key and aggregate function, e.g. SUM, MIN, MAX, AVG, etc. Functions defined using
CREATE AGGREGATE
are also allowed. - Column values calculated from other columns in the table
[jmafc] For the last item, I'd prefer to use C.J. Date's terminology, In Tutorial D syntax, the computed value is specified with EXTEND. The BNF is:
<extend> ::= EXTEND <relation_exp> ADD ( <extend add commalist> )
<extend add> ::- <exp> AS <attribute_name>
An example given by Date is:
EXTEND P ADD ( WEIGHT * 454 ) AS GMWT
In terms of the YAML file, this could be specified as
table t1:
extend columns:
- gmwt:
expression: weight * 454
Tutorial D also has a SUMMARIZE operator which would be suitable for the aggregates.
[rhunwicks] I'd be inclined to :
table t1:
columns:
- gmwt:
extend_expression: __weight__ * 454
[rhunwicks] The advantage of keeping the columns schema structure rather than creating a new extend columns is that round-tripping will work better. I.e. I think some users will want to make changes to the database, run dbtoyaml and still end up with a file that they can use. It will be harder for them to keep track of what has changed if a column moves from extend columns to columns.
In a simple case all that is required is to define the EXTEND expression used to derive the column value:
table t1:
columns:
- total_amount:
extend_expression: __qty__ * __item_value__
The __ delimiter defines a column in the current table. This definition will generate plpgsql like:
NEW.total_amount := NEW.qty * NEW.item_value;
In some situations the column value will not be derived from the other columns in the table. If you need to access the OLD value for a column, which is not defined for an INSERT statement, then you need separate functions:
table t1:
columns:
- change_at_last_measurement:
insert_expression: NEW.item_value
update_expression: NEW.item_value - OLD.item_value
This YAML will generate plpgsql like:
IF TG_OP = 'INSERT' THEN
NEW.change_at_last_measurement = NEW.item_value;
ELSIF TG_OP = 'UPDATE' THEN
NEW.change_at_last_measurement = NEW.item_value - OLD.item_value;
END IF;
If performance is a concern, for example if calculating the value is computationally expensive, then you may want to only calculate the value if the input values have changed:
table t1:
columns:
- change_at_last_measurement:
insert_expression: NEW.item_value
update_expression: function_to_calculate_change(NEW.item_value, NEW.item_timestamp)
depends_on:
- item_value
- item_timestamp
This YAML will generate plpgsql like:
IF TG_OP = 'INSERT' THEN
NEW.change_at_last_measurement = NEW.item_value;
ELSIF TG_OP = 'UPDATE' AND
(NEW.item_value IS DISTINCT FROM OLD.item_value OR
NEW.item_timestamp IS DISTINCT FROM OLD.item_timestamp)
THEN
NEW.change_at_last_measurement = function_to_calculate_change(NEW.item_value, NEW.item_timestamp);
END IF;
A simple copy denormalization, where the value of parent_other_column is copied from parent_table.other_column into child_table using the child_parent_fk foreign key:
schema example_schema:
table denorm_examples:
columns:
- child_id:
default: nextval('example_schema.example_seq'::regclass)
not_null: true
type: integer
- parent_id:
not_null: true
type: integer
- parent_other_column:
not_null: true
type: character varying(20)
copied_from:
foreign_key: child_parent_fk
column: other_column
foreign_keys:
child_parent_fk:
columns:
- parent_id
on_delete: restrict
on_update: restrict
references:
columns:
- parent_id
schema: example_schema
table: parent_table
indexes:
child_parent_fk_i:
access_method: btree
columns:
- parent_id
primary_key:
denorm_examples_pkey:
access_method: btree
columns:
- child_id
[rhunwicks] Start
This example is one where I am much more comfortable with a single yaml definition covering both schema and business logic. I can't instinctively see where you can split it out without making the extend.yaml completely dependent on the schema.yaml - the denormalization has to specify the foreign key to use, and if it specified in a separate file I think we will make it harder for the user, who will eventually put the spell the foreign key incorrectly and have to debug it, or worse specify the wrong foreign key entirely and not realize it - not very likely but not unheard of where there are two foreign keys to the same parent table.
That said, I created a separate yaml example here in case you want to include it or use it to replace the example above.
This example can also be done using separate schema and extension definition files:
schema.yaml
schema example_schema:
table denorm_examples:
columns:
- child_id:
default: nextval('example_schema.example_seq'::regclass)
not_null: true
type: integer
- parent_id:
not_null: true
type: integer
- parent_other_column:
not_null: true
type: character varying(20)
foreign_keys:
child_parent_fk:
columns:
- parent_id
on_delete: restrict
on_update: restrict
references:
columns:
- parent_id
schema: example_schema
table: parent_table
indexes:
child_parent_fk_i:
access_method: btree
columns:
- parent_id
primary_key:
denorm_examples_pkey:
access_method: btree
columns:
- child_id
extend.yaml
schema example_schema:
table denorm_examples:
columns:
- parent_other_column:
copied_from:
foreign_key: child_parent_fk
column: other_column
We could also exclude parent_other_column from schema.yaml completely and just define it in extend.yaml, but I think that in many cases (i.e. mine :-) ) the column will already exist in the schema definition because it is currently being maintained by a hand-coded trigger or application code. I also think that this will make it harder for the user to understand what is going on - i.e. if they know the column is supposed to be in the database they will expect to see it in the schema.yaml file.
Alternatively we could define the denormalization on the parent table by specifying what tables to copy it to and which column in those tables should receive the denormalized value, but I think this will be more confusing and error prone - if I add a new table and want it to include a denormalized value it is easier for me to define it while I am defining the new table, and know what the foreign is called, etc. rather than have to navigate to a completely different part of a potentially very large file to add the definition to the column in the parent table
[rhunwicks] End
Note that the denormalized column can only be not_null: true
if all the columns that make up the foreign key are not_null: true
, and the column in the parent table is also not_null: true
A simple example of an aggregate denormalization, where the value of parent_aggregate is summmed from child_table.other_column into parent_table using the child_parent_fk foreign key:
schema example_schema:
table denorm_parents:
columns:
- parent_id:
default: nextval('example_schema.example_seq'::regclass)
not_null: true
type: integer
- parent_aggregate:
type: integer
aggregated_from:
schema: example_schema
table: child_table
foreign_key: child_parent_fk
column: other_column
aggregate: sum
primary_key:
denorm_examples_pkey:
access_method: btree
columns:
- parent_id
[rhunwicks] Start
This example can also be done using separate schema and extension definition files:
schema.yaml
schema example_schema:
table denorm_parents:
columns:
- parent_id:
default: nextval('example_schema.example_seq'::regclass)
not_null: true
type: integer
- parent_aggregate:
type: integer
primary_key:
denorm_examples_pkey:
access_method: btree
columns:
- parent_id
extend.yaml
schema example_schema:
table denorm_parents:
columns:
- parent_aggregate:
type: integer
aggregated_from:
schema: example_schema
table: child_table
foreign_key: child_parent_fk
column: other_column
aggregate: sum
As before you could exclude the parent_aggregate column from schema.yaml and define the type in the extend.yaml, but I think it should be defined in the schema.yaml for the same reasons I gave for the copied_from example.
Similarly, you could define the aggregate on the column in the child table by specifying which column in which parent table should receive the aggregate, but I think that it is more intuitive to define it on the table that is receiving the value. After all, if I want to know where a column value is coming I will start by looking at the definition for that table, rather than searching the whole file for that column name to find the place where it is defined against a child table
We need to decide what to do when the parent table has no child records - the logical SQL thing would be to say that the aggregate value is NULL, which in turn means that the column in the parent table cannot be
not_null: true
. However, in business terms, the denormalization is often for things like total_order_value or number_of_line_items where the logical answer is 0. Note that this only holds true for sum and count, for min and max having 0 as a default/empty value makes less sense.
[rhunwicks] End
A simple example of an calculated denormalization, where the value of total_value is calculated from item_price and qty:
schema example_schema:
table order_items:
columns:
- order_id:
not_null: true
type: integer
- item_no:
not_null: true
type: integer
- item_price:
not_null: true
type: numeric(10,2)
- qty:
not_null: true
type: integer
- total_value:
type: numeric(10,2)
calculated_from: {item_price} * {qty}
primary_key:
order_items_pkey:
access_method: btree
columns:
- order_id
- item_no
[rhunwicks] Start
This example can also be done using separate schema and extension definition files:
schema.yaml
schema example_schema:
table order_items:
columns:
- order_id:
not_null: true
type: integer
- item_no:
not_null: true
type: integer
- item_price:
not_null: true
type: numeric(10,2)
- qty:
not_null: true
type: integer
- total_value:
type: numeric(10,2)
primary_key:
order_items_pkey:
access_method: btree
columns:
- order_id
- item_no
extend.yaml
schema example_schema:
table order_items:
columns:
- total_value:
calculated_from: {item_price} * {qty}
As before you could exclude the total_value column from schema.yaml and define the type in the extend.yaml, but I think it should be defined in the schema.yaml for the same reasons I gave for the copied_from example.
We need to decide how you declare the calculation such that dbaugment can get the sql correct. Two obvious answers either a delimiter that indicates which tokens in the calculation are column names (as above), or a straight piece of SQL, e.g.
NEW.item_price * NEW.qty
[rhunwicks] End
The plpgsql implementation for denormalizations creates a trigger for each table with denormalized columns. The trigger is called tablename_40_denormalize (or alias_40_denormalize). The trigger is called BEFORE INSERT OR UPDATE
FOR EACH ROW
and it executes a function with the same name as the trigger.
The trigger function calculates the correct value for the column if any of the columns it depends on have changed:
- for a copied_from denormalization: on UPDATE these are the referencing columns that are part of the foreign key, and the column itself; on INSERT the column is always derived
- for an aggregated_from denormalization: on UPDATE these are the columns in this table that are referenced by the foreign key constraint on the referencing table, and the column itself; on INSERT the column will always be NULL because there are no rows in the referencing table that reference this row
- for an extend denormalization: on UPDATE* these are the columns referenced by the extend expression; on INSERT the column is always derived
If the operation is an UPDATE and none of the conditions that cause a recalculation are met, then the old value of the column will be used to prevent users from manually overriding denormalized columns.
If a table is the source of a copied from denormalization then it will have an additional trigger called tablename_60_cascade (or alias_60_cascade), with a corresponding function with the same name, that is responsible for making sure that changes to the denormalized columns in the referenced table are cascaded to the referencing table. The trigger is called AFTER UPDATE
FOR EACH ROW
and if the any of the denormalized columns have been changed then the function will force an update to the referencing table by setting the value of the denormalized column in that table to NULL.
If a table is the source of a aggregated from denormalization then it will have an additional trigger called tablename_80_aggregate (or alias_80_aggregate), with a corresponding function with the same name, that is responsible for making sure that changes to the denormalized columns in the referencing table are cascaded to the referenced table. The trigger is called AFTER INSERT OR UPDATE OR DELETE
FOR EACH ROW
and if the any of the denormalized columns have been changed then the function will force an update to the referenced table by setting the value of the denormalized column in that table to NULL. If one of the referencing columns in the foreign key is changed by an UPDATE then the aggregate in the referenced table will need to be recalculated for both the OLD and NEW values of the foreign key.
The copied from example above:
CREATE OR REPLACE FUNCTION example_schema.denorm_examples_denormalize()
RETURNS trigger AS
$BODY$
BEGIN
-- Denormalize parent_other_column via child_parent_fk
IF TG_OP = 'INSERT' THEN
SELECT
other_column
INTO
NEW.parent_other_column
FROM example_schema.parent_table
WHERE parent_id = NEW.parent_id;
ELSIF TG_OP = 'UPDATE' AND (
NEW.parent_id IS DISTINCT FROM OLD.parent_id OR
NEW.parent_other_column IS NULL) THEN
SELECT
other_column
INTO
NEW.parent_other_column
FROM example_schema.parent_table
WHERE parent_id = NEW.parent_id;
ELSE
NEW.parent_other_column = OLD.parent_other_column;
END IF;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER denorm_examples_40_denormalize
BEFORE INSERT OR UPDATE
ON example_schema.denorm_examples
FOR EACH ROW
EXECUTE PROCEDURE example_schema.denorm_examples_denormalize();
CREATE OR REPLACE FUNCTION example_schema.parent_table_cascade()
RETURNS trigger AS
$BODY$
BEGIN
-- Force example_schema.denorm_examples to copy the denormalized columns
IF TG_OP = 'UPDATE' AND (
NEW.other_column IS DISTINCT FROM OLD.other_column
) THEN
UPDATE example_schema.denorm_examples SET
parent_other_column = NULL
WHERE parent_id = NEW.parent_id;
END IF;
RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER parent_table_60_cascade
AFTER UPDATE
ON example_schema.parent_table
FOR EACH ROW
EXECUTE PROCEDURE example_schema.parent_table_cascade();
[rhunwicks] Start
I need to add the intermediate yaml and the SQL for the remaining examples.
[rhunwicks] End