work

   dbt uses ur keys from delta n definition n then identify what chged and adds processing col

23/24-scd2

config(

materialized='delta', --table loading mechanism

unique_key='id',  --PK of table

stratergy='check', --mechanism how to handle new incoming data,always check  .data need to be checked in existing table with new incoming data and compared.

properties={"partitioned_by":"ARRAY['dbt_valid_from']"},  --any specific properties that the table shd have 

target_schema='pcc',  --schema where table will deploy

check_cols="all",  --which col need to tested against for scd2 compare

invalid_hard_deletes=True,  --mechanism to tell type 23 ,24, by default value is false and will type 24. if made true it is 23.

updated_at=var('BUSINESS_DATE')  --this config set 'dbt_valid_from' date (=start date) when no value is set default to 'CURRENT_DATE' this can be passed via dt config or while falling dt run or via airflow.

)

25 - transactional

 23/24-scd2

config(

materialized='incremental', --table loading mechanism

incremental_stratergy='append',  --mechanism how to handle new incoming data,append ; append new data with existing data.

'Truncate+Insert' ; delete all n put

properties={"partitioned_by":"ARRAY['dbt_valid_from']"},  --any specific properties that the table shd have 

target='pcc',  --logical tag to grp obj together

updated_at=var('BUSINESS_DATE')  --this config set 'dbt_valid_from' date (=start date) when no value is set default to 'CURRENT_DATE' this can be passed via dt config or while falling dt run or via airflow.

)


================================================

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

ROWS UNBOUNDED PRECEDING ) BRKR_ID

select 

max_id.MAX_BRKR_ID + SUM (CAST (1 AS INTEGER)) OVER (ROWS UNBOUNDED PRECEDING ) BRKR_ID,  

--ASSUMPTION IS HUB HAS RECORDS, WHENMAX_ID.MAX_BRKR_ID is null or negative , id need to be default to 0

complilment_key,

tpmi_sao_key

from 

(

select distinct a.compliment_key , a.tpmi_sao_key

from acq va

left join Vr_h_brkr

on A.compliment_key=B.compliment_key

AND A.tpmi_sao_key=B.tpmi_sao_key

WHERE A.DUP_SQN=1

AND A.SRC_DLET_FLAG=0

AND B.BRKR IS NULL

==this is to identify brkr that are present in src but not in hub , so hub would do insert for these recors.

)

and

(

select bussiness_date

from gcfr process where prc name=tx h brkr

) between a.src-strt ts and a.src end ts

)src

cross join

(

select coalesce(max(brkr_id) , 0 ) as max_brkr_id from vr_h_brkr

) max_id;


HUB-transactional data-25.    Satellite-scd2 -23-24 (based on data in src not dates)

23--> full

chg prev if exists and date to 9999 else nothing

full snapshot is the source table data- means everyday snap is coming in , and if some data is missing for the day it means that the data is deleted in src n marked with flag del =1 in target

we have like act no,name,address , today one row came with some values for these attribute, tomorrow address chg , then later some thg else chg . when new record comes the existing ki end date shd chg , new record shd be inserted.

=>input - src data for gcfr full transform process is full snapshot of current state from source

=>output - target table will have data stored with start date n end date that represent valid period of data.

=>process - this input data will be compared with open record (end date = high date ) in the target and changes will be captured and update into target table.

24-->delta

means src has only delta data- it has only new and changed data, --same as full but for missing data in src it wont mark del flag as 1 I target

=>input - src data for gcfr full transform process is changed data from src with chg flag (gcfr action flag). change flag are 

0-new record

1-update

2-delete

=>output - same as full, target will have data stored with start date n end date that represent valid period of data

=>process - gcfr prc will apply chg following chg flag in target table


25-->transactional

prev never chg and end date is always 9999

once data flows in , it cant change .example like if v do payment to vendor now transaction time amount and details cant change ever.all these details can never chg . tomorrow  record cant chg telling somethg got updated in prev record beoz that's not possible..

 so any real life event , transaction , activity happen on acct r all 'transaction data'.

=>input - src data for gcfr full transform process is transaction or event from src

=>output - target table is transaction type table where start date will be set with business date and end date with high_date

=>process - gcfr prc will insert only new transaction into target table



===============================================

================================================

SNAPSHOT TO SCD2

we had full snapshot data loaded from src to target , everyday full data loaded. so storage is up.
we will convert it to scd2.

gcfr-tool does the conversion to scd2 auto , we need to mention tx ( which tell how to select the data from src to loaded in target) , what kinda load 23/24 and 25(full/delta , transaction)

we created new table with opt and 2 tx.

1 taking data from _d for daily load and 2 to load history data we used fulsnap table in tx =fl dt which is snapshot date from src.

CONTROLm - we created only 1 stream - using 1 stream v use hist tx and _d tx to load in same table because all details r same , also v shd have same prcnm in table by gcfr.

cross joi with cal to get prev data becoz down v r using fldt , n our purpose is to remove fldt because table shd be scd2 but fldt chges everyday  so table is considering it as transaction n inserting all data everyday





=======================================================

CONTINUOS IMPROVEMENT

  • reduce join as much as possible with cte n seperate sel stmt
  • add filter on rec del n end date for active , before joins on base tabel
  • collect stat on join n where n pi col , partitioned by , compress , pi
left join gcfr_process with (and start date = sel bussdate from gcfr_v.gcfr process id where prd name ='')

tx we reduced join to cte - runtime by 82%(from 40 min to 10 min) , spool space 99% 500gb to 3gb , cpu time 87 
===========================================================

OBA

airflow dag n td table 














Comments

Popular posts from this blog

Teradata

Git