Teradata
👉 RDBMS is a database system that stores structured data in tables and maintains relationships between them using keys.
Teradata is a massively parallel processing (MPP) relational database management system (RDBMS) designed for large-scale data warehousing and analytics.
It is mainly used by enterprises to store and analyze very large volumes of structured data.
=====================================================================
🔹 Data Warehouse – Definition
A Data Warehouse (DW) is a centralized system used to store large volumes of historical, structured data from multiple sources for reporting and analytics.
👉 It is mainly used for business intelligence (BI) and decision-making.
👉 It is optimized for OLAP (Online Analytical Processing).
🔹 Popular Data Warehouse Platforms
| Teradata |
| Amazon Redshift |
| Snowflake |
| Google BigQuery |
| Azure Synapse Analytics |
=====================================================================
🔹 Data Mart – Definition
A Data Mart is a subset of a Data Warehouse designed for a specific department or business function (like Sales, Finance, HR).
👉 It contains focused, subject-specific data.
👉 Used for faster reporting for a particular team.
🔹 Example
If a company has a Data Warehouse containing all data:
-
Sales team needs only sales data
-
Finance team needs only revenue & expense data
👉 Separate Data Marts are created for each team.
🔹 Types of Data Mart
There are 3 main types of Data Marts:
📌 1️⃣ Dependent Data Mart
| Feature | Description |
|---|---|
| Source | Created from Data Warehouse |
| Data Quality | High (already cleaned & integrated) |
| Control | Centralized |
| Example | Sales Data Mart from Enterprise DW |
👉 Most common in large enterprises
👉 Ensures consistency across departments
📌 2️⃣ Independent Data Mart
| Feature | Description |
|---|---|
| Source | Directly from operational systems |
| Data Quality | May vary |
| Control | Decentralized |
| Example | HR team creates own reporting DB |
👉 Faster to implement
👉 Risk of data inconsistency
📌 3️⃣ Hybrid Data Mart
| Feature | Description |
|---|---|
| Source | Combination of DW + operational systems |
| Flexibility | High |
| Use Case | When some data is in DW and some external |
👉 Balanced approach
🔹 Quick Comparison Table
| Feature | Dependent | Independent | Hybrid |
|---|---|---|---|
| Source | Data Warehouse | Operational systems | Both |
| Data Consistency | High | Medium | High |
| Implementation Speed | Medium | Fast | Medium |
| Best For | Large organizations | Small teams | Mixed requirements |
Relational Database Terminology
🔷 1. Tables
Core structure for storing data.
Consist of rows (records) and columns (fields).
Supports indexing, primary keys, foreign keys, etc.
Columns - A column always contains the same kind of information or contains similar data.
Row - A row is one instance of all the columns in the table.
Database - The database is a collection of logically related data. Many users access them for different purposes.
Primary Key - The primary key is used to identify a row in a table uniquely. No duplicate values are allowed in a primary key column, and they cannot accept NULL values. It is a mandatory field in a table.
Foreign Key - Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table.
A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.
🔷 2. Views
Virtual tables based on one or more tables or views.
Do not store data physically—just SQL definitions.
Use cases:
Restrict access to certain columns/rows.
Simplify complex joins/queries.
Standardize reporting logic.
X-Views: Special views that only show data specific to the accessing user.
🔷 3. Macros
Store predefined SQL commands under a name.
Can accept parameters.
Useful for repeating business logic or securing access.
Example:
🔷 4. Triggers
A set of SQL statements that automatically executes in response to data changes (INSERT, UPDATE, DELETE).
Attached to a table.
Used for enforcing rules or auditing changes.
🔷 5. Stored Procedures
A program written in SQL and stored on the database.
Can contain control-of-flow logic (IF, LOOP, etc.).
Can be scheduled, parameterized, or reused.
Example use case: ETL workflows, validations, auditing.
Supports error handling and transaction control.
🔷 6. User-Defined Functions (UDFs)
Custom functions written in C/C++ or SQL.
Types:
Scalar UDFs – return a single value.
Aggregate UDFs – return a summarized value.
Table UDFs – return a table.
Used for:
Complex calculations.
Reusability.
Protecting PII (Personally Identifiable Information).
🔷 7. Indexes
Used to optimize data access.
Types:
Primary Index (UPI/NUPI) – determines data distribution.
Secondary Index – improves access for non-primary keys.
Join Index – speeds up joins.
Hash Index – supports fast lookups (less common now).
🔹 Teradata Components
Teradata Corporation built Teradata using MPP (Massively Parallel Processing) architecture.
Teradata works using multiple parallel components.

🔹 1️⃣ Parsing Engine (PE)
The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).
| Function | Explanation |
|---|---|
| SQL Parsing | Checks syntax |
| Optimization | Creates execution plan |
| Session Management | Handles user sessions |
| Request Distribution | Sends work to AMPs |
👉 PE does NOT store data.
🔹 2️⃣ BYNET
This is the message-passing layer or simply the networking layer in Teradata. It receives the execution plan from the parsing engine and passes it to AMPs and the nodes. After that, it gets the processed output from the AMPs and sends it back to the parsing engine.
BYNET is Teradata’s high-speed interconnect network that allows nodes, PEs, and AMPs to communicate internally in a Massively Parallel Processing (MPP) environment.
To maintain adequate availability, the BYNET 0 and BYNET 1 two types of BYNETs are available. This ensures that a secondary BYNET is available in case of the failure of the primary BYNET.
| Feature | Explanation |
|---|---|
| High-speed network | Connects nodes |
| Message passing | Transfers data between AMPs |
| Parallel communication | Enables MPP processing |
👉 Acts like backbone network.
🔹 3️⃣ AMP (Access Module Processor)
These are the virtual processors of Teradata. They receive the execution plan and the data from the parsing engine. The data will undergo any required conversion, filtering, aggregation, sorting, etc., and will be further sent to the corresponding disks for storage.
The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.
Table records will be distributed to each AMP for data storage. Only that AMP can read or write data into the disks that have access permission.
| Function | Explanation |
|---|---|
| Data Storage | Stores actual table data |
| Data Retrieval | Fetches rows |
| Index Handling | Manages primary index |
| Parallel Processing | Each AMP works independently |
👉 Data is distributed across AMPs.
🔹 4️⃣ Node
A Node is a physical server in the Teradata system. It includes both hardware and software components that support the execution of Teradata’s parallel database processing.
| Feature | Description |
|---|---|
| Physical Server | Runs Teradata software |
| Contains | Multiple AMPs + PE |
| Scalable | Add nodes to scale system |
🔹 Parallel Database Extensions (PDE)
The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).
🔹 Disks
Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.🔄 Query Execution Flow in Teradata
Query Submission
A user or application submits an SQL query.Parsing Engine (PE)
Parser: Checks SQL syntax and semantics.
Security: Validates user access.
Optimizer: Generates the most cost-effective execution plan.
Dispatcher: Sends the execution steps to BYNET.
BYNET
Routes the steps from the PE to the appropriate AMPs.AMPs (Access Module Processors)
Execute the steps (e.g., read/write data).
Retrieve or update data from their own disks.
Send results back to the PE.
PE (again)
Assembles the rows.
Sends the final result back to the client.
=====================================================================
🚀 How Teradata Distributes Rows Across AMPs
In Teradata, row distribution is fully automatic, even, and hash-based. The core idea is that each AMP (Access Module Processor) is responsible for a portion of each table’s data, and Teradata ensures the rows are distributed evenly using a hashing algorithm.
🔁 Hash-Based Distribution Mechanism
When a row is inserted into a table:
Teradata hashes the Primary Index column(s) value.
The resulting hash value is mapped to an AMP.
That AMP stores the row on its associated disk.
📌 Hashing is done on Primary Index (PI) — it’s how Teradata determines where to store a row.
🎯 Why Even Distribution Matters
Ensures parallelism is fully utilized.
Avoids skew (when some AMPs hold significantly more rows than others).
Prevents slowest AMP bottleneck in full table operations.
✅ Ideally:
Each table's rows are evenly distributed across all AMPs.
All AMPs perform roughly equal work on large queries.
🔄 Row Distribution Mechanism
Teradata uses a hashing algorithm on the PI column(s).
Hash result determines the AMP where the row will be stored.
Leads to parallelism across AMPs.
🔄 Row Distribution via Hashing in Teradata
📥 How Data Gets Distributed Across AMPs
When a row is inserted into a Teradata table:
The Primary Index (PI) value of the row is input into a hashing algorithm.
The algorithm outputs a 32-bit Row Hash.
The high-order 16 bits are used to find a Hash Bucket Number.
The Hash Map uses that bucket number to determine which AMP will store the row.
The AMP stores the row in its associated vDisk.
=====================================================================
🔹 Index Types in Teradata
In Teradata, indexes are mainly used for data distribution and faster retrieval.
1️⃣ PI (Primary Index)
The Primary Index (PI) is the core mechanism for distributing rows across AMPs and retrieving data efficiently.
Primary Index decides:
Decides where the data is stored (which AMP).
Mandatory (every table must have one).
-
How data is distributed
👉 Every table MUST have one PI
👉 It does NOT have to be unique
✅ Key Characteristics
| Attribute | Description |
|---|---|
| Purpose | Determines how rows are distributed and accessed |
| When defined? | At CREATE TABLE time only – cannot be altered later |
| Max Columns | Can consist of 1 to 64 columns |
| Stored in | Data Dictionary (DD) |
🧠 Best Practices for Choosing a PI
Choose a highly unique column to avoid skew (UPI preferred).
Consider query access patterns (frequent filters, joins).
Try to align PI with foreign keys in child tables for AMP-local joins.
Avoid columns with few distinct values (e.g., Gender, Status) as PI → causes skew.
📈 Why PI Matters for Performance
Every row access by PI is a 1-AMP operation → fastest.
Good PI choice = even AMP workload → maximum parallelism.
Bad PI choice = skew → one AMP overworked, others underutilized.
🚫 Common Mistakes
Using low cardinality columns as PI (e.g., "India", "Yes/No").
Forgetting that PI ≠ PK in Teradata.
Choosing PI without analyzing access patterns.
🔹 2️⃣ UPI (Unique Primary Index)
A Primary Index where column values are unique.
👉 No duplicate values allowed
👉 Best performance
👉 Even data distribution
🔹 3️⃣ NUPI (Non-Unique Primary Index)
Primary Index where duplicates are allowed.
The hashing algorithm still determines the target AMP, but:
All rows with the same NUPI value go to the same AMP.
This can lead to uneven (skewed) distribution.
Example: Skew from Poor NUPI Choice
If you choose
Order_Status(values = "C", "O") as a NUPI:All rows get hashed to just 2 AMPs.
Remaining AMPs stay idle → severe performance degradation.
👉 Most commonly used
👉 Can cause data skew if not chosen properly
cust_id → skew risk.🔹 4️⃣ SI (Secondary Index)
A Secondary Index (SI) is an additional index created on a column other than the Primary Index to improve query performance.
A Secondary Index (SI) in Teradata is a physical access path to the data that is separate from the Primary Index (PI). It is used to improve query performance when the query's search condition does not use the PI columns.
👉 It does NOT control data distribution
👉 It helps in faster data retrieval
👉 It uses extra storage
🔍 Why Use a Secondary Index?
When queries frequently use non-PI columns in WHERE conditions.
To avoid full table scans.
To reduce query response time.
To optimize joins or aggregations on non-PI columns.
Additional index created for faster search on non-PI columns.
- Optional
- Improves query performance
- Requires extra storage
- SI is optional
- Used for non-PI columns
- Improves read performance
- Increases write overhead
🚦 Performance Tips
USI: Great for point queries (e.g.,
WHERE emp_name = 'Alice')NUSI: Better for range queries (e.g.,
WHERE salary BETWEEN 50000 AND 70000)Too many SIs can increase storage and slow down DML operations like INSERT/UPDATE.
🛠️ How It Works
When you define a SI, Teradata creates a sub-table that holds:
The index value.
The row ID (ROWID) of the corresponding data row in the base table.
For USI, the sub-table is hash-distributed.
For NUSI, the sub-table is locally stored on the same AMP as the base table row (no redistribution).
🔹 5️⃣ USI (Unique Secondary Index)
A Unique Secondary Index (USI) is a secondary index created on a column where duplicate values are NOT allowed.
👉 Ensures uniqueness
👉 Used for fast lookup
👉 Stored in a separate index subtable
👉 Secondary index with unique values
👉 Stored in separate subtable
👉 Very fast lookup
🔹 How USI Works Internally
-
User runs query:
SELECT * FROM customer WHERE email_id = 'rk@gmail.com'; -
Teradata hashes the USI column
-
Goes to specific AMP storing USI subtable
-
Finds RowID
-
Fetches actual row from base table
👉 Very fast because it’s unique.
🔹 6️⃣ NUSI (Non-Unique Secondary Index)
A Non-Unique Secondary Index (NUSI) is a secondary index created on a column where duplicate values are allowed.
👉 Used for filtering
👉 Improves SELECT performance
👉 Does NOT control data distribution
👉 Allows duplicates
👉 Stored on same AMP
👉 Helps filtering
🔹 How NUSI Works Internally
-
Query runs:
SELECT * FROM orders WHERE status = 'SHIPPED'; -
Teradata checks NUSI subtable
-
Finds list of RowIDs for matching rows
-
Fetches rows from base table
👉 Since duplicates exist, multiple rows are returned.
✅ What is a NoPI Table?
A NoPI (No Primary Index) Table is a table that does not have a defined Primary Index. This was introduced in Teradata 13.0 to improve bulk load performance and simplify certain workloads.
CREATE TABLE table_x
(col_x INTEGER
,col_y CHAR(10)
,col_z DATE)
NO PRIMARY INDEX;📦 Why Use NoPI Tables?
Ideal for staging or landing tables in ETL pipelines.
Great for bulk loading large volumes of data quickly.
Often used temporarily before redistributing data into final structured tables.
🧠 Best Practices:
Use NoPI tables for staging or intermediate ETL steps.
Avoid long-term use unless you really do not need joins or indexed access.
After loading, redistribute data into base tables with a PI for performance.
=====================================================================
🔀 Partitioned Primary Index (PPI)
🔹 What is PPI?
PPI is a Teradata feature that divides rows into partitions based on a defined column or expression, while still distributing rows across AMPs based on the Primary Index (PI).
👉 Data is distributed across AMPs (like normal PI)
👉 Within each AMP, data is divided into partitions
👉 Improves range query performance
Key Benefits:
Improves query performance for range-based queries.
Allows partition elimination, where irrelevant partitions are excluded at query time.
Ideal for incremental loads, archiving, and deletions.
⚙️ How PPI Works:
Hashing for Distribution:
PI → Hash Value → Determines which AMP stores the row.
Partitioning for Storage Order:
Once inside an AMP, rows are grouped into partitions.
Within each partition, data is sorted by Row Hash (row ID).
Partition Elimination:
The optimizer analyzes query filters (e.g.,
WHERE order_date BETWEEN …) and skips partitions that won’t qualify.This reduces I/O and scan time dramatically.
SELECT * FROM sales WHERE oder_date= DATE '2024-05-10';
👉 Only May partition scanned
👉 Faster query
🔹 When NOT to Use PPI
❌ Small tables
❌ Queries without filter on partition column
❌ Very high insert workloads (slightly more overhead)
🌲 Multilevel Partitioned Primary Index (MLPPI)
🔹 What is MLPPI?
MLPPI allows sub-partitioning — i.e., a partition within a partition — for greater granularity and performance.
Example Use Case:
Partition 1: by
Claim_DateSub-partition: by
State
This improves query efficiency when filters match both levels, enabling multi-level partition elimination.
In Teradata, when using Multi-Level Partitioned Primary Index (MLPPI), you can define the partitions using either:
RANGE_N— for range-based partitioning (like dates, numbers)CASE_N— for custom logic or discrete values
🧩 Use RANGE_N When:
You are partitioning by a range of values (e.g., dates or numeric ranges).
Best for time-series data, like logs or transaction tables.
🧩 Use CASE_N When:
You want to partition using categorical values, custom conditions, or non-uniform ranges.
Good when data doesn't fit into neat, equal ranges.
=====================================================================
Column Partitioning (also called Columnar or CPPI – Column-Partitioned Primary Index) stores columns separately instead of storing full rows together.
👉 Traditional Teradata = Row-based storage
👉 Column Partitioning = Column-based storage
👉 Improves performance for analytical queries
It can be applied to:
Tables (especially NoPI tables)
Join Indexes
You can also combine Column Partitioning with Row Partitioning (MLPPI) for a multi-dimensional partitioning strategy.
🔹 Normal Table (Row Storage)
If table has 20 columns:
SELECT col1, col2 FROM big_table;
👉 Teradata reads all 20 columns from disk
👉 Higher I/O
🔹 Column Partitioning
👉 Data stored column-wise
👉 Only required columns read
SELECT col1, col2 FROM big_table;
👉 Only col1 and col2 scanned
👉 Reduced I/O
👉 Faster query
🔹 Row vs Column Partitioning
| Feature | Row-Based | Column Partitioning |
|---|---|---|
| Storage style | Row | Column |
| Best for | OLTP | Analytics |
| I/O for few columns | High | Low |
| Insert performance | Faster | Slightly slower |
🔹 Limitations
❌ Slightly slower inserts
❌ Not ideal for transactional workloads
❌ More complex storage structure
=========================================================================
🔹 Join Index (JI) in Teradata
A Join Index is a system-maintained physical structure that stores pre-joined or pre-aggregated data
of join of two or more tables to speed up query performance.
👉 Reduces repeated join computation
👉 Stored physically like a table
👉 Can be row-based or columnar (Columnar Join Index = CJI)
Eliminate base table joins
Reduce data redistribution
Avoid aggregate processing
Improve query response time
🔷 1. What is a Single-Table Join Index (STJI)?
A Single-Table Join Index is a special type of join index that:
Is defined on only one table.
Uses a different column as its Primary Index (PI) than the base table’s PI.
Helps optimize joins between foreign keys and primary keys—especially in star schema or snowflake models.
Avoids row redistribution and base table access during queries.
🧩 Why Use STJI?
Teradata stores rows based on the PI. If you frequently join on a non-PI column (e.g., a foreign key), Teradata may redistribute data to perform the join.
✅ STJI prevents redistribution by pre-sorting and rehashing rows using the foreign key (or any chosen column).
What This Does:
Creates a copy of the
orderstable, hashed oncustomer_idinstead oforder_id.When a query joins
orders.customer_idtocustomers.customer_id, Teradata can use this index.No data redistribution needed because both tables are now hashed on the same column.
🔹 2.Multi-Table Join Index (MTJI) in Teradata
📌 Definition
A Multi-Table Join Index (MTJI) is a Join Index that combines more than two base tables into a precomputed, stored join.
👉 Speeds up queries that join multiple tables frequently
👉 Can be row-based or columnar
👉 Stored physically like a table
🔹 Advantages
✅ Speeds up multi-table queries
✅ Reduces CPU & AMP load
✅ Precomputed join simplifies SQL
✅ Good for data warehouse reporting
🔹 Disadvantages
❌ Uses extra disk space
❌ Slower inserts/updates/deletes on base tables (index maintenance)
❌ Needs careful design to select only necessary columns
🔹 3. Aggregate Join Index (AJI) in Teradata
An Aggregate Join Index (AJI) is a precomputed, stored table that combines aggregation and joins.
👉 Optimized for analytical queries
👉 Stores aggregated values (SUM, COUNT, AVG, etc.)
👉 Can be row-based or columnar
👉 Reduces query processing time on large fact tables
🔹 Advantages
✅ Improves query performance for aggregation queries
✅ Reduces CPU and I/O on large tables
✅ Simplifies complex queries for users
✅ Can be combined with columnar storage for further optimization
🔹 Disadvantages
❌ Extra storage required
❌ Insert/Update/Delete slower due to maintenance
❌ Needs careful selection of aggregation columns
🔹 4. Columnar Join Index (CJI) in Teradata
A Columnar Join Index (CJI) is a precomputed, columnar storage index based on a join of one or more tables to improve query performance for analytical workloads.
👉 Combines benefits of Join Index (prejoin data) and Columnar storage (column-based access)
👉 Speeds up queries with aggregations, filters, and joins
🔹 How Columnar Join Index Works
1️⃣ Base tables: sales and customer
2️⃣ Query often executed:
SELECT c.region, SUM(s.amount)
FROM sales s
JOIN customer c
ON s.cust_id = c.cust_id
GROUP BY c.region;
3️⃣ CJI created:
CREATE JOIN INDEX cji_sales_customer AS
SELECT s.amount, c.region
FROM sales s
JOIN customer c
ON s.cust_id = c.cust_id
PRIMARY INDEX (cust_id)
PARTITION BY COLUMN;
-
Stores only needed columns (
amountandregion) -
Precomputed join stored in columnar format
-
Queries hitting CJI skip join computation
🔹 Advantages
✅ Reduces query I/O
✅ Speeds up aggregation & join queries
✅ Only scans required columns
✅ Reduces CPU and AMP workload
🔹 Disadvantages
❌ Requires extra disk space
❌ Insert/Update/Delete may have slight overhead
❌ Needs careful column selection
🔍 What is a Full Table Scan (FTS)?
A Full Table Scan happens when Teradata (or any database engine) reads every row in a table to find rows that match a query’s condition—because no index or partitioning strategy can narrow it down.
⚠️ This is often the least efficient way to access data, especially for large tables.
❌ What happens here?
The Primary Index (PI) is on
emp_id, but the query filters ondept_id, which is not indexed.Result: Teradata must read every row to find those where
dept_id = 10→ Full Table Scan.
=========================================================================
✅ Best Practices
Choose PKs based on business rules and uniqueness.
Choose PIs based on query patterns, join usage, and data distribution.
Don't always assume PK = PI. Analyze based on performance needs.
=====================================================================
| Feature | Description |
|---|---|
| Works at | Data block level |
| Compresses | Entire block of rows |
| Best For | Large tables |
| Enabled | Automatically in newer versions |
=====================================================================
🔹 Space in Teradata
In Teradata, Space means the disk storage allocated to a database or user to store tables, indexes, temporary data, etc.
👉 Space is allocated at Database/User level
👉 Managed automatically across AMPs
👉 Measured in bytes (KB, MB, GB)
🔍 1. Perm Space
Permanent Space (Perm Space) is the disk space allocated to store permanent database objects like tables, indexes, views, macros, etc.
👉 Stores actual business data
👉 Not automatically cleared
👉 Allocated to Database/User
👉 Distributed across AMPs
Definition: Maximum allowed space a user/database can occupy permanently.
Not pre-allocated, only consumed as needed.
Used for:
Table data (rows)
Stored procedures
User-Defined Functions (UDFs)
Required for creating tables.
| Feature | Description |
|---|---|
| Stores | Tables, indexes, views |
| Allocated to | Database/User |
| Remains until | Object is dropped |
| Most important | Yes |
🔍 2. Spool Space
Spool Space is temporary disk space used by Teradata to store intermediate results while executing a query.
👉 Used during SELECT, JOIN, GROUP BY, ORDER BY
👉 Automatically cleared after query finishes
👉 If exceeded → query fails
Definition: Temporary space used during SQL execution.
Used for:
Sorting
Joining
Aggregation
Storing query results
Reclaimed automatically when the query ends.
Inherits limits from parent user/database unless explicitly set.
📌 Example:
You have 100 GB total disk space.
10 GB used for user data
10 GB for overhead
❓ Available spool space?
✅ Answer: 80 GB
| Feature | Description |
|---|---|
| Used for | Intermediate query results |
| Automatically freed | After query ends |
| Very important | Yes (query may fail if exceeded) |
🔍 3. Temp Space
Temp Space (Temporary Space) is the disk space used to store Global Temporary Tables (GTTs) during a user session.
👉 Used only for temporary tables
👉 Data is session-specific
👉 Automatically cleared after session ends
Tables created in Temp Space survive restarts (not the data).
Treated like Perm Space, but temporarily allocated for session data.
| Feature | Description |
|---|---|
| Used for | Global Temporary Tables |
| Released | At session end |
| Stored in | Temp Space allocation |
4️⃣ Journal Space
Journal Space is the disk space used to store before and after images of rows for data recovery purposes.
👉 Helps in recovering data if transaction fails
👉 Used for rollback and permanent journal
| Feature | Description |
|---|---|
| Used for | Recovery |
| Stores | Before/After image of rows |
| Optional | Can be disabled |
=====================================================================
📌 What is COLLECT STATISTICS in Teradata?
COLLECT STATISTICS (a.k.a. COLLECT STATS) gathers data demographics (like row count, value distribution, uniqueness) on columns, indexes, or partitions.
Collect Statistics is a process in Teradata that gathers metadata about table columns or indexes to help the optimizer choose the best execution plan.
✅ These stats are used by the Teradata Optimizer to:
Choose optimal query plans
Decide whether to use an index or full table scan
Improve join strategies, aggregations, and access paths
💡 Why It Matters
Without stats, Teradata assumes uniform distribution and guesses row counts, which can result in:
Wrong join order
Full table scans
Skewed AMP workload
Longer response times
🔹 Why We Use Collect Statistics
| Scenario | Benefit |
|---|---|
| Joins on large tables | Helps optimizer choose best join method (merge/hash/nested) |
| Aggregation queries | Chooses fastest access path |
| Skewed data | Helps optimizer avoid skew |
| Where / Group By / Order By | Provides accurate row estimates |
| Indexes | Helps optimizer decide which index to use |
🔹 How It Helps Optimizer
-
Determines row count and value distribution
-
Helps decide join strategy:
-
Merge Join
-
Hash Join
-
Product Join
-
-
Helps in AMP assignment
-
Reduces spool space usage
📊 What Does It Collect?
Distinct values count
Null count
Min/Max values
Frequency histogram
Row count
Skew factor
These help the optimizer estimate:
Selectivity (how many rows a filter returns)
Join cardinality
Join cost and order
Whether to use an index, join index, or full table scan
=====================================================================
🔐 Teradata Data Protection Features
Teradata provides multi-layered protection, both at the system level (hardware & OS) and database level (AMP-level redundancy, journaling, and transaction integrity).
🔁 Real-world analogy
Imagine a shopping mall with multiple backup generators (nodes) and alternate power cables (buses). If one cable or generator fails, power (data processing) continues seamlessly.
🗃️ DATABASE-LEVEL PROTECTION
1. Fallback
Each row stored twice on different AMPs (primary + fallback copy).
Fallback is a data protection feature in Teradata that ensures AMP-level fault tolerance by keeping a duplicate copy of each row of a table on a different AMP in the same cluster.
If one AMP goes down, fallback AMP serves the data.
✅ Transparent to users, but requires:
2x disk space
Extra I/O for write operations
🔁 Real-world analogy:
Imagine every important document is kept in two different filing cabinets. If one cabinet is lost in a fire (AMP failure), the second one keeps business running.
📦 How It Works:
For every row inserted into a Fallback table, a second copy is created.
The primary row goes to one AMP (based on hashing).
The fallback row is stored on a different AMP in the same cluster.
If an AMP fails, queries will automatically read from the fallback AMP.
🧠 Example:
Suppose a table employees is defined with Fallback.
Row
emp_id = 101hashes to AMP 1 → primary row is stored here.Teradata automatically stores a fallback copy on AMP 2 (same cluster).
If AMP 1 fails, Teradata accesses the copy on AMP 2 seamlessly — no data loss and no user disruption.
2. Transient Journal (TJ)
The Transient Journal (TJ) is a temporary, automatic rollback mechanism used to protect data integrity during a transaction.
Keeps before images of rows during transactions.
To ensure atomicity in transactions — either all changes are committed, or none (rollback if failure occurs).
If transaction fails → rollback to original state.
🧩 How it works:
During a transaction, before any row is changed, the "before image" of the row is saved in the Transient Journal on the same AMP.
If the transaction succeeds (gets an
END TRANSACTION), the saved copies are discarded.If the transaction fails (e.g., AMP crash, session drop), the changes are rolled back using the saved before images.
✅ Fully automatic and transparent to the user.
3. Down AMP Recovery Journal
Activated when an AMP goes down.
Tracks changes that would have affected that AMP.
Ensures no data is lost during recovery.
4. Permanent Journal (Optional)
Logs before/after images permanently for auditing or full rollback.
Useful for regulatory or historical tracking.
5. Locks
Ensure data consistency.
Prevents multiple users from changing same data simultaneously.
🔐 Types: Read lock, Write lock, Exclusive lock, etc.
🛡️ What is RAID?
RAID (Redundant Array of Independent Disks) is a hardware-level protection mechanism that improves data availability and fault tolerance by distributing or duplicating data across multiple disk drives.
RAID is different from Fallback, which is a Teradata software-level protection strategy.
🔰 Types of RAID Supported in Teradata
🔁 RAID 1 – Mirroring
How it works: Every disk has an exact mirror copy on another disk.
Redundancy: 100%
If one disk fails: The mirrored disk is used — no performance loss.
Disk Space Cost: High (needs 2× storage).
Performance: Excellent, especially for read operations.
✅ Recommended for high-performance, critical systems.
🧮 RAID 5 – Parity-based
How it works: Data is striped across 3 disks, and a parity block is stored on a 4th disk. This parity helps reconstruct any one missing block.
Redundancy: Can tolerate one disk failure per rank.
If a disk fails: Data is recalculated using parity — performance decreases slightly.
Disk Space Cost: Low (more efficient than RAID 1).
Performance: Good under normal conditions, slower during recovery.
✅ Used where cost-efficiency is more important than top-tier speed.
🧮 RAID S – (RAID 5 variant used with EMC)
Similar to RAID 5
Found in EMC systems
Parity and striping logic is optimized for EMC array design
📘 What is BTEQ in Teradata?
🚀 When to Use BTEQ
Running SQL jobs in batch mode (cron jobs or scheduled tasks)
Exporting large volumes of data from Teradata
Loading data quickly into staging tables
Writing ETL-style scripts
Running SQL scripts with conditional or iterative logic
==========================================================================
- Loads only empty tables: Target table must be empty before loading.
- Supports only INSERT operations: Cannot perform updates or deletes.
- Requires no secondary indexes or referential integrity on target table.
- Uses multiple sessions (parallelism) for speed.
- Supports error handling with error tables.
- Supports checkpoints for recovery.
Table constraint
Constraints are rules applied on tables or columns to enforce data integrity in Teradata.
👉 Ensures accuracy, consistency, and validity of data
👉 Supported mostly for Primary Indexes, Primary Keys, Foreign Keys, and Check conditions
In Teradata, a table constraint is a rule applied at the table level (not just column level) to enforce data integrity. Table constraints are typically used for:
Composite primary keys
Composite unique constraints
Foreign key constraints referencing multiple columns
Check constraints on multiple columns
1. Composite PRIMARY KEY (Table-level)
This ensures each
(order_id, product_id)pair is unique.
2. Composite UNIQUE Constraint
Ensures no two employees have the same email + phone combo.
3. Table-Level FOREIGN KEY Constraint
4. CHECK Constraint
Restricts values based on multiple column logic.
🧠 Notes for Teradata:
Foreign key and check constraints are declarative only in Teradata—they are not enforced by the database engine.
Use referential integrity manually via joins or scripts if enforcement is required.
Types of Table
🔹 1. Permanent Table
Definition: Stores actual business data permanently.
-
Persistence: Data remains until dropped manually.
Stores persistent data that is not deleted after a session ends.
Data is written to disk and protected using fallback or RAID.
Uses: OLTP/OLAP, fact and dimension tables.
-
Space: Uses Perm Space.
🧾 Example:
🔹 2. Volatile Table
Definition: Temporary tables whose definition and data exist only during the session.
-
Persistence: Automatically dropped at session end.
Temporary table only available during the session.
Data is stored in spool space and lost after logout.
Good for staging or session-specific processing.
Uses: Intermediate calculations, session-level processing.
-
Space: Uses Temp Space.
🧾 Example:
🔹 3. Global Temporary Table (GTT)
Definition: Permanent table definition, temporary data for each session.
-
Persistence: Structure stays; data cleared at session end.
Table definition persists, but data is session-specific.
Data is deleted at the end of each session.
Requires ON COMMIT clause.
Uses: Temporary processing across multiple queries.
-
Space: Uses Temp Space.
🧾 Example:
🔹 4. Derived Table
Definition: Tables created on-the-fly inside queries, no explicit storage.
-
Persistence: Exists only for duration of the query.
Virtual table created in a subquery.
Exists only for the duration of the query.
Uses: Query processing, subqueries.
-
Space: Uses Spool Space.
Cannot be reused.
🧾 Example:
🔹 5. Queue Table
Used for asynchronous processing (message queuing).
Follows FIFO (First In First Out) principle.
Supports special operations like
READQandCONSUME.
🧾 Example:
🔹 6. Multiset Table
Allows duplicate rows.
Must be explicitly specified.
🧾 Example:
🔹 7. Set Table
Does not allow duplicate rows.
Teradata performs a duplicate row check on insert.
🧾 Example:
🔹 8. NoPI Table (No Primary Index)
Rows are not hashed by PI, but distributed round-robin.
Good for staging and bulk loading.
Can reduce skew in large loads.
🧾 Example:
🔹 9. Column-Partitioned Table (Columnar)
Stores data column-wise instead of row-wise.
Improves performance for analytic queries on few columns.
🧾 Example:
🔹 10. Fallback Table
Any table (except volatile and queue) can be fallback.
Stores duplicate copy of each row on a different AMP for data protection.
🧾 Example:
🔹 Types of SQL in Teradata
SQL commands are categorized based on purpose and functionality.
🔹 1️⃣ DDL – Data Definition Language
-
Purpose: Defines or modifies database objects like tables, indexes, databases.
-
Examples of Commands:
CREATE,ALTER,DROP,RENAME,TRUNCATE -
Example:
CREATE TABLE employee (
emp_id INT,
name VARCHAR(50),
dept_id INT
);
🔹 2️⃣ DML – Data Manipulation Language
-
Purpose: Manipulates data stored in tables.
-
Examples of Commands:
INSERT,UPDATE,DELETE,MERGE -
Example:
INSERT INTO employee (emp_id, name, dept_id)
VALUES (1, 'RK', 10);
🔹 3️⃣ DQL – Data Query Language
-
Purpose: Retrieves data from database tables.
-
Command:
SELECT -
Example:
SELECT name, dept_id
FROM employee
WHERE emp_id = 1;
🔹 4️⃣ DCL – Data Control Language
-
Purpose: Controls access and permissions in the database.
-
Examples of Commands:
GRANT,REVOKE -
Example:
GRANT SELECT ON employee TO analyst_role;
🔹 5️⃣ TCL – Transaction Control Language
-
Purpose: Manages transactions in the database.
-
Examples of Commands:
COMMIT,ROLLBACK,SAVEPOINT -
Example:
BEGIN TRANSACTION;
UPDATE employee SET dept_id = 20 WHERE emp_id = 1;
ROLLBACK; -- undo changes
🔹 Summary Table
| SQL Type | Purpose | Examples |
|---|---|---|
| DDL | Define/alter objects | CREATE, ALTER, DROP |
| DML | Manipulate data | INSERT, UPDATE, DELETE, MERGE |
| DQL | Query data | SELECT |
| DCL | Control access | GRANT, REVOKE |
| TCL | Manage transactions | COMMIT, ROLLBACK, SAVEPOINT |
🔹 Teradata Set Operators in Teradata
Set operators in Teradata are used to combine results of two or more queries into a single result set.
🔹 Rules for Set Operators
-
Same number of columns in each query
-
Compatible data types in corresponding columns
-
Result column names taken from first query
-
UNION → removes duplicates → slower than
UNION ALL -
UNION ALL → faster, preserves duplicates
-
INTERSECT / MINUS → useful for comparing datasets
-
Can be combined with ORDER BY at the end of the full query
Even though
UNION ALLkeeps all duplicate rows, SQL tables and set operations are inherently unordered.-
The order of rows in the result set is not guaranteed unless you explicitly use an
ORDER BYclause.
-
Teradata String Manipulation Functions
| Function | Description | Syntax | Example | Output |
|---|---|---|---|---|
| CONCAT | Concatenates two or more strings | CONCAT(str1, str2, ...) | CONCAT('Hello',' ','RK') | Hello RK |
| LENGTH / CHARACTER_LENGTH | Returns number of characters in string | LENGTH(str) or CHARACTER_LENGTH(str) | CHARACTER_LENGTH('Robert') | 6 |
| LOWER | Converts string to lowercase | LOWER(str) | LOWER('ROBERT') | robert |
| UPPER | Converts string to uppercase | UPPER(str) | UPPER('Robert') | ROBERT |
| LPAD | Pads left side of string to specified length | LPAD(str, size, padstr) | LPAD('RK',5,'*') | ***RK |
| RPAD | Pads right side of string to specified length | RPAD(str, size, padstr) | RPAD('RK',5,'*') | RK*** |
| TRIM | Removes spaces from string ends | `TRIM([LEADING | TRAILING | BOTH] [char] FROM str)` |
| LTRIM | Removes leading spaces | LTRIM(str) | LTRIM(' Robert') | Robert |
| RTRIM | Removes trailing spaces | RTRIM(str) | RTRIM('Robert ') | Robert |
| REPLACE (remove) | Removes search string | REPLACE(str, search) | REPLACE('Hello RK','RK') | Hello |
| REPLACE (replace) | Replaces search string with another | REPLACE(str, search, replace) | REPLACE('Hello RK','RK','AI') | Hello AI |
| REVERSE | Reverses the string | REVERSE(str) | REVERSE('Robert') | treboR |
| SPLIT | Splits string by delimiter | SPLIT(str, delimiter) | SPLIT('a,b,c',',') | ['a','b','c'] |
| STRPOS / POSITION | Returns starting position of substring | STRPOS(str, substr) or POSITION(substr IN str) | POSITION('e' IN 'Robert') | 4 |
| SUBSTR / SUBSTRING | Returns substring from position with length | SUBSTRING(str FROM pos FOR len) or SUBSTR(str,pos,len) | SUBSTRING('Robert' FROM 1 FOR 3) | Rob |
| CHR | Returns character from ASCII value | CHR(n) | CHR(65) | A |
| TO_UTF8 | Encodes string to UTF-8 varbinary | TO_UTF8(str) | TO_UTF8('RK') | 0x5254 (binary) |
| FROM_UTF8 | Decodes UTF-8 varbinary to string | FROM_UTF8(binary) | FROM_UTF8(0x5254) | RK |
| TRANSLATE | Replaces characters in string based on mapping | SELECT TRANSLATE(str, from, to) | TRANSLATE('abc','abc','xyz') | xyz |
| INDEX | Locates position of character (Teradata extension) | INDEX(str, char) | INDEX('Robert','b') | 2 |
🔹 Teradata Date/Time Functions
1️⃣ Built-in Current Functions
| Function | Description | Example | Output |
|---|---|---|---|
CURRENT_DATE | Returns the current date | SELECT CURRENT_DATE; | 2024-05-22 |
CURRENT_TIME | Returns the current time | SELECT CURRENT_TIME; | 10:01:13 |
CURRENT_TIMESTAM P | Returns current timestamp (date + time) | SELECT CURRENT_TIMESTAMP; | 2024-05-22 10:01:13.990000+00.00 |
DATE | Returns current date (legacy) | SELECT DATE; | 2024-05-22 |
TIME | Returns current time (legacy) | SELECT TIME; | 10:01:13 |
DATABASE | Returns current database name | SELECT DATABASE; | TDUSER |
2️⃣ EXTRACT Function
Extracts portions of DATE, TIME, or TIMESTAMP.
| Part | Syntax | Example | Output |
|---|---|---|---|
| Year | EXTRACT(YEAR FROM date) | EXTRACT(YEAR FROM CURRENT_DATE); | 2024 |
| Month | EXTRACT(MONTH FROM date) | EXTRACT(MONTH FROM CURRENT_DATE); | 5 |
| Day | EXTRACT(DAY FROM date) | EXTRACT(DAY FROM CURRENT_DATE); | 22 |
| Hour | EXTRACT(HOUR FROM timestamp) | EXTRACT(HOUR FROM CURRENT_TIMESTAMP); | 10 |
| Minute | EXTRACT(MINUTE FROM timestamp) | EXTRACT(MINUTE FROM CURRENT_TIMESTAMP); | 1 |
| Second | EXTRACT(SECOND FROM timestamp) | EXTRACT(SECOND FROM CURRENT_TIMESTAMP); | 13.990000 |
3️⃣ Interval Arithmetic
Teradata INTERVAL functions allow date/time arithmetic.
| Interval Type | Description | Example | Output |
|---|---|---|---|
YEAR | Add years | CURRENT_DATE + INTERVAL '04' YEAR; | 2028-05-22 |
YEAR TO MONTH | Add years & months | CURRENT_DATE + INTERVAL '04-03' YEAR TO MONTH; | 2028-08-22 |
DAY TO MINUTE | Add days, hours, minutes | CURRENT_TIMESTAMP + INTERVAL '03 05:10' DAY TO MINUTE; | 2024-05-25 15:11:13 |
DAY TO SECOND | Add days, hours, minutes, seconds | Similar syntax | - |
4️⃣ Common Date Functions
| Function | Description | Example | Output |
|---|---|---|---|
LAST_DAY(date) | Returns last day of the month | LAST_DAY(CURRENT_DATE); | 2024-05-31 |
NEXT_DAY(date, weekday) | Returns next specified weekday | NEXT_DAY(CURRENT_DATE, 'MONDAY'); | 2024-05-27 |
MONTHS_BETWEEN(date1, date2) | Returns integer months between two dates | MONTHS_BETWEEN(DATE '2024-05-22', DATE '2024-01-22'); | 4 |
ADD_MONTHS(date, n) | Adds n months | ADD_MONTHS(CURRENT_DATE, 3); | 2024-08-22 |
TO_DATE(string) | Converts string to DATE | TO_DATE('2024-05-22','YYYY-MM-DD'); | 2024-05-22 |
TO_TIMESTAMP(string) | Converts string to TIMESTAMP | TO_TIMESTAMP('2024-05-22 10:01:13'); | 2024-05-22 10:01:13 |
TRUNC(date, unit) | Truncate to unit | TRUNC(CURRENT_DATE,'MONTH'); | 2024-05-01 |
ROUND(date, unit) | Round to nearest unit | ROUND(CURRENT_DATE,'MONTH'); | 2024-06-01 |
NUMTODSINTERVAL(n,'DAY') | Converts numeric to INTERVAL | NUMTODSINTERVAL(3,'DAY'); | 3 00:00:00 |
NUMTOYMINTERVAL(n,'MONTH') | Converts numeric to INTERVAL | NUMTOYMINTERVAL(5,'MONTH'); | 0-05 |
5️⃣ DISTINCT Option (Aggregate on Dates or Numeric)
-
DISTINCTensures unique values only.
Example:
SELECT COUNT(DISTINCT Deduction) FROM Salary;
SELECT SUM(DISTINCT x), AVG(DISTINCT y) FROM Salary;
CASE, COALESCE, and NULLIF
🔹 1️⃣ CASE Expression
🔹 Teradata Joins Overview
Joins are used to combine rows from two or more tables based on related columns.
| Join Type | Description | Syntax | Example / Output |
|---|---|---|---|
| INNER JOIN | Returns rows only when there is a match in both tables | SELECT cols FROM Table1 INNER JOIN Table2 ON Table1.col = Table2.col; | Example: Join Student and Attendance on RollNo. RollNo 1005 is excluded because it has no attendance record. Output:1001 → 90%1002 → 72%1003 → 68%1004 → 95% |
| LEFT OUTER JOIN | Returns all rows from left table + matching rows from right table | SELECT cols FROM Table1 LEFT OUTER JOIN Table2 ON Table1.col = Table2.col; | RollNo 1005 from Student is included; % is NULL because attendance is missing. Output: 1001 → 90%, …, 1005 → NULL |
| RIGHT OUTER JOIN | Returns all rows from right table + matching rows from left table | SELECT cols FROM Table1 RIGHT OUTER JOIN Table2 ON Table1.col = Table2.col; | Similar to LEFT JOIN, but ensures all right table rows appear |
| FULL OUTER JOIN | Returns all rows from both tables, matching where possible | SELECT cols FROM Table1 FULL OUTER JOIN Table2 ON Table1.col = Table2.col; | Combines LEFT and RIGHT OUTER JOIN results. Non-matches from either table show NULL. |
| CROSS JOIN | Returns all combinations of rows from both tables (Cartesian product) | SELECT cols FROM Table1 CROSS JOIN Table2; | Example: Student RollNo = 1001 CROSS JOIN Attendance Output: 1001-1001 → 90%, 1001-1002 → 72%, 1001-1003 → 68%, 1001-1004 → 95% |
| SELF JOIN | Joins a table with itself | SELECT A.col, B.col FROM Table A, Table B WHERE A.col = B.col; | Useful to compare rows in the same table (e.g., Employee-Manager relationships) |
🔹 Important Teradata Notes on Joins with NULLs
-
In SQL, NULL ≠ NULL, so joins on
NULLusually do not match. -
Teradata follows the ANSI standard: rows with NULL in the join column do not match any other row (even NULL) for INNER, LEFT, RIGHT joins.
-
CROSS JOIN is unaffected because it does not use
ON.
Tables
🔹 LEFT JOIN (t1 LEFT JOIN t2 ON t1.id = t2.id)
🔹 RIGHT JOIN (t1 RIGHT JOIN t2 ON t1.id = t2.id)
🔹 INNER JOIN (t1 INNER JOIN t2 ON t1.id = t2.id)
❌ Per ANSI standard. In Teradata, NULL does not equal NULL, so inner join on NULL yields no row.
Comments
Post a Comment