https://drive.google.com/file/d/1vevk2HJLngFs0PFqvefq4h_pbt1dB73t/view

Teradata is the most popular Relational Database Management System (RDBMS). The Teradata is suitable for large data warehousing applications.

Features of Teradata

Advantages of the Teradata Relational Database:

advantages of Relational Databases (RDBMS)

Data Warehouse?

A data warehouse is a central repository that stores integrated, historical data extracted from operational systems, enabling enterprise-wide analytics and decision-making.

📘 What is a Data Mart?

A Data Mart is a subject-specific subset of a data warehouse that focuses on a specific department, function, or business area (e.g., marketing, sales, HR).

==============================================================
Components of Teradata

Teradata is a very efficient, less expensive, and high-quality Relational Database management System.

Teradata is based on Massively Parallel Processing (MPP) architecture. It is made of Parsing Engine (PE), BYNET, Access Module Processors (AMPs), and other components such as nodes.

Below are some significant components of the Teradata, such as:

1. Parsing Engine: The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).

2. BYNET: This is the message-passing layer or simply the networking layer in Teradata. It receives the execution plan from the parsing engine and passes it to AMPs and the nodes. After that, it gets the processed output from the AMPs and sends it back to the parsing engine.

To maintain adequate availability, the BYNET 0 and BYNET 1 two types of BYNETs are available. This ensures that a secondary BYNET is available in case of the failure of the primary BYNET.

3. Access Module Processors (AMPs): These are the virtual processors of Teradata. They receive the execution plan and the data from the parsing engine. The data will undergo any required conversion, filtering, aggregation, sorting, etc., and will be further sent to the corresponding disks for storage.

The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.

Table records will be distributed to each AMP for data storage. Only that AMP can read or write data into the disks that have access permission.

4. Nodes: The basic unit of a Teradata system is called a Node. Each node has its operating system, CPU memory, a copy of RDBMS software, and some disk space. A single cabinet can have one or more nodes in it.

5.Parallel Database Extensions (PDE)

The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).

6.Disks

Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.

🔄 Query Execution Flow in Teradata

Query Submission
A user or application submits an SQL query.
Parsing Engine (PE)
- Parser: Checks SQL syntax and semantics.
- Security: Validates user access.
- Optimizer: Generates the most cost-effective execution plan.
- Dispatcher: Sends the execution steps to BYNET.
BYNET
Routes the steps from the PE to the appropriate AMPs.
AMPs (Access Module Processors)
- Execute the steps (e.g., read/write data).
- Retrieve or update data from their own disks.
- Send results back to the PE.
PE (again)

Assembles the rows.
Sends the final result back to the client.

==============================================================

🔧 Parsing Engine (PE) – Teradata

A Parsing Engine (PE) is a virtual processor (vproc) that manages communication between a client application and the Teradata RDBMS. Each PE can handle up to 120 sessions, with each session capable of handling multiple SQL requests.

---------------------------------------------------------

🧠 What is BYNET in Teradata?

BYNET is Teradata’s high-speed interconnect network that allows nodes, PEs, and AMPs to communicate internally in a Massively Parallel Processing (MPP) environment.

---------------------------------------------------------

🧠 What is an AMP in Teradata?

The Access Module Processor (AMP) is a virtual processor (vproc) responsible for managing a portion of the database in Teradata. Each AMP owns and operates on a subset of the data stored on its own disk (vdisk), working independently in a shared-nothing architecture.

-----------------------------------------------------------------------------

🖥️ What is a Node in Teradata?

A Node is a physical server in the Teradata system. It includes both hardware and software components that support the execution of Teradata’s parallel database processing.

⚙️ Parallelism in Teradata: At the Core of Performance

Parallelism is a fundamental design principle in Teradata, enabling it to efficiently process massive volumes of data. Almost every component of Teradata is designed to operate in parallel — from query parsing to data retrieval and response.

==============================================================

COMPRESSION

Multi-value Compression

Multi-value compression has three variations:

COMPRESS – ONLY Nulls are compressed.
COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
COMPRESS <constant> – Nulls and the specified <constant> value are compressed.

Algorithmic Compression

When column values are mostly unique, Algorithmic Compression can provide better compression results than Multi-value Compression. You can define your own compression and decompression algorithms and apply them to data at the column level.

A user can create their own compression algorithms as external C/C++ scalar UDFs, and then specify them in the column definition of a CREATE TABLE/ALTER TABLE statement. Teradata invokes these algorithms to compress and decompress the data when the data is inserted into or retrieved from the table.

ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.

Block Level Compression

Performs compression on whole data blocks at the file system level before they are written to storage.

==============================================================

🔷 1. Tables

Core structure for storing data.
Consist of rows (records) and columns (fields).
Supports indexing, primary keys, foreign keys, etc.

🔷 2. Views

Virtual tables based on one or more tables or views.
Do not store data physically—just SQL definitions.
Use cases:
- Restrict access to certain columns/rows.
- Simplify complex joins/queries.
- Standardize reporting logic.
X-Views: Special views that only show data specific to the accessing user.

🔷 3. Macros

Store predefined SQL commands under a name.
Can accept parameters.
Useful for repeating business logic or securing access.

Example:


CREATE MACRO GetEmployee (emp_id INT)
AS (
   SELECT * FROM Employees WHERE EmployeeID = :emp_id;
);

🔷 4. Triggers

A set of SQL statements that automatically executes in response to data changes (INSERT, UPDATE, DELETE).
Attached to a table.
Used for enforcing rules or auditing changes.

🔷 5. Stored Procedures

A program written in SQL and stored on the database.
Can contain control-of-flow logic (IF, LOOP, etc.).
Can be scheduled, parameterized, or reused.
Example use case: ETL workflows, validations, auditing.
Supports error handling and transaction control.

🔷 6. User-Defined Functions (UDFs)

Custom functions written in C/C++ or SQL.
Types:
- Scalar UDFs – return a single value.
- Aggregate UDFs – return a summarized value.
- Table UDFs – return a table.
Used for:
- Complex calculations.
- Reusability.
- Protecting PII (Personally Identifiable Information).

🔷 7. Indexes

Used to optimize data access.
Types:

Primary Index (UPI/NUPI) – determines data distribution.
Secondary Index – improves access for non-primary keys.
Join Index – speeds up joins.
Hash Index – supports fast lookups (less common now).

===================================================

SPACE

🔍 1. Perm Space

Definition: Maximum allowed space a user/database can occupy permanently.
Not pre-allocated, only consumed as needed.
Used for:
- Table data (rows)
- Stored procedures
- User-Defined Functions (UDFs)
Required for creating tables.

🔍 2. Spool Space

Definition: Temporary space used during SQL execution.
Used for:
- Sorting
- Joining
- Aggregation
- Storing query results
Reclaimed automatically when the query ends.
Inherits limits from parent user/database unless explicitly set.

📌 Example:

You have 100 GB total disk space.

10 GB used for user data

10 GB for overhead

❓ Available spool space?
✅ Answer: 80 GB

🔍 3. Temp Space

Used for:
- Global Temporary Tables (GTTs)
Persistence: Lasts until session ends.
Tables created in Temp Space survive restarts (not the data).
Treated like Perm Space, but temporarily allocated for session data.

==========================================================================

🚀 How Teradata Distributes Rows Across AMPs

In Teradata, row distribution is fully automatic, even, and hash-based. The core idea is that each AMP (Access Module Processor) is responsible for a portion of each table’s data, and Teradata ensures the rows are distributed evenly using a hashing algorithm.

🔁 Hash-Based Distribution Mechanism

When a row is inserted into a table:

Teradata hashes the Primary Index column(s) value.
The resulting hash value is mapped to an AMP.
That AMP stores the row on its associated disk.

📌 Hashing is done on Primary Index (PI) — it’s how Teradata determines where to store a row.

🎯 Why Even Distribution Matters

Ensures parallelism is fully utilized.
Avoids skew (when some AMPs hold significantly more rows than others).
Prevents slowest AMP bottleneck in full table operations.

✅ Ideally:

Each table's rows are evenly distributed across all AMPs.
All AMPs perform roughly equal work on large queries.

KEYS ARE IMPORTANT HERE

✅ Best Practices

Choose PKs based on business rules and uniqueness.
Choose PIs based on query patterns, join usage, and data distribution.
Don't always assume PK = PI. Analyze based on performance needs.

📌 Primary Index (PI) in Teradata

The Primary Index (PI) is the core mechanism for distributing rows across AMPs and retrieving data efficiently.

✅ Key Characteristics

Attribute	Description
Purpose	Determines how rows are distributed and accessed
When defined?	At CREATE TABLE time only – cannot be altered later
Max Columns	Can consist of 1 to 64 columns
Stored in	Data Dictionary (DD)

🔄 Row Distribution Mechanism

Teradata uses a hashing algorithm on the PI column(s).
Hash result determines the AMP where the row will be stored.
Leads to parallelism across AMPs.

🧩 Types of Primary Index

Type	Description	Impact
UPI (Unique Primary Index)	No duplicate values allowed	Ensures even row distribution, no duplicate check during loads, 1-AMP access
NUPI (Non-Unique Primary Index)	Duplicate values allowed	May cause data skew, slower joins/queries if not carefully chosen

🧠 Best Practices for Choosing a PI

Choose a highly unique column to avoid skew (UPI preferred).
Consider query access patterns (frequent filters, joins).
Try to align PI with foreign keys in child tables for AMP-local joins.
Avoid columns with few distinct values (e.g., Gender, Status) as PI → causes skew.

📈 Why PI Matters for Performance

Every row access by PI is a 1-AMP operation → fastest.
Good PI choice = even AMP workload → maximum parallelism.
Bad PI choice = skew → one AMP overworked, others underutilized.

🚫 Common Mistakes

Using low cardinality columns as PI (e.g., "India", "Yes/No").
Forgetting that PI ≠ PK in Teradata.
Choosing PI without analyzing access patterns.

==========================================================================

🔄 Row Distribution via Hashing in Teradata

📥 How Data Gets Distributed Across AMPs

When a row is inserted into a Teradata table:

The Primary Index (PI) value of the row is input into a hashing algorithm.
The algorithm outputs a 32-bit Row Hash.
The high-order 16 bits are used to find a Hash Bucket Number.
The Hash Map uses that bucket number to determine which AMP will store the row.
The AMP stores the row in its associated vDisk.

🔄 UPI Row Distribution

NUPI

🔹 What is a NUPI?

A Non-Unique Primary Index allows duplicate values in the index column(s).
The hashing algorithm still determines the target AMP, but:

All rows with the same NUPI value go to the same AMP.
This can lead to uneven (skewed) distribution.

📊 Example: Skew from Poor NUPI Choice

If you choose Order_Status (values = "C", "O") as a NUPI:

All rows get hashed to just 2 AMPs.
Remaining AMPs stay idle → severe performance degradation.

🔀 Partitioned Primary Index (PPI)

🔹 What is PPI?

PPI is a Teradata feature that divides rows into partitions based on a defined column or expression, while still distributing rows across AMPs based on the Primary Index (PI).

Key Benefits:

Improves query performance for range-based queries.
Allows partition elimination, where irrelevant partitions are excluded at query time.
Ideal for incremental loads, archiving, and deletions.

⚙️ How PPI Works:

Hashing for Distribution:
- PI → Hash Value → Determines which AMP stores the row.
Partitioning for Storage Order:
- Once inside an AMP, rows are grouped into partitions.
- Within each partition, data is sorted by Row Hash (row ID).
Partition Elimination:
- The optimizer analyzes query filters (e.g., WHERE order_date BETWEEN …) and skips partitions that won’t qualify.
- This reduces I/O and scan time dramatically.

🌲 Multilevel Partitioned Primary Index (MLPPI)

🔹 What is MLPPI?

MLPPI allows sub-partitioning — i.e., a partition within a partition — for greater granularity and performance.

Example Use Case:

Partition 1: by Claim_Date
Sub-partition: by State

This improves query efficiency when filters match both levels, enabling multi-level partition elimination.

In Teradata, when using Multi-Level Partitioned Primary Index (MLPPI), you can define the partitions using either:

RANGE_N — for range-based partitioning (like dates, numbers)
CASE_N — for custom logic or discrete values

🧩 Use `RANGE_N` When:

You are partitioning by a range of values (e.g., dates or numeric ranges).
Best for time-series data, like logs or transaction tables.

🧩 Use `CASE_N` When:

You want to partition using categorical values, custom conditions, or non-uniform ranges.
Good when data doesn't fit into neat, equal ranges.

==========================================================================

✅ What is a NoPI Table?

A NoPI (No Primary Index) Table is a table that does not have a defined Primary Index. This was introduced in Teradata 13.0 to improve bulk load performance and simplify certain workloads.

CREATE TABLE table_x
 	(col_x INTEGER
 	,col_y CHAR(10)
 	,col_z DATE)
NO PRIMARY INDEX;

📦 Why Use NoPI Tables?

Ideal for staging or landing tables in ETL pipelines.
Great for bulk loading large volumes of data quickly.
Often used temporarily before redistributing data into final structured tables.

🧠 Best Practices:

Use NoPI tables for staging or intermediate ETL steps.
Avoid long-term use unless you really do not need joins or indexed access.
After loading, redistribute data into base tables with a PI for performance.

==========================================================================

✅ What is Teradata Columnar (Column Partitioning)?

Introduced in Teradata 14.0, Columnar is a new way of physically organizing data by partitioning a table’s columns into separate storage units, rather than rows.

It can be applied to:

Tables (especially NoPI tables)
Join Indexes

You can also combine Column Partitioning with Row Partitioning (MLPPI) for a multi-dimensional partitioning strategy.

==========================================================================

📘 Columnar Join Indexes:

A Column-Partitioned Join Index helps queries without changing the base table.

🔹 Must follow these rules:

Single-table only
No aggregate or compression
No PI or value ordering
Must include RowID of base table
Can be sparse or row-partitioned

Example use: Supporting BI reports where only a few columns are accessed repeatedly.

🎯 When to Use Columnar:

When most queries touch few columns.
When needing high compression and fast scans.
In CPU-rich environments.
As staging for analytics or data marts.

==========================================================================

Teradata: Secondary Index (SI) Explained

A Secondary Index (SI) in Teradata is a physical access path to the data that is separate from the Primary Index (PI). It is used to improve query performance when the query's search condition does not use the PI columns.

🔍 Why Use a Secondary Index?

When queries frequently use non-PI columns in WHERE conditions.
To avoid full table scans.
To reduce query response time.
To optimize joins or aggregations on non-PI columns.

🛠️ How It Works

When you define a SI, Teradata creates a sub-table that holds:
- The index value.
- The row ID (ROWID) of the corresponding data row in the base table.
For USI, the sub-table is hash-distributed.
For NUSI, the sub-table is locally stored on the same AMP as the base table row (no redistribution).

🔁 Without Secondary Index

Teradata will perform a full table scan, checking each row for dept_id = 10.
Slow on large tables (millions of rows).

🚦 Performance Tips

USI: Great for point queries (e.g., WHERE emp_name = 'Alice')
NUSI: Better for range queries (e.g., WHERE salary BETWEEN 50000 AND 70000)
Too many SIs can increase storage and slow down DML operations like INSERT/UPDATE.

==========================================================================

🔷 What is a Join Index (JI)?

A Join Index is a system-maintained physical structure that stores pre-joined or pre-aggregated data to:

Eliminate base table joins
Reduce data redistribution
Avoid aggregate processing
Improve query response time

🔷 What is a Single-Table Join Index (STJI)?

A Single-Table Join Index is a special type of join index that:

Is defined on only one table.
Uses a different column as its Primary Index (PI) than the base table’s PI.
Helps optimize joins between foreign keys and primary keys—especially in star schema or snowflake models.
Avoids row redistribution and base table access during queries.

🧩 Why Use STJI?

Teradata stores rows based on the PI. If you frequently join on a non-PI column (e.g., a foreign key), Teradata may redistribute data to perform the join.

✅ STJI prevents redistribution by pre-sorting and rehashing rows using the foreign key (or any chosen column).

What This Does:

Creates a copy of the orders table, hashed on customer_id instead of order_id.
When a query joins orders.customer_id to customers.customer_id, Teradata can use this index.
No data redistribution needed because both tables are now hashed on the same column.

==========================================================================

🔍 What is a Full Table Scan (FTS)?

A Full Table Scan happens when Teradata (or any database engine) reads every row in a table to find rows that match a query’s condition—because no index or partitioning strategy can narrow it down.

⚠️ This is often the least efficient way to access data, especially for large tables.

❌ What happens here?

The Primary Index (PI) is on emp_id, but the query filters on dept_id, which is not indexed.
Result: Teradata must read every row to find those where dept_id = 10 → Full Table Scan.

==========================================================================

📌 What is `COLLECT STATISTICS` in Teradata?

COLLECT STATISTICS (a.k.a. COLLECT STATS) gathers data demographics (like row count, value distribution, uniqueness) on columns, indexes, or partitions.

✅ These stats are used by the Teradata Optimizer to:

Choose optimal query plans
Decide whether to use an index or full table scan
Improve join strategies, aggregations, and access paths

💡 Why It Matters

Without stats, Teradata assumes uniform distribution and guesses row counts, which can result in:

Wrong join order
Full table scans
Skewed AMP workload
Longer response times

📊 What Does It Collect?

Distinct values count
Null count
Min/Max values
Frequency histogram
Row count
Skew factor

These help the optimizer estimate:

Selectivity (how many rows a filter returns)
Join cardinality
Join cost and order
Whether to use an index, join index, or full table scan

==========================================================================

📦 What is `COMPRESS` in Teradata?

COMPRESS is a column-level compression feature in Teradata that reduces disk storage by storing repeated values only once and referencing them internally.

✅ It’s especially effective in large tables with:

Repeating values (like status = ‘Active’)
Nullable columns with many nulls
Low-cardinality columns (few distinct values)

🧱 How `COMPRESS` Works

Instead of storing the full value for every row, Teradata:

Stores common/repeating values only once in the table header.
Rows with those values point to the compressed value.
Values not listed in COMPRESS are stored normally.

COMPRESS - ONLY NULL COMPRESSED

COMPRESS ('Active', 'Terminated', 'On Leave')

==========================================================================

🔐 Teradata Data Protection Features

Teradata provides multi-layered protection, both at the system level (hardware & OS) and database level (AMP-level redundancy, journaling, and transaction integrity).

🔁 Real-world analogy

Imagine a shopping mall with multiple backup generators (nodes) and alternate power cables (buses). If one cable or generator fails, power (data processing) continues seamlessly.

🗃️ DATABASE-LEVEL PROTECTION

1. Fallback

Each row stored twice on different AMPs (primary + fallback copy).
Fallback is a data protection feature in Teradata that ensures AMP-level fault tolerance by keeping a duplicate copy of each row of a table on a different AMP in the same cluster.
If one AMP goes down, fallback AMP serves the data.

✅ Transparent to users, but requires:

2x disk space
Extra I/O for write operations

🔁 Real-world analogy:

Imagine every important document is kept in two different filing cabinets. If one cabinet is lost in a fire (AMP failure), the second one keeps business running.

📦 How It Works:

For every row inserted into a Fallback table, a second copy is created.
The primary row goes to one AMP (based on hashing).
The fallback row is stored on a different AMP in the same cluster.
If an AMP fails, queries will automatically read from the fallback AMP.

🧠 Example:

Suppose a table employees is defined with Fallback.

Row emp_id = 101 hashes to AMP 1 → primary row is stored here.
Teradata automatically stores a fallback copy on AMP 2 (same cluster).

If AMP 1 fails, Teradata accesses the copy on AMP 2 seamlessly — no data loss and no user disruption.

2. Transient Journal (TJ)

The Transient Journal (TJ) is a temporary, automatic rollback mechanism used to protect data integrity during a transaction.
Keeps before images of rows during transactions.
To ensure atomicity in transactions — either all changes are committed, or none (rollback if failure occurs).
If transaction fails → rollback to original state.

🧩 How it works:

During a transaction, before any row is changed, the "before image" of the row is saved in the Transient Journal on the same AMP.
If the transaction succeeds (gets an END TRANSACTION), the saved copies are discarded.
If the transaction fails (e.g., AMP crash, session drop), the changes are rolled back using the saved before images.

✅ Fully automatic and transparent to the user.

3. Down AMP Recovery Journal

Activated when an AMP goes down.
Tracks changes that would have affected that AMP.
Ensures no data is lost during recovery.

4. Permanent Journal (Optional)

Logs before/after images permanently for auditing or full rollback.
Useful for regulatory or historical tracking.

5. Locks

Ensure data consistency.
Prevents multiple users from changing same data simultaneously.

🔐 Types: Read lock, Write lock, Exclusive lock, etc.

==========================================================================

🛡️ What is RAID?

RAID (Redundant Array of Independent Disks) is a hardware-level protection mechanism that improves data availability and fault tolerance by distributing or duplicating data across multiple disk drives.

RAID is different from Fallback, which is a Teradata software-level protection strategy.

🔰 Types of RAID Supported in Teradata

🔁 RAID 1 – Mirroring

How it works: Every disk has an exact mirror copy on another disk.
Redundancy: 100%
If one disk fails: The mirrored disk is used — no performance loss.
Disk Space Cost: High (needs 2× storage).
Performance: Excellent, especially for read operations.

✅ Recommended for high-performance, critical systems.

🧮 RAID 5 – Parity-based

How it works: Data is striped across 3 disks, and a parity block is stored on a 4th disk. This parity helps reconstruct any one missing block.
Redundancy: Can tolerate one disk failure per rank.
If a disk fails: Data is recalculated using parity — performance decreases slightly.
Disk Space Cost: Low (more efficient than RAID 1).
Performance: Good under normal conditions, slower during recovery.

✅ Used where cost-efficiency is more important than top-tier speed.

🧮 RAID S – (RAID 5 variant used with EMC)

Similar to RAID 5
Found in EMC systems
Parity and striping logic is optimized for EMC array design

==========================================================================

📘 What is BTEQ in Teradata?

BTEQ (Basic Teradata Query) is a command-line utility that enables users to interact with the Teradata database. It is widely used by developers, DBAs, and data engineers to run SQL scripts, perform data exports/imports, and automate batch processing tasks.

🚀 When to Use BTEQ

Running SQL jobs in batch mode (cron jobs or scheduled tasks)
Exporting large volumes of data from Teradata
Loading data quickly into staging tables
Writing ETL-style scripts
Running SQL scripts with conditional or iterative logic

==========================================================================

FASTLOAD vs MULTILOAD

Teradata FastLoad is a high-performance utility designed to quickly load large volumes of data into empty tables in Teradata. It bypasses the traditional insert path and loads data directly into the database, achieving very fast bulk loading speeds.

Key Features:

Loads only empty tables: Target table must be empty before loading.
Supports only INSERT operations: Cannot perform updates or deletes.
Requires no secondary indexes or referential integrity on target table.
Uses multiple sessions (parallelism) for speed.
Supports error handling with error tables.
Supports checkpoints for recovery.

SESSIONS 4;

RECORD 100 THRU 100000;

ERRLIMIT 25;

LOGON tdpid/userid,password;

DROP TABLE FastTable;

DROP TABLE Error1;

DROP TABLE Error2;

CREATE TABLE FastTable, NO FALLBACK (

ID INTEGER,

UFACTOR INTEGER,

MISC CHAR(42)

) PRIMARY INDEX (ID);

DEFINE

ID (INTEGER),

UFACTOR (INTEGER),

MISC (CHAR(42))

FILE=FileName;

SHOW;

BEGIN LOADING FastTable ERRORFILES Error1, Error2 CHECKPOINT 10000;

INSERT INTO FastTable (ID, UFACTOR, MISC) VALUES (:ID, :UFACTOR, :MISC);

END LOADING;

LOGOFF;

MULTI LOAD

Teradata MultiLoad is a powerful utility designed to perform high-speed batch maintenance of multiple tables simultaneously in Teradata. Unlike FastLoad, which only inserts into empty tables, MultiLoad can insert, update, delete, and upsert rows in one or more existing tables.

Can load up to 5 tables and perform up to 20 DML operations (insert, update, delete, upsert) within one job/script

==========================================================================

Ansi vs btet

=======================================================================

In Teradata, a table constraint is a rule applied at the table level (not just column level) to enforce data integrity. Table constraints are typically used for:

Composite primary keys
Composite unique constraints
Foreign key constraints referencing multiple columns
Check constraints on multiple columns

✅ Types of Table-Level Constraints in Teradata

Constraint Type	Purpose	Declared At
`PRIMARY KEY`	Enforces row uniqueness	Table-level / Column-level
`UNIQUE`	Ensures column or column group is unique	Table-level / Column-level
`FOREIGN KEY`	Enforces referential integrity	Table-level
`CHECK`	Restricts values in rows	Table-level

📌 Examples of Table Constraints in Teradata

1. Composite PRIMARY KEY (Table-level)


CREATE TABLE orders (
    order_id INT,
    product_id INT,
    quantity INT,
    PRIMARY KEY (order_id, product_id)
);

This ensures each (order_id, product_id) pair is unique.

2. Composite UNIQUE Constraint


CREATE TABLE employee (
    emp_id INT,
    email VARCHAR(100),
    phone_number VARCHAR(15),
    UNIQUE (email, phone_number)
);

Ensures no two employees have the same email + phone combo.

3. Table-Level FOREIGN KEY Constraint


CREATE TABLE order_items (
    order_id INT,
    product_id INT,
    quantity INT,
    FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

4. CHECK Constraint


CREATE TABLE payments (
    payment_id INT,
    amount DECIMAL(10,2),
    method VARCHAR(20),
    CHECK (amount > 0 AND method IN ('cash', 'card', 'upi'))
);

Restricts values based on multiple column logic.

🧠 Notes for Teradata:

Foreign key and check constraints are declarative only in Teradata—they are not enforced by the database engine.
Use referential integrity manually via joins or scripts if enforcement is required.

==========================================================================

Relational Database Terminology

Set Theory Terms	Relational Database Term
Relation	Table
Tuple	Row
Attribute	Column

Tables

Tables are the basic unit in RDBMS where the data is stored. Tables are two-dimensional objects consisting of rows and columns. Data is organized in tabular format and presented to the users of a relational database.

References between tables define the relationships and constraints of data inside the tables themselves.

Columns

A column always contains the same kind of information or contains similar data.

Row

A row is one instance of all the columns in the table.

DATabase

The database is a collection of logically related data. Many users access them for different purposes.

For example, a sales database contains complete information about sales, which is stored in many tables.

Primary Key

The primary key is used to identify a row in a table uniquely. No duplicate values are allowed in a primary key column, and they cannot accept NULL values. It is a mandatory field in a table.

Foreign Key

Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table.

A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.

==========================================================================

Types of Table

🔹 1. Permanent Table

Default type of table.
Stores persistent data that is not deleted after a session ends.
Data is written to disk and protected using fallback or RAID.

🧾 Example:


CREATE TABLE employee (
   emp_id INTEGER,
   name VARCHAR(100)
);

🔹 2. Volatile Table

Temporary table only available during the session.
Data is stored in spool space and lost after logout.
Good for staging or session-specific processing.

🧾 Example:


CREATE VOLATILE TABLE temp_employee (
   emp_id INTEGER,
   name VARCHAR(100)
) ON COMMIT PRESERVE ROWS;

🔹 3. Global Temporary Table (GTT)

Table definition persists, but data is session-specific.
Data is deleted at the end of each session.
Requires ON COMMIT clause.

🧾 Example:


CREATE GLOBAL TEMPORARY TABLE temp_salary (
   emp_id INTEGER,
   salary DECIMAL(10,2)
) ON COMMIT PRESERVE ROWS;

🔹 4. Derived Table

Virtual table created in a subquery.
Exists only for the duration of the query.
Cannot be reused.

🧾 Example:


SELECT dept, AVG(salary)
FROM (
   SELECT dept, salary FROM employee
) AS derived
GROUP BY dept;

🔹 5. Queue Table

Used for asynchronous processing (message queuing).
Follows FIFO (First In First Out) principle.
Supports special operations like READQ and CONSUME.

🧾 Example:


CREATE QUEUE TABLE msg_queue (
   msg_id INTEGER,
   msg_text VARCHAR(500)
) PRIMARY INDEX (msg_id);

🔹 6. Multiset Table

Allows duplicate rows.
Must be explicitly specified.

🧾 Example:


CREATE MULTISET TABLE logs (
   user_id INTEGER,
   login_time TIMESTAMP
);

🔹 7. Set Table

Does not allow duplicate rows.
Teradata performs a duplicate row check on insert.

🧾 Example:


CREATE SET TABLE departments (
   dept_id INTEGER,
   dept_name VARCHAR(100)
);

🔹 8. NoPI Table (No Primary Index)

Rows are not hashed by PI, but distributed round-robin.
Good for staging and bulk loading.
Can reduce skew in large loads.

🧾 Example:


CREATE SET TABLE staging_table (
   emp_id INTEGER,
   dept_id INTEGER
) NO PRIMARY INDEX;

🔹 9. Column-Partitioned Table (Columnar)

Stores data column-wise instead of row-wise.
Improves performance for analytic queries on few columns.

🧾 Example:


CREATE COLUMN PARTITIONED TABLE sales_data (
   sales_id INTEGER,
   product_id INTEGER,
   amount DECIMAL(10,2),
   sales_date DATE
) PARTITION BY COLUMN;

🔹 10. Fallback Table

Any table (except volatile and queue) can be fallback.
Stores duplicate copy of each row on a different AMP for data protection.

🧾 Example:


CREATE TABLE customer (
   cust_id INTEGER,
   cust_name VARCHAR(100)
) FALLBACK;

==========================================================================

Teradata Data Manipulation

DROP TABLE <tablename>;

INSERT INTO <tablename> (column1, column2, column3)

VALUES (value1, value2, value3);

INSERT INTO target_table (column1, column2, column3)

SELECT column1, column2, column3

FROM source_table;

UPDATE <table_name>

SET <column_name> = <new_value>

WHERE <condition>;

RULE :

We can update one or more values of the table.
If the WHERE condition is not specified, then all rows of the table are impacted.
We can update a table with the values from another table.

DELETE FROM <tablename>
[WHERE condition];

Rules

Here are some specific rules to delete records from the tables, such as:

We can update one or more records of the table.
If the WHERE condition is not specified, then all rows of the table are deleted.
We can update a table with the values from another table.

Teradata Set Operator

SET operators combine results from multiple SELECT statements. This may look similar to Joins, but joins combines columns from various tables, whereas SET operators combine rows from multiple rows.

Rules for Set Operator

Here are the following rules to specify the Set operator, such as:

The number of columns from each SELECT statement should be the same.
The data types from each SELECT must be compatible.
ORDER BY should be included only in the final SELECT statement.

Teradata SQL Set Operators

Set Operator	Function
INTERSECT	It returns result in rows that appear in all answer sets generated by the individual SELECT statements.
MINUS / EXCEPT	The result is those rows returned by the first SELECT except for those also selected by the second SELECT.
UNION	It combines the results of two or more SELECT statements.

1. UNION

The UNION statement is used to combine results from multiple SELECT statements. It ignores duplicates.

SELECT col1, col2, col3...   
FROM    
<table T1>   
[WHERE condition]   
UNION    
  
SELECT col1, col2, col3...   
FROM    
<table T2>   
[WHERE condition];  

Example

Consider the following student table as T1 and attendance table as T2.

RollNo	FirstName	LastName	BirthDate
1001	Mike	Richard	1/2/1996
1002	Robert	Williams	3/5/1995
1003	Peter	Collin	4/1/1994
1004	Alexa	Stuart	11/6/1995
1005	Robert	Peterson	12/1/1997

RollNo	present	Absent	%
1001	200	20	90%
1002	160	60	72%
1003	150	70	68%
1004	210	10	95%

The following UNION query combines the RollNo value from both T1 and T2 tables.

SELECT RollNo
FROM
T1
UNION
SELECT RollNo
FROM
T2;

When the query is executed, it gives the following output, such as:

2. UNION ALL

UNION ALL statement is similar to the UNION statement. It combines results from multiple tables, including duplicate rows.

Syntax

Following is the basic syntax of the UNION ALL statement.

SELECT col1, col2, col3...  
FROM    
<table 1>   
[WHERE condition]   
UNION ALL   
  
SELECT col1, col2, col3...  
FROM    
<table 2>   
[WHERE condition];  

Example

Following is an example for UNION ALL statement.

SELECT RollNo
FROM
T1
UNION ALL
SELECT RollNo
FROM
T2;

When the above query is executed, it produces the following output. And it returns the duplicates also.

3. INTERSECT

INTERSECT command is also used to combine results from multiple SELECT statements.

It returns the rows from the first SELECT statement that has a corresponding match in the second SELECT statement.

Syntax

Following is the basic syntax of the INTERSECT statement.

SELECT col1, col2, col3...   
FROM    
<table 1>  
[WHERE condition]   
INTERSECT   
  
SELECT col1, col2, col3...   
FROM    
<table 2>   
[WHERE condition];  

Example

Following is an example of the INTERSECT statement. It returns the RollNo values that exist in both tables.

SELECT RollNo
FROM
T1
INTERSECT
SELECT RollNo
FROM
T2;

When the above query is executed, it returns the following records. RollNo 1005 is excluded since it doesn't exist in the T2 table.

RollNo
1001
1002
1003
1004

4. MINUS/EXCEPT

MINUS/EXCEPT commands combine rows from multiple tables and return the rows, which are in the first SELECT but not in the second SELECT. They both return the same results.

Syntax

Following is the basic syntax of the MINUS statement.

SELECT col1, col2, col3...  
FROM    
<table 1>   
[WHERE condition]   
MINUS   
  
SELECT col1, col2, col3...  
FROM    
<table 2>   
[WHERE condition];  

Example

Following is an example of a MINUS statement.

SELECT RollNo
FROM
T1
MINUS
SELECT RollNo
FROM
T2;

When this query is executed, it gives the following output.

RollNo
1005

Teradata String Manipulation

Teradata provides several functions to manipulate the strings. These functions are compatible with the ANSI standard.

Teradata String Functions are also supported most of the standard string functions along with the Teradata extension to those functions.

String function	Explanation
Concat (string1, ..., stringN)	It returns the concatenation of two or more string values. This function provides the same functionality as the SQL-standard concatenation operator (\|\|).
Length (string)	It returns the number of characters in the string.
Lower (string)	It converts a string to lower case.
Upper (string)	It converts a string to upper case.
Lpad (string, size, padstring)	Pads the left side of the string with characters to create a new string.
Rpad (string, size, padstring)	Pads the right side of the string with characters to create a new string.
Trim (string)	It removes leading and trailing whitespace from the given string.
Ltrim (string)	It removes leading whitespaces from the string.
Rtrim (string)	It removes trailing whitespaces from the string.
Replace (string, search)	It removes the search string from the given string.
Replace (string, search, replace)	It replaces all instances of search with replacing string.
Reverse (string)	It returns string characters in reverse order.
Split (string, delimiter)	Split given string on the delimiter. This function returns an array of string.
Strops (string, substring)	It returns the staring position first instance of a substring in a given string.
Position (substring IN string)	It returns the staring position first instance of a substring in a given string.
Substr (string, start, length)	It returns a substring of string that begins at the start position and is length characters long.
Chr (n)	It returns the character with the specified ASCII value.
to_utf8 (string)	It encodes a string into a UTF-8 varbinary representation.
from_utf8 (binary)	It decodes a UTF-8 encoded string from binary.
Select translate (string, from, to);	It can replace any character in the string that matches a character in the form set with the corresponding character in the set.
Index (string)	It locates the position of a character in a string (Teradata extension).

UPPER & LOWER Function

The UPPER and LOWER functions convert the character column values all in uppercase and lowercase, respectively. UPPER and LOWER are ANSI compliant.

Syntax

UPPER (expression) -> returns expression as uppercase
LOWER (expression) -> returns expression as lowercase

Example

The following example will convert the "Robert" string in the upper case string.

/*create a table called NAMES */
TABLE NAMES(Name text);
/*create few records in this table */
UPPER('Robert');
/*Query completed. One row found. One column returned */
/*Total elapsed time was 0.00 second */
/*Display all the records from the table */
* FROM NAMES;

After executing the above code, it will give the following output.

ROBERT

Now in the same example, we will convert the same "ROBERT" string in the lower case string.

SELECT LOWER('ROBERT');
/*Query completed. One row found. One column returned */
/*Total elapsed time was 0.00 second */

Output

robert

CHARACTER_LENGTH Function

The CHARACTER_LENGTH function returns the numbers of characters of a character string expression.

The result will be an integer number that represents the length.
The result will be the same for the fixed-length character.
The result will vary for the variable-length character.
Spaces are valid characters so that the length will be counted for space.

Syntax

CHARACTER_LENGTH (expression)

Example

The following example will return the number of characters of the "Robert" string.

SELECT CHARACTER_LENGTH('Robert');
/* Query completed. One row found. One column returned */
/* Total elapsed time was 0.00 second */

Execute the above code, and it tells the length of the "Robert" string as the output shown below.

TRIM Function

The TRIM function is used to remove space a particular set of leading or trailing or both from an expression. TRIM is ANSI standards.

Syntax

TRIM ([LEADING | BOTH | TRAILING] [trim_character ] FROM expression)

Example

The following example removes the space from both the end of the "Robert" string.

SELECT TRIM(' Robert ');
/* Query completed. One row found. One column returned */
/* Total elapsed time was 0.00 second */

When we execute the above code, it will trim the existing space from both ends of the string and gives the following output.

Robert

POSITION Function

The POSITION function is used to return the position of a substring inside the string. The position of the first occurrence of the string is returned only.

Syntax

POSITION (substr IN str)

Example

The following example will return the occurrence of "e" in the "Robert" string.

SELECT POSITION("e" IN "Robert");
/* Query completed. One row found. One column returned */
/* Total elapsed time was 0.00 second */

After executing the above code, it will find the position of "e" substring in the "Robert" string as output.

SUBSTRING Function

The SUBSTRING function is used to return a specified number of characters from a particular position of a given string. SUBSTRING function is ANSI standard.

Syntax

SUBSTRING (str FROM pos FOR len)

It returns a string (str), starting at the position (pos), and length (len) in characters.

Example

The following example returns the character from the 1^st position for 3 characters.

SELECT SUBSTRING('Robert' FROM 1 FOR 3);
/* Query completed. One row found. One column returned */
/* Total elapsed time was 0.00 second */

The above code returns 3 characters from the 1st position of the string "Robert" as the output.

Rob

Teradata Date/Time Functions

Date/Time functions operate on either Date/Time or Interval values and provide a Date/Time value as a result.

The supported Date/Time functions are:

CURRENT_DATE
CURRENT_TIME
CURRENT_TIMESTAMP
EXTRACT

To avoid any synchronization problems, operations among these functions are guaranteed to use identical definitions for DATE, TIME, or TIMESTAMP, therefore following services are always valid:

CURRENT_DATE = CURRENT_DATE
CURRENT_TIME = CURRENT_TIME
CURRENT_TIMESTAMP = CURRENT_TIMESTAMP
CURRENT_DATE and CURRENT_TIMESTAMP always identify the same DATE
CURRENT_TIME and CURRENT_TIMESTAMP always identify the same TIME

The values reflect the time when the request starts and does not change during the application's duration.

Date Storage

Dates are stored as integer internally using the following formula.

((YEAR - 1900) * 10000) + (MONTH * 100) + DAY

To check how the dates are stored using the following query.

SELECT CAST(CURRENT_DATE AS INTEGER);

Since the dates are stored as an integer, we can perform some arithmetic operations on them.

Teradata supports most of the standards date functions. Some of the commonly used date functions are listed below, such as:

Date Function	Explanation
LAST_DAY	It returns the last day of the given month. It may contain the timestamp values as well.
NEXT_DAY	It returns the date of the weekday that follows a particular date.
MONTHS_BETWEEN	It returns the number of months between two date (timestamp) values. The result is always an integer value.
ADD_MONTHS	It adds a month to the given date (timestamp) value and return resulting date value.
OADD_MONTHS	It adds a month to the given date (timestamp) value and return resulting date value.
TO_DATE	It converts a string value to a DATE value and returns the resulting date value.
TO_TIMESTAMP	It converts a string value to a TIMESTAMP value and returns resulting timestamp value.
TRUNC	It returns a DATE value with the time portion truncated to the unit specified by a format string.
ROUND	It returns a DATE value with the time portion rounded to the unit specified by a format string.
NUMTODSINTERVAL	It converts a numeric value to interval days to seconds.
NUMTOYMINTERVAL	It converts a numeric value to interval years to the month.
TO_DSINTERVAL	It converts a string value to interval days to second.
TO_YMINTERVAL	It converts a string value to interval year to a month.
EXTRACT	It extracts portions of the day, month, and year from a given date value.
INTERVAL	INTERVAL function is used to perform arithmetic operations on DATE and TIME values.

EXTRACT

EXTRACT function is used to extract portions of the day, month, and year from a DATE value. This function is also used to extract hour, minute, and second from TIME/TIMESTAMP value.

Examples

1. The following example shows how to extract Year value from Date and Timestamp values.

SELECT EXTRACT(YEAR FROM CURRENT_DATE);
EXTRACT(YEAR FROM Date)

Output

2. The following example shows how to extract Month values from Date and Timestamp values.

SELECT EXTRACT(MONTH FROM CURRENT_DATE);
EXTRACT(MONTH FROM Date)

Output

3. The following example shows how to extract Day values from Date and Timestamp values.

SELECT EXTRACT(DAY FROM CURRENT_DATE);
EXTRACT(DAY FROM Date)

Output

4. The following example shows how to extract Hour values from Date and Timestamp values.

SELECT EXTRACT(HOUR FROM CURRENT_TIMESTAMP);
EXTRACT(HOUR FROM Current TimeStamp(6))

Output

5. The following example shows how to extract Minute values from Date and Timestamp values.

SELECT EXTRACT(MINUTE FROM CURRENT_TIMESTAMP);
EXTRACT(MINUTE FROM Current TimeStamp(6))

Output

6. The following example shows how to extract the Second values from Date and Timestamp values.

SELECT EXTRACT(SECOND FROM CURRENT_TIMESTAMP);
EXTRACT(SECOND FROM Current TimeStamp(6))

Output

25.150000

INTERVAL

Teradata provides INTERVAL function to perform arithmetic operations on DATE and TIME values. There are two types of INTERVAL functions, such as:

1. Year-Month Interval

YEAR
YEAR TO MONTH
MONTH

2. Day-Time Interval

DAY
DAY TO HOUR
DAY TO MINUTE
DAY TO SECOND
HOUR
HOUR TO MINUTE
HOUR TO SECOND
MINUTE
MINUTE TO SECOND
SECOND

Examples

1. The following example adds 4 years to the current date.

SELECT CURRENT_DATE, CURRENT_DATE + INTERVAL '04' YEAR;

Output

05/22/2024

2. The following example adds 4 years and 03 months to the current date.

SELECT CURRENT_DATE, CURRENT_DATE + INTERVAL '04-03' YEAR TO MONTH;

Output

08/22/2024

3. The following example adds 03 days, 05 hours, and 10 minutes to the current timestamp.

SELECT CURRENT_TIMESTAMP,CURRENT_TIMESTAMP + INTERVAL '03 05:10' DAY TO MINUTE;

Output

05-25-2020 10:07:25.150000+00.00

Teradata Built-In Functions

Teradata provides built-in functions, which are extensions to SQL. The Built-in functions return information about the system.

Built-in functions are sometimes referred to as individual registers. It can be used anywhere that a literal can appear.

If a SELECT statement that contains a built-in function references a table name, then the result of the query contains one row for every row of the table that satisfies the search condition.

Some common built-in functions are listed below with examples, such as:

S. No.	Function	Example
1	SELECT DATE;	Date- 2018/02/10
2	SELECT CURRENT_DATE;	Date- 2020/05/23
3	SELECT TIME;	Time- 09:02:00
4	SELECT CURRENT_TIME;	Time- 10:01:13
5	SELECT CURRENT_TIMESTAMP;	Current TimeStamp(6)- 2020-05-23 10:01:13.990000+00.00
6	SELECT DATABASE;	Database- TDUSER

DISTINCT Option

The DISTINCT option specifies that duplicate values which are not to be used when an expression is processed.

The following SELECT returns the number of unique job titles in a table.

SELECT COUNT(DISTINCT Deduction) FROM Salary;

Output

A query can have multiple aggregate functions that use DISTINCT with the same expression, such as:

SELECT SUM(DISTINCT x), AVG(DISTINCT x) FROM Salary;

A query can also have multiple aggregate functions that use DISTINCT with different expressions, such as:

SELECT SUM(DISTINCT x), AVG(DISTINCT y) FROM Salary;

Teradata CASE & COALESCE

CASE and COALESCE both functions are used in Teradata for different purposes. Both functions have different functionalities.

CASE Expression

Teradata CASE statement provides the flexibility to fetch alternate values for a column base on the condition specified in the expression.

CASE expression evaluates each row against a condition or WHEN clause and returns the result of the first match if there are no matches, then the result from the ELSE part of the return.

Syntax

Following is the syntax of the CASE expression.

CASE <expression>   
WHEN <expression> THEN result-1   
WHEN <expression> THEN result-2   
  
ELSE    
   Result-n   
END  

Example

Consider the following Employee record in the below table.

Employee_Id	Age	Salary	Marital_Status
202001	22	20,000	1
202002	25	25,000	1
202003	30	35,000	2
202004	26	32,000	2
202005	32	40,000	2

In the above example, we evaluate the Marital_Status column. It returns 1 if the marital status is Single and returns 2 if the marital status is married. Otherwise, it returns the value as Not Sure.

Now, we will apply the CASE statement on Marital_Status column as follows:

SELECT
Employee_Id,
Age,
Salary,
CASE WHEN Marital_Status= 1 THEN 'Single'
WHEN Marital_Status= 2 THEN 'Married'
ELSE 'Not Sure'
END

After executing the above code, it produces the following output.

The above CASE expression can also be written in the following way, which will produce the same result as above.

SELECT
Employee_Id,
Age,
Salary,
CASE Marital_Status When 1 THEN 'Single'
WHEN 2 THEN 'Married'
ELSE 'Not Sure'
END AS Marital_Status
FROM Employee;

COALESCE Expression

Teradata COALESCE is used for NULL handling. The COALESCE is a statement that returns the first non-null value of the expression. It returns NULL if all the arguments of the expression evaluate to NULL. Following is the syntax.

Syntax

Here is the basic syntax of the COALESCE function:

COALESCE(expression 1, expression 2, ....... expression n)

Example

Consider the following Student table.

Now we can prioritize which phone number to select using COALESCE function as follows:

SELECT
Roll_No,
First_Name,
Last_Name,
COALESCE(Landline_no, Mobile_no, 'phone# not available') AS phone_no
FROM Student;

In the above example, we will search for Landline_no first. If that is NULL, it will search for Mobile_no, respectively. If both the numbers are NULL, it will return not available. And if none of the argument is returning, not NULL value, it will return the default value from those columns.

When we execute the above query, it generates the following output.

NULLIF

The NULLIF statement returns NULL if the arguments are equal.

Syntax

Following is the syntax of the NULLIF statement.

NULLIF(expression 1, expression 2)

Example

The following example returns NULL if the Mobile_No is equal to 0. Otherwise, it returns the Mobile_No value.

SELECT
Roll_No,
NULLIF(Mobile_no, 0) AS Mobile
FROM Student;

The above query returns the following records. We can see that Roll_No 101 and 104 have Mobile as NULL.

Teradata Primary Index

The primary index is used to specify where the data resides in Teradata. It is used to determine which AMP gets the data row.

In Teradata, each table is required to have a primary index defined. If the primary index is not defined, Teradata automatically assigns the primary index.

The primary index provides the fastest way to access the data. A primary may have a maximum of 64 columns. The primary index is defined while creating a table, and it cannot be altered or modified.

The primary index is the most preferred and essential index for:

Data distribution
Known access path
Improves join performance

Rules for Primary Index

Here are some specific rules for the Primary index, such as:

Rule 1: One Primary index per table.

Rule 2: A Primary index value can unique or non-unique.

Rule 3: The Primary index value can be NULL.

Rule 4: The Primary index of a populated table cannot be modified.

Rule 5: the Primary index value can be modified.

Rule 6: A Primary index has a limit of 64 columns.

Types of Primary Index

There are two types of Primary Indexes.

Unique Primary Index (UPI)
Non-Unique Primary Index (NUPI)

1. Unique Primary Index (UPI)

In the Unique Primary Index table, the column should not have any duplicate values. If any duplicate values are inserted, they will be rejected. The Unique Primary index enforces uniqueness for a column.

A Unique Primary Index (UPI) will always spread the rows of the table evenly amongst the AMPs. UPI access is always a one-AMP operation.

How to create a Unique Primary Index?

In the following example, we create the Student table with columns Roll_no, First_name, and Last_name.

Roll_no	First_name	Last_name
1001	Mike	Richard
1002	Robert	Williams
1003	Peter	Collin
1004	Alexa	Stuart
1005	Robert	Peterson

We have selected Roll_no to be our Primary Index. Because we have designated Roll_no as a Unique Primary Index, there can be no duplicates of student roll numbers in the table.

CREATE SET TABLE Student
(
Roll_no int,
First_name varchar2(10),
Last_name varchar2(10),
)
UNIQUE PRIMARY INDEX(Roll_no);

2. Non-Unique Primary Index (NUPI)

A Non-Unique Primary Index (NUPI) means that the values for the selected column can be non-unique.

A Non-Unique Primary Index will never spread the table rows evenly. An All-AMP operation will take longer if the data is unevenly distributed.

We might pick a NUPI over a UPI because the NUPI column may be more useful for query access and joins.

How to Create a Non-Unique Primary Index?

In the following example, we create the employee table with columns Employee_Id, Name, Department, and City.

Employee_Id	Name	Department	City
202001	Max	Sales	London
202002	Erika	Finance	Washington
202003	Nancy	Management	Paris
202004	Bella	Human Recourse	London
202005	Sam	Marketing	London

Every employee has a different employee Id, name, and department, but many employees belong to the same city in this table. Therefore we have selected City to be our Non-Unique Primary Index.

CREATE SET TABLE Employee
(
Employee_Id int,
Name varchar2 (10),
Department varchar2 (10),
)
PRIMARY INDEX(City);

Multi-Column Primary Indexes

Teradata allows more than one column to be designated as the Primary Index. It is still only one Primary Index, but it is merely made up of combining multiple columns.

Teradata Multi-column Primary index allows up to 64 combined columns to make up the one Primary Index required for a table.

Example

In the following example, we have designated First_Name, and Last_Name combined to make up the Primary Index.

This is very useful and beneficial for two reasons:

To get better data distribution among the AMPs.
And for users who often use multiple keys consistently to query.

Data distribution using Primary Index

When a user submits an SQL request against a table using a Primary Index, the request becomes a one-AMP operation, which is the most direct and efficient way for the system to find a row. The process is explained below.

The complete process is explained in the below image:

Hashing Process

The process of Hashing is defined in the following steps, such as:

Step 1: In the first step, the Primary index value goes into the hashing algorithm.

Step 2: The output of the hashing algorithm is the row hash value.

Step 3: The hash map points to the specific AMP where the row resides.

Step 4: The PE sends the request directly to the identified AMP.

Step 5: The AMP locates the row(s) on its vdisk.

Step 6: The data is sent to PE through BYNET, and PE sends the answer set to the client application.

Duplicate Row Hash Values

The hashing algorithm can end up with the same row hash value for two different rows.

We can do this in two ways, such as:

Duplicate NUPI values: If a Non-Unique Primary Index is used, duplicate NUPI values will produce the same row hash value.
Hash synonym: It is also called a hash collision. It occurs when the hashing algorithm calculates an identical row hash value for two different Primary Index values.

To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is the combination of the row hash value and unique value.

Row ID = Row Hash Value + Uniqueness Value

The uniqueness value is used to differentiate between rows whose Primary Index values generate identical row hash values. In most cases, only the row hash value portion of the Row ID is needed to locate the row.

When each row is inserted, the AMP adds the row ID, stored as a prefix of the row.

The first row inserted with a particular row hash value is assigned a unique value of the unique value is incremented by 1 for any additional rows inserted with the same row hash value.

Teradata Joins

Join is used to combine records from more than one table. Tables are joined based on the common columns and values from these tables.

There are different types of Joins available in Teradata.

Inner Join
Left Outer Join
Right Outer Join
Full Outer Join
Self Join
Cross Join
Cartesian Production Join

INNER JOIN

Inner Join combines records from multiple tables and returns the values that exist in both the tables.

Syntax

Following is the syntax of the INNER JOIN statement.

SELECT col1, col2, col3?.
FROM
Table-1
INNER JOIN
Table-2
ON (col1 = col2)
<WHERE condition>;

Example

Consider the following two tables, such as the student table and attendance table.

Student Table:

RollNo	FirstName	LastName	BirthDate
1001	Mike	Richard	1/2/1996
1002	Robert	Williams	3/5/1995
1003	Peter	Collin	4/1/1994
1004	Alexa	Stuart	11/6/1995
1005	Robert	Peterson	12/1/1997

Attendance Table:

RollNo	present	Absent	%
1001	200	20	90%
1002	160	60	72%
1003	150	70	68%
1004	210	10	95%

The following query joins the student table and attendance table on the common column Rollno. Each table is assigned an alias A & B, and the columns are referenced with the correct alias.

SELECT A.RollNo, B.%
FROM
Student A
INNER JOIN
Attendance B
ON (A.RollNo = B.RollNo);

When the above query is executed, it returns the following records. Rollno 1005 is not included in the result since it doesn't have matching records in the attendance table.

Rollno %
1001 90%
1002 72%
1003 68%
1004 95%

OUTER JOIN

LEFT OUTER JOIN and RIGHT OUTER JOIN also combine the results from multiple tables.

LEFT OUTER JOIN:It returns all the records from the left table and returns only the matching records from the right table.
RIGHT OUTER JOIN:It returns all the records from the right table and returns only matching rows from the left table.
FULL OUTER JOIN:It combines the results from both LEFT OUTER and RIGHT OUTER JOINS. It returns both matching and non-matching rows from the joined tables.

Syntax

Following is the syntax of the OUTER JOIN statement. We need to use one of the options from LEFT OUTER JOIN, RIGHT OUTER JOIN, or FULL OUTER JOIN.

SELECT col1, col2, col3....
FROM
Table-1
LEFT OUTER JOIN/RIGHT OUTER JOIN/FULL OUTER JOIN
Table-2
ON (col1 = col2)
<WHERE condition>;

Example

Consider the following example of the LEFT OUTER JOIN query. It returns all the records from the Student table and matching records from the attendance table.

FROM
Student A
LEFT OUTER JOIN
Attendance B
ON (A.RollNo = B.RollNo)
ORDER BY A.RollNo;

When the above query is executed, it produces the following output. For student 1005, % value is NULL, since it doesn't have matching records in the attendance table.

Rollno %
1001 90%
1002 72%
1003 68%
1004 95%
1005 -

CROSS JOIN

Cross Join joins every row from the left table to every row from the right table.

Cross Join is a Teradata specified join, which is equivalent to the product join. There will not be an "ON" keyword in the Cross Join.

Syntax

Following is the syntax of the CROSS JOIN statement.

Following is the syntax of the CROSS JOIN statement.
SELECT col1, col2, col3?.
FROM
Table-1
CROSS JOIN
Table-2
<WHERE condition>;

Example

Consider the following example of the CROSS JOIN query.

SELECT A.RollNo, B.RollNo, B.%
FROM
Student A
CROSS JOIN
Attendance B
WHERE A.RollNo = 1001
ORDER BY B.RollNo;

When the above query is executed, it produces the following output. RollNo 1001 from the Student table is joined with each and every record from Attendance Table.

Rollno            Rollno                           %
1001                1001                            90%
1001                1002                            72%
1001                1003                            68%
1001                1004                            95%

Teradata Substring

The Teradata SUBSTRING or SUBSTR function is one of the Teradata string functions, and it is used to cut a substring from a string based on its position.

SUBSTR or SUBSTRING will work the same in Teradata. But the syntax may be different.

We use ANSI syntax for Teradata SUBSTRING and Teradata syntax for Teradata SUBSTR. The ANSI syntax is designed to compatible with other database systems.

Syntax

SUBSTRING (<column-name>, <start-location>, [<length>] )  

SUBSTRING (<column-name> FROM <start-location> FOR <length>)  

Argument Types and Rules

SUBSTRING and SUBSTR operate on the following types of arguments:

Character
Byte
Numeric

If the string_expression argument is numeric, then User-defined type (UDT) are implicitly cast to any of the following predefined types:

Character
Numeric
Byte
DATE

To define an implicit cast for a UDT, we use the CREATE CAST statement and specify the AS ASSIGNMENT clause.

Implicit type conversion of UDTs for system operators and functions, including SUBSTRING and SUBSTR, is a Teradata extension to the ANSI SQL standard. To disable this extension, set the DisableUDTImplCastForSysFuncOp field of the DBS Control Record to TRUE.

Result Type and Attributes

Here are the default result type and attributes for SUBSTR and SUBSTRING, such as:

If the string argument is a:

BLOB, then the result type is BLOB(n).
Byte string other than BLOB, the result type is VARBYTE(n).
CLOB, then the result type is CLOB(n).
Numeric or character string other than CLOB, the result type is VARCHAR(n).

In ANSI mode, the value of n for the resulting BLOB(n), VARBYTE(n), CLOB(n), or VARCHAR(n) is the same as the original string.

In Teradata mode, the value of n for the result type depends on the number of characters or bytes in the resulting string. To get the data type of the resulting string, we can use the TYPE function.

Difference between SUBSTRING and SUBSTR

The SUBSTR function is the original Teradata substring operation. It is written to be compatible with DB/2.

It can be used in the SELECT list to return any portion of the character that data stored in a column to a client or in the WHERE clause.
When we use the SUBSTR function, such as SUBSTRING, the name of the column needs to be provided along with the starting character location and the length or number of characters to return.
The main difference is that commas are used as delimiters between these three parameters instead of FROM and FOR.

The SUBSTRING function length is optional. When it is not included, all remaining characters to the end of the column are returned. In the earlier releases of Teradata, the SUBSTR was much more restrictive in the values allowed. This situation increased the chances of the SQL statement failing due to unexpected data or costs.

Both SUBSTRING and SUBSTR allow for partial character data strings to be returned, even in ANSI mode. These functions only store the requested data in a spool, not the entire column. Therefore, the amount of spool space required can be reduced or tuned using the substring functions.
The SUBSTR is more compatible and tolerant regarding the parameter values passed to them, like the newer SUBSTRING.
The SUBSTRING is the ANSI standard, and therefore, it is the better choice between these two functions.

Teradata Table Types

Teradata supports the following types of tables to hold temporary data.

S. No.	Table Types	Description
1	ANSI Temporal	ANSI-compliant support for temporal tables. Using temporal tables, Teradata Database can process statements and queries that include time-based reasoning. Temporal tables record both system time (the time when the information was recorded in the database) and valid time (when the information is in effect or correct in a real-world application).
2	Derived	A derived table is a type of temporary table obtained from one or more other tables as the result of a SubQuery. It is specified in an SQL SELECT statement. It avoids the need to use the CREATE and DROP TABLE statements for storing retrieved information. It is useful when we are more sophisticated and complex query codes.
3	Error Logging	Error logging tables store the information about errors on an associated permanent table. And it also stores Log information about insert and updates errors.
4	Global Temporary	Global temporary tables are private to the session. It is dropped automatically at the end of a session. It has a persistent table definition stored in the Data Dictionary. The saved description may be shared by multiple users and sessions, with each session getting its instance of the table.
5	Global Temporary Trace	Global temporary trace tables are store trace output for the length of the session. It has a persistent table definition stored in the Data Dictionary. It is useful for debugging SQL stored procedures (via a call to an external stored procedure written to the trace output) and external routines (UDFs, UDMs, and external stored procedures).
6	NoPI	NoPI tables are permanent tables that do not have primary indexes defined on them. They provide a performance advantage when used as staging tables to load data from FastLoad or TPump Array INSERT. They can have secondary indexes defined on them to avoid full-table scans during row access.
7	Permanent	Permanent tables allow different sessions and users to share table content.
8	Queue	Queue tables are permanent tables with a timestamp column. The timestamp indicates when each row was inserted into the table. It is established first-in-first-out (FIFO) ordering of table contents, which is needed for customer applications requiring event processing.
9	Volatile	Volatile tables are used only when One session needs the table. The creator needs to access the table. We want better performance than a global temporary table. We do not need the table definition after the session ends. Note: The definition of a volatile table can survive across a system restart if it is contained in a macro.

Derived Table

Derived tables are created, used, and dropped within a query. These are used to store intermediate results within a query.

Example

Consider the following employee record in the form of two tables.

Emp Table:

Employee_Id	First_Name	Last_Name	Department_No
202001	Mike	Richard	1
202002	Robert	Williams	2
202003	Peter	Collin	2
202004	Alexa	Stuart	1
202005	Robert	Peterson	1

Salary Table:

Employee_Id	First_Name	Last_Name	Department_No
202001	40,000	4,000	36,000
202002	80,000	6,000	74,000
202003	90,000	7,000	83,000
202004	75,000	5,000	70,000
202005	80,000	00	80,000

The following query creates a derived table EmpSal with records of employees with a salary higher than 80,000.

SELECT
Emp.Employee_Id,
Emp.First_Name,
EmpSal.NetPay
FROM
Emp,
(select Employee_Id , NetPay
from Salary
where NetPay >= 80000) EmpSal
where Emp.Employee_Id = EmpSal.Employee_Id;

When the above query is executed, it returns the employee's records with a salary higher than or equal to 80,000.

Employee_Id       First_Name       NetPay
202003            Peter            83,000
202005            Robert           80,000

Volatile Table

Volatile tables are created and dropped within a user session. Their definition is not stored in the data dictionary. They hold intermediate data of the query, which is frequently used.

Syntax

CREATE [SET|MULTISET] VOALTILE TABLE tablename   
<table definitions>   
<column definitions>   
<index definitions>   
ON COMMIT [DELETE|PRESERVE] ROWS  

Example

In the following example, we create a VOLATILE table that name is dept_stat.

CREATE VOLATILE TABLE dept_stat (
dept_no INTEGER,
avg_salary INTEGER,
max_salary INTEGER,
min_salary INTEGER
)
PRIMARY INDEX(dept_no)
ON COMMIT PRESERVE ROWS;

After executing the above code, it returns the min, avg, and max salary according to the departments in the output.

Dept_no     avg_salary     max_salary     min_salary 
 1           186,000         80,000         36,000
 2           157,000         83,000         74,000

Global Temporary Table

The definition of the Global Temporary table is stored in the data dictionary, and they can be used by many users and sessions. But the data loaded into a global temporary table is retained only during the session.

We can materialize up to 2000 global temporary tables per session.

Syntax

Following is the syntax of the Global Temporary table.

CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tablename   
<table definitions>   
<column definitions>   
<index definitions>   

Example

Below query creates the Global Temporary table, such as:

CREATE SET GLOBAL TEMPORARY TABLE dept_stat (
dept_no INTEGER,
avg_salary INTEGER,
max_salary INTEGER,
min_salary INTEGER
)
PRIMARY INDEX(dept_no);

When the above query is executed, it returns the following output.

A table has been created.

Teradata Space Concepts

Teradata is designed to reduce the DBA's administrative functions when it comes to space management. Space concepts are configured in the following ways in the Teradata system.

Permanent Space
Spool Space
Temporary Space

Permanent Space (Perm Space)

Permanent space is the maximum amount of space available for the user and database to hold data rows. Permanent tables, journals, fallback tables, and secondary index sub-tables use permanent space.

Permanent space is not pre-allocated for the database and user. The amount of permanent space is divided by the number of AMPs. Whenever per AMP limit exceeds, an error message is generated.

Equal distribution is necessary because there is a high percentage that the objects will be shared across all the AMPs, and at the time of data retrieval, all AMPs will work parallel to fetch the data.

Unlike other relational databases, the Teradata database does not physically define the Perm space at the time of object creation. Instead of that, it represents the upper limit for the Perm space, and then Perm space is used dynamically by the objects.

Example

Suppose we have a Teradata system having 500 GB Perm space. Initially, DBC owns 100% of the perm space of a new system. As in Teradata, everything is parallel. Perm space will be distributed evenly across all AMPs.

Teradata system having 500GB perm space and 5 AMPs, each AMP will get 100GB of perm space to execute a query in parallel if DBC creates two users as sysdba and appdba with 20GB and 100GB perm space respectively.

Space will be taken from parent i.e., DBC. So now DBC has 280GB of perm space. This mechanism ensures enough memory to execute all processes in the Teradata system.

Spool Space

Spool space is the amount of space on the system that has not been allocated. It is used by the system to keep the intermediate results of the SQL query. Users without spool space cannot execute any query.

When executing a conditional query, all the qualifying rows which satisfy the given condition will be store in the Spool space for further processing by the query. Any Perm space currently unassigned is available as a Spool space.

Spool space defines the maximum amount of space the user can use. Establishing a Spool space limit is not required when Users and Databases are created. But it is highly recommended to define the upper limit of Spool space for any object.

Example

Suppose a 10GB permanent space has been allocated to a database. However, the actual perm space occupied by tables took up 50% of perm space i.e., 5GB.

The remaining 5GB space is available system-wide for a spool. Spool space for the child will not be subtracted from its immediate parent. And the child database spool space can be equal as its parent. Spool space for sysdba and appdba is the same as DBC.

Temp Space

Temp space is the unused permanent space that is used by Global Temporary tables. Temp space is also divided by the number of AMPs.

Unused Perm space is used as Temp space for storing the table and the data in it. If we are specifying some value for Temp space, it should not exceed the value of the parent database and user.

If we do not specify a value, then the maximum value is inherited from that of the parent.

Teradata Statistics

Teradata optimizer gives an execution strategy for every SQL query. This execution strategy is based on the statistics collected on the tables used within the SQL query. Statistics on the table are collected using COLLECT STATISTICS command.

The COLLECT STATISTICS (Optimizer form) statement collects demographic data for one or more columns of a base table, hash index, or join index, computes a statistical profile of the collected data, and stores the synopsis in the Data Dictionary.

The Optimizer uses the synopsis data when it generates its table access and joins plans.

Environment Information

Teradata Statistics environment needs the following:

Number of Nodes, AMPs, and CPUs
Amount of memory

Data Demographics

Data Demographics consider the following:

Number of rows
Row size
Range of values in the table
Number of rows per value
Number of Nulls

Usage

We should collect statistics on newly created, empty data tables. An empty collection defines the columns, indexes, and synoptic data structure for loaded groups.

We can easily collect statistics again after the table is populated for prototyping, and back when it is in production.

We can collect statistics in the following ways.

A unique index, which can be:
- Primary or secondary
- Single or multiple columns
- Partitioned or non-partitioned
A non-unique index, which can be:
- Primary or secondary
- Single or multiple columns
- Partitioned or non-partitioned
- With or without COMPRESS fields
A non-indexed column or set of columns, which can be:
- Partitioned or non-partitioned
- With or without COMPRESS fields
A temporary table
- If we specify the TEMPORARY keyword but a materialized table does not exist, the system first materializes an instance based on the specified column names and indexes.
  This means that after a valid instance is created, we can re-collect statistics on the columns by entering COLLECT STATISTICS and the TEMPORARY keyword without having to specify the desired columns and index.
- If we omit the TEMPORARY keyword, but the table is temporary, statistics are collected for an empty base table rather than the materialized instance.
Sample (system-selected percentage) of the rows of a data table or index, to detect data skew and dynamically increase the sample size when found.
- The system does not store both sampled and defined statistics for the same index or column set. Once sampled statistics have been collected, implicit recollection hits the same columns and indexes and operates in the same mode.
Join index
Hash index
NoPI table

How to approach Collect Statistics

There are three approaches to collect statistics on the table.

Random AMP Sampling
Full statistics collection
Using the SAMPLE option

Collecting Statistics on Table

COLLECT STATISTICS command is used to collect statistics on a table.

Syntax

Following is the basic syntax to collect statistics on a table.

COLLECT [SUMMARY] STATISTICS
INDEX (indexname) COLUMN (columnname)
ON <tablename>;

Example

Consider an Employee table with the following records, such as:

Emp_Id	First_Name	Last_Name	Department_No
202001	Mike	Richard	1
202002	Robert	Williams	2
202003	Peter	Collin	2
202004	Alexa	Stuart	1
202005	Robert	Peterson	1

We are going to run the following query to collect statistics for the Emp_Id, First_Name columns of the Employee table.

COLLECT STATISTICS COLUMN(Emp_Id, First_Name) ON Employee;
/* Total elapsed time was 0.00 second */

When the above query is executed, it produces the following output.

Update completed. 2 rows changed.

Viewing Statistics

We can view the collected statistics using the HELP STATISTICS command.

Syntax

Following is the syntax to view the statistics collected.

HELP STATISTICS <tablename>;

Example

Following is an example to view the statistics collected on the Employee table.

HELP STATISTICS Employee;

When the above query is executed, it produces the following table with updated columns and their values.

Date        Time        Unique Values    Column Names
6/2/20     10:05:02      5                  *
6/2/20     10:05:02      5                  Emp_Id, First_Name

Teradata Compression

Compression reduces the physical size of the stored information. The goal of compression is to represent information accurately using the fewest number of bits.

The Compression methods are either logical or physical. Physical data compression re-encodes information independently of its meaning, and logical data compression substitutes one set of data with another, more compact set.

In Teradata, compression can compress up to 255 distinct values, including NULL. Since the storage is reduced, Teradata can store more records in a block. This results in improved query response time since any input operation can process more rows per block.

Compression can be added at table creation using CREATE TABLE or after table creation using ALTER TABLE command.

Compression has the following essential reasons, such as:

To reduce storage costs.
To enhance system performance.

Compression reduces storage costs by storing more logical data per unit of physical capacity. Compression produces smaller rows, resulting in more rows stored per data block and fewer data blocks.

Compression enhances system performance because there is less physical data to retrieve per row for queries. And compressed data remains compressed while in memory, the FSG cache can hold more rows, reducing the size of disk input.

Rules

Teradata Compression method has the following rules to compress stored data, such as:

Only 255 values can be compressed per column.
The Primary Index column cannot be compressed.
Volatile tables cannot be compressed.

Types of Compression

Teradata Database uses several types of compression.

Database element	Explanation
Column values	The storage of those values one time only in the table header, not in the row itself, and pointing to them using an array of presence bits in the row header. It applies to: Multi-value compression Algorithmic compression We cannot apply either multi-value compression or algorithmic compression to row-level security constraint columns.
Hash and Join indexes	A logical row compression in which multiple sets of non-repeating column values are appended to a single set of repeating column values. This allows the system to store the repeating value set only once, while any non-repeating column values are stored as logical segmental extensions of the base repeating set.
Data blocks	The storage of primary table data, or join or hash index subtable data.
Partition containers	The auto-compression method set determined by Teradata Database to apply to a container of a column-partitioned table or join index when we have not specified the NO AUTO COMPRESS option at the time the object was created.

Multi-Value Compression (MVC)

Multi-value compression (MVC) compresses repeating values in a column when we specify the value in a compression list in the column definition.

When data in the column matches a value specified in the compression list, the database stores it only once in the table header, regardless of how many times it occurs as a field value for the column. The database then stores a smaller substitute value, often as small as 2 bits, in each row where the value occurs.

MVC provides the best cost and benefit ratio compared to other methods because it requires minimal resources to uncompress the data during query processing. We can use MVC for frequently used data without compromising query and load performance.

MVC is also considered the easiest to implement of all the compression methods. MVC is a logical data compression form.

Multi-value compression has the following performance impacts, such as:

It improves table scan response times for most configurations and workloads.
It provides moderate to little CPU savings.

Example

The following example compresses the field Depatment_No for values 1, 2, and 3. When compression is applied on a column, the values for this column are not stored with the row.

Instead, the values are stored in the Table header in each AMP. And only presence bits are added to the row to indicate the value.

CREATE SET TABLE Employee
(
Employee_No integer,
First_Name CHAR(10),
Last_Name CHAR(10),
BirthDate DATE FORMAT 'YYYY-MM-DD-',
Department_No CHAR(02) COMPRESS(1,2,3)
)
UNIQUE PRIMARY INDEX(Employee_No);

The system automatically compresses the specified values when data moves into the table and uncompressed when the containing rows are accessed.

NOTE

The system automatically compresses NULLs when specifying COMPRESS.
And cannot be applying MVC to row-level security columns.

Algorithmic Compression

Teradata Database software includes several standard compression algorithms, in the form of UDFs, which is used to compress many types of data by table column. We can also create custom compression and decompression algorithms in the UDF format.

When column values are unique, then algorithmic compression (ALC) provides better compression results than MVC.

If columns have repeated values, then we can use ALC and MVC concurrently on the same column, but the system does not apply ALC to any value covered by MVC.

NOTE

Specify only one set of compression or decompression UDFs for a particular column.
The system automatically compresses nulls when specifying COMPRESS.

Row Compression

Row compression is a form of logical data compression in which Teradata Database stores a repeating column value set only once.

And other non-repeating column values that belong to that set are stored as logical segmental extensions of the base repeating set. Row compression is a lossless method.

Like multi-value compression, there is no decompression necessary to access row compressed data values.

Block-Level Compression

A data block is a physical unit of input that defines how Teradata handles data when we specify block-level compression (BLC), Teradata stores data blocks in a compressed format to save storage space.

The goals of block-level compression are to save storage space and to reduce disk input bandwidth. Block-level compression can use significantly more CPU to compress and decompress data dynamically. Thus, whether query performance is enhanced with block-level compression depends on whether performance is more limited by disk input bandwidth or CPU usage.

Teradata Database offers software-based and hardware-based BLC. We select the BLC method when we specify the compression algorithm in the Compression Algorithm field in DBS Control.

Hardware-Based Block-Level Compression
Hardware-based BLC creates little or no CPU resource contention with other database operations.
Hardware-based block-level compression provides the same functions as software-based block-level compression.
Software-Based Block-Level Compression
Software-based BLC is designed for use on large volumes of seldom-accessed data. BLC can also achieve relatively good compression on unknown and unclassified data across different column types, including the internal data structures and the row header.
Block-level compression can be enabled at the system level using the DBS Control utility and at the table level using SQL DDL.

Auto Compression

When we create a column-partitioned table or join index, Teradata Database attempts to use one or more methods to compress the data. That inserts into the physical rows of the object unless we specify the NO AUTO COMPRESS option when we create it, or NO AUTO COMPRESS is set as the default.

The process of selecting and applying appropriate compression methods to the physical containers of a column-partitioned table or join index is referred to as Auto Compression.

Auto Compression is most useful for a column partition with a single column and COLUMN format.

Teradata Database Auto Compression only applies on column partitions with COLUMN format, and then if it reduces the size of a container. Or it is set as default with the following requirements, such as:

Minimal CPU resources are required to decompress the data for reading.
Teradata Database does not need to decompress many values to find a single value.

For some values, there are no applicable compression techniques that can reduce the size of the physical row, so Teradata Database does not compress the values for that physical row. Still, the system attempts to compress physical row values using one of the Auto Compression methods available to it.

Teradata Database automatically decompresses any compressed column partition values when we retrieve rows from a column-partitioned table.

Teradata Explain

EXPLAIN command is the Parsing Engines (PE's) plan to the AMPs. EXPLAIN command returns the execution plan of the parsing engine in English translation. It can be used with any SQL statement except on another EXPLAIN command.

When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs. The explain plan can give a clear picture of the way optimizer will execute a query.

It is a good practice to analyze the explain plan for a newly developed query before running it in the Teradata system.

The explain plan can be obtained in two ways. First, add an "EXPLAIN" keyword before any queries, and the second is simply pressing "F6" from the keyboard.

Explain plan is beneficial to analyze the performance issue of a query as it breaks down any complex query into the lowest level. The Explain plan provides a lot of information, such as:

Access path: If data is fetching using full table scan or using any index like primary index path, secondary index path, or any other indexes.
Confidence level: If the optimizer can use the statistics or any statistics is missing.
Joins information: What kind of join is going to happen.
Time estimation: Provides an estimated query completion time.

If we pass an explain command in front of a query statement or Simply Press the F6 key, there are the following estimated confidence messages that the optimizer will relay to the user, such as:

High Confidence: Statistics available on an Index or Column.
Low Confidence: Random sampling of INDEX, or Statistics available but AND/OR condition exists.
No Confidence: Random sampling based on AMP row count. No statistics are collected.

Explain Plan Keywords

To understand the EXPLAIN plan, we should know the following Keywords.

Keyword	Explanation
Locking Pseudo Table	Serial lock on a symbolic table. Every table has one. It is used to prevent deadlocks situations between users.
Locking table for	This indicates that ACCESS, READ, WRITE, or the EXCLUSIVE lock has been placed on the table.
Locking rows for	Indicates that an ACCESS, READ or WRITE, the lock is placed on rows read or written.
Do an ABORT test	Guarantees a transaction is not in progress for this user.
All AMPs retrieve	All AMPs are receiving the AMP steps and are involved in providing the answer set.
By way of an all-rows scan	Rows are read sequentially on all AMPs.
By way of the primary index	Rows are read using the Primary index column(s).
By way of index number	Rows are read using the Secondary index: number from HELP INDEX.
BMSMS	Bit Map Set Manipulation Step, alternative direct access technique when multiple NUSI columns are referenced in the WHERE clause.
Residual conditions	WHERE clause conditions, other than those of a join.
Eliminating duplicate rows	Providing unique values usually result in DISTINCT, GROUP BY, or subquery.
Where unknown comparison will be ignored	This indicates that NULL values will not compare to TRUE or FALSE. Seen in a subquery using NOT IN or NOT = ALL because no rows will be returned on the ignored comparison.
Nested join	The fastest join is possible. It uses UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row.
Merge join	Rows of one table are matched to the other table on common domain columns after being sorted into the same sequence, normally Row Hash.
Product join	Rows of one table are matched to all rows of another table with no concern for domain match.
ROWID join	A very fast join. It uses ROWID of a UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row.
Duplicated on all AMPs	Participating rows for the table of a join are duplicated on all AMPS.
Hash redistributed on all AMPs.	Participating rows of a join are hashed on the join column and sent to the same AMP that stores the matching row of the table to join.
SMS	Set Manipulation Step, the result of an INTERSECT, UNION, EXCEPT, or MINUS operation.
Last use	SPOOL file is no longer needed after the step and space are released.
Built locally on the AMPs	As rows are read, they are put into SPOOL on the same AMP.
Aggregate Intermediate Results computed locally.	The aggregation values are all on the same AMP and, therefore, no need to redistribute them to work with rows on other AMPs.
Aggregate Intermediate Results computed globally.	The aggregation values are not all on the same AMP and must be redistributed on one AMP to accompany the same value with the other AMPs.

How to EXPLAIN Plan Works

The EXPLAIN request modifier that precedes any SQL request causes Teradata Database to display the execution plan for that request. The request itself is not submitted for execution.

When we perform an EXPLAIN against any SQL request, that request is parsed and optimized.
The access and join plans generated by the Optimizer are returned in the form of a text file that explains the (possibly parallel) steps used in the execution of the request.
It also included the relative cost required to complete the request, given the statistics with which the Optimizer had to work.
If the statistics are not reasonably accurate, the cost estimate may not be accurate.

Benefits of Using EXPLAIN

Here are some essential benefits of using the Explain Plan, such as:

EXPLAIN helps to evaluate complex queries and to develop alternative, more efficient, processing strategies.
We can get a better plan by collecting more statistics on more columns, or by defining additional indexes.
Knowing the actual demographics information may allow identifying row count estimates that seem badly wrong and help pinpoint areas where additional statistics would be helpful.

Examples of EXPLAIN

Consider the Employee table with the following definition.

CREATE SET TABLE EMPLOYEE, FALLBACK (
Employee_Id INTEGER,
First_Name VARCHAR(10),
Last_Name VARCHAR(10),
DOB DATE FORMAT 'YYYY-MM-DD',
Department_No BYTEINT
)
UNIQUE PRIMARY INDEX ( Employee_Id );

Full Table Scan (FTS)

When no conditions are specified in the SELECT statement, then the optimizer may use the Full Table Scan where each and every row of the table is accessed.

Example

Following is a query where the optimizer may choose FTS.

EXPLAIN SELECT * FROM Employee;

When the above query is executed, it produces the following output. As can be seen, the optimizer chooses to access all AMPs and all rows within the AMP.

1. First, we lock a distinct TDUSER."pseudo table" for reading on a RowHash to prevent global deadlock for TDUSER.Employee.  
2. Next, we lock TDUSER.Employee to read.   
3. We do an all-AMPs RETRIEVE step from TDUSER.Employee by way of an all-rows scan with no residual conditions into Spool 1 (group_amps) built locally on the AMPs.  The size of Spool 1 is estimated with low confidence to be 2 rows (116 bytes). The estimated time for this step is 0.03 seconds.
4. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. 

//The contents of Spool 1 are sent back to the user as the result of statement 1.
//The total estimated time is 0.03 seconds.

Unique Primary Index

When the rows are accessed using Unique Primary Index, then it is one AMP operation.

EXPLAIN SELECT * FROM Employee WHERE Employee_Id = 1001;

When the above query is executed, it produces the single-AMP retrieval, and the optimizer uses the unique primary index to access the row.

1. First, we do a single-AMP RETRIEVE step from TDUSER.Employee by way of the unique primary index "TDUSER.Employee.Employee_Id = 1001" with no residual conditions.  

//The row is sent directly back to the user as the result of statement 1.  
//The total estimated time is 0.01 seconds.

Unique Secondary Index

When the rows are accessed using Unique Secondary Index, it's a two amp operation.

Example

Consider the Salary table with the following definition.

CREATE SET TABLE SALARY, FALLBACK
(
Employee_Id INTEGER,
Gross INTEGER,
Deduction INTEGER,
NetPay INTEGER
)
PRIMARY INDEX (Employee_Id)
UNIQUE INDEX (Employee_Id);

Consider the following SELECT statement.

EXPLAIN SELECT * FROM Salary WHERE Employee_Id = 1001;

When the above query is executed, it produces the optimizer retrieves the row in two amp operations using a unique secondary index.

1. First, we do a two-AMP RETRIEVE step from TDUSER.Salary by way of unique index # 4 "TDUSER.Salary.Employee_Id = 1001" with no residual conditions.   

//The row is sent directly back to the user as the result of statement 1.  
//The total estimated time is 0.01 seconds.

Teradata Join Index

The Teradata Join index stores pre-joined tables. Join indexes are comparable to materialized views.

Join indexes are designed to permit queries to be resolved by accessing the index instead of accessing, and possibly joining, their underlying base tables.

The join index contains one or more tables and also provide pre-aggregated data. Join indexes are mainly used for improving performance.

The join indexes are an index structure. They have to be considered more as an added layer of pre-aggregated, pre-joined, or permanent tables with its row-level partitioning, Primary Index, and Secondary Indexes.

Even if a join index does not entirely cover a query, the optimizer can use it to join its underlying base tables in a way that provides better query optimization than scanning the base tables for all the columns specified in the request.

Join indexes are useful for queries that aggregate columns from tables with large cardinalities. These indexes play the role of pre-join and summary tables without denormalizing the logical design of the database. At the same time, indeed, denormalization often enhances the performance of a particular query or queries.

There are three types of join indexes available in Teradata.

Single Table Join Index (STJI)
Multi-Table Join Index (MTJI)
Aggregate Join Index (AJI)

Rules

Here are some essential rules of the Join Index, such as:

The Usage of the FULL OUTER JOIN is not allowed.
At least one non-null column must be selected on the inner side of the LEFT or RIGHT JOIN.
OUTER JOIN is preferable for Join Index usage likelihood but not allowed for the Aggregate Join Index.
HAVING and QUALIFY keywords are not allowed.
Set Operations are not allowed, such as UNION, INTERSECT, and MINUS.
No Subqueries are allowed.
Inequality conditions in ON clauses of join index definitions are not allowed.
Only <, <=, >=, > comparison operators are allowed.
During the restoration of a base table or database, the join index is marked as invalid.
A maximum of 64 columns per table per Join Index is allowed.
The maximum of 32 Indexes can be defined on a table, including join indexes.
UPI only allowed for a single table join index.
Value ordered NUSI and Partition Expression on the same Join Index columns are not allowed.

Single Table Join Index

Single-Table join indexes are created from exactly one base table. Their purpose is to have the same table available with a different primary index, partitioning, or smaller table (the join index table) with fewer columns to be spooled.

This improves the performance of joins as no distribution or duplication is needed. The user will query on the base table, but PE will decide whether to access the base table or single table join index.

Syntax

Following is the syntax of a JOIN INDEX.

CREATE JOIN INDEX <index name>
AS
<SELECT Query>
<Index Definition>;

Example

Consider the following Employee and Salary tables.

CREATE SET TABLE EMPLOYEE, FALLBACK
(
Employee_Id INTEGER,
First_Name VARCHAR(10),
Last_Name VARCHAR(10),
DOB DATE FORMAT 'YYYY-MM-DD',
DepartmentNo BYTEINT
)
UNIQUE PRIMARY INDEX (Employee_Id);
CREATE SET TABLE SALARY, FALLBACK
(
Employee_Id INTEGER,
Gross INTEGER,
Deduction INTEGER,
NetPay INTEGER
)
PRIMARY INDEX (Employee_Id)
UNIQUE INDEX (Employee_Id);

Following is an example of creating a Join index named Employee_JI on the Employee table.

CREATE JOIN INDEX Employee_JI
AS
SELECT Employee_Id, First_Name, Last_Name,
BirthDate, DepartmentNo
FROM Employee
PRIMARY INDEX (First_Name);

If the user submits a query with a WHERE clause on Employee_Id, then the system will query the Employee table using the unique primary index.

If the user queries the employee table using employee_name, then the system can access the join index Employee_JI using employee_name.

The rows of the join index are hashed on the employee_name column. If the join index and the employee_name are not defined as a secondary index, then the system will perform a full table scan to access the rows, which is time-consuming.

We can run the following EXPLAIN plan and verify the optimizer plan. In the following example, the optimizer uses the Join Index instead of the base Employee table when the table queries using the Employee_Name column.

EXPLAIN SELECT * FROM EMPLOYEE WHERE First_Name='Alexa';
/* Help information returned. 8 rows.
/* Total elapsed time was 1 second.

Explanation

1. First, we do a single-AMP RETRIEVE operation from EMPLOYEE_JI by using the primary index "EMPLOYEE_JI.First_Name = 'Alexa' with no residual conditions into Spool 1 (one-amp) built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (232 bytes). The estimated time for this step is 0.01 seconds.
//The contents of Spool 1 are sent back to the user as the result of statement 1. //The total estimated time is 0.01 seconds.

Multi-Table Join Index

A multi-table join index is created by joining more than one table. A multi-table join index can be used to store the result set of frequently joined tables to improve the performance.

A multi-table join index is used to hold a pre-join result set from the two or more columns. So during join processing, PE may decide to access data from the Multi-table join index rather than joining again underlying base tables. We need to remember that we should define a multi-table join index after lots of analysis based on the frequency and cost of joining.

Multi-Table Join Indexes allow us to move resource-intensive joins from the online to the batch window.

Shifting the workload does not reduce the total workload, but it turns it to a point in time that is beneficial for the system's overall performance.

Example

The following example creates a JOIN INDEX named Employee_Salary_JI by joining Employee and Salary tables.

CREATE JOIN INDEX Employee_Salary_JI
AS
SELECT a.Employee_Id,a.First_Name,a.Last_Name,
a.BirthDate,a.DepartmentNo,b.Gross,b.Deduction,b.NetPay
FROM Employee a
INNER JOIN Salary b
ON(a.Employee_Id = b.Employee_Id)
PRIMARY INDEX(First_Name);

When the base tables Employee or Salary are updated, the Join index Employee_Salary_JI is also automatically updated.

If we are running a query joining Employee and Salary tables, then the optimizer may choose to access the data from Employee_Salary_JI directly instead of joining the tables. EXPLAIN plan on the query can be used to verify if the optimizer will choose the base table or Join index.

Aggregate Join Index

Aggregate join indexes method is used to resolving queries that frequently specify the same aggregate operations on the same column or columns.

When aggregate join indexes are available, then the system does not repeat aggregate calculations for every query. An aggregate join index can be defined on two or more tables or a single table.

If a table is consistently aggregated on specific columns, then the aggregate join index can be defined on the table to improve the performance. One limitation of the aggregate join index is that it supports only the following functions.

SUM function
COUNT function
GROUP BY clause

Example

In the following example, the Employee and Salary are joined to identify the total salary per Department.

CREATE JOIN INDEX Employee_Salary_JI
AS
SELECT a.Department_Id,SUM(b.NetPay) AS TotalPay
FROM Employee a
INNER JOIN Salary b
ON(a.Employee_Id = b.Employee_Id)
GROUP BY a.DepartmentNo
Primary Index(DepartmentNo);

Teradata Partitioned Primary Index

Partitioned Primary Index (PPI) is one of the powerful features of Teradata, which allows a user to access a part of the table rather than full table access. PPI is an indexing mechanism that is useful in improving the performance of specific queries. PPI works the same as Primary Index for the data distribution but creates partitions according to range or case as specified in the table.

A partitioned primary index (PPI) allows rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries predicated on primary index values.

When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. And when a table is defined with PPI, then the rows are sorted by their partition number.

Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined. Below are the following significant points of the PPI, such as:

PPI is used to improve performance for large tables when we submit queries that specify a range constraint.
PPI allows reducing the number of rows to be processed by using partition elimination.
PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints

Commonly queried rows are stored together in partitions so that we can restrict data retrieval to those partitions that contain the rows required by your SQL statement. Other partitions that do not provide requested rows are not read.

Incidentally, other database systems, such as Netezza, go the other way round by storing the information about the data blocks where the requested rows are not stored. Netezza called this method "Data Skipping".

Partitioning will not help to avoid table scans, but it will reduce the number of data blocks that each AMP must move to its local memory (FSG cache) because only relevant partitions need to be accessed.

Characteristics of PPI

Here are some characteristics of the Partitioned Primary Index, such as:

Partitioning can be done on volatile, global temporary, and permanent tables.
The partitioning column doesn't have to be part of the non-unique Primary Index.
The partitioning column must be part of a unique Primary Index.
The partitioned column will not decide which AMP data should go and dependent on the Primary Index.
Rows within AMP will be sorted by partitioned column and not by row Id.
PPI can be defined at multi-level.
Teradata uses a partition elimination strategy to avoid a Full Table Scan (FTS) for partitioned tables.
Partitioned tables may reduce input for range queries by reducing the number of rows to be processed.
Details about the partitions, constraints & conditions can be retrieved from dbc.PartitioningConstraintsV view.

Advantages of PPI

Following are the advantages of the Partitioned Primary Index:

It avoids a full table scan for specific queries.
PPI does not use a secondary index that requires additional physical structure and other input maintenance.
It can access a subset of a large table quickly.
PPI drop the old data quickly and add new data.
It can be created on a global temp table, volatile table, and non-compressed join indexes.
For range-based queries, we can remove SIs and use PPI, which will reduce the space of an overhead SI Subtable.

Disadvantages

Partitioned Primary Index also has the following disadvantages, such as:

Partitioning can make a single row or primary index accesses to the table slower if the partitioning column is not a part of the primary index column.
An additional of 2 bytes is added to each row and hence increases the perm space.
While joining a PPI table with a non-partitioned table takes a long time.
Access other than the PPI column takes more time.

Types of PPI

There are four types of Partitioned Primary Index:

Case partitioning
Range-based partitioning
Multi-level partitioning
Character-based partitioning

Example

Consider the following Orders table with Primary Index on OrderNo.

StoreNo	OrderNo	OrderDate	OrderTotal
2001	75	2017-11-02	900
2001	99	2017-12-10	1,200
2002	77	2017-06-09	3,000
2002	18	2017-09-05	2,454
2001	71	2017-11-03	1201
2003	70	2017-18-04	2,454
2001	80	2017-10-08	1201
2001	65	2017-15-05	1201

Assume that the records are distributed between AMPs, as shown in the following tables. Recorded are stored in AMPs, sorted based on their row hash.

AMP 1

RowHash	OrderNo	OrderDate
1	71	2017-11-03
2	18	2017-09-05
3	75	2017-11-02
4	65	2017-15-05

AMP 2

RowHash	OrderNo	OrderDate
1	80	2017-10-08
2	99	2017-12-10
3	70	2017-18-04
4	77	2017-06-09

If we run a query to extract the orders for a particular date, then the optimizer may choose to use Full Table Scan, then all the records within the AMP may be accessed. We define the order date as Partitioned Primary Index to avoid the full table scan. When rows are inserted into the orders table, they are partitioned by the order date. Within each partition, they will be ordered by their row hash.

The following data shows how the records will be stored in AMPs if they are partitioned by Order Date. If a query is run to access the records by Order Date, only the partition that contains the records for that particular order will be obtained.

AMP 1

Partition	RowHash	OrderNo	OrderDate
0	3	75	2017-11-02
1	1	71	2017-11-03
1	2	18	2017-09-05
2	4	65	2017-15-05

AMP 2

Partition	RowHash	OrderNo	OrderDate
0	2	99	2017-12-10
0	4	77	2017-06-09
1	3	70	2017-18-04
2	1	80	2017-10-08

Following is an example of creating a table with a partition primary Index. PARTITION BY clause is used to define the partition.

CREATE SET TABLE Orders
(
StoreNo SMALLINT,
OrderNo INTEGER,
OrderDate DATE FORMAT 'YYYY-MM-DD',
OrderTotal INTEGER
)
PRIMARY INDEX(OrderNo)
PARTITION BY RANGE_N
(
OrderDate BETWEEN DATE '2015-01-01' AND '2020-12-31' EACH INTERVAL '1' DAY
);

In the above example, the table is partitioned by the OrderDate column. There will be one separate partition for each day.

PARTITION BY RANGE_N(totalorders BETWEEN *, 100, 1000 AND *,
                         UNKNOWN);

In the example, RANGE_N specifies four partitions to which a row can be assigned, based on the value of the totalorders column:

Partition Number	Condition
1	The value of the totalorders column is less than 100.
2	The value of the totalorders column is less than 1000, but greater than or equal to 100.
3	The value of the totalorders column is greater than or equal to 1000.
4	The totalorders column is NULL, so the range is UNKNOWN.

PARTITION BY (RANGE_N(totalorders BETWEEN *, 100, 1000 AND *)
                  ,RANGE_N(orderdate BETWEEN *, '2005-12-31' AND *) );

The example defines six partitions to which a row can be assigned. The first RANGE_N expression defines three partitions based on the value of the totalorders column. The second RANGE_N expression subdivides each of the three partitions into two partitions based on the value of the orderdate column.

Level 1 Partition Number	Level 2 Partition Number	Condition
1	1	The value of the totalorders column is less than 100 and the value of the orderdate column is less than '2005-12-31'.
1	2	The value of the totalorders column is less than 100 and the value of the orderdate column is greater than or equal to '2005-12-31'.
2	1	The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is less than '2005-12-31'.
2	2	The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is greater than or equal to '2005-12-31'.
3	1	The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is less than '2005-12-31'.
3	2	The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is greater than or equal to '2005-12-31'.

Teradata Secondary Indexes

Secondary indexes are never required for Teradata Database tables, but they can often improve system performance.

A Secondary Index (SI) is an alternate data access path. It allows accessing the data without having to do a full-table scan.

The secondary index can be created explicitly using the CREATE TABLE and CREATE INDEX statements. Teradata Database can implicitly create unique secondary indexes.

For example, when we use a CREATE TABLE statement that specifies a primary index, Teradata Database implicitly creates unique secondary indexes on column sets that we specify using PRIMARY KEY or UNIQUE constraints.

Secondary Indexes are stored in separate subtables that require additional disk space and maintenance, which is handled automatically by the system. These tables are built-in all AMPs.

These subtables contain the index rows, so we need to add another set of rows that requires updating each time.

Secondary index subtables are also duplicated whenever a table is defined with FALLBACK. The Secondary index is not involved in the data distribution.

The purpose of the Secondary Index Subtable will be to return back to the real row in the base table via the Row-ID.

Secondary Index Rules

Here are some essential rules for the secondary index:

Rule 1: Secondary Indexes are optional.
Rule 2: Secondary Index values can be unique or non-unique.
Rule 3: Secondary Index values can be NULL.
Rule 4: Secondary Index values can be modified.
Rule 5: Secondary Indexes can be changed.
Rule 6: A Secondary Index has a limit of 64 columns.

Secondary Indexes Types

There are two types of secondary indexes in Teradata:

Unique Secondary Index (USI)
Non-Unique Secondary Index (NUSI)

Unique Secondary Index (USI)

A Unique Secondary Index allows unique values for the columns defined as USI.

A Unique Secondary Index (USI) serves two purposes.

Enforces uniqueness on a column or group of columns.
Speeds up access to a row (data retrieval speed).

When A USI is created, then the Teradata will immediately build a secondary index subtable on each AMP.

Each AMP will then hash the secondary index value for each of their rows in the base table.

Syntax

Below is the syntax to create a Unique Secondary Index:

CREATE UNIQUE INDEX (Column/Columns) ON;

Example

First create an Employee table with the following records, such as:

Emp_Id	First_Name	Last_Name	Department_No
202001	Mike	Richard	1
202002	Robert	Williams	2
202003	Peter	Collin	2
202004	Alexa	Stuart	1
202005	Robert	Peterson	1

The following example creates USI on the Emp_Id column of the employee table.

CREATE UNIQUE INDEX(Emp_Id) on Employee;

The output of the Emp_Id hash will utilize the hash map to point to a specific AMP and that AMP will hold the secondary index subtable row for the secondary index value.

That means the subtable row will hold the base table Row-ID, and Teradata will then find the base row immediately.

Non-Unique Secondary Index (NUSI)

A Non-Unique Secondary Index (NUSI) allows duplicate values for the columns defined as NUSI.

A Non-Unique Secondary Index is usually specified to prevent full-table scans, in which every row of a table is read.

A USI is always a Two-AMP operation, so it is almost as fast as a Primary Index, but a NUSI is an All-AMP operation, but not a Full Table Scan.

When A NUSI is created, then the Teradata will immediately build a secondary index subtable on each AMP.

Each AMP will hold the secondary index values for their rows in the base table only. There can be up to 32 Secondary Indexes on a table.

Syntax

Below is the generic syntax to create a Non-Unique Secondary Index:

CREATE INDEX (Column/Columns) ON;

Example

The following example creates NUSI on the First_Name column of the employee table.

CREATE INDEX(First_Name) on Employee;

In the above example, each AMP holds the name column for all employee rows in the base table on their AMP (AMP local).

Each AMP Local name will have the Base Table Row-ID (pointer), so the AMP can retrieve it quickly if needed.

If an AMP contains duplicate first names, only one subtable row for that name is built with multiple Base Row-IDs.

Teradata Hashing Algorithm

The hashing algorithm is a piece of code that acts as a translation table. A row is assigned to a particular AMP based on the primary index value. Teradata uses a hashing algorithm to determine which AMP gets the row.

The Teradata Database hashing algorithms are proprietary mathematical functions that transform an input data value of any length into a 32-bit value referred to as a rowhash, which is used to assign a row to an AMP.

Whether the input is a combination of different column values, comes from differently-sized values from a variable-length column, or is composed of different data types, the hashing algorithm's output will always be a fixed size and format.

It dealing with uniformly-sized row identifiers simplifies the database's effort during storing, aggregating, sorting, and joining rows.

Following is a high-level diagram on the hashing algorithm.

Teradata Hashing Algorithm

Here is the explanation of the above diagram, how to insert the data:

First, the client submits a query.
Then, the parser receives the query and passes the PI value of the record to the hashing algorithm.
The hashing algorithm hashes the primary index value and returns a 32-bit number, called Row Hash.
The higher-order bits of the row hash (first 16 bits) are used to identify the hash map entry. The hash map contains one AMP #. Hash map is an array of buckets that contain specific AMP #.
BYNET sends the data to the identified AMP.
AMP uses the 32 bit Row hash to locate the row within its disk.
If there is any record with the same row hash, it increments the uniqueness ID, a 32-bit number. For new row hash, uniqueness ID is assigned as 1 and incremented whenever a record with the same row hash is inserted.
The combination of Row hash and Uniqueness ID is called Row ID.
Row ID prefixes each record in the disk.
Each table row in the AMP is logically sorted by their Row IDs.

Hashing Functions

Teradata SQL provides several functions that can be used to analyze the hashing characteristics of the existing indexes and candidate indexes.

These functions are documented fully in SQL Functions, Operators, Expressions, and Predicates. There are four types of hashing functions available in the Teradata.

1. HASHROW: It describes hexadecimal rowhash value for an expression. The Query would give the same results if we ran again and again.

The HASHROW function produces the 32-bit binary Row Hash that is stored as part of the data row. It returns maximum of 4,294,967,295 unique values.

Syntax

HASHROW( [ < data-column-value > [..., < data-column-value >...] ] )

2. HASHAMP: The HASHAMP function returns the identification number of the primary AMP for any Hash Bucket number.

When no value is passed through the HASHAMP function, it returns a number less than the number of AMPs in the current system configuration.

Syntax

HASHAMP( )

3. HASHBUCKET: The HASHBUCKET function produces 16bit binary Hash Bucket used with the Hash Map to determine the AMP that store and retrieve the data row.

The values range from 0 to 1,048,575, not counting the NULL as a possible result.

Syntax

HASHBUCKET( [ ] )

4. HASHBAKAMP: The HASHBAKAMP function returns the identification number of the Fallback AMP for any Hash Bucket number.

Syntax

HASHBAKAMP ( )

Hash Collisions

The Hash Collision is a situation in which the rowhash value for different rows is identical, making it difficult for a system to discriminate among the hash synonyms when one unique row is requested for retrieval from a set of hash synonyms.

Minimizing Hash Collisions

To minimize the Hash Collision problem, Teradata Database defines 4.2 billion hash values. The AMP software adds a system-generated 32-bit uniqueness value to the rowhash value.

The resulting 64-bit value prefixed with an internal partition number is called the rowID. This value uniquely identifies each row in a system, making a scan to retrieve a particular row among several having the same rowhash a trivial task.

A scan must check each of the rows to determine if it has the searched-for value and not another value with the same rowhash value.

Hash Maps

A hash map is a mechanism for determining which AMP a row is stored on, or, in the case of the Open PDE hash map, the destination AMP for a message sent by the Parser subsystem.

Each cell in the map array corresponds to a hash bucket, and each hash bucket is assigned to an AMP. Hash map entries are maintained by the BYNET.

Teradata Join Strategies

Teradata Join Strategies are utilized by the optimizer to choose the minimum cost plan and better performance.

The strategy will be chosen based on the available information to the optimizer, such as Table size, PI information, and Stats Information.

Teradata joins strategies are the following types:

Merge (Exclusion)
Nested
Row Hash

Merge Join Strategies

Merge Join method takes place when the join is based on the equality condition.

Merge Join requires the joining rows to be on the same AMP. Rows are joined based on their row hash.

Merge Join uses four different merge join strategies based on the redistribution to bring the rows to the same AMP.

Strategy 1

The 1st merge join will utilize the Primary Index on both tables in the join condition.

Teradata Join Strategies

The inner join above focuses on returning all rows when there is a match between the two tables.

The ON clause is essential because this join establishes the join (equality) condition.

Teradata Join Strategies

Each matching row is joined where Emp = Emp is stated from the ON Clause in the JOIN.

EMP is the Primary Index for both tables. This first merge join type is extremely efficient because both columns in the ON clause are the Primary Indexes of their respective tables. When this occurs, NO data has to be moved into the spool, and the joins can be performed in what is called AMP LOCAL.

Teradata can perform this join with rapid speed and fewer data to be moved to complete a join.

Strategy 2

This strategy is used when a join is performed on a Primary Index column of one table to a non-primary indexed column of another table.

Teradata Join Strategies

Two tables are being joined based on the DEPT column. In the department table, the Primary Index column is DEPT.

The employee table has EMP as the Primary Index column. The primary objective is to bring the rows together from each table on the same AMPs.

There are several options that the Teradata Optimizer could choose to complete this task, such as:

The first option is to duplicate the smaller table on all AMPs.
The second option could be to leave the department table with an equality condition match on the Primary Index Column stationary on the AMP.
The next step would be to move the rows from the Employee table into a spool. This would be accomplished by hashing (locating) the columns in the employee table and then transferring these rows into spool to the appropriate AMPs where the department table rows reside.

Teradata Join Strategies

Strategy 3

This is used where neither table is being joined on the Primary Index of either table. In this case, Teradata will redistribute both tables into the spool and sort them by hash code.

When we want to redistribute and sort by hash code, we merely hash the non-primary index columns and move them to the AMPs spool where they are sorted by hash.

Once this is accomplished, then the appropriate rows are together in spool on all the AMPs.

Teradata Join Strategies

The Primary Index of the department table is DEPT, and the Primary Index for the manager table is LOC. In this case, both columns being utilized in this join equality are not part of the Primary Index columns.

Rows from both tables will need to be rehashed and redistributed into SPOOL because neither column selected in the ON Clause is the Primary Index of the respective tables. Therefore, both tables are redistributed based on the ON clause columns.

Teradata Join Strategies

The next step in this process is to redistribute the rows and locate them to the matching AMPs.

When this is completed, the rows from both tables will be located in two different spools. The rows in each spool will be joined together to bring back the matching rows.

Merge Join Strategy 4

The fourth merge join strategy is called the big table - small table join. If one of the tables being joined is small, then Teradata may choose a plan that will duplicate the smaller table across all the AMPs.

The key to this strategy is that regardless of the table is part of the Primary Index Column or not, Teradata could still choose to duplicate the table across all the AMPs.

Teradata Join Strategies

The two tables involved in the join are the Employee table and the Department table.

The Dept column is the join equality that is making the match between the two tables.

The DEPT column is the Primary Index Column in the Department table.

The Employee table has the Emp column as the Primary Index. The Department table is small.

To join these two tables together:

The first step is to get the rows together on the same AMP. In this case, since the Department table is small, Teradata will choose to duplicate the entire Department table on each AMP into a spool.
Then, the AMPs to join the base Employee rows with the Department rows.

Teradata Join Strategies

Nested Join

A nested join is designed to utilize a unique index type (Either Unique Primary Index or Unique Secondary Index) from one of the tables in the joint statement to retrieves a single row.

Then it matches that row to one or more rows on the other table being used in the join.

Teradata Join Strategies

The nested join has the join equality (ON) condition based on the Dept column. The Dept column is the Primary Index Column on the department table.

The Dept column is the Secondary Index Column in the employee table. The nested join can move a single row into the spool and then matching that row with another table that contains several matches.

Analysis of this join statement indicates a new clause has been added to this statement. This is known as the WHERE option.

When utilized, the WHERE option allows for a single row to be retrieved from a table. A nested join will always use a unique index to isolate that single record and then join that record to another table.

The other table may use an index, or it may not. The utilization of indexes in join statements will improve performance and utilize fewer resources, as shown in the below diagram.

Teradata Join Strategies

Since there is only one row in the department table with a match for department =10, which is based on the AND option in the join statement, the Teradata Optimizer will choose a path to move the department table columns into the spool and duplicate them across all the AMP.

After completed, the matches will proceed with that single record (10 and SALES) to the second table, which did not move from the base AMP.

Nested Joins are great in an OLTP Environment because of the usage of both Unique and Non-Unique Indexes.

Row Hash Join

The Hash Join is part of the Merge Join. The key to a Merge Join is based on equality condition, such as E.Dept = D.Dept in the ON clause of the join statement.

A Hash Join can only occur if one or both of the tables on each AMP can fit entirely inside the AMP's memory.

Teradata Join Strategies

The column's names can be different, but the row information has to be similar for the match to work.
Both the EMP and MGREMP columns have the same type of information, so therefore a join based on these column names will be successful.
EMP column is the Primary Index column on the employee table. However, the MGREMP column is not an index column in the department table.
The key to determining a Hash Join is if the SMALLER TABLE can be entirely held in each AMP's MEMORY.

Teradata Hashing Algorithm

The Hash Join process is where the smaller table is sorted by row hash and duplicated on every AMP.

Here the key is a smaller table that is required to be held entirely in each AMP's memory.

Teradata will use the join column of the larger table to search for a match. The row hash join is extremely efficient because it eliminates the sorting, redistribution, and copying of the larger table into the spool.

The rows that are duplicated into the AMP's memory yield increased performance because the rows never go into a spool. Rows that go into spool always have to involve disk activity.

Product Joins

Product Joins compare every row of one table to every row of another table. They are called product joins because they are a product of the number of rows in table one multiplied by the number of rows in table two.

For example, if one table had six rows and the other had six rows, then the Product Join would compare 6 x 6 rows with a potential of 36 rows.

SELECT E.EMP,D.DEPT
FROM EMPLOYEETABLE E,DEPTTABLE D
WHERE
EMP LIKE '_b%'

Many times, product joins are major mistakes because all rows in both tables will be compared.

Teradata tables have the potential to contain millions of rows. If a user accidentally writes a product, join against two tables that have 1 million rows each.

The result set would return One Trillion Rows. To avoid a product join, check the syntax to ensure that the join is based on an EQUALITY condition.

In the above example, the equality statement reads "WHERE EMP Like '_b%' because this clause is not based on a common domain condition between the two tables e.dept = d.dept, the result is a product join.

Another cause of a product join is when aliases are not used after being established. So firstly, ensure the WHERE clause is not missing in the join syntax.

Cartesian product Join

A Cartesian Product Join will join every row in one table to every row in another table. The only thing that decides the number of rows will be the total number of rows from both tables.

If one table had 5 rows and another had 10 rows, then we will always get 50 rows returned.

SELECT E.EMP,D.DEPT
FROM EMPLOYEETABLE E,DEPTTABLE D;

Most of the time, a Cartesian Product Join is a major problem because all rows in both tables will be joined.

To avoid a Cartesian Product Join, check the syntax to ensure that the join is based on an EQUALITY condition.

In the above example, the WHERE clause is missing because this clause is missing, a common domain condition between the two tables (e.dept = d.dept).

Another cause of a product join is when aliases are not used after being established.

Exclusion Join

Exclusion Joins have one primary function. They exclude rows during a join.

SELECT EMP, DEPT, NAME
FROM
EMPLOYEETABLE
WHERE DEPT=10 and
EMP NOT IN (SELECT MGREMP from DEPTTABLE WHERE MGREMP IS NOT NULL)

In the above example, the join utilizes the NOT IN statement. Exclusion joins are used for finding rows that don't have a matching row in the other table.

Queries with the NOT IN operator are the types of queries that always provide exclusion joins results. In this case, this query will find all the employees who belong to department 10 who are NOT managers.

These joins will always involve a Full Table Scan because Teradata will need to compare every record to eliminate rows that will need to be excluded.

This type of join can be resource-intensive if the two tables in this comparison are large.

The biggest problem with the Exclusion Joins is when the NOT IN statement is used because the NULLs are considered unknowns, so the data returned in the answer will be NULLs. There are two ways to avoid this problem:

Define NOT IN columns as NOT NULL on the CREATE TABLE.
Add the "AND WHERE Column IS NOT NULL" to the end of the JOIN.

Teradata Views

Views are database objects that are built by the query. Views can be created using a single table or multiple tables by way of joining.

Their definition is stored permanently in a data dictionary, but they don't save a copy. Data for the view is built dynamically.

A view may contain a subset of the table's rows or a subset of columns.

Create a View

Views are created using a special form of Data Definition Language (DDL). The CREATE requests a new VIEW, provides the name and the SELECT for the view. It is suggested the name of the view either start with "v_" or end with "_v" to identify it as a view name.

That way, it is visibly obvious to people that this is a view and not a table. The name of the view must be unique from the names of other objects in the database. The CREATE VIEW verifies that the name does not already exist and returns an error if it does.

Syntax

Following is the syntax for creating a view.

CREATE/REPLACE VIEW <view_name>
AS
<select query>;

view_name: Name of the new view. If view_name is not fully qualified, the default database is used.
database_name: It divided into two sections
- user_name: Name of the database or user to contain view_name if something other than the current database or user.
- column_name: Name of a view column. If more than one column is specified, list their names in the order in which each column is to be displayed for the view.
AS: An introduction to the view definition.

NOTE

Views that reference a row-level security table can include columns based on row-level security constraints, but it is not required. However, the view enforces any security constraints in the base table, whether or not they are included in the view definition.
A view can reference both row-level security tables and non-row-level security tables, but all referenced row-level security tables must contain the same security constraint columns or subsequent requests to access the view fail with an error.

Example

Consider the following Employee table.

Emp_Id	First_Name	Last_Name	Department_No	BirthDate
202001	Mike	Richard	1	4/8/1988
202002	Robert	Williams	2	8/5/1991
202003	Peter	Collin	2	5/9/1990
202004	Alexa	Stuart	1	15/11/1989
202005	Robert	Peterson	1	22/2/1990

The following example creates a view on the Employee table.

CREATE VIEW Employee_View
AS
SELECT
Emp_Id,
First_Name,
Last_Name,
Department_No,
BirthDate,
FROM
Employee;

Using Views

We can use a regular SELECT statement to retrieve data from Views.

Example

The following example retrieves the records from Employee_View;

SELECT Employee_Id,
First_Name,
Last_Name,
Department_No,
BirthDate,
FROM Employee_View;

Modifying Views

A current view can be modified using the REPLACE VIEW statement.

REPLACE VIEW redefines an existing view or, if the specified view does not exist, it creates a new view with the specified name.

Syntax

Following is the syntax to modify a view.

REPLACE VIEW <viewname>
AS
<select query>;

Example

The following example modifies the view Employee_View for adding additional columns.

REPLACE VIEW Employee_View
AS
SELECT
Employee_Id,
First_Name,
Last_Name,
BirthDate,
JoinedDate
Department_No
FROM
Employee;

Drop View

A current view can be dropped using the DROP VIEW statement.

Syntax

Following is the syntax of DROP VIEW.

DROP VIEW <viewname>;

Example

Following is an example to drop the view Employee_View.

DROP VIEW Employee_View;

Advantages of Views

Here are some advantages of using Views in Teradata, such as:

Views provide an additional level of security by restricting the rows or columns of a table.
Users can be given access only to views instead of base tables.
Simplifies the use of multiple tables by pre-joining those using Views.
Taking requires rows and columns from the tables.
The view is used to reduce the net bond width.
It provides better binding between the table and view data.

Teradata Macros

Macro is a set of SQL statements that are stored and executed by calling the Macro name. The definition of Macros is stored in the Data Dictionary.

Users only need EXEC privilege to execute the Macro. Users don't need separate rights on the database objects used inside the Macro.

A macro consists of one or more statements that can be executed by performing a single statement. Each time the Macro is completed, one or more rows of data can be returned.

Macro statements are executed as a single transaction. If one of the SQL statements in Macro fails, then all the statements are rolled back. Macros can accept parameters. Macros can contain DDL statements, but that should be the last statement in Macro.

A frequently used SQL statement or series of statements can be incorporated into a macro and defined using the SQL CREATE MACRO statement. The comments in the Macro are performed using the EXECUTE statement. A macro can include an EXECUTE statement that executes another macro.

Create Macros

CREATE MACRO privilege on the database or user in which the Macro is to be created. The creator automatically gets the DROP MACRO and EXECUTES privileges WITH GRANT OPTION. The immediate owner of the Macro:

It is the database in which it exists, not the user who created it.
Determines the macro access privileges, not the Macro.
Must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION

Syntax

Macros are created using the CREATE MACRO statement. Following is the generic syntax of the CREATE MACRO command.

CREATE MACRO <macroname> [(parameter1, parameter2,...)]
(
<sql statements>
);

macro_name: The name of the new Macro. If a fully qualified name is not specified, the default database or user is used. The name of a macro must be unique within its containing user or database.

Emp_Id	First_Name	Last_Name	BirthDate
202001	Mike	Richard	11/05/1990
202002	Robert	Williams	20/01/1988
202003	Peter	Collin	21/08/1985
202004	Alexa	Stuart	10/12/1992
202005	Robert	Peterson	19/03/1987

Salary Table:

Emp_Id	Gross	Deduction	NetPay
202001	40,000	4,000	36,000
202002	80,000	6,000	74,000
202003	90,000	7,000	83,000
202004	75,000	5,000	70,000
202005	80,000	00	80,000

The following example creates a Macro called Get_Emp. It contains a select statement to retrieve records from the employee table.

CREATE MACRO Get_Emp AS
(
SELECT
Emp_Id,
First_Name,
Last_Name
FROM
employee
ORDER BY Emp_Id;
);

Executing Macros

The immediate owner of the Macro (the database in which the Macro resides) must have the necessary privileges on objects named in the request set for statements that are contained in the Macro.

Syntax

Macros are executed using the EXEC command. Following is the syntax of the EXECUTE MACRO command.

EXEC <macroname>;

Example

The following example executes the Macro names Get_Emp; when the following command is executed, it retrieves all records from the employee table.

EXEC Get_Emp;
/* Query completed. 5 rows found. 3 columns returned */
/* Total elapsed time was 1 second */
Emp_Id First_Name Last_Name
202001 Mike Richard
202002 Robert Williams
202003 Peter Collin
202004 Alexa Stuart
202005 Robert Peterson

Parameterized Macros

Teradata Macros can accept parameters. Within a Macro, these parameters are referenced with a semicolon (;).

Example

Following is an example of a Macro that accepts parameters.

CREATE MACRO Get_Emp_Salary(Emp_Id INTEGER) AS
(
SELECT
Emp_Id,
NetPay
FROM
Salary
WHERE Emp_Id = :Emp_Id;
);

Executing Parameterized Macros

Macros are executed using the EXEC command. We need EXEC privilege to execute the Macros.

Syntax

Following is the syntax of the EXECUTE MACRO statement.

EXEC <macroname>(value);

Example

The following example executes the Macro names Get_Emp. It accepts employee Id as a parameter and extracts records from the employee table for that employee.

EXEC Get_Emp_Salary(202001);
/* Query completed. One row found. 2 columns returned */
/* Total elapsed time was 1 second */
Emp_Id NetPay
202001 36,000

Replace a Macro

REPLACE privilege depends on whether the Macro being replaced already exists.

If the Macro already exists, the DROP MACRO privilege on the Macro or it contains a database or user.
If the Macro does not already exist, the CREATE MACRO privilege on the Macro or its containing database or user.

NOTE

Once a macro has been replaced, its immediate owner is the database in which it exists, not the user who replaced it.
The immediately owning database must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION.

Teradata OLAP Functions

OLAP functions are similar to aggregate functions except that the aggregate functions will return only one value. In contrast, the OLAP function will provide the individual rows in addition to the aggregates.

The OLAP functions are built into the Teradata database to provide data mining capabilities and trend analysis.

These functions provide processing not available using the standard aggregation. The OLAP functions give the result of their operation and display the detail data values used in the function.

The detail row data is displayed as part of the answer set since the output is in row format and not in a report format like WITH.

The OLAP functions may be performed on all tables or views and used to populate tables in conjunction with INSERT/SELECT. The most significant difference is that these functions can be used in Queryman, unlike WITH.

Syntax

Following is the syntax of the OLAP function.

<aggregate function> OVER
([PARTITION BY] [ORDER BY columnname][ROWS BETWEEN
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

NOTE: Aggregation functions can be SUM, COUNT, MIN, MAX, and AVG.

Example

Consider the following Salary table of the employees.

Emp_Id	Gross	Deduction	NetPay
202001	40,000	4,000	36,000
202002	80,000	6,000	74,000
202003	90,000	7,000	83,000
202004	75,000	5,000	70,000

Following is an example to find the cumulative sum or running total of NetPay on the Salary table.

Records are sorted by Emp_Id, and the cumulative sum is calculated on the NetPay column.

SELECT
Emp_Id, NetPay,
SUM(Netpay) OVER(ORDER BY Emp_Id ROWS
UNBOUNDED PRECEDING) as TotalSalary
FROM Salary;

After executing the above query, it produces the following output.

Emp_Id NetPay TotalSalary
202001 36000 36000
202002 74000 110000
202003 83000 193000
202004 70000 263000

QUANTILE Function

A QUANTILE function is used to divide rows into several partitions of roughly the same number of rows in each partition. The percentile is the QUANTILE most commonly used in business.

By default, both the QUANTILE column and the QUANTILE value itself will be output in ascending sequence.

In some cases, the ORDER BY clause may be used to reorder the output for display. Here the order of the output does not change the meaning of the output, unlike a summation where the values are being added together and all need to appear in the proper sequence.

Syntax

Following is the syntax of the QUANTILE function.

SELECT QUANTILE (<partitions>, <column_name> ,<sort_key> [DESC | ASC])
FROM <table_name>
[QUALIFY QUANTILE (<column_name>) {< | > | = | <= | >=} <number of rows>];

RANK Function

RANK function orders the records based on the column provided. The RANK function can also filter the number of records returned based on the Rank.

The RANK function allows a column to be evaluated and compared based on either high or low order, against all other rows to create the output set.

The order will be sorted by default in the descending sequence of the ranking column, which correlates to descending Rank.

The output of the RANK function is the highest or the lowest data values in the column, depending on the sort requested.

Syntax

Following is the syntax to use the RANK function.

SELECT RANK( <column_name> [DESC | ASC] )
FROM <table_name>
[GROUP BY <column_name>, [<column_number> ] ] [{ ASC | DESC } ]
[QUALIFY RANK(<column_name>) {< | <=} <number of rows> ];

Example

Consider the following Employee table.

Emp_Id	First_Name	Last_Name	JoinedDate	Department_No	BirthDate
202001	Mike	Richard	3/27/2008	1	1/5/1980
202002	Robert	Williams	4/25/2013	1	3/5/1983
202003	Peter	Collin	3/21/2010	1	4/1/1983
202004	Alexa	Stuart	1/2/2017	2	11/6/1984
202005	Robert	Peterson	1/4/2015	2	12/1/1984

The following query orders the records of the employee table by Joined Date and assigns the ranking on Joined Date.

SELECT Emp_Id, JoinedDate, RANK()
OVER(ORDER BY JoinedDate) as Seniority
FROM Employee;

When the above query is executed, it produces the following output.

Emp_Id JoinedDate Seniority
202001 03/27/2008 1
202003 03/21/2010 2
202002 04/25/2013 3
202005 01/04/2015 4
202004 01/02/2017 5

PARTITION BY clause groups the data by the columns defined in the PARTITION BY clause and performs the OLAP function within each group. Following is an example of the query that uses the PARTITION BY clause.

SELECT Emp_Id, JoinedDate, RANK()
OVER(PARTITION BY DeparmentNo ORDER BY JoinedDate) as Seniority
FROM Employee;

When the above query is executed, it produces the following output. We can see that the Rank is reset for each Department.

Emp_Id DepartmentNo JoinedDate Seniority
202001 1 03/27/2008 1
202003 1 03/21/2010 2
202002 1 04/25/2013 3
202005 2 01/04/2015 4
202004 2 01/02/2017 5

Teradata Data Protection

Teradata has many features to protect against potential data loss. Each of these features protects the data differently and on a different level of granularity. Some processes for protection are automatic, and some of them are optional.

Here are some features available for data protection in Teradata.

RAID
Fallback
Cliques
Journals
Hot Standby Nodes

RAID

RAID mechanism is used to protect data from Disk Failures. Disk Array consists of a set of disks that are grouped as a logical unit. This unit looks like a single unit to the user, but it can be spread across several disks.

RAID 1 is commonly used in Teradata. In RAID 1, each disk is associated with a mirror disk. Any changes to the data in the primary disk are reflected in mirror copy as well. If the primary disk fails, then the data from the mirror disk can be accessed.

Although each AMP has exactly one virtual disk assigned (VDISK), several physical disks make up this VDISK. Half of the discs keep the mirrored data.

As in the case of fallback protection, the cost is the usage of twice as much as disk space.

The RAID 1 configuration ensures enhanced data accessibility and security. If a single disk fails, data will serve from the mirror disk. Only if both the primary disk and the mirror disk fail, and no fallback protection were enabled, the data would be lost.

Teradata Data Protection

Fallback

Fallback protects the table data by storing the second copy of rows of a table on another AMP called as Fallback AMP. If one AMP fails, then the fallback rows are accessed. With this, even if one AMP fails, data is still available through fallback AMP. The fallback option can be used at table creation or after table creation.

Fallback ensures that the second copy of the table's rows is always stored in another AMP to protect the data from AMP failure. However, fallback occupies twice the storage and input for Insert, Delete, or Update.

The advantage of fallback protection is that the data may be accessible in case of an AMP failure, as the auxiliary AMP will take over its tasks until the failed AMP comes back online.

Only in the unlikely event, that the backup AMP fails as well, your table ultimately will become inaccessible.

The following diagram shows how a fallback copy of the rows is stored in another AMP.

Teradata Data Protection

To achieve maximum protection, AMP's are grouped in clusters. Main AMP and fallback AMP always belong to the same cluster and protect each other.

The main AMP and the fallback AMP are never stored together physically at the same node, a prudent design choice on the nature of hardware failures. Even if a complete node fails, the fallback protection still allows us to access the data!

Cliques

Clique is a mechanism used by Teradata to protect data from Node failures. A clique is nothing but a set of Teradata nodes that share a common set of Disk Arrays. When a node fails, the AMPs of the failed node will be migrated to another node belonging to the same clique and stays fully functional and continue to access their disk arrays.

This protection mechanism adds another level of security to the Teradata System.

To remove this restriction, Teradata offers hot standby nodes to take over the AMPs of a failed node.

As these AMPs are not engaged in routine operations in the ordinary course of business, no performance degradation occurs. Before the AMP migration occurs, a system restart is required, followed by another restart as soon as the failed node goes online again.

Journals

Journals are kept on the system to provide data availability in the event of a component or process failure. Journals are the following types in the Teradata:

1. Recovery Journal: The Teradata Database uses Recovery Journals to maintain data integrity in the case of automatically

An interrupted transaction.
An AMP failure.

Recovery Journals are created, maintained, and purged by the system automatically, so no DBA intervention is required.

Recovery Journals are tables stored on disk arrays like user data is, so they take up disk space on the system.

2. Down-AMP Recovery Journal: The Down-AMP Recovery Journal allows continued system operationwhile an AMP is down.

A Down-AMP Recovery Journal is used with Fallback-protected tables to maintain a record of writing transactions (updates, creates, inserts, deletes, etc.) on the failed AMP while unavailable.

Teradata Data Protection

The Down-AMP Recovery Journal starts after the loss of an AMP in a cluster automatically. Any changes to the data on the failed AMP are logged into the Down-AMP Recovery Journal by the other AMPs in the group.

When the failed AMP is online again, the restart process includes applying the changes in the Down-AMP Recovery Journal to the recovered AMP. The journal is discarded when the process is complete, and the AMP is brought online and fully recovered.

3. Transient Journal: A Transient Journal maintains data integrity when in-flight transactions are interrupted. Data is returned to its original stateafter transaction failure.

A Transient Journal is used during regular system operation to keep "before images" of changed rows so the data can be restored to its previous state if the transaction is not completed. This happens on each AMP as changes occur. When a transaction is started, the system stores a copy of all the rows automatically affected by the transaction in the Transient Journal until the transaction is completed.

Once the transaction is complete, the "before images" are purged. In the event of a transaction failure, "before images" are reapplied to affected tables, deleted from the journal, and the "rollback" operation is completed.

4. Permanent Journal: Permanent Journals are an optional feature used to provide an additional level of data protection. We can specify the use of Permanent Journals at the table level. It offers full-table recovery for a specific point in time. It can also reduce the need for costly and time-consuming full-table backups.

Permanent Journals are tables stored on disk arrays such as user data, so they take up additional disk space. The Database Administrator maintains the Permanent Journal entries such as deleting, archiving, etc.

The additional disk space required may be calculated in advance to ensure adequate resources. The Database Administrator must leave the Permanent Journal to external media, thus reducing the need for full-table backups since only changes are backed up rather than the entire database.

Hot Standby Node

Hot Standby Node is a node that does not participate in the production environment.

Teradata Database can use spare nodes to improve availability and maintain performance levels in node failure. If a node fails, then the vprocs from the failed nodes will migrate to the hot standby node.

Once the failed node is recovered, it becomes the hot standby node. Hot Standby nodes are used to maintain the performance in case of node failures.

When the failed node is recovered and restarted, it becomes the new hot standby node. The second restart of Teradata is not needed.

Hot standby node has the following characteristics:

A node that is a member of a clique.
Hot Standby Node does not regularly participate in the trusted parallel application (TPA).
It can be brought into the configuration when a node fails in the clique.
It helps with unplanned outages.
It also eliminates the need for a restart to bring a failed node back into service.

Teradata Stored Procedure

A stored procedure contains a set of SQL statements and procedural statements. They consist of a set of control and condition handling comments that make SQL a computationally complete programming language. The definition of the stored procedure is stored in the database, and the parameters are stored in data dictionary tables.

These features provide a server-based procedural interface to Teradata Database for application programmers.

Stored procedures can be a great benefit for some tactical query applications. This section provides:

Some examples of using stored procedures to process complex updates.
A comparison of the relative efficiency of stored procedures and macros for different tactical query applications.

Advantages

Teradata Stored Procedure provides the following essential advantages, such as:

Stored procedures reduce the network load between the client and the server.
It provides better security since the data is accessed through stored procedures instead of accessing them directly.
It gives better maintenance since the business logic is tested and stored in the server.

Elements of Stored Procedures

The set of statements constituting the main tasks of the stored procedure is called the stored procedure body, consisting of a single statement or a compound statement or block.

A single statement stored procedure body can contain one control statement, such as LOOP or WHILE, or one SQL DDL, DML, or DCL statement, including dynamic SQL. The following statements are not allowed, such as:

Any declaration statement, such as local variable, cursor, condition, or condition handler.
A cursor statement (OPEN, FETCH, or CLOSE).

A compound statement stored procedure body consists of a BEGIN-END statement enclosing a set of declarations and statements, including:

Local variable declarations.
Cursor declarations.
Condition declarations.
Condition handler declaration statements.
Control statements.
SQL DML, DDL, and DCL statements supported by stored procedures, including dynamic SQL.
Multi-statement requests (including dynamic multi-statement requests) delimited by the keywords BEGIN REQUEST and END REQUEST.

Creating a Stored Procedure

Teradata stored procedure is created from the following commands:

BTEQ utility using the COMPILE command.
CLIv2 applications, ODBC, JDBC, and Teradata SQL Assistant (formerly called Queryman) using the SQL CREATE PROCEDURE or REPLACE PROCEDURE statement.
Stored Procedures are created using the CREATE PROCEDURE statement.

The procedures are stored in the user database space as objects and are executed on the server.

Syntax

Following is the syntax of the CREATE PROCEDURE statement.

CREATE PROCEDURE <procedurename> ( [parameter 1 data type, parameter 2 data type..] )
BEGIN
<SQL or SPL statements>;
END;

Example

Consider the following Salary Table of the employees.

EmployeeId	Gross	Deduction	NetPay
202001	40,000	2,000	38,000
202002	80,000	3,000	77,000
202003	85,000	5,000	80,000
202004	90,000	2,000	88,000

The following example creates a stored procedure named InsertSalary to accept the values and insert them into the Salary Table.

CREATE PROCEDURE InsertSalary
(
IN in_EmployeeId INTEGER,
IN in_Gross INTEGER,
IN in_Deduction INTEGER,
IN in_NetPay INTEGER
)
BEGIN
INSERT INTO Salary
(
EmployeeId,
Gross,
Deduction,
NetPay
)
VALUES
(
:in_EmployeeId,
:in_Gross,
:in_Deduction,
:in_NetPay
);
END;

Executing Stored Procedures

If we have sufficient privileges, we can execute a stored procedure from any supporting client utility or interface using the SQL CALL statement.

We can also execute a stored procedure from an external stored procedure written in C, C++, or Java. We have to specify arguments for all the parameters contained in the stored procedure.

Syntax

Following is the generic syntax of the CALL statement.

CALL <procedure name> [(parameter values)];

Example

The below example calls the stored procedure InsertSalary and inserts records to Salary Table of the employees.

EmployeeId	Gross	Deduction	NetPay
202001	40,000	2,000	38,000
202002	80,000	3,000	77,000
202003	85,000	5,000	80,000
202004	90,000	2,000	88,000
202005	20,000	1,000	19,000

Teradata User Management

The Teradata Database Administrators are responsible for managing the users in the database. They will create the Teradata account for each user with limited space and access to the database objects.

Users

A user is created using the CREATE USER command. In Teradata, a user is also similar to a database. They both can be assigned space and contain database objects except that the user is assigned a password.

We must explicitly grant the CREATE USER or USER privilege to any user that requires it, except user DBC, which has these privileges by default. The creator of a user automatically receives the DROP USER privilege on the user.

Syntax

Following is the syntax for CREATE USER.

CREATE USER username
AS
[PERMANENT/PERM] = n BYTES
PASSWORD = password
TEMPORARY = n BYTES
SPOOL = n BYTES;

CREATE USER user_name: The name of an individual Teradata Database user. We cannot allow multiple users to share a username for accessing the database.
PERM: The space in bytes available to contain objects that the user creates or owns. Permanent space cannot exceed the current PERM of the user's immediate owner, which is usually the database or user specified in the FROM clause of the CREATE USER statement.
PASSWORD = password: A temporary password. If the system is set up according to the recommendations in creating the security administrator user, the user is prompted to change the temporary password to a private password at first login.
TEMPORARY: Only objects such as global temporary tables and volatile tables require temporary space. We should develop a strategy for using the database objects that need temp space before assigning temp space to users. We can add a temp space specification later with a MODIFY USER or MODIFY PROFILE statement.
SPOOL: Defines the maximum amount of space available to the user for intermediate query results, volatile tables, or formatted answer sets to queries. The system borrows spool space from unused system perm space, such as the spool reserve in creating the spool space reserve.
The spool space specified in a profile takes precedence over spool space defined in a CREATE USER statement.
Specify spool space as part of a profile unless the user requires unique spool considerations.
The spool space allocation should be approximately 20% of the perm space available to the most extensive database the user accesses.

NOTE: While creating a user, the values for user name, Permanent space, and Password is mandatory. But the other fields are optional.

Example

Following is an example to create user T001.

CREATE USER T001
AS
PERMANENT = 1000000 BYTES
PASSWORD = ABC@999
TEMPORARY = 1000000 BYTES
SPOOL = 1000000 BYTES;

Accounts

ACCOUNT option in CREATE USER is used to assign the account. While creating a new user, the user may be assigned to:

The user, but the profile contains more than one account. The default is the first account in the string. The user can submit a SET SESSION ACCOUNT statement to set the default for the current session to one of the other accounts.
A user and the profile contains only one account. The default is the account defined in the profile.
A user, but the profile does not include an account. The default is the account identifier of the immediate owner of the user.
Not assigned to the user, the default is the account identifier of the user's immediate owner.

NOTE

If a user definition specifies an account but not a profile, the default account is the account string in the user definition.
If you assign a profile to a user, and the profile specifies an account string, the profile account is in effect for AMPUsage and DBQL for that user. However, the system uses the value found in the DBC.DatabaseV view for space accounting.

Syntax

Following is the syntax for CREATE USER with the account option.

CREATE USER username
PERM = n BYTES
PASSWORD = password
ACCOUNT = accounted

Example

The following example creates the user T002 and assigns the account as IT and Admin.

CREATE USER T002
AS
PERMANENT = 1000000 BYTES
PASSWORD = ABC@123
TEMPORARY = 1000000 BYTES
SPOOL = 1000000 BYTES
ACCOUNT = ('IT', 'Admin');

The user can specify the account id while logging into the Teradata system or after being logged into the system using the SET SESSION command.

Grant Privileges

GRANT command is used to assign one or more privileges on the database objects to the user or database.

Syntax

Following is the syntax of the GRANT command.

GRANT privileges ON objectname TO username;

NOTE
Privileges can be INSERT, SELECT, UPDATE, REFERENCES.

Example

Following is an example of the GRANT statement.

GRANT SELECT, INSERT, UPDATE ON Employee TO T001;

Revoke Privileges

REVOKE command removes the privileges from the users or databases. The REVOKE command can only remove explicit privileges.

Syntax

Following is the basic syntax for the REVOKE command.

REVOKE [ALL/privileges] ON objectname FROM username;

Example

Below is an example of the REVOKE command.

REVOKE INSERT, SELECT ON Employee FROM T001;

Teradata FastLoad

Teradata FastLoad is used to load a large amount of data in an empty table on a Teradata System or load data from the client to the Database system.

It allows fast loading of a single empty table with no SIs, JIs, HIs, or RI (FKs) or column partitioning.

FastLoad loads large amounts of data into an empty table, with or without a primary index, on Teradata Database. However, row size and the number of columns, more than any other factor, affect performance.

Teradata FastLoad is a command-driven utility which can be invoked either in batch or interactive mode.

Teradata FastLoad utility establishes multiple session i.e., one session per AMP by default, does not use any transient journal during operation, and transfer data using block rather than row to load data into the table. That's why it is so fast.

It loads only one table at a time. If we want to load multiple tables simultaneously, invoke multiple FastLoad jobs. FastLoad can load data from:

Input files on a network-attached workstation.
Disk or tape files from channel attached workstation.
Individual input module (INMOD) routines written to select, validate, and preprocess input data.
Any other device is providing correctly formatted source data.

Teradata FastLoad Prerequisite

There are three key components specified to run any Teradata FastLoad.

1. Log Table: One log table is needed to track the status of every FastLoad session running on the system.

Teradata maintains one table called fastlog under the database SYSADMIN. To use this table, we need to INSERT, DELETE, and UPDATE privilege on this table.

2. Empty Target Table:FastLoad needs one empty target table before inserting data into it. FastLoad does not care about how this has been accomplished.

We can achieve this by deleting the data from the target table or dropping-recreate the target table before starting the loading.

3. Two ERROR Tables:Two error tables are required by FastLoad to capture if an exception occurs during the FastLoad process. These tables will be created automatically.

The First ERROR Table:It is used to capture any translation errors or constraint violation.
For example, a row with column wrong data type would be reported to the first error table.
The Second ERROR Table:It is used for error caused by duplicates values for Unique Primary Index (UPI).
FastLoad will load only one instance for every UPI and stores the duplicate occurrence in the second table. However, if the entire row is a duplicate, FastLoad counts it but does not store the row.

NOTE
Teradata will allow a maximum of 15 FastLoads, Multiloads, or FastExports at the same time. This protects the system from giving all the resources to only load utilities.

Phases of FastLoad

FastLoad divides the whole process into two phases:

Phase 1: Acquisition

This phase's main objective is to transfer the data from the Host computer to the Teradata environment for Access Module Processor (AMP) as quickly as possible.
Teradata Parsing Engine reads the records from the input file and sends a block to each AMP.
Then PE opens a session from the FastLoad client to the AMPs directly. By default, it will create one session per AMP.
One of the client sessions will pack the raw data into 64K blocks.
Then this large block of data will be sent to the AMP randomly without any concern for which AMPs get the data.
After receipt of each data block, every AMP hashes its rows based on the primary index and redistributes them to the proper AMP.
After redistribution of rows, data will be written to an internal Worktable on the AMPs but remain unsorted.
At the end of Phase 1, each AMP has its rows, but they are not in row hash sequence.

Phase 2: Application

The main objective of this phase is to write the data to the actual table space.
Phase 2 starts when FastLoad receives the END LOADING statement.
Each AMP will start sorting of rows in the second error table.
After sorting, these rows will be written to the actual data block permanently in the disk.
Locks on the target table are released, and the error tables are dropped.

Example

Create a text file with the following records and name the file as emp.txt.

202001, James, 1980/01/05, 2010/03/01, 1
202002, Robert, 1983/03/05, 2010/09/01, 1
202003, Peter, 1983/04/01, 2009/02/12, 2
202004, Mike, 1984/11/06, 2014/01/01, 2
202005, Robert, 1984/12/01, 2015/03/09, 3

Following is a sample FastLoad script to load the above file into the Emp_Stg table.

LOGON 192.168.1.102/dbc,dbc;
DATABASE tduser;
BEGIN LOADING tduser.Emp_Stg
ERRORFILES Emp_ET, Emp_UV
CHECKPOINT 10;
SET RECORD VARTEXT ",";
DEFINE
in_EmpId (VARCHAR(10)),
in_FirstName (VARCHAR(10)),
in_BirthDate (VARCHAR(10)),
in_JoinedDate (VARCHAR(10)),
in_DepartmentNo (VARCHAR(05)),
FILE = emp.txt;
INSERT INTO Emp_Stg
(
EmpId,
FirstName,
BirthDate,
JoinedDate,
DepartmentNo
)
VALUES
(
:in_EmpId,
:in_FirstName,
:in_BirthDate (FORMAT 'YYYY/MM/DD'),
:in_JoinedDate (FORMAT 'YYYY/MM/DD'),
:in_DepartmentNo
);
END LOADING;
LOGOFF;

Executing a FastLoad Script

Once the input file emp.txt is created, and the FastLoad script is named as EmpLoad.fl, we can run the FastLoad script using the following command in UNIX and Windows.

FastLoad < EmpLoad.fl;

Once the above command is executed, the FastLoad script will run and produce the log. In the log, here is the number of records processed by FastLoad and status code.

/* 03:19:14 END LOADING COMPLETE
Total Records Read = 5
Total Error Table 1 = 0 (Table has been dropped)
Total Error Table 2 = 0 (Table has been dropped)
Total Inserts Applied = 5
Total Duplicate Rows = 0
Start: Sat June 27 05:20:14 2020
End: Sat June 27 05:20:15 2020
/* 05:20:15 Application Phase statistics:
Elapsed time: 00:00:01 (in hh:mm:ss)
0008 LOGOFF;
/* 05:20:16 Logging off all sessions

Teradata MultiLoad

Teradata MultiLoad or MLoad is a command-driven load utility for fast, high volume data maintenance on multiple tables or views in Teradata large databases.

MultiLoad is a command-driven utility that can do fast, high-volume maintenance on multiple tables and views of a Teradata Relational Database Management System (RDBMS).

Using a single MultiLoad job, we can do several different imports and delete tasks on RDBMS tables and views:

Each MultiLoad import task can do multiple data insert, update, and delete functions on up to five different tables or views.
Each MultiLoad delete task can remove large numbers of rows from a single table.

We can use MultiLoad to import data from:

Disk or tape files on a channel-attached client system.
Input files on a network-attached workstation.
Individual input module (INMOD) programs write to select, validate, and preprocess input data.
Access modules.
Any device is providing correctly formatted source data.

The table or view in the RDBMS receiving the data can be any existing table or view for which we have access privileges for the maintenance tasks we want.

MultiLoad Modes

Teradata has the two modes of the MultiLoad, such as:

1. MultiLoad Import

Each MultiLoad import task can perform multiple INSERT, UPDATE, DELETE, and UPSERT operation on five target tables in parallel.
It can also import data from the network-attached system or mainframe attached system using a custom access module.

2. MultiLoad DELETE

Each MultiLoad delete can remove large volumes of data from a single table.
The MultiLoad DELETE is used to perform global (all amps) delete operation on just one table.
The DELETE's main feature is that it bypasses the transient journal (TJ) and can be restarted if it fails before finishing.
We can't use the primary index in the MultiLoad DELETE operation as primary index access is AMP specific, but MultiLoad DELETE is built for global delete.

MultiLoad requires a work table, a log table, and two error tables in addition to the target table.

Log Table: Used to maintain the checkpoints taken during load, which will be used for restart.
Error Tables: These tables are inserted during load when an error occurs. The first error table stores conversion errors, whereas the second error table stores duplicate records.
Log Table: Maintains the results from each phase of MultiLoad for restart purposes.
Worktable: MultiLoad script creates one work table per target table. The work table is used to keep DML tasks and input data.

MultiLoad Features

Teradata MultiLoad has the following features, such as:

High-performance maintenance operations apply updates to multiple tables in a single pass.
Best for over 1-2% of rows changed.

Data Conversion Capabilities

If an input data field with character data type is targeted for a column with date data type specification, Teradata MultiLoad can change the input data specification to date before inserting it into the target table.

Following are the conversions that support by the Teradata MultiLoad:

Numeric-to-numeric
Character-to-numeric
Character-to-date
Date-to-character

Teradata MultiLoad Limitations

Teradata MultiLoad has the following limitations, such as:

No Unique Secondary Index: MultiLoad does not support unique secondary index (USI) the same as FastLoad, as in the case of USI subtable may create in different AMP where the actual row resides.
But unlike FastLoad, it supports non-unique secondary index (NUSI) as, in this case, suitable will be created in the same AMP. In MultiLoad, each AMP works in parallel and independently. That's why it supports NUSI but not USI.
No Referential Integrity: Referential Integrity (RI) on the target table not supported by the Teradata MultiLoad. RI requires too much system checking to maintain the referential constraints to a different table.
Triggers are not allowed:Triggers are involved with more than one table. MultiLoad deals with only one target table. ALTER the Triggers to the DISABLED status before using MultiLoad.
No concatenation of the input files: It could impact the restart process if the data were concatenated in different sequences.
No Aggregate, exponential operator, or arithmetic functions:If we need data conversions, then we using an INMOD to prepare the data before loading it.
MultiLoad does not support SELECT, Foreign key references, Hash Indexes, Join Indexes, NOPI tables.

How MultiLoad Works

Teradata MultiLoad import has five phases:

Phase 1 (Preliminary Phase): It performs basic setup activities.
Phase 2 (DML Transaction Phase): It verifies the syntax of DML statements and brings them to the Teradata system.
Phase 3 (Acquisition Phase): It brings the input data into work tables and locks it.
Phase 4 (Application Phase): It applies all DML operations.
Phase 5 (Cleanup Phase): It releases the table lock.

The following steps involved in a MultiLoad script:

Step 1: Set up the log table.

Step 2: Log on to Teradata.

Step 3: Then, specify the Target, Work, and Error tables.

Step 4: Define the INPUT file layout.

Step 5: Also, define the DML queries.

Step 6: Name the IMPORT file.

Step 7: And specify the LAYOUT to be used.

Step 8: Initiate the Load.

Step 9: And last, Finish the load and terminate the sessions.

Example

Create a text file with the following records and name the file as emp.txt.

202001, James, 1980/01/05, 2010/03/01, 1
202002, Robert, 1983/03/05, 2010/09/01, 1
202003, Peter, 1983/04/01, 2009/02/12, 2
202004, Mike, 1984/11/06, 2014/01/01, 2
202005, Robert, 1984/12/01, 2015/03/09, 3

The following example is a MultiLoad script that reads records from the employee table and loads into the Emp_Stg table.

.LOGTABLE tduser.Emp_log;
.LOGON 192.168.1.102/dbc,dbc;
.BEGIN MLOAD TABLES Emp_Stg;
.LAYOUT Emp;
.FIELD in_EmpId * VARCHAR(10);
.FIELD in_FirstName * VARCHAR(10);
.FIELD in_BirthDate * VARCHAR(10);
.FIELD in_JoinedDate * VARCHAR(10);
.FIELD in_DepartmentNo * VARCHAR(05);
.DML LABEL EmpLabel;
INSERT INTO Emp_Stg
(
EmpId,
FirstName,
BirthDate,
JoinedDate,
DepartmentNo
)
VALUES
(
:in_EmpId,
:in_FirstName,
:in_BirthDate,
:in_JoinedDate,
:in_DepartmentNo
);
.IMPORT INFILE emp.txt
FORMAT VARTEXT ','
LAYOUT Emp
APPLY EmpLabel;
.END MLOAD;
LOGOFF;

Executing a MultiLoad Script

Once the input file emp.txt is created, and the MultiLoad script is named as EmpLoad.ml, then we can run the MultiLoad script using the following command in UNIX and Windows.

Multiload < EmployeeLoad.ml;

Teradata BTEQ

Teradata BTEQ stands for Basic Teradata Query. It is a command-driven utility that enables users to interact with one or more Teradata Database Systems.

BTEQ utility is used in both batch and interactive mode. It can be used to run any DML statement, DDL statement, create Macros, and stored procedures. BTEQ can be used to import data into Teradata tables from a flat file, and it can also be used to extract data from tables into files or reports.

BTEQ Terms

Below are some terms that commonly used in BTEQ scripts.

LOGON: It is used to log into the Teradata system.
ACTIVITYCOUNT: It returns the number of rows affected by the previous query.
ERRORCODE: It returns the status code of the previous query.
DATABASE: It sets the default database.
LABEL:It assigns a name to a set of SQL commands.
RUN FILE:It executes the query contained in a file.
GOTO:It transfers control to a label.
LOGOFF:It logs off from the database and terminates all sessions.
IMPORT:It specifies the input file path.
EXPORT:It specifies the output file path and initiates the export.

BTEQ Limitations

Teradata BTEQ has the following limitations, such as:

It can submit SQL in both modes, either in batch or interactive way. Interactive users can send SQL and get an answer set on the screen. Users also can submit BTEQ jobs from the batch scripts, have error checking and conditional logic, and allow for the work to be done in the background.
Teradata SQL Assistant (Queryman) outputs data in a spreadsheet format wherein using BTEQ users can get the output in the report format.
BTEQ is an excellent tool for importing and exporting data.

Importing Data

BTEQ can import the data from the source and incorporate it into SQL to modify the content of one or more tables.

Syntax of the IMPORT command:

.IMPORT <mode> {FILE | DNAME = <filename>,[SKIP=n]

Exporting Data

The data can be written to either mainframe or LAN attached computer using SELECT from Teradata. Data can be exported in a variety of formats such as:

Record/DATA mode: In this mode, data will be written to a flat file in native format i.e., INTEGER data is written as a 4 bytes binary field. It cannot be read and understood using a regular text editor. It is set by .EXPORT DATA.
Field/Report mode: This is the default mode of BTEQ. The output of this BTEQ export includes column headers for the fields, white space, and can be understood using a text editor. It is set by .EXPORT REPORT.
Indicator mode: It is similar to DATA mode, but it also includes a bitmap to recognize NULL. It helps to acknowledge NULL values into another RDBMS. And it is set by .EXPORT INDICDATA.
DIF mode: Allows users to export data from Teradata that is directly utilized for spreadsheet applications such as Excel, FoxPro, and Lotus.

Example

Following is an example of the Teradata BTEQ script.

LOGON 192.168.1.102/dbc,dbc;
DATABASE tduser;
CREATE TABLE emp_bkup
(
EmpId INTEGER,
FirstName CHAR(10),
LastName CHAR(10),
DepartmentNo SMALLINT,
NetPay INTEGER
)
Unique Primary Index(EmpId);
.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;
SELECT * FROM
Emp
Sample 1;
.IF ACTIVITYCOUNT <> 0 THEN .GOTO InsertEmp;
DROP TABLE emp_bkup;
.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;
.LABEL InsertEmp
INSERT INTO emp_bkup
SELECT a.EmpId,
a.FirstName,
a.LastName,
a.DepartmentNo,
b.NetPay
FROM
Emp a INNER JOIN Salary b
ON (a.EmpId = b.EmpId);
.IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;
.LOGOFF;

The above Teradata BTEQ script performs the above task in the following steps.

Step 1: Logs into Teradata System.

Step 2: Sets the Default Database.

Step 3: Creates a table called emp_bkup.

Step 4: Selects one record from the Emp table to check if the table has any records.

Step 5: Drops emp_bkup table if the table is empty.

Step 6: Transfers the control to a Label InsertEmp, which inserts records into the emp_bkup table.

Step 7: Then, check ERRORCODE to ensure that the statement is successful, following each SQL statement.

Step 8: ACTIVITYCOUNT returns the number of records selected/impacted by the previous SQL query.

Teradata Performance Tuning

In Teradata, Performance tuning is done to identify all the bottlenecks and then resolve them.

The bottleneck is not a form of error, but it certainly causes a certain amount of delay.

There are the following ways to identifying a bottleneck and resolve them, such as:

EXPLAIN Plan

The first step in performance tuning is the use of EXPLAIN on the query. When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs.

The EXPLAIN plan gives a clear picture of the way optimizer will execute a query. In the Explain plan, check for the keywords such as

Confidence level
Join strategy used
Spool file size
Redistribution, etc.

Collect Statistics

The optimizer uses Data demographics to come up with an effective execution strategy. COLLECT STATISTICS command is used to collect data demographics of the table. Make sure that the statistics collected on the columns are up to date.

Collect statistics on the columns that are used in the WHERE clause and on the columns used in the joining condition.
Collect statistics on the Unique Primary Index columns.
Collect statistics on Non-Unique Secondary Index columns. The optimizer will decide if it can use NUSI or Full Table Scan.
Collect statistics on the Join Index through the statistics on the base table is collected.
Collect statistics on the partitioning columns.

Environment Information

Number of Nodes, AMPs and CPUs
Amount of memory

Data Demographics

Number of rows
Row size
Range of values in the table
Number of rows per value
Number of Nulls
There are three approaches to collect statistics on the table.
Random AMP Sampling
Full statistics collection
Using SAMPLE option

Collecting Statistics

COLLECT STATISTICS command is used to collect statistics on a table.

Syntax

Following is the basic syntax to collect statistics on a table.

COLLECT [SUMMARY] STATISTICS   
INDEX (indexname) COLUMN (columnname) 
ON <tablename>;

Example

The following example collects statistics on EmployeeNo column of Employee table.

COLLECT STATISTICS COLUMN(EmployeeNo) ON Employee;

Viewing Statistics

You can view the collected statistics using HELP STATISTICS command.

Syntax

Following is the syntax to view the statistics collected.

HELP STATISTICS <tablename>;

Example

Following is an example to view the statistics collected on Employee table.

HELP STATISTICS employee;

Using the THRESHOLD Options to Collect and Recollect Statistics

Statistics collection thresholds enable you to minimize the unnecessary collection of statistics. For a system-determined threshold, the Optimizer automatically determines whether a recollection is needed or whether extrapolation is adequate. You can collect and recollect threshold statistics on tables.
The THRESHOLD options enable you to skip recollecting statistics if the amount of data change since the last statistics collection or the age of the current statistics is under the thresholds in effect for the statistics.
The Optimizer can automatically determine the appropriate thresholds to apply based on previously collected statistics, column histories, change counts, and other factors, or you can explicitly specify the thresholds as part of a COLLECT STATISTICS request by specifying a change percentage, a number of days, or both.
You can submit collect statistics requests at regular intervals, and the Optimizer can use stored threshold and historical data to determine whether the specified recollection of statistics is necessary or not. The Optimizer can use the same data to determine when to recollect statistics for those columns that exceed the specified threshold. For example, if you determine that the threshold is to be a 10% data change and you submit a request to recollect statistics, the Optimizer does not recollect those statistics if the change in the data from the last collection is less than 10%.
If you specify a THRESHOLD option for first time statistics collections, the Optimizer collects the statistics and stores the THRESHOLD options you specified for future use without consulting the threshold values you specify. The Optimizer uses the saved collection thresholds to determine whether the current statistics collection can be skipped or not when you submit recollection requests at a later time.
Specifying a new THRESHOLD option for statistics recollection overrides the current setting, if any. The Optimizer applies the new THRESHOLD value to the current request, and remembers the updated THRESHOLD option for future statistics recollection.
Collect Stats in Teradata collect the statistics for one or even multiple columns of the base table specified. These statistics include the hash index, a join index, etc. The primary purpose is to collect statistical profiles of the required columns and store them in a data dictionary.

What is Collect stats in Teradata?

Collect Stats is used to gather the statistics on various columns as per the requirement. Thereafter Teradata’s optimizer creates an execution strategy that is based on these statistics collected by the COLLECT STATS command.
This COLLECT STATS command gathers data demographics and environment information, which the optimizer utilizes in order to optimize the plan for the SQL based on these tables.
Environment information consists of the below set of info:
The amount of memory required
The number of nodes
AMP’s
CPU’s
Data demographics consist of the below set of info:
The row size of the table
Number of rows in that table
The entire range of values present in that table
The number of rows available per table
Number of Null values in that table
You can specify the following kinds of thresholds to control recollecting statistics.
Change-based
Time-based
NO SAMPLE Use a full-table scan to collect the specified statistics.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify NO SAMPLE for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, NO SAMPLE overrides the default.
For recollections of statistics, NO SAMPLE overrides the previous SAMPLE options and collects full statistics.
NO THRESHOLD
Do not apply any thresholds to the collection of statistics.
Remove the existing threshold before collecting the statistics.
You can only specify this option if you also specify an explicit column or index.
You cannot specify NO THRESHOLD for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, NO THRESHOLD overrides the default settings.
For recollections of statistics, NO THRESHOLD overrides any previously specified THRESHOLD options and recollects the statistics without any thresholds.
NO THRESHOLD DAYS
Do not apply a DAYS threshold to the collection of statistics.
Remove the existing DAYS threshold before collecting the statistics.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify NO THRESHOLD DAYS for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, NO THRESHOLD DAYS overrides the default setting.
For recollection of statistics, NO THRESHOLD DAYS overrides any previous DAYS threshold specification.
NO THRESHOLD PERCENT
Do not apply a PERCENT change threshold to the collection of statistics.
Remove the existing PERCENT change threshold before collecting the statistics.
You can only specify this option if you also specify an explicit column or index.
You cannot specify NO THRESHOLD PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, NO THRESHOLD PERCENT overrides the default.
For recollections of statistics, NO THRESHOLD PERCENT overrides any previous change threshold percent specification.
SAMPLE Scan a system-determined percentage of table rows to collect the specified statistics.
SAMPLE has the same meaning as SYSTEM SAMPLE and is only provided for backward compatibility to enable existing COLLECT STATISTICS scripts that specify the USING SAMPLE option to continue to run.
You should use the SYSTEM SAMPLE option instead of SAMPLE.
SAMPLE n PERCENT Scans the percentage of table rows that you specify rather than scanning all of the rows in the table to collect statistics.
The value for n can be a decimal number or integer from 2 through 100.
Specifying SAMPLE 100 PERCENT is equivalent to collecting full statistics.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify SAMPLE n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, the specified sample percentage overrides the default.
For recollection of statistics, SAMPLE n PERCENT overrides any previous SAMPLE option specifications and instead scans n percent of the rows in the table.

THRESHOLD n DAYS Do not to recollect statistics if the age of the statistic is less than the number of days specified.
The value for n must be an integer number that represents the number of days.
The valid range for n is 1 - 9999.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify THRESHOLD n DAYS for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, collecting statistics is not skipped because no current histogram exists to use to determine the age of the current statistics.
Instead, the specified number of days overrides the default setting.
For recollection of statistics, THRESHOLD n DAYS overrides the previous THRESHOLD n DAYS specification and instead applies the specified number of days threshold to the collection.
The DBS Control field DefaultTimeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102.
THRESHOLD n PERCENT Do not to recollect statistics if the percentage of data change since the last collection is less than the specified percentage.
The value for n can be either a decimal number or an integer number.
The valid range of n is 1 - 9999.99.
You can only specify this option if you also specify an explicit column or index.
You cannot specify THRESHOLD n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
For the first collection of statistics, THRESHOLD n PERCENT overrides the default setting.
For recollection of statistics, THRESHOLD n PERCENT overrides the previous threshold change percentage specification and instead applies the specified threshold percentage.
The DBS Control field DefaultUserChangeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102.

Using the sample option: This includes unique index columns, nearly unique indexes, or columns
Full statistic collection methodology: This usually includes Non-Indexed columns, Partition for all tables whether permitted or not, and Collection of full stats over relevant columns.
Most NUPIs refer to non-unique primary indexed columns, while UPIs stand for unique primary indexed columns.
Apart from these, it also includes the single-column join constraints along with NUSI’s
The Random AMP Sampling: It involves the USI’s or UPI’s if only used with equality predicates.

How to Collect Stats in Teradata?

Below is the syntax of COLLECT STATS statement in Teradata:

COLLECT [SUMMARY] STATISTICS
INDEX (name_of_the_index)
COLUMN (col_name)
ON <table_name>;

Here the keyword SUMMARY is optional, and the user may skip it if not required.

Let’s take up some examples to understand how to collect stats in Teradata in detail:

COLLECT STATISTICS COLUMN(roll_number) ON Student_Table;

This will collect stats on the roll_number column of the student_table
When the above query is executed, the below kind of output is produced

Collect Stats in Teradata 1

Let’s take an example to understand the optimization in detail.

Suppose we have two tables table1 and table2.

table1 contains the details of the students like id, name, age, marks, etc

whereas table2 contains the geographic info of the students like address, location along with the primary key as ID

SELECT
a.id,
a.name,
a.age,
a.marks,
b.address,
b.location
From table1 as a
left join table2 as b
on a.id = b.id

Let’s consider two cases wherein the above-mentioned query gets executed.

CASE 1: When we do not have any information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be more costly.

CASE2: When we do have the specifically required information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be less costly.

The reason is, during the join, which is based on the id column from table1 and table2, it needs to be on the same AMP in order to join the data based on this column from table1 and table2

Suppose table1 contains 100 records having ID from 1 to 100 distributed evenly over 10 AMP in the below fashion.

The records having the ID from 1 to 10 in AMP1
The records having the ID from 11 to 20 in AMP2
The records having the ID from 21 to 30 in AMP3
The records having the ID from 31 to 40 in AMP4
and so on…
The records having the ID from 91 to 100 in AMP10

And table2 has only 80 records having ID from 1 to 100 with 20 missing ID’s

The records having the ID from 1 to 8 in AMP1
The records having the ID from 9 to 10 and 15 to 20 in AMP2
The records having the ID from 21 to 28 in AMP3
The records having the ID from 29 to 36 in AMP4
and so on…
The records having the ID from 92 to 100 in AMP8

Now for the join to happen, the ID’s should be available in the SAME AMP

Generally, the data from the smaller table is redistributed. So here, the redistribution will happen for table2

How can we see the statistics collected for the tables in Teradata?

The collected stats can be seen using the below query:

HELP STATISTICS <table_name>

Let’s see the stats collected on Student_table

HELP STATISTICS student_table

When the above query is executed, The result will be something like this:

Collect Stats in Teradata 2

Data Types

Each column in a table is associated with a data type. Data types specify what kind of values will be stored in the column. Every data value belongs to an SQL data type.

Make sure that proper data types are used. This will avoid the use of excessive storage than required.

Conversion

Ensure that the data types of the columns used in join conditions are compatible to avoid explicit data conversions.

Sort

Remove unnecessary ORDER BY clauses unless required.

Spool Space Issue

Spool space error is generated if the query exceeds per AMP spool space limit for that user.

Verify the explain plan and identify the step that consumes more spool space. These common queries can be split and put as separately to build temporary tables.

Primary Index

The primary index is used to specify where the data resides in Teradata. Each table is required to have a primary index defined, so make sure the Primary Index is correctly set for the table.

The primary index column should evenly distribute the data and should be frequently used to access the data.

Partition Primary Index

If Partition Primary Index is defined in tables, try to use it. If we are not using it in the filter condition, it will degrade the performance.

SET Table

If we define a SET table, then the optimizer will check if the record is duplicate for each and every record inserted. To remove the duplicate check condition, we need to define Unique Secondary Index for the table.

UPDATE on Large Table

Updating the large table will be time-consuming. Instead of updating the table, we can delete the records and insert the files with modified rows.

Dropping Temporary Tables

Drop the temporary tables (staging tables) and volatiles if they are no longer needed. This will free up permanent space and spool space.

MULTISET Table

If we confirmed that the input records would not have duplicate files, we could define the target table as a MULTISET table to avoid the duplicate row check used by the SET table.

IN Clause

Avoid the IN clause in filter conditions when there can be a huge number of values in where conditions, then it's better to insert such values in a volatile table and use a volatile table with INNER JOIN in the main query.

UPDATE Clause

Avoid using the UPDATE clause with just SET condition and no WHERE conditioning. Even if the Target and Source have only one row, add WHERE clause for PI column.

columnar table

Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS)

In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.

SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.

-----------------------------------------------------

SQL STATEMENT TYPE

1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.

DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.

DDL Statements:

CREATE

DROP

ALTER

RENAME

MODIFY

REPLACE

COLLECT STATISTICS

SET

BEGIN

END

FLUSH

CHECKPOINT

HELP

SHOW

2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.

DML is used to work with data, including inserting data, updating existing data or performing queries on the data.

The DML subset will be the focal point of this fundamentals course.

DML Statements:

SELECT

INSERT

UPDATE

DELETE

3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.

DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.

DCL Statements:

GRANT

REVOKE

GIVE

4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.

Some statements in this category are also functionally aligned with the DDL SQL category.

For more information on SQL statements in this category, see references in help section.

SQL Statements in this category:

BEGIN/END QUERY LOGGING

SHOW QUERY LOGGING

COLLECT DEMOGRAPHICS

COLLECT STATISTICS

DROP STATISTICS

DUMP EXPLAIN

INITIATE INDEX ANALYSIS

INITIATE PARTITION ANALYSIS

INSERT EXPLAIN

RESTART INDEX ANALYSIS

-------------------------------------------------------------

HELP

HELP command displays information about database objects (databases, tables, views, macros etc.).

This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.

-----------------------------------------------------------------------------------------------------------------

SHOW command displays the current data definition language (DDL) of a database object.

It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).

------------------------------------------------

EXLAIN

=========================================================

Introduction to the Teradata Database

The Teradata database is a Relational Database Management System (RDBMS) that drives a company's data warehouse.

Teradata Parallel Architecture

There are many reasons customers choose the Teradata database as the preferred platform for enterprise data warehousing:

Supports more warehouse data than all competitors combined. There are over 5400 1TB or larger warehouses in the field.
Supports scalability from small (10GB) to massive (50+PB) databases.
Provides a parallel-aware Optimizer that makes it unnecessary to perform complicated tuning to get a query to run efficiently.
Automatic data distribution eliminates complex indexing schemes and time-consuming reorganizations.
Designed and built on a parallel architecture from the start (not a parallel retrofit).
Supports ad-hoc queries using ANSI-standard Structured Query Language (SQL) which is used to communicate with an RDBMS. This allows Teradata to interface with 3rd party BI tools and submit queries from other database systems.
Single operational view of the entire MPP system via the Service Workstation (SWS) and single point of control for the DBA with Teradata Viewpoint.

Teradata Corporation has been doing data warehousing longer than any other vendor. The Teradata database was built to do Decision Support.

Relational Database Advantages

The Teradata database provides customers with unlimited, proven scalability. As the example on the previous page showed, the Teradata database can scale from 10 gigabytes to over 100 petabytes of data on a single system. When we talk about scalability, it isn't just about being able to grow very large and handle enormous amounts of data. It's about growth without losing any performance capabilities. This scalability provides investment protection for customer's growth and application development.

As is proven in the benchmarks we perform, the Teradata database can handle the most concurrent users, who are often running complex queries.

The parallelism of the Teradata database is unlimited. The Teradata database performs every task in parallel, including joins, sorts, and aggregations.

The optimizer is parallel-aware and the most robust in the industry, able to handle multiple complex queries, joins per query, and unlimited ad-hoc processing.

The Teradata File System manages data and disk space automatically. This is what makes the rebuilding of indexes unnecessary.

The Teradata database provides a low total cost of ownership (TCO) due to:

Ease of setup and maintenance
No reorganization of data needed
Automatic data distribution
Most robust utilities in the industry
Low cost of disk to data ratio
Ease in expanding the system

High availability is also a major advantage because with the Teradata database architecture, there is no single point of failure - fault tolerance is built-in to the system. The Teradata database is the only database that is truly linearly scalable, and this extends to data loading with the use of our parallel utilities.

Teradata Database Features

The Teradata database is a high-performance database system that processes enormous quantities of detail data that are beyond the capability of conventional systems.

The system is specifically designed for large data warehouses. From the beginning, the Teradata database was created to store, access and manage large amounts of data.

Hundreds of terabytes of storage capacity are currently available, making it an ideal solution for enterprise data warehouses and even for smaller data marts.

Parallel processing distributes data across multiple processors evenly. The system is designed such that the components divide the work up into approximately equal pieces. This keeps all the parts busy all the time and enables the system to accommodate a larger number of users and/or more data.

Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. Teradata runs on industry standard operating systems as well.

Linear scalability enables the system to grow to support more users/data/queries/query complexity, without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1.

The Teradata database currently runs as a database server on a variety of hardware platforms for single node or Symmetric Multi-Processor (SMP) systems, and on Teradata hardware for multi-node Massively Parallel Processing (MPP) systems.

What is a Database?

A database is a collection of permanently stored data that is used by an application or enterprise.

A database contains logically related data, which means that the database was created with a specific purpose in mind. A database supports shared access by many users. One characteristic of a database is that many people use it, often for many different purposes. A database is protected to control access and managed to retain its value and integrity.

One example of a database is payroll data that includes the names of the employees, their employee numbers, and their salary history. This database is logically related - it's all about payroll. It must have shared access, since it will be used by the payroll department to generate checks, and also by management to make decisions. This database must be protected; much of the information is confidential and must be managed to ensure the accuracy of the records.

The Teradata database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industry-standard SQL statements. SQL is a set-oriented language for relational database management.

Later in this course we will provide another definition of database that is specific to the Teradata database.

Logical/Relational Modeling

The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly.

The design of the data model is the same regardless of data volume.

An enterprise model is one that provides the ability to look across functional processes.

Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.

Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is "in normal form."

The intent of normalizing a relational database is to put one fact in one place. By decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables.

A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple (an ordered set of values), and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. The separator for each value is often a comma. Common uses for the tuple as a data type are:

For passing a string of parameters from one program to another
Representing a set of value attributes in a relational database

The Logical Model

Should be designed without regard to usage
- Accommodates a wide variety of front end tools
- Allows the database to be created more quickly
Should be the same regardless of data volume
Data is organized according to what it represents - real world business data in table (relational) form
Includes all the data definitions within the scope of the application or enterprise
Is generic - the logical model is the template for physical implementation on any RDBMS platform

Normalization

Process of reducing a complex data structure into a simple, stable one
Involves removing redundant attributes, keys, and relationships from the conceptual data model

Relational Databases

Relational databases are solidly founded on Set Theory and are based on the Relational Model, which is a set of principles for relational databases formalized by Dr. E.F. Codd in the late 1960s. The relational model of data prescribes how the data should be represented in terms of structure, integrity, and manipulation. The rules of the relational model are widely accepted in the information technology industry, though actual implementations may vary.

The key to understanding relational databases is the concept of the table, which is made up of rows and columns. The relational model provides the system with rules and a structure to build on. The relational terms we will be discussing in this course are independent of the database you may be running on.

A column is an attribute of the entity that a table represents. A column always contains like data, such as part name, supplier name, or employee number. In the example below, the column named LAST NAME contains last name and never anything else. Columns should contain atomic data, so a phone number might be divided into three columns; area code, prefix, and suffix, so drill-down analysis can be performed. Column position in the table is arbitrary. Missing data values are represented by "nulls", or the absence of a value.

A row is one instance of all the columns of a table. It is a single occurrence of an entity in the table. In our example, all information about a single employee is in one row. The sequence of the rows in a table is arbitrary.

In a relational database, tables are defined as a named collection of one or more named columns that can have zero or many rows of related information. Notice that the information in each row of the table refers to only one person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important.

Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place, thing, or event about which the table contains information. In this case, the entity is the employee.

Note: In relational math we use the term:

Table to mean a relation.
Row to mean a tuple.
Column to mean an attribute.

Row →
(Single
Entity)

Employee

↓ Column (Attribute)

EMPLOYEE NUMBER	MANAGER EMPLOYEE NUMBER	DEPARTMENT NUMBER	JOB CODE	LAST NAME	FIRST NAME	HIRE DATE	BIRTH DATE	SALARY AMOUNT

1006	1019	301	312101	Stein	John	861015	631015	3945000
1008	1019	301	312102	Kanieski	Carol	870201	680517	3925000
1005	0801	403	431100	Ryan	Loretta	861015	650910	4120000
1004	1003	401	412101	Johnson	Darlene	861015	560423	4630000
1007				Villegas	Arnando	870102	470131	5970000
1003	0801	401	411100	Trader	James	860731	570619	4785000

Relation

The employee table has:

Nine columns of data
Six rows of data - one per employee
Missing data values represented by nulls
Column and row order are arbitrary

Normalized Data Model vs. Star Schema Model

As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:

Entities - One record in a table
Attributes - Columns
Relationships - Between tables

First normal form (1NF) rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities.

Second normal form (2NF) requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key.

Third normal form (3NF) requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity.

While the Teradata database can support any data model that can be processed via SQL; an advantage of a normalized data model is the ability to support previously unknown (ad-hoc) questions.

Star Schema

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.

Some characteristics of a Star Schema model include:

They tend to have fewer entities
They advocate a greater level of denormalization

Primary Keys

Tables are made up of rows and columns and represent entities and attributes. Entities are the people, places, things, or events that the tables model. A Primary Key is required for every logical model version of a table, because each entity holds only one type of tuple (i.e., a row about a person, place, thing or event), and each tuple is uniquely identified within an entity by a Primary Key (PK).

Primary Key Rules

A Primary Key uniquely identifies each tuple within an entity. A Primary Key is required, because each tuple within an entity must be able to be uniquely identified.
No duplicate values are allowed. The Primary Key for the EMPLOYEE table is the employee number, because no two employees can have the same number.
Because it is used for identification, the Primary Key cannot be null. There must be something in that field to uniquely identify each occurrence.
Primary Key values should not be changed. Historical information, as well as relationships with other entities, may be lost if a PK value is changed or re-used.
Primary Key can include more than one attribute. In fact, there is no limit to the number of attributes allowed in the PK.
Only one Primary Key is allowed per entity.

Selecting a Primary Key

Each column of a row is called an attribute of that row. A database designer can select any attribute to be a Primary Key but, as a result of the rules listed above, many attributes will not qualify as a Primary Key candidate. For example, we could have selected Last Name as the PK of the EMPLOYEE table, but as soon as the company hires two people with the same last name, the PK is no longer unique. Even if we made the PK Last Name and First Name (possible since PKs can be made up of more than one column) we could still have two employees with the same name. Moreover, some employees may choose to change their last names.

Many data modelers recommend using system-assigned sequential integers for Primary Keys. This assures uniqueness and gives users control over Primary Key values, but it also takes up database space.

Primary Key (PK) values uniquely identify each entry (tuple) of an entity.

Employee

EMPLOYEE NUMBER	MANAGER EMPLOYEE NUMBER	DEPARTMENT NUMBER	JOB CODE	LAST NAME	FIRST NAME	HIRE DATE	BIRTH DATE	SALARY AMOUNT
PK
1006	1019	301	312101	Stein	John	861015	631015	3945000
1008	1019	301	312102	Kanieski	Carol	870201	680517	3925000
1005	0801	403	431100	Ryan	Loretta	861015	650910	4120000
1004	1003	401	412101	Johnson	Darlene	861015	560423	4630000
1007				Villegas	Arnando	870102	470131	5970000
1003	0801	401	411100	Trader	James	860731	570619	4785000

Foreign Keys

Relational databases permit data values to associate across more than one entity. A Foreign Key (FK) value identifies entity relationships.

Below you will see that the employee table has three FK attributes, one of which models the relationship between employees and their departments. A second FK attribute models the relationship between employees and their jobs.

A third FK attribute is used to model the relationship between employees and each other. This is called a recursive relationship.

Foreign Key Rules

Duplicate values are allowed in a FK attribute.
NULLs are allowed in a FK attribute.
Values may be changed in a FK attribute.
Each FK must exist elsewhere as a Primary Key.

Note that Department_Number is the Primary Key for the DEPARTMENT entity.

Remember, these terms are not Teradata specific - they are just general relational concepts.

Employee
Code
Table

EMPLOYEE (partial listing)

EMPLOYEE NUMBER	MANAGER EMPLOYEE NUMBER	DEPARTMENT NUMBER	JOB CODE	LAST NAME	FIRST NAME	HIRE DATE	BIRTH DATE	SALARY AMOUNT
PK	FK	FK	FK
1006	1019	301	312101	Stein	John	861015	631015	3945000
1008	1019	301	312102	Kanieski	Carol	870201	680517	3925000
1005	0801	403	431100	Ryan	Loretta	861015	650910	4120000
1004	1003	401	412101	Johnson	Darlene	861015	560423	4630000
1007				Villegas	Arnando	870102	470131	5970000
1003	0801	401	411100	Trader	James	860731	570619	4785000

Foreign Key (FK)
values model
relationships.

DEPARTMENT NUMBER	DEPARTMENT NAME	BUDGET AMOUNT	MANAGER EMPLOYEE NUMBER
PK			FK
501	marketing sales	80050000	1017
301	research and development	46560000	1019
302	product planning	22600000	1016
403	education	93200000	1005
402	software support	30800000	1011
401	customer support	98230000	1003
201	technical operations	29380000	1025

Foreign Keys (FK) are optional.
An entity may have more than one FK.
A FK may consist of more than one attribute.
FK values may be duplicated.
FK values may be null.
FK values may be changed.
FK values must exist elsewhere as a PK.

Relational Advantages

Advantages of a Relational Database (physical and logical) compared to other database methodologies include:

Flexible
Flexibility provides substantial benefits. The user does not need to know the access path to the data; the RDBMS keeps track of where everything is. Relational databases use atomic data - breaking data down into its smallest parts to allow maximum flexibility in selecting and using data. In addition, a single copy of the data can serve multiple purposes.
Responds quickly
In a non-relational database, adding a field means that all programs that use the database must be rewritten to become aware of the new data structure. In a relational database, programs do not need to be re-written when a field is added.
Data-driven
Relational databases are designed to represent a business and its practices - not the application or the computer system.
Business-oriented
The two tables we have looked at, EMPLOYEE and DEPARTMENT, are organized to reflect the way the business really works.
Simple and easy to use and understand
Simplicity is useful not only to the people who ask the questions, but also to the people who have to figure out how to retrieve information from the database. Understanding how an RDBMS functions is not necessary.
Easier to build applications
Relational databases make the data do more work. Programs and transactions are simpler, which makes it easier to build applications.
Support the trend toward end-user computing
The trend is moving away from organizations funneling all data requests through the IT people who know how the system works. As systems get easier to use, more people have access to them. This is called the "democratization of data."
Set Theory
Set theory is the mathematical science of the infinite. It studies properties of sets, abstract objects that pervade the whole of modern mathematics. The language of set theory, in its simplicity, is sufficiently universal to formalize all mathematical concepts and thus set theory, along with predicate calculus, constitutes the true foundations of mathematics.

Evolution of Data Processing

Traditionally, data processing has been divided into two categories: On-Line Transaction Processing (OLTP) and Decision Support Systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

Traditional

A transaction is a logical unit of work.

On-Line Transaction Processing (OLTP)
OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.
Decision Support Systems (DSS)
Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of pre-written scripts, users now require the ability to do ad hoc queries, analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data and rely less on summary data.

Today

Functional Trends Rankings

On-line Analytical Processing (OLAP)
OLAP is a modern form of analytic processing within a DSS environment. OLAP tools (e.g., from companies like MicroStrategy and Cognos) provide an easy to use Graphical User Interface to allow "slice and dice" analysis along multiple dimensions (e.g., products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes.
Data Mining (DM)
DM (predictive modeling) involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns (e.g., buying, attrition, or fraud patterns) that are used to predict future behavior. An "analytic model" is built from the historical data (Phase 1: minutes to hours) incorporating the detected patterns. The model is then run against current detail data ("scoring") to predict likely outcomes (Phase 2: seconds or less). Likely outcomes, for example, include scores on likelihood of purchasing a product, switching to a competitor, or being fraudulent.

Type	Examples	Number of Rows Accessed	Response Time
OLTP	Update a checking account to reflect a deposit. Debit transaction takes place against current balance to reflect amount of money withdrawn at ATM.	Small	Seconds
DSS	How many child size blue jeans were sold across all of our Eastern stores in the month of March? What were the monthly sale shoes for retailer X?	Large	Seconds or minutes
OLAP	Show the top ten selling items across all stores for 1997. Show a comparison of sales from this week to last week.	Large of detail rows or moderate of summary rows	Seconds or minutes
Data Mining	Which customers are most likely to leave? Which customers are most likely to respond to this promotion?	Moderate to large detailed historical rows	Phase 1: Minutes or hours Phase 2: Seconds or less

Row vs. Set Processing

Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.

What is Row-by-Row Processing?

Where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program.

A benefit of row processing is that there is less lock contention.

What is Set Processing?

A lot of data processing is set processing, which is what relational databases do best.

Instead of processing row-by-row sequentially, you can process relational data set-by-set, without a cursor. For example, to sum all payment rows with balances of $100 or less, a single SQL statement completely processes all rows that meet the condition as a set.

With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-time processing.

Some characteristics of set processing include:

An update with all AMPs involved
Single session processing which takes advantage of parallel processing
Efficient updates of large amounts of data.

Response Time vs. Throughput

When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Response Time

This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.

Throughput

A throughput measure is an amount of something per unit time. For operating systems, throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

What Does this Mean to Teradata?

Response Time	Throughput
Measures the average duration of queries	Measures quantity of queries completed during a time interval
A measure of process completion	A measure of the amount of work processed
Is how long the processing takes	How many queries were processed
The elapsed time per query	The number of queries executed in an hour

In order to improve both response time and throughput on a Teradata system, you could increase CPU power (i.e., add nodes), implement workload management to control resources, and decrease the number of concurrent users.

The Advantage of Using Detail Data

Until recently, most business decisions were based on summary data.

The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half-as shown in the example on below.

Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.

Decision support, answering business questions, is the real purpose of databases. To answer business questions, decision-makers must have four things:

The right data
Enough detail data
Proper data structure
Enough computer power to access and produce reports on the data

Strategic workloads tend to rely heavily on detail data.

Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?

QUESTION: How effective was the national advertisement for jeans that ran June 6 through June 8?

DETAIL DATA vs. SUMMARY DATA

Detail data gives a more accurate picture.
Correct business decisions result.

The Data Warehouse

A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems (this is where the saying, "Load once, use many" comes from). A data warehouse has a centrally located logical architecture, using a single data store, which minimizes data synchronization, utilizes less space, and provides a single version of the truth.

Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years' worth of detail data so that historical trends may be analyzed using the full power of the data.

Many data warehouses get their data directly from operational systems so that the data is timely and accurate, providing less disparate results. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.

The data warehouse carries a massive volume of detail data, so it is well suited to the parallel architecture of the Teradata database. Because the Teradata database has traditionally specialized in large-scale, decision-support queries against huge volumes of detail data, its use for data warehousing becomes an obvious solution.

The Teradata database can run in single-node environments with a database of a terabyte or less. As requirements grow, the single node can easily be expanded to multiple nodes with the Teradata database's linear scalability. In the multi-node environment, the Teradata database can scale upward to systems supporting more than 50 petabytes of data and process hundreds of concurrent sessions. With the power of multiple CPUs propelling multiple AMPs over multiple disks, getting answers from the detail data in the warehouse becomes fast and efficient.

The Teradata database supports normalized logical models because it can perform 128 joins in a single query as well as large aggregations during very complex queries using its robust parser and optimizer. A key Teradata database strength is the ability to model the customer's business. With a business model, data is organized to what it represents. The Teradata database can do Star Schema and other forms of relational modeling - but Third Normal Form is the methodology we recommend to our customers. The Teradata database also has:

A Priority Scheduler and workload management tools that prioritize resources among competing applications based on importance
A powerful suite of load and maintenance utilities
Numerous third-party strategic and alliance partners

A central, enterprise-wide database that contains information extracted from operational systems.

Based on enterprise-wide model
Can begin small but may grow large rapidly
Populated by extraction/loading of data from operational systems
Responds to users "what if" queries
Minimizes data movement/ synchronization
Provides a single view of the business

Data Marts

A data mart is a special purpose subset of data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Some reasons data marts are implemented are when a department wants to own its own data, a quick ROI is required, and in some cases, canned queries may offer users better performance.

Data Mart Types

Independent Data Marts
Independent data marts are created directly from operational systems, just as a data warehouse is. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate.
Logical Data Marts
Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology (Teradata), it removes the need for massive data loading and transformation, making a single data store available for all user needs.
Dependent Data Marts
Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for some performance-critical user queries.

Data Models - Enterprise vs. Application

To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business.

In addition, an enterprise data model is more extensible than an application data model. It is intended to encompass the entire enterprise.

Teradata Functional Overview

The Parsing Engine is responsible for:

Managing individual sessions
(up to 120 sessions per PE)
Parsing and optimizing your SQL requests
Building query plans with the parallel-aware, cost-based, intelligent Optimizer
Dispatching the optimized plan to the AMPs
EBCDIC/ASCII input conversion (if necessary)
Sending the answer set response back to the requesting client

A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.

Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

Major Components of the Teradata Database

We have looked at the overall node, and now we will describe the components that make up a node in detail.

Parallel Database Extensions (PDE)

The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).

Parsing Engine (PE)

The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).

BYNET

The BYNET is the message-passing layer. It determines which AMP(s) should receive a message.

Access Module Processor (AMP)

Disks

Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.

What Happens When a Query is Submitted?

The Parser evaluates the SQL for proper syntax.
The Optimizer develops the least expensive plan.
The BYNET passes the steps from the Dispatcher to the AMPs.

The Parsing Engine (PE)

A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Session Control

Parsing

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata Database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Parsing Engine is responsible for:

Managing individual sessions (up to 120 sessions per PE)
Parsing and optimizing your SQL requests
Building query plans with the parallel-aware, cost-based, intelligent Optimizer
Dispatching the optimized plan to the AMPs
EBCDIC/ASCII input conversion (if necessary)
Sending the answer set response back to the requesting client

The BYNET

The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate.

The BYNET handles the internal communication of the Teradata database. All communication between PEs and AMPs is done via the BYNET.

When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. Messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET.

Depending on the nature of the dispatch request, the communication between nodes may be a:

Broadcast - message is routed to all nodes in the system.
Point-to-point - message is routed to one specific node in the system.

Once the message is on a participating node, PDE directs the message to the appropriate AMPs on that node. All AMPs receive a broadcast message. With a point-to-point or multicast (multiple AMPs) message, the message is directed only to the appropriate AMP(s) on that node by PDE.

So, while a Teradata database system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.

The BYNET has several unique features:

Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1, and vice versa.
Load balanced: traffic is automatically and dynamically distributed between both BYNETs.
Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance.
High Performance: an MPP system has two BYNET networks providing hardware redundancy. Because both networks are active, the system benefits from the full aggregate bandwidth.

The technology of the BYNET is what makes the scalable parallelism of the Teradata database possible.

BYNET merges final answer set

Dual redundant, fault-tolerant, bi-directional interconnect network that enables:

Automatic load balancing of message traffic
Automatic reconfiguration after fault detection
Scalable bandwidth as nodes are added

The BYNET connects and communicates with all the AMPs and PEs in the system:

Between nodes, the BYNET hardware carries broadcast and point-to-point communications
On a node, BYNET software and PDE together control which AMPs receive a multicast communication
Merges the final answer set

The Access Module Processor (AMP)

The Access Module Processor (AMP) is the virtual processor (vproc) in the architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set, including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the requested rows from a table.

An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs.

The Database Manager subsystem resides on each AMP. This subsystem will:

Lock databases and tables.
Create, modify, or delete definitions of tables.
Insert, delete, or modify rows within the tables.
Retrieve information from definitions and tables.
Return responses to the Dispatcher.

Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs space-management functions that control the use and allocation of space.

The Teradata database performs all tasks in parallel, providing exceptional performance. The greater the number of tasks processed in parallel, the better the system performance. Many databases call themselves "parallel," but they can only perform some tasks in parallel.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.

AMPs perform all tasks in parallel

The AMP is responsible for:

Storing rows to and retrieving rows from its vdisk
Lock management
Sorting rows and aggregating columns
Join processing
Output conversion and formatting (ASCII, EBCDIC)
Creating answer sets for clients
Disk space management and accounting
Special utility protocols
Recovery processing

What is a Node?

Up to this point, we have discussed relational databases in terms of how the user sees them - as a collection of tables that relate to one another. Now we'll look at the physical components of the system.

A node is made up of various hardware and software components. All applications run under Linux or Windows, and all the Teradata database software runs under Parallel Database Extensions (PDE). All share the resources of CPU and memory on the node.

Access Module Processors (AMPs) and Parsing Engines (PEs) are virtual processors (vprocs) running under control of the PDE. Their numbers are software configurable.

AMPs are associated with virtual disks (vdisks), which are configured as ranks of a disk array.

The "Shared Nothing" architecture of the Teradata database means that each vproc is responsible for its own portion of the database, and for all activities to be performed on that part of the system.

All AMPs and PEs communicate via the BYNET, which is a message passing system. In an SMP (Symmetric Multi-Processing) system, it is implemented as a BYNET driver, and in MPP (Massively Parallel Processing) systems, it is implemented as a software and hardware solution. The BYNET allows multiple vprocs on multiple nodes to communicate with each other. PDE actually controls message-passing activity, while the BYNET handles message queuing and flow control.

An application that runs under the control of PDE, such as the Teradata database, is considered a Trusted Parallel Application (TPA).

Note: The versatility of the Teradata database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. These vprocs are a set of software processes that run on a node under Parallel Database Extensions (PDE) and the multitasking environment of the operating system.

In a massively parallel processing (MPP) system, there is hardware redundancy among both the multiple nodes and the BYNET.

Shared Nothing Architecture

Parallelism

Parallelism is at the very heart of the Teradata database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient.

Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets.

The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform's maximum throughput.

Each AMP can perform up to 80 tasks (or more) in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results.

In truth, parallelism is built into the Teradata database from the ground up!

Notes:

Each PE can handle up to 120 sessions in parallel.
Each Session can handle multiple REQUESTS.
The Message Passing Layer can handle all message activity in parallel.
Each AMP can perform up to 80 tasks in parallel (can be configured for more).
All AMPs can work together in parallel to service any request.
Each AMP can work on several requests in parallel.

Teradata Compression Comparison

Multi-value Compression

Multi-value compression has three variations:

COMPRESS – Nulls are compressed.
COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
COMPRESS <constant> – Nulls and the specified <constant> value are compressed.

With algorithmic compression, you must specify a UDF for compression and a UDF for decompression.

Example:

CREATE TABLE bank_accounts
 (customer_id INTEGER
 ,primary_bank INTEGER COMPRESS -- only NULL is compressed
 ,account_type CHAR(10) COMPRESS 'Savings' -- both NULL and 'Savings' are compressed
 ,account_notes VARCHAR(500) CHARACTER SET UNICODE
	COMPRESS USING TD_SYSFNLIB.TransUnicodeToUTF8
	DECOMPRESS USING TD_SYSFNLIB.TransUTF8ToUnicode -- algorithmic compression
 . . . );

Multi-value Compression vs. VARCHAR

For character-based data, an alternative to Teradata compression is the VARCHAR (N) data type. The number of bytes used to store each field is the length of the data item plus two bytes. Contrast this to a fixed-length CHAR (N) data type that takes N bytes per row, regardless of the actual number of characters in each field. Combining Teradata compression with fixed-length character data type can be a very effective space saver. MVC can compress 255 values in a column.

Data demographics can help determine whether variable-length character data type or fixed length plus compression is more efficient. The most important factors are the maximum field length, the average field length, and the compressibility of the data.

Algorithmic Compression

ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.

Block Level Compression

Performs compression on whole data blocks at the file system level before they are written to storage.

BLC will compress/decompress only data blocks but will not be used on any file system structures such as Master/Cylinder indexes, the WAL log and table headers.
Only primary data subtable and fallback copies can be compressed. Both objects are either compressed or neither is compressed.
Secondary Indexes (USI/NUSI) cannot be compressed but Join Indexes can be compressed since they are effectively user tables.
Only the compressed form of the data block will be cached, each block access must perform the data block decompression.
On initial delivery, a single compression algorithm will be supplied and used.
Spool data blocks can be compressed via a system-wide tunable, as well as Temporary, WORK (sort Work space), and permanent journal.
Once enabled, reversion to an earlier release for compressed tables isn't allowed.
Teradata 14.0 added the capability of Temperature Based Block Level Compression (TBBLC) to enable greater compression flexibility while negating potential performance offsets.

	Multi-value Compression (MVC)	Algorithmic Compression (ALC)	Block Level Compression (BLC) - (TBBLC)
Ease of Use	Easy to apply to well understood data columns and values	Easy to apply on column with CREATE TABLE	Set once and forget
Analysis Required	Need to analyze data for common values	Use Teradata algorithms or user-defined compression algorithms to match unique data patterns	Need to analyze CPU overhead trade-off (You can turn on for all data on system or you can apply on a per table basis)
Flexibility	Works for a wide variety of data and situations	Automatically invoked for values not replaced by MVC	Automatically combined with other compression mechanisms
Performance Impact	No or minimal CPU usage	Depends on compression algorithm used	Reduced I/O due to compressed data blocks (CPU cycles are used to compress/decompress)
Applicability	Replaces common values	Industry data, UNICODE, Latin data	All Data

A customer can choose any combination or all three on a column/table.

Layers of the Data Warehouse Architecture

There are several "layers" that make up the EDW environment. These layers include:

Staging Layer
Staging is the primary purpose of the staging layer is to perform data transformation and loading during either the ETL or ELT process.
Access Layer
Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a third party tool.
Physical Layer
The physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc.
Data Lab
This is Teradata's implementation of a controlled "sandbox" environment. It enables rapid loading and analysis of new data combined with existing warehouse data. It quickly assimilates untested data into a "non-production" portion of the warehouse, providing a self-provisioning, self-service environment for prototyping and analysis on new, un-cleansed data, which is temporary in nature.
Data Lab is comprised of two Viewpoint portlets.

Teradata Data Lab - Interface

Teradata Database Objects

A "database" or "user" in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functions, or indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. Starting with Teradata Release 14.10, extended object names feature, allows object names to be up to 128 characters where prior it was a 30-byte limit.

All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D).

Databases provide a logical grouping for information. They are also the foundation for space allocation and access control.

Tables

A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows.

Views

A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used to implement security and control access to the underlying tables. They can hide columns from users, insulate applications from changes, and simplify or standardize access techniques. They provide a mechanism to include a locking modifier. Views provide a well-defined and tested high performance access to data. X-Views are special Views that limit the display of data to the user who accesses them.

Macros

A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros may also be used to limit user access to data.

Triggers

A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired - effectively executing the SQL statements.

Stored Procedures

A stored procedure is a program that is stored within Teradata and executes within the Teradata database. A stored procedure uses permanent disk space and may be used to limit a user's access to data.

User Defined Functions (UDF)

A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable information (PII), even from the DBA(s).

X-Views limit the display of results to the user accessing them.

Teradata Database Space

There are three types of space within the Teradata database:

Perm Space
All databases and users have a defined upper limit of Permanent Space. Permanent Space is used for storing the data rows of tables. Perm Space is not pre-allocated. It represents a maximum limit.
Spool Space
All databases also have an upper limit of Spool Space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all of the unallocated space in the system for their query. Spool Space is unused Perm Space that is used to hold intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released.
Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of Spool Space available for queries?
Answer: 80GB. All of the remaining space in the system is available for spool.
Temp Space
The third type of space is Temporary Space. Temp Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart. Temp Space is permanent space currently not used.

Space Management

The way the Teradata database handles space management is different from other database implementations.

The initial Teradata database system comes with several users and databases:

Users: DBC, SysAdmin, SystemFE, TDPuser
Databases: Crashdumps, Default, All, Public

Initially, most of the space in the system belongs to user DBC, with some allocated to each of the users or databases listed above. Database DBC contains all of the RDBMS software components and system tables.

Before defining application users and databases, the Database Administrator should first create a special administrative user and assign most of the space in the system to that user. This space comes from user DBC. User DBC becomes the owner of the administrative user.

There should be enough space left in user DBC to accommodate the growth of system tables and logs.

As the administrative user creates additional users and databases, space assigned to those objects will be subtracted from the administrative user's space. As these users and databases create subordinate objects, they will give up some of their space to these new users and databases.

The clear boxes represent all the space available in the system. The shaded boxes represent space that is being used in the system by tables. Space that is not currently being used by tables is available for use as spool or temporary space. If a table is dropped, that space becomes available to the system again.

Database 1 and Database 2 contain the production tables. Database 3 contains views and macros to control access to the data by end users and does not consume any database space. User D is an example of an end user who has no Perm Space but has been granted rights on certain views and macros to accomplish work. This user can still execute queries and create views or macros. User A, User B, and User C are examples of application developers who need some Perm Space to hold sample data for program testing and validation.

Most database vendors do not handle space this way. Once space is allocated to a table, it cannot be made available again without the Database Administrator having to perform a re-organization and re-allocate the space and partition the data.

A new database or user must be created from an existing database or user.
All Perm Space limits are subtracted from the owner.
Perm Space is a zero-sum game - the total of all Perm Space limits must equal the total amount of disk space available.
Perm Space currently not being used is available as Spool Space or Temp Space.

How Does the Teradata Database Distribute Rows?

Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. The Teradata database uses hashing to dynamically distribute data across all AMPs.

The Teradata database's automatic hash distribution eliminates costly data maintenance tasks. There is no specific order to the placement of the data. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. There is never a need to reorganize data or index space which makes the database easier to manage and maintain. As a result, strategic business data is more accessible to the users.

The benefits of automatic data placement include:

Distribution is the same regardless of data volume.
Distribution is based on row content, not data demographics.
Rows are moved or re-blocked automatically.

Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. For example, what if there are fewer rows than AMPs? In this case, at least some AMPs will hold no rows from that table. (This example should be considered the exception, and not the rule.)

Ideally, the rows of each table will be evenly distributed across all of the AMPs. Even distribution is desirable because in an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal portion of work to do. When workloads are not evenly distributed, the desired response can only be as fast as the slowest AMP.

Remember: The Teradata database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion of the database and does not share it with any other AMP.

Data Management

Many other databases require you to manually partition the data. Teradata automates the placement and maintenance of physical data location.

With the Teradata database, your DBA can spend more time with users developing strategic applications to beat your competition.

Loading, inserting, updating and, deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization.

Primary Keys and Primary Indexes

While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on this slide summarizes those differences.

A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.

It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table.

Indexes are conceptually different from keys.
A PK is a relational modeling convention which allows each row to be uniquely identified.
A PI is a Teradata convention which determines how the row will be stored and accessed.
A significant percentage of tables MAY use the same columns for both the PK and the PI.
A well-designed database WILL use a PI that is different from the PK for some tables.

Primary Key (PK)	Primary Index (PI)
Logical concept of data modeling	Physical mechanism for access and storage
Teradata doesn’t need to recognize	Each table can have (at most) one primary index
No limit on number of columns	64 column limit
Documented in data model (Optional in CREATE TABLE)	Defined in CREATE TABLE statement
Must be unique	May be unique or non-unique
Identifies each row	Identifies 1 (UPI) or multiple rows (NUPI)
Values should not change	Values may be changed (Delete + Insert)
May not be NULL - requires a value	May be NULL
Does not imply an access path	Defines most efficient access path
Chosen for logical correctness	Chosen for physical performance

Primary Key (PK) vs. Primary Index (PI)

Primary Indexes are conceptually different from Primary Keys.

Primary Key (PK) is a relational data-modeling term that defines the attribute(s) used to uniquely identify a tuple in a entity.
Primary Index (PI) is a physical database implementation term that defines the column(s) used to distribute the rows in a table. The Primary Index column(s) value may be unique or non-unique.
Unique Primary Index (UPI) the values in the Primary Index column(s) are unique which results in even row distribution and the elimination of duplicate row checks during data load operations.
Non-Unique Primary Index (NUPI) the values in the Primary Index column(s) are non-unique which results in potentially skewed row distribution - all duplicate Primary Index values go to the same AMP.

Accessing a row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance and known access paths, and will therefore be a different choice than the Primary Key of its source entity.

Example: PK vs. PI. An insurance claims table may contain the following columns: Claim ID, Date, Insurance Provider ID, Patient ID, and Claim Amount. The Primary Key is defined as Claim ID and Date. The same claim number may be submitted on one date and then resubmitted on another date, so you need both columns to look up a claim and its details. Other users may often be analyzing provider fraud behavior and need to access the claim by the Provider ID. In this case, the Primary Index may be the Provider ID column. Even if data distribution may be skewed, a choice like this might be made to provide efficient access to a table.

Primary Key (PK)	Primary Index (PI)
Logical concept of data modeling	Mechanism for row distribution and access
Teradata does not need the PK defined	A table can have one (or none) Primary Index
No limit on the number of attributes	May be from 1 to 64 columns
Documented in the logical data model	Defined in the CREATE TABLE statement
Must be unique	Value may be unique or non-unique
Unique identifier	Used to place a row on an AMP
Value should not change	Value may be changed (Updated)
May not be NULL	May be NULL
Does not imply access path	Defines the most efficient access path
Chosen for logical correctness	Chosen for physical performance

Primary Indexes

With the Teradata database, the Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on its disks. Indexes are also used to access rows from a table without having to search the entire table. A table may have a Primary Index. The Primary Index of the table cannot be changed (i.e., altered, added, or dropped).

Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows in the table and the performance of the table in a production environment. Although many tables use combined columns as the Primary Index choice, the examples we use here are single-column indexes.

There are two types of primary indexes; unique (UPI) and non-unique (NUPI).

A Unique Primary Index (UPI) is a column that has no duplicate values. UPIs are desirable because they guarantee even distribution of table rows. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation.

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique, or duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. A NUPI allows duplicate values as well as access to the non-unique columns. A NUPI can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.

Accessing the row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance (enabling AMP-local joins to other tables) and known access paths (allowing local joins to similar entities), and will therefore be a different choice than the Primary Key of the table.

Defining the Primary Index

When a table is created, the Primary Index should be specified. The Primary Index may consist of a single column or a combination up to 64 columns. Columns are assigned data types and may optionally be assigned attributes and constraints.

Once you choose the Primary Index for a table, it cannot be changed. The existing table must be dropped and a new table created to change the Primary Index definition.

Note: While you cannot change the Primary Index itself, values in a Primary Index column may be changed. The Teradata database simply rehashes that row to its new location, based on the Primary Index value.

The table definition is stored in the Data Dictionary (DD).

UPI Table:

CREATE TABLE Table1
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
UNIQUE PRIMARY INDEX (Col1);

NUPI Table:

CREATE TABLE Table2
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
PRIMARY INDEX (Col2);

The Primary Index is the mechanism used to assign a row to an AMP.
The Primary Index may be Unique (UPI) or Non-Unique (NUPI).
Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.
Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

The diagram below gives an overview of Primary Index hash distribution, which is the process by which all data is distributed in the Teradata database.

The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Communications Layer. The Hash Map, in combination with the Row Hash determines which AMP gets the row. The Hash Map is a part of the Communications Layer interface.

The Teradata database's hash partitioning is the only approach which offers balanced processing and balanced data placement, while minimizing interconnect traffic and freeing the DBA from time-consuming reorganizations. The Teradata File System eliminates the need to rebuild indexes when tables are updated or structures change.

Hashing Down to the AMPs

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. The Teradata database's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.

The rows of all tables are distributed across the AMPs according to their primary index value. The primary index value goes into the hashing algorithm. The output is a 32-bit row hash. The high-order 16 bits are referred to as the hash bucket number and are used to identify a hash-map entry. This entry is used to identify the AMP to be targeted. The remaining 16 bits are not used to locate the AMP.

Teradata provides customers with two choices for the size and number of the hash buckets in a system:

16 bits (65,536 hash buckets, the same as in previous releases)
20 bits (1,048,576 hash buckets)

For larger systems, such as those with 1,000 AMPs or more, the 20-bit option will result in more hash buckets to allow workloads to be distributed more evenly over the AMPs. New systems are also candidates for the expanded hash bucket option.

The entire 32-bit row hash will be used by the selected AMP to locate the row within its disk space.

Hash maps are uniquely configured for each size of system. A 50-AMP system will have a different hash map than a 30-AMP system.

A hash map is simply an array that associates bucket numbers with specific AMPs. When a system grows, new AMPs are typically added. Adding AMPs requires a change to the hash map to reflect the new total number of AMPs.

UPI Row Distribution

The hashing algorithm predictably distributes rows across all AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, the Teradata database can distribute the rows evenly. If the Primary Index is non-unique, but there are a fairly even number of rows per index value, the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values, the distribution will probably be skewed.

In the example below, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is uniform. This assures maximum efficiency because each AMP does approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task.

This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata database.

Order

Order Number	Customer Number	Order Date	Order Status
PK
UPI
7325	2	4/13	O
7324	3	4/13	O
7415	1	4/13	C
7103	1	4/10	O
7225	2	4/15	C
7384	1	4/12	C
7402	3	4/16	C
7188	1	4/13	C
7202	2	4/09	C

Order_Number values are unique (UPI).
The rows will distribute evenly across the AMPs.

Unique Primary Index (UPI) Access

Accessing a table using a Primary Index is the most efficient way to access the data and is always a one-AMP operation. In the case of a UPI, the one-AMP access can return, at most, one row. In the example below, we are looking for the row whose primary index value is 45. By specifying the PI value as part of our selection criteria, we are guaranteed that only the AMP containing the specified row will be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm. Hashing takes place in the Parsing Engine. The output of the hashing algorithm contains information that will point to a specific AMP. Once it has isolated the appropriate AMP, finding the row is quick and efficient.

SELECT *
FROM  Customer
WHERE Cust = 45;

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
UNIQUE PRIMARY INDEX (Cust);

CUSTOMER table

Cust	Name	Phone
PK
UPI
37	White	555-4444
98	Brown	333-9999
74	Smith	555-6666
95	Peters	555-7777
27	Jones	222-8888
56	Smith	555-7777
45	Adams	444-6666
84	Rice	666-5555
49	Smith	111-6666
51	Marsh	888-2222
31	Adams	111-2222
62	Black	444-5555
12	Young	777-4444
77	Jones	777-6666

NUPI Row Distribution

In the example below, Customer_Number has been used as a non-unique Primary Index (NUPI). Note that row distribution among AMPs is uneven. All rows with the same primary index value (with the same customer number) are stored on the same AMP.

Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all rows among all AMPs.

AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation, AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. Overall operation time is increased and some AMPs are under-utilized.

This illustrates how NUPIs can create irregular distributions, called skewed distributions. AMPs that have more than an average number of rows will take longer for full-table operations than other AMPs. Because an operation is not complete until all AMPs have finished, the operation where distribution is skewed will take longer than it would if all AMPs were utilized evenly.

Order

Order Number	Customer Number	Order Date	Order Status
PK
	NUPI
7325	2	4/13	O
7324	3	4/13	O
7415	1	4/13	C
7103	1	4/10	O
7225	2	4/15	C
7384	1	4/12	C
7402	3	4/16	C
7188	1	4/13	C
7202	2	4/09	C

Customer_Number values are non-unique (NUPI).
Rows with the same PI value distribute to the same AMP causing skewed row distribution.

Non-Unique Primary Index (NUPI) Access

Accessing a table using a Non-Unique Primary Index (NUPI) is also a one-AMP operation. In the case of a NUPI, the one-AMP access can return zero to many rows. In the example below, we are looking for the rows whose primary index value is 555-7777. By specifying the PI value as part of our selection criteria, we are once again guaranteeing that only the AMP containing the required rows will need to be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm executed in the Parsing Engine. The output of the hashing algorithm will once again point to a specific AMP. Once it has isolated the appropriate AMP, it must find all rows that have the specified value. In the example, the AMP returns two rows.

SELECT *
FROM  Customer
WHERE Phone = '555-7777';

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
PRIMARY INDEX (Phone);

CUSTOMER table

Cust	Name	Phone
PK
		NUPI
37	White	555-4444
98	Brown	333-9999
74	Smith	555-6666
95	Peters	555-7777
27	Jones	222-8888
56	Smith	555-7777
45	Adams	444-6666
84	Rice	666-5555
49	Smith	111-6666
51	Marsh	888-2222
31	Adams	111-2222
62	Black	444-5555
12	Young	777-4444
77	Jones	777-6666
72	Adams	666-7777
40	Smith	222-3333

Single AMP access with 0 to n rows returned.

Highly Non-Unique NUPI Row Distribution

The example below uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all rows are placed on two AMPs.

This is an example of a highly non-unique Primary Index.

When choosing a Primary Index, never pick a column with a severely-limited value set. The degree of uniqueness is critical to efficiency. Choose NUPIs that allow all AMPs to participate fairly equally.

Order

Order Number	Customer Number	Order Date	Order Status
PK
			NUPI
7325	2	4/13	O
7324	3	4/13	O
7415	1	4/13	C
7103	1	4/10	O
7225	2	4/15	C
7384	1	4/12	C
7402	3	4/16	C
7188	1	4/13	C
7202	2	4/09	C

Order_Status values are highly non-unique (NUPI).
Only two values exist. The rows will be distributed to two AMPs.
This table will not perform well in parallel operations.
Highly non-unique columns are poor PI choices.
The degree of uniqueness is critical to efficiency.

Partitioned Primary Index (PPI)

The Teradata database has a partitioning mechanism called Partitioned Primary Index (PPI). PPI is used to improve performance for large tables when you submit queries that specify a range constraint. PPI allows you to reduce the number of rows to be processed by using a technique called partition elimination. PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. PPI is useful for instantaneous dropping of old data and rapid addition of new data. Using PPI lets you avoid having to do a full-table scan without the overhead of a Secondary Index.

How Does PPI Work?

Data distribution with PPI is still based on the Primary Index:

Primary Index >> Hash Value >> Determines which AMP gets the row

With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.

Using PPI, the rows are stored first by partition and then by row hash /row id order. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.

Data Storage Using PPI

To store rows using PPI, specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.

Partition Elimination is used to access tables with a PPI. The term "partition elimination" refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination.

If the partitioning expression references any columns that aren't part of the primary index, then PI access may be slower. When all partitioning expression columns are contained in the PI, then PI access is unchanged.

The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#).

Partitioned Primary Indexes:

Improve performance on range constraint queries
Use partition elimination to reduce the number of rows accessed
May use a CASE_N or RANGE_N partitioning expression

The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI).

Multilevel Partitioned Primary Index (MLPPI)

A PPI can be sub-partitioned, increasing the opportunities for partition elimination at the various additional levels (or combinations of levels). With a multilevel PPI, you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination.

Allows multiple partitioning expressions on a table or a non-compressed join index - including base tables, Global Temp tables, and Volatile tables.
Multilevel partitioning allows each partition at a level to be sub-partitioned. Each partitioning level is defined independently using a RANGE_N or CASE_N expression.
A multilevel PPI allows efficient searches by using partition elimination at the various levels or combination of levels.
Allows flexibility on which partitioning expression to use when there are multiple choices of partitioning expressions.

For example, an insurance claims table could be partitioned by claim date and then sub-partitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as June 2012) would take advantage of partition elimination for faster performance.

Note: an MLPPI table must have at least two partition levels defined.

Advantages of partitioned tables:

They reduce the I/O for range constraint queries.
They take advantage of dynamic partition elimination.
They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices (i.e., you can use last name or some other value that is more readily available to query on).
The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities.
Row hash locks are used for SELECT with equality conditions on the PI columns.
Partitioned tables allow for fast deletes of data in a partition.
They allow for range queries without a secondary index.
Specific partitions maybe archived or deleted.
May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
May replace a Value Ordered NUSI for access.

Disadvantages of partitioned tables:

Rows in a partitioned table are eight bytes longer
Access via the Primary Index may take longer
Full table joins to an NPPI table with the same PI may take longer

Syntax:

What is a NoPI Table?

A NoPI Table is simply a table without a primary index.

Prior to Teradata database 13.0, Teradata tables required a primary index. The primary index was used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradata's parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index.

Starting with Teradata database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.

Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value.
With a NoPI table:

INSERTS/UPDATES to the table are always appended at the end of the table
The table must always be Multi-set
There are no duplicate row checks performed upon load

If you use FastLoad to load rows into a NoPI table, you cannot have secondary or join indexes, CHECK constraints, referential integrity constraints on the table as FastLoad cannot load a table defined with any of these features. You can use FastLoad to load rows into un-partitioned NoPI tables with row-level security defined, however.

When FastLoading to a NoPI table, the rows are distributed a block at a time.

Example of creating a NoPI table:

CREATE TABLE table_x
 	(col_x INTEGER
 	,col_y CHAR(10)
 	,col_z DATE)
NO PRIMARY INDEX;

Teradata Columnar

Teradata Column or Column Partitioning (CP) is a new physical database design implementation option (starting with Teradata 14.0) that allows single columns or sets of columns of a NoPI table to be stored in separate partitions. Column partitioning can also be applied to join indexes.

Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk.

Columnar offers the ability to partition a table or join index by column. It can be used alone or in combination with row partitioning in multilevel partitioning definitions. Column partitions may be stored using traditional 'ROW' storage or alternatively stored using the new 'COLUMN' storage option. In either case, columnar can automatically compress physical rows where appropriate.

The key benefit in defining row-partitioned (PPI) tables is when queries access a subset of rows based on constraints on one or more partitioning columns. The major advantage of using column partitioning is to improve the performance of queries that access a subset of the columns from a table, either for predicates (e.g., WHERE clause) or projections (i.e., SELECTed columns).

Because sets of one or more columns can be stored in separate column partitions, only the column partitions that contain the columns needed by the query need to be accessed. Just as row-partitioning can eliminate rows that need not be read, column partitioning eliminates columns that are not needed.

The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks.

Columnar makes more sense in CPU-rich environments because CPU cycles are needed to "glue" columns back together into rows, for compression and for different join strategies (mainly hash joins).

Example of creating a Columnar table:

CREATE TABLE table_y
	(col_a CHAR(25) NOT NULL
	,col_b CHAR(25) NOT NULL
	,col_c DATE NOT NULL
	,col_d CHAR(7) NOT NULL
	,col_e INTEGER)
PARTITION BY COLUMN;

Teradata Columnar Benefits

Benefits of using the Teradata Columnar feature include:

Improved query performance: Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows.
Reduced disk space: The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index.
Increased flexibility: Provides a new physical database design option to improve performance for suitable classes of workloads.
Reduced I/O: Allows fast and efficient access to selected data from column partitions, thus reducing query I/O.
Ease of use: Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries.

Columnar Join Indexes

A join index can also be created as column-partitioned for either a columnar table or a non-columnar table. Conversely, a join index can be created as non-columnar for either type of table as well.

Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes.

A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index.

Secondary Indexes

A secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection.

There are three general ways to access a table:

Primary Index access (one AMP access)
Secondary Index access (two or all AMP access)
Full Table Scan (all AMP access)

Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan.

Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful.

A secondary index provides an alternate path to the rows of a table.
A secondary index can be used to maintain uniqueness within a column or set of columns.
A table can have from 0 to 32 secondary indexes (each have 1
64 columns).
Secondary Indexes:
- Do not affect table distribution.
- Add overhead, both in terms of disk space (sub-tables) and maintenance.
- May be added or dropped dynamically as needed.
- Are only chosen to improve query performance.

Defining a Secondary Index

Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.

Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.

Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.

All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.

A Secondary Index may be defined ...

at table creation (CREATE TABLE)
following table creation (CREATE INDEX)
may be up to 64 columns

USI

NUSI

If the index choice of column(s) is unique, it is called a USI.

Unique Secondary Index

Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.

Non-Unique Secondary Index

Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

Creating a Secondary Index cause an internal sub-table to be built.
Dropping a Secondary Index causes the sub-table to be deleted.

Defining a Secondary Index

A Secondary Index may be defined ...

at table creation (CREATE TABLE)
following table creation (CREATE INDEX)
may be up to 64 columns

USI

NUSI

If the index choice of column(s) is unique, it is called a USI.

Unique Secondary Index

Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.

Non-Unique Secondary Index

Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

Creating a Secondary Index cause an internal sub-table to be built.
Dropping a Secondary Index causes the sub-table to be deleted.

Other Types of Secondary Indexes

There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.

Value-ordered NUSIs have the following limitations.

The sort key is limited to a single numeric column.
The sort key column cannot exceed four bytes.
They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Other Types of Secondary Indexes

There are a few other Secondary Indexes available with Teradata:

Sparse Index

Hash Index

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs have the following limitations.

The sort key is limited to a single numeric column.
The sort key column cannot exceed four bytes.
They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Full-Table Scans

A full table scan is another way to access data without using any Primary or Secondary Indexes.

In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on this slide, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster.

With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once.

As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans.

While full table scans are impractical and even disallowed on some systems, the Vantage Advanced SQL Engine routinely executes ad hoc queries with full table scans.

Every row of the table must be read. All AMPs scan their portion of the table in parallel.

Fast and efficient on Teradata due to parallelism.

Full table scans typically occur when either:

An index is not used in the query
An index is used in a non-equality test

CUSTOMER

CustID	CustName	CustPhone
USI		NUPI

Examples of Full Table Scans:

SELECT * FROM Customer WHERE CustPhone LIKE '858-485-_ _ _ _ ';

SELECT * FROM Customer WHERE CustName = 'Lund';

SELECT * FROM Customer WHERE CustID > 1000;

Locks

Locking serializes access and prevents multiple users who are trying to update the same data at the same time from violating data integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks:

Exclusive Locks - prevents any other type of concurrent access. Exclusive locks are applied only to databases or tables, never to rows. They are the most restrictive types of lock; all other users are locked out. Exclusive locks are typically used only when structural changes are being made to the database or table, such as changing the block size on a table, as in the case of a DDL statement.
Write Locks - prevents other Read, Write, Exclusive locks. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (access lock readers). Until a write lock is released, no new read or write locks are allowed. When you update a table without a WHERE clause, the system places a write lock on the table.
Read Locks - prevents Write and Exclusive locks. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted.
Access Locks - prevents Exclusive locks only. Users who are not concerned about data consistency can specify access locks. Using an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small, single-row changes. Access locks are sometimes called stale read or dirty read locks. You may get stale data that hasn't been updated, or has been overwritten.

Lock Levels

Three levels of database locking are provided:

Database - applies to all tables/views in the database. Locks all objects in the database.
Table/View - applies to all rows in the table/view. Locks all rows in the table or view.
Row Hash - applies to all rows with same row hash. Locks all rows with the same row hash (primary and Fallback rows, and Secondary Index subtable rows.

The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock.

SELECT - requests a Read lock
UPDATE - requests a Write lock
CREATE TABLE - requests an Exclusive lock

Lock requests may be upgraded or downgraded:

LOCKING TABLE Table1 FOR ACCESS . . .
LOCKING TABLE Table1 FOR EXCLUSIVE . . .

Data Protection Features

Facilities that provide system-level protection:

Disk Arrays
- RAID data protection (e.g., RAID 1)
- Redundant data buses and array controllers
Cliques and Vproc Migration
- SMP or Linux failures; Vprocs can migrate to other nodes (or HSN) within the clique.

Facilities that provide Teradata database protection

Fallback - provides data access with a “down” AMP
Locks - provide data integrity
Transient Journal - automatic rollback of aborted transactions
Down AMP Recovery Journal - fast recovery of fallback rows for AMPs
Permanent Journal - optional before and after-image journaling
Unity - integrated portfolio that turn a multi-system environment into an orchestrated Ecosystem
ARC - Archive/Restore facility
NetBackup and Tivoli - provide backup storage management, ARC script creation, and scheduling capabilities

Disk Arrays – Disk arrays provide RAID 1 or RAID 6 protection. If a disk drive fails, the array subsystem provides continuous access to the data.

Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays.

Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired and released during the processing of a request.

Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP” and provides protection in the event of an AMP failure. Fallback provides AMP fault tolerance at the table level.

Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP.

Transient Journal – the TJ is part of the WAL log and exists to permit the successful rollback of a failed transaction.

ARC/DSA and NetBackup/Tivoli – ARC and DSA command scripts provide the capability to backup and restore the Teradata database.

Transient Journal

The transient journal (or Online Transaction Journal) permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, data is returned to its original state after transaction failure.

The transient journal automatically maintains a copy on each AMP of before images of all rows affected by the transaction. In the event of a transaction failure, the before images are reapplied to the affected table, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.

Journal entries are made on and by the AMP making the change. Transient Journal activities are automatic and transparent to the user.

ALTER TABLE Misconception

An often repeated misconception is that the when an ALTER TABLE command is executed, the “altered data” is written to the transient journal/WAL (Write-Ahead Logging or Write-Ahead Logic) log.

For example, assume an ALTER TABLE ADD COLUMN is executed. Assuming the altered table is a large table, this operation may be time consuming. Why?

The reason that ALTER TABLE ADD COLUMN may be time consuming is that each row is rebuilt, each block is rebuilt, and the cylinder is rebuilt (re-lays out each cylinder). It does so one cylinder at a time in order to avoid having to make a full copy of the table. A new cylinder is allocated; each datablock is read; the records are reformatted and written to a new datablock which is placed on the new cylinder. When all the blocks are complete, the new cylinder is committed to disk. Then the old cylinder is released and the process is repeated until all cylinders are completed. This work reformats the rows and blocks in the same way as an INSERT-SELECT.

The only thing written to the WAL log is a single record per cylinder as a cylinder is committed. This is used in case the ALTER TABLE is interrupted – e.g., by a Teradata restart. The last cylinder log record is used to restart the process from the last completed cylinder and complete it. This is also why an ALTER TABLE cannot be aborted once it has been started.

Fallback

Fallback protects your data by storing a second copy of each row of a table on an alternate, Fallback AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all table data is still available. Users may continue to use Fallback tables without any loss of available data.

During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent.

Fallback guarantees that the two copies of a row will always be on different AMPs in the same cluster. A cluster is a group of AMPs that act as a single Fallback unit. Clustering has no effect on primary row distribution of the table, but the Fallback row will always go to another AMP in the same cluster. If either AMP fails, the alternate row copy is still available on the other AMP

There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O for INSERTs, UPDATEs, and DELETEs. (The Fallback option does not require any extra I/O for SELECT operations and the Fallback I/O will be performed in parallel with the primary I/O.)

The benefits of Fallback include:

Protects your data from an AMP failure. An AMP failure may occur because of physical disk failures - e.g., both disk drives in a RAID 1 drive group have failed or three drives in the same RAID 6 drive group. An AMP may also fail because of an AMP software failure or because access to the disk array has failed.
Protects your data from software (node) failure.
Automatically recovers with minimum recovery time, after repairs or fixes are complete.
May be specified at the table or database level
Permits access to table data during AMP off-line period
Adds a level of data protection beyond disk array RAID 1 & 5
Highest level of data protection is RAID 1 and Fallback
Critical for high availability applications

Loss of any two AMPs in a cluster causes RDBMS to halt!

Down AMP Recovery Journal

After the loss of any AMP, a down-AMP recovery journal is started automatically to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row-ID in its recovery journal.

This process will continue until the down AMP is brought back on-line. As part of the restart activity, the recovery journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and the AMP is brought on-line, fully recovered.

Recovery Journal is:

Automatically activated when an AMP is taken off-line
Maintained by other AMPs in the cluster
Totally transparent to users of the system

While AMP is off-line:

Journal is active
Table updates continue as normal
Journal logs Row-IDs of changed rows for down-AMP

When AMP is back on-line:

Restores rows on recovered AMP to current status
Journal discarded when recovery complete

For Fallback Tables

RAID Protection

There are many forms of disk array protection. The Teradata database supports the following protection schemes:

RAID 1
RAID 5
RAID S

RAID 1 (Mirroring) uses a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged.

RAID 5 (Parity) uses a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. A rank is defined as a group of disks that have the same SCSI ID (the disks in the rank are not on the same channel, but are on different channels). The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.

RAID S uses a parity-checking technique similar to RAID 5 and is used with EMC disk arrays.

Teradata systems use disk arrays that employ redundant controllers (DRDAC) and redundant disk drives. When a drive fails, the array switches to the mirror drive or uses parity to reconstruct the data.

Note: With Teradata databases, RAID 1 and RAID 5 are commonly used. Unless there is a reason not to do so, Teradata recommends that all our customers implement RAID 1.

Summary:

RAID-1 -	Good performance with disk failures Higher cost in terms of disk space
RAID-5 -	Reduced performance with disk failures Lower cost in terms of disk space

Hot Standby Nodes (HSN)

A Hot Standby Node (HSN) is a node which is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails Teradata restarts and the AMPs from the failed node move to the hot standby node. The performance degradation is 0%.

When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed.

Characteristics of a hot standby node are:

A node that is a member of a clique.
Does not normally participate in the trusted parallel application (TPA).
Will be brought into the configuration when a node fails in the clique.
Helps with unplanned outages.
Helps with planned outages
Eliminates the need for a restart to bring a failed node back into service.

Hot standby nodes are positioned as a performance continuity feature.

This example illustrates vproc migration using a Hot Standby Node.

Performance Degradation is 0% as AMPs are moved to the Hot Standby Node.
When Node 1 is recovered, it becomes the new Hot Standy Node.

Teradata Tools and Utilities Overview

Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system.

Teradata utilities leverage the Teradata database's high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.

Teradata database client utilities include the following, described in this section:

Query Submitting Utilities - Query tools that submit SQL queries
- BTEQ
- Teradata Studio Express
Load and Unload Utilities
- Teradata Parallel Transporter (TPT)
Administrative Utilities - for managing workloads
- Teradata Studio
- Priority Scheduler
- Database Query Log (DBQL)
- Teradata Workload Analyzer
- Teradata Integrated Workload Management (TIWM)
- Teradata Active Systems Management (TASM)
Tools that support a Multi-System Environment
- Viewpoint
- Unity
- Teradata Multi-Systems Manager (TMSM)
- Data Mover

BTEQ

BTEQ stands for Basic Teradata Query utility. It is a SQL front-end tool that runs on all client platforms. It supports both interactive ad hoc and scripted batch queries and provides standard report writing and formatting functionality. It also provides a powerful import/export capability even though there are other utilities available for these functions as well, particularly when larger tables are involved. When a user logs on with BTEQ, they can set the number of sessions they need before logging on. Using multiple sessions may improve performance for certain operations. This feature exploits the parallelism of the Teradata database.

BTEQ uses the CLI application programming interface.

Key Points about BTEQ

Basic Teradata Query utility
SQL front-end
Report writing and formatting features
Interactive and scripted batch queries
Import/Export across all platforms
Supports ad-hoc querying
Supports table updates with import and export
Runs on all platforms

BTEQ Uses the CLI API

Teradata Parallel Transporter

Teradata Parallel Transporter (TPT) is the replacement for Teradata Warehouse Builder (TWB). This utility is effectively an object-oriented software system that executes multiple instances of data extraction, transformation, and load functions in a scalable, high-speed parallel processing environment.

TPT is scalable and enables end-to-end parallelism. The previous versions of utilities (like FastLoad) allow you to load data into Teradata in parallel, but you still have a single input stream. Teradata Parallel Transporter allows you to run multiple instances of the extract, optional transformation, and load. You can have as many loads as you have sources in the same job. With multiple sources of data coming from multiple platforms, integration is important in a parallel environment.

TPT eliminates the need for persistent storage. It stores data into data buffers so you no longer need to write data into a flat file.

TPT provides a single scripting language. You can do the extract, some transformation, and loads all in one SQL-like scripting language.

Once the dynamics of the language are learned, you can perform multiple tasks with a single script.

TPT supports FastLoad INMODs, FastExport OUTMODs, and Access Modules to provide access to all the data sources you use today. TPT also provides a Direct API interface that can be used by Third Party partners. They can effectively write 'C' code to directly load/unload data to/from Teradata.

It is effectively a parallel load and export utility. Characteristics of this utility include:

High performance

Parallel Export and Load operators eliminate sequential bottlenecks.
Data Streams eliminate the overhead of intermediate (persistent) storage.
Scalable
End-to-end parallelism

Easy to use

Single scripting language
Access to various data sources

Extensible

Direct API enables Third Party partners to write 'C' code to directly load/unload data to/from Teradata.

TPT Operators

The operators are components that "plug" into the TPT infrastructure and actually perform the functions.

The FastLoad and FastExport operators support the current FastLoad and FastExport INMOD/OUTMOD features.
The DataConnector operator is an adapter for the Access Module or non-Teradata files.
The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands.
The Load, Update, Export and Stream operators are similar to the FastLoad, MultiLoad, FastExport, and TPump utilities, but built for the TPT parallel environment.

The INMOD and OUTMOD adapters, DataConnector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export, and Stream operators are purchased separately.

Here is a comparison of the Teradata Parallel Transporter Operators with the classic utilities that we just covered.

TPT Operator	Teradata Utility	Description
LOAD	FastLoad	A consumer-type operator that uses the Teradata FastLoad protocol to load empty tables. Supports Error limits,Checkpoint/ Restart,Multi-value Compression and PPI.
UPDATE	MultiLoad	Utilizes the Teradata MultiLoad protocol to update tables. This allows highly scalable and parallel inserts and updates to a populated table.
EXPORT	FastExport	A producer operator that emulates FastExport
STREAM	TPump	Uses multiple sessions to perform DML transactions in near real-time.
DataConnector	N/A	Emulates the DataConnector API. Reads external data files, writes data to external data files, reads an unspecified number of data files.
ODBC	N/A	Reads data from an ODBC Provider.
SQL INSERT	BTEQ	A consumer operator that inserts data into a table, similar to a BTEQ import operation.
SQL SELECT	BTEQ	A producer operator that submits SQL SELECT statements to retrieve data from a table.

TPT Architecture

There are two types of parallelism used by TPT: pipeline parallelism and data parallelism.

Pipeline Parallelism

Pipeline parallelism (also called "functional parallelism") is achieved by connecting operator instances through data streams in a single job. As seen in the following diagram, a "producer" operator extracts data from a data source and writes it to the data stream, that is, it 'produces' a data stream. Finally, the "consumer" operator 'consumes' the data from the data stream and then starts writing data to the target.

There are two operators involved here, producer and consumer. Both operators can operate concurrently and independently of each other thereby creating a pipeline effect without the need for intermediary files.

For example:

A "producer" operator extracts data from a data source and writes ("produces") it to a data stream. A "consumer" operator reads ("consumes") the data from the data stream and writes or loads data to the target. Both operators can operate concurrently and independently of each other. This creates a pipeline effect without the need for intermediary files.

TPT Parallelism

Imagine you have the same job as described on the previous slide, but that it is handling a large quantity of data coming from several different input files. Each file can be handled by a different instance of the producer operator so that the files are read concurrently. The operator instances can then merge these files into a single data stream which can then be input to multiple consumer operators. The diagram on this slide illustrates this concept.

Each source can be handled by a different instance of the producer operator, which merge these files into a single data stream.

The data stream can then be input to multiple consumer operators.

===============================================================

Teradata SQL for Applications Development

Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:

the column has defined default values

the column has a default system value specified

the column permits nulls

If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.

The SQL command INSERT INTO tablename DEFAULT VALUES;

Will insert defined default values into each column.

Will insert a null if no default is defined.

Will fail if no default is defined and a null is not allowed.

Some options for NULL and DEFAULT attributes:

NOT NULL - Nulls are not permitted for this column

DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.

Example: DEFAULT 'Unknown'

DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx

Example: DEFAULT DATE '2099-12-31'

WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)

DEFAULT TIME - Assign the current time to the integer data type

DEFAULT USER - Assign the user id of the session to the character string

^ - A circumflex character indicates that the character (or characters) that follow the circumflex in the regular expression must occur at the beginning of the source string.

Example: '^E' requires that the source string begin with an upper case E.

$ - A dollar sign character indicates that the preceding character (or characters) must occur at the end of the source string.

Example ':$' requires that the source string ends with a colon.
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.

Example: '(ABC)+' would match one or more successive occurrences of ABC at the corresponding position in a source string.
[] - Square braces enclose a subset of characters.

Example: '[0-9]' would match any single digit at the corresponding position in a source string.

How many partitions can you have?

The maximum number of partitions is the same as a row-partitioned table: 9,223,372,036,854,775,807. The maximum number of column partitions is limited by the limit on number of columns (2048).

Teradata Advanced SQL Engine 17.10

A database provides an organized mechanism for storing, managing, and retrieving data. Most modern databases are made up of tables.

types of Databases

A relational database is structured to recognize connections between stored items of information. Some of the more common relational databases are Teradata and Oracle. The standard user and application program interface to a relational database is the structured query language (SQL).

If the goal is only to retrieve information, a NoSQL database will likely suffice. DynamoDB and MongoDB are two well-known examples of NoSQL databases today.

Relational Database?

If the database is relational, which most databases are, it can cross-reference records in different tables. This means that you can create relationships between tables.
CARDINALITY - ROW , DEGREE-COL

The term schema refers to organizing the data as a blueprint that shows how the database is divided into tables. The formal definition of a database schema is a set of formulas or sentences called integrity constraints imposed on a database.

A Foreign key is a column or group of columns in a relational database table that defines the relationship between the data in two tables. Foreign keys and their implementation require planning.

data model:

Data Models define how elements of data are connected or related.
As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.

Majorly, there are three types of data models:

Relational
Dimensional
Logical and Physical

Relational Data Model

It is the most common type of data model. A relational data model should reflect the business rules.

Dimensional Data Models

Dimensional model is specially designed for reporting as it provides navigation paths and aligns with the work process of a business users. A dimensional model emphasizes usability because the tables are:

Defined in a way that a business uses the data
Organized in a way that a given user or group of users can easily understand the data

Its purpose is to allow efficient, user-friendly filtering, sorting and summing of data specific to the needs of one or more business area(s).
Every dimensional model is composed of one table with a multipart key, called a fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table.
Normalization vs. Denormalization

Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.

Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.

Dimensional Model

In a dimensional model, fact tables always represent M:M (many to many) relationships. According to the model, a fact table should contain one or more numerical measures (the "facts" of the fact table) that occur for the combination of keys that define each tuple in the table.

DBMS

A DBMS is the application and system software used for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.A DBMS creates reports, enforces database rules and constraints, and maintains the database schema. A database is a collection of bits and bytes without DBMS.The most common type of database management system is called a Relational Database Management System (RDBMS).

SQL?

Relational Database Management Systems use Structured Query Language (SQL) as a database query language. Just as humans communicate with the help of language, SQL is the language that a database understands. SQL is used to create, manage, manipulate, and query database objects such as tables.

Data Warehouse

A data warehouse is a specially constructed data repository where data is organized so that it can be easily accessed by end users for various applications. Data warehouses are common in corporations where enterprise-wide detailed data is analyzed to make strategic and tactical business decisions. This is because data warehouses often carry many years’ worth of detailed data so historical trends may be analyzed. Data warehouses may begin as small in scope and purpose, but often grow quite large if their utility becomes more fully exploited by the enterprise.

Data Marts

Data marts can be independent of, dependent on, or part of a data warehouse. Data warehouses have an enterprise-wide depth. The information in data marts pertains to a single department.

Independent data marts are isolated entities entirely separate from the enterprise data warehouse. The independent data mart data is derived from independent sources and should be viewed as data pirates in the context of the enterprise data warehouse. The independent inputs, which are entirely separate from the enterprise data warehouse, have a high likelihood of producing data that does not match that of the warehouse.

Dependent data marts are derived from the enterprise data warehouse. Depending on how a dependent data mart is configured, it might or might not be useful. The recommended process uses only data that is derived from the enterprise data warehouse data store. This permits users to have full access to the enterprise data store when the need to investigate enterprise-wide issues arises.

The logical data mart is a form of dependent data mart that is constructed virtually from the physical data warehouse. Data is presented to users of the mart using a series of SQL views that make it appear that a physical data mart underlies the data available for analysis.

What is Analytic Processing?

A data warehouse was originally defined as a decision support system (DSS). As earlier discussed, data warehouses are computer-based information systems that support business or organizational decision-making activities. OLAP and OLTP are sets of operations that can be performed on a data set.
OLAP (On-Line Analytic Processing) operations comprise the complex analysis that are performed on a data set within a data warehouse. These operations include data mining, querying, pivoting, slicing, dicing, drilling, reporting and other decision-support applications. A data warehouse can be used without OLAP by just running reports. ATM

OLTP (On-Line Transaction Processing) manages business applications and collects the business’s day-to-day data like order entry, financial transactions, or sales. This is different than a data warehouse. OLTP accesses a small number of rows (or records), or a few tables, in a matter of seconds.

Reasons to Deploy a Data Warehouse

For companies to be successful they must make good business decisions. Making good business decisions requires businesses to have the ability to analyze their related, relevant business data. A data warehouse is an indispensable tool for businesses. While business data can be useful, it means little if it cannot be stored and analyzed in a useful way. A key driver of analytics is strategic queries – business questions that are intended to provide a strategic advantage through forecasting. Businesses can also learn more about their customers, transactions, and profits – studying these items to find patterns which can assist them in avoiding costly mistakes, enhance business productivity, and manage customer relationships.

Data Warehouse Architecture - Layers

The above diagram shows the source of the data and its flow. The reference information architecture is broken down into 3 major data layers:
acquisition- As the first point of entry into the data warehouse, raw data is acquired from various source systems such as mainframe, server, etc.
integration- The integration layer is primarily responsible for integrating data from multiple systems both normalized and potentially de-normalized. It also may create common metrics and summaries which are widely used within an organization.
access-The access layer’s primary responsibility is to provide easy access to the data using various analytic methods.

Types of implementation possibilities

Only a few of the multiple types of implementations of a data warehouse are listed below
A centralized data warehouse is useful for small and mid-size data warehouses. It is a single physical repository that can contain organizational data for a specific utility area, department, branch, division or the whole organization. Centralized data warehouses serve the needs of several separate business units within an organization at the same time using a single data model.
A federated data warehouse shares information among a number of different systems. Critical master files will be shared, and the other systems will be able to use this information. An example of this would be adding financial information to advertising data. This will allow an organization to better market its products to customers. If the data is federated, organizations can reduce response time.
A data mart exists within a single organizational data warehouse repository. Ideally, it a condensed and more focused version of the data warehouse that stores data dedicated to a specific business function or department within the organization. This subset of data may span across many, or all, of the organization’s functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit such as accounting, marketing, sales, etc.

Deployments Options

-->The on-premises deployment option involves buying the software and hardware from a data warehouse company. Some examples of these companies would be Teradata, Oracle, or IBM. The on-premises (sometimes abbreviated as on-prem) data warehouse is installed, runs, and is overseen on the premises of the organization using it. Benefits of this option are that it gives the organization total control, flexibility, accessibility, and predictability, and dependable performance. Some consider this option to also be the most secure and might even be required to use on-premises depending on the sensitivity of the organization’s data. Examples of this might be highly regulated markets such as healthcare or financial. There are also some countries that restrict access to internet content so on-premises would be the only viable solution.
-->If a company chooses to implement its data warehouse in the public or private cloud then it will run on a computing platform that belongs to the cloud provider, and access to it is granted, as a service, by a provider. Some examples of companies that offer this deployment option are Amazon (AWS), or Microsoft (Azure), or Google cloud platform. A public cloud solution can be quick to set up, is scalable, accessible, and easy to use. As mentioned, while discussing the on-premises solution, security might be an issue for some organizations depending on the market they are in. Trust issues are often the most difficult to overcome for IT executives when considering whether or not to move to a cloud-based data warehouse. One consideration that organizations need to take into account, is that cloud options require high bandwidth which may not be an option for some smaller organizations. A private cloud, such as Teradata’s IntelliCloud option, can provide a more custom, secure virtual environment than a public cloud.

Benefits of TD

-->Unlimited Parallelism means that every request is broken down into components which are all worked on at the same time (parallel). Parallelism is at the very heart of the Teradata Advanced SQL Engine and is built into the system from the ground up. There is virtually no part of the system where parallelism has not been built in.
-->Because the Teradata Advanced SQL Engine is highly scalable it is available everywhere.
-->The Teradata Advanced SQL Engine is highly scalable which makes it easy for customers to expand the system as they need to for their growing businesses.
-->With its high availability, there is no single point of failure. The Teradata Advanced SQL Engine has the ability to make data available in the event of errors or failures, because both hardware and software provide fault tolerance. For example, fallback tables are duplicate copies of primary tables. Each fallback row in a fallback table is stored on an AMP different from the one to which the primary row hashes. This storage technique maintains availability should the system lose an AMP and its associated disk storage in a cluster. In that event, the system would access data in the fallback rows.

Partitioning

Partitioning is a database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.

Object Management

In order to manage the database objects, Teradata needs to store information about all of the objects. This information is referred as metadata. Metadata can include:

Means of creation of the data
Purpose of the data
Time and date of creation along with the creator and author of the data
Location on a computer network where the data was created
Standards used and file size

Data Dictionary

The data dictionary is owned by system user DBC. The data dictionary is composed of tables and views. Data dictionary views provide access to the information in the tables.Tables and views are reserved for use by the system which contains the metadata about the database objects such as privileges, system events, and system usage.

What is Data Distribution?
The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
Teradata Advanced SQL Engine generates a row hash by hashing the values of the PI columns. The row hash and a sequence number, which is assigned to distinguish between rows with the same row hash within a table, are collectively called a row identifier and uniquely identify each row in a table.

Metadata

Metadata is stored in tables that belong to user DBC and is reserved for use by the system. This information is stored in the Data Dictionary. The Data Dictionary contains metadata about the objects in the system like privileges, system events, and system usage. Views provide access to the information in the tables.
Index

Indexes can provide an easier and faster way of accessing and locating data and thus reducing unwanted inputs and outputs.
The use of Unique Primary Index (UPI) and Unique Secondary Index (USI) ensures effective maintenance of uniqueness on the table.

Ideally, the rows of every large table will be distributed among all of the AMPs. There may be some circumstances where this is not true. A column-partitioned table or join index may have a primary AMP index as an alternative to having no primary index. A primary AMP index (PA) allows for single-AMP access when the values of the primary AMP index column(s) are specified or to provide better distribution of the rows.

Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows of the table and, consequently, the performance of the table in a production environment. Unique Primary Indexes (UPI's) are often desirable because they guarantee the uniform distribution of the rows of that table.

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique values; that is, there are duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution.

Improve Performance

Accessing data with equality on Primary Index result in a one AMP operation, which is extremely fast and efficient. Similar improvements can be seen with the use of other indexes.

Types and Usage

The Teradata Advanced SQL Engine has several types of indexes that are used to distribute data, retrieve data, and enhance performance.

-->A Primary Index (PI) is a column or set of columns that provide a mechanism for physical row storage and access.

The PI is defined when the table is created and is there for two major reasons:

To determine data distribution
Enhance performance

-->A No Primary Index (NOPI) table is simply a table without a primary index. NOPI tables are typically used as staging tables for the initial data load.
-->A Secondary Index (SI) can be used to enforce uniqueness on a column. The SI also provides an alternate path to access data.
-->A join index is an indexing structure containing columns from multiple tables, specifically the resulting columns from one or more tables. The join Index (JI) is an optional index which may be created by the user to improve performance of complex queries.

There are three general ways to access a table; two of which have been covered already:
--Primary Index access (one AMP access)
--Full Table Scan (all AMP access).
The third option is via a Secondary Index access (two or all AMP access). As we discussed earlier, there are two types of secondary indexes USI (unique) and NUSI (non unique).

UPI

Guarantees uniform distribution of rows, has no duplicate values, and is made up of one or more columns.

--Efficient way to access data. Accessing data through a UPI is a one-AMP operation that leads directly to the single row with the desired UPI value

--The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, Teradata can distribute the rows evenly.

NUPI

Non-unique Primary Index

Can cause skewed data, allows duplicate values, and is made up of one or more columns.
--Accessing data through a NUPI is a one-AMP operation that may lead to multiple rows with the desired NUPI value. The system creates a spool file during a NUPI access, if needed. NUPI access is efficient if the number of physical block reads is small.

--If, however, the order number was used as a non-unique Primary Index (NUPI), then the row distribution among AMPs will possibly be uneven. All rows with the same primary index value (in other words, with the same order number) will be stored on the same AMP. If the Primary Index is slightly non-unique, that is, there are only four or five rows per index value; the table will still distribute evenly. But what if there are hundreds or thousands of rows for some index values? The rows with the same primary index value would all be sent to the same AMP resulting in skewed data.

USI
Unique Secondary Index
Not involved in data distribution, has no duplicate values, enables data access via an alternate path.
--Efficient way to access data. Data access through a USI is always a two-AMP operation, which leads directly to the single row with the desired USI value.
USI retrieval will always retrieve one row at most whereas the number of rows a NUPI will retrieve depends upon the selectivity of the PI value. The performance of a NUPI will depend upon how non unique the PI is

NUSI
Non-unique secondary index
One or more columns, may have duplicate values and is used to improve access.
--(NUSI) is efficient only if the number of rows accessed is a small percentage of the total data rows in the table. NUSI access is an all-AMPs operation since the NUSI subtables must be scanned on each AMP. It is a multiple rows operation since there can be many rows per NUSI value. A spool file will be created, if needed
PPI
Partition Primary Index
Can quickly improve the performance of complex range-based queries. Partitioned Primary Index (PPI) is an indexing mechanism that is useful in improving the performance of certain queries. When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. When a table is defined with PPI, the rows are sorted by their partition number. Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined.
NOPI
No Primary Index
--Simply a table without a primary index. As rows are inserted into a NoPI table, they are appended at the end of the table and never inserted in the middle of a hash sequence. Retrievals are likely to be full table scans (FTS) unless other index or table structures are in place e.g. columnar
--Supports faster loads and inserts into tables with no primary index defined.

Data is effectively appended to an AMP in a NoPI table. In addition, a NoPI table:
--Has no duplicate row checking. Duplicate rows can exist in a NoPI table after a TPT LOAD operation.
--A table without a primary index. Data is distributed to the AMPs by methods other than the primary index (i.e., Random Generator).
--Is useful in staging and in data labs when the data is not known and there is no obvious primary index.
--Is useful as a staging table and can be loaded quickly using Teradata Parallel Transporter (TPT) LOAD.

Full table scan
Scan is efficient in that each row is scanned only once. Although index access is generally preferred to a FTS, there are cases where they are the best way to access the data. Like the situation with NUPIs and USIs, Full Table Scans can sometimes be more efficient than a NUSI. Since V2R5 a Cylinder read which reads all the data blocks in that cylinder just needs 1 I/O. A FTS will only read each data block once whereas a NUSI would read the index subtable then access the data blocks; this is efficient only when the NUSI is nearly unique. Statistics will give the optimizer the demographic information to make the right decision to select a FTS.

Secondary index:
secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A benefit is that searches will typically be less expensive than a full table scan. However, secondary indexes add overhead to the table, both in terms of disk space and maintenance (due to sub-tables) but may be dropped when not needed and recreated whenever they would be helpful.
A Secondary Index is like a Primary Index in that it allows the user to locate rows, but different in that it has no influence on the way rows are distributed among AMPs. A unique secondary index can be used to maintain uniqueness within a column or set of columns.
Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table providing an alternate path to the rows of a table. One table can have from 0 to 32 secondary indexes.

Locking

Teradata Database locks have two dimensions: Level and Severity.
Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.
Types
There are four types of locks in Teradata. (Click on each type to learn more)
exclusive - Prevents any other type of concurrent access
write - Prevents other reads, writes, and exclusives
read-Prevents other writes and exclusives
access- Prevents exclusive rights only

Levels
Locks can be applied at four levels.
Level 1
The database level applies to all tables/views in the database.
Level 2
Table/view level applies to all rows in the table/view.
Level 3
Partition level which applies to all rows within the specified partition(s).
Level 4
The row hash level is an algorithm which determines which AMP gets the row.

Data Integration Tools

Teradata Parallel Transporter

Teradata Parallel Transporter, or TPT is client software that performs data extractions, data transformations, and data loading functions in a scalable, parallel processing environment. Jobs in TPT are run using operators which define the type of activity to be performed. The process flow is shown below.

producer- Producer operators extract data from a data source by reading the data and writing it to the data stream.
consumer- Consumer operators read the data from the data stream and writes or loads it to the target which might be Teradata tables or an external data store.

Teradata Intelligent Memory

Whenever data on the very hot data list is read from disk during query processing, a copy is kept in Teradata Intelligent Memory's (TIM) extended memory area for future use.

Teradata Compression

Join index

TD:
Shared nothing - AMP works on only the assigned db portion, no sharing.
Hashing - td use it to distribute data in amps , PI decides whether the distribution is even or not.
even distribution is good because in operation with all rows ex full tab scan , each amp will have even work load and inc performance

Hashing

Normalisation
Normalization is a set of rules and a methodology for making sure that the attributes in a design are carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize update abnormalities.

Referential Integrity

Referential integrity refers to the accuracy and consistency of data within a relationship and is a reliable mechanism to prevent accidental database corruption.

===========================

Performance tuning

SHOW TABLE , EXPLAIN, HELP STATISTICS

===================================================================

Teradata Database Architecture Overview

=======================================================================

Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS)
In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.

SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.
-----------------------------------------------------
SQL STATEMENT TYPE

1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.

DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.
DDL Statements:

CREATE
DROP
ALTER
RENAME
MODIFY
REPLACE
COLLECT STATISTICS
SET
BEGIN
END
FLUSH
CHECKPOINT
HELP
SHOW

2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.

DML is used to work with data, including inserting data, updating existing data or performing queries on the data.

The DML subset will be the focal point of this fundamentals course.

DML Statements:

SELECT
INSERT
UPDATE
DELETE

3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.

DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.

DCL Statements:

GRANT
REVOKE
GIVE

4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.

Some statements in this category are also functionally aligned with the DDL SQL category.

For more information on SQL statements in this category, see references in help section.
SQL Statements in this category:

BEGIN/END QUERY LOGGING
SHOW QUERY LOGGING
COLLECT DEMOGRAPHICS
COLLECT STATISTICS
DROP STATISTICS
DUMP EXPLAIN
INITIATE INDEX ANALYSIS
INITIATE PARTITION ANALYSIS
INSERT EXPLAIN
RESTART INDEX ANALYSIS

-------------------------------------------------------------
HELP

HELP command displays information about database objects (databases, tables, views, macros etc.).

This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.

-----------------------------------------------------------------------------------------------------------------
SHOW command displays the current data definition language (DDL) of a database object.

It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).

------------------------------------------------
EXLAIN

=========================================================

Introduction to the Teradata Database

The Teradata database is a Relational Database Management System (RDBMS) that drives a company's data warehouse.

Teradata Parallel Architecture

There are many reasons customers choose the Teradata database as the preferred platform for enterprise data warehousing:
Supports more warehouse data than all competitors combined. There are over 5400 1TB or larger warehouses in the field.
Supports scalability from small (10GB) to massive (50+PB) databases.
Provides a parallel-aware Optimizer that makes it unnecessary to perform complicated tuning to get a query to run efficiently.
Automatic data distribution eliminates complex indexing schemes and time-consuming reorganizations.
Designed and built on a parallel architecture from the start (not a parallel retrofit).
Supports ad-hoc queries using ANSI-standard Structured Query Language (SQL) which is used to communicate with an RDBMS. This allows Teradata to interface with 3rd party BI tools and submit queries from other database systems.
Single operational view of the entire MPP system via the Service Workstation (SWS) and single point of control for the DBA with Teradata Viewpoint.
Teradata Corporation has been doing data warehousing longer than any other vendor. The Teradata database was built to do Decision Support.

Relational Database Advantages

The Teradata database provides customers with unlimited, proven scalability. As the example on the previous page showed, the Teradata database can scale from 10 gigabytes to over 100 petabytes of data on a single system. When we talk about scalability, it isn't just about being able to grow very large and handle enormous amounts of data. It's about growth without losing any performance capabilities. This scalability provides investment protection for customer's growth and application development.
As is proven in the benchmarks we perform, the Teradata database can handle the most concurrent users, who are often running complex queries.
The parallelism of the Teradata database is unlimited. The Teradata database performs every task in parallel, including joins, sorts, and aggregations.
The optimizer is parallel-aware and the most robust in the industry, able to handle multiple complex queries, joins per query, and unlimited ad-hoc processing.
The Teradata File System manages data and disk space automatically. This is what makes the rebuilding of indexes unnecessary.
The Teradata database provides a low total cost of ownership (TCO) due to:
Ease of setup and maintenance
No reorganization of data needed
Automatic data distribution
Most robust utilities in the industry
Low cost of disk to data ratio
Ease in expanding the system
High availability is also a major advantage because with the Teradata database architecture, there is no single point of failure - fault tolerance is built-in to the system. The Teradata database is the only database that is truly linearly scalable, and this extends to data loading with the use of our parallel utilities.

Teradata Database Features

The Teradata database is a high-performance database system that processes enormous quantities of detail data that are beyond the capability of conventional systems.
The system is specifically designed for large data warehouses. From the beginning, the Teradata database was created to store, access and manage large amounts of data.
Hundreds of terabytes of storage capacity are currently available, making it an ideal solution for enterprise data warehouses and even for smaller data marts.
Parallel processing distributes data across multiple processors evenly. The system is designed such that the components divide the work up into approximately equal pieces. This keeps all the parts busy all the time and enables the system to accommodate a larger number of users and/or more data.
Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. Teradata runs on industry standard operating systems as well.
Linear scalability enables the system to grow to support more users/data/queries/query complexity, without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1.
The Teradata database currently runs as a database server on a variety of hardware platforms for single node or Symmetric Multi-Processor (SMP) systems, and on Teradata hardware for multi-node Massively Parallel Processing (MPP) systems.

What is a Database?

A database is a collection of permanently stored data that is used by an application or enterprise.
A database contains logically related data, which means that the database was created with a specific purpose in mind. A database supports shared access by many users. One characteristic of a database is that many people use it, often for many different purposes. A database is protected to control access and managed to retain its value and integrity.
One example of a database is payroll data that includes the names of the employees, their employee numbers, and their salary history. This database is logically related - it's all about payroll. It must have shared access, since it will be used by the payroll department to generate checks, and also by management to make decisions. This database must be protected; much of the information is confidential and must be managed to ensure the accuracy of the records.
The Teradata database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industry-standard SQL statements. SQL is a set-oriented language for relational database management.
Later in this course we will provide another definition of database that is specific to the Teradata database.

Logical/Relational Modeling

The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly.
The design of the data model is the same regardless of data volume.
An enterprise model is one that provides the ability to look across functional processes.
Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.
Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is "in normal form."
The intent of normalizing a relational database is to put one fact in one place. By decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables.
A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple (an ordered set of values), and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. The separator for each value is often a comma. Common uses for the tuple as a data type are:
For passing a string of parameters from one program to another
Representing a set of value attributes in a relational database
The Logical Model
Should be designed without regard to usage
Accommodates a wide variety of front end tools
Allows the database to be created more quickly
Should be the same regardless of data volume
Data is organized according to what it represents - real world business data in table (relational) form
Includes all the data definitions within the scope of the application or enterprise
Is generic - the logical model is the template for physical implementation on any RDBMS platform
Normalization
Process of reducing a complex data structure into a simple, stable one
Involves removing redundant attributes, keys, and relationships from the conceptual data model

Relational Databases

Relational databases are solidly founded on Set Theory and are based on the Relational Model, which is a set of principles for relational databases formalized by Dr. E.F. Codd in the late 1960s. The relational model of data prescribes how the data should be represented in terms of structure, integrity, and manipulation. The rules of the relational model are widely accepted in the information technology industry, though actual implementations may vary.
The key to understanding relational databases is the concept of the table, which is made up of rows and columns. The relational model provides the system with rules and a structure to build on. The relational terms we will be discussing in this course are independent of the database you may be running on.
A column is an attribute of the entity that a table represents. A column always contains like data, such as part name, supplier name, or employee number. In the example below, the column named LAST NAME contains last name and never anything else. Columns should contain atomic data, so a phone number might be divided into three columns; area code, prefix, and suffix, so drill-down analysis can be performed. Column position in the table is arbitrary. Missing data values are represented by "nulls", or the absence of a value.
A row is one instance of all the columns of a table. It is a single occurrence of an entity in the table. In our example, all information about a single employee is in one row. The sequence of the rows in a table is arbitrary.
In a relational database, tables are defined as a named collection of one or more named columns that can have zero or many rows of related information. Notice that the information in each row of the table refers to only one person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important.
Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place, thing, or event about which the table contains information. In this case, the entity is the employee.
Note: In relational math we use the term:
Table to mean a relation.
Row to mean a tuple.
Column to mean an attribute.
Row →
(Single
Entity) Employee ↓ Column (Attribute)
EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT

1006 1019 301 312101 Stein John 861015 631015 3945000
1008 1019 301 312102 Kanieski Carol 870201 680517 3925000
1005 0801 403 431100 Ryan Loretta 861015 650910 4120000
1004 1003 401 412101 Johnson Darlene 861015 560423 4630000
1007 Villegas Arnando 870102 470131 5970000
1003 0801 401 411100 Trader James 860731 570619 4785000
Relation
The employee table has:
Nine columns of data
Six rows of data - one per employee
Missing data values represented by nulls
Column and row order are arbitrary

Normalized Data Model vs. Star Schema Model

As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:
Entities - One record in a table
Attributes - Columns
Relationships - Between tables
First normal form (1NF) rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities.
Second normal form (2NF) requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key.
Third normal form (3NF) requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity.
While the Teradata database can support any data model that can be processed via SQL; an advantage of a normalized data model is the ability to support previously unknown (ad-hoc) questions.

Star Schema

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.
Some characteristics of a Star Schema model include:
They tend to have fewer entities
They advocate a greater level of denormalization

Primary Keys

Tables are made up of rows and columns and represent entities and attributes. Entities are the people, places, things, or events that the tables model. A Primary Key is required for every logical model version of a table, because each entity holds only one type of tuple (i.e., a row about a person, place, thing or event), and each tuple is uniquely identified within an entity by a Primary Key (PK).

Primary Key Rules

A Primary Key uniquely identifies each tuple within an entity. A Primary Key is required, because each tuple within an entity must be able to be uniquely identified.
No duplicate values are allowed. The Primary Key for the EMPLOYEE table is the employee number, because no two employees can have the same number.
Because it is used for identification, the Primary Key cannot be null. There must be something in that field to uniquely identify each occurrence.
Primary Key values should not be changed. Historical information, as well as relationships with other entities, may be lost if a PK value is changed or re-used.
Primary Key can include more than one attribute. In fact, there is no limit to the number of attributes allowed in the PK.
Only one Primary Key is allowed per entity.

Selecting a Primary Key

Each column of a row is called an attribute of that row. A database designer can select any attribute to be a Primary Key but, as a result of the rules listed above, many attributes will not qualify as a Primary Key candidate. For example, we could have selected Last Name as the PK of the EMPLOYEE table, but as soon as the company hires two people with the same last name, the PK is no longer unique. Even if we made the PK Last Name and First Name (possible since PKs can be made up of more than one column) we could still have two employees with the same name. Moreover, some employees may choose to change their last names.
Many data modelers recommend using system-assigned sequential integers for Primary Keys. This assures uniqueness and gives users control over Primary Key values, but it also takes up database space.
Primary Key (PK) values uniquely identify each entry (tuple) of an entity.
Employee
EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT
PK
1006 1019 301 312101 Stein John 861015 631015 3945000
1008 1019 301 312102 Kanieski Carol 870201 680517 3925000
1005 0801 403 431100 Ryan Loretta 861015 650910 4120000
1004 1003 401 412101 Johnson Darlene 861015 560423 4630000
1007 Villegas Arnando 870102 470131 5970000
1003 0801 401 411100 Trader James 860731 570619 4785000

Foreign Keys

Relational databases permit data values to associate across more than one entity. A Foreign Key (FK) value identifies entity relationships.
Below you will see that the employee table has three FK attributes, one of which models the relationship between employees and their departments. A second FK attribute models the relationship between employees and their jobs.
A third FK attribute is used to model the relationship between employees and each other. This is called a recursive relationship.

Foreign Key Rules

Duplicate values are allowed in a FK attribute.
NULLs are allowed in a FK attribute.
Values may be changed in a FK attribute.
Each FK must exist elsewhere as a Primary Key.
Note that Department_Number is the Primary Key for the DEPARTMENT entity.
Remember, these terms are not Teradata specific - they are just general relational concepts.
Employee
Code
Table EMPLOYEE (partial listing)
EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT
PK FK FK FK
1006 1019 301 312101 Stein John 861015 631015 3945000
1008 1019 301 312102 Kanieski Carol 870201 680517 3925000
1005 0801 403 431100 Ryan Loretta 861015 650910 4120000
1004 1003 401 412101 Johnson Darlene 861015 560423 4630000
1007 Villegas Arnando 870102 470131 5970000
1003 0801 401 411100 Trader James 860731 570619 4785000
Foreign Key (FK)
values model
relationships.

DEPARTMENT
NUMBER DEPARTMENT
NAME BUDGET
AMOUNT MANAGER
EMPLOYEE
NUMBER
PK FK
501 marketing sales 80050000 1017
301 research and development 46560000 1019
302 product planning 22600000 1016
403 education 93200000 1005
402 software support 30800000 1011
401 customer support 98230000 1003
201 technical operations 29380000 1025
Foreign Keys (FK) are optional.
An entity may have more than one FK.
A FK may consist of more than one attribute.
FK values may be duplicated.
FK values may be null.
FK values may be changed.
FK values must exist elsewhere as a PK.

Relational Advantages

Advantages of a Relational Database (physical and logical) compared to other database methodologies include:
Flexible
Flexibility provides substantial benefits. The user does not need to know the access path to the data; the RDBMS keeps track of where everything is. Relational databases use atomic data - breaking data down into its smallest parts to allow maximum flexibility in selecting and using data. In addition, a single copy of the data can serve multiple purposes.
Responds quickly
In a non-relational database, adding a field means that all programs that use the database must be rewritten to become aware of the new data structure. In a relational database, programs do not need to be re-written when a field is added.
Data-driven
Relational databases are designed to represent a business and its practices - not the application or the computer system.
Business-oriented
The two tables we have looked at, EMPLOYEE and DEPARTMENT, are organized to reflect the way the business really works.
Simple and easy to use and understand
Simplicity is useful not only to the people who ask the questions, but also to the people who have to figure out how to retrieve information from the database. Understanding how an RDBMS functions is not necessary.
Easier to build applications
Relational databases make the data do more work. Programs and transactions are simpler, which makes it easier to build applications.
Support the trend toward end-user computing
The trend is moving away from organizations funneling all data requests through the IT people who know how the system works. As systems get easier to use, more people have access to them. This is called the "democratization of data."
Set Theory
Set theory is the mathematical science of the infinite. It studies properties of sets, abstract objects that pervade the whole of modern mathematics. The language of set theory, in its simplicity, is sufficiently universal to formalize all mathematical concepts and thus set theory, along with predicate calculus, constitutes the true foundations of mathematics.

Evolution of Data Processing

Traditionally, data processing has been divided into two categories: On-Line Transaction Processing (OLTP) and Decision Support Systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

Traditional

A transaction is a logical unit of work.
On-Line Transaction Processing (OLTP)
OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.
Decision Support Systems (DSS)
Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of pre-written scripts, users now require the ability to do ad hoc queries, analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data and rely less on summary data.

Today

Functional Trends Rankings
On-line Analytical Processing (OLAP)
OLAP is a modern form of analytic processing within a DSS environment. OLAP tools (e.g., from companies like MicroStrategy and Cognos) provide an easy to use Graphical User Interface to allow "slice and dice" analysis along multiple dimensions (e.g., products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes.
Data Mining (DM)
DM (predictive modeling) involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns (e.g., buying, attrition, or fraud patterns) that are used to predict future behavior. An "analytic model" is built from the historical data (Phase 1: minutes to hours) incorporating the detected patterns. The model is then run against current detail data ("scoring") to predict likely outcomes (Phase 2: seconds or less). Likely outcomes, for example, include scores on likelihood of purchasing a product, switching to a competitor, or being fraudulent.
Type Examples Number of Rows Accessed Response Time
OLTP Update a checking account to reflect a deposit.
Debit transaction takes place against current balance to reflect amount of money withdrawn at ATM. Small Seconds
DSS How many child size blue jeans were sold across all of our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X? Large Seconds or minutes
OLAP Show the top ten selling items across all stores for 1997.
Show a comparison of sales from this week to last week. Large of detail rows or moderate of summary rows Seconds or minutes
Data Mining Which customers are most likely to leave?
Which customers are most likely to respond to this promotion? Moderate to large detailed historical rows Phase 1: Minutes or hours
Phase 2: Seconds or less

Row vs. Set Processing

Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.

What is Row-by-Row Processing?

Where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program.
A benefit of row processing is that there is less lock contention.

What is Set Processing?

A lot of data processing is set processing, which is what relational databases do best.
Instead of processing row-by-row sequentially, you can process relational data set-by-set, without a cursor. For example, to sum all payment rows with balances of $100 or less, a single SQL statement completely processes all rows that meet the condition as a set.
With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-time processing.
Some characteristics of set processing include:
An update with all AMPs involved
Single session processing which takes advantage of parallel processing
Efficient updates of large amounts of data.

Response Time vs. Throughput

When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Response Time

This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.

Throughput

A throughput measure is an amount of something per unit time. For operating systems, throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

What Does this Mean to Teradata?

Response Time Throughput
Measures the average duration of queries Measures quantity of queries completed during a time interval
A measure of process completion A measure of the amount of work processed
Is how long the processing takes How many queries were processed
The elapsed time per query The number of queries executed in an hour
In order to improve both response time and throughput on a Teradata system, you could increase CPU power (i.e., add nodes), implement workload management to control resources, and decrease the number of concurrent users.

The Advantage of Using Detail Data

Until recently, most business decisions were based on summary data.
The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half-as shown in the example on below.
Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.
Decision support, answering business questions, is the real purpose of databases. To answer business questions, decision-makers must have four things:
The right data
Enough detail data
Proper data structure
Enough computer power to access and produce reports on the data
Strategic workloads tend to rely heavily on detail data.
Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?
QUESTION: How effective was the national advertisement for jeans that ran June 6 through June 8?
DETAIL DATA vs. SUMMARY DATA
Detail data gives a more accurate picture.
Correct business decisions result.

The Data Warehouse

A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems (this is where the saying, "Load once, use many" comes from). A data warehouse has a centrally located logical architecture, using a single data store, which minimizes data synchronization, utilizes less space, and provides a single version of the truth.
Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years' worth of detail data so that historical trends may be analyzed using the full power of the data.
Many data warehouses get their data directly from operational systems so that the data is timely and accurate, providing less disparate results. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.
The data warehouse carries a massive volume of detail data, so it is well suited to the parallel architecture of the Teradata database. Because the Teradata database has traditionally specialized in large-scale, decision-support queries against huge volumes of detail data, its use for data warehousing becomes an obvious solution.
The Teradata database can run in single-node environments with a database of a terabyte or less. As requirements grow, the single node can easily be expanded to multiple nodes with the Teradata database's linear scalability. In the multi-node environment, the Teradata database can scale upward to systems supporting more than 50 petabytes of data and process hundreds of concurrent sessions. With the power of multiple CPUs propelling multiple AMPs over multiple disks, getting answers from the detail data in the warehouse becomes fast and efficient.
The Teradata database supports normalized logical models because it can perform 128 joins in a single query as well as large aggregations during very complex queries using its robust parser and optimizer. A key Teradata database strength is the ability to model the customer's business. With a business model, data is organized to what it represents. The Teradata database can do Star Schema and other forms of relational modeling - but Third Normal Form is the methodology we recommend to our customers. The Teradata database also has:
A Priority Scheduler and workload management tools that prioritize resources among competing applications based on importance
A powerful suite of load and maintenance utilities
Numerous third-party strategic and alliance partners
A central, enterprise-wide database that contains information extracted from operational systems.
Based on enterprise-wide model
Can begin small but may grow large rapidly
Populated by extraction/loading of data from operational systems
Responds to users "what if" queries
Minimizes data movement/ synchronization
Provides a single view of the business

Data Marts

A data mart is a special purpose subset of data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Some reasons data marts are implemented are when a department wants to own its own data, a quick ROI is required, and in some cases, canned queries may offer users better performance.

Data Mart Types

Independent Data Marts
Independent data marts are created directly from operational systems, just as a data warehouse is. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate.
Logical Data Marts
Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology (Teradata), it removes the need for massive data loading and transformation, making a single data store available for all user needs.
Dependent Data Marts
Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for some performance-critical user queries.

Data Models - Enterprise vs. Application

To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business.
In addition, an enterprise data model is more extensible than an application data model. It is intended to encompass the entire enterprise.

Teradata Functional Overview

The Parsing Engine is responsible for:
Managing individual sessions
(up to 120 sessions per PE)
Parsing and optimizing your SQL requests
Building query plans with the parallel-aware, cost-based, intelligent Optimizer
Dispatching the optimized plan to the AMPs
EBCDIC/ASCII input conversion (if necessary)
Sending the answer set response back to the requesting client
A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.
Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.
Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

Major Components of the Teradata Database

We have looked at the overall node, and now we will describe the components that make up a node in detail.

Parallel Database Extensions (PDE)

The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).

Parsing Engine (PE)

The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).

BYNET

The BYNET is the message-passing layer. It determines which AMP(s) should receive a message.

Access Module Processor (AMP)

The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.

Disks

Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.

What Happens When a Query is Submitted?

The Parser evaluates the SQL for proper syntax.
The Optimizer develops the least expensive plan.
The BYNET passes the steps from the Dispatcher to the AMPs.

The Parsing Engine (PE)

A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.
Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata Database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.
Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.
The Parsing Engine is responsible for:
Managing individual sessions (up to 120 sessions per PE)
Parsing and optimizing your SQL requests
Building query plans with the parallel-aware, cost-based, intelligent Optimizer
Dispatching the optimized plan to the AMPs
EBCDIC/ASCII input conversion (if necessary)
Sending the answer set response back to the requesting client

The BYNET

The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate.
The BYNET handles the internal communication of the Teradata database. All communication between PEs and AMPs is done via the BYNET.
When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. Messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET.
Depending on the nature of the dispatch request, the communication between nodes may be a:
Broadcast - message is routed to all nodes in the system.
Point-to-point - message is routed to one specific node in the system.
Once the message is on a participating node, PDE directs the message to the appropriate AMPs on that node. All AMPs receive a broadcast message. With a point-to-point or multicast (multiple AMPs) message, the message is directed only to the appropriate AMP(s) on that node by PDE.
So, while a Teradata database system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.
The BYNET has several unique features:
Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1, and vice versa.
Load balanced: traffic is automatically and dynamically distributed between both BYNETs.
Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance.
High Performance: an MPP system has two BYNET networks providing hardware redundancy. Because both networks are active, the system benefits from the full aggregate bandwidth.
The technology of the BYNET is what makes the scalable parallelism of the Teradata database possible.
BYNET merges final answer set
Dual redundant, fault-tolerant, bi-directional interconnect network that enables:
Automatic load balancing of message traffic
Automatic reconfiguration after fault detection
Scalable bandwidth as nodes are added
The BYNET connects and communicates with all the AMPs and PEs in the system:
Between nodes, the BYNET hardware carries broadcast and point-to-point communications
On a node, BYNET software and PDE together control which AMPs receive a multicast communication
Merges the final answer set

The Access Module Processor (AMP)

The Access Module Processor (AMP) is the virtual processor (vproc) in the architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set, including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the requested rows from a table.
An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs.
The Database Manager subsystem resides on each AMP. This subsystem will:
Lock databases and tables.
Create, modify, or delete definitions of tables.
Insert, delete, or modify rows within the tables.
Retrieve information from definitions and tables.
Return responses to the Dispatcher.
Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs space-management functions that control the use and allocation of space.
The Teradata database performs all tasks in parallel, providing exceptional performance. The greater the number of tasks processed in parallel, the better the system performance. Many databases call themselves "parallel," but they can only perform some tasks in parallel.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.
AMPs perform all tasks in parallel
The AMP is responsible for:
Storing rows to and retrieving rows from its vdisk
Lock management
Sorting rows and aggregating columns
Join processing
Output conversion and formatting (ASCII, EBCDIC)
Creating answer sets for clients
Disk space management and accounting
Special utility protocols
Recovery processing

What is a Node?

Up to this point, we have discussed relational databases in terms of how the user sees them - as a collection of tables that relate to one another. Now we'll look at the physical components of the system.
A node is made up of various hardware and software components. All applications run under Linux or Windows, and all the Teradata database software runs under Parallel Database Extensions (PDE). All share the resources of CPU and memory on the node.
Access Module Processors (AMPs) and Parsing Engines (PEs) are virtual processors (vprocs) running under control of the PDE. Their numbers are software configurable.
AMPs are associated with virtual disks (vdisks), which are configured as ranks of a disk array.
The "Shared Nothing" architecture of the Teradata database means that each vproc is responsible for its own portion of the database, and for all activities to be performed on that part of the system.
All AMPs and PEs communicate via the BYNET, which is a message passing system. In an SMP (Symmetric Multi-Processing) system, it is implemented as a BYNET driver, and in MPP (Massively Parallel Processing) systems, it is implemented as a software and hardware solution. The BYNET allows multiple vprocs on multiple nodes to communicate with each other. PDE actually controls message-passing activity, while the BYNET handles message queuing and flow control.
An application that runs under the control of PDE, such as the Teradata database, is considered a Trusted Parallel Application (TPA).
Note: The versatility of the Teradata database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. These vprocs are a set of software processes that run on a node under Parallel Database Extensions (PDE) and the multitasking environment of the operating system.
In a massively parallel processing (MPP) system, there is hardware redundancy among both the multiple nodes and the BYNET.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.

Parallelism

Parallelism is at the very heart of the Teradata database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient.
Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets.
The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform's maximum throughput.
Each AMP can perform up to 80 tasks (or more) in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results.
In truth, parallelism is built into the Teradata database from the ground up!
Notes:
Each PE can handle up to 120 sessions in parallel.
Each Session can handle multiple REQUESTS.
The Message Passing Layer can handle all message activity in parallel.
Each AMP can perform up to 80 tasks in parallel (can be configured for more).
All AMPs can work together in parallel to service any request.
Each AMP can work on several requests in parallel.

Teradata Compression Comparison

Multi-value Compression

Multi-value compression has three variations:

COMPRESS – Nulls are compressed.
COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
COMPRESS <constant> – Nulls and the specified <constant> value are compressed.

With algorithmic compression, you must specify a UDF for compression and a UDF for decompression.

Example:

CREATE TABLE bank_accounts
 (customer_id INTEGER
 ,primary_bank INTEGER COMPRESS -- only NULL is compressed
 ,account_type CHAR(10) COMPRESS 'Savings' -- both NULL and 'Savings' are compressed
 ,account_notes VARCHAR(500) CHARACTER SET UNICODE
	COMPRESS USING TD_SYSFNLIB.TransUnicodeToUTF8
	DECOMPRESS USING TD_SYSFNLIB.TransUTF8ToUnicode -- algorithmic compression
 . . . );

Multi-value Compression vs. VARCHAR

For character-based data, an alternative to Teradata compression is the VARCHAR (N) data type. The number of bytes used to store each field is the length of the data item plus two bytes. Contrast this to a fixed-length CHAR (N) data type that takes N bytes per row, regardless of the actual number of characters in each field. Combining Teradata compression with fixed-length character data type can be a very effective space saver. MVC can compress 255 values in a column.
Data demographics can help determine whether variable-length character data type or fixed length plus compression is more efficient. The most important factors are the maximum field length, the average field length, and the compressibility of the data.

Algorithmic Compression

When column values are mostly unique, Algorithmic Compression can provide better compression results than Multi-value Compression. You can define your own compression and decompression algorithms and apply them to data at the column level.
A user can create their own compression algorithms as external C/C++ scalar UDFs, and then specify them in the column definition of a CREATE TABLE/ALTER TABLE statement. Teradata invokes these algorithms to compress and decompress the data when the data is inserted into or retrieved from the table.
ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.

Block Level Compression

Performs compression on whole data blocks at the file system level before they are written to storage.
BLC will compress/decompress only data blocks but will not be used on any file system structures such as Master/Cylinder indexes, the WAL log and table headers.
Only primary data subtable and fallback copies can be compressed. Both objects are either compressed or neither is compressed.
Secondary Indexes (USI/NUSI) cannot be compressed but Join Indexes can be compressed since they are effectively user tables.
Only the compressed form of the data block will be cached, each block access must perform the data block decompression.
On initial delivery, a single compression algorithm will be supplied and used.
Spool data blocks can be compressed via a system-wide tunable, as well as Temporary, WORK (sort Work space), and permanent journal.
Once enabled, reversion to an earlier release for compressed tables isn't allowed.
Teradata 14.0 added the capability of Temperature Based Block Level Compression (TBBLC) to enable greater compression flexibility while negating potential performance offsets.
Multi-value Compression
(MVC) Algorithmic
Compression
(ALC) Block Level
Compression
(BLC) - (TBBLC)
Ease of Use
Easy to apply to well understood data columns and values Easy to apply on column with CREATE TABLE Set once and forget
Analysis Required
Need to analyze data for common values Use Teradata algorithms or user-defined compression algorithms to match unique data patterns Need to analyze CPU overhead trade-off (You can turn on for all data on system or you can apply on a per table basis)
Flexibility
Works for a wide variety of data and situations Automatically invoked for values not replaced by MVC Automatically combined with other compression mechanisms
Performance Impact
No or minimal CPU usage Depends on compression algorithm used Reduced I/O due to compressed data blocks (CPU cycles are used to compress/decompress)
Applicability
Replaces common values Industry data, UNICODE, Latin data All Data
A customer can choose any combination or all three on a column/table.

Layers of the Data Warehouse Architecture

There are several "layers" that make up the EDW environment. These layers include:
Staging Layer
Staging is the primary purpose of the staging layer is to perform data transformation and loading during either the ETL or ELT process.
Access Layer
Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a third party tool.
Physical Layer
The physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc.
Data Lab
This is Teradata's implementation of a controlled "sandbox" environment. It enables rapid loading and analysis of new data combined with existing warehouse data. It quickly assimilates untested data into a "non-production" portion of the warehouse, providing a self-provisioning, self-service environment for prototyping and analysis on new, un-cleansed data, which is temporary in nature.
Data Lab is comprised of two Viewpoint portlets.
Teradata Data Lab - Interface

Teradata Database Objects

A "database" or "user" in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functions, or indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. Starting with Teradata Release 14.10, extended object names feature, allows object names to be up to 128 characters where prior it was a 30-byte limit.
All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D).
Databases provide a logical grouping for information. They are also the foundation for space allocation and access control.

Tables

A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows.

Views

A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used to implement security and control access to the underlying tables. They can hide columns from users, insulate applications from changes, and simplify or standardize access techniques. They provide a mechanism to include a locking modifier. Views provide a well-defined and tested high performance access to data. X-Views are special Views that limit the display of data to the user who accesses them.

Macros

A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros may also be used to limit user access to data.

Triggers

A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired - effectively executing the SQL statements.

Stored Procedures

A stored procedure is a program that is stored within Teradata and executes within the Teradata database. A stored procedure uses permanent disk space and may be used to limit a user's access to data.

User Defined Functions (UDF)

A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable information (PII), even from the DBA(s).
X-Views limit the display of results to the user accessing them.

Teradata Database Space

There are three types of space within the Teradata database:
Perm Space
All databases and users have a defined upper limit of Permanent Space. Permanent Space is used for storing the data rows of tables. Perm Space is not pre-allocated. It represents a maximum limit.
Spool Space
All databases also have an upper limit of Spool Space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all of the unallocated space in the system for their query. Spool Space is unused Perm Space that is used to hold intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released.
Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of Spool Space available for queries?
Answer: 80GB. All of the remaining space in the system is available for spool.
Temp Space
The third type of space is Temporary Space. Temp Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart. Temp Space is permanent space currently not used.

Space Management

The way the Teradata database handles space management is different from other database implementations.
The initial Teradata database system comes with several users and databases:
Users: DBC, SysAdmin, SystemFE, TDPuser
Databases: Crashdumps, Default, All, Public

Initially, most of the space in the system belongs to user DBC, with some allocated to each of the users or databases listed above. Database DBC contains all of the RDBMS software components and system tables.
Before defining application users and databases, the Database Administrator should first create a special administrative user and assign most of the space in the system to that user. This space comes from user DBC. User DBC becomes the owner of the administrative user.
There should be enough space left in user DBC to accommodate the growth of system tables and logs.
As the administrative user creates additional users and databases, space assigned to those objects will be subtracted from the administrative user's space. As these users and databases create subordinate objects, they will give up some of their space to these new users and databases.
The clear boxes represent all the space available in the system. The shaded boxes represent space that is being used in the system by tables. Space that is not currently being used by tables is available for use as spool or temporary space. If a table is dropped, that space becomes available to the system again.
Database 1 and Database 2 contain the production tables. Database 3 contains views and macros to control access to the data by end users and does not consume any database space. User D is an example of an end user who has no Perm Space but has been granted rights on certain views and macros to accomplish work. This user can still execute queries and create views or macros. User A, User B, and User C are examples of application developers who need some Perm Space to hold sample data for program testing and validation.
Most database vendors do not handle space this way. Once space is allocated to a table, it cannot be made available again without the Database Administrator having to perform a re-organization and re-allocate the space and partition the data.
A new database or user must be created from an existing database or user.
All Perm Space limits are subtracted from the owner.
Perm Space is a zero-sum game - the total of all Perm Space limits must equal the total amount of disk space available.
Perm Space currently not being used is available as Spool Space or Temp Space.

How Does the Teradata Database Distribute Rows?

Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. The Teradata database uses hashing to dynamically distribute data across all AMPs.
The Teradata database's automatic hash distribution eliminates costly data maintenance tasks. There is no specific order to the placement of the data. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. There is never a need to reorganize data or index space which makes the database easier to manage and maintain. As a result, strategic business data is more accessible to the users.
The benefits of automatic data placement include:
Distribution is the same regardless of data volume.
Distribution is based on row content, not data demographics.
Rows are moved or re-blocked automatically.
Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. For example, what if there are fewer rows than AMPs? In this case, at least some AMPs will hold no rows from that table. (This example should be considered the exception, and not the rule.)
Ideally, the rows of each table will be evenly distributed across all of the AMPs. Even distribution is desirable because in an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal portion of work to do. When workloads are not evenly distributed, the desired response can only be as fast as the slowest AMP.
Remember: The Teradata database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion of the database and does not share it with any other AMP.

Data Management

Many other databases require you to manually partition the data. Teradata automates the placement and maintenance of physical data location.
With the Teradata database, your DBA can spend more time with users developing strategic applications to beat your competition.
Loading, inserting, updating and, deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization.

Primary Keys and Primary Indexes

While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on this slide summarizes those differences.
A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.
It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table.
Indexes are conceptually different from keys.
A PK is a relational modeling convention which allows each row to be uniquely identified.
A PI is a Teradata convention which determines how the row will be stored and accessed.
A significant percentage of tables MAY use the same columns for both the PK and the PI.
A well-designed database WILL use a PI that is different from the PK for some tables.
Primary Key (PK) Primary Index (PI)
Logical concept of data modeling Physical mechanism for access and storage
Teradata doesn’t need to recognize Each table can have (at most) one primary index
No limit on number of columns 64 column limit
Documented in data model (Optional in CREATE TABLE) Defined in CREATE TABLE statement
Must be unique May be unique or non-unique
Identifies each row Identifies 1 (UPI) or multiple rows (NUPI)
Values should not change Values may be changed (Delete + Insert)
May not be NULL - requires a value May be NULL
Does not imply an access path Defines most efficient access path
Chosen for logical correctness Chosen for physical performance

Primary Key (PK) vs. Primary Index (PI)

Primary Indexes are conceptually different from Primary Keys.
Primary Key (PK) is a relational data-modeling term that defines the attribute(s) used to uniquely identify a tuple in a entity.
Primary Index (PI) is a physical database implementation term that defines the column(s) used to distribute the rows in a table. The Primary Index column(s) value may be unique or non-unique.
Unique Primary Index (UPI) the values in the Primary Index column(s) are unique which results in even row distribution and the elimination of duplicate row checks during data load operations.
Non-Unique Primary Index (NUPI) the values in the Primary Index column(s) are non-unique which results in potentially skewed row distribution - all duplicate Primary Index values go to the same AMP.
Accessing a row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance and known access paths, and will therefore be a different choice than the Primary Key of its source entity.
Example: PK vs. PI. An insurance claims table may contain the following columns: Claim ID, Date, Insurance Provider ID, Patient ID, and Claim Amount. The Primary Key is defined as Claim ID and Date. The same claim number may be submitted on one date and then resubmitted on another date, so you need both columns to look up a claim and its details. Other users may often be analyzing provider fraud behavior and need to access the claim by the Provider ID. In this case, the Primary Index may be the Provider ID column. Even if data distribution may be skewed, a choice like this might be made to provide efficient access to a table.
Primary Key (PK) Primary Index (PI)
Logical concept of data modeling Mechanism for row distribution and access
Teradata does not need the PK defined A table can have one (or none) Primary Index
No limit on the number of attributes May be from 1 to 64 columns
Documented in the logical data model Defined in the CREATE TABLE statement
Must be unique Value may be unique or non-unique
Unique identifier Used to place a row on an AMP
Value should not change Value may be changed (Updated)
May not be NULL May be NULL
Does not imply access path Defines the most efficient access path
Chosen for logical correctness Chosen for physical performance

Primary Indexes

With the Teradata database, the Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on its disks. Indexes are also used to access rows from a table without having to search the entire table. A table may have a Primary Index. The Primary Index of the table cannot be changed (i.e., altered, added, or dropped).
Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows in the table and the performance of the table in a production environment. Although many tables use combined columns as the Primary Index choice, the examples we use here are single-column indexes.
There are two types of primary indexes; unique (UPI) and non-unique (NUPI).
A Unique Primary Index (UPI) is a column that has no duplicate values. UPIs are desirable because they guarantee even distribution of table rows. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation.
Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique, or duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. A NUPI allows duplicate values as well as access to the non-unique columns. A NUPI can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.
Accessing the row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance (enabling AMP-local joins to other tables) and known access paths (allowing local joins to similar entities), and will therefore be a different choice than the Primary Key of the table.
The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. Rows are hashed based on the Primary Index.

Defining the Primary Index

Once you choose the Primary Index for a table, it cannot be changed. The existing table must be dropped and a new table created to change the Primary Index definition.

The table definition is stored in the Data Dictionary (DD).

UPI Table:

CREATE TABLE Table1
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
UNIQUE PRIMARY INDEX (Col1);

NUPI Table:

CREATE TABLE Table2
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
PRIMARY INDEX (Col2);

The Primary Index is the mechanism used to assign a row to an AMP.
The Primary Index may be Unique (UPI) or Non-Unique (NUPI).
Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.
Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

The diagram below gives an overview of Primary Index hash distribution, which is the process by which all data is distributed in the Teradata database.
The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Communications Layer. The Hash Map, in combination with the Row Hash determines which AMP gets the row. The Hash Map is a part of the Communications Layer interface.
The Teradata database's hash partitioning is the only approach which offers balanced processing and balanced data placement, while minimizing interconnect traffic and freeing the DBA from time-consuming reorganizations. The Teradata File System eliminates the need to rebuild indexes when tables are updated or structures change.

Hashing Down to the AMPs

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. The Teradata database's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.
The rows of all tables are distributed across the AMPs according to their primary index value. The primary index value goes into the hashing algorithm. The output is a 32-bit row hash. The high-order 16 bits are referred to as the hash bucket number and are used to identify a hash-map entry. This entry is used to identify the AMP to be targeted. The remaining 16 bits are not used to locate the AMP.
Teradata provides customers with two choices for the size and number of the hash buckets in a system:
16 bits (65,536 hash buckets, the same as in previous releases)
20 bits (1,048,576 hash buckets)
For larger systems, such as those with 1,000 AMPs or more, the 20-bit option will result in more hash buckets to allow workloads to be distributed more evenly over the AMPs. New systems are also candidates for the expanded hash bucket option.
The entire 32-bit row hash will be used by the selected AMP to locate the row within its disk space.
Hash maps are uniquely configured for each size of system. A 50-AMP system will have a different hash map than a 30-AMP system.
A hash map is simply an array that associates bucket numbers with specific AMPs. When a system grows, new AMPs are typically added. Adding AMPs requires a change to the hash map to reflect the new total number of AMPs.

UPI Row Distribution

The hashing algorithm predictably distributes rows across all AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, the Teradata database can distribute the rows evenly. If the Primary Index is non-unique, but there are a fairly even number of rows per index value, the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values, the distribution will probably be skewed.
In the example below, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is uniform. This assures maximum efficiency because each AMP does approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task.
This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata database.

Order

Order
Number Customer
Number Order
Date Order
Status
PK
UPI
7325 2 4/13 O
7324 3 4/13 O
7415 1 4/13 C
7103 1 4/10 O
7225 2 4/15 C
7384 1 4/12 C
7402 3 4/16 C
7188 1 4/13 C
7202 2 4/09 C
Order_Number values are unique (UPI).
The rows will distribute evenly across the AMPs.

Unique Primary Index (UPI) Access

SELECT *
FROM  Customer
WHERE Cust = 45;

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
UNIQUE PRIMARY INDEX (Cust);

CUSTOMER table

Cust Name Phone
PK
UPI
37 White 555-4444
98 Brown 333-9999
74 Smith 555-6666
95 Peters 555-7777
27 Jones 222-8888
56 Smith 555-7777
45 Adams 444-6666
84 Rice 666-5555
49 Smith 111-6666
51 Marsh 888-2222
31 Adams 111-2222
62 Black 444-5555
12 Young 777-4444
77 Jones 777-6666

NUPI Row Distribution

In the example below, Customer_Number has been used as a non-unique Primary Index (NUPI). Note that row distribution among AMPs is uneven. All rows with the same primary index value (with the same customer number) are stored on the same AMP.
Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all rows among all AMPs.
AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation, AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. Overall operation time is increased and some AMPs are under-utilized.
This illustrates how NUPIs can create irregular distributions, called skewed distributions. AMPs that have more than an average number of rows will take longer for full-table operations than other AMPs. Because an operation is not complete until all AMPs have finished, the operation where distribution is skewed will take longer than it would if all AMPs were utilized evenly.

Order

Order
Number Customer
Number Order
Date Order
Status
PK
NUPI
7325 2 4/13 O
7324 3 4/13 O
7415 1 4/13 C
7103 1 4/10 O
7225 2 4/15 C
7384 1 4/12 C
7402 3 4/16 C
7188 1 4/13 C
7202 2 4/09 C
Customer_Number values are non-unique (NUPI).
Rows with the same PI value distribute to the same AMP causing skewed row distribution.

Non-Unique Primary Index (NUPI) Access

SELECT *
FROM  Customer
WHERE Phone = '555-7777';

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
PRIMARY INDEX (Phone);

CUSTOMER table

Cust Name Phone
PK
NUPI
37 White 555-4444
98 Brown 333-9999
74 Smith 555-6666
95 Peters 555-7777
27 Jones 222-8888
56 Smith 555-7777
45 Adams 444-6666
84 Rice 666-5555
49 Smith 111-6666
51 Marsh 888-2222
31 Adams 111-2222
62 Black 444-5555
12 Young 777-4444
77 Jones 777-6666
72 Adams 666-7777
40 Smith 222-3333
Single AMP access with 0 to n rows returned.

Highly Non-Unique NUPI Row Distribution

The example below uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all rows are placed on two AMPs.
This is an example of a highly non-unique Primary Index.
When choosing a Primary Index, never pick a column with a severely-limited value set. The degree of uniqueness is critical to efficiency. Choose NUPIs that allow all AMPs to participate fairly equally.

Order

Order
Number Customer
Number Order
Date Order
Status
PK
NUPI
7325 2 4/13 O
7324 3 4/13 O
7415 1 4/13 C
7103 1 4/10 O
7225 2 4/15 C
7384 1 4/12 C
7402 3 4/16 C
7188 1 4/13 C
7202 2 4/09 C
Order_Status values are highly non-unique (NUPI).
Only two values exist. The rows will be distributed to two AMPs.
This table will not perform well in parallel operations.
Highly non-unique columns are poor PI choices.
The degree of uniqueness is critical to efficiency.

Partitioned Primary Index (PPI)

The Teradata database has a partitioning mechanism called Partitioned Primary Index (PPI). PPI is used to improve performance for large tables when you submit queries that specify a range constraint. PPI allows you to reduce the number of rows to be processed by using a technique called partition elimination. PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. PPI is useful for instantaneous dropping of old data and rapid addition of new data. Using PPI lets you avoid having to do a full-table scan without the overhead of a Secondary Index.

How Does PPI Work?

Data distribution with PPI is still based on the Primary Index:
Primary Index >> Hash Value >> Determines which AMP gets the row
With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.
Using PPI, the rows are stored first by partition and then by row hash /row id order. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.

Data Storage Using PPI

To store rows using PPI, specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.
Partition Elimination is used to access tables with a PPI. The term "partition elimination" refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination.
If the partitioning expression references any columns that aren't part of the primary index, then PI access may be slower. When all partitioning expression columns are contained in the PI, then PI access is unchanged.
The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#).
Partitioned Primary Indexes:
Improve performance on range constraint queries
Use partition elimination to reduce the number of rows accessed
May use a CASE_N or RANGE_N partitioning expression
The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI).

Multilevel Partitioned Primary Index (MLPPI)

A PPI can be sub-partitioned, increasing the opportunities for partition elimination at the various additional levels (or combinations of levels). With a multilevel PPI, you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination.
Allows multiple partitioning expressions on a table or a non-compressed join index - including base tables, Global Temp tables, and Volatile tables.
Multilevel partitioning allows each partition at a level to be sub-partitioned. Each partitioning level is defined independently using a RANGE_N or CASE_N expression.
A multilevel PPI allows efficient searches by using partition elimination at the various levels or combination of levels.
Allows flexibility on which partitioning expression to use when there are multiple choices of partitioning expressions.
For example, an insurance claims table could be partitioned by claim date and then sub-partitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as June 2012) would take advantage of partition elimination for faster performance.
Note: an MLPPI table must have at least two partition levels defined.

Advantages of partitioned tables:

They reduce the I/O for range constraint queries.
They take advantage of dynamic partition elimination.
They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices (i.e., you can use last name or some other value that is more readily available to query on).
The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities.
Row hash locks are used for SELECT with equality conditions on the PI columns.
Partitioned tables allow for fast deletes of data in a partition.
They allow for range queries without a secondary index.
Specific partitions maybe archived or deleted.
May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
May replace a Value Ordered NUSI for access.

Disadvantages of partitioned tables:

Rows in a partitioned table are eight bytes longer
Access via the Primary Index may take longer
Full table joins to an NPPI table with the same PI may take longer

Syntax:

What is a NoPI Table?

A NoPI Table is simply a table without a primary index.

Starting with Teradata database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.

INSERTS/UPDATES to the table are always appended at the end of the table
The table must always be Multi-set
There are no duplicate row checks performed upon load

When FastLoading to a NoPI table, the rows are distributed a block at a time.

Example of creating a NoPI table:

CREATE TABLE table_x
 	(col_x INTEGER
 	,col_y CHAR(10)
 	,col_z DATE)
NO PRIMARY INDEX;

Teradata Columnar

Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk.

The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks.

Columnar makes more sense in CPU-rich environments because CPU cycles are needed to "glue" columns back together into rows, for compression and for different join strategies (mainly hash joins).

Example of creating a Columnar table:

CREATE TABLE table_y
	(col_a CHAR(25) NOT NULL
	,col_b CHAR(25) NOT NULL
	,col_c DATE NOT NULL
	,col_d CHAR(7) NOT NULL
	,col_e INTEGER)
PARTITION BY COLUMN;

Teradata Columnar Benefits

Benefits of using the Teradata Columnar feature include:
Improved query performance: Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows.
Reduced disk space: The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index.
Increased flexibility: Provides a new physical database design option to improve performance for suitable classes of workloads.
Reduced I/O: Allows fast and efficient access to selected data from column partitions, thus reducing query I/O.
Ease of use: Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries.

Columnar Join Indexes

A join index can also be created as column-partitioned for either a columnar table or a non-columnar table. Conversely, a join index can be created as non-columnar for either type of table as well.
Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes.
A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index.

Secondary Indexes

A secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection.
There are three general ways to access a table:
Primary Index access (one AMP access)
Secondary Index access (two or all AMP access)
Full Table Scan (all AMP access)
Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan.
Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful.
A secondary index provides an alternate path to the rows of a table.
A secondary index can be used to maintain uniqueness within a column or set of columns.
A table can have from 0 to 32 secondary indexes (each have 1
64 columns).
Secondary Indexes:
Do not affect table distribution.
Add overhead, both in terms of disk space (sub-tables) and maintenance.
May be added or dropped dynamically as needed.
Are only chosen to improve query performance.

Defining a Secondary Index

A Secondary Index may be defined ...

at table creation (CREATE TABLE)
following table creation (CREATE INDEX)
may be up to 64 columns

USI

NUSI

If the index choice of column(s) is unique, it is called a USI.

Unique Secondary Index

Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.

Non-Unique Secondary Index

Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

Creating a Secondary Index cause an internal sub-table to be built.
Dropping a Secondary Index causes the sub-table to be deleted.

Defining a Secondary Index

A Secondary Index may be defined ...

at table creation (CREATE TABLE)
following table creation (CREATE INDEX)
may be up to 64 columns

USI

NUSI

If the index choice of column(s) is unique, it is called a USI.

Unique Secondary Index

Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.

Non-Unique Secondary Index

Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

Creating a Secondary Index cause an internal sub-table to be built.
Dropping a Secondary Index causes the sub-table to be deleted.

Other Types of Secondary Indexes

There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.
You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.
Value-ordered NUSIs have the following limitations.
The sort key is limited to a single numeric column.
The sort key column cannot exceed four bytes.
They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Other Types of Secondary Indexes

There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.
You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.
Value-ordered NUSIs have the following limitations.
The sort key is limited to a single numeric column.
The sort key column cannot exceed four bytes.
They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Full-Table Scans

A full table scan is another way to access data without using any Primary or Secondary Indexes.
In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on this slide, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster.
With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once.
As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans.
While full table scans are impractical and even disallowed on some systems, the Vantage Advanced SQL Engine routinely executes ad hoc queries with full table scans.
Every row of the table must be read. All AMPs scan their portion of the table in parallel.
Fast and efficient on Teradata due to parallelism.
Full table scans typically occur when either:
An index is not used in the query
An index is used in a non-equality test

CUSTOMER

CustID	CustName	CustPhone
USI		NUPI

Examples of Full Table Scans:

SELECT * FROM Customer WHERE CustPhone LIKE '858-485-_ _ _ _ ';

SELECT * FROM Customer WHERE CustName = 'Lund';

SELECT * FROM Customer WHERE CustID > 1000;

Locks

Locking serializes access and prevents multiple users who are trying to update the same data at the same time from violating data integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks:
Exclusive Locks - prevents any other type of concurrent access. Exclusive locks are applied only to databases or tables, never to rows. They are the most restrictive types of lock; all other users are locked out. Exclusive locks are typically used only when structural changes are being made to the database or table, such as changing the block size on a table, as in the case of a DDL statement.
Write Locks - prevents other Read, Write, Exclusive locks. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (access lock readers). Until a write lock is released, no new read or write locks are allowed. When you update a table without a WHERE clause, the system places a write lock on the table.
Read Locks - prevents Write and Exclusive locks. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted.
Access Locks - prevents Exclusive locks only. Users who are not concerned about data consistency can specify access locks. Using an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small, single-row changes. Access locks are sometimes called stale read or dirty read locks. You may get stale data that hasn't been updated, or has been overwritten.

Lock Levels

Three levels of database locking are provided:

Database - applies to all tables/views in the database. Locks all objects in the database.
Table/View - applies to all rows in the table/view. Locks all rows in the table or view.
Row Hash - applies to all rows with same row hash. Locks all rows with the same row hash (primary and Fallback rows, and Secondary Index subtable rows.

The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock.

SELECT - requests a Read lock
UPDATE - requests a Write lock
CREATE TABLE - requests an Exclusive lock

Lock requests may be upgraded or downgraded:

LOCKING TABLE Table1 FOR ACCESS . . .
LOCKING TABLE Table1 FOR EXCLUSIVE . . .

Data Protection Features

Facilities that provide system-level protection:
Disk Arrays
RAID data protection (e.g., RAID 1)
Redundant data buses and array controllers
Cliques and Vproc Migration
SMP or Linux failures; Vprocs can migrate to other nodes (or HSN) within the clique.
Facilities that provide Teradata database protection
Fallback - provides data access with a “down” AMP
Locks - provide data integrity
Transient Journal - automatic rollback of aborted transactions
Down AMP Recovery Journal - fast recovery of fallback rows for AMPs
Permanent Journal - optional before and after-image journaling
Unity - integrated portfolio that turn a multi-system environment into an orchestrated Ecosystem
ARC - Archive/Restore facility
NetBackup and Tivoli - provide backup storage management, ARC script creation, and scheduling capabilities
Disk Arrays – Disk arrays provide RAID 1 or RAID 6 protection. If a disk drive fails, the array subsystem provides continuous access to the data.
Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays.
Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired and released during the processing of a request.
Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP” and provides protection in the event of an AMP failure. Fallback provides AMP fault tolerance at the table level.
Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP.
Transient Journal – the TJ is part of the WAL log and exists to permit the successful rollback of a failed transaction.
ARC/DSA and NetBackup/Tivoli – ARC and DSA command scripts provide the capability to backup and restore the Teradata database.

Transient Journal

The transient journal (or Online Transaction Journal) permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, data is returned to its original state after transaction failure.
The transient journal automatically maintains a copy on each AMP of before images of all rows affected by the transaction. In the event of a transaction failure, the before images are reapplied to the affected table, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.
Journal entries are made on and by the AMP making the change. Transient Journal activities are automatic and transparent to the user.

ALTER TABLE Misconception

An often repeated misconception is that the when an ALTER TABLE command is executed, the “altered data” is written to the transient journal/WAL (Write-Ahead Logging or Write-Ahead Logic) log.
For example, assume an ALTER TABLE ADD COLUMN is executed. Assuming the altered table is a large table, this operation may be time consuming. Why?
The reason that ALTER TABLE ADD COLUMN may be time consuming is that each row is rebuilt, each block is rebuilt, and the cylinder is rebuilt (re-lays out each cylinder). It does so one cylinder at a time in order to avoid having to make a full copy of the table. A new cylinder is allocated; each datablock is read; the records are reformatted and written to a new datablock which is placed on the new cylinder. When all the blocks are complete, the new cylinder is committed to disk. Then the old cylinder is released and the process is repeated until all cylinders are completed. This work reformats the rows and blocks in the same way as an INSERT-SELECT.
The only thing written to the WAL log is a single record per cylinder as a cylinder is committed. This is used in case the ALTER TABLE is interrupted – e.g., by a Teradata restart. The last cylinder log record is used to restart the process from the last completed cylinder and complete it. This is also why an ALTER TABLE cannot be aborted once it has been started.

Fallback

Fallback protects your data by storing a second copy of each row of a table on an alternate, Fallback AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all table data is still available. Users may continue to use Fallback tables without any loss of available data.
During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent.
Fallback guarantees that the two copies of a row will always be on different AMPs in the same cluster. A cluster is a group of AMPs that act as a single Fallback unit. Clustering has no effect on primary row distribution of the table, but the Fallback row will always go to another AMP in the same cluster. If either AMP fails, the alternate row copy is still available on the other AMP
There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O for INSERTs, UPDATEs, and DELETEs. (The Fallback option does not require any extra I/O for SELECT operations and the Fallback I/O will be performed in parallel with the primary I/O.)
The benefits of Fallback include:
Protects your data from an AMP failure. An AMP failure may occur because of physical disk failures - e.g., both disk drives in a RAID 1 drive group have failed or three drives in the same RAID 6 drive group. An AMP may also fail because of an AMP software failure or because access to the disk array has failed.
Protects your data from software (node) failure.
Automatically recovers with minimum recovery time, after repairs or fixes are complete.
May be specified at the table or database level
Permits access to table data during AMP off-line period
Adds a level of data protection beyond disk array RAID 1 & 5
Highest level of data protection is RAID 1 and Fallback
Critical for high availability applications
Loss of any two AMPs in a cluster causes RDBMS to halt!

Down AMP Recovery Journal

After the loss of any AMP, a down-AMP recovery journal is started automatically to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row-ID in its recovery journal.
This process will continue until the down AMP is brought back on-line. As part of the restart activity, the recovery journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and the AMP is brought on-line, fully recovered.
Recovery Journal is:
Automatically activated when an AMP is taken off-line
Maintained by other AMPs in the cluster
Totally transparent to users of the system
While AMP is off-line:
Journal is active
Table updates continue as normal
Journal logs Row-IDs of changed rows for down-AMP
When AMP is back on-line:
Restores rows on recovered AMP to current status
Journal discarded when recovery complete
For Fallback Tables

RAID Protection

There are many forms of disk array protection. The Teradata database supports the following protection schemes:
RAID 1
RAID 5
RAID S
RAID 1 (Mirroring) uses a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged.
RAID 5 (Parity) uses a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. A rank is defined as a group of disks that have the same SCSI ID (the disks in the rank are not on the same channel, but are on different channels). The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.
RAID S uses a parity-checking technique similar to RAID 5 and is used with EMC disk arrays.
Teradata systems use disk arrays that employ redundant controllers (DRDAC) and redundant disk drives. When a drive fails, the array switches to the mirror drive or uses parity to reconstruct the data.
Note: With Teradata databases, RAID 1 and RAID 5 are commonly used. Unless there is a reason not to do so, Teradata recommends that all our customers implement RAID 1.
Summary:
RAID-1 - Good performance with disk failures
Higher cost in terms of disk space
RAID-5 - Reduced performance with disk failures
Lower cost in terms of disk space

Hot Standby Nodes (HSN)

A Hot Standby Node (HSN) is a node which is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails Teradata restarts and the AMPs from the failed node move to the hot standby node. The performance degradation is 0%.
When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed.
Characteristics of a hot standby node are:
A node that is a member of a clique.
Does not normally participate in the trusted parallel application (TPA).
Will be brought into the configuration when a node fails in the clique.
Helps with unplanned outages.
Helps with planned outages
Eliminates the need for a restart to bring a failed node back into service.
Hot standby nodes are positioned as a performance continuity feature.

This example illustrates vproc migration using a Hot Standby Node.
Performance Degradation is 0% as AMPs are moved to the Hot Standby Node.
When Node 1 is recovered, it becomes the new Hot Standy Node.

Teradata Tools and Utilities Overview

Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system.
Teradata utilities leverage the Teradata database's high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.
Teradata database client utilities include the following, described in this section:
Query Submitting Utilities - Query tools that submit SQL queries
BTEQ
Teradata Studio Express
Load and Unload Utilities
Teradata Parallel Transporter (TPT)
Administrative Utilities - for managing workloads
Teradata Studio
Priority Scheduler
Database Query Log (DBQL)
Teradata Workload Analyzer
Teradata Integrated Workload Management (TIWM)
Teradata Active Systems Management (TASM)
Tools that support a Multi-System Environment
Viewpoint
Unity
Teradata Multi-Systems Manager (TMSM)
Data Mover

BTEQ

BTEQ stands for Basic Teradata Query utility. It is a SQL front-end tool that runs on all client platforms. It supports both interactive ad hoc and scripted batch queries and provides standard report writing and formatting functionality. It also provides a powerful import/export capability even though there are other utilities available for these functions as well, particularly when larger tables are involved. When a user logs on with BTEQ, they can set the number of sessions they need before logging on. Using multiple sessions may improve performance for certain operations. This feature exploits the parallelism of the Teradata database.
BTEQ uses the CLI application programming interface.

Key Points about BTEQ

Basic Teradata Query utility
SQL front-end
Report writing and formatting features
Interactive and scripted batch queries
Import/Export across all platforms
Supports ad-hoc querying
Supports table updates with import and export
Runs on all platforms
BTEQ Uses the CLI API

Teradata Parallel Transporter

Teradata Parallel Transporter (TPT) is the replacement for Teradata Warehouse Builder (TWB). This utility is effectively an object-oriented software system that executes multiple instances of data extraction, transformation, and load functions in a scalable, high-speed parallel processing environment.
TPT is scalable and enables end-to-end parallelism. The previous versions of utilities (like FastLoad) allow you to load data into Teradata in parallel, but you still have a single input stream. Teradata Parallel Transporter allows you to run multiple instances of the extract, optional transformation, and load. You can have as many loads as you have sources in the same job. With multiple sources of data coming from multiple platforms, integration is important in a parallel environment.
TPT eliminates the need for persistent storage. It stores data into data buffers so you no longer need to write data into a flat file.
TPT provides a single scripting language. You can do the extract, some transformation, and loads all in one SQL-like scripting language.
Once the dynamics of the language are learned, you can perform multiple tasks with a single script.
TPT supports FastLoad INMODs, FastExport OUTMODs, and Access Modules to provide access to all the data sources you use today. TPT also provides a Direct API interface that can be used by Third Party partners. They can effectively write 'C' code to directly load/unload data to/from Teradata.
It is effectively a parallel load and export utility. Characteristics of this utility include:
High performance
Parallel Export and Load operators eliminate sequential bottlenecks.
Data Streams eliminate the overhead of intermediate (persistent) storage.
Scalable
End-to-end parallelism
Easy to use
Single scripting language
Access to various data sources
Extensible
Direct API enables Third Party partners to write 'C' code to directly load/unload data to/from Teradata.

TPT Operators

The operators are components that "plug" into the TPT infrastructure and actually perform the functions.
The FastLoad and FastExport operators support the current FastLoad and FastExport INMOD/OUTMOD features.
The DataConnector operator is an adapter for the Access Module or non-Teradata files.
The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands.
The Load, Update, Export and Stream operators are similar to the FastLoad, MultiLoad, FastExport, and TPump utilities, but built for the TPT parallel environment.
The INMOD and OUTMOD adapters, DataConnector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export, and Stream operators are purchased separately.
Here is a comparison of the Teradata Parallel Transporter Operators with the classic utilities that we just covered.
TPT Operator Teradata Utility Description
LOAD FastLoad A consumer-type operator that uses the Teradata FastLoad protocol to load empty tables. Supports Error limits,Checkpoint/ Restart,Multi-value Compression and PPI.
UPDATE MultiLoad Utilizes the Teradata MultiLoad protocol to update tables. This allows highly scalable and parallel inserts and updates to a populated table.
EXPORT FastExport A producer operator that emulates FastExport
STREAM TPump Uses multiple sessions to perform DML transactions in near real-time.
DataConnector N/A Emulates the DataConnector API. Reads external data files, writes data to external data files, reads an unspecified number of data files.
ODBC N/A Reads data from an ODBC Provider.
SQL INSERT BTEQ A consumer operator that inserts data into a table, similar to a BTEQ import operation.
SQL SELECT BTEQ A producer operator that submits SQL SELECT statements to retrieve data from a table.

TPT Architecture

There are two types of parallelism used by TPT: pipeline parallelism and data parallelism.

Pipeline Parallelism

Pipeline parallelism (also called "functional parallelism") is achieved by connecting operator instances through data streams in a single job. As seen in the following diagram, a "producer" operator extracts data from a data source and writes it to the data stream, that is, it 'produces' a data stream. Finally, the "consumer" operator 'consumes' the data from the data stream and then starts writing data to the target.
There are two operators involved here, producer and consumer. Both operators can operate concurrently and independently of each other thereby creating a pipeline effect without the need for intermediary files.
For example:
A "producer" operator extracts data from a data source and writes ("produces") it to a data stream. A "consumer" operator reads ("consumes") the data from the data stream and writes or loads data to the target. Both operators can operate concurrently and independently of each other. This creates a pipeline effect without the need for intermediary files.

TPT Parallelism

Imagine you have the same job as described on the previous slide, but that it is handling a large quantity of data coming from several different input files. Each file can be handled by a different instance of the producer operator so that the files are read concurrently. The operator instances can then merge these files into a single data stream which can then be input to multiple consumer operators. The diagram on this slide illustrates this concept.
Each source can be handled by a different instance of the producer operator, which merge these files into a single data stream.
The data stream can then be input to multiple consumer operators.

===============================================================
Teradata SQL for Applications Development

Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:

the column has defined default values
the column has a default system value specified
the column permits nulls
If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.

The SQL command INSERT INTO tablename DEFAULT VALUES;

Will insert defined default values into each column.
Will insert a null if no default is defined.
Will fail if no default is defined and a null is not allowed.

Some options for NULL and DEFAULT attributes:

NOT NULL - Nulls are not permitted for this column
DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.
Example: DEFAULT 'Unknown'

DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx
Example: DEFAULT DATE '2099-12-31'

WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)
DEFAULT TIME - Assign the current time to the integer data type
DEFAULT USER - Assign the user id of the session to the character string

^ - A circumflex character indicates that the character (or characters) that follow the circumflex in the regular expression must occur at the beginning of the source string.
Example: '^E' requires that the source string begin with an upper case E.
$ - A dollar sign character indicates that the preceding character (or characters) must occur at the end of the source string.
Example ':$' requires that the source string ends with a colon.
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.
Example: '(ABC)+' would match one or more successive occurrences of ABC at the corresponding position in a source string.
[] - Square braces enclose a subset of characters.
Example: '[0-9]' would match any single digit at the corresponding position in a source string.

Data Models define how elements of data are connected or related

As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.

Majorly, there are three types of data models:

Relational
Dimensional
Logical and Physical

Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.

Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.

The Teradata Advanced SQL Engine performs space management and is different from most other database implementations.

Types of Spaces

Generally, there are three types of space within the Teradata Advanced SQL Engine.

1.Permanent Space, also known as Perm space, is used for storing the data rows of tables and is the maximum storage assigned to a user and/or database. It is not pre-allocated or reserved ahead of time, it is available on-demand. Space is deducted from the owner’s allocation as it is used.
2.Spool Space is unused Perm Space that holds intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released. All databases have an upper limit of Spool Space. Theoretically, a user could use all the unallocated space in the system for their query.
3.Temporary Space is sometimes referred to as Temp Space and is used for global temporary tables. Tables created in Temp Space will survive a restart and remain available to the user until their session is terminated.

What is Data Distribution?

The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.The Teradata Advanced SQL Engine's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.
Teradata Database l

Locks have two dimensions: Level and Severity.

Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.

Types

There are four types of locks in Teradata. (Click on each type to learn more)

EXCLUSIVE -Prevents any other type of concurrent access
WRITE-Prevents other reads, writes, and exclusives
READ- Prevents other writes and exclusives
ACCESS - Prevents exclusive rights only

NO SAMPLE	Use a full-table scan to collect the specified statistics. You can only specify this option if you also specify an explicit index or column set. You cannot specify NO SAMPLE for a standard recollection of statistics on an implicitly specified index or column set. For the first collection of statistics, NO SAMPLE overrides the default. For recollections of statistics, NO SAMPLE overrides the previous SAMPLE options and collects full statistics.
NO THRESHOLD	Do not apply any thresholds to the collection of statistics. Remove the existing threshold before collecting the statistics. You can only specify this option if you also specify an explicit column or index. You cannot specify NO THRESHOLD for a standard recollection of statistics on an implicitly specified index or column set. For the first collection of statistics, NO THRESHOLD overrides the default settings. For recollections of statistics, NO THRESHOLD overrides any previously specified THRESHOLD options and recollects the statistics without any thresholds.
NO THRESHOLD DAYS	Do not apply a DAYS threshold to the collection of statistics. Remove the existing DAYS threshold before collecting the statistics. You can only specify this option if you also specify an explicit index or column set. You cannot specify NO THRESHOLD DAYS for a standard recollection of statistics on an implicitly specified index or column set. For the first collection of statistics, NO THRESHOLD DAYS overrides the default setting. For recollection of statistics, NO THRESHOLD DAYS overrides any previous DAYS threshold specification.
NO THRESHOLD PERCENT	Do not apply a PERCENT change threshold to the collection of statistics. Remove the existing PERCENT change threshold before collecting the statistics. You can only specify this option if you also specify an explicit column or index. You cannot specify NO THRESHOLD PERCENT for a standard recollection of statistics on an implicitly specified index or column set. For the first collection of statistics, NO THRESHOLD PERCENT overrides the default. For recollections of statistics, NO THRESHOLD PERCENT overrides any previous change threshold percent specification.
SAMPLE	Scan a system-determined percentage of table rows to collect the specified statistics. SAMPLE has the same meaning as SYSTEM SAMPLE and is only provided for backward compatibility to enable existing COLLECT STATISTICS scripts that specify the USING SAMPLE option to continue to run. You should use the SYSTEM SAMPLE option instead of SAMPLE.
SAMPLE n PERCENT	Scans the percentage of table rows that you specify rather than scanning all of the rows in the table to collect statistics. The value for n can be a decimal number or integer from 2 through 100. Specifying SAMPLE 100 PERCENT is equivalent to collecting full statistics. You can only specify this option if you also specify an explicit index or column set. You cannot specify SAMPLE n PERCENT for a standard recollection of statistics on an implicitly specified index or column set. For the first collection of statistics, the specified sample percentage overrides the default. For recollection of statistics, SAMPLE n PERCENT overrides any previous SAMPLE option specifications and instead scans n percent of the rows in the table.

Teradata

Features of Teradata

Data Warehouse?

📘 What is a Data Mart?

🔄 Query Execution Flow in Teradata

🔧 Parsing Engine (PE) – Teradata

🧠 What is BYNET in Teradata?

🧠 What is an AMP in Teradata?

🖥️ What is a Node in Teradata?

⚙️ Parallelism in Teradata: At the Core of Performance

Multi-value Compression

Algorithmic Compression

Block Level Compression

🔷 1. Tables

🔷 2. Views

🔷 3. Macros

🔷 4. Triggers

🔷 5. Stored Procedures

🔷 6. User-Defined Functions (UDFs)

🔷 7. Indexes

SPACE

🔍 1. Perm Space

🔍 2. Spool Space

📌 Example:

🔍 3. Temp Space

🚀 How Teradata Distributes Rows Across AMPs

🔁 Hash-Based Distribution Mechanism

🎯 Why Even Distribution Matters

✅ Ideally:

✅ Best Practices

📌 Primary Index (PI) in Teradata

✅ Key Characteristics

🔄 Row Distribution Mechanism

🧩 Types of Primary Index

🧠 Best Practices for Choosing a PI

📈 Why PI Matters for Performance

🚫 Common Mistakes

🔄 Row Distribution via Hashing in Teradata

📥 How Data Gets Distributed Across AMPs

NUPI

🔹 What is a NUPI?

📊 Example: Skew from Poor NUPI Choice

🔀 Partitioned Primary Index (PPI)

🔹 What is PPI?

⚙️ How PPI Works:

🌲 Multilevel Partitioned Primary Index (MLPPI)

🔹 What is MLPPI?

🧩 Use RANGE_N When:

🧩 Use CASE_N When:

✅ What is a NoPI Table?

📦 Why Use NoPI Tables?

🧠 Best Practices:

📘 Columnar Join Indexes:

🎯 When to Use Columnar:

Teradata: Secondary Index (SI) Explained

🔍 Why Use a Secondary Index?

🛠️ How It Works

🔁 Without Secondary Index

🚦 Performance Tips

🔷 What is a Join Index (JI)?

🔷 What is a Single-Table Join Index (STJI)?

🧩 Why Use STJI?

What This Does:

🔍 What is a Full Table Scan (FTS)?

⚠️ This is often the least efficient way to access data, especially for large tables.

❌ What happens here?

📌 What is COLLECT STATISTICS in Teradata?

💡 Why It Matters

📊 What Does It Collect?

📦 What is COMPRESS in Teradata?

🧱 How COMPRESS Works

🔐 Teradata Data Protection Features

🔁 Real-world analogy

🗃️ DATABASE-LEVEL PROTECTION

1. Fallback

🔁 Real-world analogy:

📦 How It Works:

🧠 Example:

2. Transient Journal (TJ)

🧩 How it works:

🧩 Use `RANGE_N` When:

🧩 Use `CASE_N` When:

📌 What is `COLLECT STATISTICS` in Teradata?

📦 What is `COMPRESS` in Teradata?

🧱 How `COMPRESS` Works

DELETE FROM <tablename>
[WHERE condition];