td nz teradata sql

  https://drive.google.com/file/d/1vevk2HJLngFs0PFqvefq4h_pbt1dB73t/view


2: https://drive.google.com/file/d/1vevk2HJLngFs0PFqvefq4h_pbt1dB73t/view?usp=sharing


Teradata is the most popular Relational Database Management System (RDBMS). The Teradata is suitable for large data warehousing applications.

3: Features of Teradata

Here are some essential features of the Teradata, such as:

Teradata Features

1) Linear Scalability

Teradata Systems are linear and highly scalable. It can handle a large volume of data efficiently at a time. It can be scaled up to a maximum of 2048 nodes that enhance the system performance.

2) Unlimited Parallelism

The architecture of Teradata is based on massively parallel processors (MPP), which divides large volumes of data into smaller processes. These all smaller processors execute in parallel. This way of the execution completes the complex tasks quickly.

3) Mature Optimizer

Teradata Optimizer is one of the most advanced optimizers that are available in the market. It can handle up to 64 joins in a single query.

4) Shared Nothing Architecture

The Teradata architecture is also known as Shared Nothing Architecture because the Teradata nodes, Access Module Processors (AMPs), and disks all are work independently. These resources are not shared, therefore provides the best value for a given task. The 100% availability of every resource enhances efficiency by decreasing the process execution time.

5) Connectivity

The parallel system of Teradata can connect to channel-attached systems such as network-attached systems or mainframes.

6) Low TCO

The Teradata system offers a low total cost of ownership, as it is easy to set up and maintain the arrangement.

7) SQL

Teradata systems allow the implementation of SQL to interact with the data in the database easily. It also provides its extension.

8) Load & Unload utilities

Teradata provides load & unload utilities to move data in and out of the Teradata system

9) Automatic Distribution

Teradata can efficiently distribute the data to its disks without any external intervention.

10) Robust Utilities

Teradata has many robust utilities to handle the data in and out of the Teradata systems. FastLoad, MultiLoad, FastExport, and TPT are few of them which help in the import and export of data to and from the systems.

Components of Teradata

Teradata is a very efficient, less expensive, and high-quality Relational Database management System.

Teradata is based on Massively Parallel Processing (MPP) architecture. It is made of Parsing Engine (PE), BYNET, Access Module Processors (AMPs), and other components such as nodes.

Below are some significant components of the Teradata, such as:

Teradata Components

1. Parsing Engine: Parsing Engine is the base component that receives the queries from the clients and prepares the execution plan.

2. BYNET: This is the message-passing layer or simply the networking layer in Teradata. It receives the execution plan from the parsing engine and passes it to AMPs and the nodes. After that, it gets the processed output from the AMPs and sends it back to the parsing engine.

To maintain adequate availability, the BYNET 0 and BYNET 1 two types of BYNETs are available. This ensures that a secondary BYNET is available in case of the failure of the primary BYNET.

3. Access Module Processors (AMPs): These are the virtual processors of Teradata. They receive the execution plan and the data from the parsing engine. The data will undergo any required conversion, filtering, aggregation, sorting, etc., and will be further sent to the corresponding disks for storage.

Table records will be distributed to each AMP for data storage. Only that AMP can read or write data into the disks that have access permission.

4. Nodes: The basic unit of a Teradata system is called a Node. Each node has its operating system, CPU memory, a copy of RDBMS software, and some disk space. A single cabinet can have one or more nodes in it.

Teradata Architecture

The architecture of Teradata is a Massively Parallel processing architecture. The Teradata system has four components.

  • Parsing Engine
  • BYNET
  • AMPs
  • Disk

Teradata Architecture

Based on the principal function of the Teradata system, the architecture can be categorized into two sections, such as:

  1. Storage Architecture
  2. Retrieval Architecture

Storage Architecture

The storage architecture consists of the above two components of the Teradata architecture. When the client runs queries to insert records, the Parsing engine sends the files to BYNET. BYNET retrieves the files and sends the row to the target AMP. AMP stores these records on its disks.

Teradata Architecture

1. Parsing Engine

When a user fires an SQL query, it first gets connected to the Parsing Engine. The processes, such as planning and distributing the data to AMPS, are done here. It finds out the best optimal plan for query execution. Parsing Engine performs the following processes, such as:

  • Parser: The Parser checks for the syntax, if true forward the query to Session Handler.
  • Session Handler: It does all the security checks, such as checking of logging credentials and whether the user has permission to execute the query or not.
  • Optimizer: It finds out the best possible and optimized plan to execute the query.
  • Dispatcher: The Dispatcher forwards the query to the AMPs.

2. AMP

Access Module Processor is a virtual processor that connects to PE via BYNET. Each AMP has its disk and is allowed to read and write in its disk. This is called as 'SHARED NOTHING ARCHITECTURE'.

When the query is fired, then Teradata distributes the rows of a table on all the AMPs. The AMP executes any SQL requests in the following steps, such as:

  1. Lock the table.
  2. Execute the operation requested.
  3. End the transaction.

Retrieval Architecture

This architecture section consists of reaming two components of the Teradata architecture. When the client runs queries to retrieve records, the Parsing engine sends a request to BYNET. BYNET sends the retrieval request to appropriate AMPs. Then AMPs search their disks in parallel and identify the required records and forwards to BYNET. BYNET sends the records to Parsing Engine, which in turn will send to the client.

Teradata Architecture

1. BYNET

The BYNET acts as a channel between PE and AMPs. There are two BYNETs in Teradata 'BYNET 0' and 'BYNET 1'.

  1. If one BYNET fails, the second one can take its place.
  2. When data is large, both BYNETs can be made functional, which improves the communication between PE and AMPs, thus fastening the process.

2. Disk

Teradata offers a set of Virtual Disks for each AMP. The storage area of each AMP is called as Virtual Disk or Vdisk. Here are the following steps for executing the query, such as:

Step 1: The user raises the question which is sent to PE.

Step 2: PE checks the security and syntax, and finds out the best optimal plan to execute the query.

Step 3: The table rows are distributed on the AMP, and the data is retrieved from the disk.

Step 4: The AMP sends back the data through BYNET to PE.

Step 5: PE returns the data to the user.

Teradata Relational Concepts

Relational Database Management System is a DBMS software that helps to interact with databases. The relational database is a collection of objects in the form of tables, views, macros, stored procedures, and triggers.

They use Structured Query Language (SQL) to interact with the data stored in tables directly or individually developed applications.

Relational Database Terminology

Relational databases are a generalization of the mathematics of set theory relations. Therefore, the correspondences between set theory and relational databases are not always direct.

The following table shows the correspondence between set theory and relational database terms.

Set Theory TermsRelational Database Term
RelationTable
TupleRow
AttributeColumn

Tables

Tables are the basic unit in RDBMS where the data is stored. Tables are two-dimensional objects consisting of rows and columns. Data is organized in tabular format and presented to the users of a relational database.

References between tables define the relationships and constraints of data inside the tables themselves.

For example, the student data is shown in the tabular format, such as:

Roll noStudent NameDate of Birth
10001Mike1/2/1996
10002James5/1/1997
10003Robert15/7/1997
10004Paul16/2/1996
10005Peter5/9/1998

Table Constraints

We can define conditions that must be met before Teradata Database writes a given value to a column in a table. These conditions are called constraints.

Constraints can include value ranges, equality or inequality conditions, and intercolumn dependencies. Teradata Database supports constraints at both the column and table levels.

During table creation and modification, you can specify constraints on single-column values as part of a column definition or on multiple columns using the CREATE and ALTER TABLE statements.

Columns

A column always contains the same kind of information or contains similar data.

For example, a student information table would have a column for date of birth, and nothing other than the student date of birth should be placed in that column.

Date of Birth
1/2/1996
5/1/1997
15/7/1997
16/2/1996
5/9/1998

Row

A row is one instance of all the columns in the table.

An entity is a person, place, or thing about which the table contains information. The table mentioned in the previous paragraphs includes information on the student entity. Each table holds only one kind of row.

The relational model requires that each row in a table be uniquely identified. To accomplish this, we define a uniqueness constraint to identify each row in the table.

For example, each row in the student table would contain, among other things, the first roll no, name, and the last date of birth for that student. The columns in a table represent entities, relationships, or attributes.

Roll noStudent NameDate of Birth
10001Mike1/2/1996

Database

The database is a collection of logically related data. Many users access them for different purposes.

For example, a sales database contains complete information about sales, which is stored in many tables.

Primary Key

The primary key is used to identify a row in a table uniquely. No duplicate values are allowed in a primary key column, and they cannot accept NULL values. It is a mandatory field in a table.

Foreign Key

Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table.

A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.

Teradata Tables

Tables in the Relational model are defined as a collection of data. They are represented in the rows and columns.

Types of Table

In the Teradata, tables support the following types of tables, such as:

  • Permanent Table: This is the default table, and it contains data inserted by the user and permanently stores the data.
  • Volatile Table: The data inserted into a volatile table is retained only during the user session. The table and data are dropped at the end of the session. These tables are mainly used to hold the intermediate data during data transformation.
  • Global Temporary Table: The definition of the Global Temporary table is persistent, but the data in the table is deleted at the end of the user session.
  • Derived Table: Derived table holds the intermediate results in a query. Their lifetime is within the question in which they are created, used, and dropped.

Set vs Multiset

Teradata classifies the tables as Set or Multiset tables based on how the duplicate files are handled.

A table defined as the SET table doesn't store the duplicate files, whereas the MULTISET table can store duplicate files.

1. Create Table

CREATE Table command is used to create tables in Teradata.

Syntax

Following is the generic syntax of the CREATE TABLE statement.

  1. CREATE <SET/MULTISET> TABLE <Tablename>   
  2. <Table Options>   
  3. <Column Definitions>   
  4. <Index Definitions>;  
  • Table Options: It specifies the physical attributes of the table, such as Journal and Fallback.
  • Column Definition: It specifies the list of columns, data types, and attributes.
  • Index Definition: It is an additional indexing option such as Primary Index, Secondary Index, and Partitioned Primary Index.

Example

In the following example, we create a table called student with a FALLBACK option. The table contains four columns with rollno as the Unique Primary Index.

  1. CREATE SET TABLE STUDENT, FALLBACK (   
  2.    RollNo INTEGER,   
  3.    FirstName VARCHAR(10),   
  4.    LastName VARCHAR(10),   
  5.    DOB DATE FORMAT 'YYYY-MM-DD',   
  6. )   
  7. UNIQUE PRIMARY INDEX ( RollNo );  

Once the table is created, we can use the SHOW TABLE command to view the Definition of the table.

  1. SHOW TABLE Student;   
  2. CREATE SET TABLE STUDENT, FALLBACK ,   
  3.    NO BEFORE JOURNAL,   
  4.    NO AFTER JOURNAL,   
  5.    CHECKSUM = DEFAULT,   
  6.      
  7.    DEFAULT MERGEBLOCKRATIO (  
  8.       RollNo INTEGER,   
  9.       FirstName VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,   
  10.       LastName VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,   
  11.       DOB DATE FORMAT 'YYYY-MM-DD',   
  12.    )   
  13. UNIQUE PRIMARY INDEX ( RollNo );  

2. Alter Table

ALTER table command is used to add or drop columns from an existing table. We can also use an ALTER table command to modify the attributes of the existing columns.

Syntax

Following is the generic syntax for ALTER table.

  1. ALTER TABLE <tablename>   
  2. ADD <columnname> <column attributes>   
  3. DROP <columnname>;  

Example

In the following example, we drop the column DOB and add a new column BirthDate.

  1. ALTER TABLE STUDENT  
  2. ADD BirthDate DATE FORMAT 'YYYY-MM-DD',   
  3. DROP DOB;  

We can run the SHOW table command to view the changes to the table. In the following output, we can see that the column student_dob is removed, and BirthDate is added.

  1. SHOW table student;  
  2. CREATE SET TABLE student, FALLBACK,   
  3.    NO BEFORE JOURNAL,   
  4.    NO AFTER JOURNAL,   
  5.    CHECKSUM = DEFAULT,   
  6.      
  7.    DEFAULT MERGEBLOCKRATIO (   
  8.       RollNo INTEGER,   
  9.       FirstName VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,   
  10.       LastName VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,   
  11.       BirthDate DATE FORMAT 'YYYY-MM-DD'  
  12.    )   
  13. UNIQUE PRIMARY INDEX ( RollNo );  

3. Drop Table

DROP TABLE command is used to drop a table. When the DROP TABLE is issued, data in the table is deleted, and the table is dropped.

Syntax

Following is the generic syntax for DROP TABLE.

  1. DROP TABLE <tablename>;  

Example

In the following example, we drop the "student" table.

  1. DROP TABLE student;  

If we run the SHOW TABLE command after this, we will get an error message stating that the table doesn't exist.

  1. SHOW TABLE student;   
  2. *** Failure 3807 Object 'student' does not exist.   
  3.    Statement# 1, Info = 0   

Teradata Data Manipulation

The SQL commands used to manipulate the data stored in the Teradata tables.

Insert Records

INSERT INTO statement is used to insert records into the table.

Syntax

Following is the generic syntax for INSERT INTO, such as:

  1. INSERT INTO <tablename>   
  2. (column1, column2, column3,?)   
  3. VALUES   
  4. (value1, value2, value3 ?);  

Example

Consider the following student table in which we try to insert records into the table.

  1. INSERT INTO Student (  
  2.    RollNo,   
  3.    FirstName,   
  4.    LastName,   
  5.    BirthDate,   
  6. )  
  7. VALUES (   
  8.    10001,   
  9.    'Jack',   
  10.    'Nicholson',   
  11.    '1990-05-06',   
  12.   
  13. );  

Once the above query is inserted, we can use the SELECT statement to view the records from the table.

RollNoFirstNameLastNameBirthDate
10001JackNicholson5/6/1990

Insert from another Table

INSERT SELECT statement is used to insert records from another table.

Syntax

Following is the generic syntax for INSERT INTO.

  1. INSERT INTO <tablename>   
  2. (column1, column2, column3,?)   
  3. SELECT   
  4. column1, column2, column3?   
  5. FROM    
  6. <source table>;  

Example

In the following example we insert records into the student table. Create a table called Student_Bkup with the same column definition as student table before running the following insert query.

  1. INSERT INTO Student_Bkup (   
  2.    RollNo,   
  3.    FirstName,   
  4.    LastName,   
  5.    BirthDate,   
  6.   
  7. )   
  8. SELECT   
  9.    RollNo,   
  10.    FirstName,   
  11.    LastName,   
  12.    BirthDate,   
  13.    
  14. FROM    
  15.    Student;  

When the above query is executed, it will insert all records from the student table into the student_bkup table.

Rules

Here are some specific rules to insert records from another table, such as:

  • The number of columns specified in the VALUES list should match with the columns specified in the INSERT INTO clause.
  • Values are mandatory for NOT NULL columns.
  • If no values are specified, then NULL is inserted for nullable fields.
  • The data types of columns specified in the VALUES clause should be compatible with the data types of columns in the INSERT clause.

Update Records

The UPDATE statement is used to update records from the table.

Syntax

Following is the generic syntax for UPDATE.

  1. UPDATE <tablename>   
  2. SET <columnnamme> = <new value>   
  3. [WHERE condition];  

Example

In the following example we update the student BirthDate to 5/6/1997 for student 10001.

  1. UPDATE Student  
  2. SET BirthDate = 5/6/1990   
  3. WHERE RollNo = 10001;  

The following output shows that the BirthDate is updated from 5/6/1990 to 5/6/1997 for RollNo 10001.

  1. SELECT RollNo, BirthDate FROM Student;   
  2. //Query completed. One row found. 2 columns returned.   
  3. //Total elapsed time was 1 second.    
  4. RollNo            BirthDate   
  5. -----------           -------------   
  6.    10001             5/6/1997   

Rules

Here are some specific rules to update the records from the table, such as:

  • We can update one or more values of the table.
  • If the WHERE condition is not specified, then all rows of the table are impacted.
  • We can update a table with the values from another table.

Delete Records

DELETE FROM statement is used to update records from the table.

Syntax

Following is the generic syntax for DELETE FROM.

  1. DELETE FROM  <tablename>   
  2. [WHERE condition];  

Example

In the following example we delete the student 10001 from the table student.

  1. DELETE FROM Student  
  2. WHERE RollNo = 10001;  

In the following output, we can see that student 10001 is deleted from the table.

  1. SELECT RollNo FROM Student;    
  2. // Query completed. No rows found.   
  3. //Total elapsed time was 1 second.   

Rules

Here are some specific rules to delete records from the tables, such as:

  • We can update one or more records of the table.
  • If the WHERE condition is not specified, then all rows of the table are deleted.
  • We can update a table with the values from another table.

Teradata Set Operator

SET operators combine results from multiple SELECT statements. This may look similar to Joins, but joins combines columns from various tables, whereas SET operators combine rows from multiple rows.

Rules for Set Operator

Here are the following rules to specify the Set operator, such as:

  • The number of columns from each SELECT statement should be the same.
  • The data types from each SELECT must be compatible.
  • ORDER BY should be included only in the final SELECT statement.

Teradata SQL Set Operators

Teradata SQL supports the following set operators:

Set OperatorFunction
INTERSECTIt returns result in rows that appear in all answer sets generated by the individual SELECT statements.
MINUS /
EXCEPT
The result is those rows returned by the first SELECT except for those also selected by the second SELECT.
UNIONIt combines the results of two or more SELECT statements.

1. UNION

The UNION statement is used to combine results from multiple SELECT statements. It ignores duplicates.

Syntax

Following is the basic syntax of the UNION statement.

  1. SELECT col1, col2, col3...   
  2. FROM    
  3. <table T1>   
  4. [WHERE condition]   
  5. UNION    
  6.   
  7. SELECT col1, col2, col3...   
  8. FROM    
  9. <table T2>   
  10. [WHERE condition];  

Example

Consider the following student table as T1 and attendance table as T2.

RollNoFirstNameLastNameBirthDate
1001MikeRichard1/2/1996
1002RobertWilliams3/5/1995
1003PeterCollin4/1/1994
1004AlexaStuart11/6/1995
1005RobertPeterson12/1/1997

RollNopresentAbsent%
10012002090%
10021606072%
10031507068%
10042101095%

The following UNION query combines the RollNo value from both T1 and T2 tables.

  1. SELECT RollNo   
  2. FROM    
  3. T1  
  4. UNION   
  5.   
  6. SELECT RollNo   
  7. FROM    
  8. T2;  

When the query is executed, it gives the following output, such as:

RollNo
1001
1002
1003
1004
1005

2. UNION ALL

UNION ALL statement is similar to the UNION statement. It combines results from multiple tables, including duplicate rows.

Syntax

Following is the basic syntax of the UNION ALL statement.

  1. SELECT col1, col2, col3...  
  2. FROM    
  3. <table 1>   
  4. [WHERE condition]   
  5. UNION ALL   
  6.   
  7. SELECT col1, col2, col3...  
  8. FROM    
  9. <table 2>   
  10. [WHERE condition];  

Example

Following is an example for UNION ALL statement.

  1. SELECT RollNo   
  2. FROM    
  3. T1  
  4. UNION ALL   
  5.   
  6. SELECT RollNo   
  7. FROM    
  8. T2;  

When the above query is executed, it produces the following output. And it returns the duplicates also.

RollNo
1001
1002
1003
1004
1005
1001
1002
1003
1004

3. INTERSECT

INTERSECT command is also used to combine results from multiple SELECT statements.

It returns the rows from the first SELECT statement that has a corresponding match in the second SELECT statement.

Syntax

Following is the basic syntax of the INTERSECT statement.

  1. SELECT col1, col2, col3...   
  2. FROM    
  3. <table 1>  
  4. [WHERE condition]   
  5. INTERSECT   
  6.   
  7. SELECT col1, col2, col3...   
  8. FROM    
  9. <table 2>   
  10. [WHERE condition];  

Example

Following is an example of the INTERSECT statement. It returns the RollNo values that exist in both tables.

  1. SELECT RollNo   
  2. FROM    
  3. T1  
  4. INTERSECT   
  5.   
  6. SELECT RollNo   
  7. FROM    
  8. T2;  

When the above query is executed, it returns the following records. RollNo 1005 is excluded since it doesn't exist in the T2 table.

  1. RollNo  
  2. 1001  
  3. 1002  
  4. 1003  
  5. 1004  

4. MINUS/EXCEPT

MINUS/EXCEPT commands combine rows from multiple tables and return the rows, which are in the first SELECT but not in the second SELECT. They both return the same results.

Syntax

Following is the basic syntax of the MINUS statement.

  1. SELECT col1, col2, col3...  
  2. FROM    
  3. <table 1>   
  4. [WHERE condition]   
  5. MINUS   
  6.   
  7. SELECT col1, col2, col3...  
  8. FROM    
  9. <table 2>   
  10. [WHERE condition];  

Example

Following is an example of a MINUS statement.

  1. SELECT RollNo   
  2. FROM    
  3. T1  
  4. MINUS   
  5.   
  6. SELECT RollNo   
  7. FROM    
  8. T2;  

When this query is executed, it gives the following output.

RollNo
1005


Teradata String Manipulation

Teradata provides several functions to manipulate the strings. These functions are compatible with the ANSI standard.

Teradata String Functions are also supported most of the standard string functions along with the Teradata extension to those functions.

String functionExplanation
Concat (string1, ..., stringN)It returns the concatenation of two or more string values. This function provides the same functionality as the SQL-standard concatenation operator (||).
Length (string)It returns the number of characters in the string.
Lower (string)It converts a string to lower case.
Upper (string)It converts a string to upper case.
Lpad (string, size, padstring)Pads the left side of the string with characters to create a new string.
Rpad (string, size, padstring)Pads the right side of the string with characters to create a new string.
Trim (string)It removes leading and trailing whitespace from the given string.
Ltrim (string)It removes leading whitespaces from the string.
Rtrim (string)It removes trailing whitespaces from the string.
Replace (string, search)It removes the search string from the given string.
Replace (string, search, replace)It replaces all instances of search with replacing string.
Reverse (string)It returns string characters in reverse order.
Split (string, delimiter)Split given string on the delimiter. This function returns an array of string.
Strops (string, substring)It returns the staring position first instance of a substring in a given string.
Position (substring IN string)It returns the staring position first instance of a substring in a given string.
Substr (string, start, length)It returns a substring of string that begins at the start position and is length characters long.
Chr (n)It returns the character with the specified ASCII value.
to_utf8 (string)It encodes a string into a UTF-8 varbinary representation.
from_utf8 (binary)It decodes a UTF-8 encoded string from binary.
Select translate (string, from, to);It can replace any character in the string that matches a character in the form set with the corresponding character in the set.
Index (string)It locates the position of a character in a string (Teradata extension).

UPPER & LOWER Function

The UPPER and LOWER functions convert the character column values all in uppercase and lowercase, respectively. UPPER and LOWER are ANSI compliant.

Syntax

  1. UPPER (expression) -> returns expression as uppercase   
  2. LOWER (expression) -> returns expression as lowercase   

Example

The following example will convert the "Robert" string in the upper case string.

  1.  /*create a table called NAMES */  
  2. TABLE NAMES(Name text);  
  3. /*create few records in this table */  
  4. UPPER('Robert');  
  5. /*Query completed. One row found. One column returned */  
  6. /*Total elapsed time was 0.00 second */  
  7. /*Display all the records from the table */  
  8. * FROM NAMES;  

After executing the above code, it will give the following output.

ROBERT

Now in the same example, we will convert the same "ROBERT" string in the lower case string.

  1. SELECT LOWER('ROBERT');  
  2.       /*Query completed. One row found. One column returned */  
  3.      /*Total elapsed time was 0.00 second */  

Output

robert

CHARACTER_LENGTH Function

The CHARACTER_LENGTH function returns the numbers of characters of a character string expression.

  • The result will be an integer number that represents the length.
  • The result will be the same for the fixed-length character.
  • The result will vary for the variable-length character.
  • Spaces are valid characters so that the length will be counted for space.

Syntax

  1. CHARACTER_LENGTH (expression)  

Example

The following example will return the number of characters of the "Robert" string.

  1. SELECT CHARACTER_LENGTH('Robert');  
  2.           /* Query completed. One row found. One column returned */  
  3.          /* Total elapsed time was 0.00 second */  

Execute the above code, and it tells the length of the "Robert" string as the output shown below.

6

TRIM Function

The TRIM function is used to remove space a particular set of leading or trailing or both from an expression. TRIM is ANSI standards.

Syntax

  1. TRIM ([LEADING | BOTH | TRAILING] [trim_character ] FROM expression)  

Example

The following example removes the space from both the end of the "Robert" string.

  1. SELECT TRIM('   Robert    ');  
  2.             /* Query completed. One row found. One column returned */  
  3.             /* Total elapsed time was 0.00 second */  

When we execute the above code, it will trim the existing space from both ends of the string and gives the following output.

Robert

POSITION Function

The POSITION function is used to return the position of a substring inside the string. The position of the first occurrence of the string is returned only.

Syntax

  1. POSITION (substr IN str)  

Example

The following example will return the occurrence of "e" in the "Robert" string.

  1. SELECT POSITION("e" IN "Robert");  
  2.           /* Query completed. One row found. One column returned */  
  3.          /* Total elapsed time was 0.00 second */  

After executing the above code, it will find the position of "e" substring in the "Robert" string as output.

4

SUBSTRING Function

The SUBSTRING function is used to return a specified number of characters from a particular position of a given string. SUBSTRING function is ANSI standard.

Syntax

  1. SUBSTRING (str FROM pos FOR len)  

It returns a string (str), starting at the position (pos), and length (len) in characters.

Example

The following example returns the character from the 1st position for 3 characters.

  1. SELECT SUBSTRING('Robert' FROM 1 FOR 3);  
  2.          /* Query completed. One row found. One column returned */  
  3.         /* Total elapsed time was 0.00 second */  

The above code returns 3 characters from the 1st position of the string "Robert" as the output.

Rob


Teradata Date/Time Functions

Date/Time functions operate on either Date/Time or Interval values and provide a Date/Time value as a result.

The supported Date/Time functions are:

  • CURRENT_DATE
  • CURRENT_TIME
  • CURRENT_TIMESTAMP
  • EXTRACT

To avoid any synchronization problems, operations among these functions are guaranteed to use identical definitions for DATE, TIME, or TIMESTAMP, therefore following services are always valid:

  • CURRENT_DATE = CURRENT_DATE
  • CURRENT_TIME = CURRENT_TIME
  • CURRENT_TIMESTAMP = CURRENT_TIMESTAMP
  • CURRENT_DATE and CURRENT_TIMESTAMP always identify the same DATE
  • CURRENT_TIME and CURRENT_TIMESTAMP always identify the same TIME

The values reflect the time when the request starts and does not change during the application's duration.

Date Storage

Dates are stored as integer internally using the following formula.

  1. ((YEAR - 1900) * 10000) + (MONTH * 100) + DAY  

To check how the dates are stored using the following query.

  1. SELECT CAST(CURRENT_DATE AS INTEGER);  

Since the dates are stored as an integer, we can perform some arithmetic operations on them.

Teradata supports most of the standards date functions. Some of the commonly used date functions are listed below, such as:

Date FunctionExplanation
LAST_DAYIt returns the last day of the given month. It may contain the timestamp values as well.
NEXT_DAYIt returns the date of the weekday that follows a particular date.
MONTHS_BETWEENIt returns the number of months between two date (timestamp) values. The result is always an integer value.
ADD_MONTHSIt adds a month to the given date (timestamp) value and return resulting date value.
OADD_MONTHSIt adds a month to the given date (timestamp) value and return resulting date value.
TO_DATEIt converts a string value to a DATE value and returns the resulting date value.
TO_TIMESTAMPIt converts a string value to a TIMESTAMP value and returns resulting timestamp value.
TRUNCIt returns a DATE value with the time portion truncated to the unit specified by a format string.
ROUNDIt returns a DATE value with the time portion rounded to the unit specified by a format string.
NUMTODSINTERVALIt converts a numeric value to interval days to seconds.
NUMTOYMINTERVALIt converts a numeric value to interval years to the month.
TO_DSINTERVALIt converts a string value to interval days to second.
TO_YMINTERVALIt converts a string value to interval year to a month.
EXTRACTIt extracts portions of the day, month, and year from a given date value.
INTERVALINTERVAL function is used to perform arithmetic operations on DATE and TIME values.

EXTRACT

EXTRACT function is used to extract portions of the day, month, and year from a DATE value. This function is also used to extract hour, minute, and second from TIME/TIMESTAMP value.

Examples

1. The following example shows how to extract Year value from Date and Timestamp values.

  1. SELECT EXTRACT(YEAR FROM CURRENT_DATE);    
  2. EXTRACT(YEAR FROM Date)   

Output

2020

2. The following example shows how to extract Month values from Date and Timestamp values.

  1. SELECT EXTRACT(MONTH FROM CURRENT_DATE);    
  2. EXTRACT(MONTH FROM Date)   

Output

  1. 5  

3. The following example shows how to extract Day values from Date and Timestamp values.

  1. SELECT EXTRACT(DAY FROM CURRENT_DATE);    
  2. EXTRACT(DAY FROM Date)   

Output

22

4. The following example shows how to extract Hour values from Date and Timestamp values.

  1. SELECT EXTRACT(HOUR FROM CURRENT_TIMESTAMP);    
  2. EXTRACT(HOUR FROM Current TimeStamp(6))   

Output

6

5. The following example shows how to extract Minute values from Date and Timestamp values.

  1. SELECT EXTRACT(MINUTE FROM CURRENT_TIMESTAMP);    
  2. EXTRACT(MINUTE FROM Current TimeStamp(6))   

Output

46

6. The following example shows how to extract the Second values from Date and Timestamp values.

  1. SELECT EXTRACT(SECOND FROM CURRENT_TIMESTAMP);    
  2. EXTRACT(SECOND FROM Current TimeStamp(6))   

Output

25.150000

INTERVAL

Teradata provides INTERVAL function to perform arithmetic operations on DATE and TIME values. There are two types of INTERVAL functions, such as:

1. Year-Month Interval

  • YEAR
  • YEAR TO MONTH
  • MONTH

2. Day-Time Interval

  • DAY
  • DAY TO HOUR
  • DAY TO MINUTE
  • DAY TO SECOND
  • HOUR
  • HOUR TO MINUTE
  • HOUR TO SECOND
  • MINUTE
  • MINUTE TO SECOND
  • SECOND

Examples

1. The following example adds 4 years to the current date.

  1. SELECT CURRENT_DATE, CURRENT_DATE + INTERVAL '04' YEAR;   

Output

05/22/2024

2. The following example adds 4 years and 03 months to the current date.

  1. SELECT CURRENT_DATE, CURRENT_DATE + INTERVAL '04-03' YEAR TO MONTH;   

Output

08/22/2024

3. The following example adds 03 days, 05 hours, and 10 minutes to the current timestamp.

  1. SELECT CURRENT_TIMESTAMP,CURRENT_TIMESTAMP + INTERVAL '03 05:10' DAY TO MINUTE;   

Output

05-25-2020 10:07:25.150000+00.00


Teradata Built-In Functions

Teradata provides built-in functions, which are extensions to SQL. The Built-in functions return information about the system.

Built-in functions are sometimes referred to as individual registers. It can be used anywhere that a literal can appear.

If a SELECT statement that contains a built-in function references a table name, then the result of the query contains one row for every row of the table that satisfies the search condition.

Some common built-in functions are listed below with examples, such as:

S. No.FunctionExample
1SELECT DATE;
Date-  2018/02/10
2SELECT CURRENT_DATE;
Date-   2020/05/23
3SELECT TIME;
Time-  09:02:00
4SELECT CURRENT_TIME;
Time-   10:01:13
5SELECT CURRENT_TIMESTAMP;
Current TimeStamp(6)-      2020-05-23 10:01:13.990000+00.00
6SELECT DATABASE;
Database-   TDUSER


DISTINCT Option

The DISTINCT option specifies that duplicate values which are not to be used when an expression is processed.

The following SELECT returns the number of unique job titles in a table.

  1. SELECT COUNT(DISTINCT Deduction) FROM Salary;  

Output

2000
1000
00
3000

A query can have multiple aggregate functions that use DISTINCT with the same expression, such as:

  1. SELECT SUM(DISTINCT x), AVG(DISTINCT x) FROM Salary;  

A query can also have multiple aggregate functions that use DISTINCT with different expressions, such as:

  1. SELECT SUM(DISTINCT x), AVG(DISTINCT y) FROM Salary;  

Teradata CASE & COALESCE

CASE and COALESCE both functions are used in Teradata for different purposes. Both functions have different functionalities.

CASE Expression

Teradata CASE statement provides the flexibility to fetch alternate values for a column base on the condition specified in the expression.

CASE expression evaluates each row against a condition or WHEN clause and returns the result of the first match if there are no matches, then the result from the ELSE part of the return.

Syntax

Following is the syntax of the CASE expression.

  1. CASE <expression>   
  2. WHEN <expression> THEN result-1   
  3. WHEN <expression> THEN result-2   
  4.   
  5. ELSE    
  6.    Result-n   
  7. END  

Example

Consider the following Employee record in the below table.

Employee_IdAgeSalaryMarital_Status
2020012220,0001
2020022525,0001
2020033035,0002
2020042632,0002
2020053240,0002

In the above example, we evaluate the Marital_Status column. It returns 1 if the marital status is Single and returns 2 if the marital status is married. Otherwise, it returns the value as Not Sure.

Now, we will apply the CASE statement on Marital_Status column as follows:

  1. SELECT   
  2.    Employee_Id,   
  3.    Age,  
  4.    Salary,  
  5.   
  6. CASE WHEN Marital_Status1 THEN 'Single'  
  7.            WHEN Marital_Status2 THEN 'Married'  
  8. ELSE 'Not Sure'  
  9. END   

After executing the above code, it produces the following output.

Teradata CASE & COALESCE

The above CASE expression can also be written in the following way, which will produce the same result as above.

  1. SELECT   
  2.    Employee_Id,  
  3.    Age,  
  4.    Salary,  
  5.   
  6.  CASE Marital_Status When 1 THEN 'Single'  
  7.             WHEN 2 THEN 'Married'  
  8. ELSE 'Not Sure'  
  9. END AS Marital_Status  
  10. FROM Employee;  

COALESCE Expression

Teradata COALESCE is used for NULL handling. The COALESCE is a statement that returns the first non-null value of the expression. It returns NULL if all the arguments of the expression evaluate to NULL. Following is the syntax.

Syntax

Here is the basic syntax of the COALESCE function:

  1. COALESCE(expression 1, expression 2, ....... expression n)   

Example

Consider the following Student table.

Teradata CASE & COALESCE

Now we can prioritize which phone number to select using COALESCE function as follows:

  1. SELECT   
  2. Roll_No,  
  3. First_Name,  
  4. Last_Name,  
  5. COALESCE(Landline_no, Mobile_no, 'phone# not available') AS phone_no  
  6. FROM Student;  

In the above example, we will search for Landline_no first. If that is NULL, it will search for Mobile_no, respectively. If both the numbers are NULL, it will return not available. And if none of the argument is returning, not NULL value, it will return the default value from those columns.

When we execute the above query, it generates the following output.

Teradata CASE & COALESCE

NULLIF

The NULLIF statement returns NULL if the arguments are equal.

Syntax

Following is the syntax of the NULLIF statement.

  1. NULLIF(expression 1, expression 2)   

Example

The following example returns NULL if the Mobile_No is equal to 0. Otherwise, it returns the Mobile_No value.

  1. SELECT   
  2.    Roll_No,    
  3.    NULLIF(Mobile_no, 0) AS Mobile   
  4. FROM Student;  

The above query returns the following records. We can see that Roll_No 101 and 104 have Mobile as NULL.

Teradata CASE & COALESCE


Teradata Primary Index

The primary index is used to specify where the data resides in Teradata. It is used to determine which AMP gets the data row.

In Teradata, each table is required to have a primary index defined. If the primary index is not defined, Teradata automatically assigns the primary index.

The primary index provides the fastest way to access the data. A primary may have a maximum of 64 columns. The primary index is defined while creating a table, and it cannot be altered or modified.

The primary index is the most preferred and essential index for:

  • Data distribution
  • Known access path
  • Improves join performance

Rules for Primary Index

Here are some specific rules for the Primary index, such as:

Rule 1: One Primary index per table.

Rule 2: A Primary index value can unique or non-unique.

Rule 3: The Primary index value can be NULL.

Rule 4: The Primary index of a populated table cannot be modified.

Rule 5: the Primary index value can be modified.

Rule 6: A Primary index has a limit of 64 columns.

Types of Primary Index

There are two types of Primary Indexes.

  1. Unique Primary Index (UPI)
  2. Non-Unique Primary Index (NUPI)

1. Unique Primary Index (UPI)

In the Unique Primary Index table, the column should not have any duplicate values. If any duplicate values are inserted, they will be rejected. The Unique Primary index enforces uniqueness for a column.

A Unique Primary Index (UPI) will always spread the rows of the table evenly amongst the AMPs. UPI access is always a one-AMP operation.

How to create a Unique Primary Index?

In the following example, we create the Student table with columns Roll_no, First_name, and Last_name.

Roll_noFirst_nameLast_name
1001MikeRichard
1002RobertWilliams
1003PeterCollin
1004AlexaStuart
1005RobertPeterson

We have selected Roll_no to be our Primary Index. Because we have designated Roll_no as a Unique Primary Index, there can be no duplicates of student roll numbers in the table.

  1. CREATE SET TABLE Student   
  2. (   
  3.    Roll_no int,   
  4.    First_name varchar2(10),    
  5.    Last_name varchar2(10),    
  6. )   
  7. UNIQUE PRIMARY INDEX(Roll_no);  

2. Non-Unique Primary Index (NUPI)

A Non-Unique Primary Index (NUPI) means that the values for the selected column can be non-unique.

A Non-Unique Primary Index will never spread the table rows evenly. An All-AMP operation will take longer if the data is unevenly distributed.

We might pick a NUPI over a UPI because the NUPI column may be more useful for query access and joins.

How to Create a Non-Unique Primary Index?

In the following example, we create the employee table with columns Employee_Id, Name, Department, and City.

Employee_IdNameDepartmentCity
202001MaxSalesLondon
202002ErikaFinanceWashington
202003NancyManagementParis
202004BellaHuman RecourseLondon
202005SamMarketingLondon

Every employee has a different employee Id, name, and department, but many employees belong to the same city in this table. Therefore we have selected City to be our Non-Unique Primary Index.

  1. CREATE SET TABLE Employee  
  2. (   
  3.    Employee_Id int,   
  4.    Name varchar2 (10),  
  5.    Department varchar2 (10),  
  6. )   
  7. PRIMARY INDEX(City);  

Multi-Column Primary Indexes

Teradata allows more than one column to be designated as the Primary Index. It is still only one Primary Index, but it is merely made up of combining multiple columns.

Teradata Multi-column Primary index allows up to 64 combined columns to make up the one Primary Index required for a table.

Example

In the following example, we have designated First_Name, and Last_Name combined to make up the Primary Index.

This is very useful and beneficial for two reasons:

  1. To get better data distribution among the AMPs.
  2. And for users who often use multiple keys consistently to query.

Data distribution using Primary Index

When a user submits an SQL request against a table using a Primary Index, the request becomes a one-AMP operation, which is the most direct and efficient way for the system to find a row. The process is explained below.

The complete process is explained in the below image:

Teradata Primary Index

Hashing Process

The process of Hashing is defined in the following steps, such as:

Step 1: In the first step, the Primary index value goes into the hashing algorithm.

Step 2: The output of the hashing algorithm is the row hash value.

Step 3: The hash map points to the specific AMP where the row resides.

Step 4: The PE sends the request directly to the identified AMP.

Step 5: The AMP locates the row(s) on its vdisk.

Step 6: The data is sent to PE through BYNET, and PE sends the answer set to the client application.

Duplicate Row Hash Values

The hashing algorithm can end up with the same row hash value for two different rows.

We can do this in two ways, such as:

  1. Duplicate NUPI values: If a Non-Unique Primary Index is used, duplicate NUPI values will produce the same row hash value.
  2. Hash synonym: It is also called a hash collision. It occurs when the hashing algorithm calculates an identical row hash value for two different Primary Index values.

To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is the combination of the row hash value and unique value.

Row ID = Row Hash Value + Uniqueness Value

The uniqueness value is used to differentiate between rows whose Primary Index values generate identical row hash values. In most cases, only the row hash value portion of the Row ID is needed to locate the row.

Teradata Primary Index

When each row is inserted, the AMP adds the row ID, stored as a prefix of the row.

The first row inserted with a particular row hash value is assigned a unique value of the unique value is incremented by 1 for any additional rows inserted with the same row hash value.


Teradata Joins

Join is used to combine records from more than one table. Tables are joined based on the common columns and values from these tables.

There are different types of Joins available in Teradata.

  • Inner Join
  • Left Outer Join
  • Right Outer Join
  • Full Outer Join
  • Self Join
  • Cross Join
  • Cartesian Production Join

INNER JOIN

Inner Join combines records from multiple tables and returns the values that exist in both the tables.

Syntax

Following is the syntax of the INNER JOIN statement.

  1. SELECT col1, col2, col3?.   
  2. FROM    
  3. Table-1   
  4. INNER JOIN   
  5. Table-2   
  6. ON (col1 = col2)   
  7. <WHERE condition>;  

Example

Consider the following two tables, such as the student table and attendance table.

Student Table:

RollNoFirstNameLastNameBirthDate
1001MikeRichard1/2/1996
1002RobertWilliams3/5/1995
1003PeterCollin4/1/1994
1004AlexaStuart11/6/1995
1005RobertPeterson12/1/1997

Attendance Table:

RollNopresentAbsent%
10012002090%
10021606072%
10031507068%
10042101095%

The following query joins the student table and attendance table on the common column Rollno. Each table is assigned an alias A & B, and the columns are referenced with the correct alias.

  1. SELECT A.RollNo, B.%   
  2. FROM    
  3. Student A   
  4. INNER JOIN   
  5. Attendance B     
  6. ON (A.RollNo = B.RollNo);  

When the above query is executed, it returns the following records. Rollno 1005 is not included in the result since it doesn't have matching records in the attendance table.

  1.  Rollno               %  
  2. 1001                 90%  
  3. 1002                 72%  
  4. 1003                 68%  
  5. 1004                 95%  

OUTER JOIN

LEFT OUTER JOIN and RIGHT OUTER JOIN also combine the results from multiple tables.

  • LEFT OUTER JOIN:It returns all the records from the left table and returns only the matching records from the right table.
  • RIGHT OUTER JOIN:It returns all the records from the right table and returns only matching rows from the left table.
  • FULL OUTER JOIN:It combines the results from both LEFT OUTER and RIGHT OUTER JOINS. It returns both matching and non-matching rows from the joined tables.

Syntax

Following is the syntax of the OUTER JOIN statement. We need to use one of the options from LEFT OUTER JOIN, RIGHT OUTER JOIN, or FULL OUTER JOIN.

  1. SELECT col1, col2, col3....   
  2. FROM    
  3. Table-1   
  4. LEFT OUTER JOIN/RIGHT OUTER JOIN/FULL OUTER JOIN   
  5. Table-2   
  6. ON (col1 = col2)   
  7. <WHERE condition>;  

Example

Consider the following example of the LEFT OUTER JOIN query. It returns all the records from the Student table and matching records from the attendance table.

  1. FROM    
  2. Student A   
  3. LEFT OUTER JOIN   
  4. Attendance B   
  5. ON (A.RollNo = B.RollNo)   
  6. ORDER BY A.RollNo;  

When the above query is executed, it produces the following output. For student 1005, % value is NULL, since it doesn't have matching records in the attendance table.

  1. Rollno               %  
  2. 1001                 90%  
  3. 1002                 72%  
  4. 1003                 68%  
  5. 1004                 95%  
  6. 1005                   -  

CROSS JOIN

Cross Join joins every row from the left table to every row from the right table.

Cross Join is a Teradata specified join, which is equivalent to the product join. There will not be an "ON" keyword in the Cross Join.

Syntax

Following is the syntax of the CROSS JOIN statement.

  1. Following is the syntax of the CROSS JOIN statement.  
  2. SELECT col1, col2, col3?.   
  3. FROM    
  4. Table-1   
  5. CROSS JOIN   
  6. Table-2   
  7. <WHERE condition>;  

Example

Consider the following example of the CROSS JOIN query.

  1. SELECT A.RollNo, B.RollNo, B.%   
  2. FROM    
  3. Student A   
  4. CROSS JOIN   
  5. Attendance B   
  6. WHERE A.RollNo = 1001   
  7. ORDER BY B.RollNo;  

When the above query is executed, it produces the following output. RollNo 1001 from the Student table is joined with each and every record from Attendance Table.

Rollno            Rollno                           %
1001                1001                            90%
1001                1002                            72%
1001                1003                            68%
1001                1004                            95%


Teradata Substring

The Teradata SUBSTRING or SUBSTR function is one of the Teradata string functions, and it is used to cut a substring from a string based on its position.

SUBSTR or SUBSTRING will work the same in Teradata. But the syntax may be different.

We use ANSI syntax for Teradata SUBSTRING and Teradata syntax for Teradata SUBSTR. The ANSI syntax is designed to compatible with other database systems.

Syntax

  1. SUBSTRING (<column-name><start-location>, [<length>] )  

Or

  1. SUBSTRING (<column-name> FROM <start-location> FOR <length>)  

Argument Types and Rules

SUBSTRING and SUBSTR operate on the following types of arguments:

  • Character
  • Byte
  • Numeric

If the string_expression argument is numeric, then User-defined type (UDT) are implicitly cast to any of the following predefined types:

  • Character
  • Numeric
  • Byte
  • DATE

To define an implicit cast for a UDT, we use the CREATE CAST statement and specify the AS ASSIGNMENT clause.

Implicit type conversion of UDTs for system operators and functions, including SUBSTRING and SUBSTR, is a Teradata extension to the ANSI SQL standard. To disable this extension, set the DisableUDTImplCastForSysFuncOp field of the DBS Control Record to TRUE.

Result Type and Attributes

Here are the default result type and attributes for SUBSTR and SUBSTRING, such as:

If the string argument is a:

  • BLOB, then the result type is BLOB(n).
  • Byte string other than BLOB, the result type is VARBYTE(n).
  • CLOB, then the result type is CLOB(n).
  • Numeric or character string other than CLOB, the result type is VARCHAR(n).

In ANSI mode, the value of n for the resulting BLOB(n), VARBYTE(n), CLOB(n), or VARCHAR(n) is the same as the original string.

In Teradata mode, the value of n for the result type depends on the number of characters or bytes in the resulting string. To get the data type of the resulting string, we can use the TYPE function.

Difference between SUBSTRING and SUBSTR

The SUBSTR function is the original Teradata substring operation. It is written to be compatible with DB/2.

  • It can be used in the SELECT list to return any portion of the character that data stored in a column to a client or in the WHERE clause.
  • When we use the SUBSTR function, such as SUBSTRING, the name of the column needs to be provided along with the starting character location and the length or number of characters to return.
  • The main difference is that commas are used as delimiters between these three parameters instead of FROM and FOR.

The SUBSTRING function length is optional. When it is not included, all remaining characters to the end of the column are returned. In the earlier releases of Teradata, the SUBSTR was much more restrictive in the values allowed. This situation increased the chances of the SQL statement failing due to unexpected data or costs.

  • Both SUBSTRING and SUBSTR allow for partial character data strings to be returned, even in ANSI mode. These functions only store the requested data in a spool, not the entire column. Therefore, the amount of spool space required can be reduced or tuned using the substring functions.
  • The SUBSTR is more compatible and tolerant regarding the parameter values passed to them, like the newer SUBSTRING.
  • The SUBSTRING is the ANSI standard, and therefore, it is the better choice between these two functions.

Teradata Table Types

Teradata supports the following types of tables to hold temporary data.

S. No.Table TypesDescription
1ANSI TemporalANSI-compliant support for temporal tables. Using temporal tables, Teradata Database can process statements and queries that include time-based reasoning.
Temporal tables record both system time (the time when the information was recorded in the database) and valid time (when the information is in effect or correct in a real-world application).
2DerivedA derived table is a type of temporary table obtained from one or more other tables as the result of a SubQuery.
It is specified in an SQL SELECT statement.
It avoids the need to use the CREATE and DROP TABLE statements for storing retrieved information.
It is useful when we are more sophisticated and complex query codes.
3Error LoggingError logging tables store the information about errors on an associated permanent table.
And it also stores Log information about insert and updates errors.
4Global TemporaryGlobal temporary tables are private to the session.
It is dropped automatically at the end of a session.
It has a persistent table definition stored in the Data Dictionary. The saved description may be shared by multiple users and sessions, with each session getting its instance of the table.
5Global Temporary TraceGlobal temporary trace tables are store trace output for the length of the session.
It has a persistent table definition stored in the Data Dictionary.
It is useful for debugging SQL stored procedures (via a call to an external stored procedure written to the trace output) and external routines (UDFs, UDMs, and external stored procedures).
6NoPINoPI tables are permanent tables that do not have primary indexes defined on them.
They provide a performance advantage when used as staging tables to load data from FastLoad or TPump Array INSERT.
They can have secondary indexes defined on them to avoid full-table scans during row access.
7PermanentPermanent tables allow different sessions and users to share table content.
8QueueQueue tables are permanent tables with a timestamp column. The timestamp indicates when each row was inserted into the table.
It is established first-in-first-out (FIFO) ordering of table contents, which is needed for customer applications requiring event processing.
9VolatileVolatile tables are used only when
  • One session needs the table.
  • The creator needs to access the table.
  • We want better performance than a global temporary table.
  • We do not need the table definition after the session ends.
Note: The definition of a volatile table can survive across a system restart if it is contained in a macro.

Derived Table

Derived tables are created, used, and dropped within a query. These are used to store intermediate results within a query.

Example

Consider the following employee record in the form of two tables.

Emp Table:

Employee_IdFirst_NameLast_NameDepartment_No
202001MikeRichard1
202002RobertWilliams2
202003PeterCollin2
202004AlexaStuart1
202005RobertPeterson1

Salary Table:

Employee_IdFirst_NameLast_NameDepartment_No
20200140,0004,00036,000
20200280,0006,00074,000
20200390,0007,00083,000
20200475,0005,00070,000
20200580,0000080,000

The following query creates a derived table EmpSal with records of employees with a salary higher than 80,000.

  1. SELECT   
  2. Emp.Employee_Id,   
  3. Emp.First_Name,   
  4. EmpSal.NetPay   
  5. FROM   
  6. Emp,   
  7. (select Employee_Id , NetPay   
  8. from Salary  
  9. where NetPay >= 80000) EmpSal   
  10. where Emp.Employee_Id = EmpSal.Employee_Id;  

When the above query is executed, it returns the employee's records with a salary higher than or equal to 80,000.

Employee_Id       First_Name       NetPay
202003            Peter            83,000
202005            Robert           80,000

Volatile Table

Volatile tables are created and dropped within a user session. Their definition is not stored in the data dictionary. They hold intermediate data of the query, which is frequently used.

Syntax

  1. CREATE [SET|MULTISET] VOALTILE TABLE tablename   
  2. <table definitions>   
  3. <column definitions>   
  4. <index definitions>   
  5. ON COMMIT [DELETE|PRESERVE] ROWS  

Example

In the following example, we create a VOLATILE table that name is dept_stat.

  1. CREATE VOLATILE TABLE dept_stat (   
  2.    dept_no INTEGER,   
  3.    avg_salary INTEGER,   
  4.    max_salary INTEGER,   
  5.    min_salary INTEGER   
  6. )   
  7. PRIMARY INDEX(dept_no)   
  8. ON COMMIT PRESERVE ROWS;  

After executing the above code, it returns the min, avg, and max salary according to the departments in the output.

Dept_no     avg_salary     max_salary     min_salary 
 1           186,000         80,000         36,000
 2           157,000         83,000         74,000

Global Temporary Table

The definition of the Global Temporary table is stored in the data dictionary, and they can be used by many users and sessions. But the data loaded into a global temporary table is retained only during the session.

We can materialize up to 2000 global temporary tables per session.

Syntax

Following is the syntax of the Global Temporary table.

  1. CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tablename   
  2. <table definitions>   
  3. <column definitions>   
  4. <index definitions>   

Example

Below query creates the Global Temporary table, such as:

  1. CREATE SET GLOBAL TEMPORARY TABLE dept_stat (   
  2.    dept_no INTEGER,   
  3.    avg_salary INTEGER,   
  4.    max_salary INTEGER,   
  5.    min_salary INTEGER   
  6. )   
  7. PRIMARY INDEX(dept_no);  

When the above query is executed, it returns the following output.

A table has been created.

Teradata Space Concepts

Teradata is designed to reduce the DBA's administrative functions when it comes to space management. Space concepts are configured in the following ways in the Teradata system.

  1. Permanent Space
  2. Spool Space
  3. Temporary Space

Permanent Space (Perm Space)

Permanent space is the maximum amount of space available for the user and database to hold data rows. Permanent tables, journals, fallback tables, and secondary index sub-tables use permanent space.

Permanent space is not pre-allocated for the database and user. The amount of permanent space is divided by the number of AMPs. Whenever per AMP limit exceeds, an error message is generated.

Equal distribution is necessary because there is a high percentage that the objects will be shared across all the AMPs, and at the time of data retrieval, all AMPs will work parallel to fetch the data.

Unlike other relational databases, the Teradata database does not physically define the Perm space at the time of object creation. Instead of that, it represents the upper limit for the Perm space, and then Perm space is used dynamically by the objects.

Teradata Space Concepts

Example

Suppose we have a Teradata system having 500 GB Perm space. Initially, DBC owns 100% of the perm space of a new system. As in Teradata, everything is parallel. Perm space will be distributed evenly across all AMPs.

Teradata system having 500GB perm space and 5 AMPs, each AMP will get 100GB of perm space to execute a query in parallel if DBC creates two users as sysdba and appdba with 20GB and 100GB perm space respectively.

Space will be taken from parent i.e., DBC. So now DBC has 280GB of perm space. This mechanism ensures enough memory to execute all processes in the Teradata system.

Spool Space

Spool space is the amount of space on the system that has not been allocated. It is used by the system to keep the intermediate results of the SQL query. Users without spool space cannot execute any query.

When executing a conditional query, all the qualifying rows which satisfy the given condition will be store in the Spool space for further processing by the query. Any Perm space currently unassigned is available as a Spool space.

Spool space defines the maximum amount of space the user can use. Establishing a Spool space limit is not required when Users and Databases are created. But it is highly recommended to define the upper limit of Spool space for any object.

Teradata Space Concepts

Example

Suppose a 10GB permanent space has been allocated to a database. However, the actual perm space occupied by tables took up 50% of perm space i.e., 5GB.

The remaining 5GB space is available system-wide for a spool. Spool space for the child will not be subtracted from its immediate parent. And the child database spool space can be equal as its parent. Spool space for sysdba and appdba is the same as DBC.

Temp Space

Temp space is the unused permanent space that is used by Global Temporary tables. Temp space is also divided by the number of AMPs.

Unused Perm space is used as Temp space for storing the table and the data in it. If we are specifying some value for Temp space, it should not exceed the value of the parent database and user.

If we do not specify a value, then the maximum value is inherited from that of the parent.

Teradata Statistics

Teradata optimizer gives an execution strategy for every SQL query. This execution strategy is based on the statistics collected on the tables used within the SQL query. Statistics on the table are collected using COLLECT STATISTICS command.

The COLLECT STATISTICS (Optimizer form) statement collects demographic data for one or more columns of a base table, hash index, or join index, computes a statistical profile of the collected data, and stores the synopsis in the Data Dictionary.

The Optimizer uses the synopsis data when it generates its table access and joins plans.

Environment Information

Teradata Statistics environment needs the following:

  • Number of Nodes, AMPs, and CPUs
  • Amount of memory

Data Demographics

Data Demographics consider the following:

  • Number of rows
  • Row size
  • Range of values in the table
  • Number of rows per value
  • Number of Nulls

Usage

We should collect statistics on newly created, empty data tables. An empty collection defines the columns, indexes, and synoptic data structure for loaded groups.

We can easily collect statistics again after the table is populated for prototyping, and back when it is in production.

We can collect statistics in the following ways.

  1. A unique index, which can be:
    • Primary or secondary
    • Single or multiple columns
    • Partitioned or non-partitioned
  2. A non-unique index, which can be:
    • Primary or secondary
    • Single or multiple columns
    • Partitioned or non-partitioned
    • With or without COMPRESS fields
  3. A non-indexed column or set of columns, which can be:
    • Partitioned or non-partitioned
    • With or without COMPRESS fields
  4. A temporary table
    • If we specify the TEMPORARY keyword but a materialized table does not exist, the system first materializes an instance based on the specified column names and indexes.
      This means that after a valid instance is created, we can re-collect statistics on the columns by entering COLLECT STATISTICS and the TEMPORARY keyword without having to specify the desired columns and index.
    • If we omit the TEMPORARY keyword, but the table is temporary, statistics are collected for an empty base table rather than the materialized instance.
  5. Sample (system-selected percentage) of the rows of a data table or index, to detect data skew and dynamically increase the sample size when found.
    • The system does not store both sampled and defined statistics for the same index or column set. Once sampled statistics have been collected, implicit recollection hits the same columns and indexes and operates in the same mode.
  6. Join index
  7. Hash index
  8. NoPI table

How to approach Collect Statistics

There are three approaches to collect statistics on the table.

  • Random AMP Sampling
  • Full statistics collection
  • Using the SAMPLE option

Collecting Statistics on Table

COLLECT STATISTICS command is used to collect statistics on a table.

Syntax

Following is the basic syntax to collect statistics on a table.

  1. COLLECT [SUMMARY] STATISTICS     
  2. INDEX (indexname) COLUMN (columnname)   
  3. ON <tablename>;  

Example

Consider an Employee table with the following records, such as:

Emp_IdFirst_NameLast_NameDepartment_No
202001MikeRichard1
202002RobertWilliams2
202003PeterCollin2
202004AlexaStuart1
202005RobertPeterson1

We are going to run the following query to collect statistics for the Emp_Id, First_Name columns of the Employee table.

  1. COLLECT STATISTICS COLUMN(Emp_Id, First_Name) ON Employee;  
  2.   
  3.           /* Total elapsed time was 0.00 second */  

When the above query is executed, it produces the following output.

Update completed. 2 rows changed.

Viewing Statistics

We can view the collected statistics using the HELP STATISTICS command.

Syntax

Following is the syntax to view the statistics collected.

  1. HELP STATISTICS <tablename>;  

Example

Following is an example to view the statistics collected on the Employee table.

  1. HELP STATISTICS Employee;  

When the above query is executed, it produces the following table with updated columns and their values.

Date        Time        Unique Values    Column Names
6/2/20     10:05:02      5                  *
6/2/20     10:05:02      5                  Emp_Id, First_Name

Teradata Compression

Compression reduces the physical size of the stored information. The goal of compression is to represent information accurately using the fewest number of bits.

The Compression methods are either logical or physical. Physical data compression re-encodes information independently of its meaning, and logical data compression substitutes one set of data with another, more compact set.

In Teradata, compression can compress up to 255 distinct values, including NULL. Since the storage is reduced, Teradata can store more records in a block. This results in improved query response time since any input operation can process more rows per block.

Compression can be added at table creation using CREATE TABLE or after table creation using ALTER TABLE command.

Compression has the following essential reasons, such as:

  • To reduce storage costs.
  • To enhance system performance.

Compression reduces storage costs by storing more logical data per unit of physical capacity. Compression produces smaller rows, resulting in more rows stored per data block and fewer data blocks.

Compression enhances system performance because there is less physical data to retrieve per row for queries. And compressed data remains compressed while in memory, the FSG cache can hold more rows, reducing the size of disk input.

Rules

Teradata Compression method has the following rules to compress stored data, such as:

  • Only 255 values can be compressed per column.
  • The Primary Index column cannot be compressed.
  • Volatile tables cannot be compressed.

Types of Compression

Teradata Database uses several types of compression.

Database elementExplanation
Column valuesThe storage of those values one time only in the table header, not in the row itself, and pointing to them using an array of presence bits in the row header. It applies to:
  • Multi-value compression
  • Algorithmic compression
We cannot apply either multi-value compression or algorithmic compression to row-level security constraint columns.
Hash and Join indexesA logical row compression in which multiple sets of non-repeating column values are appended to a single set of repeating column values.
This allows the system to store the repeating value set only once, while any non-repeating column values are stored as logical segmental extensions of the base repeating set.
Data blocksThe storage of primary table data, or join or hash index subtable data.
Partition containersThe auto-compression method set determined by Teradata Database to apply to a container of a column-partitioned table or join index when we have not specified the NO AUTO COMPRESS option at the time the object was created.

Multi-Value Compression (MVC)

Multi-value compression (MVC) compresses repeating values in a column when we specify the value in a compression list in the column definition.

When data in the column matches a value specified in the compression list, the database stores it only once in the table header, regardless of how many times it occurs as a field value for the column. The database then stores a smaller substitute value, often as small as 2 bits, in each row where the value occurs.

MVC provides the best cost and benefit ratio compared to other methods because it requires minimal resources to uncompress the data during query processing. We can use MVC for frequently used data without compromising query and load performance.

MVC is also considered the easiest to implement of all the compression methods. MVC is a logical data compression form.

Multi-value compression has the following performance impacts, such as:

  • It improves table scan response times for most configurations and workloads.
  • It provides moderate to little CPU savings.

Example

The following example compresses the field Depatment_No for values 1, 2, and 3. When compression is applied on a column, the values for this column are not stored with the row.

Instead, the values are stored in the Table header in each AMP. And only presence bits are added to the row to indicate the value.

  1. CREATE SET TABLE Employee   
  2. (   
  3.    Employee_No integer,   
  4.    First_Name CHAR(10),   
  5.    Last_Name CHAR(10),   
  6.    BirthDate DATE FORMAT 'YYYY-MM-DD-',   
  7.      
  8.    Department_No CHAR(02) COMPRESS(1,2,3)   
  9. )   
  10. UNIQUE PRIMARY INDEX(Employee_No);  

The system automatically compresses the specified values when data moves into the table and uncompressed when the containing rows are accessed.

NOTE

  • The system automatically compresses NULLs when specifying COMPRESS.
  • And cannot be applying MVC to row-level security columns.

Algorithmic Compression

Teradata Database software includes several standard compression algorithms, in the form of UDFs, which is used to compress many types of data by table column. We can also create custom compression and decompression algorithms in the UDF format.

When column values are unique, then algorithmic compression (ALC) provides better compression results than MVC.

If columns have repeated values, then we can use ALC and MVC concurrently on the same column, but the system does not apply ALC to any value covered by MVC.

NOTE

  • Specify only one set of compression or decompression UDFs for a particular column.
  • The system automatically compresses nulls when specifying COMPRESS.

Row Compression

Row compression is a form of logical data compression in which Teradata Database stores a repeating column value set only once.

And other non-repeating column values that belong to that set are stored as logical segmental extensions of the base repeating set. Row compression is a lossless method.

Like multi-value compression, there is no decompression necessary to access row compressed data values.

Block-Level Compression

A data block is a physical unit of input that defines how Teradata handles data when we specify block-level compression (BLC), Teradata stores data blocks in a compressed format to save storage space.

The goals of block-level compression are to save storage space and to reduce disk input bandwidth. Block-level compression can use significantly more CPU to compress and decompress data dynamically. Thus, whether query performance is enhanced with block-level compression depends on whether performance is more limited by disk input bandwidth or CPU usage.

Teradata Database offers software-based and hardware-based BLC. We select the BLC method when we specify the compression algorithm in the Compression Algorithm field in DBS Control.

  1. Hardware-Based Block-Level Compression
    Hardware-based BLC creates little or no CPU resource contention with other database operations.
    Hardware-based block-level compression provides the same functions as software-based block-level compression.
  2. Software-Based Block-Level Compression
    Software-based BLC is designed for use on large volumes of seldom-accessed data. BLC can also achieve relatively good compression on unknown and unclassified data across different column types, including the internal data structures and the row header.
    Block-level compression can be enabled at the system level using the DBS Control utility and at the table level using SQL DDL.

Auto Compression

When we create a column-partitioned table or join index, Teradata Database attempts to use one or more methods to compress the data. That inserts into the physical rows of the object unless we specify the NO AUTO COMPRESS option when we create it, or NO AUTO COMPRESS is set as the default.

The process of selecting and applying appropriate compression methods to the physical containers of a column-partitioned table or join index is referred to as Auto Compression.

Auto Compression is most useful for a column partition with a single column and COLUMN format.

Teradata Database Auto Compression only applies on column partitions with COLUMN format, and then if it reduces the size of a container. Or it is set as default with the following requirements, such as:

  • Minimal CPU resources are required to decompress the data for reading.
  • Teradata Database does not need to decompress many values to find a single value.

For some values, there are no applicable compression techniques that can reduce the size of the physical row, so Teradata Database does not compress the values for that physical row. Still, the system attempts to compress physical row values using one of the Auto Compression methods available to it.

Teradata Database automatically decompresses any compressed column partition values when we retrieve rows from a column-partitioned table.

Teradata Explain

EXPLAIN command is the Parsing Engines (PE's) plan to the AMPs. EXPLAIN command returns the execution plan of the parsing engine in English translation. It can be used with any SQL statement except on another EXPLAIN command.

When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs. The explain plan can give a clear picture of the way optimizer will execute a query.

It is a good practice to analyze the explain plan for a newly developed query before running it in the Teradata system.

The explain plan can be obtained in two ways. First, add an "EXPLAIN" keyword before any queries, and the second is simply pressing "F6" from the keyboard.

Explain plan is beneficial to analyze the performance issue of a query as it breaks down any complex query into the lowest level. The Explain plan provides a lot of information, such as:

  • Access path: If data is fetching using full table scan or using any index like primary index path, secondary index path, or any other indexes.
  • Confidence level: If the optimizer can use the statistics or any statistics is missing.
  • Joins information: What kind of join is going to happen.
  • Time estimation: Provides an estimated query completion time.

If we pass an explain command in front of a query statement or Simply Press the F6 key, there are the following estimated confidence messages that the optimizer will relay to the user, such as:

  • High Confidence: Statistics available on an Index or Column.
  • Low Confidence: Random sampling of INDEX, or Statistics available but AND/OR condition exists.
  • No Confidence: Random sampling based on AMP row count. No statistics are collected.

Explain Plan Keywords

To understand the EXPLAIN plan, we should know the following Keywords.

KeywordExplanation
Locking Pseudo TableSerial lock on a symbolic table. Every table has one. It is used to prevent deadlocks situations between users.
Locking table forThis indicates that ACCESS, READ, WRITE, or the EXCLUSIVE lock has been placed on the table.
Locking rows forIndicates that an ACCESS, READ or WRITE, the lock is placed on rows read or written.
Do an ABORT testGuarantees a transaction is not in progress for this user.
All AMPs retrieveAll AMPs are receiving the AMP steps and are involved in providing the answer set.
By way of an all-rows scanRows are read sequentially on all AMPs.
By way of the primary indexRows are read using the Primary index column(s).
By way of index numberRows are read using the Secondary index: number from HELP INDEX.
BMSMSBit Map Set Manipulation Step, alternative direct access technique when multiple NUSI columns are referenced in the WHERE clause.
Residual conditionsWHERE clause conditions, other than those of a join.
Eliminating duplicate rowsProviding unique values usually result in DISTINCT, GROUP BY, or subquery.
Where unknown comparison will be ignoredThis indicates that NULL values will not compare to TRUE or FALSE. Seen in a subquery using NOT IN or NOT = ALL because no rows will be returned on the ignored comparison.
Nested joinThe fastest join is possible. It uses UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row.
Merge joinRows of one table are matched to the other table on common domain columns after being sorted into the same sequence, normally Row Hash.
Product joinRows of one table are matched to all rows of another table with no concern for domain match.
ROWID joinA very fast join. It uses ROWID of a UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row.
Duplicated on all AMPsParticipating rows for the table of a join are duplicated on all AMPS.
Hash redistributed on all AMPs.Participating rows of a join are hashed on the join column and sent to the same AMP that stores the matching row of the table to join.
SMSSet Manipulation Step, the result of an INTERSECT, UNION, EXCEPT, or MINUS operation.
Last useSPOOL file is no longer needed after the step and space are released.
Built locally on the AMPsAs rows are read, they are put into SPOOL on the same AMP.
Aggregate Intermediate Results computed locally.The aggregation values are all on the same AMP and, therefore, no need to redistribute them to work with rows on other AMPs.
Aggregate Intermediate Results computed globally.The aggregation values are not all on the same AMP and must be redistributed on one AMP to accompany the same value with the other AMPs.

How to EXPLAIN Plan Works

The EXPLAIN request modifier that precedes any SQL request causes Teradata Database to display the execution plan for that request. The request itself is not submitted for execution.

  • When we perform an EXPLAIN against any SQL request, that request is parsed and optimized.
  • The access and join plans generated by the Optimizer are returned in the form of a text file that explains the (possibly parallel) steps used in the execution of the request.
  • It also included the relative cost required to complete the request, given the statistics with which the Optimizer had to work.
  • If the statistics are not reasonably accurate, the cost estimate may not be accurate.

Benefits of Using EXPLAIN

Here are some essential benefits of using the Explain Plan, such as:

  1. EXPLAIN helps to evaluate complex queries and to develop alternative, more efficient, processing strategies.
  2. We can get a better plan by collecting more statistics on more columns, or by defining additional indexes.
  3. Knowing the actual demographics information may allow identifying row count estimates that seem badly wrong and help pinpoint areas where additional statistics would be helpful.

Examples of EXPLAIN

Consider the Employee table with the following definition.

  1. CREATE SET TABLE EMPLOYEE, FALLBACK (   
  2.    Employee_Id INTEGER,   
  3.    First_Name VARCHAR(10),   
  4.    Last_Name VARCHAR(10),  
  5.    DOB DATE FORMAT 'YYYY-MM-DD',   
  6.      
  7.    Department_No BYTEINT   
  8. )   
  9. UNIQUE PRIMARY INDEX ( Employee_Id );  

Full Table Scan (FTS)

When no conditions are specified in the SELECT statement, then the optimizer may use the Full Table Scan where each and every row of the table is accessed.

Example

Following is a query where the optimizer may choose FTS.

  1. EXPLAIN SELECT * FROM Employee;  

When the above query is executed, it produces the following output. As can be seen, the optimizer chooses to access all AMPs and all rows within the AMP.

1. First, we lock a distinct TDUSER."pseudo table" for reading on a RowHash to prevent global deadlock for TDUSER.Employee.  
2. Next, we lock TDUSER.Employee to read.   
3. We do an all-AMPs RETRIEVE step from TDUSER.Employee by way of an all-rows scan with no residual conditions into Spool 1 (group_amps) built locally on the AMPs.  The size of Spool 1 is estimated with low confidence to be 2 rows (116 bytes). The estimated time for this step is 0.03 seconds.
4. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. 

//The contents of Spool 1 are sent back to the user as the result of statement 1.
//The total estimated time is 0.03 seconds.

Unique Primary Index

When the rows are accessed using Unique Primary Index, then it is one AMP operation.

  1. EXPLAIN SELECT * FROM Employee WHERE Employee_Id = 1001;  

When the above query is executed, it produces the single-AMP retrieval, and the optimizer uses the unique primary index to access the row.

1. First, we do a single-AMP RETRIEVE step from TDUSER.Employee by way of the unique primary index "TDUSER.Employee.Employee_Id = 1001" with no residual conditions.  

//The row is sent directly back to the user as the result of statement 1.  
//The total estimated time is 0.01 seconds.

Unique Secondary Index

When the rows are accessed using Unique Secondary Index, it's a two amp operation.

Example

Consider the Salary table with the following definition.

  1. CREATE SET TABLE SALARY, FALLBACK   
  2. (   
  3.    Employee_Id INTEGER,   
  4.    Gross INTEGER,   
  5.    Deduction INTEGER,   
  6.    NetPay INTEGER   
  7. )  
  8. PRIMARY INDEX (Employee_Id)   
  9. UNIQUE INDEX (Employee_Id);  

Consider the following SELECT statement.

  1. EXPLAIN SELECT * FROM Salary WHERE Employee_Id = 1001;  

When the above query is executed, it produces the optimizer retrieves the row in two amp operations using a unique secondary index.

1. First, we do a two-AMP RETRIEVE step from TDUSER.Salary by way of unique index # 4 "TDUSER.Salary.Employee_Id = 1001" with no residual conditions.   

//The row is sent directly back to the user as the result of statement 1.  
//The total estimated time is 0.01 seconds.


Teradata Join Index

The Teradata Join index stores pre-joined tables. Join indexes are comparable to materialized views.

Join indexes are designed to permit queries to be resolved by accessing the index instead of accessing, and possibly joining, their underlying base tables.

The join index contains one or more tables and also provide pre-aggregated data. Join indexes are mainly used for improving performance.

The join indexes are an index structure. They have to be considered more as an added layer of pre-aggregated, pre-joined, or permanent tables with its row-level partitioning, Primary Index, and Secondary Indexes.

Even if a join index does not entirely cover a query, the optimizer can use it to join its underlying base tables in a way that provides better query optimization than scanning the base tables for all the columns specified in the request.

Join indexes are useful for queries that aggregate columns from tables with large cardinalities. These indexes play the role of pre-join and summary tables without denormalizing the logical design of the database. At the same time, indeed, denormalization often enhances the performance of a particular query or queries.

There are three types of join indexes available in Teradata.

  1. Single Table Join Index (STJI)
  2. Multi-Table Join Index (MTJI)
  3. Aggregate Join Index (AJI)

Rules

Here are some essential rules of the Join Index, such as:

  • The Usage of the FULL OUTER JOIN is not allowed.
  • At least one non-null column must be selected on the inner side of the LEFT or RIGHT JOIN.
  • OUTER JOIN is preferable for Join Index usage likelihood but not allowed for the Aggregate Join Index.
  • HAVING and QUALIFY keywords are not allowed.
  • Set Operations are not allowed, such as UNION, INTERSECT, and MINUS.
  • No Subqueries are allowed.
  • Inequality conditions in ON clauses of join index definitions are not allowed.
  • Only <, <=, >=, > comparison operators are allowed.
  • During the restoration of a base table or database, the join index is marked as invalid.
  • A maximum of 64 columns per table per Join Index is allowed.
  • The maximum of 32 Indexes can be defined on a table, including join indexes.
  • UPI only allowed for a single table join index.
  • Value ordered NUSI and Partition Expression on the same Join Index columns are not allowed.

Single Table Join Index

Single-Table join indexes are created from exactly one base table. Their purpose is to have the same table available with a different primary index, partitioning, or smaller table (the join index table) with fewer columns to be spooled.

This improves the performance of joins as no distribution or duplication is needed. The user will query on the base table, but PE will decide whether to access the base table or single table join index.

Syntax

Following is the syntax of a JOIN INDEX.

  1. CREATE JOIN INDEX <index name>   
  2. AS   
  3. <SELECT Query>   
  4. <Index Definition>;  

Example

Consider the following Employee and Salary tables.

  1. CREATE SET TABLE EMPLOYEE, FALLBACK   
  2. (   
  3.    Employee_Id INTEGER,   
  4.    First_Name VARCHAR(10),   
  5.    Last_Name VARCHAR(10),   
  6.    DOB DATE FORMAT 'YYYY-MM-DD',   
  7.      
  8.    DepartmentNo BYTEINT   
  9. )   
  10. UNIQUE PRIMARY INDEX (Employee_Id);  
  11.   
  12. CREATE SET TABLE SALARY, FALLBACK   
  13. (   
  14.    Employee_Id INTEGER,   
  15.    Gross INTEGER,   
  16.    Deduction INTEGER,   
  17.    NetPay INTEGER   
  18. )   
  19. PRIMARY INDEX (Employee_Id)   
  20. UNIQUE INDEX (Employee_Id);  

Following is an example of creating a Join index named Employee_JI on the Employee table.

  1. CREATE JOIN INDEX Employee_JI   
  2. AS   
  3. SELECT Employee_Id, First_Name, Last_Name,   
  4. BirthDate, DepartmentNo   
  5. FROM Employee   
  6. PRIMARY INDEX (First_Name);  

If the user submits a query with a WHERE clause on Employee_Id, then the system will query the Employee table using the unique primary index.

If the user queries the employee table using employee_name, then the system can access the join index Employee_JI using employee_name.

The rows of the join index are hashed on the employee_name column. If the join index and the employee_name are not defined as a secondary index, then the system will perform a full table scan to access the rows, which is time-consuming.

We can run the following EXPLAIN plan and verify the optimizer plan. In the following example, the optimizer uses the Join Index instead of the base Employee table when the table queries using the Employee_Name column.

  1. EXPLAIN SELECT * FROM EMPLOYEE WHERE First_Name='Alexa';   
  2.   
  3.        /* Help information returned. 8 rows.   
  4.        /* Total elapsed time was 1 second.   

Explanation

  1. 1.  First, we do a single-AMP RETRIEVE operation from EMPLOYEE_JI by using the primary index "EMPLOYEE_JI.First_Name = 'Alexa' with no residual conditions into Spool 1 (one-amp) built locally on that AMP. The size of Spool 1 is estimated with low confidence to be 2 rows (232 bytes). The estimated time for this step is 0.01 seconds.  
  2.   
  3. //The contents of Spool 1 are sent back to the user as the result of statement 1.  //The total estimated time is 0.01 seconds.  

Multi-Table Join Index

A multi-table join index is created by joining more than one table. A multi-table join index can be used to store the result set of frequently joined tables to improve the performance.

A multi-table join index is used to hold a pre-join result set from the two or more columns. So during join processing, PE may decide to access data from the Multi-table join index rather than joining again underlying base tables. We need to remember that we should define a multi-table join index after lots of analysis based on the frequency and cost of joining.

Multi-Table Join Indexes allow us to move resource-intensive joins from the online to the batch window.

Shifting the workload does not reduce the total workload, but it turns it to a point in time that is beneficial for the system's overall performance.

Example

The following example creates a JOIN INDEX named Employee_Salary_JI by joining Employee and Salary tables.

  1. CREATE JOIN INDEX Employee_Salary_JI   
  2. AS   
  3. SELECT a.Employee_Id,a.First_Name,a.Last_Name,   
  4. a.BirthDate,a.DepartmentNo,b.Gross,b.Deduction,b.NetPay   
  5. FROM Employee a   
  6. INNER JOIN Salary b   
  7. ON(a.Employee_Id = b.Employee_Id)   
  8. PRIMARY INDEX(First_Name);  

When the base tables Employee or Salary are updated, the Join index Employee_Salary_JI is also automatically updated.

If we are running a query joining Employee and Salary tables, then the optimizer may choose to access the data from Employee_Salary_JI directly instead of joining the tables. EXPLAIN plan on the query can be used to verify if the optimizer will choose the base table or Join index.

Aggregate Join Index

Aggregate join indexes method is used to resolving queries that frequently specify the same aggregate operations on the same column or columns.

When aggregate join indexes are available, then the system does not repeat aggregate calculations for every query. An aggregate join index can be defined on two or more tables or a single table.

If a table is consistently aggregated on specific columns, then the aggregate join index can be defined on the table to improve the performance. One limitation of the aggregate join index is that it supports only the following functions.

  • SUM function
  • COUNT function
  • GROUP BY clause

Example

In the following example, the Employee and Salary are joined to identify the total salary per Department.

  1. CREATE JOIN INDEX Employee_Salary_JI   
  2. AS   
  3. SELECT a.Department_Id,SUM(b.NetPay) AS TotalPay   
  4. FROM Employee a   
  5. INNER JOIN Salary b   
  6. ON(a.Employee_Id = b.Employee_Id)  
  7. GROUP BY a.DepartmentNo   
  8. Primary Index(DepartmentNo);   

Teradata Partitioned Primary Index

Partitioned Primary Index (PPI) is one of the powerful features of Teradata, which allows a user to access a part of the table rather than full table access. PPI is an indexing mechanism that is useful in improving the performance of specific queries. PPI works the same as Primary Index for the data distribution but creates partitions according to range or case as specified in the table.

A partitioned primary index (PPI) allows rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries predicated on primary index values.

When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. And when a table is defined with PPI, then the rows are sorted by their partition number.

Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined. Below are the following significant points of the PPI, such as:

  • PPI is used to improve performance for large tables when we submit queries that specify a range constraint.
  • PPI allows reducing the number of rows to be processed by using partition elimination.
  • PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints

Commonly queried rows are stored together in partitions so that we can restrict data retrieval to those partitions that contain the rows required by your SQL statement. Other partitions that do not provide requested rows are not read.

Incidentally, other database systems, such as Netezza, go the other way round by storing the information about the data blocks where the requested rows are not stored. Netezza called this method "Data Skipping".

Partitioning will not help to avoid table scans, but it will reduce the number of data blocks that each AMP must move to its local memory (FSG cache) because only relevant partitions need to be accessed.

Characteristics of PPI

Here are some characteristics of the Partitioned Primary Index, such as:

  1. Partitioning can be done on volatile, global temporary, and permanent tables.
  2. The partitioning column doesn't have to be part of the non-unique Primary Index.
  3. The partitioning column must be part of a unique Primary Index.
  4. The partitioned column will not decide which AMP data should go and dependent on the Primary Index.
  5. Rows within AMP will be sorted by partitioned column and not by row Id.
  6. PPI can be defined at multi-level.
  7. Teradata uses a partition elimination strategy to avoid a Full Table Scan (FTS) for partitioned tables.
  8. Partitioned tables may reduce input for range queries by reducing the number of rows to be processed.
  9. Details about the partitions, constraints & conditions can be retrieved from dbc.PartitioningConstraintsV view.

Advantages of PPI

Following are the advantages of the Partitioned Primary Index:

  • It avoids a full table scan for specific queries.
  • PPI does not use a secondary index that requires additional physical structure and other input maintenance.
  • It can access a subset of a large table quickly.
  • PPI drop the old data quickly and add new data.
  • It can be created on a global temp table, volatile table, and non-compressed join indexes.
  • For range-based queries, we can remove SIs and use PPI, which will reduce the space of an overhead SI Subtable.

Disadvantages

Partitioned Primary Index also has the following disadvantages, such as:

  • Partitioning can make a single row or primary index accesses to the table slower if the partitioning column is not a part of the primary index column.
  • An additional of 2 bytes is added to each row and hence increases the perm space.
  • While joining a PPI table with a non-partitioned table takes a long time.
  • Access other than the PPI column takes more time.

Types of PPI

There are four types of Partitioned Primary Index:

  1. Case partitioning
  2. Range-based partitioning
  3. Multi-level partitioning
  4. Character-based partitioning

Example

Consider the following Orders table with Primary Index on OrderNo.

StoreNoOrderNoOrderDateOrderTotal
2001752017-11-02900
2001992017-12-101,200
2002772017-06-093,000
2002182017-09-052,454
2001712017-11-031201
2003702017-18-042,454
2001802017-10-081201
2001652017-15-051201

Assume that the records are distributed between AMPs, as shown in the following tables. Recorded are stored in AMPs, sorted based on their row hash.

AMP 1

RowHashOrderNoOrderDate
1712017-11-03
2182017-09-05
3752017-11-02
4652017-15-05

AMP 2

RowHashOrderNoOrderDate
1802017-10-08
2992017-12-10
3702017-18-04
4772017-06-09

If we run a query to extract the orders for a particular date, then the optimizer may choose to use Full Table Scan, then all the records within the AMP may be accessed. We define the order date as Partitioned Primary Index to avoid the full table scan. When rows are inserted into the orders table, they are partitioned by the order date. Within each partition, they will be ordered by their row hash.

The following data shows how the records will be stored in AMPs if they are partitioned by Order Date. If a query is run to access the records by Order Date, only the partition that contains the records for that particular order will be obtained.

AMP 1

PartitionRowHashOrderNoOrderDate
03752017-11-02
11712017-11-03
12182017-09-05
24652017-15-05

AMP 2

PartitionRowHashOrderNoOrderDate
02992017-12-10
04772017-06-09
13702017-18-04
21802017-10-08

Following is an example of creating a table with a partition primary Index. PARTITION BY clause is used to define the partition.

  1. CREATE SET TABLE Orders   
  2. (  
  3.    StoreNo SMALLINT,   
  4.    OrderNo INTEGER,   
  5.    OrderDate DATE FORMAT 'YYYY-MM-DD',   
  6.    OrderTotal INTEGER   
  7. )   
  8. PRIMARY INDEX(OrderNo)   
  9. PARTITION BY RANGE_N    
  10. (  
  11.    OrderDate BETWEEN DATE '2015-01-01' AND '2020-12-31' EACH INTERVAL '1' DAY  
  12. );  

In the above example, the table is partitioned by the OrderDate column. There will be one separate partition for each day.


PARTITION BY RANGE_N(totalorders BETWEEN *, 100, 1000 AND *,
                         UNKNOWN);

In the example, RANGE_N specifies four partitions to which a row can be assigned, based on the value of the totalorders column:

Partition NumberCondition
1The value of the totalorders column is less than 100.
2The value of the totalorders column is less than 1000, but greater than or equal to 100.
3The value of the totalorders column is greater than or equal to 1000.
4The totalorders column is NULL, so the range is UNKNOWN.

PARTITION BY (RANGE_N(totalorders BETWEEN *, 100, 1000 AND *)
                  ,RANGE_N(orderdate BETWEEN *, '2005-12-31' AND *) );

The example defines six partitions to which a row can be assigned. The first RANGE_N expression defines three partitions based on the value of the totalorders column. The second RANGE_N expression subdivides each of the three partitions into two partitions based on the value of the orderdate column.

Level 1 Partition NumberLevel 2 Partition NumberCondition
11The value of the totalorders column is less than 100 and the value of the orderdate column is less than '2005-12-31'.
2The value of the totalorders column is less than 100 and the value of the orderdate column is greater than or equal to '2005-12-31'.
21The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is less than '2005-12-31'.
2The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is greater than or equal to '2005-12-31'.
31The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is less than '2005-12-31'.
2The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is greater than or equal to '2005-12-31'.

Teradata Secondary Indexes

Secondary indexes are never required for Teradata Database tables, but they can often improve system performance.

A Secondary Index (SI) is an alternate data access path. It allows accessing the data without having to do a full-table scan.

The secondary index can be created explicitly using the CREATE TABLE and CREATE INDEX statements. Teradata Database can implicitly create unique secondary indexes.

For example, when we use a CREATE TABLE statement that specifies a primary index, Teradata Database implicitly creates unique secondary indexes on column sets that we specify using PRIMARY KEY or UNIQUE constraints.

Secondary Indexes are stored in separate subtables that require additional disk space and maintenance, which is handled automatically by the system. These tables are built-in all AMPs.

These subtables contain the index rows, so we need to add another set of rows that requires updating each time.

Secondary index subtables are also duplicated whenever a table is defined with FALLBACK. The Secondary index is not involved in the data distribution.

The purpose of the Secondary Index Subtable will be to return back to the real row in the base table via the Row-ID.

Secondary Index Rules

Here are some essential rules for the secondary index:

Rule 1: Secondary Indexes are optional.
Rule 2: Secondary Index values can be unique or non-unique.
Rule 3: Secondary Index values can be NULL.
Rule 4: Secondary Index values can be modified.
Rule 5: Secondary Indexes can be changed.
Rule 6: A Secondary Index has a limit of 64 columns.

Secondary Indexes Types

There are two types of secondary indexes in Teradata:

  • Unique Secondary Index (USI)
  • Non-Unique Secondary Index (NUSI)

Unique Secondary Index (USI)

A Unique Secondary Index allows unique values for the columns defined as USI.

A Unique Secondary Index (USI) serves two purposes.

  • Enforces uniqueness on a column or group of columns.
  • Speeds up access to a row (data retrieval speed).

When A USI is created, then the Teradata will immediately build a secondary index subtable on each AMP.

Each AMP will then hash the secondary index value for each of their rows in the base table.

Syntax

Below is the syntax to create a Unique Secondary Index:

  1. CREATE UNIQUE INDEX (Column/Columns) ON;  

Example

First create an Employee table with the following records, such as:

Emp_IdFirst_NameLast_NameDepartment_No
202001MikeRichard1
202002RobertWilliams2
202003PeterCollin2
202004AlexaStuart1
202005RobertPeterson1

The following example creates USI on the Emp_Id column of the employee table.

  1. CREATE UNIQUE INDEX(Emp_Id) on Employee;  

The output of the Emp_Id hash will utilize the hash map to point to a specific AMP and that AMP will hold the secondary index subtable row for the secondary index value.

That means the subtable row will hold the base table Row-ID, and Teradata will then find the base row immediately.

Non-Unique Secondary Index (NUSI)

A Non-Unique Secondary Index (NUSI) allows duplicate values for the columns defined as NUSI.

A Non-Unique Secondary Index is usually specified to prevent full-table scans, in which every row of a table is read.

A USI is always a Two-AMP operation, so it is almost as fast as a Primary Index, but a NUSI is an All-AMP operation, but not a Full Table Scan.

When A NUSI is created, then the Teradata will immediately build a secondary index subtable on each AMP.

Each AMP will hold the secondary index values for their rows in the base table only. There can be up to 32 Secondary Indexes on a table.

Syntax

Below is the generic syntax to create a Non-Unique Secondary Index:

  1. CREATE INDEX (Column/Columns) ON;  

Example

The following example creates NUSI on the First_Name column of the employee table.

  1. CREATE INDEX(First_Name) on Employee;  

In the above example, each AMP holds the name column for all employee rows in the base table on their AMP (AMP local).

Each AMP Local name will have the Base Table Row-ID (pointer), so the AMP can retrieve it quickly if needed.

If an AMP contains duplicate first names, only one subtable row for that name is built with multiple Base Row-IDs.

Teradata Hashing Algorithm

The hashing algorithm is a piece of code that acts as a translation table. A row is assigned to a particular AMP based on the primary index value. Teradata uses a hashing algorithm to determine which AMP gets the row.

The Teradata Database hashing algorithms are proprietary mathematical functions that transform an input data value of any length into a 32-bit value referred to as a rowhash, which is used to assign a row to an AMP.

Whether the input is a combination of different column values, comes from differently-sized values from a variable-length column, or is composed of different data types, the hashing algorithm's output will always be a fixed size and format.

It dealing with uniformly-sized row identifiers simplifies the database's effort during storing, aggregating, sorting, and joining rows.

Following is a high-level diagram on the hashing algorithm.

Teradata Hashing Algorithm

Here is the explanation of the above diagram, how to insert the data:

  • First, the client submits a query.
  • Then, the parser receives the query and passes the PI value of the record to the hashing algorithm.
  • The hashing algorithm hashes the primary index value and returns a 32-bit number, called Row Hash.
  • The higher-order bits of the row hash (first 16 bits) are used to identify the hash map entry. The hash map contains one AMP #. Hash map is an array of buckets that contain specific AMP #.
  • BYNET sends the data to the identified AMP.
  • AMP uses the 32 bit Row hash to locate the row within its disk.
  • If there is any record with the same row hash, it increments the uniqueness ID, a 32-bit number. For new row hash, uniqueness ID is assigned as 1 and incremented whenever a record with the same row hash is inserted.
  • The combination of Row hash and Uniqueness ID is called Row ID.
  • Row ID prefixes each record in the disk.
  • Each table row in the AMP is logically sorted by their Row IDs.

Hashing Functions

Teradata SQL provides several functions that can be used to analyze the hashing characteristics of the existing indexes and candidate indexes.

These functions are documented fully in SQL Functions, Operators, Expressions, and Predicates. There are four types of hashing functions available in the Teradata.

1. HASHROW: It describes hexadecimal rowhash value for an expression. The Query would give the same results if we ran again and again.

The HASHROW function produces the 32-bit binary Row Hash that is stored as part of the data row. It returns maximum of 4,294,967,295 unique values.

Syntax

  1. HASHROW( [ < data-column-value > [..., < data-column-value >...] ] )  

2. HASHAMP: The HASHAMP function returns the identification number of the primary AMP for any Hash Bucket number.

When no value is passed through the HASHAMP function, it returns a number less than the number of AMPs in the current system configuration.

Syntax

  1. HASHAMP( )  

3. HASHBUCKET: The HASHBUCKET function produces 16bit binary Hash Bucket used with the Hash Map to determine the AMP that store and retrieve the data row.

The values range from 0 to 1,048,575, not counting the NULL as a possible result.

Syntax

  1. HASHBUCKET( [ ] )  

4. HASHBAKAMP: The HASHBAKAMP function returns the identification number of the Fallback AMP for any Hash Bucket number.

Syntax

  1. HASHBAKAMP ( )  

Hash Collisions

The Hash Collision is a situation in which the rowhash value for different rows is identical, making it difficult for a system to discriminate among the hash synonyms when one unique row is requested for retrieval from a set of hash synonyms.

Minimizing Hash Collisions

To minimize the Hash Collision problem, Teradata Database defines 4.2 billion hash values. The AMP software adds a system-generated 32-bit uniqueness value to the rowhash value.

The resulting 64-bit value prefixed with an internal partition number is called the rowID. This value uniquely identifies each row in a system, making a scan to retrieve a particular row among several having the same rowhash a trivial task.

A scan must check each of the rows to determine if it has the searched-for value and not another value with the same rowhash value.

Hash Maps

A hash map is a mechanism for determining which AMP a row is stored on, or, in the case of the Open PDE hash map, the destination AMP for a message sent by the Parser subsystem.

Each cell in the map array corresponds to a hash bucket, and each hash bucket is assigned to an AMP. Hash map entries are maintained by the BYNET.


Teradata Join Strategies

Teradata Join Strategies are utilized by the optimizer to choose the minimum cost plan and better performance.

The strategy will be chosen based on the available information to the optimizer, such as Table size, PI information, and Stats Information.

Teradata joins strategies are the following types:

  • Merge (Exclusion)
  • Nested
  • Row Hash
  • Product

Merge Join Strategies

Merge Join method takes place when the join is based on the equality condition.

Merge Join requires the joining rows to be on the same AMP. Rows are joined based on their row hash.

Merge Join uses four different merge join strategies based on the redistribution to bring the rows to the same AMP.

Strategy 1

The 1st merge join will utilize the Primary Index on both tables in the join condition.

Teradata Join Strategies

The inner join above focuses on returning all rows when there is a match between the two tables.

The ON clause is essential because this join establishes the join (equality) condition.

Teradata Join Strategies

Each matching row is joined where Emp = Emp is stated from the ON Clause in the JOIN.

EMP is the Primary Index for both tables. This first merge join type is extremely efficient because both columns in the ON clause are the Primary Indexes of their respective tables. When this occurs, NO data has to be moved into the spool, and the joins can be performed in what is called AMP LOCAL.

Teradata can perform this join with rapid speed and fewer data to be moved to complete a join.

Strategy 2

This strategy is used when a join is performed on a Primary Index column of one table to a non-primary indexed column of another table.

Teradata Join Strategies

Two tables are being joined based on the DEPT column. In the department table, the Primary Index column is DEPT.

The employee table has EMP as the Primary Index column. The primary objective is to bring the rows together from each table on the same AMPs.

There are several options that the Teradata Optimizer could choose to complete this task, such as:

  • The first option is to duplicate the smaller table on all AMPs.
  • The second option could be to leave the department table with an equality condition match on the Primary Index Column stationary on the AMP.
  • The next step would be to move the rows from the Employee table into a spool. This would be accomplished by hashing (locating) the columns in the employee table and then transferring these rows into spool to the appropriate AMPs where the department table rows reside.

Teradata Join Strategies

Strategy 3

This is used where neither table is being joined on the Primary Index of either table. In this case, Teradata will redistribute both tables into the spool and sort them by hash code.

When we want to redistribute and sort by hash code, we merely hash the non-primary index columns and move them to the AMPs spool where they are sorted by hash.

Once this is accomplished, then the appropriate rows are together in spool on all the AMPs.

Teradata Join Strategies

The Primary Index of the department table is DEPT, and the Primary Index for the manager table is LOC. In this case, both columns being utilized in this join equality are not part of the Primary Index columns.

Rows from both tables will need to be rehashed and redistributed into SPOOL because neither column selected in the ON Clause is the Primary Index of the respective tables. Therefore, both tables are redistributed based on the ON clause columns.

Teradata Join Strategies

The next step in this process is to redistribute the rows and locate them to the matching AMPs.

When this is completed, the rows from both tables will be located in two different spools. The rows in each spool will be joined together to bring back the matching rows.

Merge Join Strategy 4

The fourth merge join strategy is called the big table - small table join. If one of the tables being joined is small, then Teradata may choose a plan that will duplicate the smaller table across all the AMPs.

The key to this strategy is that regardless of the table is part of the Primary Index Column or not, Teradata could still choose to duplicate the table across all the AMPs.

Teradata Join Strategies

The two tables involved in the join are the Employee table and the Department table.

The Dept column is the join equality that is making the match between the two tables.

The DEPT column is the Primary Index Column in the Department table.

The Employee table has the Emp column as the Primary Index. The Department table is small.

To join these two tables together:

  • The first step is to get the rows together on the same AMP. In this case, since the Department table is small, Teradata will choose to duplicate the entire Department table on each AMP into a spool.
  • Then, the AMPs to join the base Employee rows with the Department rows.

Teradata Join Strategies

Nested Join

A nested join is designed to utilize a unique index type (Either Unique Primary Index or Unique Secondary Index) from one of the tables in the joint statement to retrieves a single row.

Then it matches that row to one or more rows on the other table being used in the join.

Teradata Join Strategies

The nested join has the join equality (ON) condition based on the Dept column. The Dept column is the Primary Index Column on the department table.

The Dept column is the Secondary Index Column in the employee table. The nested join can move a single row into the spool and then matching that row with another table that contains several matches.

Analysis of this join statement indicates a new clause has been added to this statement. This is known as the WHERE option.

When utilized, the WHERE option allows for a single row to be retrieved from a table. A nested join will always use a unique index to isolate that single record and then join that record to another table.

The other table may use an index, or it may not. The utilization of indexes in join statements will improve performance and utilize fewer resources, as shown in the below diagram.

Teradata Join Strategies

Since there is only one row in the department table with a match for department =10, which is based on the AND option in the join statement, the Teradata Optimizer will choose a path to move the department table columns into the spool and duplicate them across all the AMP.

After completed, the matches will proceed with that single record (10 and SALES) to the second table, which did not move from the base AMP.

Nested Joins are great in an OLTP Environment because of the usage of both Unique and Non-Unique Indexes.

Row Hash Join

The Hash Join is part of the Merge Join. The key to a Merge Join is based on equality condition, such as E.Dept = D.Dept in the ON clause of the join statement.

A Hash Join can only occur if one or both of the tables on each AMP can fit entirely inside the AMP's memory.

Teradata Join Strategies

  • The column's names can be different, but the row information has to be similar for the match to work.
  • Both the EMP and MGREMP columns have the same type of information, so therefore a join based on these column names will be successful.
  • EMP column is the Primary Index column on the employee table. However, the MGREMP column is not an index column in the department table.
  • The key to determining a Hash Join is if the SMALLER TABLE can be entirely held in each AMP's MEMORY.

Teradata Hashing Algorithm

The Hash Join process is where the smaller table is sorted by row hash and duplicated on every AMP.

Here the key is a smaller table that is required to be held entirely in each AMP's memory.

Teradata will use the join column of the larger table to search for a match. The row hash join is extremely efficient because it eliminates the sorting, redistribution, and copying of the larger table into the spool.

The rows that are duplicated into the AMP's memory yield increased performance because the rows never go into a spool. Rows that go into spool always have to involve disk activity.

Product Joins

Product Joins compare every row of one table to every row of another table. They are called product joins because they are a product of the number of rows in table one multiplied by the number of rows in table two.

For example, if one table had six rows and the other had six rows, then the Product Join would compare 6 x 6 rows with a potential of 36 rows.

  1. SELECT E.EMP,D.DEPT   
  2. FROM EMPLOYEETABLE E,DEPTTABLE D  
  3. WHERE  
  4. EMP LIKE '_b%'  

Many times, product joins are major mistakes because all rows in both tables will be compared.

Teradata tables have the potential to contain millions of rows. If a user accidentally writes a product, join against two tables that have 1 million rows each.

The result set would return One Trillion Rows. To avoid a product join, check the syntax to ensure that the join is based on an EQUALITY condition.

In the above example, the equality statement reads "WHERE EMP Like '_b%' because this clause is not based on a common domain condition between the two tables e.dept = d.dept, the result is a product join.

Another cause of a product join is when aliases are not used after being established. So firstly, ensure the WHERE clause is not missing in the join syntax.

Cartesian product Join

A Cartesian Product Join will join every row in one table to every row in another table. The only thing that decides the number of rows will be the total number of rows from both tables.

If one table had 5 rows and another had 10 rows, then we will always get 50 rows returned.

  1. SELECT E.EMP,D.DEPT   
  2. FROM EMPLOYEETABLE E,DEPTTABLE D;  

Most of the time, a Cartesian Product Join is a major problem because all rows in both tables will be joined.

To avoid a Cartesian Product Join, check the syntax to ensure that the join is based on an EQUALITY condition.

In the above example, the WHERE clause is missing because this clause is missing, a common domain condition between the two tables (e.dept = d.dept).

Another cause of a product join is when aliases are not used after being established.

Exclusion Join

Exclusion Joins have one primary function. They exclude rows during a join.

  1. SELECT EMP, DEPT, NAME   
  2. FROM   
  3. EMPLOYEETABLE  
  4. WHERE DEPT=10 and   
  5. EMP NOT IN (SELECT MGREMP from DEPTTABLE WHERE MGREMP IS NOT NULL)  

In the above example, the join utilizes the NOT IN statement. Exclusion joins are used for finding rows that don't have a matching row in the other table.

Queries with the NOT IN operator are the types of queries that always provide exclusion joins results. In this case, this query will find all the employees who belong to department 10 who are NOT managers.

These joins will always involve a Full Table Scan because Teradata will need to compare every record to eliminate rows that will need to be excluded.

This type of join can be resource-intensive if the two tables in this comparison are large.

The biggest problem with the Exclusion Joins is when the NOT IN statement is used because the NULLs are considered unknowns, so the data returned in the answer will be NULLs. There are two ways to avoid this problem:

  1. Define NOT IN columns as NOT NULL on the CREATE TABLE.
  2. Add the "AND WHERE Column IS NOT NULL" to the end of the JOIN.

Teradata Views

Views are database objects that are built by the query. Views can be created using a single table or multiple tables by way of joining.

Their definition is stored permanently in a data dictionary, but they don't save a copy. Data for the view is built dynamically.

A view may contain a subset of the table's rows or a subset of columns.

Create a View

Views are created using a special form of Data Definition Language (DDL). The CREATE requests a new VIEW, provides the name and the SELECT for the view. It is suggested the name of the view either start with "v_" or end with "_v" to identify it as a view name.

That way, it is visibly obvious to people that this is a view and not a table. The name of the view must be unique from the names of other objects in the database. The CREATE VIEW verifies that the name does not already exist and returns an error if it does.

Syntax

Following is the syntax for creating a view.

  1. CREATE/REPLACE VIEW <view_name>   
  2. AS    
  3. <select query>;   
  1. view_name: Name of the new view. If view_name is not fully qualified, the default database is used.
  2. database_name: It divided into two sections
    • user_name: Name of the database or user to contain view_name if something other than the current database or user.
    • column_name: Name of a view column. If more than one column is specified, list their names in the order in which each column is to be displayed for the view.
  3. AS: An introduction to the view definition.

NOTE

  • Views that reference a row-level security table can include columns based on row-level security constraints, but it is not required. However, the view enforces any security constraints in the base table, whether or not they are included in the view definition.
  • A view can reference both row-level security tables and non-row-level security tables, but all referenced row-level security tables must contain the same security constraint columns or subsequent requests to access the view fail with an error.

Example

Consider the following Employee table.

Emp_IdFirst_NameLast_NameDepartment_NoBirthDate
202001MikeRichard14/8/1988
202002RobertWilliams28/5/1991
202003PeterCollin25/9/1990
202004AlexaStuart115/11/1989
202005RobertPeterson122/2/1990

The following example creates a view on the Employee table.

  1. CREATE VIEW Employee_View   
  2. AS   
  3. SELECT   
  4. Emp_Id,   
  5. First_Name,   
  6. Last_Name,  
  7. Department_No,  
  8. BirthDate,   
  9. FROM    
  10. Employee;  

Using Views

We can use a regular SELECT statement to retrieve data from Views.

Example

The following example retrieves the records from Employee_View;

  1. SELECT Employee_Id,   
  2. First_Name,   
  3. Last_Name,   
  4. Department_No,  
  5. BirthDate,   
  6. FROM Employee_View;  

Modifying Views

A current view can be modified using the REPLACE VIEW statement.

REPLACE VIEW redefines an existing view or, if the specified view does not exist, it creates a new view with the specified name.

Syntax

Following is the syntax to modify a view.

  1. REPLACE VIEW <viewname>   
  2. AS    
  3. <select query>;  

Example

The following example modifies the view Employee_View for adding additional columns.

  1. REPLACE VIEW Employee_View   
  2. AS   
  3. SELECT   
  4. Employee_Id,   
  5. First_Name,   
  6. Last_Name,  
  7. BirthDate,  
  8. JoinedDate   
  9. Department_No   
  10. FROM    
  11. Employee;   

Drop View

A current view can be dropped using the DROP VIEW statement.

Syntax

Following is the syntax of DROP VIEW.

  1. DROP VIEW <viewname>;   

Example

Following is an example to drop the view Employee_View.

  1. DROP VIEW Employee_View;   

Advantages of Views

Here are some advantages of using Views in Teradata, such as:

  • Views provide an additional level of security by restricting the rows or columns of a table.
  • Users can be given access only to views instead of base tables.
  • Simplifies the use of multiple tables by pre-joining those using Views.
  • Taking requires rows and columns from the tables.
  • The view is used to reduce the net bond width.
  • It provides better binding between the table and view data.

Teradata Macros

Macro is a set of SQL statements that are stored and executed by calling the Macro name. The definition of Macros is stored in the Data Dictionary.

Users only need EXEC privilege to execute the Macro. Users don't need separate rights on the database objects used inside the Macro.

A macro consists of one or more statements that can be executed by performing a single statement. Each time the Macro is completed, one or more rows of data can be returned.

Macro statements are executed as a single transaction. If one of the SQL statements in Macro fails, then all the statements are rolled back. Macros can accept parameters. Macros can contain DDL statements, but that should be the last statement in Macro.

A frequently used SQL statement or series of statements can be incorporated into a macro and defined using the SQL CREATE MACRO statement. The comments in the Macro are performed using the EXECUTE statement. A macro can include an EXECUTE statement that executes another macro.

Create Macros

CREATE MACRO privilege on the database or user in which the Macro is to be created. The creator automatically gets the DROP MACRO and EXECUTES privileges WITH GRANT OPTION. The immediate owner of the Macro:

  • It is the database in which it exists, not the user who created it.
  • Determines the macro access privileges, not the Macro.
  • Must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION

Syntax

Macros are created using the CREATE MACRO statement. Following is the generic syntax of the CREATE MACRO command.

  1. CREATE MACRO <macroname> [(parameter1, parameter2,...)]   
  2. (   
  3.    <sql statements>   
  4. );  
  • macro_name: The name of the new Macro. If a fully qualified name is not specified, the default database or user is used. The name of a macro must be unique within its containing user or database.
Emp_IdFirst_NameLast_NameBirthDate
202001MikeRichard11/05/1990
202002RobertWilliams20/01/1988
202003PeterCollin21/08/1985
202004AlexaStuart10/12/1992
202005RobertPeterson19/03/1987

Salary Table:

Emp_IdGrossDeductionNetPay
20200140,0004,00036,000
20200280,0006,00074,000
20200390,0007,00083,000
20200475,0005,00070,000
20200580,0000080,000

The following example creates a Macro called Get_Emp. It contains a select statement to retrieve records from the employee table.

  1. CREATE MACRO Get_Emp AS   
  2. (   
  3.    SELECT   
  4.    Emp_Id,   
  5.    First_Name,   
  6.    Last_Name   
  7.    FROM    
  8.    employee   
  9.    ORDER BY Emp_Id;   
  10. );  

Executing Macros

The immediate owner of the Macro (the database in which the Macro resides) must have the necessary privileges on objects named in the request set for statements that are contained in the Macro.

Syntax

Macros are executed using the EXEC command. Following is the syntax of the EXECUTE MACRO command.

  1. EXEC <macroname>;  

Example

The following example executes the Macro names Get_Emp; when the following command is executed, it retrieves all records from the employee table.

  1. EXEC Get_Emp;   
  2.       /* Query completed. 5 rows found. 3 columns returned */  
  3.       /* Total elapsed time was 1 second */  
  4.   
  5. Emp_Id          First_Name           Last_Name   
  6. 202001           Mike                 Richard  
  7. 202002           Robert               Williams  
  8. 202003           Peter                Collin   
  9. 202004           Alexa                Stuart  
  10. 202005           Robert               Peterson    

Parameterized Macros

Teradata Macros can accept parameters. Within a Macro, these parameters are referenced with a semicolon (;).

Example

Following is an example of a Macro that accepts parameters.

  1. CREATE MACRO Get_Emp_Salary(Emp_Id INTEGER) AS   
  2. (   
  3.    SELECT   
  4.    Emp_Id,   
  5.    NetPay   
  6.    FROM    
  7.    Salary   
  8.    WHERE Emp_Id = :Emp_Id;   
  9. );  

Executing Parameterized Macros

Macros are executed using the EXEC command. We need EXEC privilege to execute the Macros.

Syntax

Following is the syntax of the EXECUTE MACRO statement.

  1. EXEC <macroname>(value);  

Example

The following example executes the Macro names Get_Emp. It accepts employee Id as a parameter and extracts records from the employee table for that employee.

  1. EXEC Get_Emp_Salary(202001);   
  2.   
  3.        /* Query completed. One row found. 2 columns returned */  
  4.        /* Total elapsed time was 1 second */  
  5.    
  6. Emp_Id                  NetPay   
  7. 202001                  36,000  

Replace a Macro

REPLACE privilege depends on whether the Macro being replaced already exists.

  • If the Macro already exists, the DROP MACRO privilege on the Macro or it contains a database or user.
  • If the Macro does not already exist, the CREATE MACRO privilege on the Macro or its containing database or user.

NOTE

  • Once a macro has been replaced, its immediate owner is the database in which it exists, not the user who replaced it.
  • The immediately owning database must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION.

Teradata OLAP Functions

OLAP functions are similar to aggregate functions except that the aggregate functions will return only one value. In contrast, the OLAP function will provide the individual rows in addition to the aggregates.

The OLAP functions are built into the Teradata database to provide data mining capabilities and trend analysis.

These functions provide processing not available using the standard aggregation. The OLAP functions give the result of their operation and display the detail data values used in the function.

The detail row data is displayed as part of the answer set since the output is in row format and not in a report format like WITH.

The OLAP functions may be performed on all tables or views and used to populate tables in conjunction with INSERT/SELECT. The most significant difference is that these functions can be used in Queryman, unlike WITH.

Syntax

Following is the syntax of the OLAP function.

  1. <aggregate function> OVER    
  2. ([PARTITION BY] [ORDER BY columnname][ROWS BETWEEN   
  3. UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)  

NOTE: Aggregation functions can be SUM, COUNT, MIN, MAX, and AVG.

Example

Consider the following Salary table of the employees.

Emp_IdGrossDeductionNetPay
20200140,0004,00036,000
20200280,0006,00074,000
20200390,0007,00083,000
20200475,0005,00070,000

Following is an example to find the cumulative sum or running total of NetPay on the Salary table.

Records are sorted by Emp_Id, and the cumulative sum is calculated on the NetPay column.

  1. SELECT    
  2. Emp_Id, NetPay,   
  3. SUM(Netpay) OVER(ORDER BY Emp_Id ROWS    
  4. UNBOUNDED PRECEDING) as TotalSalary   
  5. FROM Salary;  

After executing the above query, it produces the following output.

  1. Emp_Id           NetPay     TotalSalary   
  2. 202001           36000        36000   
  3. 202002           74000        110000   
  4. 202003           83000        193000   
  5. 202004           70000        263000   

QUANTILE Function

A QUANTILE function is used to divide rows into several partitions of roughly the same number of rows in each partition. The percentile is the QUANTILE most commonly used in business.

By default, both the QUANTILE column and the QUANTILE value itself will be output in ascending sequence.

In some cases, the ORDER BY clause may be used to reorder the output for display. Here the order of the output does not change the meaning of the output, unlike a summation where the values are being added together and all need to appear in the proper sequence.

Syntax

Following is the syntax of the QUANTILE function.

  1. SELECT QUANTILE (<partitions>, <column_name> ,<sort_key> [DESC | ASC])  
  2. FROM <table_name>  
  3. [QUALIFY QUANTILE (<column_name>) {< | > | = | <= | >=} <number of rows>];  

RANK Function

RANK function orders the records based on the column provided. The RANK function can also filter the number of records returned based on the Rank.

The RANK function allows a column to be evaluated and compared based on either high or low order, against all other rows to create the output set.

The order will be sorted by default in the descending sequence of the ranking column, which correlates to descending Rank.

The output of the RANK function is the highest or the lowest data values in the column, depending on the sort requested.

Syntax

Following is the syntax to use the RANK function.

  1. SELECT RANK( <column_name> [DESC | ASC] )  
  2. FROM <table_name>  
  3. [GROUP BY <column_name>, [<column_number> ] ] [{ ASC | DESC } ]  
  4. [QUALIFY RANK(<column_name>) {< | <=} <number of rows> ];  

Example

Consider the following Employee table.

Emp_IdFirst_NameLast_NameJoinedDateDepartment_NoBirthDate
202001MikeRichard3/27/200811/5/1980
202002RobertWilliams4/25/201313/5/1983
202003PeterCollin3/21/201014/1/1983
202004AlexaStuart1/2/2017211/6/1984
202005RobertPeterson1/4/2015212/1/1984

The following query orders the records of the employee table by Joined Date and assigns the ranking on Joined Date.

  1. SELECT Emp_Id, JoinedDate, RANK()   
  2. OVER(ORDER BY JoinedDate) as Seniority   
  3. FROM Employee;  

When the above query is executed, it produces the following output.

  1. Emp_Id        JoinedDate       Seniority    
  2. 202001        03/27/2008         1   
  3. 202003        03/21/2010         2   
  4. 202002        04/25/2013         3   
  5. 202005        01/04/2015         4   
  6. 202004        01/02/2017         5   

PARTITION BY clause groups the data by the columns defined in the PARTITION BY clause and performs the OLAP function within each group. Following is an example of the query that uses the PARTITION BY clause.

  1. SELECT Emp_Id, JoinedDate, RANK()   
  2. OVER(PARTITION BY DeparmentNo ORDER BY JoinedDate) as Seniority   
  3. FROM Employee;  

When the above query is executed, it produces the following output. We can see that the Rank is reset for each Department.

  1. Emp_Id         DepartmentNo       JoinedDate         Seniority    
  2. 202001           1                03/27/2008              1   
  3. 202003           1                03/21/2010              2   
  4. 202002           1                04/25/2013              3   
  5. 202005           2                01/04/2015              4   
  6. 202004           2                01/02/2017              5  


Teradata Data Protection

Teradata has many features to protect against potential data loss. Each of these features protects the data differently and on a different level of granularity. Some processes for protection are automatic, and some of them are optional.

Here are some features available for data protection in Teradata.

  1. RAID
  2. Fallback
  3. Cliques
  4. Journals
  5. Hot Standby Nodes

RAID

RAID mechanism is used to protect data from Disk Failures. Disk Array consists of a set of disks that are grouped as a logical unit. This unit looks like a single unit to the user, but it can be spread across several disks.

RAID 1 is commonly used in Teradata. In RAID 1, each disk is associated with a mirror disk. Any changes to the data in the primary disk are reflected in mirror copy as well. If the primary disk fails, then the data from the mirror disk can be accessed.

Although each AMP has exactly one virtual disk assigned (VDISK), several physical disks make up this VDISK. Half of the discs keep the mirrored data.

As in the case of fallback protection, the cost is the usage of twice as much as disk space.

The RAID 1 configuration ensures enhanced data accessibility and security. If a single disk fails, data will serve from the mirror disk. Only if both the primary disk and the mirror disk fail, and no fallback protection were enabled, the data would be lost.

Teradata Data Protection

Fallback

Fallback protects the table data by storing the second copy of rows of a table on another AMP called as Fallback AMP. If one AMP fails, then the fallback rows are accessed. With this, even if one AMP fails, data is still available through fallback AMP. The fallback option can be used at table creation or after table creation.

Fallback ensures that the second copy of the table's rows is always stored in another AMP to protect the data from AMP failure. However, fallback occupies twice the storage and input for Insert, Delete, or Update.

The advantage of fallback protection is that the data may be accessible in case of an AMP failure, as the auxiliary AMP will take over its tasks until the failed AMP comes back online.

Only in the unlikely event, that the backup AMP fails as well, your table ultimately will become inaccessible.

The following diagram shows how a fallback copy of the rows is stored in another AMP.

Teradata Data Protection

To achieve maximum protection, AMP's are grouped in clusters. Main AMP and fallback AMP always belong to the same cluster and protect each other.

The main AMP and the fallback AMP are never stored together physically at the same node, a prudent design choice on the nature of hardware failures. Even if a complete node fails, the fallback protection still allows us to access the data!

Cliques

Clique is a mechanism used by Teradata to protect data from Node failures. A clique is nothing but a set of Teradata nodes that share a common set of Disk Arrays. When a node fails, the AMPs of the failed node will be migrated to another node belonging to the same clique and stays fully functional and continue to access their disk arrays.

This protection mechanism adds another level of security to the Teradata System.

To remove this restriction, Teradata offers hot standby nodes to take over the AMPs of a failed node.

As these AMPs are not engaged in routine operations in the ordinary course of business, no performance degradation occurs. Before the AMP migration occurs, a system restart is required, followed by another restart as soon as the failed node goes online again.

Journals

Journals are kept on the system to provide data availability in the event of a component or process failure. Journals are the following types in the Teradata:

1. Recovery Journal: The Teradata Database uses Recovery Journals to maintain data integrity in the case of automatically

  • An interrupted transaction.
  • An AMP failure.

Recovery Journals are created, maintained, and purged by the system automatically, so no DBA intervention is required.

Recovery Journals are tables stored on disk arrays like user data is, so they take up disk space on the system.

2. Down-AMP Recovery Journal: The Down-AMP Recovery Journal allows continued system operationwhile an AMP is down.

A Down-AMP Recovery Journal is used with Fallback-protected tables to maintain a record of writing transactions (updates, creates, inserts, deletes, etc.) on the failed AMP while unavailable.

Teradata Data Protection

The Down-AMP Recovery Journal starts after the loss of an AMP in a cluster automatically. Any changes to the data on the failed AMP are logged into the Down-AMP Recovery Journal by the other AMPs in the group.

When the failed AMP is online again, the restart process includes applying the changes in the Down-AMP Recovery Journal to the recovered AMP. The journal is discarded when the process is complete, and the AMP is brought online and fully recovered.

3. Transient Journal: A Transient Journal maintains data integrity when in-flight transactions are interrupted. Data is returned to its original stateafter transaction failure.

A Transient Journal is used during regular system operation to keep "before images" of changed rows so the data can be restored to its previous state if the transaction is not completed. This happens on each AMP as changes occur. When a transaction is started, the system stores a copy of all the rows automatically affected by the transaction in the Transient Journal until the transaction is completed.

Once the transaction is complete, the "before images" are purged. In the event of a transaction failure, "before images" are reapplied to affected tables, deleted from the journal, and the "rollback" operation is completed.

4. Permanent Journal: Permanent Journals are an optional feature used to provide an additional level of data protection. We can specify the use of Permanent Journals at the table level. It offers full-table recovery for a specific point in time. It can also reduce the need for costly and time-consuming full-table backups.

Permanent Journals are tables stored on disk arrays such as user data, so they take up additional disk space. The Database Administrator maintains the Permanent Journal entries such as deleting, archiving, etc.

The additional disk space required may be calculated in advance to ensure adequate resources. The Database Administrator must leave the Permanent Journal to external media, thus reducing the need for full-table backups since only changes are backed up rather than the entire database.

Hot Standby Node

Hot Standby Node is a node that does not participate in the production environment.

Teradata Database can use spare nodes to improve availability and maintain performance levels in node failure. If a node fails, then the vprocs from the failed nodes will migrate to the hot standby node.

Once the failed node is recovered, it becomes the hot standby node. Hot Standby nodes are used to maintain the performance in case of node failures.

When the failed node is recovered and restarted, it becomes the new hot standby node. The second restart of Teradata is not needed.

Hot standby node has the following characteristics:

  • A node that is a member of a clique.
  • Hot Standby Node does not regularly participate in the trusted parallel application (TPA).
  • It can be brought into the configuration when a node fails in the clique.
  • It helps with unplanned outages.
  • It also eliminates the need for a restart to bring a failed node back into service.

Teradata Stored Procedure

A stored procedure contains a set of SQL statements and procedural statements. They consist of a set of control and condition handling comments that make SQL a computationally complete programming language. The definition of the stored procedure is stored in the database, and the parameters are stored in data dictionary tables.

These features provide a server-based procedural interface to Teradata Database for application programmers.

Stored procedures can be a great benefit for some tactical query applications. This section provides:

  • Some examples of using stored procedures to process complex updates.
  • A comparison of the relative efficiency of stored procedures and macros for different tactical query applications.

Advantages

Teradata Stored Procedure provides the following essential advantages, such as:

  • Stored procedures reduce the network load between the client and the server.
  • It provides better security since the data is accessed through stored procedures instead of accessing them directly.
  • It gives better maintenance since the business logic is tested and stored in the server.

Elements of Stored Procedures

The set of statements constituting the main tasks of the stored procedure is called the stored procedure body, consisting of a single statement or a compound statement or block.

A single statement stored procedure body can contain one control statement, such as LOOP or WHILE, or one SQL DDL, DML, or DCL statement, including dynamic SQL. The following statements are not allowed, such as:

  • Any declaration statement, such as local variable, cursor, condition, or condition handler.
  • A cursor statement (OPEN, FETCH, or CLOSE).

A compound statement stored procedure body consists of a BEGIN-END statement enclosing a set of declarations and statements, including:

  • Local variable declarations.
  • Cursor declarations.
  • Condition declarations.
  • Condition handler declaration statements.
  • Control statements.
  • SQL DML, DDL, and DCL statements supported by stored procedures, including dynamic SQL.
  • Multi-statement requests (including dynamic multi-statement requests) delimited by the keywords BEGIN REQUEST and END REQUEST.

Creating a Stored Procedure

Teradata stored procedure is created from the following commands:

  • BTEQ utility using the COMPILE command.
  • CLIv2 applications, ODBC, JDBC, and Teradata SQL Assistant (formerly called Queryman) using the SQL CREATE PROCEDURE or REPLACE PROCEDURE statement.
  • Stored Procedures are created using the CREATE PROCEDURE statement.

The procedures are stored in the user database space as objects and are executed on the server.

Syntax

Following is the syntax of the CREATE PROCEDURE statement.

  1. CREATE PROCEDURE <procedurename> ( [parameter 1 data type, parameter 2 data type..] )   
  2. BEGIN   
  3.    <SQL or SPL statements>;   
  4. END;  

Example

Consider the following Salary Table of the employees.

EmployeeIdGrossDeductionNetPay
20200140,0002,00038,000
20200280,0003,00077,000
20200385,0005,00080,000
20200490,0002,00088,000

The following example creates a stored procedure named InsertSalary to accept the values and insert them into the Salary Table.

  1. CREATE PROCEDURE InsertSalary   
  2. (   
  3.    IN in_EmployeeId INTEGER,   
  4.    IN in_Gross INTEGER,   
  5.    IN in_Deduction INTEGER,   
  6.    IN in_NetPay INTEGER   
  7. )   
  8. BEGIN   
  9.    INSERT INTO Salary   
  10. (   
  11.       EmployeeId,   
  12.       Gross,   
  13.       Deduction,   
  14.       NetPay   
  15.  )   
  16.    VALUES   
  17. (   
  18.       :in_EmployeeId,   
  19.       :in_Gross,   
  20.       :in_Deduction,   
  21.       :in_NetPay   
  22.   );  
  23.    
  24. END;  

Executing Stored Procedures

If we have sufficient privileges, we can execute a stored procedure from any supporting client utility or interface using the SQL CALL statement.

We can also execute a stored procedure from an external stored procedure written in CC++, or Java. We have to specify arguments for all the parameters contained in the stored procedure.

Syntax

Following is the generic syntax of the CALL statement.

  1. CALL <procedure name> [(parameter values)];  

Example

The below example calls the stored procedure InsertSalary and inserts records to Salary Table of the employees.

EmployeeIdGrossDeductionNetPay
20200140,0002,00038,000
20200280,0003,00077,000
20200385,0005,00080,000
20200490,0002,00088,000
20200520,0001,00019,000

Teradata User Management

The Teradata Database Administrators are responsible for managing the users in the database. They will create the Teradata account for each user with limited space and access to the database objects.

Users

A user is created using the CREATE USER command. In Teradata, a user is also similar to a database. They both can be assigned space and contain database objects except that the user is assigned a password.

We must explicitly grant the CREATE USER or USER privilege to any user that requires it, except user DBC, which has these privileges by default. The creator of a user automatically receives the DROP USER privilege on the user.

Syntax

Following is the syntax for CREATE USER.

  1. CREATE USER username   
  2. AS    
  3. [PERMANENT/PERM] = n BYTES   
  4. PASSWORD = password   
  5. TEMPORARY = n BYTES   
  6. SPOOL = n BYTES;  
  • CREATE USER user_name: The name of an individual Teradata Database user. We cannot allow multiple users to share a username for accessing the database.
  • PERM: The space in bytes available to contain objects that the user creates or owns. Permanent space cannot exceed the current PERM of the user's immediate owner, which is usually the database or user specified in the FROM clause of the CREATE USER statement.
  • PASSWORD = password: A temporary password. If the system is set up according to the recommendations in creating the security administrator user, the user is prompted to change the temporary password to a private password at first login.
  • TEMPORARY: Only objects such as global temporary tables and volatile tables require temporary space. We should develop a strategy for using the database objects that need temp space before assigning temp space to users. We can add a temp space specification later with a MODIFY USER or MODIFY PROFILE statement.
  • SPOOL: Defines the maximum amount of space available to the user for intermediate query results, volatile tables, or formatted answer sets to queries. The system borrows spool space from unused system perm space, such as the spool reserve in creating the spool space reserve.
    The spool space specified in a profile takes precedence over spool space defined in a CREATE USER statement.
    Specify spool space as part of a profile unless the user requires unique spool considerations.
    The spool space allocation should be approximately 20% of the perm space available to the most extensive database the user accesses.

NOTE: While creating a user, the values for user name, Permanent space, and Password is mandatory. But the other fields are optional.

Example

Following is an example to create user T001.

  1. CREATE USER T001   
  2. AS    
  3. PERMANENT = 1000000 BYTES   
  4. PASSWORD = ABC@999   
  5. TEMPORARY = 1000000 BYTES   
  6. SPOOL = 1000000 BYTES;  

Accounts

ACCOUNT option in CREATE USER is used to assign the account. While creating a new user, the user may be assigned to:

  • The user, but the profile contains more than one account. The default is the first account in the string. The user can submit a SET SESSION ACCOUNT statement to set the default for the current session to one of the other accounts.
  • A user and the profile contains only one account. The default is the account defined in the profile.
  • A user, but the profile does not include an account. The default is the account identifier of the immediate owner of the user.
  • Not assigned to the user, the default is the account identifier of the user's immediate owner.

NOTE

  1. If a user definition specifies an account but not a profile, the default account is the account string in the user definition.
  2. If you assign a profile to a user, and the profile specifies an account string, the profile account is in effect for AMPUsage and DBQL for that user. However, the system uses the value found in the DBC.DatabaseV view for space accounting.

Syntax

Following is the syntax for CREATE USER with the account option.

  1. CREATE USER username   
  2. PERM = n BYTES   
  3. PASSWORD = password   
  4. ACCOUNT = accounted  

Example

The following example creates the user T002 and assigns the account as IT and Admin.

  1. CREATE USER T002   
  2. AS    
  3. PERMANENT = 1000000 BYTES   
  4. PASSWORD = ABC@123   
  5. TEMPORARY = 1000000 BYTES   
  6. SPOOL = 1000000 BYTES   
  7. ACCOUNT = ('IT''Admin');  

The user can specify the account id while logging into the Teradata system or after being logged into the system using the SET SESSION command.

Grant Privileges

GRANT command is used to assign one or more privileges on the database objects to the user or database.

Syntax

Following is the syntax of the GRANT command.

  1. GRANT privileges ON objectname TO username;  

NOTE
Privileges can be INSERT, SELECT, UPDATE, REFERENCES.

Example

Following is an example of the GRANT statement.

  1. GRANT SELECT, INSERT, UPDATE ON Employee TO T001;  

Revoke Privileges

REVOKE command removes the privileges from the users or databases. The REVOKE command can only remove explicit privileges.

Syntax

Following is the basic syntax for the REVOKE command.

  1. REVOKE [ALL/privileges] ON objectname FROM username;  

Example

Below is an example of the REVOKE command.

  1. REVOKE INSERT, SELECT ON Employee FROM T001;  

Teradata FastLoad

Teradata FastLoad is used to load a large amount of data in an empty table on a Teradata System or load data from the client to the Database system.

It allows fast loading of a single empty table with no SIs, JIs, HIs, or RI (FKs) or column partitioning.

FastLoad loads large amounts of data into an empty table, with or without a primary index, on Teradata Database. However, row size and the number of columns, more than any other factor, affect performance.

Teradata FastLoad is a command-driven utility which can be invoked either in batch or interactive mode.

Teradata FastLoad utility establishes multiple session i.e., one session per AMP by default, does not use any transient journal during operation, and transfer data using block rather than row to load data into the table. That's why it is so fast.

It loads only one table at a time. If we want to load multiple tables simultaneously, invoke multiple FastLoad jobs. FastLoad can load data from:

  • Input files on a network-attached workstation.
  • Disk or tape files from channel attached workstation.
  • Individual input module (INMOD) routines written to select, validate, and preprocess input data.
  • Any other device is providing correctly formatted source data.

Teradata FastLoad Prerequisite

There are three key components specified to run any Teradata FastLoad.

1. Log Table: One log table is needed to track the status of every FastLoad session running on the system.

Teradata maintains one table called fastlog under the database SYSADMIN. To use this table, we need to INSERT, DELETE, and UPDATE privilege on this table.

2. Empty Target Table:FastLoad needs one empty target table before inserting data into it. FastLoad does not care about how this has been accomplished.

We can achieve this by deleting the data from the target table or dropping-recreate the target table before starting the loading.

3. Two ERROR Tables:Two error tables are required by FastLoad to capture if an exception occurs during the FastLoad process. These tables will be created automatically.

  • The First ERROR Table:It is used to capture any translation errors or constraint violation.
    For example, a row with column wrong data type would be reported to the first error table.
  • The Second ERROR Table:It is used for error caused by duplicates values for Unique Primary Index (UPI).
    FastLoad will load only one instance for every UPI and stores the duplicate occurrence in the second table. However, if the entire row is a duplicate, FastLoad counts it but does not store the row.

NOTE
Teradata will allow a maximum of 15 FastLoads, Multiloads, or FastExports at the same time. This protects the system from giving all the resources to only load utilities.

Phases of FastLoad

FastLoad divides the whole process into two phases:

Phase 1: Acquisition

  • This phase's main objective is to transfer the data from the Host computer to the Teradata environment for Access Module Processor (AMP) as quickly as possible.
  • Teradata Parsing Engine reads the records from the input file and sends a block to each AMP.
  • Then PE opens a session from the FastLoad client to the AMPs directly. By default, it will create one session per AMP.
  • One of the client sessions will pack the raw data into 64K blocks.
  • Then this large block of data will be sent to the AMP randomly without any concern for which AMPs get the data.
  • After receipt of each data block, every AMP hashes its rows based on the primary index and redistributes them to the proper AMP.
  • After redistribution of rows, data will be written to an internal Worktable on the AMPs but remain unsorted.
  • At the end of Phase 1, each AMP has its rows, but they are not in row hash sequence.

Phase 2: Application

  • The main objective of this phase is to write the data to the actual table space.
  • Phase 2 starts when FastLoad receives the END LOADING statement.
  • Each AMP will start sorting of rows in the second error table.
  • After sorting, these rows will be written to the actual data block permanently in the disk.
  • Locks on the target table are released, and the error tables are dropped.

Example

Create a text file with the following records and name the file as emp.txt.

  1. 202001, James, 1980/01/052010/03/011    
  2. 202002, Robert, 1983/03/052010/09/011   
  3. 202003, Peter, 1983/04/012009/02/122   
  4. 202004, Mike, 1984/11/062014/01/012   
  5. 202005, Robert, 1984/12/012015/03/093  

Following is a sample FastLoad script to load the above file into the Emp_Stg table.

  1. LOGON 192.168.1.102/dbc,dbc;    
  2.    DATABASE tduser;    
  3.    BEGIN LOADING tduser.Emp_Stg    
  4.       ERRORFILES Emp_ET, Emp_UV    
  5.       CHECKPOINT 10;    
  6.       SET RECORD VARTEXT ",";    
  7.      DEFINE   
  8.          in_EmpId (VARCHAR(10)),   
  9.          in_FirstName (VARCHAR(10)),   
  10.          in_BirthDate (VARCHAR(10)),   
  11.          in_JoinedDate (VARCHAR(10)),   
  12.          in_DepartmentNo (VARCHAR(05)),   
  13.          FILE = emp.txt;  
  14.       INSERT INTO Emp_Stg   
  15.     (  
  16.          EmpId,  
  17.          FirstName,  
  18.          BirthDate,  
  19.          JoinedDate,   
  20.          DepartmentNo  
  21.      )   
  22.   
  23.       VALUES   
  24.     (    
  25.          :in_EmpId,   
  26.          :in_FirstName,   
  27.          :in_BirthDate (FORMAT 'YYYY/MM/DD'),   
  28.          :in_JoinedDate (FORMAT 'YYYY/MM/DD'),  
  29.          :in_DepartmentNo  
  30.       );   
  31.    END LOADING;    
  32. LOGOFF;  

Executing a FastLoad Script

Once the input file emp.txt is created, and the FastLoad script is named as EmpLoad.fl, we can run the FastLoad script using the following command in UNIX and Windows.

  1. FastLoad < EmpLoad.fl;  

Once the above command is executed, the FastLoad script will run and produce the log. In the log, here is the number of records processed by FastLoad and status code.

  1. /* 03:19:14 END LOADING COMPLETE   
  2.    Total Records Read = 5   
  3.    Total Error Table 1 = 0 (Table has been dropped)  
  4.    Total Error Table 2 = 0 (Table has been dropped)  
  5.    Total Inserts Applied = 5   
  6.    Total Duplicate Rows = 0   
  7.    Start: Sat June 27 05:20:14 2020   
  8.    End: Sat June 27 05:20:15 2020   
  9.   
  10. /* 05:20:15 Application Phase statistics:   
  11.               Elapsed time: 00:00:01 (in hh:mm:ss)   
  12. 0008  LOGOFF;   
  13. /* 05:20:16 Logging off all sessions  

Teradata MultiLoad

Teradata MultiLoad or MLoad is a command-driven load utility for fast, high volume data maintenance on multiple tables or views in Teradata large databases.

MultiLoad is a command-driven utility that can do fast, high-volume maintenance on multiple tables and views of a Teradata Relational Database Management System (RDBMS).

Using a single MultiLoad job, we can do several different imports and delete tasks on RDBMS tables and views:

  • Each MultiLoad import task can do multiple data insert, update, and delete functions on up to five different tables or views.
  • Each MultiLoad delete task can remove large numbers of rows from a single table.

We can use MultiLoad to import data from:

  • Disk or tape files on a channel-attached client system.
  • Input files on a network-attached workstation.
  • Individual input module (INMOD) programs write to select, validate, and preprocess input data.
  • Access modules.
  • Any device is providing correctly formatted source data.

The table or view in the RDBMS receiving the data can be any existing table or view for which we have access privileges for the maintenance tasks we want.

MultiLoad Modes

Teradata has the two modes of the MultiLoad, such as:

1. MultiLoad Import

  • Each MultiLoad import task can perform multiple INSERT, UPDATE, DELETE, and UPSERT operation on five target tables in parallel.
  • It can also import data from the network-attached system or mainframe attached system using a custom access module.

2. MultiLoad DELETE

  • Each MultiLoad delete can remove large volumes of data from a single table.
  • The MultiLoad DELETE is used to perform global (all amps) delete operation on just one table.
  • The DELETE's main feature is that it bypasses the transient journal (TJ) and can be restarted if it fails before finishing.
  • We can't use the primary index in the MultiLoad DELETE operation as primary index access is AMP specific, but MultiLoad DELETE is built for global delete.

MultiLoad requires a work table, a log table, and two error tables in addition to the target table.

  • Log Table: Used to maintain the checkpoints taken during load, which will be used for restart.
  • Error Tables: These tables are inserted during load when an error occurs. The first error table stores conversion errors, whereas the second error table stores duplicate records.
  • Log Table: Maintains the results from each phase of MultiLoad for restart purposes.
  • Worktable: MultiLoad script creates one work table per target table. The work table is used to keep DML tasks and input data.

MultiLoad Features

Teradata MultiLoad has the following features, such as:

  • High-performance maintenance operations apply updates to multiple tables in a single pass.
  • Best for over 1-2% of rows changed.

Data Conversion Capabilities

If an input data field with character data type is targeted for a column with date data type specification, Teradata MultiLoad can change the input data specification to date before inserting it into the target table.

Following are the conversions that support by the Teradata MultiLoad:

  • Numeric-to-numeric
  • Character-to-numeric
  • Character-to-date
  • Date-to-character

Teradata MultiLoad Limitations

Teradata MultiLoad has the following limitations, such as:

  • No Unique Secondary Index: MultiLoad does not support unique secondary index (USI) the same as FastLoad, as in the case of USI subtable may create in different AMP where the actual row resides.
    But unlike FastLoad, it supports non-unique secondary index (NUSI) as, in this case, suitable will be created in the same AMP. In MultiLoad, each AMP works in parallel and independently. That's why it supports NUSI but not USI.
  • No Referential Integrity: Referential Integrity (RI) on the target table not supported by the Teradata MultiLoad. RI requires too much system checking to maintain the referential constraints to a different table.
  • Triggers are not allowed:Triggers are involved with more than one table. MultiLoad deals with only one target table. ALTER the Triggers to the DISABLED status before using MultiLoad.
  • No concatenation of the input files: It could impact the restart process if the data were concatenated in different sequences.
  • No Aggregate, exponential operator, or arithmetic functions:If we need data conversions, then we using an INMOD to prepare the data before loading it.
  • MultiLoad does not support SELECT, Foreign key references, Hash Indexes, Join Indexes, NOPI tables.

How MultiLoad Works

Teradata MultiLoad import has five phases:

  • Phase 1 (Preliminary Phase): It performs basic setup activities.
  • Phase 2 (DML Transaction Phase): It verifies the syntax of DML statements and brings them to the Teradata system.
  • Phase 3 (Acquisition Phase): It brings the input data into work tables and locks it.
  • Phase 4 (Application Phase): It applies all DML operations.
  • Phase 5 (Cleanup Phase): It releases the table lock.

The following steps involved in a MultiLoad script:

Step 1: Set up the log table.

Step 2: Log on to Teradata.

Step 3: Then, specify the Target, Work, and Error tables.

Step 4: Define the INPUT file layout.

Step 5: Also, define the DML queries.

Step 6: Name the IMPORT file.

Step 7: And specify the LAYOUT to be used.

Step 8: Initiate the Load.

Step 9: And last, Finish the load and terminate the sessions.

Example

Create a text file with the following records and name the file as emp.txt.

  1. 202001, James, 1980/01/052010/03/011    
  2. 202002, Robert, 1983/03/052010/09/011   
  3. 202003, Peter, 1983/04/012009/02/122   
  4. 202004, Mike, 1984/11/062014/01/012   
  5. 202005, Robert, 1984/12/012015/03/093  

The following example is a MultiLoad script that reads records from the employee table and loads into the Emp_Stg table.

  1. .LOGTABLE tduser.Emp_log;    
  2. .LOGON 192.168.1.102/dbc,dbc;   
  3.    .BEGIN MLOAD TABLES Emp_Stg;    
  4.       .LAYOUT Emp;    
  5.       .FIELD in_EmpId * VARCHAR(10);    
  6.       .FIELD in_FirstName * VARCHAR(10);     
  7.       .FIELD in_BirthDate * VARCHAR(10);   
  8.       .FIELD in_JoinedDate * VARCHAR(10);    
  9.       .FIELD in_DepartmentNo * VARCHAR(05);  
  10.   
  11.       .DML LABEL EmpLabel;   
  12.       INSERT INTO Emp_Stg   
  13.       (  
  14.          EmpId,  
  15.          FirstName,  
  16.          BirthDate,  
  17.          JoinedDate,  
  18.          DepartmentNo  
  19.       )    
  20.       VALUES   
  21.       (  
  22.          :in_EmpId,  
  23.          :in_FirstName,  
  24.          :in_BirthDate,  
  25.          :in_JoinedDate,  
  26.          :in_DepartmentNo  
  27.       );  
  28.       .IMPORT INFILE emp.txt    
  29.       FORMAT VARTEXT ','  
  30.       LAYOUT Emp  
  31.       APPLY EmpLabel;    
  32.    .END MLOAD;    
  33. LOGOFF;  

Executing a MultiLoad Script

Once the input file emp.txt is created, and the MultiLoad script is named as EmpLoad.ml, then we can run the MultiLoad script using the following command in UNIX and Windows.

  1. Multiload < EmployeeLoad.ml;  

Teradata BTEQ

Teradata BTEQ stands for Basic Teradata Query. It is a command-driven utility that enables users to interact with one or more Teradata Database Systems.

BTEQ utility is used in both batch and interactive mode. It can be used to run any DML statement, DDL statement, create Macros, and stored procedures. BTEQ can be used to import data into Teradata tables from a flat file, and it can also be used to extract data from tables into files or reports.

BTEQ Terms

Below are some terms that commonly used in BTEQ scripts.

  • LOGON: It is used to log into the Teradata system.
  • ACTIVITYCOUNT: It returns the number of rows affected by the previous query.
  • ERRORCODE: It returns the status code of the previous query.
  • DATABASE: It sets the default database.
  • LABEL:It assigns a name to a set of SQL commands.
  • RUN FILE:It executes the query contained in a file.
  • GOTO:It transfers control to a label.
  • LOGOFF:It logs off from the database and terminates all sessions.
  • IMPORT:It specifies the input file path.
  • EXPORT:It specifies the output file path and initiates the export.

BTEQ Limitations

Teradata BTEQ has the following limitations, such as:

  • It can submit SQL in both modes, either in batch or interactive way. Interactive users can send SQL and get an answer set on the screen. Users also can submit BTEQ jobs from the batch scripts, have error checking and conditional logic, and allow for the work to be done in the background.
  • Teradata SQL Assistant (Queryman) outputs data in a spreadsheet format wherein using BTEQ users can get the output in the report format.
  • BTEQ is an excellent tool for importing and exporting data.

Importing Data

BTEQ can import the data from the source and incorporate it into SQL to modify the content of one or more tables.

Syntax of the IMPORT command:

  1. .IMPORT <mode> {FILE | DNAME = <filename>,[SKIP=n]  

Exporting Data

The data can be written to either mainframe or LAN attached computer using SELECT from Teradata. Data can be exported in a variety of formats such as:

  • Record/DATA mode: In this mode, data will be written to a flat file in native format i.e., INTEGER data is written as a 4 bytes binary field. It cannot be read and understood using a regular text editor. It is set by .EXPORT DATA.
  • Field/Report mode: This is the default mode of BTEQ. The output of this BTEQ export includes column headers for the fields, white space, and can be understood using a text editor. It is set by .EXPORT REPORT.
  • Indicator mode: It is similar to DATA mode, but it also includes a bitmap to recognize NULL. It helps to acknowledge NULL values into another RDBMS. And it is set by .EXPORT INDICDATA.
  • DIF mode: Allows users to export data from Teradata that is directly utilized for spreadsheet applications such as Excel, FoxPro, and Lotus.

Example

Following is an example of the Teradata BTEQ script.

  1. LOGON 192.168.1.102/dbc,dbc;   
  2.    DATABASE tduser;  
  3.   
  4.    CREATE TABLE emp_bkup   
  5.    (   
  6.       EmpId INTEGER,   
  7.       FirstName CHAR(10),   
  8.       LastName CHAR(10),   
  9.       DepartmentNo SMALLINT,   
  10.       NetPay INTEGER   
  11.    )  
  12.    Unique Primary Index(EmpId);  
  13.   
  14.    .IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;  
  15.     
  16.    SELECT * FROM    
  17.    Emp  
  18.    Sample 1;   
  19.    .IF ACTIVITYCOUNT <> 0 THEN .GOTO InsertEmp;    
  20.   
  21.    DROP TABLE emp_bkup;  
  22.     
  23.    .IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;   
  24.    
  25.    .LABEL InsertEmp   
  26.    INSERT INTO emp_bkup   
  27.    SELECT a.EmpId,   
  28.       a.FirstName,   
  29.       a.LastName,   
  30.       a.DepartmentNo,   
  31.       b.NetPay   
  32.    FROM    
  33.    Emp a INNER JOIN Salary b   
  34.    ON (a.EmpId = b.EmpId);    
  35.   
  36.    .IF ERRORCODE <> 0 THEN .EXIT ERRORCODE;   
  37. .LOGOFF;   

The above Teradata BTEQ script performs the above task in the following steps.

Step 1: Logs into Teradata System.

Step 2: Sets the Default Database.

Step 3: Creates a table called emp_bkup.

Step 4: Selects one record from the Emp table to check if the table has any records.

Step 5: Drops emp_bkup table if the table is empty.

Step 6: Transfers the control to a Label InsertEmp, which inserts records into the emp_bkup table.

Step 7: Then, check ERRORCODE to ensure that the statement is successful, following each SQL statement.

Step 8: ACTIVITYCOUNT returns the number of records selected/impacted by the previous SQL query.


Teradata Performance Tuning

In Teradata, Performance tuning is done to identify all the bottlenecks and then resolve them.

The bottleneck is not a form of error, but it certainly causes a certain amount of delay.

There are the following ways to identifying a bottleneck and resolve them, such as:

EXPLAIN Plan

The first step in performance tuning is the use of EXPLAIN on the query. When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs.

The EXPLAIN plan gives a clear picture of the way optimizer will execute a query. In the Explain plan, check for the keywords such as

  • Confidence level
  • Join strategy used
  • Spool file size
  • Redistribution, etc.

Collect Statistics

The optimizer uses Data demographics to come up with an effective execution strategy. COLLECT STATISTICS command is used to collect data demographics of the table. Make sure that the statistics collected on the columns are up to date.

  • Collect statistics on the columns that are used in the WHERE clause and on the columns used in the joining condition.
  • Collect statistics on the Unique Primary Index columns.
  • Collect statistics on Non-Unique Secondary Index columns. The optimizer will decide if it can use NUSI or Full Table Scan.
  • Collect statistics on the Join Index through the statistics on the base table is collected.
  • Collect statistics on the partitioning columns.

Environment Information

  • Number of Nodes, AMPs and CPUs
  • Amount of memory

Data Demographics

  • Number of rows
  • Row size
  • Range of values in the table
  • Number of rows per value
  • Number of Nulls

There are three approaches to collect statistics on the table.

  • Random AMP Sampling
  • Full statistics collection
  • Using SAMPLE option

Collecting Statistics

COLLECT STATISTICS command is used to collect statistics on a table.

Syntax

Following is the basic syntax to collect statistics on a table.

COLLECT [SUMMARY] STATISTICS   
INDEX (indexname) COLUMN (columnname) 
ON <tablename>;

Example

The following example collects statistics on EmployeeNo column of Employee table.

COLLECT STATISTICS COLUMN(EmployeeNo) ON Employee;

Viewing Statistics

You can view the collected statistics using HELP STATISTICS command.

Syntax

Following is the syntax to view the statistics collected.

HELP STATISTICS <tablename>; 

Example

Following is an example to view the statistics collected on Employee table.

HELP STATISTICS employee;

Using the THRESHOLD Options to Collect and Recollect Statistics

Statistics collection thresholds enable you to minimize the unnecessary collection of statistics. For a system-determined threshold, the Optimizer automatically determines whether a recollection is needed or whether extrapolation is adequate. You can collect and recollect threshold statistics on tables.

The THRESHOLD options enable you to skip recollecting statistics if the amount of data change since the last statistics collection or the age of the current statistics is under the thresholds in effect for the statistics.

The Optimizer can automatically determine the appropriate thresholds to apply based on previously collected statistics, column histories, change counts, and other factors, or you can explicitly specify the thresholds as part of a COLLECT STATISTICS request by specifying a change percentage, a number of days, or both.

You can submit collect statistics requests at regular intervals, and the Optimizer can use stored threshold and historical data to determine whether the specified recollection of statistics is necessary or not. The Optimizer can use the same data to determine when to recollect statistics for those columns that exceed the specified threshold. For example, if you determine that the threshold is to be a 10% data change and you submit a request to recollect statistics, the Optimizer does not recollect those statistics if the change in the data from the last collection is less than 10%.

If you specify a THRESHOLD option for first time statistics collections, the Optimizer collects the statistics and stores the THRESHOLD options you specified for future use without consulting the threshold values you specify. The Optimizer uses the saved collection thresholds to determine whether the current statistics collection can be skipped or not when you submit recollection requests at a later time.

Specifying a new THRESHOLD option for statistics recollection overrides the current setting, if any. The Optimizer applies the new THRESHOLD value to the current request, and remembers the updated THRESHOLD option for future statistics recollection.

Collect Stats in Teradata collect the statistics for one or even multiple columns of the base table specified. These statistics include the hash index, a join index, etc. The primary purpose is to collect statistical profiles of the required columns and store them in a data dictionary.

What is Collect stats in Teradata?

Collect Stats is used to gather the statistics on various columns as per the requirement. Thereafter Teradata’s optimizer creates an execution strategy that is based on these statistics collected by the COLLECT STATS command.

This COLLECT STATS command gathers data demographics and environment information, which the optimizer utilizes in order to optimize the plan for the SQL based on these tables.

Environment information consists of the below set of info:

  • The amount of memory required
  • The number of nodes
  • AMP’s
  • CPU’s

Data demographics consist of the below set of info:

  • The row size of the table
  • Number of rows in that table
  • The entire range of values present in that table
  • The number of rows available per table
  • Number of Null values in that table
You can specify the following kinds of thresholds to control recollecting statistics.
  • Change-based
  • Time-based
NO SAMPLEUse a full-table scan to collect the specified statistics.

You can only specify this option if you also specify an explicit index or column set.

You cannot specify NO SAMPLE for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, NO SAMPLE overrides the default.
  • For recollections of statistics, NO SAMPLE overrides the previous SAMPLE options and collects full statistics.
NO THRESHOLD
  • Do not apply any thresholds to the collection of statistics.
  • Remove the existing threshold before collecting the statistics.

You can only specify this option if you also specify an explicit column or index.

You cannot specify NO THRESHOLD for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, NO THRESHOLD overrides the default settings.
  • For recollections of statistics, NO THRESHOLD overrides any previously specified THRESHOLD options and recollects the statistics without any thresholds.
NO THRESHOLD DAYS
  • Do not apply a DAYS threshold to the collection of statistics.
  • Remove the existing DAYS threshold before collecting the statistics.

You can only specify this option if you also specify an explicit index or column set.

You cannot specify NO THRESHOLD DAYS for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, NO THRESHOLD DAYS overrides the default setting.
  • For recollection of statistics, NO THRESHOLD DAYS overrides any previous DAYS threshold specification.
NO THRESHOLD PERCENT
  • Do not apply a PERCENT change threshold to the collection of statistics.
  • Remove the existing PERCENT change threshold before collecting the statistics.

You can only specify this option if you also specify an explicit column or index.

You cannot specify NO THRESHOLD PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, NO THRESHOLD PERCENT overrides the default.
  • For recollections of statistics, NO THRESHOLD PERCENT overrides any previous change threshold percent specification.
SAMPLEScan a system-determined percentage of table rows to collect the specified statistics.

SAMPLE has the same meaning as SYSTEM SAMPLE and is only provided for backward compatibility to enable existing COLLECT STATISTICS scripts that specify the USING SAMPLE option to continue to run.

You should use the SYSTEM SAMPLE option instead of SAMPLE.

SAMPLE n PERCENTScans the percentage of table rows that you specify rather than scanning all of the rows in the table to collect statistics.

The value for n can be a decimal number or integer from 2 through 100.

Specifying SAMPLE 100 PERCENT is equivalent to collecting full statistics.

You can only specify this option if you also specify an explicit index or column set.

You cannot specify SAMPLE n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, the specified sample percentage overrides the default.
  • For recollection of statistics, SAMPLE n PERCENT overrides any previous SAMPLE option specifications and instead scans n percent of the rows in the table.

THRESHOLD n DAYSDo not to recollect statistics if the age of the statistic is less than the number of days specified.

The value for n must be an integer number that represents the number of days.

The valid range for n is 1 - 9999.

You can only specify this option if you also specify an explicit index or column set.

You cannot specify THRESHOLD n DAYS for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, collecting statistics is not skipped because no current histogram exists to use to determine the age of the current statistics.

    Instead, the specified number of days overrides the default setting.

  • For recollection of statistics, THRESHOLD n DAYS overrides the previous THRESHOLD n DAYS specification and instead applies the specified number of days threshold to the collection.

The DBS Control field DefaultTimeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102.

THRESHOLD n PERCENTDo not to recollect statistics if the percentage of data change since the last collection is less than the specified percentage.

The value for n can be either a decimal number or an integer number.

The valid range of n is 1 - 9999.99.

You can only specify this option if you also specify an explicit column or index.

You cannot specify THRESHOLD n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
  • For the first collection of statistics, THRESHOLD n PERCENT overrides the default setting.
  • For recollection of statistics, THRESHOLD n PERCENT overrides the previous threshold change percentage specification and instead applies the specified threshold percentage.

The DBS Control field DefaultUserChangeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102.

  • Using the sample option: This includes unique index columns, nearly unique indexes, or columns
  • Full statistic collection methodology: This usually includes Non-Indexed columns, Partition for all tables whether permitted or not, and Collection of full stats over relevant columns.
  • Most NUPIs refer to non-unique primary indexed columns, while UPIs stand for unique primary indexed columns.
  • Apart from these, it also includes the single-column join constraints along with NUSI’s
  • The Random AMP Sampling: It involves the USI’s or UPI’s if only used with equality predicates.

How to Collect Stats in Teradata?

Below is the syntax of COLLECT STATS statement in Teradata:

COLLECT [SUMMARY] STATISTICS
INDEX (name_of_the_index)
COLUMN (col_name)
ON <table_name>;

Here the keyword SUMMARY is optional, and the user may skip it if not required.

Let’s take up some examples to understand how to collect stats in Teradata in detail:

COLLECT STATISTICS COLUMN(roll_number) ON Student_Table;
  • This will collect stats on the roll_number column of the student_table
  • When the above query is executed, the below kind of output is produced

Collect Stats in Teradata 1

Let’s take an example to understand the optimization in detail.

Suppose we have two tables table1 and table2.

table1 contains the details of the students like id, name, age, marks, etc

whereas table2 contains the geographic info of the students like address, location along with the primary key as ID

SELECT
a.id,
a.name,
a.age,
a.marks,
b.address,
b.location
From table1 as a
left join table2 as b
on a.id = b.id

Let’s consider two cases wherein the above-mentioned query gets executed.

CASE 1: When we do not have any information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be more costly.

CASE2: When we do have the specifically required information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be less costly.

The reason is, during the join, which is based on the id column from table1 and table2, it needs to be on the same AMP in order to join the data based on this column from table1 and table2

Suppose table1 contains 100 records having ID from 1 to 100 distributed evenly over 10 AMP in the below fashion.

  • The records having the ID from 1 to 10 in AMP1
  • The records having the ID from 11 to 20 in AMP2
  • The records having the ID from 21 to 30 in AMP3
  • The records having the ID from 31 to 40 in AMP4
  • and so on…
  • The records having the ID from 91 to 100 in AMP10

And table2 has only 80 records having ID from 1 to 100 with 20 missing ID’s

  • The records having the ID from 1 to 8 in AMP1
  • The records having the ID from 9 to 10 and 15 to 20 in AMP2
  • The records having the ID from 21 to 28 in AMP3
  • The records having the ID from 29 to 36 in AMP4
  • and so on…
  • The records having the ID from 92 to 100 in AMP8

Now for the join to happen, the ID’s should be available in the SAME AMP

Generally, the data from the smaller table is redistributed. So here, the redistribution will happen for table2

How can we see the statistics collected for the tables in Teradata?

The collected stats can be seen using the below query:

HELP STATISTICS <table_name>

Let’s see the stats collected on Student_table

HELP STATISTICS student_table

When the above query is executed, The result will be something like this:

Collect Stats in Teradata 2

Data Types

Each column in a table is associated with a data type. Data types specify what kind of values will be stored in the column. Every data value belongs to an SQL data type.

Make sure that proper data types are used. This will avoid the use of excessive storage than required.

Conversion

Ensure that the data types of the columns used in join conditions are compatible to avoid explicit data conversions.

Sort

Remove unnecessary ORDER BY clauses unless required.

Spool Space Issue

Spool space error is generated if the query exceeds per AMP spool space limit for that user.

Verify the explain plan and identify the step that consumes more spool space. These common queries can be split and put as separately to build temporary tables.

Primary Index

The primary index is used to specify where the data resides in Teradata. Each table is required to have a primary index defined, so make sure the Primary Index is correctly set for the table.

The primary index column should evenly distribute the data and should be frequently used to access the data.

Partition Primary Index

If Partition Primary Index is defined in tables, try to use it. If we are not using it in the filter condition, it will degrade the performance.

SET Table

If we define a SET table, then the optimizer will check if the record is duplicate for each and every record inserted. To remove the duplicate check condition, we need to define Unique Secondary Index for the table.

UPDATE on Large Table

Updating the large table will be time-consuming. Instead of updating the table, we can delete the records and insert the files with modified rows.

Dropping Temporary Tables

Drop the temporary tables (staging tables) and volatiles if they are no longer needed. This will free up permanent space and spool space.

MULTISET Table

If we confirmed that the input records would not have duplicate files, we could define the target table as a MULTISET table to avoid the duplicate row check used by the SET table.

IN Clause

Avoid the IN clause in filter conditions when there can be a huge number of values in where conditions, then it's better to insert such values in a volatile table and use a volatile table with INNER JOIN in the main query.

UPDATE Clause

Avoid using the UPDATE clause with just SET condition and no WHERE conditioning. Even if the Target and Source have only one row, add WHERE clause for PI column.


columnar table








 Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS) 

In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.

 

SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.

-----------------------------------------------------

SQL STATEMENT TYPE


1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.

 

DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.

DDL Statements:

 

CREATE

DROP

ALTER

RENAME

MODIFY

REPLACE

COLLECT STATISTICS

SET

BEGIN

END

FLUSH

CHECKPOINT

HELP

SHOW


2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.

 

DML is used to work with data, including inserting data, updating existing data or performing queries on the data.

 

The DML subset will be the focal point of this fundamentals course.


DML Statements:

 

SELECT

INSERT

UPDATE

DELETE


3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.

 

DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.


DCL Statements:

 

 

GRANT

REVOKE

GIVE


4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.

 

Some statements in this category are also functionally aligned with the DDL SQL category.

 

For more information on SQL statements in this category, see references in help section.

SQL Statements in this category:

 

 

BEGIN/END QUERY LOGGING

SHOW QUERY LOGGING

COLLECT DEMOGRAPHICS

COLLECT STATISTICS

DROP STATISTICS

DUMP EXPLAIN

INITIATE INDEX ANALYSIS

INITIATE PARTITION ANALYSIS

INSERT EXPLAIN

RESTART INDEX ANALYSIS


-------------------------------------------------------------

 HELP


HELP command displays information about database objects (databases, tables, views, macros etc.).

 

This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.



-----------------------------------------------------------------------------------------------------------------

SHOW command displays the current data definition language (DDL) of a database object.

 

It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).


------------------------------------------------

EXLAIN


=========================================================

Introduction to the Teradata Database

The Teradata database is a Relational Database Management System (RDBMS) that drives a company's data warehouse.

Teradata Parallel Architecture

      
There are many reasons customers choose the Teradata database as the preferred platform for enterprise data warehousing:

  • Supports more warehouse data than all competitors combined. There are over 5400 1TB or larger warehouses in the field.
  • Supports scalability from small (10GB) to massive (50+PB) databases.
  • Provides a parallel-aware Optimizer that makes it unnecessary to perform complicated tuning to get a query to run efficiently.
  • Automatic data distribution eliminates complex indexing schemes and time-consuming reorganizations.
  • Designed and built on a parallel architecture from the start (not a parallel retrofit).
  • Supports ad-hoc queries using ANSI-standard Structured Query Language (SQL) which is used to communicate with an RDBMS. This allows Teradata to interface with 3rd party BI tools and submit queries from other database systems.
  • Single operational view of the entire MPP system via the Service Workstation (SWS) and single point of control for the DBA with Teradata Viewpoint.

Teradata Corporation has been doing data warehousing longer than any other vendor. The Teradata database was built to do Decision Support.

Relational Database Advantages

      
The Teradata database provides customers with unlimited, proven scalability. As the example on the previous page showed, the Teradata database can scale from 10 gigabytes to over 100 petabytes of data on a single system. When we talk about scalability, it isn't just about being able to grow very large and handle enormous amounts of data. It's about growth without losing any performance capabilities. This scalability provides investment protection for customer's growth and application development.

As is proven in the benchmarks we perform, the Teradata database can handle the most concurrent users, who are often running complex queries.

The parallelism of the Teradata database is unlimited. The Teradata database performs every task in parallel, including joins, sorts, and aggregations.

The optimizer is parallel-aware and the most robust in the industry, able to handle multiple complex queries, joins per query, and unlimited ad-hoc processing.

The Teradata File System manages data and disk space automatically. This is what makes the rebuilding of indexes unnecessary.

The Teradata database provides a low total cost of ownership (TCO) due to:

  • Ease of setup and maintenance
  • No reorganization of data needed
  • Automatic data distribution
  • Most robust utilities in the industry
  • Low cost of disk to data ratio
  • Ease in expanding the system

High availability is also a major advantage because with the Teradata database architecture, there is no single point of failure - fault tolerance is built-in to the system. The Teradata database is the only database that is truly linearly scalable, and this extends to data loading with the use of our parallel utilities.

Teradata Database Features

      
The Teradata database is a high-performance database system that processes enormous quantities of detail data that are beyond the capability of conventional systems.

The system is specifically designed for large data warehouses. From the beginning, the Teradata database was created to store, access and manage large amounts of data.

Hundreds of terabytes of storage capacity are currently available, making it an ideal solution for enterprise data warehouses and even for smaller data marts.

Parallel processing distributes data across multiple processors evenly. The system is designed such that the components divide the work up into approximately equal pieces. This keeps all the parts busy all the time and enables the system to accommodate a larger number of users and/or more data.

Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. Teradata runs on industry standard operating systems as well.

Linear scalability enables the system to grow to support more users/data/queries/query complexity, without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1.

The Teradata database currently runs as a database server on a variety of hardware platforms for single node or Symmetric Multi-Processor (SMP) systems, and on Teradata hardware for multi-node Massively Parallel Processing (MPP) systems.

What is a Database?

      
database is a collection of permanently stored data that is used by an application or enterprise.

A database contains logically related data, which means that the database was created with a specific purpose in mind. A database supports shared access by many users. One characteristic of a database is that many people use it, often for many different purposes. A database is protected to control access and managed to retain its value and integrity.

One example of a database is payroll data that includes the names of the employees, their employee numbers, and their salary history. This database is logically related - it's all about payroll. It must have shared access, since it will be used by the payroll department to generate checks, and also by management to make decisions. This database must be protected; much of the information is confidential and must be managed to ensure the accuracy of the records.

The Teradata database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industry-standard SQL statements. SQL is a set-oriented language for relational database management.

Later in this course we will provide another definition of database that is specific to the Teradata database.

Logical/Relational Modeling

      
The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly.

The design of the data model is the same regardless of data volume.

An enterprise model is one that provides the ability to look across functional processes.

Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.

Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is "in normal form."

The intent of normalizing a relational database is to put one fact in one place. By decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables.

A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple (an ordered set of values), and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. The separator for each value is often a comma. Common uses for the tuple as a data type are:

  1. For passing a string of parameters from one program to another
  2. Representing a set of value attributes in a relational database

The Logical Model

  • Should be designed without regard to usage
    • Accommodates a wide variety of front end tools
    • Allows the database to be created more quickly
  • Should be the same regardless of data volume
  • Data is organized according to what it represents - real world business data in table (relational) form
  • Includes all the data definitions within the scope of the application or enterprise
  • Is generic - the logical model is the template for physical implementation on any RDBMS platform

Normalization

  • Process of reducing a complex data structure into a simple, stable one
  • Involves removing redundant attributes, keys, and relationships from the conceptual data model

Relational Databases

      
Relational databases are solidly founded on Set Theory and are based on the Relational Model, which is a set of principles for relational databases formalized by Dr. E.F. Codd in the late 1960s. The relational model of data prescribes how the data should be represented in terms of structure, integrity, and manipulation. The rules of the relational model are widely accepted in the information technology industry, though actual implementations may vary.

The key to understanding relational databases is the concept of the table, which is made up of rows and columns. The relational model provides the system with rules and a structure to build on. The relational terms we will be discussing in this course are independent of the database you may be running on.

column is an attribute of the entity that a table represents. A column always contains like data, such as part name, supplier name, or employee number. In the example below, the column named LAST NAME contains last name and never anything else. Columns should contain atomic data, so a phone number might be divided into three columns; area code, prefix, and suffix, so drill-down analysis can be performed. Column position in the table is arbitrary. Missing data values are represented by "nulls", or the absence of a value.

row is one instance of all the columns of a table. It is a single occurrence of an entity in the table. In our example, all information about a single employee is in one row. The sequence of the rows in a table is arbitrary.

In a relational database, tables are defined as a named collection of one or more named columns that can have zero or many rows of related information. Notice that the information in each row of the table refers to only one person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important.

Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place, thing, or event about which the table contains information. In this case, the entity is the employee.

Note: In relational math we use the term:

  • Table to mean a relation.
  • Row to mean a tuple.
  • Column to mean an attribute.
Row 
(Single
Entity)
Employee   Column (Attribute)
EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
         
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000

 Relation

The employee table has:

  • Nine columns of data
  • Six rows of data - one per employee
  • Missing data values represented by nulls
  • Column and row order are arbitrary

Normalized Data Model vs. Star Schema Model

      
As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:

  • Entities - One record in a table
  • Attributes - Columns
  • Relationships - Between tables

First normal form (1NF) rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities.

Second normal form (2NF) requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key.

Third normal form (3NF) requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity.

While the Teradata database can support any data model that can be processed via SQL; an advantage of a normalized data model is the ability to support previously unknown (ad-hoc) questions.

Star Schema

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.

Some characteristics of a Star Schema model include:

  • They tend to have fewer entities
  • They advocate a greater level of denormalization

Primary Keys

      
Tables are made up of rows and columns and represent entities and attributes. Entities are the people, places, things, or events that the tables model. A Primary Key is required for every logical model version of a table, because each entity holds only one type of tuple (i.e., a row about a person, place, thing or event), and each tuple is uniquely identified within an entity by a Primary Key (PK).

Primary Key Rules

  • A Primary Key uniquely identifies each tuple within an entity. A Primary Key is required, because each tuple within an entity must be able to be uniquely identified.

  • No duplicate values are allowed. The Primary Key for the EMPLOYEE table is the employee number, because no two employees can have the same number.

  • Because it is used for identification, the Primary Key cannot be null. There must be something in that field to uniquely identify each occurrence.

  • Primary Key values should not be changed. Historical information, as well as relationships with other entities, may be lost if a PK value is changed or re-used.

  • Primary Key can include more than one attribute. In fact, there is no limit to the number of attributes allowed in the PK.

  • Only one Primary Key is allowed per entity.

Selecting a Primary Key

Each column of a row is called an attribute of that row. A database designer can select any attribute to be a Primary Key but, as a result of the rules listed above, many attributes will not qualify as a Primary Key candidate. For example, we could have selected Last Name as the PK of the EMPLOYEE table, but as soon as the company hires two people with the same last name, the PK is no longer unique. Even if we made the PK Last Name and First Name (possible since PKs can be made up of more than one column) we could still have two employees with the same name. Moreover, some employees may choose to change their last names.

Many data modelers recommend using system-assigned sequential integers for Primary Keys. This assures uniqueness and gives users control over Primary Key values, but it also takes up database space.

Primary Key (PK) values uniquely identify each entry (tuple) of an entity.

Employee

EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
PK        
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000

Foreign Keys

      
Relational databases permit data values to associate across more than one entity. A Foreign Key (FK) value identifies entity relationships.

Below you will see that the employee table has three FK attributes, one of which models the relationship between employees and their departments. A second FK attribute models the relationship between employees and their jobs.

A third FK attribute is used to model the relationship between employees and each other. This is called a recursive relationship.

Foreign Key Rules

  • Duplicate values are allowed in a FK attribute.
  • NULLs are allowed in a FK attribute.
  • Values may be changed in a FK attribute.
  • Each FK must exist elsewhere as a Primary Key.

Note that Department_Number is the Primary Key for the DEPARTMENT entity.

Remember, these terms are not Teradata specific - they are just general relational concepts.

Employee
Code
Table
 EMPLOYEE (partial listing)
EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
PKFKFKFK     
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000
 Foreign Key (FK)
values model
relationships.
 
DEPARTMENT
NUMBER
DEPARTMENT
NAME
BUDGET
AMOUNT
MANAGER
EMPLOYEE
NUMBER
PK  FK
501marketing sales800500001017
301research and development465600001019
302product planning226000001016
403education932000001005
402software support308000001011
401customer support982300001003
201technical operations293800001025
  • Foreign Keys (FK) are optional.
  • An entity may have more than one FK.
  • A FK may consist of more than one attribute.
  • FK values may be duplicated.
  • FK values may be null.
  • FK values may be changed.
  • FK values must exist elsewhere as a PK.

Relational Advantages

      
Advantages of a Relational Database (physical and logical) compared to other database methodologies include:
  • Flexible
    Flexibility provides substantial benefits. The user does not need to know the access path to the data; the RDBMS keeps track of where everything is. Relational databases use atomic data - breaking data down into its smallest parts to allow maximum flexibility in selecting and using data. In addition, a single copy of the data can serve multiple purposes.

  • Responds quickly
    In a non-relational database, adding a field means that all programs that use the database must be rewritten to become aware of the new data structure. In a relational database, programs do not need to be re-written when a field is added.

  • Data-driven
    Relational databases are designed to represent a business and its practices - not the application or the computer system.

  • Business-oriented
    The two tables we have looked at, EMPLOYEE and DEPARTMENT, are organized to reflect the way the business really works.

  • Simple and easy to use and understand
    Simplicity is useful not only to the people who ask the questions, but also to the people who have to figure out how to retrieve information from the database. Understanding how an RDBMS functions is not necessary.

  • Easier to build applications
    Relational databases make the data do more work. Programs and transactions are simpler, which makes it easier to build applications.

  • Support the trend toward end-user computing
    The trend is moving away from organizations funneling all data requests through the IT people who know how the system works. As systems get easier to use, more people have access to them. This is called the "democratization of data."

  • Set Theory
    Set theory is the mathematical science of the infinite. It studies properties of sets, abstract objects that pervade the whole of modern mathematics. The language of set theory, in its simplicity, is sufficiently universal to formalize all mathematical concepts and thus set theory, along with predicate calculus, constitutes the true foundations of mathematics.

Evolution of Data Processing

      
Traditionally, data processing has been divided into two categories: On-Line Transaction Processing (OLTP) and Decision Support Systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

Traditional

A transaction is a logical unit of work.
  • On-Line Transaction Processing (OLTP)
    OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.

  • Decision Support Systems (DSS)
    Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of pre-written scripts, users now require the ability to do ad hoc queries, analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data and rely less on summary data.

Today

Functional Trends Rankings
  • On-line Analytical Processing (OLAP)
    OLAP is a modern form of analytic processing within a DSS environment. OLAP tools (e.g., from companies like MicroStrategy and Cognos) provide an easy to use Graphical User Interface to allow "slice and dice" analysis along multiple dimensions (e.g., products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes.

  • Data Mining (DM)
    DM (predictive modeling) involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns (e.g., buying, attrition, or fraud patterns) that are used to predict future behavior. An "analytic model" is built from the historical data (Phase 1: minutes to hours) incorporating the detected patterns. The model is then run against current detail data ("scoring") to predict likely outcomes (Phase 2: seconds or less). Likely outcomes, for example, include scores on likelihood of purchasing a product, switching to a competitor, or being fraudulent.
TypeExamplesNumber of Rows AccessedResponse Time
OLTPUpdate a checking account to reflect a deposit.
Debit transaction takes place against current balance to reflect amount of money withdrawn at ATM.
SmallSeconds
DSSHow many child size blue jeans were sold across all of our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X?
LargeSeconds or minutes
OLAPShow the top ten selling items across all stores for 1997.
Show a comparison of sales from this week to last week.
Large of detail rows or moderate of summary rowsSeconds or minutes
Data MiningWhich customers are most likely to leave?
Which customers are most likely to respond to this promotion?
Moderate to large detailed historical rowsPhase 1: Minutes or hours
Phase 2: Seconds or less


Row vs. Set Processing

      
Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.

What is Row-by-Row Processing?

Where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program.

A benefit of row processing is that there is less lock contention.

What is Set Processing?

A lot of data processing is set processing, which is what relational databases do best.

Instead of processing row-by-row sequentially, you can process relational data set-by-set, without a cursor. For example, to sum all payment rows with balances of $100 or less, a single SQL statement completely processes all rows that meet the condition as a set.

With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-time processing.

Some characteristics of set processing include:

  • An update with all AMPs involved
  • Single session processing which takes advantage of parallel processing
  • Efficient updates of large amounts of data.

Response Time vs. Throughput

      
When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Response Time

This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.

Throughput

throughput measure is an amount of something per unit time. For operating systems, throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

What Does this Mean to Teradata?

Response TimeThroughput
Measures the average duration of queriesMeasures quantity of queries completed during a time interval
A measure of process completionA measure of the amount of work processed
Is how long the processing takesHow many queries were processed
The elapsed time per queryThe number of queries executed in an hour

In order to improve both response time and throughput on a Teradata system, you could increase CPU power (i.e., add nodes), implement workload management to control resources, and decrease the number of concurrent users.


The Advantage of Using Detail Data

      
Until recently, most business decisions were based on summary data.

The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half-as shown in the example on below.

Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.

Decision support, answering business questions, is the real purpose of databases. To answer business questions, decision-makers must have four things:

  • The right data
  • Enough detail data
  • Proper data structure
  • Enough computer power to access and produce reports on the data

Strategic workloads tend to rely heavily on detail data.

Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?

QUESTION: How effective was the national advertisement for jeans that ran June 6 through June 8?

DETAIL DATA vs. SUMMARY DATA

  • Detail data gives a more accurate picture.
  • Correct business decisions result.

The Data Warehouse

      
data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems (this is where the saying, "Load once, use many" comes from). A data warehouse has a centrally located logical architecture, using a single data store, which minimizes data synchronization, utilizes less space, and provides a single version of the truth.

Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years' worth of detail data so that historical trends may be analyzed using the full power of the data.

Many data warehouses get their data directly from operational systems so that the data is timely and accurate, providing less disparate results. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.

The data warehouse carries a massive volume of detail data, so it is well suited to the parallel architecture of the Teradata database. Because the Teradata database has traditionally specialized in large-scale, decision-support queries against huge volumes of detail data, its use for data warehousing becomes an obvious solution.

The Teradata database can run in single-node environments with a database of a terabyte or less. As requirements grow, the single node can easily be expanded to multiple nodes with the Teradata database's linear scalability. In the multi-node environment, the Teradata database can scale upward to systems supporting more than 50 petabytes of data and process hundreds of concurrent sessions. With the power of multiple CPUs propelling multiple AMPs over multiple disks, getting answers from the detail data in the warehouse becomes fast and efficient.

The Teradata database supports normalized logical models because it can perform 128 joins in a single query as well as large aggregations during very complex queries using its robust parser and optimizer. A key Teradata database strength is the ability to model the customer's business. With a business model, data is organized to what it represents. The Teradata database can do Star Schema and other forms of relational modeling - but Third Normal Form is the methodology we recommend to our customers. The Teradata database also has:

  • A Priority Scheduler and workload management tools that prioritize resources among competing applications based on importance
  • A powerful suite of load and maintenance utilities
  • Numerous third-party strategic and alliance partners

A central, enterprise-wide database that contains information extracted from operational systems.

  • Based on enterprise-wide model
  • Can begin small but may grow large rapidly
  • Populated by extraction/loading of data from operational systems
  • Responds to users "what if" queries
  • Minimizes data movement/ synchronization
  • Provides a single view of the business

Data Marts

      
data mart is a special purpose subset of data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Some reasons data marts are implemented are when a department wants to own its own data, a quick ROI is required, and in some cases, canned queries may offer users better performance.

Data Mart Types

  • Independent Data Marts
    Independent data marts are created directly from operational systems, just as a data warehouse is. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate.

  • Logical Data Marts
    Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology (Teradata), it removes the need for massive data loading and transformation, making a single data store available for all user needs.

  • Dependent Data Marts
    Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for some performance-critical user queries.

Data Models - Enterprise vs. Application

To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business.

In addition, an enterprise data model is more extensible than an application data model. It is intended to encompass the entire enterprise.

Teradata Functional Overview

      
The Parsing Engine is responsible for:
  • Managing individual sessions
  • (up to 120 sessions per PE)
  • Parsing and optimizing your SQL requests
  • Building query plans with the parallel-aware, cost-based, intelligent Optimizer
  • Dispatching the optimized plan to the AMPs
  • EBCDIC/ASCII input conversion (if necessary)
  • Sending the answer set response back to the requesting client

Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.

Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

Major Components of the Teradata Database

      
We have looked at the overall node, and now we will describe the components that make up a node in detail.

Parallel Database Extensions (PDE)

The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).

Parsing Engine (PE)

The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).

BYNET

The BYNET is the message-passing layer. It determines which AMP(s) should receive a message.

Access Module Processor (AMP)

The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.

Disks

Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.

What Happens When a Query is Submitted?

  1. The Parser evaluates the SQL for proper syntax.
  2. The Optimizer develops the least expensive plan.
  3. The BYNET passes the steps from the Dispatcher to the AMPs.


The Parsing Engine (PE)

      
Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata Database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.

Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

The Parsing Engine is responsible for:

  • Managing individual sessions (up to 120 sessions per PE)
  • Parsing and optimizing your SQL requests
  • Building query plans with the parallel-aware, cost-based, intelligent Optimizer
  • Dispatching the optimized plan to the AMPs
  • EBCDIC/ASCII input conversion (if necessary)
  • Sending the answer set response back to the requesting client


The BYNET

      
The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate.

The BYNET handles the internal communication of the Teradata database. All communication between PEs and AMPs is done via the BYNET.

When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. Messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET.

Depending on the nature of the dispatch request, the communication between nodes may be a:

  • Broadcast - message is routed to all nodes in the system.
  • Point-to-point - message is routed to one specific node in the system.

Once the message is on a participating node, PDE directs the message to the appropriate AMPs on that node. All AMPs receive a broadcast message. With a point-to-point or multicast (multiple AMPs) message, the message is directed only to the appropriate AMP(s) on that node by PDE.

So, while a Teradata database system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.

The BYNET has several unique features:

  • Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1, and vice versa.

  • Load balanced: traffic is automatically and dynamically distributed between both BYNETs.

  • Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance.

  • High Performance: an MPP system has two BYNET networks providing hardware redundancy. Because both networks are active, the system benefits from the full aggregate bandwidth.

The technology of the BYNET is what makes the scalable parallelism of the Teradata database possible.

BYNET merges final answer set

Dual redundant, fault-tolerant, bi-directional interconnect network that enables:
  • Automatic load balancing of message traffic
  • Automatic reconfiguration after fault detection
  • Scalable bandwidth as nodes are added

The BYNET connects and communicates with all the AMPs and PEs in the system:

  • Between nodes, the BYNET hardware carries broadcast and point-to-point communications
  • On a node, BYNET software and PDE together control which AMPs receive a multicast communication
  • Merges the final answer set


The Access Module Processor (AMP)

      
The Access Module Processor (AMP) is the virtual processor (vproc) in the architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set, including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the requested rows from a table.

An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs.

The Database Manager subsystem resides on each AMP. This subsystem will:

  • Lock databases and tables.
  • Create, modify, or delete definitions of tables.
  • Insert, delete, or modify rows within the tables.
  • Retrieve information from definitions and tables.
  • Return responses to the Dispatcher.

Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs space-management functions that control the use and allocation of space.

The Teradata database performs all tasks in parallel, providing exceptional performance. The greater the number of tasks processed in parallel, the better the system performance. Many databases call themselves "parallel," but they can only perform some tasks in parallel.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.

AMPs perform all tasks in parallel

The AMP is responsible for:
  • Storing rows to and retrieving rows from its vdisk
  • Lock management
  • Sorting rows and aggregating columns
  • Join processing
  • Output conversion and formatting (ASCII, EBCDIC)
  • Creating answer sets for clients
  • Disk space management and accounting
  • Special utility protocols
  • Recovery processing

What is a Node?

      
Up to this point, we have discussed relational databases in terms of how the user sees them - as a collection of tables that relate to one another. Now we'll look at the physical components of the system.

A node is made up of various hardware and software components. All applications run under Linux or Windows, and all the Teradata database software runs under Parallel Database Extensions (PDE). All share the resources of CPU and memory on the node.

Access Module Processors (AMPs) and Parsing Engines (PEs) are virtual processors (vprocs) running under control of the PDE. Their numbers are software configurable.

AMPs are associated with virtual disks (vdisks), which are configured as ranks of a disk array.

The "Shared Nothing" architecture of the Teradata database means that each vproc is responsible for its own portion of the database, and for all activities to be performed on that part of the system.

All AMPs and PEs communicate via the BYNET, which is a message passing system. In an SMP (Symmetric Multi-Processing) system, it is implemented as a BYNET driver, and in MPP (Massively Parallel Processing) systems, it is implemented as a software and hardware solution. The BYNET allows multiple vprocs on multiple nodes to communicate with each other. PDE actually controls message-passing activity, while the BYNET handles message queuing and flow control.

An application that runs under the control of PDE, such as the Teradata database, is considered a Trusted Parallel Application (TPA).

Note: The versatility of the Teradata database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. These vprocs are a set of software processes that run on a node under Parallel Database Extensions (PDE) and the multitasking environment of the operating system.

In a massively parallel processing (MPP) system, there is hardware redundancy among both the multiple nodes and the BYNET.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.


Parallelism

      
Parallelism is at the very heart of the Teradata database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient.

Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets.

The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform's maximum throughput.

Each AMP can perform up to 80 tasks (or more) in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results.

In truth, parallelism is built into the Teradata database from the ground up!

Notes:

  • Each PE can handle up to 120 sessions in parallel.
  • Each Session can handle multiple REQUESTS.
  • The Message Passing Layer can handle all message activity in parallel.
  • Each AMP can perform up to 80 tasks in parallel (can be configured for more).
  • All AMPs can work together in parallel to service any request.
  • Each AMP can work on several requests in parallel.

Teradata Compression Comparison

      

Multi-value Compression

Multi-value compression has three variations:
  • COMPRESS – Nulls are compressed.
  • COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
  • COMPRESS <constant> – Nulls and the specified <constant> value are compressed.
With algorithmic compression, you must specify a UDF for compression and a UDF for decompression.

Example:

CREATE TABLE bank_accounts
 (customer_id INTEGER
 ,primary_bank INTEGER COMPRESS -- only NULL is compressed
 ,account_type CHAR(10) COMPRESS 'Savings' -- both NULL and 'Savings' are compressed
 ,account_notes VARCHAR(500) CHARACTER SET UNICODE
	COMPRESS USING TD_SYSFNLIB.TransUnicodeToUTF8
	DECOMPRESS USING TD_SYSFNLIB.TransUTF8ToUnicode -- algorithmic compression
 . . . );

Multi-value Compression vs. VARCHAR

For character-based data, an alternative to Teradata compression is the VARCHAR (N) data type. The number of bytes used to store each field is the length of the data item plus two bytes. Contrast this to a fixed-length CHAR (N) data type that takes N bytes per row, regardless of the actual number of characters in each field. Combining Teradata compression with fixed-length character data type can be a very effective space saver. MVC can compress 255 values in a column.

Data demographics can help determine whether variable-length character data type or fixed length plus compression is more efficient. The most important factors are the maximum field length, the average field length, and the compressibility of the data.

Algorithmic Compression

When column values are mostly unique, Algorithmic Compression can provide better compression results than Multi-value Compression. You can define your own compression and decompression algorithms and apply them to data at the column level.

A user can create their own compression algorithms as external C/C++ scalar UDFs, and then specify them in the column definition of a CREATE TABLE/ALTER TABLE statement. Teradata invokes these algorithms to compress and decompress the data when the data is inserted into or retrieved from the table.

ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.

Block Level Compression

Performs compression on whole data blocks at the file system level before they are written to storage.

  • BLC will compress/decompress only data blocks but will not be used on any file system structures such as Master/Cylinder indexes, the WAL log and table headers.
  • Only primary data subtable and fallback copies can be compressed. Both objects are either compressed or neither is compressed.
  • Secondary Indexes (USI/NUSI) cannot be compressed but Join Indexes can be compressed since they are effectively user tables.
  • Only the compressed form of the data block will be cached, each block access must perform the data block decompression.
  • On initial delivery, a single compression algorithm will be supplied and used.
  • Spool data blocks can be compressed via a system-wide tunable, as well as Temporary, WORK (sort Work space), and permanent journal.
  • Once enabled, reversion to an earlier release for compressed tables isn't allowed.
  • Teradata 14.0 added the capability of Temperature Based Block Level Compression (TBBLC) to enable greater compression flexibility while negating potential performance offsets.
 Multi-value Compression
(MVC)
Algorithmic
Compression
(ALC)
Block Level
Compression
(BLC) - (TBBLC)
Ease of Use
Easy to apply to well understood data columns and valuesEasy to apply on column with CREATE TABLESet once and forget
Analysis Required
Need to analyze data for common valuesUse Teradata algorithms or user-defined compression algorithms to match unique data patternsNeed to analyze CPU overhead trade-off (You can turn on for all data on system or you can apply on a per table basis)
Flexibility
Works for a wide variety of data and situationsAutomatically invoked for values not replaced by MVCAutomatically combined with other compression mechanisms
Performance Impact
No or minimal CPU usageDepends on compression algorithm usedReduced I/O due to compressed data blocks (CPU cycles are used to compress/decompress)
Applicability
Replaces common valuesIndustry data, UNICODE, Latin dataAll Data

A customer can choose any combination or all three on a column/table.


Layers of the Data Warehouse Architecture

      
There are several "layers" that make up the EDW environment. These layers include:

  • Staging Layer
    Staging is the primary purpose of the staging layer is to perform data transformation and loading during either the ETL or ELT process.

  • Access Layer
    Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a third party tool.

  • Physical Layer
    The physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc.

  • Data Lab
    This is Teradata's implementation of a controlled "sandbox" environment. It enables rapid loading and analysis of new data combined with existing warehouse data. It quickly assimilates untested data into a "non-production" portion of the warehouse, providing a self-provisioning, self-service environment for prototyping and analysis on new, un-cleansed data, which is temporary in nature.

    Data Lab is comprised of two Viewpoint portlets.

Teradata Data Lab - Interface

Teradata Database Objects

      
A "database" or "user" in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functionsor indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. Starting with Teradata Release 14.10, extended object names feature, allows object names to be up to 128 characters where prior it was a 30-byte limit.

All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D).

Databases provide a logical grouping for information. They are also the foundation for space allocation and access control.

Tables

A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows.

Views

A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used to implement security and control access to the underlying tables. They can hide columns from users, insulate applications from changes, and simplify or standardize access techniques. They provide a mechanism to include a locking modifier. Views provide a well-defined and tested high performance access to data. X-Views are special Views that limit the display of data to the user who accesses them.

Macros

A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros may also be used to limit user access to data.

Triggers

A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired - effectively executing the SQL statements.

Stored Procedures

A stored procedure is a program that is stored within Teradata and executes within the Teradata database. A stored procedure uses permanent disk space and may be used to limit a user's access to data.

User Defined Functions (UDF)

A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable information (PII), even from the DBA(s).

X-Views limit the display of results to the user accessing them.

Teradata Database Space

      
There are three types of space within the Teradata database:

  1. Perm Space
    All databases and users have a defined upper limit of Permanent Space. Permanent Space is used for storing the data rows of tables. Perm Space is not pre-allocated. It represents a maximum limit.

  2. Spool Space
    All databases also have an upper limit of Spool Space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all of the unallocated space in the system for their query. Spool Space is unused Perm Space that is used to hold intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released.

    Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of Spool Space available for queries?

    Answer: 80GB. All of the remaining space in the system is available for spool.

  3. Temp Space
    The third type of space is Temporary Space. Temp Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart. Temp Space is permanent space currently not used.

Space Management

      
The way the Teradata database handles space management is different from other database implementations.

The initial Teradata database system comes with several users and databases:

  • Users: DBC, SysAdmin, SystemFE, TDPuser
  • Databases: Crashdumps, Default, All, Public

Initially, most of the space in the system belongs to user DBC, with some allocated to each of the users or databases listed above. Database DBC contains all of the RDBMS software components and system tables.

Before defining application users and databases, the Database Administrator should first create a special administrative user and assign most of the space in the system to that user. This space comes from user DBC. User DBC becomes the owner of the administrative user.

There should be enough space left in user DBC to accommodate the growth of system tables and logs.

As the administrative user creates additional users and databases, space assigned to those objects will be subtracted from the administrative user's space. As these users and databases create subordinate objects, they will give up some of their space to these new users and databases.

The clear boxes represent all the space available in the system. The shaded boxes represent space that is being used in the system by tables. Space that is not currently being used by tables is available for use as spool or temporary space. If a table is dropped, that space becomes available to the system again.

Database 1 and Database 2 contain the production tables. Database 3 contains views and macros to control access to the data by end users and does not consume any database space. User D is an example of an end user who has no Perm Space but has been granted rights on certain views and macros to accomplish work. This user can still execute queries and create views or macros. User A, User B, and User C are examples of application developers who need some Perm Space to hold sample data for program testing and validation.

Most database vendors do not handle space this way. Once space is allocated to a table, it cannot be made available again without the Database Administrator having to perform a re-organization and re-allocate the space and partition the data.

  • A new database or user must be created from an existing database or user.
  • All Perm Space limits are subtracted from the owner.
  • Perm Space is a zero-sum game - the total of all Perm Space limits must equal the total amount of disk space available.
  • Perm Space currently not being used is available as Spool Space or Temp Space.

How Does the Teradata Database Distribute Rows?

      
Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. The Teradata database uses hashing to dynamically distribute data across all AMPs.

The Teradata database's automatic hash distribution eliminates costly data maintenance tasks. There is no specific order to the placement of the data. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. There is never a need to reorganize data or index space which makes the database easier to manage and maintain. As a result, strategic business data is more accessible to the users.

The benefits of automatic data placement include:

  • Distribution is the same regardless of data volume.
  • Distribution is based on row content, not data demographics.
  • Rows are moved or re-blocked automatically.

Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. For example, what if there are fewer rows than AMPs? In this case, at least some AMPs will hold no rows from that table. (This example should be considered the exception, and not the rule.)

Ideally, the rows of each table will be evenly distributed across all of the AMPs. Even distribution is desirable because in an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal portion of work to do. When workloads are not evenly distributed, the desired response can only be as fast as the slowest AMP.

Remember: The Teradata database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion of the database and does not share it with any other AMP.

Data Management

Many other databases require you to manually partition the data. Teradata automates the placement and maintenance of physical data location.

With the Teradata database, your DBA can spend more time with users developing strategic applications to beat your competition.

Loading, inserting, updating and, deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization.

Primary Keys and Primary Indexes

      
While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on this slide summarizes those differences.

A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.

It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table.

  • Indexes are conceptually different from keys.
  • A PK is a relational modeling convention which allows each row to be uniquely identified.
  • A PI is a Teradata convention which determines how the row will be stored and accessed.
  • A significant percentage of tables MAY use the same columns for both the PK and the PI.
  • A well-designed database WILL use a PI that is different from the PK for some tables.
Primary Key (PK)Primary Index (PI)
Logical concept of data modelingPhysical mechanism for access and storage
Teradata doesn’t need to recognizeEach table can have (at most) one primary index
No limit on number of columns64 column limit
Documented in data model (Optional in CREATE TABLE)Defined in CREATE TABLE statement
Must be uniqueMay be unique or non-unique
Identifies each rowIdentifies 1 (UPI) or multiple rows (NUPI)
Values should not changeValues may be changed (Delete + Insert)
May not be NULL - requires a valueMay be NULL
Does not imply an access pathDefines most efficient access path
Chosen for logical correctnessChosen for physical performance



Primary Key (PK) vs. Primary Index (PI)

      
Primary Indexes are conceptually different from Primary Keys.

  • Primary Key (PK) is a relational data-modeling term that defines the attribute(s) used to uniquely identify a tuple in a entity.

  • Primary Index (PI) is a physical database implementation term that defines the column(s) used to distribute the rows in a table. The Primary Index column(s) value may be unique or non-unique.

  • Unique Primary Index (UPI) the values in the Primary Index column(s) are unique which results in even row distribution and the elimination of duplicate row checks during data load operations.

  • Non-Unique Primary Index (NUPI) the values in the Primary Index column(s) are non-unique which results in potentially skewed row distribution - all duplicate Primary Index values go to the same AMP.

Accessing a row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance and known access paths, and will therefore be a different choice than the Primary Key of its source entity.

Example: PK vs. PI. An insurance claims table may contain the following columns: Claim ID, Date, Insurance Provider ID, Patient ID, and Claim Amount. The Primary Key is defined as Claim ID and Date. The same claim number may be submitted on one date and then resubmitted on another date, so you need both columns to look up a claim and its details. Other users may often be analyzing provider fraud behavior and need to access the claim by the Provider ID. In this case, the Primary Index may be the Provider ID column. Even if data distribution may be skewed, a choice like this might be made to provide efficient access to a table.

Primary Key (PK)Primary Index (PI)
Logical concept of data modelingMechanism for row distribution and access
Teradata does not need the PK defined  A table can have one (or none) Primary Index
No limit on the number of attributesMay be from 1 to 64 columns
Documented in the logical data modelDefined in the CREATE TABLE statement
Must be uniqueValue may be unique or non-unique
Unique identifierUsed to place a row on an AMP
Value should not changeValue may be changed (Updated)
May not be NULLMay be NULL
Does not imply access pathDefines the most efficient access path
Chosen for logical correctnessChosen for physical performance


Primary Indexes

      
With the Teradata database, the Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on its disks. Indexes are also used to access rows from a table without having to search the entire table. A table may have a Primary Index. The Primary Index of the table cannot be changed (i.e., altered, added, or dropped).

Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows in the table and the performance of the table in a production environment. Although many tables use combined columns as the Primary Index choice, the examples we use here are single-column indexes.

There are two types of primary indexes; unique (UPI) and non-unique (NUPI).

Unique Primary Index (UPI) is a column that has no duplicate values. UPIs are desirable because they guarantee even distribution of table rows. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation.

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique, or duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. A NUPI allows duplicate values as well as access to the non-unique columns. A NUPI can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.

Accessing the row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance (enabling AMP-local joins to other tables) and known access paths (allowing local joins to similar entities), and will therefore be a different choice than the Primary Key of the table.

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. Rows are hashed based on the Primary Index.

Defining the Primary Index

      
When a table is created, the Primary Index should be specified. The Primary Index may consist of a single column or a combination up to 64 columns. Columns are assigned data types and may optionally be assigned attributes and constraints.

Once you choose the Primary Index for a table, it cannot be changed. The existing table must be dropped and a new table created to change the Primary Index definition.

Note: While you cannot change the Primary Index itself, values in a Primary Index column may be changed. The Teradata database simply rehashes that row to its new location, based on the Primary Index value.

The table definition is stored in the Data Dictionary (DD).

UPI Table:

CREATE TABLE Table1
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
UNIQUE PRIMARY INDEX (Col1);

NUPI Table:

CREATE TABLE Table2
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
PRIMARY INDEX (Col2);

  • The Primary Index is the mechanism used to assign a row to an AMP.
  • The Primary Index may be Unique (UPI) or Non-Unique (NUPI).
  • Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.
  • Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

      
The diagram below gives an overview of Primary Index hash distribution, which is the process by which all data is distributed in the Teradata database.

The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Communications Layer. The Hash Map, in combination with the Row Hash determines which AMP gets the row. The Hash Map is a part of the Communications Layer interface.

The Teradata database's hash partitioning is the only approach which offers balanced processing and balanced data placement, while minimizing interconnect traffic and freeing the DBA from time-consuming reorganizations. The Teradata File System eliminates the need to rebuild indexes when tables are updated or structures change.

Hashing Down to the AMPs

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. The Teradata database's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.

The rows of all tables are distributed across the AMPs according to their primary index value. The primary index value goes into the hashing algorithm. The output is a 32-bit row hash. The high-order 16 bits are referred to as the hash bucket number and are used to identify a hash-map entry. This entry is used to identify the AMP to be targeted. The remaining 16 bits are not used to locate the AMP.

Teradata provides customers with two choices for the size and number of the hash buckets in a system:

  • 16 bits (65,536 hash buckets, the same as in previous releases)
  • 20 bits (1,048,576 hash buckets)

For larger systems, such as those with 1,000 AMPs or more, the 20-bit option will result in more hash buckets to allow workloads to be distributed more evenly over the AMPs. New systems are also candidates for the expanded hash bucket option.

The entire 32-bit row hash will be used by the selected AMP to locate the row within its disk space.

Hash maps are uniquely configured for each size of system. A 50-AMP system will have a different hash map than a 30-AMP system.

A hash map is simply an array that associates bucket numbers with specific AMPs. When a system grows, new AMPs are typically added. Adding AMPs requires a change to the hash map to reflect the new total number of AMPs.

UPI Row Distribution

      
The hashing algorithm predictably distributes rows across all AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, the Teradata database can distribute the rows evenly. If the Primary Index is non-unique, but there are a fairly even number of rows per index value, the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values, the distribution will probably be skewed.

In the example below, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is uniform. This assures maximum efficiency because each AMP does approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task.

This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata database.

Order
Order
Number
Customer
Number
Order
Date
Order
Status
PK   
UPI   
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Order_Number values are unique (UPI).
  • The rows will distribute evenly across the AMPs.


Unique Primary Index (UPI) Access

      
Accessing a table using a Primary Index is the most efficient way to access the data and is always a one-AMP operation. In the case of a UPI, the one-AMP access can return, at most, one row. In the example below, we are looking for the row whose primary index value is 45. By specifying the PI value as part of our selection criteria, we are guaranteed that only the AMP containing the specified row will be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm. Hashing takes place in the Parsing Engine. The output of the hashing algorithm contains information that will point to a specific AMP. Once it has isolated the appropriate AMP, finding the row is quick and efficient.

SELECT *
FROM  Customer
WHERE Cust = 45;

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
UNIQUE PRIMARY INDEX (Cust);
       
CUSTOMER table
CustNamePhone
PK  
UPI  
37White555-4444
98Brown333-9999
74Smith555-6666
95Peters555-7777
27Jones222-8888
56Smith555-7777
45Adams444-6666
84Rice666-5555
49Smith111-6666
51Marsh888-2222
31Adams111-2222
62Black444-5555
12Young777-4444
77Jones777-6666


NUPI Row Distribution

      
In the example below, Customer_Number has been used as a non-unique Primary Index (NUPI). Note that row distribution among AMPs is uneven. All rows with the same primary index value (with the same customer number) are stored on the same AMP.

Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all rows among all AMPs.

AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation, AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. Overall operation time is increased and some AMPs are under-utilized.

This illustrates how NUPIs can create irregular distributions, called skewed distributions. AMPs that have more than an average number of rows will take longer for full-table operations than other AMPs. Because an operation is not complete until all AMPs have finished, the operation where distribution is skewed will take longer than it would if all AMPs were utilized evenly.

Order
Order
Number
Customer
Number
Order
Date
Order
Status
PK   
 NUPI  
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Customer_Number values are non-unique (NUPI).
  • Rows with the same PI value distribute to the same AMP causing skewed row distribution.


Non-Unique Primary Index (NUPI) Access

      
Accessing a table using a Non-Unique Primary Index (NUPI) is also a one-AMP operation. In the case of a NUPI, the one-AMP access can return zero to many rows. In the example below, we are looking for the rows whose primary index value is 555-7777. By specifying the PI value as part of our selection criteria, we are once again guaranteeing that only the AMP containing the required rows will need to be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm executed in the Parsing Engine. The output of the hashing algorithm will once again point to a specific AMP. Once it has isolated the appropriate AMP, it must find all rows that have the specified value. In the example, the AMP returns two rows.

SELECT *
FROM  Customer
WHERE Phone = '555-7777';

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
PRIMARY INDEX (Phone);
       
CUSTOMER table
CustNamePhone
PK  
  NUPI
37White555-4444
98Brown333-9999
74Smith555-6666
95Peters555-7777
27Jones222-8888
56Smith555-7777
45Adams444-6666
84Rice666-5555
49Smith111-6666
51Marsh888-2222
31Adams111-2222
62Black444-5555
12Young777-4444
77Jones777-6666
72Adams666-7777
40Smith222-3333

Single AMP access with 0 to n rows returned.


Highly Non-Unique NUPI Row Distribution

      
The example below uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all rows are placed on two AMPs.

This is an example of a highly non-unique Primary Index.

When choosing a Primary Index, never pick a column with a severely-limited value set. The degree of uniqueness is critical to efficiency. Choose NUPIs that allow all AMPs to participate fairly equally.

Order
Order
Number
Customer
Number
Order
Date
Order
Status
PK   
   NUPI
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Order_Status values are highly non-unique (NUPI).
  • Only two values exist. The rows will be distributed to two AMPs.
  • This table will not perform well in parallel operations.
  • Highly non-unique columns are poor PI choices.
  • The degree of uniqueness is critical to efficiency.



Partitioned Primary Index (PPI)

      
The Teradata database has a partitioning mechanism called Partitioned Primary Index (PPI). PPI is used to improve performance for large tables when you submit queries that specify a range constraint. PPI allows you to reduce the number of rows to be processed by using a technique called partition elimination. PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. PPI is useful for instantaneous dropping of old data and rapid addition of new data. Using PPI lets you avoid having to do a full-table scan without the overhead of a Secondary Index.

How Does PPI Work?

Data distribution with PPI is still based on the Primary Index:

  • Primary Index >> Hash Value >> Determines which AMP gets the row

With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.

Using PPI, the rows are stored first by partition and then by row hash /row id order. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.

Data Storage Using PPI

To store rows using PPI, specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.

Partition Elimination is used to access tables with a PPI. The term "partition elimination" refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination.

If the partitioning expression references any columns that aren't part of the primary index, then PI access may be slower. When all partitioning expression columns are contained in the PI, then PI access is unchanged.

The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#).

Partitioned Primary Indexes:

  • Improve performance on range constraint queries
  • Use partition elimination to reduce the number of rows accessed
  • May use a CASE_N or RANGE_N partitioning expression
The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI).


Multilevel Partitioned Primary Index (MLPPI)

      
A PPI can be sub-partitioned, increasing the opportunities for partition elimination at the various additional levels (or combinations of levels). With a multilevel PPI, you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination.

  • Allows multiple partitioning expressions on a table or a non-compressed join index - including base tables, Global Temp tables, and Volatile tables.
  • Multilevel partitioning allows each partition at a level to be sub-partitioned. Each partitioning level is defined independently using a RANGE_N or CASE_N expression.
  • A multilevel PPI allows efficient searches by using partition elimination at the various levels or combination of levels.
  • Allows flexibility on which partitioning expression to use when there are multiple choices of partitioning expressions.
For example, an insurance claims table could be partitioned by claim date and then sub-partitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as June 2012) would take advantage of partition elimination for faster performance.

Note: an MLPPI table must have at least two partition levels defined.

Advantages of partitioned tables:

  • They reduce the I/O for range constraint queries.
  • They take advantage of dynamic partition elimination.
  • They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices (i.e., you can use last name or some other value that is more readily available to query on).
  • The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities.
  • Row hash locks are used for SELECT with equality conditions on the PI columns.
  • Partitioned tables allow for fast deletes of data in a partition.
  • They allow for range queries without a secondary index.
  • Specific partitions maybe archived or deleted.
  • May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
  • May replace a Value Ordered NUSI for access.

Disadvantages of partitioned tables:

  • Rows in a partitioned table are eight bytes longer
  • Access via the Primary Index may take longer
  • Full table joins to an NPPI table with the same PI may take longer

Syntax:

What is a NoPI Table?

      
A NoPI Table is simply a table without a primary index.

Prior to Teradata database 13.0, Teradata tables required a primary index. The primary index was used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradata's parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index.

Starting with Teradata database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.

Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value.
With a NoPI table:

  • INSERTS/UPDATES to the table are always appended at the end of the table
  • The table must always be Multi-set
  • There are no duplicate row checks performed upon load

If you use FastLoad to load rows into a NoPI table, you cannot have secondary or join indexes, CHECK constraints, referential integrity constraints on the table as FastLoad cannot load a table defined with any of these features. You can use FastLoad to load rows into un-partitioned NoPI tables with row-level security defined, however.

When FastLoading to a NoPI table, the rows are distributed a block at a time.

Example of creating a NoPI table:

CREATE TABLE table_x
 	(col_x INTEGER
 	,col_y CHAR(10)
 	,col_z DATE)
NO PRIMARY INDEX;


Teradata Columnar

      
Teradata Column or Column Partitioning (CP) is a new physical database design implementation option (starting with Teradata 14.0) that allows single columns or sets of columns of a NoPI table to be stored in separate partitions. Column partitioning can also be applied to join indexes.

Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk.

Columnar offers the ability to partition a table or join index by column. It can be used alone or in combination with row partitioning in multilevel partitioning definitions. Column partitions may be stored using traditional 'ROW' storage or alternatively stored using the new 'COLUMN' storage option. In either case, columnar can automatically compress physical rows where appropriate.

The key benefit in defining row-partitioned (PPI) tables is when queries access a subset of rows based on constraints on one or more partitioning columns. The major advantage of using column partitioning is to improve the performance of queries that access a subset of the columns from a table, either for predicates (e.g., WHERE clause) or projections (i.e., SELECTed columns).

Because sets of one or more columns can be stored in separate column partitions, only the column partitions that contain the columns needed by the query need to be accessed. Just as row-partitioning can eliminate rows that need not be read, column partitioning eliminates columns that are not needed.

The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks.

Columnar makes more sense in CPU-rich environments because CPU cycles are needed to "glue" columns back together into rows, for compression and for different join strategies (mainly hash joins).

Example of creating a Columnar table:

CREATE TABLE table_y
	(col_a CHAR(25) NOT NULL
	,col_b CHAR(25) NOT NULL
	,col_c DATE NOT NULL
	,col_d CHAR(7) NOT NULL
	,col_e INTEGER)
PARTITION BY COLUMN;


Teradata Columnar Benefits

      
Benefits of using the Teradata Columnar feature include:
  • Improved query performance: Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows.

  • Reduced disk space: The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index.

  • Increased flexibility: Provides a new physical database design option to improve performance for suitable classes of workloads.

  • Reduced I/O: Allows fast and efficient access to selected data from column partitions, thus reducing query I/O.

  • Ease of use: Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries.

Columnar Join Indexes

A join index can also be created as column-partitioned for either a columnar table or a non-columnar table. Conversely, a join index can be created as non-columnar for either type of table as well.

Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes.

A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index.

Secondary Indexes

      
secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection.

There are three general ways to access a table:

  • Primary Index access (one AMP access)
  • Secondary Index access (two or all AMP access)
  • Full Table Scan (all AMP access)

Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan.

Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful.

  • A secondary index provides an alternate path to the rows of a table.
  • A secondary index can be used to maintain uniqueness within a column or set of columns.
  • A table can have from 0 to 32 secondary indexes (each have 1
  • 64 columns).
  • Secondary Indexes:
    • Do not affect table distribution.
    • Add overhead, both in terms of disk space (sub-tables) and maintenance.
    • May be added or dropped dynamically as needed.
    • Are only chosen to improve query performance.

Defining a Secondary Index

      
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.

Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.

Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.

All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.

A Secondary Index may be defined ...

  • at table creation (CREATE TABLE)
  • following table creation (CREATE INDEX)
  • may be up to 64 columns
USINUSI
If the index choice of column(s) is unique, it is called a USI.
Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.
Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

  • Creating a Secondary Index cause an internal sub-table to be built.
  • Dropping a Secondary Index causes the sub-table to be deleted.


Defining a Secondary Index

      
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.

Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.

Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.

All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.

A Secondary Index may be defined ...

  • at table creation (CREATE TABLE)
  • following table creation (CREATE INDEX)
  • may be up to 64 columns
USINUSI
If the index choice of column(s) is unique, it is called a USI.
Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.
Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

  • Creating a Secondary Index cause an internal sub-table to be built.
  • Dropping a Secondary Index causes the sub-table to be deleted.

Other Types of Secondary Indexes

      
There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.

Value-ordered NUSIs have the following limitations.

  • The sort key is limited to a single numeric column.
  • The sort key column cannot exceed four bytes.
  • They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Other Types of Secondary Indexes

      
There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.

Value-ordered NUSIs have the following limitations.

  • The sort key is limited to a single numeric column.
  • The sort key column cannot exceed four bytes.
  • They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Full-Table Scans

      
full table scan is another way to access data without using any Primary or Secondary Indexes.

In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on this slide, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster.

With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once.

As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans.

While full table scans are impractical and even disallowed on some systems, the Vantage Advanced SQL Engine routinely executes ad hoc queries with full table scans.

Every row of the table must be read. All AMPs scan their portion of the table in parallel.

  • Fast and efficient on Teradata due to parallelism.
Full table scans typically occur when either:
  • An index is not used in the query
  • An index is used in a non-equality test
CUSTOMER
CustIDCustNameCustPhone
USI NUPI
   

Examples of Full Table Scans:

SELECT * FROM Customer WHERE CustPhone LIKE '858-485-_ _ _ _ ';

SELECT * FROM Customer WHERE CustName = 'Lund';

SELECT * FROM Customer WHERE CustID > 1000;

Locks

      
Locking serializes access and prevents multiple users who are trying to update the same data at the same time from violating data integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks:
  1. Exclusive Locks - prevents any other type of concurrent access. Exclusive locks are applied only to databases or tables, never to rows. They are the most restrictive types of lock; all other users are locked out. Exclusive locks are typically used only when structural changes are being made to the database or table, such as changing the block size on a table, as in the case of a DDL statement.

  2. Write Locks - prevents other Read, Write, Exclusive locks. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (access lock readers). Until a write lock is released, no new read or write locks are allowed. When you update a table without a WHERE clause, the system places a write lock on the table.

  3. Read Locks - prevents Write and Exclusive locks. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted.

  4. Access Locks - prevents Exclusive locks only. Users who are not concerned about data consistency can specify access locks. Using an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small, single-row changes. Access locks are sometimes called stale read or dirty read locks. You may get stale data that hasn't been updated, or has been overwritten.

Lock Levels

Three levels of database locking are provided:

  1. Database - applies to all tables/views in the database. Locks all objects in the database.
  2. Table/View - applies to all rows in the table/view. Locks all rows in the table or view.
  3. Row Hash - applies to all rows with same row hash. Locks all rows with the same row hash (primary and Fallback rows, and Secondary Index subtable rows.

The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock.

  • SELECT - requests a Read lock
  • UPDATE - requests a Write lock
  • CREATE TABLE - requests an Exclusive lock
Lock requests may be upgraded or downgraded:

LOCKING TABLE Table1 FOR ACCESS . . .
LOCKING TABLE Table1 FOR EXCLUSIVE . . .


Data Protection Features

      
Facilities that provide system-level protection:
  • Disk Arrays
    • RAID data protection (e.g., RAID 1)
    • Redundant data buses and array controllers

  • Cliques and Vproc Migration
    • SMP or Linux failures; Vprocs can migrate to other nodes (or HSN) within the clique.
Facilities that provide Teradata database protection
  • Fallback - provides data access with a “down” AMP
  • Locks - provide data integrity
  • Transient Journal - automatic rollback of aborted transactions
  • Down AMP Recovery Journal - fast recovery of fallback rows for AMPs
  • Permanent Journal - optional before and after-image journaling
  • Unity - integrated portfolio that turn a multi-system environment into an orchestrated Ecosystem
  • ARC - Archive/Restore facility
  • NetBackup and Tivoli - provide backup storage management, ARC script creation, and scheduling capabilities
Disk Arrays – Disk arrays provide RAID 1 or RAID 6 protection. If a disk drive fails, the array subsystem provides continuous access to the data.

Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays.

Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired and released during the processing of a request.

Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP” and provides protection in the event of an AMP failure. Fallback provides AMP fault tolerance at the table level.

Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP.

Transient Journal – the TJ is part of the WAL log and exists to permit the successful rollback of a failed transaction.

ARC/DSA and NetBackup/Tivoli – ARC and DSA command scripts provide the capability to backup and restore the Teradata database.

Transient Journal

      
The transient journal (or Online Transaction Journal) permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, data is returned to its original state after transaction failure.

The transient journal automatically maintains a copy on each AMP of before images of all rows affected by the transaction. In the event of a transaction failure, the before images are reapplied to the affected table, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.

Journal entries are made on and by the AMP making the change. Transient Journal activities are automatic and transparent to the user.

ALTER TABLE Misconception

An often repeated misconception is that the when an ALTER TABLE command is executed, the “altered data” is written to the transient journal/WAL (Write-Ahead Logging or Write-Ahead Logic) log.

For example, assume an ALTER TABLE ADD COLUMN is executed. Assuming the altered table is a large table, this operation may be time consuming. Why?

The reason that ALTER TABLE ADD COLUMN may be time consuming is that each row is rebuilt, each block is rebuilt, and the cylinder is rebuilt (re-lays out each cylinder). It does so one cylinder at a time in order to avoid having to make a full copy of the table. A new cylinder is allocated; each datablock is read; the records are reformatted and written to a new datablock which is placed on the new cylinder. When all the blocks are complete, the new cylinder is committed to disk. Then the old cylinder is released and the process is repeated until all cylinders are completed. This work reformats the rows and blocks in the same way as an INSERT-SELECT.

The only thing written to the WAL log is a single record per cylinder as a cylinder is committed. This is used in case the ALTER TABLE is interrupted – e.g., by a Teradata restart. The last cylinder log record is used to restart the process from the last completed cylinder and complete it. This is also why an ALTER TABLE cannot be aborted once it has been started.

Fallback

      
Fallback protects your data by storing a second copy of each row of a table on an alternate, Fallback AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all table data is still available. Users may continue to use Fallback tables without any loss of available data.

During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent.

Fallback guarantees that the two copies of a row will always be on different AMPs in the same cluster. A cluster is a group of AMPs that act as a single Fallback unit. Clustering has no effect on primary row distribution of the table, but the Fallback row will always go to another AMP in the same cluster. If either AMP fails, the alternate row copy is still available on the other AMP

There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O for INSERTs, UPDATEs, and DELETEs. (The Fallback option does not require any extra I/O for SELECT operations and the Fallback I/O will be performed in parallel with the primary I/O.)

The benefits of Fallback include:

  • Protects your data from an AMP failure. An AMP failure may occur because of physical disk failures - e.g., both disk drives in a RAID 1 drive group have failed or three drives in the same RAID 6 drive group. An AMP may also fail because of an AMP software failure or because access to the disk array has failed.
  • Protects your data from software (node) failure.
  • Automatically recovers with minimum recovery time, after repairs or fixes are complete.
  • May be specified at the table or database level
  • Permits access to table data during AMP off-line period
  • Adds a level of data protection beyond disk array RAID 1 & 5
  • Highest level of data protection is RAID 1 and Fallback
  • Critical for high availability applications

Loss of any two AMPs in a cluster causes RDBMS to halt!

Down AMP Recovery Journal

      
After the loss of any AMP, a down-AMP recovery journal is started automatically to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row-ID in its recovery journal.

This process will continue until the down AMP is brought back on-line. As part of the restart activity, the recovery journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and the AMP is brought on-line, fully recovered.

Recovery Journal is:

  • Automatically activated when an AMP is taken off-line
  • Maintained by other AMPs in the cluster
  • Totally transparent to users of the system
While AMP is off-line:
  • Journal is active
  • Table updates continue as normal
  • Journal logs Row-IDs of changed rows for down-AMP
When AMP is back on-line:
  • Restores rows on recovered AMP to current status
  • Journal discarded when recovery complete

For Fallback Tables

RAID Protection

      
There are many forms of disk array protection. The Teradata database supports the following protection schemes:

  • RAID 1
  • RAID 5
  • RAID S

RAID 1 (Mirroring) uses a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged.

RAID 5 (Parity) uses a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. A rank is defined as a group of disks that have the same SCSI ID (the disks in the rank are not on the same channel, but are on different channels). The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.

RAID S uses a parity-checking technique similar to RAID 5 and is used with EMC disk arrays.

Teradata systems use disk arrays that employ redundant controllers (DRDAC) and redundant disk drives. When a drive fails, the array switches to the mirror drive or uses parity to reconstruct the data.

Note: With Teradata databases, RAID 1 and RAID 5 are commonly used. Unless there is a reason not to do so, Teradata recommends that all our customers implement RAID 1.

Summary:

RAID-1 - Good performance with disk failures
Higher cost in terms of disk space
RAID-5 - Reduced performance with disk failures
Lower cost in terms of disk space

Hot Standby Nodes (HSN)

      
Hot Standby Node (HSN) is a node which is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails Teradata restarts and the AMPs from the failed node move to the hot standby node. The performance degradation is 0%.

When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed.

Characteristics of a hot standby node are:

  • A node that is a member of a clique.
  • Does not normally participate in the trusted parallel application (TPA).
  • Will be brought into the configuration when a node fails in the clique.
  • Helps with unplanned outages.
  • Helps with planned outages
  • Eliminates the need for a restart to bring a failed node back into service.

Hot standby nodes are positioned as a performance continuity feature.


This example illustrates vproc migration using a Hot Standby Node.

  1. Performance Degradation is 0% as AMPs are moved to the Hot Standby Node.
  2. When Node 1 is recovered, it becomes the new Hot Standy Node.

Teradata Tools and Utilities Overview

      
Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system.

Teradata utilities leverage the Teradata database's high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.

Teradata database client utilities include the following, described in this section:

  • Query Submitting Utilities - Query tools that submit SQL queries
    • BTEQ
    • Teradata Studio Express

  • Load and Unload Utilities
    • Teradata Parallel Transporter (TPT)

  • Administrative Utilities - for managing workloads
    • Teradata Studio
    • Priority Scheduler
    • Database Query Log (DBQL)
    • Teradata Workload Analyzer
    • Teradata Integrated Workload Management (TIWM)
    • Teradata Active Systems Management (TASM)

  • Tools that support a Multi-System Environment
    • Viewpoint
    • Unity
    • Teradata Multi-Systems Manager (TMSM)
    • Data Mover

BTEQ

      
BTEQ stands for Basic Teradata Query utility. It is a SQL front-end tool that runs on all client platforms. It supports both interactive ad hoc and scripted batch queries and provides standard report writing and formatting functionality. It also provides a powerful import/export capability even though there are other utilities available for these functions as well, particularly when larger tables are involved. When a user logs on with BTEQ, they can set the number of sessions they need before logging on. Using multiple sessions may improve performance for certain operations. This feature exploits the parallelism of the Teradata database.

BTEQ uses the CLI application programming interface.

Key Points about BTEQ

  • Basic Teradata Query utility
  • SQL front-end
  • Report writing and formatting features
  • Interactive and scripted batch queries
  • Import/Export across all platforms
  • Supports ad-hoc querying
  • Supports table updates with import and export
  • Runs on all platforms

BTEQ Uses the CLI API


Teradata Parallel Transporter

      
Teradata Parallel Transporter (TPT) is the replacement for Teradata Warehouse Builder (TWB). This utility is effectively an object-oriented software system that executes multiple instances of data extraction, transformation, and load functions in a scalable, high-speed parallel processing environment.

TPT is scalable and enables end-to-end parallelism. The previous versions of utilities (like FastLoad) allow you to load data into Teradata in parallel, but you still have a single input stream. Teradata Parallel Transporter allows you to run multiple instances of the extract, optional transformation, and load. You can have as many loads as you have sources in the same job. With multiple sources of data coming from multiple platforms, integration is important in a parallel environment.

TPT eliminates the need for persistent storage. It stores data into data buffers so you no longer need to write data into a flat file.

TPT provides a single scripting language. You can do the extract, some transformation, and loads all in one SQL-like scripting language.

Once the dynamics of the language are learned, you can perform multiple tasks with a single script.

TPT supports FastLoad INMODs, FastExport OUTMODs, and Access Modules to provide access to all the data sources you use today. TPT also provides a Direct API interface that can be used by Third Party partners. They can effectively write 'C' code to directly load/unload data to/from Teradata.

It is effectively a parallel load and export utility. Characteristics of this utility include:

    High performance
    • Parallel Export and Load operators eliminate sequential bottlenecks.
    • Data Streams eliminate the overhead of intermediate (persistent) storage.
    • Scalable
    • End-to-end parallelism

    Easy to use

    • Single scripting language
    • Access to various data sources

    Extensible

    • Direct API enables Third Party partners to write 'C' code to directly load/unload data to/from Teradata.

TPT Operators

      
The operators are components that "plug" into the TPT infrastructure and actually perform the functions.

  • The FastLoad and FastExport operators support the current FastLoad and FastExport INMOD/OUTMOD features.
  • The DataConnector operator is an adapter for the Access Module or non-Teradata files.
  • The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands.
  • The Load, Update, Export and Stream operators are similar to the FastLoad, MultiLoad, FastExport, and TPump utilities, but built for the TPT parallel environment.

The INMOD and OUTMOD adapters, DataConnector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export, and Stream operators are purchased separately.

Here is a comparison of the Teradata Parallel Transporter Operators with the classic utilities that we just covered.

TPT OperatorTeradata UtilityDescription
LOADFastLoadA consumer-type operator that uses the Teradata FastLoad protocol to load empty tables. Supports Error limits,Checkpoint/ Restart,Multi-value Compression and PPI.
UPDATEMultiLoadUtilizes the Teradata MultiLoad protocol to update tables. This allows highly scalable and parallel inserts and updates to a populated table.
EXPORTFastExportA producer operator that emulates FastExport
STREAMTPumpUses multiple sessions to perform DML transactions in near real-time.
DataConnectorN/AEmulates the DataConnector API. Reads external data files, writes data to external data files, reads an unspecified number of data files.
ODBCN/AReads data from an ODBC Provider.
SQL INSERTBTEQA consumer operator that inserts data into a table, similar to a BTEQ import operation.
SQL SELECTBTEQA producer operator that submits SQL SELECT statements to retrieve data from a table.


TPT Architecture

      
There are two types of parallelism used by TPT: pipeline parallelism and data parallelism.

Pipeline Parallelism

Pipeline parallelism (also called "functional parallelism") is achieved by connecting operator instances through data streams in a single job. As seen in the following diagram, a "producer" operator extracts data from a data source and writes it to the data stream, that is, it 'produces' a data stream. Finally, the "consumer" operator 'consumes' the data from the data stream and then starts writing data to the target.

There are two operators involved here, producer and consumer. Both operators can operate concurrently and independently of each other thereby creating a pipeline effect without the need for intermediary files.

For example:

A "producer" operator extracts data from a data source and writes ("produces") it to a data stream. A "consumer" operator reads ("consumes") the data from the data stream and writes or loads data to the target. Both operators can operate concurrently and independently of each other. This creates a pipeline effect without the need for intermediary files.

TPT Parallelism

      
Imagine you have the same job as described on the previous slide, but that it is handling a large quantity of data coming from several different input files. Each file can be handled by a different instance of the producer operator so that the files are read concurrently. The operator instances can then merge these files into a single data stream which can then be input to multiple consumer operators. The diagram on this slide illustrates this concept.

Each source can be handled by a different instance of the producer operator, which merge these files into a single data stream.

The data stream can then be input to multiple consumer operators.


===============================================================

Teradata SQL for Applications Development


Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:


the column has defined default values

the column has a default system value specified

the column permits nulls

If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.



The SQL command INSERT INTO tablename DEFAULT VALUES;


Will insert defined default values into each column.

Will insert a null if no default is defined.

Will fail if no default is defined and a null is not allowed.


Some options for NULL and DEFAULT attributes:


NOT NULL - Nulls are not permitted for this column

DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.

Example: DEFAULT 'Unknown'


DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx

Example: DEFAULT DATE '2099-12-31'


WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)

DEFAULT TIME - Assign the current time to the integer data type

DEFAULT USER - Assign the user id of the session to the character string







^ - A circumflex character indicates that the character (or characters) that follow the circumflex in the regular expression must occur at the beginning of the source string.
Example: '^E' requires that the source string begin with an upper case E.
$ - A dollar sign character indicates that the preceding character (or characters) must occur at the end of the source string.
Example ':$' requires that the source string ends with a colon.
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.
Example: '(ABC)+' would match one or more successive occurrences of ABC at the corresponding position in a source string.
[] - Square braces enclose a subset of characters.
Example: '[0-9]' would match any single digit at the corresponding position in a source string.



















How many partitions can you have?
The maximum number of partitions is the same as a row-partitioned table: 9,223,372,036,854,775,807. The maximum number of column partitions is limited by the limit on number of columns (2048).









Teradata Advanced SQL Engine 17.10
A database provides an organized mechanism for storing, managing, and retrieving data. Most modern databases are made up of tables.

types of Databases

A relational database is structured to recognize connections between stored items of information. Some of the more common relational databases are Teradata and Oracle. The standard user and application program interface to a relational database is the structured query language (SQL).

If the goal is only to retrieve information, a NoSQL database will likely suffice. DynamoDB and MongoDB are two well-known examples of NoSQL databases today.


Relational Database?

If the database is relational, which most databases are, it can cross-reference records in different tables. This means that you can create relationships between tables.
CARDINALITY - ROW , DEGREE-COL

The term schema refers to organizing the data as a blueprint that shows how the database is divided into tables. The formal definition of a database schema is a set of formulas or sentences called integrity constraints imposed on a database.


The term schema refers to organizing the data as a blueprint that shows how the database is divided into tables. The formal definition of a database schema is a set of formulas or sentences called integrity constraints imposed on a database.

Foreign key is a column or group of columns in a relational database table that defines the relationship between the data in two tables. Foreign keys and their implementation require planning.

data model:

Data Models define how elements of data are connected or related.
As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.
Majorly, there are three types of data models:
  • Relational
  • Dimensional
  • Logical and Physical

Relational Data Model

It is the most common type of data model. A relational data model should reflect the business rules. 

Dimensional Data Models

Dimensional model is specially designed for reporting as it provides navigation paths and aligns with the work process of a business users. A dimensional model emphasizes usability because the tables are:

  • Defined in a way that a business uses the data
  • Organized in a way that a given user or group of users can easily understand the data

Its purpose is to allow efficient, user-friendly filtering, sorting and summing of data specific to the needs of one or more business area(s).
Every dimensional model is composed of one table with a multipart key, called a fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table.
Normalization vs. Denormalization

Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.

Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.

Dimensional Model

In a dimensional model, fact tables always represent M:M (many to many) relationships. According to the model, a fact table should contain one or more numerical measures (the "facts" of the fact table) that occur for the combination of keys that define each tuple in the table.

DBMS

A DBMS is the application and system software used for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.A DBMS creates reports, enforces database rules and constraints, and maintains the database schema. A database is  a collection of bits and bytes without DBMS.The most common type of database management system is called a Relational Database Management System (RDBMS).

SQL?

Relational Database Management Systems use Structured Query Language (SQL) as a database query language. Just as humans communicate with the help of language, SQL is the language that a database understands. SQL is used to create, manage, manipulate, and query database objects such as tables.

Data Warehouse

A data warehouse is a specially constructed data repository where data is organized so that it can be easily accessed by end users for various applications. Data warehouses are common in corporations where enterprise-wide detailed data is analyzed to make strategic and tactical business decisions. This is because data warehouses often carry many years’ worth of detailed data so historical trends may be analyzed. Data warehouses may begin as small in scope and purpose, but often grow quite large if their utility becomes more fully exploited by the enterprise.

Data Marts

Data marts can be independent ofdependent on, or part of a data warehouse. Data warehouses have an enterprise-wide depth.  The information in data marts pertains to a single department.

Independent data marts are isolated entities entirely separate from the enterprise data warehouse. The independent data mart data is derived from independent sources and should be viewed as data pirates in the context of the enterprise data warehouse. The independent inputs, which are entirely separate from the enterprise data warehouse, have a high likelihood of producing data that does not match that of the warehouse.

Dependent data marts are derived from the enterprise data warehouse. Depending on how a dependent data mart is configured, it might or might not be useful. The recommended process uses only data that is derived from the enterprise data warehouse data store.  This permits users to have full access to the enterprise data store when the need to investigate enterprise-wide issues arises.

The logical data mart is a form of dependent data mart that is constructed virtually from the physical data warehouse. Data is presented to users of the mart using a series of SQL views that make it appear that a physical data mart underlies the data available for analysis.


What is Analytic Processing?

A data warehouse was originally defined as a decision support system (DSS). As earlier discussed, data warehouses are computer-based information systems that support business or organizational decision-making activities. OLAP and OLTP are sets of operations that can be performed on a data set.
OLAP (On-Line Analytic Processing) operations comprise the complex analysis that are performed on a data set within a data warehouse. These operations include data mining, querying, pivoting, slicing, dicing, drilling, reporting and other decision-support applications. A data warehouse can be used without OLAP by just running reports. ATM


OLTP (On-Line Transaction Processing) manages business applications and collects the business’s day-to-day data like order entry, financial transactions, or sales. This is different than a data warehouse. OLTP accesses a small number of rows (or records), or a few tables, in a matter of seconds.


Reasons to Deploy a Data Warehouse

For companies to be successful they must make good business decisions. Making good business decisions requires businesses to have the ability to analyze their related, relevant business data. A data warehouse is an indispensable tool for businesses. While business data can be useful, it means little if it cannot be stored and analyzed in a useful way. A key driver of analytics is strategic queries – business questions that are intended to provide a strategic advantage through forecasting. Businesses can also learn more about their customers, transactions, and profits – studying these items to find patterns which can assist them in avoiding costly mistakes, enhance business productivity, and manage customer relationships.


Data Warehouse Architecture - Layers

The above diagram shows the source of the data and its flow. The reference information architecture is broken down into 3 major data layers: 
acquisitionAs the first point of entry into the data warehouse, raw data is acquired from various source systems such as mainframe, server, etc.
integration
The integration layer is primarily responsible for integrating data from multiple systems both normalized and potentially de-normalized. It also may create common metrics and summaries which are widely used within an organization.
access-
The access layer’s primary responsibility is to provide easy access to the data using various analytic methods.

Types of implementation possibilities

Only a few of the multiple types of implementations of a data warehouse are listed below
centralized data warehouse is useful for small and mid-size data warehouses. It is a single physical repository that can contain organizational data for a specific utility area, department, branch, division or the whole organization. Centralized data warehouses serve the needs of several separate business units within an organization at the same time using a single data model.
federated data warehouse shares information among a number of different systems. Critical master files will be shared, and the other systems will be able to use this information. An example of this would be adding financial information to advertising data. This will allow an organization to better market its products to customers. If the data is federated, organizations can reduce response time.
data mart exists within a single organizational data warehouse repository. Ideally, it a condensed and more focused version of the data warehouse that stores data dedicated to a specific business function or department within the organization. This subset of data may span across many, or all, of the organization’s functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit such as accounting, marketing, sales, etc.

Deployments Options

-->The on-premises deployment option involves buying the software and hardware from a data warehouse company. Some examples of these companies would be Teradata, Oracle, or IBM. The on-premises (sometimes abbreviated as on-prem) data warehouse is installed, runs, and is overseen on the premises of the organization using it. Benefits of this option are that it gives the organization total control, flexibility, accessibility, and predictability, and dependable performance. Some consider this option to also be the most secure and might even be required to use on-premises depending on the sensitivity of the organization’s data. Examples of this might be highly regulated markets such as healthcare or financial. There are also some countries that restrict access to internet content so on-premises would be the only viable solution.
-->If a company chooses to implement its data warehouse in the public or private cloud then it will run on a computing platform that belongs to the cloud provider, and access to it is granted, as a service, by a provider. Some examples of companies that offer this deployment option are Amazon (AWS), or Microsoft (Azure), or Google cloud platform. A public cloud solution can be quick to set up, is scalable, accessible, and easy to use. As mentioned, while discussing the on-premises solution, security might be an issue for some organizations depending on the market they are in. Trust issues are often the most difficult to overcome for IT executives when considering whether or not to move to a cloud-based data warehouse. One consideration that organizations need to take into account, is that cloud options require high bandwidth which may not be an option for some smaller organizations. A private cloud, such as Teradata’s IntelliCloud option, can provide a more custom, secure virtual environment than a public cloud.

Benefits of TD

-->Unlimited Parallelism means that every request is broken down into components which are all worked on at the same time (parallel). Parallelism is at the very heart of the Teradata Advanced SQL Engine and is built into the system from the ground up. There is virtually no part of the system where parallelism has not been built in.
-->Because the Teradata Advanced SQL Engine is highly scalable it is available everywhere.
-->The Teradata Advanced SQL Engine is highly scalable which makes it easy for customers to expand the system as they need to for their growing businesses.
-->
With its high availability, there is no single point of failure. The Teradata Advanced SQL Engine has the ability to make data available in the event of errors or failures, because both hardware and software provide fault tolerance. For example, fallback tables are duplicate copies of primary tables. Each fallback row in a fallback table is stored on an AMP different from the one to which the primary row hashes. This storage technique maintains availability should the system lose an AMP and its associated disk storage in a cluster. In that event, the system would access data in the fallback rows.

Partitioning

Partitioning is a database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.

Object Management

In order to manage the database objects, Teradata needs to store information about all of the objects. This information is referred as metadata. Metadata can include:

  • Means of creation of the data
  • Purpose of the data
  • Time and date of creation along with the creator and author of the data
  • Location on a computer network where the data was created
  • Standards used and file size

Data Dictionary

The data dictionary is owned by system user DBC. The data dictionary is composed of tables and views. Data dictionary views provide access to the information in the tables.Tables and views are reserved for use by the system which contains the metadata about the database objects such as privileges, system events, and system usage.

What is Data Distribution?
The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
Teradata Advanced SQL Engine generates a row hash by hashing the values of the PI columns. The row hash and a sequence number, which is assigned to distinguish between rows with the same row hash within a table, are collectively called a row identifier and uniquely identify each row in a table.

Metadata

Metadata is stored in tables that belong to user DBC and is reserved for use by the system. This information is stored in the Data Dictionary. The Data Dictionary contains metadata about the objects in the system like privileges, system events, and system usage. Views provide access to the information in the tables.
Index

Indexes can provide an easier and faster way of accessing and locating data and thus reducing unwanted inputs and outputs.
The use of Unique Primary Index (UPI) and Unique Secondary Index (USI) ensures effective maintenance of uniqueness on the table.

Ideally, the rows of every large table will be distributed among all of the AMPs. There may be some circumstances where this is not true. A column-partitioned table or join index may have a primary AMP index as an alternative to having no primary index. A primary AMP index (PA) allows for single-AMP access when the values of the primary AMP index column(s) are specified or to provide better distribution of the rows.

 

Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows of the table and, consequently, the performance of the table in a production environment. Unique Primary Indexes (UPI's) are often desirable because they guarantee the uniform distribution of the rows of that table.

 

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique values; that is, there are duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution.

Improve Performance

Accessing data with equality on Primary Index result in a one AMP operation, which is extremely fast and efficient. Similar improvements can be seen with the use of other indexes.

Types and Usage

The Teradata Advanced SQL Engine has several types of indexes that are used to distribute data, retrieve data, and enhance performance.

-->A Primary Index (PI) is a column or set of columns that provide a mechanism for physical row storage and access. 


The PI is defined when the table is created and is there for two major reasons:

  • To determine data distribution
  • Enhance performance

-->A No Primary Index (NOPI) table is simply a table without a primary index. NOPI tables are typically used as staging tables for the initial data load.
-->
A Secondary Index (SI) can be used to enforce uniqueness on a column. The SI also provides an alternate path to access data.
-->
A join index is an indexing structure containing columns from multiple tables, specifically the resulting columns from one or more tables. The join Index (JI) is an optional index which may be created by the user to improve performance of complex queries.

There are three general ways to access a table; two of which have been covered already:
--
Primary Index access (one AMP access)
--Full Table Scan (all AMP access).
The third option is via a Secondary Index access (two or all AMP access). As we discussed earlier, there are two types of secondary indexes USI (unique) and NUSI (non unique).

UPI

Guarantees uniform distribution of rows, has no duplicate values, and is made up of one or more columns.


--Efficient way to access data. Accessing data through a UPI is a one-AMP operation that leads directly to the single row with the desired UPI value

--The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, Teradata can distribute the rows evenly.


NUPI

Non-unique Primary Index

Can cause skewed data, allows duplicate values, and is made up of one or more columns.
--Accessing data through a NUPI is a one-AMP operation that may lead to multiple rows with the desired NUPI value. The system creates a spool file during a NUPI access, if needed. NUPI access is efficient if the number of physical block reads is small.


--If, however, the order number was used as a non-unique Primary Index (NUPI), then the row distribution among AMPs will possibly be uneven. All rows with the same primary index value (in other words, with the same order number) will be stored on the same AMP. If the Primary Index is slightly non-unique, that is, there are only four or five rows per index value; the table will still distribute evenly. But what if there are hundreds or thousands of rows for some index values? The rows with the same primary index value would all be sent to the same AMP resulting in skewed data.

USI
Unique Secondary Index
Not involved in data distribution, has no duplicate values, enables data access via an alternate path.
--E
fficient way to access data. Data access through a USI is always a two-AMP operation, which leads directly to the single row with the desired USI value.
USI retrieval will always retrieve one row at most whereas the number of rows a NUPI will retrieve depends upon the selectivity of the PI value. The performance of a NUPI will depend upon how non unique the PI is

NUSI
Non-unique secondary index
One or more columns, may have duplicate values and is used to improve access.
--(NUSI) is efficient only if the number of rows accessed is a small percentage of the total data rows in the table. NUSI access is an all-AMPs operation since the NUSI subtables must be scanned on each AMP. It is a multiple rows operation since there can be many rows per NUSI value. A spool file will be created, if needed
PPI
Partition Primary Index
Can quickly improve the performance of complex range-based queries. Partitioned Primary Index (PPI) is an indexing mechanism that is useful in improving the performance of certain queries. When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. When a table is defined with PPI, the rows are sorted by their partition number. Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined.
NOPI
No Primary Index
--Simply a table without a primary index. As rows are inserted into a NoPI table, they are appended at the end of the table and never inserted in the middle of a hash sequence. Retrievals are likely to be full table scans (FTS) unless other index or table structures are in place e.g. columnar
--Supports faster loads and inserts into tables with no primary index defined.

Data is effectively appended to an AMP in a NoPI table. In addition, a NoPI table:
--Has no duplicate row checking. Duplicate rows can exist in a NoPI table after a TPT LOAD operation.
--A table without a primary index. Data is distributed to the AMPs by methods other than the primary index (i.e., Random Generator).  
--Is useful in staging and in data labs when the data is not known and there is no obvious primary index.
--Is useful as a staging table and can be loaded quickly using Teradata Parallel Transporter (TPT) LOAD.

Full table scan
Scan is efficient in that each row is scanned only once. Although index access is generally preferred to a FTS, there are cases where they are the best way to access the data. Like the situation with NUPIs and USIs, Full Table Scans can sometimes be more efficient than a NUSI. Since V2R5 a Cylinder read which reads all the data blocks in that cylinder just needs 1 I/O. A FTS will only read each data block once whereas a NUSI would read the index subtable then access the data blocks; this is efficient only when the NUSI is nearly unique. Statistics will give the optimizer the demographic information to make the right decision to select a FTS.

Secondary index:
secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A benefit is that searches will typically be less expensive than a full table scan. However, secondary indexes add overhead to the table, both in terms of disk space and maintenance (due to sub-tables) but may be dropped when not needed and recreated whenever they would be helpful.
A Secondary Index is like a Primary Index in that it allows the user to locate rows, but different in that it has no influence on the way rows are distributed among AMPs. A unique secondary index can be used to maintain uniqueness within a column or set of columns.
Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table providing an alternate path to the rows of a table. One table can have from 0 to 32 secondary indexes.

Locking

Teradata Database locks have two dimensions: Level and Severity. 
Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.
Types
There are four types of locks in Teradata. (Click on each type to learn more)
exclusive - 
Prevents any other type of concurrent access
write - Prevents other reads, writes, and exclusives
read-Prevents other writes and exclusives
access- Prevents exclusive rights only

Levels
Locks can be applied at four levels.
Level 1
The database level applies to all tables/views in the database.
Level 2
Table/view level applies to all rows in the table/view.
Level 3
Partition level which applies to all rows within the specified partition(s).
Level 4
The row hash level is an algorithm which determines which AMP gets the row.


Data Integration Tools

Teradata Parallel Transporter

Teradata Parallel Transporter, or TPT is client software that performs data extractions, data transformations, and data loading functions in a scalable, parallel processing environment. Jobs in TPT are run using operators which define the type of activity to be performed. The process flow is shown below.
producer- Producer operators extract data from a data source by reading the data and writing it to the data stream.
consumer- Consumer operators read the data from the data stream and writes or loads it to the target which might be Teradata tables or an external data store.








Teradata Intelligent Memory
Whenever data on the very hot data list is read from disk during query processing, a copy is kept in Teradata Intelligent Memory's (TIM) extended memory area for future use.

Teradata Compression




Join index

TD:
Shared nothing - 
AMP works on only the assigned db portion, no sharing.
Hashing
 - td use it to distribute data in amps , PI  decides whether the distribution is even or not. 
even distribution is good because in operation with all rows ex full tab scan  , each amp will have even work load and inc performance


Hashing


Normalisation
Normalization is a set of rules and a methodology for making sure that the attributes in a design are carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize update abnormalities.

Referential Integrity

Referential integrity refers to the accuracy and consistency of data within a relationship and is a reliable mechanism to prevent accidental database corruption.


===========================
Performance tuning


SHOW TABLE , EXPLAIN, HELP STATISTICS 




===================================================================

Teradata Database Architecture Overview






























=======================================================================

 Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS) 

In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.

 

SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.

-----------------------------------------------------

SQL STATEMENT TYPE


1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.

 

DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.

DDL Statements:

 

CREATE

DROP

ALTER

RENAME

MODIFY

REPLACE

COLLECT STATISTICS

SET

BEGIN

END

FLUSH

CHECKPOINT

HELP

SHOW


2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.

 

DML is used to work with data, including inserting data, updating existing data or performing queries on the data.

 

The DML subset will be the focal point of this fundamentals course.


DML Statements:

 

SELECT

INSERT

UPDATE

DELETE


3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.

 

DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.


DCL Statements:

 

 

GRANT

REVOKE

GIVE


4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.

 

Some statements in this category are also functionally aligned with the DDL SQL category.

 

For more information on SQL statements in this category, see references in help section.

SQL Statements in this category:

 

 

BEGIN/END QUERY LOGGING

SHOW QUERY LOGGING

COLLECT DEMOGRAPHICS

COLLECT STATISTICS

DROP STATISTICS

DUMP EXPLAIN

INITIATE INDEX ANALYSIS

INITIATE PARTITION ANALYSIS

INSERT EXPLAIN

RESTART INDEX ANALYSIS


-------------------------------------------------------------

 HELP


HELP command displays information about database objects (databases, tables, views, macros etc.).

 

This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.



-----------------------------------------------------------------------------------------------------------------

SHOW command displays the current data definition language (DDL) of a database object.

 

It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).


------------------------------------------------

EXLAIN


=========================================================

Introduction to the Teradata Database

The Teradata database is a Relational Database Management System (RDBMS) that drives a company's data warehouse.

Teradata Parallel Architecture

      
There are many reasons customers choose the Teradata database as the preferred platform for enterprise data warehousing:

  • Supports more warehouse data than all competitors combined. There are over 5400 1TB or larger warehouses in the field.
  • Supports scalability from small (10GB) to massive (50+PB) databases.
  • Provides a parallel-aware Optimizer that makes it unnecessary to perform complicated tuning to get a query to run efficiently.
  • Automatic data distribution eliminates complex indexing schemes and time-consuming reorganizations.
  • Designed and built on a parallel architecture from the start (not a parallel retrofit).
  • Supports ad-hoc queries using ANSI-standard Structured Query Language (SQL) which is used to communicate with an RDBMS. This allows Teradata to interface with 3rd party BI tools and submit queries from other database systems.
  • Single operational view of the entire MPP system via the Service Workstation (SWS) and single point of control for the DBA with Teradata Viewpoint.

Teradata Corporation has been doing data warehousing longer than any other vendor. The Teradata database was built to do Decision Support.

Relational Database Advantages

      
The Teradata database provides customers with unlimited, proven scalability. As the example on the previous page showed, the Teradata database can scale from 10 gigabytes to over 100 petabytes of data on a single system. When we talk about scalability, it isn't just about being able to grow very large and handle enormous amounts of data. It's about growth without losing any performance capabilities. This scalability provides investment protection for customer's growth and application development.

As is proven in the benchmarks we perform, the Teradata database can handle the most concurrent users, who are often running complex queries.

The parallelism of the Teradata database is unlimited. The Teradata database performs every task in parallel, including joins, sorts, and aggregations.

The optimizer is parallel-aware and the most robust in the industry, able to handle multiple complex queries, joins per query, and unlimited ad-hoc processing.

The Teradata File System manages data and disk space automatically. This is what makes the rebuilding of indexes unnecessary.

The Teradata database provides a low total cost of ownership (TCO) due to:

  • Ease of setup and maintenance
  • No reorganization of data needed
  • Automatic data distribution
  • Most robust utilities in the industry
  • Low cost of disk to data ratio
  • Ease in expanding the system

High availability is also a major advantage because with the Teradata database architecture, there is no single point of failure - fault tolerance is built-in to the system. The Teradata database is the only database that is truly linearly scalable, and this extends to data loading with the use of our parallel utilities.

Teradata Database Features

      
The Teradata database is a high-performance database system that processes enormous quantities of detail data that are beyond the capability of conventional systems.

The system is specifically designed for large data warehouses. From the beginning, the Teradata database was created to store, access and manage large amounts of data.

Hundreds of terabytes of storage capacity are currently available, making it an ideal solution for enterprise data warehouses and even for smaller data marts.

Parallel processing distributes data across multiple processors evenly. The system is designed such that the components divide the work up into approximately equal pieces. This keeps all the parts busy all the time and enables the system to accommodate a larger number of users and/or more data.

Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. Teradata runs on industry standard operating systems as well.

Linear scalability enables the system to grow to support more users/data/queries/query complexity, without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1.

The Teradata database currently runs as a database server on a variety of hardware platforms for single node or Symmetric Multi-Processor (SMP) systems, and on Teradata hardware for multi-node Massively Parallel Processing (MPP) systems.

What is a Database?

      
database is a collection of permanently stored data that is used by an application or enterprise.

A database contains logically related data, which means that the database was created with a specific purpose in mind. A database supports shared access by many users. One characteristic of a database is that many people use it, often for many different purposes. A database is protected to control access and managed to retain its value and integrity.

One example of a database is payroll data that includes the names of the employees, their employee numbers, and their salary history. This database is logically related - it's all about payroll. It must have shared access, since it will be used by the payroll department to generate checks, and also by management to make decisions. This database must be protected; much of the information is confidential and must be managed to ensure the accuracy of the records.

The Teradata database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industry-standard SQL statements. SQL is a set-oriented language for relational database management.

Later in this course we will provide another definition of database that is specific to the Teradata database.

Logical/Relational Modeling

      
The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly.

The design of the data model is the same regardless of data volume.

An enterprise model is one that provides the ability to look across functional processes.

Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.

Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is "in normal form."

The intent of normalizing a relational database is to put one fact in one place. By decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables.

A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple (an ordered set of values), and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. The separator for each value is often a comma. Common uses for the tuple as a data type are:

  1. For passing a string of parameters from one program to another
  2. Representing a set of value attributes in a relational database

The Logical Model

  • Should be designed without regard to usage
    • Accommodates a wide variety of front end tools
    • Allows the database to be created more quickly
  • Should be the same regardless of data volume
  • Data is organized according to what it represents - real world business data in table (relational) form
  • Includes all the data definitions within the scope of the application or enterprise
  • Is generic - the logical model is the template for physical implementation on any RDBMS platform

Normalization

  • Process of reducing a complex data structure into a simple, stable one
  • Involves removing redundant attributes, keys, and relationships from the conceptual data model

Relational Databases

      
Relational databases are solidly founded on Set Theory and are based on the Relational Model, which is a set of principles for relational databases formalized by Dr. E.F. Codd in the late 1960s. The relational model of data prescribes how the data should be represented in terms of structure, integrity, and manipulation. The rules of the relational model are widely accepted in the information technology industry, though actual implementations may vary.

The key to understanding relational databases is the concept of the table, which is made up of rows and columns. The relational model provides the system with rules and a structure to build on. The relational terms we will be discussing in this course are independent of the database you may be running on.

column is an attribute of the entity that a table represents. A column always contains like data, such as part name, supplier name, or employee number. In the example below, the column named LAST NAME contains last name and never anything else. Columns should contain atomic data, so a phone number might be divided into three columns; area code, prefix, and suffix, so drill-down analysis can be performed. Column position in the table is arbitrary. Missing data values are represented by "nulls", or the absence of a value.

row is one instance of all the columns of a table. It is a single occurrence of an entity in the table. In our example, all information about a single employee is in one row. The sequence of the rows in a table is arbitrary.

In a relational database, tables are defined as a named collection of one or more named columns that can have zero or many rows of related information. Notice that the information in each row of the table refers to only one person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important.

Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place, thing, or event about which the table contains information. In this case, the entity is the employee.

Note: In relational math we use the term:

  • Table to mean a relation.
  • Row to mean a tuple.
  • Column to mean an attribute.
Row 
(Single
Entity)
Employee   Column (Attribute)
EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
         
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000

 Relation

The employee table has:

  • Nine columns of data
  • Six rows of data - one per employee
  • Missing data values represented by nulls
  • Column and row order are arbitrary

Normalized Data Model vs. Star Schema Model

      
As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:

  • Entities - One record in a table
  • Attributes - Columns
  • Relationships - Between tables

First normal form (1NF) rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities.

Second normal form (2NF) requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key.

Third normal form (3NF) requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity.

While the Teradata database can support any data model that can be processed via SQL; an advantage of a normalized data model is the ability to support previously unknown (ad-hoc) questions.

Star Schema

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.

Some characteristics of a Star Schema model include:

  • They tend to have fewer entities
  • They advocate a greater level of denormalization

Primary Keys

      
Tables are made up of rows and columns and represent entities and attributes. Entities are the people, places, things, or events that the tables model. A Primary Key is required for every logical model version of a table, because each entity holds only one type of tuple (i.e., a row about a person, place, thing or event), and each tuple is uniquely identified within an entity by a Primary Key (PK).

Primary Key Rules

  • A Primary Key uniquely identifies each tuple within an entity. A Primary Key is required, because each tuple within an entity must be able to be uniquely identified.

  • No duplicate values are allowed. The Primary Key for the EMPLOYEE table is the employee number, because no two employees can have the same number.

  • Because it is used for identification, the Primary Key cannot be null. There must be something in that field to uniquely identify each occurrence.

  • Primary Key values should not be changed. Historical information, as well as relationships with other entities, may be lost if a PK value is changed or re-used.

  • Primary Key can include more than one attribute. In fact, there is no limit to the number of attributes allowed in the PK.

  • Only one Primary Key is allowed per entity.

Selecting a Primary Key

Each column of a row is called an attribute of that row. A database designer can select any attribute to be a Primary Key but, as a result of the rules listed above, many attributes will not qualify as a Primary Key candidate. For example, we could have selected Last Name as the PK of the EMPLOYEE table, but as soon as the company hires two people with the same last name, the PK is no longer unique. Even if we made the PK Last Name and First Name (possible since PKs can be made up of more than one column) we could still have two employees with the same name. Moreover, some employees may choose to change their last names.

Many data modelers recommend using system-assigned sequential integers for Primary Keys. This assures uniqueness and gives users control over Primary Key values, but it also takes up database space.

Primary Key (PK) values uniquely identify each entry (tuple) of an entity.

Employee

EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
PK        
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000

Foreign Keys

      
Relational databases permit data values to associate across more than one entity. A Foreign Key (FK) value identifies entity relationships.

Below you will see that the employee table has three FK attributes, one of which models the relationship between employees and their departments. A second FK attribute models the relationship between employees and their jobs.

A third FK attribute is used to model the relationship between employees and each other. This is called a recursive relationship.

Foreign Key Rules

  • Duplicate values are allowed in a FK attribute.
  • NULLs are allowed in a FK attribute.
  • Values may be changed in a FK attribute.
  • Each FK must exist elsewhere as a Primary Key.

Note that Department_Number is the Primary Key for the DEPARTMENT entity.

Remember, these terms are not Teradata specific - they are just general relational concepts.

Employee
Code
Table
 EMPLOYEE (partial listing)
EMPLOYEE
NUMBER
MANAGER
EMPLOYEE
NUMBER
DEPARTMENT
NUMBER
JOB
CODE
LAST
NAME
FIRST
NAME
HIRE
DATE
BIRTH
DATE
SALARY
AMOUNT
PKFKFKFK     
10061019301312101SteinJohn8610156310153945000
10081019301312102KanieskiCarol8702016805173925000
10050801403431100RyanLoretta8610156509104120000
10041003401412101JohnsonDarlene8610155604234630000
1007   VillegasArnando8701024701315970000
10030801401411100TraderJames8607315706194785000
 Foreign Key (FK)
values model
relationships.
 
DEPARTMENT
NUMBER
DEPARTMENT
NAME
BUDGET
AMOUNT
MANAGER
EMPLOYEE
NUMBER
PK  FK
501marketing sales800500001017
301research and development465600001019
302product planning226000001016
403education932000001005
402software support308000001011
401customer support982300001003
201technical operations293800001025
  • Foreign Keys (FK) are optional.
  • An entity may have more than one FK.
  • A FK may consist of more than one attribute.
  • FK values may be duplicated.
  • FK values may be null.
  • FK values may be changed.
  • FK values must exist elsewhere as a PK.

Relational Advantages

      
Advantages of a Relational Database (physical and logical) compared to other database methodologies include:
  • Flexible
    Flexibility provides substantial benefits. The user does not need to know the access path to the data; the RDBMS keeps track of where everything is. Relational databases use atomic data - breaking data down into its smallest parts to allow maximum flexibility in selecting and using data. In addition, a single copy of the data can serve multiple purposes.

  • Responds quickly
    In a non-relational database, adding a field means that all programs that use the database must be rewritten to become aware of the new data structure. In a relational database, programs do not need to be re-written when a field is added.

  • Data-driven
    Relational databases are designed to represent a business and its practices - not the application or the computer system.

  • Business-oriented
    The two tables we have looked at, EMPLOYEE and DEPARTMENT, are organized to reflect the way the business really works.

  • Simple and easy to use and understand
    Simplicity is useful not only to the people who ask the questions, but also to the people who have to figure out how to retrieve information from the database. Understanding how an RDBMS functions is not necessary.

  • Easier to build applications
    Relational databases make the data do more work. Programs and transactions are simpler, which makes it easier to build applications.

  • Support the trend toward end-user computing
    The trend is moving away from organizations funneling all data requests through the IT people who know how the system works. As systems get easier to use, more people have access to them. This is called the "democratization of data."

  • Set Theory
    Set theory is the mathematical science of the infinite. It studies properties of sets, abstract objects that pervade the whole of modern mathematics. The language of set theory, in its simplicity, is sufficiently universal to formalize all mathematical concepts and thus set theory, along with predicate calculus, constitutes the true foundations of mathematics.

Evolution of Data Processing

      
Traditionally, data processing has been divided into two categories: On-Line Transaction Processing (OLTP) and Decision Support Systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

Traditional

A transaction is a logical unit of work.
  • On-Line Transaction Processing (OLTP)
    OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.

  • Decision Support Systems (DSS)
    Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of pre-written scripts, users now require the ability to do ad hoc queries, analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data and rely less on summary data.

Today

Functional Trends Rankings
  • On-line Analytical Processing (OLAP)
    OLAP is a modern form of analytic processing within a DSS environment. OLAP tools (e.g., from companies like MicroStrategy and Cognos) provide an easy to use Graphical User Interface to allow "slice and dice" analysis along multiple dimensions (e.g., products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes.

  • Data Mining (DM)
    DM (predictive modeling) involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns (e.g., buying, attrition, or fraud patterns) that are used to predict future behavior. An "analytic model" is built from the historical data (Phase 1: minutes to hours) incorporating the detected patterns. The model is then run against current detail data ("scoring") to predict likely outcomes (Phase 2: seconds or less). Likely outcomes, for example, include scores on likelihood of purchasing a product, switching to a competitor, or being fraudulent.
TypeExamplesNumber of Rows AccessedResponse Time
OLTPUpdate a checking account to reflect a deposit.
Debit transaction takes place against current balance to reflect amount of money withdrawn at ATM.
SmallSeconds
DSSHow many child size blue jeans were sold across all of our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X?
LargeSeconds or minutes
OLAPShow the top ten selling items across all stores for 1997.
Show a comparison of sales from this week to last week.
Large of detail rows or moderate of summary rowsSeconds or minutes
Data MiningWhich customers are most likely to leave?
Which customers are most likely to respond to this promotion?
Moderate to large detailed historical rowsPhase 1: Minutes or hours
Phase 2: Seconds or less


Row vs. Set Processing

      
Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.

What is Row-by-Row Processing?

Where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program.

A benefit of row processing is that there is less lock contention.

What is Set Processing?

A lot of data processing is set processing, which is what relational databases do best.

Instead of processing row-by-row sequentially, you can process relational data set-by-set, without a cursor. For example, to sum all payment rows with balances of $100 or less, a single SQL statement completely processes all rows that meet the condition as a set.

With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-time processing.

Some characteristics of set processing include:

  • An update with all AMPs involved
  • Single session processing which takes advantage of parallel processing
  • Efficient updates of large amounts of data.

Response Time vs. Throughput

      
When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Response Time

This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.

Throughput

throughput measure is an amount of something per unit time. For operating systems, throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

What Does this Mean to Teradata?

Response TimeThroughput
Measures the average duration of queriesMeasures quantity of queries completed during a time interval
A measure of process completionA measure of the amount of work processed
Is how long the processing takesHow many queries were processed
The elapsed time per queryThe number of queries executed in an hour

In order to improve both response time and throughput on a Teradata system, you could increase CPU power (i.e., add nodes), implement workload management to control resources, and decrease the number of concurrent users.


The Advantage of Using Detail Data

      
Until recently, most business decisions were based on summary data.

The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half-as shown in the example on below.

Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.

Decision support, answering business questions, is the real purpose of databases. To answer business questions, decision-makers must have four things:

  • The right data
  • Enough detail data
  • Proper data structure
  • Enough computer power to access and produce reports on the data

Strategic workloads tend to rely heavily on detail data.

Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?

QUESTION: How effective was the national advertisement for jeans that ran June 6 through June 8?

DETAIL DATA vs. SUMMARY DATA

  • Detail data gives a more accurate picture.
  • Correct business decisions result.

The Data Warehouse

      
data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems (this is where the saying, "Load once, use many" comes from). A data warehouse has a centrally located logical architecture, using a single data store, which minimizes data synchronization, utilizes less space, and provides a single version of the truth.

Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years' worth of detail data so that historical trends may be analyzed using the full power of the data.

Many data warehouses get their data directly from operational systems so that the data is timely and accurate, providing less disparate results. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.

The data warehouse carries a massive volume of detail data, so it is well suited to the parallel architecture of the Teradata database. Because the Teradata database has traditionally specialized in large-scale, decision-support queries against huge volumes of detail data, its use for data warehousing becomes an obvious solution.

The Teradata database can run in single-node environments with a database of a terabyte or less. As requirements grow, the single node can easily be expanded to multiple nodes with the Teradata database's linear scalability. In the multi-node environment, the Teradata database can scale upward to systems supporting more than 50 petabytes of data and process hundreds of concurrent sessions. With the power of multiple CPUs propelling multiple AMPs over multiple disks, getting answers from the detail data in the warehouse becomes fast and efficient.

The Teradata database supports normalized logical models because it can perform 128 joins in a single query as well as large aggregations during very complex queries using its robust parser and optimizer. A key Teradata database strength is the ability to model the customer's business. With a business model, data is organized to what it represents. The Teradata database can do Star Schema and other forms of relational modeling - but Third Normal Form is the methodology we recommend to our customers. The Teradata database also has:

  • A Priority Scheduler and workload management tools that prioritize resources among competing applications based on importance
  • A powerful suite of load and maintenance utilities
  • Numerous third-party strategic and alliance partners

A central, enterprise-wide database that contains information extracted from operational systems.

  • Based on enterprise-wide model
  • Can begin small but may grow large rapidly
  • Populated by extraction/loading of data from operational systems
  • Responds to users "what if" queries
  • Minimizes data movement/ synchronization
  • Provides a single view of the business

Data Marts

      
data mart is a special purpose subset of data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Some reasons data marts are implemented are when a department wants to own its own data, a quick ROI is required, and in some cases, canned queries may offer users better performance.

Data Mart Types

  • Independent Data Marts
    Independent data marts are created directly from operational systems, just as a data warehouse is. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate.

  • Logical Data Marts
    Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology (Teradata), it removes the need for massive data loading and transformation, making a single data store available for all user needs.

  • Dependent Data Marts
    Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for some performance-critical user queries.

Data Models - Enterprise vs. Application

To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business.

In addition, an enterprise data model is more extensible than an application data model. It is intended to encompass the entire enterprise.

Teradata Functional Overview

      
The Parsing Engine is responsible for:
  • Managing individual sessions
  • (up to 120 sessions per PE)
  • Parsing and optimizing your SQL requests
  • Building query plans with the parallel-aware, cost-based, intelligent Optimizer
  • Dispatching the optimized plan to the AMPs
  • EBCDIC/ASCII input conversion (if necessary)
  • Sending the answer set response back to the requesting client

Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.

Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

Major Components of the Teradata Database

      
We have looked at the overall node, and now we will describe the components that make up a node in detail.

Parallel Database Extensions (PDE)

The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).

Parsing Engine (PE)

The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).

BYNET

The BYNET is the message-passing layer. It determines which AMP(s) should receive a message.

Access Module Processor (AMP)

The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.

Disks

Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.

What Happens When a Query is Submitted?

  1. The Parser evaluates the SQL for proper syntax.
  2. The Optimizer develops the least expensive plan.
  3. The BYNET passes the steps from the Dispatcher to the AMPs.


The Parsing Engine (PE)

      
Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.

Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.

Session Control

The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.

Parsing

The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.

Optimizing

The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata Database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.

Dispatching

The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.

Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.

The Parsing Engine is responsible for:

  • Managing individual sessions (up to 120 sessions per PE)
  • Parsing and optimizing your SQL requests
  • Building query plans with the parallel-aware, cost-based, intelligent Optimizer
  • Dispatching the optimized plan to the AMPs
  • EBCDIC/ASCII input conversion (if necessary)
  • Sending the answer set response back to the requesting client


The BYNET

      
The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate.

The BYNET handles the internal communication of the Teradata database. All communication between PEs and AMPs is done via the BYNET.

When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. Messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET.

Depending on the nature of the dispatch request, the communication between nodes may be a:

  • Broadcast - message is routed to all nodes in the system.
  • Point-to-point - message is routed to one specific node in the system.

Once the message is on a participating node, PDE directs the message to the appropriate AMPs on that node. All AMPs receive a broadcast message. With a point-to-point or multicast (multiple AMPs) message, the message is directed only to the appropriate AMP(s) on that node by PDE.

So, while a Teradata database system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.

The BYNET has several unique features:

  • Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1, and vice versa.

  • Load balanced: traffic is automatically and dynamically distributed between both BYNETs.

  • Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance.

  • High Performance: an MPP system has two BYNET networks providing hardware redundancy. Because both networks are active, the system benefits from the full aggregate bandwidth.

The technology of the BYNET is what makes the scalable parallelism of the Teradata database possible.

BYNET merges final answer set

Dual redundant, fault-tolerant, bi-directional interconnect network that enables:
  • Automatic load balancing of message traffic
  • Automatic reconfiguration after fault detection
  • Scalable bandwidth as nodes are added

The BYNET connects and communicates with all the AMPs and PEs in the system:

  • Between nodes, the BYNET hardware carries broadcast and point-to-point communications
  • On a node, BYNET software and PDE together control which AMPs receive a multicast communication
  • Merges the final answer set


The Access Module Processor (AMP)

      
The Access Module Processor (AMP) is the virtual processor (vproc) in the architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set, including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the requested rows from a table.

An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs.

The Database Manager subsystem resides on each AMP. This subsystem will:

  • Lock databases and tables.
  • Create, modify, or delete definitions of tables.
  • Insert, delete, or modify rows within the tables.
  • Retrieve information from definitions and tables.
  • Return responses to the Dispatcher.

Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs space-management functions that control the use and allocation of space.

The Teradata database performs all tasks in parallel, providing exceptional performance. The greater the number of tasks processed in parallel, the better the system performance. Many databases call themselves "parallel," but they can only perform some tasks in parallel.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.

AMPs perform all tasks in parallel

The AMP is responsible for:
  • Storing rows to and retrieving rows from its vdisk
  • Lock management
  • Sorting rows and aggregating columns
  • Join processing
  • Output conversion and formatting (ASCII, EBCDIC)
  • Creating answer sets for clients
  • Disk space management and accounting
  • Special utility protocols
  • Recovery processing

What is a Node?

      
Up to this point, we have discussed relational databases in terms of how the user sees them - as a collection of tables that relate to one another. Now we'll look at the physical components of the system.

A node is made up of various hardware and software components. All applications run under Linux or Windows, and all the Teradata database software runs under Parallel Database Extensions (PDE). All share the resources of CPU and memory on the node.

Access Module Processors (AMPs) and Parsing Engines (PEs) are virtual processors (vprocs) running under control of the PDE. Their numbers are software configurable.

AMPs are associated with virtual disks (vdisks), which are configured as ranks of a disk array.

The "Shared Nothing" architecture of the Teradata database means that each vproc is responsible for its own portion of the database, and for all activities to be performed on that part of the system.

All AMPs and PEs communicate via the BYNET, which is a message passing system. In an SMP (Symmetric Multi-Processing) system, it is implemented as a BYNET driver, and in MPP (Massively Parallel Processing) systems, it is implemented as a software and hardware solution. The BYNET allows multiple vprocs on multiple nodes to communicate with each other. PDE actually controls message-passing activity, while the BYNET handles message queuing and flow control.

An application that runs under the control of PDE, such as the Teradata database, is considered a Trusted Parallel Application (TPA).

Note: The versatility of the Teradata database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. These vprocs are a set of software processes that run on a node under Parallel Database Extensions (PDE) and the multitasking environment of the operating system.

In a massively parallel processing (MPP) system, there is hardware redundancy among both the multiple nodes and the BYNET.

Shared Nothing Architecture

The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.


Parallelism

      
Parallelism is at the very heart of the Teradata database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient.

Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets.

The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform's maximum throughput.

Each AMP can perform up to 80 tasks (or more) in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results.

In truth, parallelism is built into the Teradata database from the ground up!

Notes:

  • Each PE can handle up to 120 sessions in parallel.
  • Each Session can handle multiple REQUESTS.
  • The Message Passing Layer can handle all message activity in parallel.
  • Each AMP can perform up to 80 tasks in parallel (can be configured for more).
  • All AMPs can work together in parallel to service any request.
  • Each AMP can work on several requests in parallel.

Teradata Compression Comparison

      

Multi-value Compression

Multi-value compression has three variations:
  • COMPRESS – Nulls are compressed.
  • COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
  • COMPRESS <constant> – Nulls and the specified <constant> value are compressed.
With algorithmic compression, you must specify a UDF for compression and a UDF for decompression.

Example:

CREATE TABLE bank_accounts
 (customer_id INTEGER
 ,primary_bank INTEGER COMPRESS -- only NULL is compressed
 ,account_type CHAR(10) COMPRESS 'Savings' -- both NULL and 'Savings' are compressed
 ,account_notes VARCHAR(500) CHARACTER SET UNICODE
	COMPRESS USING TD_SYSFNLIB.TransUnicodeToUTF8
	DECOMPRESS USING TD_SYSFNLIB.TransUTF8ToUnicode -- algorithmic compression
 . . . );

Multi-value Compression vs. VARCHAR

For character-based data, an alternative to Teradata compression is the VARCHAR (N) data type. The number of bytes used to store each field is the length of the data item plus two bytes. Contrast this to a fixed-length CHAR (N) data type that takes N bytes per row, regardless of the actual number of characters in each field. Combining Teradata compression with fixed-length character data type can be a very effective space saver. MVC can compress 255 values in a column.

Data demographics can help determine whether variable-length character data type or fixed length plus compression is more efficient. The most important factors are the maximum field length, the average field length, and the compressibility of the data.

Algorithmic Compression

When column values are mostly unique, Algorithmic Compression can provide better compression results than Multi-value Compression. You can define your own compression and decompression algorithms and apply them to data at the column level.

A user can create their own compression algorithms as external C/C++ scalar UDFs, and then specify them in the column definition of a CREATE TABLE/ALTER TABLE statement. Teradata invokes these algorithms to compress and decompress the data when the data is inserted into or retrieved from the table.

ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.

Block Level Compression

Performs compression on whole data blocks at the file system level before they are written to storage.

  • BLC will compress/decompress only data blocks but will not be used on any file system structures such as Master/Cylinder indexes, the WAL log and table headers.
  • Only primary data subtable and fallback copies can be compressed. Both objects are either compressed or neither is compressed.
  • Secondary Indexes (USI/NUSI) cannot be compressed but Join Indexes can be compressed since they are effectively user tables.
  • Only the compressed form of the data block will be cached, each block access must perform the data block decompression.
  • On initial delivery, a single compression algorithm will be supplied and used.
  • Spool data blocks can be compressed via a system-wide tunable, as well as Temporary, WORK (sort Work space), and permanent journal.
  • Once enabled, reversion to an earlier release for compressed tables isn't allowed.
  • Teradata 14.0 added the capability of Temperature Based Block Level Compression (TBBLC) to enable greater compression flexibility while negating potential performance offsets.
 Multi-value Compression
(MVC)
Algorithmic
Compression
(ALC)
Block Level
Compression
(BLC) - (TBBLC)
Ease of Use
Easy to apply to well understood data columns and valuesEasy to apply on column with CREATE TABLESet once and forget
Analysis Required
Need to analyze data for common valuesUse Teradata algorithms or user-defined compression algorithms to match unique data patternsNeed to analyze CPU overhead trade-off (You can turn on for all data on system or you can apply on a per table basis)
Flexibility
Works for a wide variety of data and situationsAutomatically invoked for values not replaced by MVCAutomatically combined with other compression mechanisms
Performance Impact
No or minimal CPU usageDepends on compression algorithm usedReduced I/O due to compressed data blocks (CPU cycles are used to compress/decompress)
Applicability
Replaces common valuesIndustry data, UNICODE, Latin dataAll Data

A customer can choose any combination or all three on a column/table.


Layers of the Data Warehouse Architecture

      
There are several "layers" that make up the EDW environment. These layers include:

  • Staging Layer
    Staging is the primary purpose of the staging layer is to perform data transformation and loading during either the ETL or ELT process.

  • Access Layer
    Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a third party tool.

  • Physical Layer
    The physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc.

  • Data Lab
    This is Teradata's implementation of a controlled "sandbox" environment. It enables rapid loading and analysis of new data combined with existing warehouse data. It quickly assimilates untested data into a "non-production" portion of the warehouse, providing a self-provisioning, self-service environment for prototyping and analysis on new, un-cleansed data, which is temporary in nature.

    Data Lab is comprised of two Viewpoint portlets.

Teradata Data Lab - Interface

Teradata Database Objects

      
A "database" or "user" in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functionsor indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. Starting with Teradata Release 14.10, extended object names feature, allows object names to be up to 128 characters where prior it was a 30-byte limit.

All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D).

Databases provide a logical grouping for information. They are also the foundation for space allocation and access control.

Tables

A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows.

Views

A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used to implement security and control access to the underlying tables. They can hide columns from users, insulate applications from changes, and simplify or standardize access techniques. They provide a mechanism to include a locking modifier. Views provide a well-defined and tested high performance access to data. X-Views are special Views that limit the display of data to the user who accesses them.

Macros

A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros may also be used to limit user access to data.

Triggers

A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired - effectively executing the SQL statements.

Stored Procedures

A stored procedure is a program that is stored within Teradata and executes within the Teradata database. A stored procedure uses permanent disk space and may be used to limit a user's access to data.

User Defined Functions (UDF)

A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable information (PII), even from the DBA(s).

X-Views limit the display of results to the user accessing them.

Teradata Database Space

      
There are three types of space within the Teradata database:

  1. Perm Space
    All databases and users have a defined upper limit of Permanent Space. Permanent Space is used for storing the data rows of tables. Perm Space is not pre-allocated. It represents a maximum limit.

  2. Spool Space
    All databases also have an upper limit of Spool Space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all of the unallocated space in the system for their query. Spool Space is unused Perm Space that is used to hold intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released.

    Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of Spool Space available for queries?

    Answer: 80GB. All of the remaining space in the system is available for spool.

  3. Temp Space
    The third type of space is Temporary Space. Temp Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart. Temp Space is permanent space currently not used.

Space Management

      
The way the Teradata database handles space management is different from other database implementations.

The initial Teradata database system comes with several users and databases:

  • Users: DBC, SysAdmin, SystemFE, TDPuser
  • Databases: Crashdumps, Default, All, Public

Initially, most of the space in the system belongs to user DBC, with some allocated to each of the users or databases listed above. Database DBC contains all of the RDBMS software components and system tables.

Before defining application users and databases, the Database Administrator should first create a special administrative user and assign most of the space in the system to that user. This space comes from user DBC. User DBC becomes the owner of the administrative user.

There should be enough space left in user DBC to accommodate the growth of system tables and logs.

As the administrative user creates additional users and databases, space assigned to those objects will be subtracted from the administrative user's space. As these users and databases create subordinate objects, they will give up some of their space to these new users and databases.

The clear boxes represent all the space available in the system. The shaded boxes represent space that is being used in the system by tables. Space that is not currently being used by tables is available for use as spool or temporary space. If a table is dropped, that space becomes available to the system again.

Database 1 and Database 2 contain the production tables. Database 3 contains views and macros to control access to the data by end users and does not consume any database space. User D is an example of an end user who has no Perm Space but has been granted rights on certain views and macros to accomplish work. This user can still execute queries and create views or macros. User A, User B, and User C are examples of application developers who need some Perm Space to hold sample data for program testing and validation.

Most database vendors do not handle space this way. Once space is allocated to a table, it cannot be made available again without the Database Administrator having to perform a re-organization and re-allocate the space and partition the data.

  • A new database or user must be created from an existing database or user.
  • All Perm Space limits are subtracted from the owner.
  • Perm Space is a zero-sum game - the total of all Perm Space limits must equal the total amount of disk space available.
  • Perm Space currently not being used is available as Spool Space or Temp Space.

How Does the Teradata Database Distribute Rows?

      
Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. The Teradata database uses hashing to dynamically distribute data across all AMPs.

The Teradata database's automatic hash distribution eliminates costly data maintenance tasks. There is no specific order to the placement of the data. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. There is never a need to reorganize data or index space which makes the database easier to manage and maintain. As a result, strategic business data is more accessible to the users.

The benefits of automatic data placement include:

  • Distribution is the same regardless of data volume.
  • Distribution is based on row content, not data demographics.
  • Rows are moved or re-blocked automatically.

Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. For example, what if there are fewer rows than AMPs? In this case, at least some AMPs will hold no rows from that table. (This example should be considered the exception, and not the rule.)

Ideally, the rows of each table will be evenly distributed across all of the AMPs. Even distribution is desirable because in an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal portion of work to do. When workloads are not evenly distributed, the desired response can only be as fast as the slowest AMP.

Remember: The Teradata database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion of the database and does not share it with any other AMP.

Data Management

Many other databases require you to manually partition the data. Teradata automates the placement and maintenance of physical data location.

With the Teradata database, your DBA can spend more time with users developing strategic applications to beat your competition.

Loading, inserting, updating and, deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization.

Primary Keys and Primary Indexes

      
While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on this slide summarizes those differences.

A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.

It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table.

  • Indexes are conceptually different from keys.
  • A PK is a relational modeling convention which allows each row to be uniquely identified.
  • A PI is a Teradata convention which determines how the row will be stored and accessed.
  • A significant percentage of tables MAY use the same columns for both the PK and the PI.
  • A well-designed database WILL use a PI that is different from the PK for some tables.
Primary Key (PK)Primary Index (PI)
Logical concept of data modelingPhysical mechanism for access and storage
Teradata doesn’t need to recognizeEach table can have (at most) one primary index
No limit on number of columns64 column limit
Documented in data model (Optional in CREATE TABLE)Defined in CREATE TABLE statement
Must be uniqueMay be unique or non-unique
Identifies each rowIdentifies 1 (UPI) or multiple rows (NUPI)
Values should not changeValues may be changed (Delete + Insert)
May not be NULL - requires a valueMay be NULL
Does not imply an access pathDefines most efficient access path
Chosen for logical correctnessChosen for physical performance



Primary Key (PK) vs. Primary Index (PI)

      
Primary Indexes are conceptually different from Primary Keys.

  • Primary Key (PK) is a relational data-modeling term that defines the attribute(s) used to uniquely identify a tuple in a entity.

  • Primary Index (PI) is a physical database implementation term that defines the column(s) used to distribute the rows in a table. The Primary Index column(s) value may be unique or non-unique.

  • Unique Primary Index (UPI) the values in the Primary Index column(s) are unique which results in even row distribution and the elimination of duplicate row checks during data load operations.

  • Non-Unique Primary Index (NUPI) the values in the Primary Index column(s) are non-unique which results in potentially skewed row distribution - all duplicate Primary Index values go to the same AMP.

Accessing a row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance and known access paths, and will therefore be a different choice than the Primary Key of its source entity.

Example: PK vs. PI. An insurance claims table may contain the following columns: Claim ID, Date, Insurance Provider ID, Patient ID, and Claim Amount. The Primary Key is defined as Claim ID and Date. The same claim number may be submitted on one date and then resubmitted on another date, so you need both columns to look up a claim and its details. Other users may often be analyzing provider fraud behavior and need to access the claim by the Provider ID. In this case, the Primary Index may be the Provider ID column. Even if data distribution may be skewed, a choice like this might be made to provide efficient access to a table.

Primary Key (PK)Primary Index (PI)
Logical concept of data modelingMechanism for row distribution and access
Teradata does not need the PK defined  A table can have one (or none) Primary Index
No limit on the number of attributesMay be from 1 to 64 columns
Documented in the logical data modelDefined in the CREATE TABLE statement
Must be uniqueValue may be unique or non-unique
Unique identifierUsed to place a row on an AMP
Value should not changeValue may be changed (Updated)
May not be NULLMay be NULL
Does not imply access pathDefines the most efficient access path
Chosen for logical correctnessChosen for physical performance


Primary Indexes

      
With the Teradata database, the Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on its disks. Indexes are also used to access rows from a table without having to search the entire table. A table may have a Primary Index. The Primary Index of the table cannot be changed (i.e., altered, added, or dropped).

Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows in the table and the performance of the table in a production environment. Although many tables use combined columns as the Primary Index choice, the examples we use here are single-column indexes.

There are two types of primary indexes; unique (UPI) and non-unique (NUPI).

Unique Primary Index (UPI) is a column that has no duplicate values. UPIs are desirable because they guarantee even distribution of table rows. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation.

Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique, or duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. A NUPI allows duplicate values as well as access to the non-unique columns. A NUPI can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.

Accessing the row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance (enabling AMP-local joins to other tables) and known access paths (allowing local joins to similar entities), and will therefore be a different choice than the Primary Key of the table.

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. Rows are hashed based on the Primary Index.

Defining the Primary Index

      
When a table is created, the Primary Index should be specified. The Primary Index may consist of a single column or a combination up to 64 columns. Columns are assigned data types and may optionally be assigned attributes and constraints.

Once you choose the Primary Index for a table, it cannot be changed. The existing table must be dropped and a new table created to change the Primary Index definition.

Note: While you cannot change the Primary Index itself, values in a Primary Index column may be changed. The Teradata database simply rehashes that row to its new location, based on the Primary Index value.

The table definition is stored in the Data Dictionary (DD).

UPI Table:

CREATE TABLE Table1
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
UNIQUE PRIMARY INDEX (Col1);

NUPI Table:

CREATE TABLE Table2
  ( Col1  INTEGER
   ,Col2  INTEGER
   ,Col3  INTEGER )
PRIMARY INDEX (Col2);

  • The Primary Index is the mechanism used to assign a row to an AMP.
  • The Primary Index may be Unique (UPI) or Non-Unique (NUPI).
  • Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.
  • Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.

Row Distribution via Hashing

      
The diagram below gives an overview of Primary Index hash distribution, which is the process by which all data is distributed in the Teradata database.

The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Communications Layer. The Hash Map, in combination with the Row Hash determines which AMP gets the row. The Hash Map is a part of the Communications Layer interface.

The Teradata database's hash partitioning is the only approach which offers balanced processing and balanced data placement, while minimizing interconnect traffic and freeing the DBA from time-consuming reorganizations. The Teradata File System eliminates the need to rebuild indexes when tables are updated or structures change.

Hashing Down to the AMPs

The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. The Teradata database's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.

The rows of all tables are distributed across the AMPs according to their primary index value. The primary index value goes into the hashing algorithm. The output is a 32-bit row hash. The high-order 16 bits are referred to as the hash bucket number and are used to identify a hash-map entry. This entry is used to identify the AMP to be targeted. The remaining 16 bits are not used to locate the AMP.

Teradata provides customers with two choices for the size and number of the hash buckets in a system:

  • 16 bits (65,536 hash buckets, the same as in previous releases)
  • 20 bits (1,048,576 hash buckets)

For larger systems, such as those with 1,000 AMPs or more, the 20-bit option will result in more hash buckets to allow workloads to be distributed more evenly over the AMPs. New systems are also candidates for the expanded hash bucket option.

The entire 32-bit row hash will be used by the selected AMP to locate the row within its disk space.

Hash maps are uniquely configured for each size of system. A 50-AMP system will have a different hash map than a 30-AMP system.

A hash map is simply an array that associates bucket numbers with specific AMPs. When a system grows, new AMPs are typically added. Adding AMPs requires a change to the hash map to reflect the new total number of AMPs.

UPI Row Distribution

      
The hashing algorithm predictably distributes rows across all AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, the Teradata database can distribute the rows evenly. If the Primary Index is non-unique, but there are a fairly even number of rows per index value, the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values, the distribution will probably be skewed.

In the example below, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is uniform. This assures maximum efficiency because each AMP does approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task.

This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata database.

Order

Order
Number
Customer
Number
Order
Date
Order
Status
PK   
UPI   
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Order_Number values are unique (UPI).
  • The rows will distribute evenly across the AMPs.


Unique Primary Index (UPI) Access

      
Accessing a table using a Primary Index is the most efficient way to access the data and is always a one-AMP operation. In the case of a UPI, the one-AMP access can return, at most, one row. In the example below, we are looking for the row whose primary index value is 45. By specifying the PI value as part of our selection criteria, we are guaranteed that only the AMP containing the specified row will be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm. Hashing takes place in the Parsing Engine. The output of the hashing algorithm contains information that will point to a specific AMP. Once it has isolated the appropriate AMP, finding the row is quick and efficient.

SELECT *
FROM  Customer
WHERE Cust = 45;

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
UNIQUE PRIMARY INDEX (Cust);
       

CUSTOMER table

CustNamePhone
PK  
UPI  
37White555-4444
98Brown333-9999
74Smith555-6666
95Peters555-7777
27Jones222-8888
56Smith555-7777
45Adams444-6666
84Rice666-5555
49Smith111-6666
51Marsh888-2222
31Adams111-2222
62Black444-5555
12Young777-4444
77Jones777-6666


NUPI Row Distribution

      
In the example below, Customer_Number has been used as a non-unique Primary Index (NUPI). Note that row distribution among AMPs is uneven. All rows with the same primary index value (with the same customer number) are stored on the same AMP.

Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all rows among all AMPs.

AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation, AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. Overall operation time is increased and some AMPs are under-utilized.

This illustrates how NUPIs can create irregular distributions, called skewed distributions. AMPs that have more than an average number of rows will take longer for full-table operations than other AMPs. Because an operation is not complete until all AMPs have finished, the operation where distribution is skewed will take longer than it would if all AMPs were utilized evenly.

Order

Order
Number
Customer
Number
Order
Date
Order
Status
PK   
 NUPI  
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Customer_Number values are non-unique (NUPI).
  • Rows with the same PI value distribute to the same AMP causing skewed row distribution.


Non-Unique Primary Index (NUPI) Access

      
Accessing a table using a Non-Unique Primary Index (NUPI) is also a one-AMP operation. In the case of a NUPI, the one-AMP access can return zero to many rows. In the example below, we are looking for the rows whose primary index value is 555-7777. By specifying the PI value as part of our selection criteria, we are once again guaranteeing that only the AMP containing the required rows will need to be searched.

The correct AMP is located by taking the PI value and passing it through the hashing algorithm executed in the Parsing Engine. The output of the hashing algorithm will once again point to a specific AMP. Once it has isolated the appropriate AMP, it must find all rows that have the specified value. In the example, the AMP returns two rows.

SELECT *
FROM  Customer
WHERE Phone = '555-7777';

CREATE TABLE Customer
  ( Cust  INTEGER
   ,Name  CHAR(10)
   ,Phone CHAR(8) )
PRIMARY INDEX (Phone);
       

CUSTOMER table

CustNamePhone
PK  
  NUPI
37White555-4444
98Brown333-9999
74Smith555-6666
95Peters555-7777
27Jones222-8888
56Smith555-7777
45Adams444-6666
84Rice666-5555
49Smith111-6666
51Marsh888-2222
31Adams111-2222
62Black444-5555
12Young777-4444
77Jones777-6666
72Adams666-7777
40Smith222-3333

Single AMP access with 0 to n rows returned.


Highly Non-Unique NUPI Row Distribution

      
The example below uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all rows are placed on two AMPs.

This is an example of a highly non-unique Primary Index.

When choosing a Primary Index, never pick a column with a severely-limited value set. The degree of uniqueness is critical to efficiency. Choose NUPIs that allow all AMPs to participate fairly equally.

Order

Order
Number
Customer
Number
Order
Date
Order
Status
PK   
   NUPI
732524/13O
732434/13O
741514/13C
710314/10O
722524/15C
738414/12C
740234/16C
718814/13C
720224/09C
  • Order_Status values are highly non-unique (NUPI).
  • Only two values exist. The rows will be distributed to two AMPs.
  • This table will not perform well in parallel operations.
  • Highly non-unique columns are poor PI choices.
  • The degree of uniqueness is critical to efficiency.



Partitioned Primary Index (PPI)

      
The Teradata database has a partitioning mechanism called Partitioned Primary Index (PPI). PPI is used to improve performance for large tables when you submit queries that specify a range constraint. PPI allows you to reduce the number of rows to be processed by using a technique called partition elimination. PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. PPI is useful for instantaneous dropping of old data and rapid addition of new data. Using PPI lets you avoid having to do a full-table scan without the overhead of a Secondary Index.

How Does PPI Work?

Data distribution with PPI is still based on the Primary Index:

  • Primary Index >> Hash Value >> Determines which AMP gets the row

With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.

Using PPI, the rows are stored first by partition and then by row hash /row id order. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.

Data Storage Using PPI

To store rows using PPI, specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.

Partition Elimination is used to access tables with a PPI. The term "partition elimination" refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination.

If the partitioning expression references any columns that aren't part of the primary index, then PI access may be slower. When all partitioning expression columns are contained in the PI, then PI access is unchanged.

The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#).

Partitioned Primary Indexes:

  • Improve performance on range constraint queries
  • Use partition elimination to reduce the number of rows accessed
  • May use a CASE_N or RANGE_N partitioning expression
The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI).


Multilevel Partitioned Primary Index (MLPPI)

      
A PPI can be sub-partitioned, increasing the opportunities for partition elimination at the various additional levels (or combinations of levels). With a multilevel PPI, you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination.

  • Allows multiple partitioning expressions on a table or a non-compressed join index - including base tables, Global Temp tables, and Volatile tables.
  • Multilevel partitioning allows each partition at a level to be sub-partitioned. Each partitioning level is defined independently using a RANGE_N or CASE_N expression.
  • A multilevel PPI allows efficient searches by using partition elimination at the various levels or combination of levels.
  • Allows flexibility on which partitioning expression to use when there are multiple choices of partitioning expressions.
For example, an insurance claims table could be partitioned by claim date and then sub-partitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as June 2012) would take advantage of partition elimination for faster performance.

Note: an MLPPI table must have at least two partition levels defined.

Advantages of partitioned tables:

  • They reduce the I/O for range constraint queries.
  • They take advantage of dynamic partition elimination.
  • They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices (i.e., you can use last name or some other value that is more readily available to query on).
  • The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities.
  • Row hash locks are used for SELECT with equality conditions on the PI columns.
  • Partitioned tables allow for fast deletes of data in a partition.
  • They allow for range queries without a secondary index.
  • Specific partitions maybe archived or deleted.
  • May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
  • May replace a Value Ordered NUSI for access.

Disadvantages of partitioned tables:

  • Rows in a partitioned table are eight bytes longer
  • Access via the Primary Index may take longer
  • Full table joins to an NPPI table with the same PI may take longer

Syntax:

What is a NoPI Table?

      
A NoPI Table is simply a table without a primary index.

Prior to Teradata database 13.0, Teradata tables required a primary index. The primary index was used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradata's parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index.

Starting with Teradata database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.

Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value.
With a NoPI table:

  • INSERTS/UPDATES to the table are always appended at the end of the table
  • The table must always be Multi-set
  • There are no duplicate row checks performed upon load

If you use FastLoad to load rows into a NoPI table, you cannot have secondary or join indexes, CHECK constraints, referential integrity constraints on the table as FastLoad cannot load a table defined with any of these features. You can use FastLoad to load rows into un-partitioned NoPI tables with row-level security defined, however.

When FastLoading to a NoPI table, the rows are distributed a block at a time.

Example of creating a NoPI table:

CREATE TABLE table_x
 	(col_x INTEGER
 	,col_y CHAR(10)
 	,col_z DATE)
NO PRIMARY INDEX;


Teradata Columnar

      
Teradata Column or Column Partitioning (CP) is a new physical database design implementation option (starting with Teradata 14.0) that allows single columns or sets of columns of a NoPI table to be stored in separate partitions. Column partitioning can also be applied to join indexes.

Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk.

Columnar offers the ability to partition a table or join index by column. It can be used alone or in combination with row partitioning in multilevel partitioning definitions. Column partitions may be stored using traditional 'ROW' storage or alternatively stored using the new 'COLUMN' storage option. In either case, columnar can automatically compress physical rows where appropriate.

The key benefit in defining row-partitioned (PPI) tables is when queries access a subset of rows based on constraints on one or more partitioning columns. The major advantage of using column partitioning is to improve the performance of queries that access a subset of the columns from a table, either for predicates (e.g., WHERE clause) or projections (i.e., SELECTed columns).

Because sets of one or more columns can be stored in separate column partitions, only the column partitions that contain the columns needed by the query need to be accessed. Just as row-partitioning can eliminate rows that need not be read, column partitioning eliminates columns that are not needed.

The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks.

Columnar makes more sense in CPU-rich environments because CPU cycles are needed to "glue" columns back together into rows, for compression and for different join strategies (mainly hash joins).

Example of creating a Columnar table:

CREATE TABLE table_y
	(col_a CHAR(25) NOT NULL
	,col_b CHAR(25) NOT NULL
	,col_c DATE NOT NULL
	,col_d CHAR(7) NOT NULL
	,col_e INTEGER)
PARTITION BY COLUMN;


Teradata Columnar Benefits

      
Benefits of using the Teradata Columnar feature include:
  • Improved query performance: Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows.

  • Reduced disk space: The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index.

  • Increased flexibility: Provides a new physical database design option to improve performance for suitable classes of workloads.

  • Reduced I/O: Allows fast and efficient access to selected data from column partitions, thus reducing query I/O.

  • Ease of use: Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries.

Columnar Join Indexes

A join index can also be created as column-partitioned for either a columnar table or a non-columnar table. Conversely, a join index can be created as non-columnar for either type of table as well.

Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes.

A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index.

Secondary Indexes

      
secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection.

There are three general ways to access a table:

  • Primary Index access (one AMP access)
  • Secondary Index access (two or all AMP access)
  • Full Table Scan (all AMP access)

Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan.

Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful.

  • A secondary index provides an alternate path to the rows of a table.
  • A secondary index can be used to maintain uniqueness within a column or set of columns.
  • A table can have from 0 to 32 secondary indexes (each have 1
  • 64 columns).
  • Secondary Indexes:
    • Do not affect table distribution.
    • Add overhead, both in terms of disk space (sub-tables) and maintenance.
    • May be added or dropped dynamically as needed.
    • Are only chosen to improve query performance.

Defining a Secondary Index

      
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.

Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.

Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.

All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.

A Secondary Index may be defined ...

  • at table creation (CREATE TABLE)
  • following table creation (CREATE INDEX)
  • may be up to 64 columns
USINUSI
If the index choice of column(s) is unique, it is called a USI.
Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.
Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

  • Creating a Secondary Index cause an internal sub-table to be built.
  • Dropping a Secondary Index causes the sub-table to be deleted.


Defining a Secondary Index

      
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.

Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.

Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.

All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.

A Secondary Index may be defined ...

  • at table creation (CREATE TABLE)
  • following table creation (CREATE INDEX)
  • may be up to 64 columns
USINUSI
If the index choice of column(s) is unique, it is called a USI.
Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.

CREATE UNIQUE INDEX
(EmpNumber) ON Employee;

If the index choice of column(s) is nonunique, it is called a NUSI.
Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.

CREATE INDEX
(LastName) ON Employee;

Notes:

  • Creating a Secondary Index cause an internal sub-table to be built.
  • Dropping a Secondary Index causes the sub-table to be deleted.

Other Types of Secondary Indexes

      
There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.

Value-ordered NUSIs have the following limitations.

  • The sort key is limited to a single numeric column.
  • The sort key column cannot exceed four bytes.
  • They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Other Types of Secondary Indexes

      
There are a few other Secondary Indexes available with Teradata:

Sparse Index

Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.

Hash Index

Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.

You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.

Value-Ordered NUSI

Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.

Value-ordered NUSIs have the following limitations.

  • The sort key is limited to a single numeric column.
  • The sort key column cannot exceed four bytes.
  • They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Full-Table Scans

      
full table scan is another way to access data without using any Primary or Secondary Indexes.

In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on this slide, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster.

With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once.

As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans.

While full table scans are impractical and even disallowed on some systems, the Vantage Advanced SQL Engine routinely executes ad hoc queries with full table scans.

Every row of the table must be read. All AMPs scan their portion of the table in parallel.

  • Fast and efficient on Teradata due to parallelism.
Full table scans typically occur when either:
  • An index is not used in the query
  • An index is used in a non-equality test

CUSTOMER

CustIDCustNameCustPhone
USI NUPI
   

Examples of Full Table Scans:

SELECT * FROM Customer WHERE CustPhone LIKE '858-485-_ _ _ _ ';

SELECT * FROM Customer WHERE CustName = 'Lund';

SELECT * FROM Customer WHERE CustID > 1000;

Locks

      
Locking serializes access and prevents multiple users who are trying to update the same data at the same time from violating data integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks:
  1. Exclusive Locks - prevents any other type of concurrent access. Exclusive locks are applied only to databases or tables, never to rows. They are the most restrictive types of lock; all other users are locked out. Exclusive locks are typically used only when structural changes are being made to the database or table, such as changing the block size on a table, as in the case of a DDL statement.

  2. Write Locks - prevents other Read, Write, Exclusive locks. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (access lock readers). Until a write lock is released, no new read or write locks are allowed. When you update a table without a WHERE clause, the system places a write lock on the table.

  3. Read Locks - prevents Write and Exclusive locks. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted.

  4. Access Locks - prevents Exclusive locks only. Users who are not concerned about data consistency can specify access locks. Using an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small, single-row changes. Access locks are sometimes called stale read or dirty read locks. You may get stale data that hasn't been updated, or has been overwritten.

Lock Levels

Three levels of database locking are provided:

  1. Database - applies to all tables/views in the database. Locks all objects in the database.
  2. Table/View - applies to all rows in the table/view. Locks all rows in the table or view.
  3. Row Hash - applies to all rows with same row hash. Locks all rows with the same row hash (primary and Fallback rows, and Secondary Index subtable rows.

The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock.

  • SELECT - requests a Read lock
  • UPDATE - requests a Write lock
  • CREATE TABLE - requests an Exclusive lock
Lock requests may be upgraded or downgraded:

LOCKING TABLE Table1 FOR ACCESS . . .
LOCKING TABLE Table1 FOR EXCLUSIVE . . .


Data Protection Features

      
Facilities that provide system-level protection:
  • Disk Arrays
    • RAID data protection (e.g., RAID 1)
    • Redundant data buses and array controllers

  • Cliques and Vproc Migration
    • SMP or Linux failures; Vprocs can migrate to other nodes (or HSN) within the clique.
Facilities that provide Teradata database protection
  • Fallback - provides data access with a “down” AMP
  • Locks - provide data integrity
  • Transient Journal - automatic rollback of aborted transactions
  • Down AMP Recovery Journal - fast recovery of fallback rows for AMPs
  • Permanent Journal - optional before and after-image journaling
  • Unity - integrated portfolio that turn a multi-system environment into an orchestrated Ecosystem
  • ARC - Archive/Restore facility
  • NetBackup and Tivoli - provide backup storage management, ARC script creation, and scheduling capabilities
Disk Arrays – Disk arrays provide RAID 1 or RAID 6 protection. If a disk drive fails, the array subsystem provides continuous access to the data.

Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays.

Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired and released during the processing of a request.

Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP” and provides protection in the event of an AMP failure. Fallback provides AMP fault tolerance at the table level.

Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP.

Transient Journal – the TJ is part of the WAL log and exists to permit the successful rollback of a failed transaction.

ARC/DSA and NetBackup/Tivoli – ARC and DSA command scripts provide the capability to backup and restore the Teradata database.

Transient Journal

      
The transient journal (or Online Transaction Journal) permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, data is returned to its original state after transaction failure.

The transient journal automatically maintains a copy on each AMP of before images of all rows affected by the transaction. In the event of a transaction failure, the before images are reapplied to the affected table, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.

Journal entries are made on and by the AMP making the change. Transient Journal activities are automatic and transparent to the user.

ALTER TABLE Misconception

An often repeated misconception is that the when an ALTER TABLE command is executed, the “altered data” is written to the transient journal/WAL (Write-Ahead Logging or Write-Ahead Logic) log.

For example, assume an ALTER TABLE ADD COLUMN is executed. Assuming the altered table is a large table, this operation may be time consuming. Why?

The reason that ALTER TABLE ADD COLUMN may be time consuming is that each row is rebuilt, each block is rebuilt, and the cylinder is rebuilt (re-lays out each cylinder). It does so one cylinder at a time in order to avoid having to make a full copy of the table. A new cylinder is allocated; each datablock is read; the records are reformatted and written to a new datablock which is placed on the new cylinder. When all the blocks are complete, the new cylinder is committed to disk. Then the old cylinder is released and the process is repeated until all cylinders are completed. This work reformats the rows and blocks in the same way as an INSERT-SELECT.

The only thing written to the WAL log is a single record per cylinder as a cylinder is committed. This is used in case the ALTER TABLE is interrupted – e.g., by a Teradata restart. The last cylinder log record is used to restart the process from the last completed cylinder and complete it. This is also why an ALTER TABLE cannot be aborted once it has been started.

Fallback

      
Fallback protects your data by storing a second copy of each row of a table on an alternate, Fallback AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all table data is still available. Users may continue to use Fallback tables without any loss of available data.

During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent.

Fallback guarantees that the two copies of a row will always be on different AMPs in the same cluster. A cluster is a group of AMPs that act as a single Fallback unit. Clustering has no effect on primary row distribution of the table, but the Fallback row will always go to another AMP in the same cluster. If either AMP fails, the alternate row copy is still available on the other AMP

There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O for INSERTs, UPDATEs, and DELETEs. (The Fallback option does not require any extra I/O for SELECT operations and the Fallback I/O will be performed in parallel with the primary I/O.)

The benefits of Fallback include:

  • Protects your data from an AMP failure. An AMP failure may occur because of physical disk failures - e.g., both disk drives in a RAID 1 drive group have failed or three drives in the same RAID 6 drive group. An AMP may also fail because of an AMP software failure or because access to the disk array has failed.
  • Protects your data from software (node) failure.
  • Automatically recovers with minimum recovery time, after repairs or fixes are complete.
  • May be specified at the table or database level
  • Permits access to table data during AMP off-line period
  • Adds a level of data protection beyond disk array RAID 1 & 5
  • Highest level of data protection is RAID 1 and Fallback
  • Critical for high availability applications

Loss of any two AMPs in a cluster causes RDBMS to halt!

Down AMP Recovery Journal

      
After the loss of any AMP, a down-AMP recovery journal is started automatically to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row-ID in its recovery journal.

This process will continue until the down AMP is brought back on-line. As part of the restart activity, the recovery journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and the AMP is brought on-line, fully recovered.

Recovery Journal is:

  • Automatically activated when an AMP is taken off-line
  • Maintained by other AMPs in the cluster
  • Totally transparent to users of the system
While AMP is off-line:
  • Journal is active
  • Table updates continue as normal
  • Journal logs Row-IDs of changed rows for down-AMP
When AMP is back on-line:
  • Restores rows on recovered AMP to current status
  • Journal discarded when recovery complete

For Fallback Tables

RAID Protection

      
There are many forms of disk array protection. The Teradata database supports the following protection schemes:

  • RAID 1
  • RAID 5
  • RAID S

RAID 1 (Mirroring) uses a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged.

RAID 5 (Parity) uses a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. A rank is defined as a group of disks that have the same SCSI ID (the disks in the rank are not on the same channel, but are on different channels). The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.

RAID S uses a parity-checking technique similar to RAID 5 and is used with EMC disk arrays.

Teradata systems use disk arrays that employ redundant controllers (DRDAC) and redundant disk drives. When a drive fails, the array switches to the mirror drive or uses parity to reconstruct the data.

Note: With Teradata databases, RAID 1 and RAID 5 are commonly used. Unless there is a reason not to do so, Teradata recommends that all our customers implement RAID 1.

Summary:

RAID-1 - Good performance with disk failures
Higher cost in terms of disk space
RAID-5 - Reduced performance with disk failures
Lower cost in terms of disk space

Hot Standby Nodes (HSN)

      
Hot Standby Node (HSN) is a node which is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails Teradata restarts and the AMPs from the failed node move to the hot standby node. The performance degradation is 0%.

When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed.

Characteristics of a hot standby node are:

  • A node that is a member of a clique.
  • Does not normally participate in the trusted parallel application (TPA).
  • Will be brought into the configuration when a node fails in the clique.
  • Helps with unplanned outages.
  • Helps with planned outages
  • Eliminates the need for a restart to bring a failed node back into service.

Hot standby nodes are positioned as a performance continuity feature.


This example illustrates vproc migration using a Hot Standby Node.

  1. Performance Degradation is 0% as AMPs are moved to the Hot Standby Node.
  2. When Node 1 is recovered, it becomes the new Hot Standy Node.

Teradata Tools and Utilities Overview

      
Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system.

Teradata utilities leverage the Teradata database's high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.

Teradata database client utilities include the following, described in this section:

  • Query Submitting Utilities - Query tools that submit SQL queries
    • BTEQ
    • Teradata Studio Express

  • Load and Unload Utilities
    • Teradata Parallel Transporter (TPT)

  • Administrative Utilities - for managing workloads
    • Teradata Studio
    • Priority Scheduler
    • Database Query Log (DBQL)
    • Teradata Workload Analyzer
    • Teradata Integrated Workload Management (TIWM)
    • Teradata Active Systems Management (TASM)

  • Tools that support a Multi-System Environment
    • Viewpoint
    • Unity
    • Teradata Multi-Systems Manager (TMSM)
    • Data Mover

BTEQ

      
BTEQ stands for Basic Teradata Query utility. It is a SQL front-end tool that runs on all client platforms. It supports both interactive ad hoc and scripted batch queries and provides standard report writing and formatting functionality. It also provides a powerful import/export capability even though there are other utilities available for these functions as well, particularly when larger tables are involved. When a user logs on with BTEQ, they can set the number of sessions they need before logging on. Using multiple sessions may improve performance for certain operations. This feature exploits the parallelism of the Teradata database.

BTEQ uses the CLI application programming interface.

Key Points about BTEQ

  • Basic Teradata Query utility
  • SQL front-end
  • Report writing and formatting features
  • Interactive and scripted batch queries
  • Import/Export across all platforms
  • Supports ad-hoc querying
  • Supports table updates with import and export
  • Runs on all platforms

BTEQ Uses the CLI API


Teradata Parallel Transporter

      
Teradata Parallel Transporter (TPT) is the replacement for Teradata Warehouse Builder (TWB). This utility is effectively an object-oriented software system that executes multiple instances of data extraction, transformation, and load functions in a scalable, high-speed parallel processing environment.

TPT is scalable and enables end-to-end parallelism. The previous versions of utilities (like FastLoad) allow you to load data into Teradata in parallel, but you still have a single input stream. Teradata Parallel Transporter allows you to run multiple instances of the extract, optional transformation, and load. You can have as many loads as you have sources in the same job. With multiple sources of data coming from multiple platforms, integration is important in a parallel environment.

TPT eliminates the need for persistent storage. It stores data into data buffers so you no longer need to write data into a flat file.

TPT provides a single scripting language. You can do the extract, some transformation, and loads all in one SQL-like scripting language.

Once the dynamics of the language are learned, you can perform multiple tasks with a single script.

TPT supports FastLoad INMODs, FastExport OUTMODs, and Access Modules to provide access to all the data sources you use today. TPT also provides a Direct API interface that can be used by Third Party partners. They can effectively write 'C' code to directly load/unload data to/from Teradata.

It is effectively a parallel load and export utility. Characteristics of this utility include:

    High performance
    • Parallel Export and Load operators eliminate sequential bottlenecks.
    • Data Streams eliminate the overhead of intermediate (persistent) storage.
    • Scalable
    • End-to-end parallelism

    Easy to use

    • Single scripting language
    • Access to various data sources

    Extensible

    • Direct API enables Third Party partners to write 'C' code to directly load/unload data to/from Teradata.

TPT Operators

      
The operators are components that "plug" into the TPT infrastructure and actually perform the functions.

  • The FastLoad and FastExport operators support the current FastLoad and FastExport INMOD/OUTMOD features.
  • The DataConnector operator is an adapter for the Access Module or non-Teradata files.
  • The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands.
  • The Load, Update, Export and Stream operators are similar to the FastLoad, MultiLoad, FastExport, and TPump utilities, but built for the TPT parallel environment.

The INMOD and OUTMOD adapters, DataConnector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export, and Stream operators are purchased separately.

Here is a comparison of the Teradata Parallel Transporter Operators with the classic utilities that we just covered.

TPT OperatorTeradata UtilityDescription
LOADFastLoadA consumer-type operator that uses the Teradata FastLoad protocol to load empty tables. Supports Error limits,Checkpoint/ Restart,Multi-value Compression and PPI.
UPDATEMultiLoadUtilizes the Teradata MultiLoad protocol to update tables. This allows highly scalable and parallel inserts and updates to a populated table.
EXPORTFastExportA producer operator that emulates FastExport
STREAMTPumpUses multiple sessions to perform DML transactions in near real-time.
DataConnectorN/AEmulates the DataConnector API. Reads external data files, writes data to external data files, reads an unspecified number of data files.
ODBCN/AReads data from an ODBC Provider.
SQL INSERTBTEQA consumer operator that inserts data into a table, similar to a BTEQ import operation.
SQL SELECTBTEQA producer operator that submits SQL SELECT statements to retrieve data from a table.


TPT Architecture

      
There are two types of parallelism used by TPT: pipeline parallelism and data parallelism.

Pipeline Parallelism

Pipeline parallelism (also called "functional parallelism") is achieved by connecting operator instances through data streams in a single job. As seen in the following diagram, a "producer" operator extracts data from a data source and writes it to the data stream, that is, it 'produces' a data stream. Finally, the "consumer" operator 'consumes' the data from the data stream and then starts writing data to the target.

There are two operators involved here, producer and consumer. Both operators can operate concurrently and independently of each other thereby creating a pipeline effect without the need for intermediary files.

For example:

A "producer" operator extracts data from a data source and writes ("produces") it to a data stream. A "consumer" operator reads ("consumes") the data from the data stream and writes or loads data to the target. Both operators can operate concurrently and independently of each other. This creates a pipeline effect without the need for intermediary files.

TPT Parallelism

      
Imagine you have the same job as described on the previous slide, but that it is handling a large quantity of data coming from several different input files. Each file can be handled by a different instance of the producer operator so that the files are read concurrently. The operator instances can then merge these files into a single data stream which can then be input to multiple consumer operators. The diagram on this slide illustrates this concept.

Each source can be handled by a different instance of the producer operator, which merge these files into a single data stream.

The data stream can then be input to multiple consumer operators.


===============================================================

Teradata SQL for Applications Development


Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:


the column has defined default values

the column has a default system value specified

the column permits nulls

If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.



The SQL command INSERT INTO tablename DEFAULT VALUES;


Will insert defined default values into each column.

Will insert a null if no default is defined.

Will fail if no default is defined and a null is not allowed.


Some options for NULL and DEFAULT attributes:


NOT NULL - Nulls are not permitted for this column

DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.

Example: DEFAULT 'Unknown'


DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx

Example: DEFAULT DATE '2099-12-31'


WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)

DEFAULT TIME - Assign the current time to the integer data type

DEFAULT USER - Assign the user id of the session to the character string







^ - A circumflex character indicates that the character (or characters) that follow the circumflex in the regular expression must occur at the beginning of the source string.
Example: '^E' requires that the source string begin with an upper case E.
$ - A dollar sign character indicates that the preceding character (or characters) must occur at the end of the source string.
Example ':$' requires that the source string ends with a colon.
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.
Example: '(ABC)+' would match one or more successive occurrences of ABC at the corresponding position in a source string.
[] - Square braces enclose a subset of characters.
Example: '[0-9]' would match any single digit at the corresponding position in a source string.












Data Models define how elements of data are connected or related


As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.


Majorly, there are three types of data models:


Relational

Dimensional

Logical and Physical


Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.


Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.


The Teradata Advanced SQL Engine performs space management and is different from most other database implementations.  

Types of Spaces

Generally, there are three types of space within the Teradata Advanced SQL Engine.





1.Permanent Space, also known as Perm space, is used for storing the data rows of tables and is the maximum storage assigned to a user and/or database. It is not pre-allocated or reserved ahead of time, it is available on-demand. Space is deducted from the owner’s allocation as it is used.

2.Spool Space is unused Perm Space that holds intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released. All databases have an upper limit of Spool Space. Theoretically, a user could use all the unallocated space in the system for their query.

3.Temporary Space is sometimes referred to as Temp Space and is used for global temporary tables. Tables created in Temp Space will survive a restart and remain available to the user until their session is terminated.


What is Data Distribution?


The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.


Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.The Teradata Advanced SQL Engine's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.

Teradata Database l


Locks have two dimensions: Level and Severity. 


Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.


Types


There are four types of locks in Teradata. (Click on each type to learn more)


EXCLUSIVE -Prevents any other type of concurrent access
WRITE-Prevents other reads, writes, and exclusives
READ- Prevents other writes and exclusives
ACCESS - Prevents exclusive rights only


























































































































































































 

Comments

Popular posts from this blog

pandas

starburst