td nz teradata sql
https://drive.google.com/file/d/1vevk2HJLngFs0PFqvefq4h_pbt1dB73t/view
2: https://drive.google.com/file/d/1vevk2HJLngFs0PFqvefq4h_pbt1dB73t/view?usp=sharing
3: Features of Teradata
Here are some essential features of the Teradata, such as:

1) Linear Scalability
Teradata Systems are linear and highly scalable. It can handle a large volume of data efficiently at a time. It can be scaled up to a maximum of 2048 nodes that enhance the system performance.
2) Unlimited Parallelism
The architecture of Teradata is based on massively parallel processors (MPP), which divides large volumes of data into smaller processes. These all smaller processors execute in parallel. This way of the execution completes the complex tasks quickly.
3) Mature Optimizer
Teradata Optimizer is one of the most advanced optimizers that are available in the market. It can handle up to 64 joins in a single query.
4) Shared Nothing Architecture
The Teradata architecture is also known as Shared Nothing Architecture because the Teradata nodes, Access Module Processors (AMPs), and disks all are work independently. These resources are not shared, therefore provides the best value for a given task. The 100% availability of every resource enhances efficiency by decreasing the process execution time.
5) Connectivity
The parallel system of Teradata can connect to channel-attached systems such as network-attached systems or mainframes.
6) Low TCO
The Teradata system offers a low total cost of ownership, as it is easy to set up and maintain the arrangement.
7) SQL
Teradata systems allow the implementation of SQL to interact with the data in the database easily. It also provides its extension.
8) Load & Unload utilities
Teradata provides load & unload utilities to move data in and out of the Teradata system
9) Automatic Distribution
Teradata can efficiently distribute the data to its disks without any external intervention.
10) Robust Utilities
Teradata has many robust utilities to handle the data in and out of the Teradata systems. FastLoad, MultiLoad, FastExport, and TPT are few of them which help in the import and export of data to and from the systems.
Components of Teradata
Teradata is a very efficient, less expensive, and high-quality Relational Database management System.
Teradata is based on Massively Parallel Processing (MPP) architecture. It is made of Parsing Engine (PE), BYNET, Access Module Processors (AMPs), and other components such as nodes.
Below are some significant components of the Teradata, such as:

1. Parsing Engine: Parsing Engine is the base component that receives the queries from the clients and prepares the execution plan.
2. BYNET: This is the message-passing layer or simply the networking layer in Teradata. It receives the execution plan from the parsing engine and passes it to AMPs and the nodes. After that, it gets the processed output from the AMPs and sends it back to the parsing engine.
To maintain adequate availability, the BYNET 0 and BYNET 1 two types of BYNETs are available. This ensures that a secondary BYNET is available in case of the failure of the primary BYNET.
3. Access Module Processors (AMPs): These are the virtual processors of Teradata. They receive the execution plan and the data from the parsing engine. The data will undergo any required conversion, filtering, aggregation, sorting, etc., and will be further sent to the corresponding disks for storage.
Table records will be distributed to each AMP for data storage. Only that AMP can read or write data into the disks that have access permission.
4. Nodes: The basic unit of a Teradata system is called a Node. Each node has its operating system, CPU memory, a copy of RDBMS software, and some disk space. A single cabinet can have one or more nodes in it.
Teradata Architecture
The architecture of Teradata is a Massively Parallel processing architecture. The Teradata system has four components.
- Parsing Engine
- BYNET
- AMPs
- Disk
Based on the principal function of the Teradata system, the architecture can be categorized into two sections, such as:
- Storage Architecture
- Retrieval Architecture
Storage Architecture
The storage architecture consists of the above two components of the Teradata architecture. When the client runs queries to insert records, the Parsing engine sends the files to BYNET. BYNET retrieves the files and sends the row to the target AMP. AMP stores these records on its disks.
1. Parsing Engine
When a user fires an SQL query, it first gets connected to the Parsing Engine. The processes, such as planning and distributing the data to AMPS, are done here. It finds out the best optimal plan for query execution. Parsing Engine performs the following processes, such as:
- Parser: The Parser checks for the syntax, if true forward the query to Session Handler.
- Session Handler: It does all the security checks, such as checking of logging credentials and whether the user has permission to execute the query or not.
- Optimizer: It finds out the best possible and optimized plan to execute the query.
- Dispatcher: The Dispatcher forwards the query to the AMPs.
2. AMP
Access Module Processor is a virtual processor that connects to PE via BYNET. Each AMP has its disk and is allowed to read and write in its disk. This is called as 'SHARED NOTHING ARCHITECTURE'.
When the query is fired, then Teradata distributes the rows of a table on all the AMPs. The AMP executes any SQL requests in the following steps, such as:
- Lock the table.
- Execute the operation requested.
- End the transaction.
Retrieval Architecture
This architecture section consists of reaming two components of the Teradata architecture. When the client runs queries to retrieve records, the Parsing engine sends a request to BYNET. BYNET sends the retrieval request to appropriate AMPs. Then AMPs search their disks in parallel and identify the required records and forwards to BYNET. BYNET sends the records to Parsing Engine, which in turn will send to the client.
1. BYNET
The BYNET acts as a channel between PE and AMPs. There are two BYNETs in Teradata 'BYNET 0' and 'BYNET 1'.
- If one BYNET fails, the second one can take its place.
- When data is large, both BYNETs can be made functional, which improves the communication between PE and AMPs, thus fastening the process.
2. Disk
Teradata offers a set of Virtual Disks for each AMP. The storage area of each AMP is called as Virtual Disk or Vdisk. Here are the following steps for executing the query, such as:
Step 1: The user raises the question which is sent to PE.
Step 2: PE checks the security and syntax, and finds out the best optimal plan to execute the query.
Step 3: The table rows are distributed on the AMP, and the data is retrieved from the disk.
Step 4: The AMP sends back the data through BYNET to PE.
Step 5: PE returns the data to the user.
Teradata Relational Concepts
Relational Database Management System is a DBMS software that helps to interact with databases. The relational database is a collection of objects in the form of tables, views, macros, stored procedures, and triggers.
They use Structured Query Language (SQL) to interact with the data stored in tables directly or individually developed applications.
Relational Database Terminology
Relational databases are a generalization of the mathematics of set theory relations. Therefore, the correspondences between set theory and relational databases are not always direct.
The following table shows the correspondence between set theory and relational database terms.
Set Theory Terms | Relational Database Term |
---|---|
Relation | Table |
Tuple | Row |
Attribute | Column |
Tables
Tables are the basic unit in RDBMS where the data is stored. Tables are two-dimensional objects consisting of rows and columns. Data is organized in tabular format and presented to the users of a relational database.
References between tables define the relationships and constraints of data inside the tables themselves.
For example, the student data is shown in the tabular format, such as:
Roll no | Student Name | Date of Birth |
---|---|---|
10001 | Mike | 1/2/1996 |
10002 | James | 5/1/1997 |
10003 | Robert | 15/7/1997 |
10004 | Paul | 16/2/1996 |
10005 | Peter | 5/9/1998 |
Table Constraints
We can define conditions that must be met before Teradata Database writes a given value to a column in a table. These conditions are called constraints.
Constraints can include value ranges, equality or inequality conditions, and intercolumn dependencies. Teradata Database supports constraints at both the column and table levels.
During table creation and modification, you can specify constraints on single-column values as part of a column definition or on multiple columns using the CREATE and ALTER TABLE statements.
Columns
A column always contains the same kind of information or contains similar data.
For example, a student information table would have a column for date of birth, and nothing other than the student date of birth should be placed in that column.
Date of Birth |
---|
1/2/1996 |
5/1/1997 |
15/7/1997 |
16/2/1996 |
5/9/1998 |
Row
A row is one instance of all the columns in the table.
An entity is a person, place, or thing about which the table contains information. The table mentioned in the previous paragraphs includes information on the student entity. Each table holds only one kind of row.
The relational model requires that each row in a table be uniquely identified. To accomplish this, we define a uniqueness constraint to identify each row in the table.
For example, each row in the student table would contain, among other things, the first roll no, name, and the last date of birth for that student. The columns in a table represent entities, relationships, or attributes.
Roll no | Student Name | Date of Birth |
---|---|---|
10001 | Mike | 1/2/1996 |
Database
The database is a collection of logically related data. Many users access them for different purposes.
For example, a sales database contains complete information about sales, which is stored in many tables.
Primary Key
The primary key is used to identify a row in a table uniquely. No duplicate values are allowed in a primary key column, and they cannot accept NULL values. It is a mandatory field in a table.
Foreign Key
Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table.
A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.
Teradata Tables
Tables in the Relational model are defined as a collection of data. They are represented in the rows and columns.
Types of Table
In the Teradata, tables support the following types of tables, such as:
- Permanent Table: This is the default table, and it contains data inserted by the user and permanently stores the data.
- Volatile Table: The data inserted into a volatile table is retained only during the user session. The table and data are dropped at the end of the session. These tables are mainly used to hold the intermediate data during data transformation.
- Global Temporary Table: The definition of the Global Temporary table is persistent, but the data in the table is deleted at the end of the user session.
- Derived Table: Derived table holds the intermediate results in a query. Their lifetime is within the question in which they are created, used, and dropped.
Set vs Multiset
Teradata classifies the tables as Set or Multiset tables based on how the duplicate files are handled.
A table defined as the SET table doesn't store the duplicate files, whereas the MULTISET table can store duplicate files.
1. Create Table
CREATE Table command is used to create tables in Teradata.
Syntax
Following is the generic syntax of the CREATE TABLE statement.
- Table Options: It specifies the physical attributes of the table, such as Journal and Fallback.
- Column Definition: It specifies the list of columns, data types, and attributes.
- Index Definition: It is an additional indexing option such as Primary Index, Secondary Index, and Partitioned Primary Index.
Example
In the following example, we create a table called student with a FALLBACK option. The table contains four columns with rollno as the Unique Primary Index.
Once the table is created, we can use the SHOW TABLE command to view the Definition of the table.
2. Alter Table
ALTER table command is used to add or drop columns from an existing table. We can also use an ALTER table command to modify the attributes of the existing columns.
Syntax
Following is the generic syntax for ALTER table.
Example
In the following example, we drop the column DOB and add a new column BirthDate.
We can run the SHOW table command to view the changes to the table. In the following output, we can see that the column student_dob is removed, and BirthDate is added.
3. Drop Table
DROP TABLE command is used to drop a table. When the DROP TABLE is issued, data in the table is deleted, and the table is dropped.
Syntax
Following is the generic syntax for DROP TABLE.
Example
In the following example, we drop the "student" table.
If we run the SHOW TABLE command after this, we will get an error message stating that the table doesn't exist.
Teradata Data Manipulation
The SQL commands used to manipulate the data stored in the Teradata tables.
Insert Records
INSERT INTO statement is used to insert records into the table.
Syntax
Following is the generic syntax for INSERT INTO, such as:
Example
Consider the following student table in which we try to insert records into the table.
Once the above query is inserted, we can use the SELECT statement to view the records from the table.
RollNo | FirstName | LastName | BirthDate |
---|---|---|---|
10001 | Jack | Nicholson | 5/6/1990 |
Insert from another Table
INSERT SELECT statement is used to insert records from another table.
Syntax
Following is the generic syntax for INSERT INTO.
Example
In the following example we insert records into the student table. Create a table called Student_Bkup with the same column definition as student table before running the following insert query.
When the above query is executed, it will insert all records from the student table into the student_bkup table.
Rules
Here are some specific rules to insert records from another table, such as:
- The number of columns specified in the VALUES list should match with the columns specified in the INSERT INTO clause.
- Values are mandatory for NOT NULL columns.
- If no values are specified, then NULL is inserted for nullable fields.
- The data types of columns specified in the VALUES clause should be compatible with the data types of columns in the INSERT clause.
Update Records
The UPDATE statement is used to update records from the table.
Syntax
Following is the generic syntax for UPDATE.
Example
In the following example we update the student BirthDate to 5/6/1997 for student 10001.
The following output shows that the BirthDate is updated from 5/6/1990 to 5/6/1997 for RollNo 10001.
Rules
Here are some specific rules to update the records from the table, such as:
- We can update one or more values of the table.
- If the WHERE condition is not specified, then all rows of the table are impacted.
- We can update a table with the values from another table.
Delete Records
DELETE FROM statement is used to update records from the table.
Syntax
Following is the generic syntax for DELETE FROM.
Example
In the following example we delete the student 10001 from the table student.
In the following output, we can see that student 10001 is deleted from the table.
Rules
Here are some specific rules to delete records from the tables, such as:
- We can update one or more records of the table.
- If the WHERE condition is not specified, then all rows of the table are deleted.
- We can update a table with the values from another table.
Teradata Set Operator
SET operators combine results from multiple SELECT statements. This may look similar to Joins, but joins combines columns from various tables, whereas SET operators combine rows from multiple rows.
Rules for Set Operator
Here are the following rules to specify the Set operator, such as:
- The number of columns from each SELECT statement should be the same.
- The data types from each SELECT must be compatible.
- ORDER BY should be included only in the final SELECT statement.
Teradata SQL Set Operators
Teradata SQL supports the following set operators:
Set Operator | Function |
---|---|
INTERSECT | It returns result in rows that appear in all answer sets generated by the individual SELECT statements. |
MINUS / EXCEPT | The result is those rows returned by the first SELECT except for those also selected by the second SELECT. |
UNION | It combines the results of two or more SELECT statements. |
1. UNION
The UNION statement is used to combine results from multiple SELECT statements. It ignores duplicates.
Syntax
Following is the basic syntax of the UNION statement.
Example
Consider the following student table as T1 and attendance table as T2.
RollNo | FirstName | LastName | BirthDate |
---|---|---|---|
1001 | Mike | Richard | 1/2/1996 |
1002 | Robert | Williams | 3/5/1995 |
1003 | Peter | Collin | 4/1/1994 |
1004 | Alexa | Stuart | 11/6/1995 |
1005 | Robert | Peterson | 12/1/1997 |
RollNo | present | Absent | % |
---|---|---|---|
1001 | 200 | 20 | 90% |
1002 | 160 | 60 | 72% |
1003 | 150 | 70 | 68% |
1004 | 210 | 10 | 95% |
The following UNION query combines the RollNo value from both T1 and T2 tables.
When the query is executed, it gives the following output, such as:
RollNo 1001 1002 1003 1004 1005
2. UNION ALL
UNION ALL statement is similar to the UNION statement. It combines results from multiple tables, including duplicate rows.
Syntax
Following is the basic syntax of the UNION ALL statement.
Example
Following is an example for UNION ALL statement.
When the above query is executed, it produces the following output. And it returns the duplicates also.
RollNo 1001 1002 1003 1004 1005 1001 1002 1003 1004
3. INTERSECT
INTERSECT command is also used to combine results from multiple SELECT statements.
It returns the rows from the first SELECT statement that has a corresponding match in the second SELECT statement.
Syntax
Following is the basic syntax of the INTERSECT statement.
Example
Following is an example of the INTERSECT statement. It returns the RollNo values that exist in both tables.
When the above query is executed, it returns the following records. RollNo 1005 is excluded since it doesn't exist in the T2 table.
4. MINUS/EXCEPT
MINUS/EXCEPT commands combine rows from multiple tables and return the rows, which are in the first SELECT but not in the second SELECT. They both return the same results.
Syntax
Following is the basic syntax of the MINUS statement.
Example
Following is an example of a MINUS statement.
When this query is executed, it gives the following output.
RollNo 1005
Teradata String Manipulation
Teradata provides several functions to manipulate the strings. These functions are compatible with the ANSI standard.
Teradata String Functions are also supported most of the standard string functions along with the Teradata extension to those functions.
String function | Explanation |
---|---|
Concat (string1, ..., stringN) | It returns the concatenation of two or more string values. This function provides the same functionality as the SQL-standard concatenation operator (||). |
Length (string) | It returns the number of characters in the string. |
Lower (string) | It converts a string to lower case. |
Upper (string) | It converts a string to upper case. |
Lpad (string, size, padstring) | Pads the left side of the string with characters to create a new string. |
Rpad (string, size, padstring) | Pads the right side of the string with characters to create a new string. |
Trim (string) | It removes leading and trailing whitespace from the given string. |
Ltrim (string) | It removes leading whitespaces from the string. |
Rtrim (string) | It removes trailing whitespaces from the string. |
Replace (string, search) | It removes the search string from the given string. |
Replace (string, search, replace) | It replaces all instances of search with replacing string. |
Reverse (string) | It returns string characters in reverse order. |
Split (string, delimiter) | Split given string on the delimiter. This function returns an array of string. |
Strops (string, substring) | It returns the staring position first instance of a substring in a given string. |
Position (substring IN string) | It returns the staring position first instance of a substring in a given string. |
Substr (string, start, length) | It returns a substring of string that begins at the start position and is length characters long. |
Chr (n) | It returns the character with the specified ASCII value. |
to_utf8 (string) | It encodes a string into a UTF-8 varbinary representation. |
from_utf8 (binary) | It decodes a UTF-8 encoded string from binary. |
Select translate (string, from, to); | It can replace any character in the string that matches a character in the form set with the corresponding character in the set. |
Index (string) | It locates the position of a character in a string (Teradata extension). |
UPPER & LOWER Function
The UPPER and LOWER functions convert the character column values all in uppercase and lowercase, respectively. UPPER and LOWER are ANSI compliant.
Syntax
Example
The following example will convert the "Robert" string in the upper case string.
After executing the above code, it will give the following output.
ROBERT
Now in the same example, we will convert the same "ROBERT" string in the lower case string.
Output
robert
CHARACTER_LENGTH Function
The CHARACTER_LENGTH function returns the numbers of characters of a character string expression.
- The result will be an integer number that represents the length.
- The result will be the same for the fixed-length character.
- The result will vary for the variable-length character.
- Spaces are valid characters so that the length will be counted for space.
Syntax
Example
The following example will return the number of characters of the "Robert" string.
Execute the above code, and it tells the length of the "Robert" string as the output shown below.
6
TRIM Function
The TRIM function is used to remove space a particular set of leading or trailing or both from an expression. TRIM is ANSI standards.
Syntax
Example
The following example removes the space from both the end of the "Robert" string.
When we execute the above code, it will trim the existing space from both ends of the string and gives the following output.
Robert
POSITION Function
The POSITION function is used to return the position of a substring inside the string. The position of the first occurrence of the string is returned only.
Syntax
Example
The following example will return the occurrence of "e" in the "Robert" string.
After executing the above code, it will find the position of "e" substring in the "Robert" string as output.
4
SUBSTRING Function
The SUBSTRING function is used to return a specified number of characters from a particular position of a given string. SUBSTRING function is ANSI standard.
Syntax
It returns a string (str), starting at the position (pos), and length (len) in characters.
Example
The following example returns the character from the 1st position for 3 characters.
The above code returns 3 characters from the 1st position of the string "Robert" as the output.
Rob
Teradata Date/Time Functions
Date/Time functions operate on either Date/Time or Interval values and provide a Date/Time value as a result.
The supported Date/Time functions are:
- CURRENT_DATE
- CURRENT_TIME
- CURRENT_TIMESTAMP
- EXTRACT
To avoid any synchronization problems, operations among these functions are guaranteed to use identical definitions for DATE, TIME, or TIMESTAMP, therefore following services are always valid:
- CURRENT_DATE = CURRENT_DATE
- CURRENT_TIME = CURRENT_TIME
- CURRENT_TIMESTAMP = CURRENT_TIMESTAMP
- CURRENT_DATE and CURRENT_TIMESTAMP always identify the same DATE
- CURRENT_TIME and CURRENT_TIMESTAMP always identify the same TIME
The values reflect the time when the request starts and does not change during the application's duration.
Date Storage
Dates are stored as integer internally using the following formula.
To check how the dates are stored using the following query.
Since the dates are stored as an integer, we can perform some arithmetic operations on them.
Teradata supports most of the standards date functions. Some of the commonly used date functions are listed below, such as:
Date Function | Explanation |
---|---|
LAST_DAY | It returns the last day of the given month. It may contain the timestamp values as well. |
NEXT_DAY | It returns the date of the weekday that follows a particular date. |
MONTHS_BETWEEN | It returns the number of months between two date (timestamp) values. The result is always an integer value. |
ADD_MONTHS | It adds a month to the given date (timestamp) value and return resulting date value. |
OADD_MONTHS | It adds a month to the given date (timestamp) value and return resulting date value. |
TO_DATE | It converts a string value to a DATE value and returns the resulting date value. |
TO_TIMESTAMP | It converts a string value to a TIMESTAMP value and returns resulting timestamp value. |
TRUNC | It returns a DATE value with the time portion truncated to the unit specified by a format string. |
ROUND | It returns a DATE value with the time portion rounded to the unit specified by a format string. |
NUMTODSINTERVAL | It converts a numeric value to interval days to seconds. |
NUMTOYMINTERVAL | It converts a numeric value to interval years to the month. |
TO_DSINTERVAL | It converts a string value to interval days to second. |
TO_YMINTERVAL | It converts a string value to interval year to a month. |
EXTRACT | It extracts portions of the day, month, and year from a given date value. |
INTERVAL | INTERVAL function is used to perform arithmetic operations on DATE and TIME values. |
EXTRACT
EXTRACT function is used to extract portions of the day, month, and year from a DATE value. This function is also used to extract hour, minute, and second from TIME/TIMESTAMP value.
Examples
1. The following example shows how to extract Year value from Date and Timestamp values.
Output
2020
2. The following example shows how to extract Month values from Date and Timestamp values.
Output
3. The following example shows how to extract Day values from Date and Timestamp values.
Output
22
4. The following example shows how to extract Hour values from Date and Timestamp values.
Output
6
5. The following example shows how to extract Minute values from Date and Timestamp values.
Output
46
6. The following example shows how to extract the Second values from Date and Timestamp values.
Output
25.150000
INTERVAL
Teradata provides INTERVAL function to perform arithmetic operations on DATE and TIME values. There are two types of INTERVAL functions, such as:
1. Year-Month Interval
- YEAR
- YEAR TO MONTH
- MONTH
2. Day-Time Interval
- DAY
- DAY TO HOUR
- DAY TO MINUTE
- DAY TO SECOND
- HOUR
- HOUR TO MINUTE
- HOUR TO SECOND
- MINUTE
- MINUTE TO SECOND
- SECOND
Examples
1. The following example adds 4 years to the current date.
Output
05/22/2024
2. The following example adds 4 years and 03 months to the current date.
Output
08/22/2024
3. The following example adds 03 days, 05 hours, and 10 minutes to the current timestamp.
Output
05-25-2020 10:07:25.150000+00.00
Teradata Built-In Functions
Teradata provides built-in functions, which are extensions to SQL. The Built-in functions return information about the system.
Built-in functions are sometimes referred to as individual registers. It can be used anywhere that a literal can appear.
If a SELECT statement that contains a built-in function references a table name, then the result of the query contains one row for every row of the table that satisfies the search condition.
Some common built-in functions are listed below with examples, such as:
S. No. | Function | Example |
---|---|---|
1 | SELECT DATE; | Date- 2018/02/10 |
2 | SELECT CURRENT_DATE; | Date- 2020/05/23 |
3 | SELECT TIME; | Time- 09:02:00 |
4 | SELECT CURRENT_TIME; | Time- 10:01:13 |
5 | SELECT CURRENT_TIMESTAMP; | Current TimeStamp(6)- 2020-05-23 10:01:13.990000+00.00 |
6 | SELECT DATABASE; | Database- TDUSER |
DISTINCT Option
The DISTINCT option specifies that duplicate values which are not to be used when an expression is processed.
The following SELECT returns the number of unique job titles in a table.
Output
2000 1000 00 3000
A query can have multiple aggregate functions that use DISTINCT with the same expression, such as:
A query can also have multiple aggregate functions that use DISTINCT with different expressions, such as:
Teradata CASE & COALESCE
CASE and COALESCE both functions are used in Teradata for different purposes. Both functions have different functionalities.
CASE Expression
Teradata CASE statement provides the flexibility to fetch alternate values for a column base on the condition specified in the expression.
CASE expression evaluates each row against a condition or WHEN clause and returns the result of the first match if there are no matches, then the result from the ELSE part of the return.
Syntax
Following is the syntax of the CASE expression.
Example
Consider the following Employee record in the below table.
Employee_Id | Age | Salary | Marital_Status |
---|---|---|---|
202001 | 22 | 20,000 | 1 |
202002 | 25 | 25,000 | 1 |
202003 | 30 | 35,000 | 2 |
202004 | 26 | 32,000 | 2 |
202005 | 32 | 40,000 | 2 |
In the above example, we evaluate the Marital_Status column. It returns 1 if the marital status is Single and returns 2 if the marital status is married. Otherwise, it returns the value as Not Sure.
Now, we will apply the CASE statement on Marital_Status column as follows:
After executing the above code, it produces the following output.

The above CASE expression can also be written in the following way, which will produce the same result as above.
COALESCE Expression
Teradata COALESCE is used for NULL handling. The COALESCE is a statement that returns the first non-null value of the expression. It returns NULL if all the arguments of the expression evaluate to NULL. Following is the syntax.
Syntax
Here is the basic syntax of the COALESCE function:
Example
Consider the following Student table.

Now we can prioritize which phone number to select using COALESCE function as follows:
In the above example, we will search for Landline_no first. If that is NULL, it will search for Mobile_no, respectively. If both the numbers are NULL, it will return not available. And if none of the argument is returning, not NULL value, it will return the default value from those columns.
When we execute the above query, it generates the following output.

NULLIF
The NULLIF statement returns NULL if the arguments are equal.
Syntax
Following is the syntax of the NULLIF statement.
Example
The following example returns NULL if the Mobile_No is equal to 0. Otherwise, it returns the Mobile_No value.
The above query returns the following records. We can see that Roll_No 101 and 104 have Mobile as NULL.

Teradata Primary Index
The primary index is used to specify where the data resides in Teradata. It is used to determine which AMP gets the data row.
In Teradata, each table is required to have a primary index defined. If the primary index is not defined, Teradata automatically assigns the primary index.
The primary index provides the fastest way to access the data. A primary may have a maximum of 64 columns. The primary index is defined while creating a table, and it cannot be altered or modified.
The primary index is the most preferred and essential index for:
- Data distribution
- Known access path
- Improves join performance
Rules for Primary Index
Here are some specific rules for the Primary index, such as:
Rule 1: One Primary index per table.
Rule 2: A Primary index value can unique or non-unique.
Rule 3: The Primary index value can be NULL.
Rule 4: The Primary index of a populated table cannot be modified.
Rule 5: the Primary index value can be modified.
Rule 6: A Primary index has a limit of 64 columns.
Types of Primary Index
There are two types of Primary Indexes.
- Unique Primary Index (UPI)
- Non-Unique Primary Index (NUPI)
1. Unique Primary Index (UPI)
In the Unique Primary Index table, the column should not have any duplicate values. If any duplicate values are inserted, they will be rejected. The Unique Primary index enforces uniqueness for a column.
A Unique Primary Index (UPI) will always spread the rows of the table evenly amongst the AMPs. UPI access is always a one-AMP operation.
How to create a Unique Primary Index?
In the following example, we create the Student table with columns Roll_no, First_name, and Last_name.
Roll_no | First_name | Last_name |
---|---|---|
1001 | Mike | Richard |
1002 | Robert | Williams |
1003 | Peter | Collin |
1004 | Alexa | Stuart |
1005 | Robert | Peterson |
We have selected Roll_no to be our Primary Index. Because we have designated Roll_no as a Unique Primary Index, there can be no duplicates of student roll numbers in the table.
2. Non-Unique Primary Index (NUPI)
A Non-Unique Primary Index (NUPI) means that the values for the selected column can be non-unique.
A Non-Unique Primary Index will never spread the table rows evenly. An All-AMP operation will take longer if the data is unevenly distributed.
We might pick a NUPI over a UPI because the NUPI column may be more useful for query access and joins.
How to Create a Non-Unique Primary Index?
In the following example, we create the employee table with columns Employee_Id, Name, Department, and City.
Employee_Id | Name | Department | City |
---|---|---|---|
202001 | Max | Sales | London |
202002 | Erika | Finance | Washington |
202003 | Nancy | Management | Paris |
202004 | Bella | Human Recourse | London |
202005 | Sam | Marketing | London |
Every employee has a different employee Id, name, and department, but many employees belong to the same city in this table. Therefore we have selected City to be our Non-Unique Primary Index.
Multi-Column Primary Indexes
Teradata allows more than one column to be designated as the Primary Index. It is still only one Primary Index, but it is merely made up of combining multiple columns.
Teradata Multi-column Primary index allows up to 64 combined columns to make up the one Primary Index required for a table.
Example
In the following example, we have designated First_Name, and Last_Name combined to make up the Primary Index.
This is very useful and beneficial for two reasons:
- To get better data distribution among the AMPs.
- And for users who often use multiple keys consistently to query.
Data distribution using Primary Index
When a user submits an SQL request against a table using a Primary Index, the request becomes a one-AMP operation, which is the most direct and efficient way for the system to find a row. The process is explained below.
The complete process is explained in the below image:

Hashing Process
The process of Hashing is defined in the following steps, such as:
Step 1: In the first step, the Primary index value goes into the hashing algorithm.
Step 2: The output of the hashing algorithm is the row hash value.
Step 3: The hash map points to the specific AMP where the row resides.
Step 4: The PE sends the request directly to the identified AMP.
Step 5: The AMP locates the row(s) on its vdisk.
Step 6: The data is sent to PE through BYNET, and PE sends the answer set to the client application.
Duplicate Row Hash Values
The hashing algorithm can end up with the same row hash value for two different rows.
We can do this in two ways, such as:
- Duplicate NUPI values: If a Non-Unique Primary Index is used, duplicate NUPI values will produce the same row hash value.
- Hash synonym: It is also called a hash collision. It occurs when the hashing algorithm calculates an identical row hash value for two different Primary Index values.
To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is the combination of the row hash value and unique value.
The uniqueness value is used to differentiate between rows whose Primary Index values generate identical row hash values. In most cases, only the row hash value portion of the Row ID is needed to locate the row.

When each row is inserted, the AMP adds the row ID, stored as a prefix of the row.
The first row inserted with a particular row hash value is assigned a unique value of the unique value is incremented by 1 for any additional rows inserted with the same row hash value.
Teradata Joins
Join is used to combine records from more than one table. Tables are joined based on the common columns and values from these tables.
There are different types of Joins available in Teradata.
- Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- Self Join
- Cross Join
- Cartesian Production Join
INNER JOIN
Inner Join combines records from multiple tables and returns the values that exist in both the tables.
Syntax
Following is the syntax of the INNER JOIN statement.
Example
Consider the following two tables, such as the student table and attendance table.
Student Table:
RollNo | FirstName | LastName | BirthDate |
---|---|---|---|
1001 | Mike | Richard | 1/2/1996 |
1002 | Robert | Williams | 3/5/1995 |
1003 | Peter | Collin | 4/1/1994 |
1004 | Alexa | Stuart | 11/6/1995 |
1005 | Robert | Peterson | 12/1/1997 |
Attendance Table:
RollNo | present | Absent | % |
---|---|---|---|
1001 | 200 | 20 | 90% |
1002 | 160 | 60 | 72% |
1003 | 150 | 70 | 68% |
1004 | 210 | 10 | 95% |
The following query joins the student table and attendance table on the common column Rollno. Each table is assigned an alias A & B, and the columns are referenced with the correct alias.
When the above query is executed, it returns the following records. Rollno 1005 is not included in the result since it doesn't have matching records in the attendance table.
OUTER JOIN
LEFT OUTER JOIN and RIGHT OUTER JOIN also combine the results from multiple tables.
- LEFT OUTER JOIN:It returns all the records from the left table and returns only the matching records from the right table.
- RIGHT OUTER JOIN:It returns all the records from the right table and returns only matching rows from the left table.
- FULL OUTER JOIN:It combines the results from both LEFT OUTER and RIGHT OUTER JOINS. It returns both matching and non-matching rows from the joined tables.
Syntax
Following is the syntax of the OUTER JOIN statement. We need to use one of the options from LEFT OUTER JOIN, RIGHT OUTER JOIN, or FULL OUTER JOIN.
Example
Consider the following example of the LEFT OUTER JOIN query. It returns all the records from the Student table and matching records from the attendance table.
When the above query is executed, it produces the following output. For student 1005, % value is NULL, since it doesn't have matching records in the attendance table.
CROSS JOIN
Cross Join joins every row from the left table to every row from the right table.
Cross Join is a Teradata specified join, which is equivalent to the product join. There will not be an "ON" keyword in the Cross Join.
Syntax
Following is the syntax of the CROSS JOIN statement.
Example
Consider the following example of the CROSS JOIN query.
When the above query is executed, it produces the following output. RollNo 1001 from the Student table is joined with each and every record from Attendance Table.
Rollno Rollno % 1001 1001 90% 1001 1002 72% 1001 1003 68% 1001 1004 95%
Teradata Substring
The Teradata SUBSTRING or SUBSTR function is one of the Teradata string functions, and it is used to cut a substring from a string based on its position.
SUBSTR or SUBSTRING will work the same in Teradata. But the syntax may be different.
We use ANSI syntax for Teradata SUBSTRING and Teradata syntax for Teradata SUBSTR. The ANSI syntax is designed to compatible with other database systems.
Syntax
Or
Argument Types and Rules
SUBSTRING and SUBSTR operate on the following types of arguments:
- Character
- Byte
- Numeric
If the string_expression argument is numeric, then User-defined type (UDT) are implicitly cast to any of the following predefined types:
- Character
- Numeric
- Byte
- DATE
To define an implicit cast for a UDT, we use the CREATE CAST statement and specify the AS ASSIGNMENT clause.
Implicit type conversion of UDTs for system operators and functions, including SUBSTRING and SUBSTR, is a Teradata extension to the ANSI SQL standard. To disable this extension, set the DisableUDTImplCastForSysFuncOp field of the DBS Control Record to TRUE.
Result Type and Attributes
Here are the default result type and attributes for SUBSTR and SUBSTRING, such as:
If the string argument is a:
- BLOB, then the result type is BLOB(n).
- Byte string other than BLOB, the result type is VARBYTE(n).
- CLOB, then the result type is CLOB(n).
- Numeric or character string other than CLOB, the result type is VARCHAR(n).
In ANSI mode, the value of n for the resulting BLOB(n), VARBYTE(n), CLOB(n), or VARCHAR(n) is the same as the original string.
In Teradata mode, the value of n for the result type depends on the number of characters or bytes in the resulting string. To get the data type of the resulting string, we can use the TYPE function.
Difference between SUBSTRING and SUBSTR
The SUBSTR function is the original Teradata substring operation. It is written to be compatible with DB/2.
- It can be used in the SELECT list to return any portion of the character that data stored in a column to a client or in the WHERE clause.
- When we use the SUBSTR function, such as SUBSTRING, the name of the column needs to be provided along with the starting character location and the length or number of characters to return.
- The main difference is that commas are used as delimiters between these three parameters instead of FROM and FOR.
The SUBSTRING function length is optional. When it is not included, all remaining characters to the end of the column are returned. In the earlier releases of Teradata, the SUBSTR was much more restrictive in the values allowed. This situation increased the chances of the SQL statement failing due to unexpected data or costs.
- Both SUBSTRING and SUBSTR allow for partial character data strings to be returned, even in ANSI mode. These functions only store the requested data in a spool, not the entire column. Therefore, the amount of spool space required can be reduced or tuned using the substring functions.
- The SUBSTR is more compatible and tolerant regarding the parameter values passed to them, like the newer SUBSTRING.
- The SUBSTRING is the ANSI standard, and therefore, it is the better choice between these two functions.
Teradata Table Types
Teradata supports the following types of tables to hold temporary data.
S. No. | Table Types | Description |
---|---|---|
1 | ANSI Temporal | ANSI-compliant support for temporal tables. Using temporal tables, Teradata Database can process statements and queries that include time-based reasoning. Temporal tables record both system time (the time when the information was recorded in the database) and valid time (when the information is in effect or correct in a real-world application). |
2 | Derived | A derived table is a type of temporary table obtained from one or more other tables as the result of a SubQuery. It is specified in an SQL SELECT statement. It avoids the need to use the CREATE and DROP TABLE statements for storing retrieved information. It is useful when we are more sophisticated and complex query codes. |
3 | Error Logging | Error logging tables store the information about errors on an associated permanent table. And it also stores Log information about insert and updates errors. |
4 | Global Temporary | Global temporary tables are private to the session. It is dropped automatically at the end of a session. It has a persistent table definition stored in the Data Dictionary. The saved description may be shared by multiple users and sessions, with each session getting its instance of the table. |
5 | Global Temporary Trace | Global temporary trace tables are store trace output for the length of the session. It has a persistent table definition stored in the Data Dictionary. It is useful for debugging SQL stored procedures (via a call to an external stored procedure written to the trace output) and external routines (UDFs, UDMs, and external stored procedures). |
6 | NoPI | NoPI tables are permanent tables that do not have primary indexes defined on them. They provide a performance advantage when used as staging tables to load data from FastLoad or TPump Array INSERT. They can have secondary indexes defined on them to avoid full-table scans during row access. |
7 | Permanent | Permanent tables allow different sessions and users to share table content. |
8 | Queue | Queue tables are permanent tables with a timestamp column. The timestamp indicates when each row was inserted into the table. It is established first-in-first-out (FIFO) ordering of table contents, which is needed for customer applications requiring event processing. |
9 | Volatile | Volatile tables are used only when
|
Derived Table
Derived tables are created, used, and dropped within a query. These are used to store intermediate results within a query.
Example
Consider the following employee record in the form of two tables.
Emp Table:
Employee_Id | First_Name | Last_Name | Department_No |
---|---|---|---|
202001 | Mike | Richard | 1 |
202002 | Robert | Williams | 2 |
202003 | Peter | Collin | 2 |
202004 | Alexa | Stuart | 1 |
202005 | Robert | Peterson | 1 |
Salary Table:
Employee_Id | First_Name | Last_Name | Department_No |
---|---|---|---|
202001 | 40,000 | 4,000 | 36,000 |
202002 | 80,000 | 6,000 | 74,000 |
202003 | 90,000 | 7,000 | 83,000 |
202004 | 75,000 | 5,000 | 70,000 |
202005 | 80,000 | 00 | 80,000 |
The following query creates a derived table EmpSal with records of employees with a salary higher than 80,000.
When the above query is executed, it returns the employee's records with a salary higher than or equal to 80,000.
Employee_Id First_Name NetPay 202003 Peter 83,000 202005 Robert 80,000
Volatile Table
Volatile tables are created and dropped within a user session. Their definition is not stored in the data dictionary. They hold intermediate data of the query, which is frequently used.
Syntax
Example
In the following example, we create a VOLATILE table that name is dept_stat.
After executing the above code, it returns the min, avg, and max salary according to the departments in the output.
Dept_no avg_salary max_salary min_salary 1 186,000 80,000 36,000 2 157,000 83,000 74,000
Global Temporary Table
The definition of the Global Temporary table is stored in the data dictionary, and they can be used by many users and sessions. But the data loaded into a global temporary table is retained only during the session.
We can materialize up to 2000 global temporary tables per session.
Syntax
Following is the syntax of the Global Temporary table.
Example
Below query creates the Global Temporary table, such as:
When the above query is executed, it returns the following output.
A table has been created.
Teradata Space Concepts
Teradata is designed to reduce the DBA's administrative functions when it comes to space management. Space concepts are configured in the following ways in the Teradata system.
- Permanent Space
- Spool Space
- Temporary Space
Permanent Space (Perm Space)
Permanent space is the maximum amount of space available for the user and database to hold data rows. Permanent tables, journals, fallback tables, and secondary index sub-tables use permanent space.
Permanent space is not pre-allocated for the database and user. The amount of permanent space is divided by the number of AMPs. Whenever per AMP limit exceeds, an error message is generated.
Equal distribution is necessary because there is a high percentage that the objects will be shared across all the AMPs, and at the time of data retrieval, all AMPs will work parallel to fetch the data.
Unlike other relational databases, the Teradata database does not physically define the Perm space at the time of object creation. Instead of that, it represents the upper limit for the Perm space, and then Perm space is used dynamically by the objects.

Example
Suppose we have a Teradata system having 500 GB Perm space. Initially, DBC owns 100% of the perm space of a new system. As in Teradata, everything is parallel. Perm space will be distributed evenly across all AMPs.
Teradata system having 500GB perm space and 5 AMPs, each AMP will get 100GB of perm space to execute a query in parallel if DBC creates two users as sysdba and appdba with 20GB and 100GB perm space respectively.
Space will be taken from parent i.e., DBC. So now DBC has 280GB of perm space. This mechanism ensures enough memory to execute all processes in the Teradata system.
Spool Space
Spool space is the amount of space on the system that has not been allocated. It is used by the system to keep the intermediate results of the SQL query. Users without spool space cannot execute any query.
When executing a conditional query, all the qualifying rows which satisfy the given condition will be store in the Spool space for further processing by the query. Any Perm space currently unassigned is available as a Spool space.
Spool space defines the maximum amount of space the user can use. Establishing a Spool space limit is not required when Users and Databases are created. But it is highly recommended to define the upper limit of Spool space for any object.

Example
Suppose a 10GB permanent space has been allocated to a database. However, the actual perm space occupied by tables took up 50% of perm space i.e., 5GB.
The remaining 5GB space is available system-wide for a spool. Spool space for the child will not be subtracted from its immediate parent. And the child database spool space can be equal as its parent. Spool space for sysdba and appdba is the same as DBC.
Temp Space
Temp space is the unused permanent space that is used by Global Temporary tables. Temp space is also divided by the number of AMPs.
Unused Perm space is used as Temp space for storing the table and the data in it. If we are specifying some value for Temp space, it should not exceed the value of the parent database and user.
If we do not specify a value, then the maximum value is inherited from that of the parent.
Teradata Statistics
Teradata optimizer gives an execution strategy for every SQL query. This execution strategy is based on the statistics collected on the tables used within the SQL query. Statistics on the table are collected using COLLECT STATISTICS command.
The COLLECT STATISTICS (Optimizer form) statement collects demographic data for one or more columns of a base table, hash index, or join index, computes a statistical profile of the collected data, and stores the synopsis in the Data Dictionary.
The Optimizer uses the synopsis data when it generates its table access and joins plans.
Environment Information
Teradata Statistics environment needs the following:
- Number of Nodes, AMPs, and CPUs
- Amount of memory
Data Demographics
Data Demographics consider the following:
- Number of rows
- Row size
- Range of values in the table
- Number of rows per value
- Number of Nulls
Usage
We should collect statistics on newly created, empty data tables. An empty collection defines the columns, indexes, and synoptic data structure for loaded groups.
We can easily collect statistics again after the table is populated for prototyping, and back when it is in production.
We can collect statistics in the following ways.
- A unique index, which can be:
- Primary or secondary
- Single or multiple columns
- Partitioned or non-partitioned
- A non-unique index, which can be:
- Primary or secondary
- Single or multiple columns
- Partitioned or non-partitioned
- With or without COMPRESS fields
- A non-indexed column or set of columns, which can be:
- Partitioned or non-partitioned
- With or without COMPRESS fields
- A temporary table
- If we specify the TEMPORARY keyword but a materialized table does not exist, the system first materializes an instance based on the specified column names and indexes.
This means that after a valid instance is created, we can re-collect statistics on the columns by entering COLLECT STATISTICS and the TEMPORARY keyword without having to specify the desired columns and index. - If we omit the TEMPORARY keyword, but the table is temporary, statistics are collected for an empty base table rather than the materialized instance.
- If we specify the TEMPORARY keyword but a materialized table does not exist, the system first materializes an instance based on the specified column names and indexes.
- Sample (system-selected percentage) of the rows of a data table or index, to detect data skew and dynamically increase the sample size when found.
- The system does not store both sampled and defined statistics for the same index or column set. Once sampled statistics have been collected, implicit recollection hits the same columns and indexes and operates in the same mode.
- Join index
- Hash index
- NoPI table
How to approach Collect Statistics
There are three approaches to collect statistics on the table.
- Random AMP Sampling
- Full statistics collection
- Using the SAMPLE option
Collecting Statistics on Table
COLLECT STATISTICS command is used to collect statistics on a table.
Syntax
Following is the basic syntax to collect statistics on a table.
Example
Consider an Employee table with the following records, such as:
Emp_Id | First_Name | Last_Name | Department_No |
---|---|---|---|
202001 | Mike | Richard | 1 |
202002 | Robert | Williams | 2 |
202003 | Peter | Collin | 2 |
202004 | Alexa | Stuart | 1 |
202005 | Robert | Peterson | 1 |
We are going to run the following query to collect statistics for the Emp_Id, First_Name columns of the Employee table.
When the above query is executed, it produces the following output.
Update completed. 2 rows changed.
Viewing Statistics
We can view the collected statistics using the HELP STATISTICS command.
Syntax
Following is the syntax to view the statistics collected.
Example
Following is an example to view the statistics collected on the Employee table.
When the above query is executed, it produces the following table with updated columns and their values.
Date Time Unique Values Column Names 6/2/20 10:05:02 5 * 6/2/20 10:05:02 5 Emp_Id, First_Name
Teradata Compression
Compression reduces the physical size of the stored information. The goal of compression is to represent information accurately using the fewest number of bits.
The Compression methods are either logical or physical. Physical data compression re-encodes information independently of its meaning, and logical data compression substitutes one set of data with another, more compact set.
In Teradata, compression can compress up to 255 distinct values, including NULL. Since the storage is reduced, Teradata can store more records in a block. This results in improved query response time since any input operation can process more rows per block.
Compression can be added at table creation using CREATE TABLE or after table creation using ALTER TABLE command.
Compression has the following essential reasons, such as:
- To reduce storage costs.
- To enhance system performance.
Compression reduces storage costs by storing more logical data per unit of physical capacity. Compression produces smaller rows, resulting in more rows stored per data block and fewer data blocks.
Compression enhances system performance because there is less physical data to retrieve per row for queries. And compressed data remains compressed while in memory, the FSG cache can hold more rows, reducing the size of disk input.
Rules
Teradata Compression method has the following rules to compress stored data, such as:
- Only 255 values can be compressed per column.
- The Primary Index column cannot be compressed.
- Volatile tables cannot be compressed.
Types of Compression
Teradata Database uses several types of compression.
Database element | Explanation |
---|---|
Column values | The storage of those values one time only in the table header, not in the row itself, and pointing to them using an array of presence bits in the row header. It applies to:
|
Hash and Join indexes | A logical row compression in which multiple sets of non-repeating column values are appended to a single set of repeating column values. This allows the system to store the repeating value set only once, while any non-repeating column values are stored as logical segmental extensions of the base repeating set. |
Data blocks | The storage of primary table data, or join or hash index subtable data. |
Partition containers | The auto-compression method set determined by Teradata Database to apply to a container of a column-partitioned table or join index when we have not specified the NO AUTO COMPRESS option at the time the object was created. |
Multi-Value Compression (MVC)
Multi-value compression (MVC) compresses repeating values in a column when we specify the value in a compression list in the column definition.
When data in the column matches a value specified in the compression list, the database stores it only once in the table header, regardless of how many times it occurs as a field value for the column. The database then stores a smaller substitute value, often as small as 2 bits, in each row where the value occurs.
MVC provides the best cost and benefit ratio compared to other methods because it requires minimal resources to uncompress the data during query processing. We can use MVC for frequently used data without compromising query and load performance.
MVC is also considered the easiest to implement of all the compression methods. MVC is a logical data compression form.
Multi-value compression has the following performance impacts, such as:
- It improves table scan response times for most configurations and workloads.
- It provides moderate to little CPU savings.
Example
The following example compresses the field Depatment_No for values 1, 2, and 3. When compression is applied on a column, the values for this column are not stored with the row.
Instead, the values are stored in the Table header in each AMP. And only presence bits are added to the row to indicate the value.
The system automatically compresses the specified values when data moves into the table and uncompressed when the containing rows are accessed.
NOTE
- The system automatically compresses NULLs when specifying COMPRESS.
- And cannot be applying MVC to row-level security columns.
Algorithmic Compression
Teradata Database software includes several standard compression algorithms, in the form of UDFs, which is used to compress many types of data by table column. We can also create custom compression and decompression algorithms in the UDF format.
When column values are unique, then algorithmic compression (ALC) provides better compression results than MVC.
If columns have repeated values, then we can use ALC and MVC concurrently on the same column, but the system does not apply ALC to any value covered by MVC.
NOTE
- Specify only one set of compression or decompression UDFs for a particular column.
- The system automatically compresses nulls when specifying COMPRESS.
Row Compression
Row compression is a form of logical data compression in which Teradata Database stores a repeating column value set only once.
And other non-repeating column values that belong to that set are stored as logical segmental extensions of the base repeating set. Row compression is a lossless method.
Like multi-value compression, there is no decompression necessary to access row compressed data values.
Block-Level Compression
A data block is a physical unit of input that defines how Teradata handles data when we specify block-level compression (BLC), Teradata stores data blocks in a compressed format to save storage space.
The goals of block-level compression are to save storage space and to reduce disk input bandwidth. Block-level compression can use significantly more CPU to compress and decompress data dynamically. Thus, whether query performance is enhanced with block-level compression depends on whether performance is more limited by disk input bandwidth or CPU usage.
Teradata Database offers software-based and hardware-based BLC. We select the BLC method when we specify the compression algorithm in the Compression Algorithm field in DBS Control.
- Hardware-Based Block-Level Compression
Hardware-based BLC creates little or no CPU resource contention with other database operations.
Hardware-based block-level compression provides the same functions as software-based block-level compression. - Software-Based Block-Level Compression
Software-based BLC is designed for use on large volumes of seldom-accessed data. BLC can also achieve relatively good compression on unknown and unclassified data across different column types, including the internal data structures and the row header.
Block-level compression can be enabled at the system level using the DBS Control utility and at the table level using SQL DDL.
Auto Compression
When we create a column-partitioned table or join index, Teradata Database attempts to use one or more methods to compress the data. That inserts into the physical rows of the object unless we specify the NO AUTO COMPRESS option when we create it, or NO AUTO COMPRESS is set as the default.
The process of selecting and applying appropriate compression methods to the physical containers of a column-partitioned table or join index is referred to as Auto Compression.
Auto Compression is most useful for a column partition with a single column and COLUMN format.
Teradata Database Auto Compression only applies on column partitions with COLUMN format, and then if it reduces the size of a container. Or it is set as default with the following requirements, such as:
- Minimal CPU resources are required to decompress the data for reading.
- Teradata Database does not need to decompress many values to find a single value.
For some values, there are no applicable compression techniques that can reduce the size of the physical row, so Teradata Database does not compress the values for that physical row. Still, the system attempts to compress physical row values using one of the Auto Compression methods available to it.
Teradata Database automatically decompresses any compressed column partition values when we retrieve rows from a column-partitioned table.
Teradata Explain
EXPLAIN command is the Parsing Engines (PE's) plan to the AMPs. EXPLAIN command returns the execution plan of the parsing engine in English translation. It can be used with any SQL statement except on another EXPLAIN command.
When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs. The explain plan can give a clear picture of the way optimizer will execute a query.
It is a good practice to analyze the explain plan for a newly developed query before running it in the Teradata system.
The explain plan can be obtained in two ways. First, add an "EXPLAIN" keyword before any queries, and the second is simply pressing "F6" from the keyboard.
Explain plan is beneficial to analyze the performance issue of a query as it breaks down any complex query into the lowest level. The Explain plan provides a lot of information, such as:
- Access path: If data is fetching using full table scan or using any index like primary index path, secondary index path, or any other indexes.
- Confidence level: If the optimizer can use the statistics or any statistics is missing.
- Joins information: What kind of join is going to happen.
- Time estimation: Provides an estimated query completion time.
If we pass an explain command in front of a query statement or Simply Press the F6 key, there are the following estimated confidence messages that the optimizer will relay to the user, such as:
- High Confidence: Statistics available on an Index or Column.
- Low Confidence: Random sampling of INDEX, or Statistics available but AND/OR condition exists.
- No Confidence: Random sampling based on AMP row count. No statistics are collected.
Explain Plan Keywords
To understand the EXPLAIN plan, we should know the following Keywords.
Keyword | Explanation |
---|---|
Locking Pseudo Table | Serial lock on a symbolic table. Every table has one. It is used to prevent deadlocks situations between users. |
Locking table for | This indicates that ACCESS, READ, WRITE, or the EXCLUSIVE lock has been placed on the table. |
Locking rows for | Indicates that an ACCESS, READ or WRITE, the lock is placed on rows read or written. |
Do an ABORT test | Guarantees a transaction is not in progress for this user. |
All AMPs retrieve | All AMPs are receiving the AMP steps and are involved in providing the answer set. |
By way of an all-rows scan | Rows are read sequentially on all AMPs. |
By way of the primary index | Rows are read using the Primary index column(s). |
By way of index number | Rows are read using the Secondary index: number from HELP INDEX. |
BMSMS | Bit Map Set Manipulation Step, alternative direct access technique when multiple NUSI columns are referenced in the WHERE clause. |
Residual conditions | WHERE clause conditions, other than those of a join. |
Eliminating duplicate rows | Providing unique values usually result in DISTINCT, GROUP BY, or subquery. |
Where unknown comparison will be ignored | This indicates that NULL values will not compare to TRUE or FALSE. Seen in a subquery using NOT IN or NOT = ALL because no rows will be returned on the ignored comparison. |
Nested join | The fastest join is possible. It uses UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row. |
Merge join | Rows of one table are matched to the other table on common domain columns after being sorted into the same sequence, normally Row Hash. |
Product join | Rows of one table are matched to all rows of another table with no concern for domain match. |
ROWID join | A very fast join. It uses ROWID of a UPI to retrieve a single row after using a UPI or a USI in the WHERE to reduce the join to a single row. |
Duplicated on all AMPs | Participating rows for the table of a join are duplicated on all AMPS. |
Hash redistributed on all AMPs. | Participating rows of a join are hashed on the join column and sent to the same AMP that stores the matching row of the table to join. |
SMS | Set Manipulation Step, the result of an INTERSECT, UNION, EXCEPT, or MINUS operation. |
Last use | SPOOL file is no longer needed after the step and space are released. |
Built locally on the AMPs | As rows are read, they are put into SPOOL on the same AMP. |
Aggregate Intermediate Results computed locally. | The aggregation values are all on the same AMP and, therefore, no need to redistribute them to work with rows on other AMPs. |
Aggregate Intermediate Results computed globally. | The aggregation values are not all on the same AMP and must be redistributed on one AMP to accompany the same value with the other AMPs. |
How to EXPLAIN Plan Works
The EXPLAIN request modifier that precedes any SQL request causes Teradata Database to display the execution plan for that request. The request itself is not submitted for execution.
- When we perform an EXPLAIN against any SQL request, that request is parsed and optimized.
- The access and join plans generated by the Optimizer are returned in the form of a text file that explains the (possibly parallel) steps used in the execution of the request.
- It also included the relative cost required to complete the request, given the statistics with which the Optimizer had to work.
- If the statistics are not reasonably accurate, the cost estimate may not be accurate.
Benefits of Using EXPLAIN
Here are some essential benefits of using the Explain Plan, such as:
- EXPLAIN helps to evaluate complex queries and to develop alternative, more efficient, processing strategies.
- We can get a better plan by collecting more statistics on more columns, or by defining additional indexes.
- Knowing the actual demographics information may allow identifying row count estimates that seem badly wrong and help pinpoint areas where additional statistics would be helpful.
Examples of EXPLAIN
Consider the Employee table with the following definition.
Full Table Scan (FTS)
When no conditions are specified in the SELECT statement, then the optimizer may use the Full Table Scan where each and every row of the table is accessed.
Example
Following is a query where the optimizer may choose FTS.
When the above query is executed, it produces the following output. As can be seen, the optimizer chooses to access all AMPs and all rows within the AMP.
1. First, we lock a distinct TDUSER."pseudo table" for reading on a RowHash to prevent global deadlock for TDUSER.Employee. 2. Next, we lock TDUSER.Employee to read. 3. We do an all-AMPs RETRIEVE step from TDUSER.Employee by way of an all-rows scan with no residual conditions into Spool 1 (group_amps) built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 2 rows (116 bytes). The estimated time for this step is 0.03 seconds. 4. Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. //The contents of Spool 1 are sent back to the user as the result of statement 1. //The total estimated time is 0.03 seconds.
Unique Primary Index
When the rows are accessed using Unique Primary Index, then it is one AMP operation.
When the above query is executed, it produces the single-AMP retrieval, and the optimizer uses the unique primary index to access the row.
1. First, we do a single-AMP RETRIEVE step from TDUSER.Employee by way of the unique primary index "TDUSER.Employee.Employee_Id = 1001" with no residual conditions. //The row is sent directly back to the user as the result of statement 1. //The total estimated time is 0.01 seconds.
Unique Secondary Index
When the rows are accessed using Unique Secondary Index, it's a two amp operation.
Example
Consider the Salary table with the following definition.
Consider the following SELECT statement.
When the above query is executed, it produces the optimizer retrieves the row in two amp operations using a unique secondary index.
1. First, we do a two-AMP RETRIEVE step from TDUSER.Salary by way of unique index # 4 "TDUSER.Salary.Employee_Id = 1001" with no residual conditions. //The row is sent directly back to the user as the result of statement 1. //The total estimated time is 0.01 seconds.
Teradata Join Index
The Teradata Join index stores pre-joined tables. Join indexes are comparable to materialized views.
Join indexes are designed to permit queries to be resolved by accessing the index instead of accessing, and possibly joining, their underlying base tables.
The join index contains one or more tables and also provide pre-aggregated data. Join indexes are mainly used for improving performance.
The join indexes are an index structure. They have to be considered more as an added layer of pre-aggregated, pre-joined, or permanent tables with its row-level partitioning, Primary Index, and Secondary Indexes.
Even if a join index does not entirely cover a query, the optimizer can use it to join its underlying base tables in a way that provides better query optimization than scanning the base tables for all the columns specified in the request.
Join indexes are useful for queries that aggregate columns from tables with large cardinalities. These indexes play the role of pre-join and summary tables without denormalizing the logical design of the database. At the same time, indeed, denormalization often enhances the performance of a particular query or queries.
There are three types of join indexes available in Teradata.
- Single Table Join Index (STJI)
- Multi-Table Join Index (MTJI)
- Aggregate Join Index (AJI)
Rules
Here are some essential rules of the Join Index, such as:
- The Usage of the FULL OUTER JOIN is not allowed.
- At least one non-null column must be selected on the inner side of the LEFT or RIGHT JOIN.
- OUTER JOIN is preferable for Join Index usage likelihood but not allowed for the Aggregate Join Index.
- HAVING and QUALIFY keywords are not allowed.
- Set Operations are not allowed, such as UNION, INTERSECT, and MINUS.
- No Subqueries are allowed.
- Inequality conditions in ON clauses of join index definitions are not allowed.
- Only <, <=, >=, > comparison operators are allowed.
- During the restoration of a base table or database, the join index is marked as invalid.
- A maximum of 64 columns per table per Join Index is allowed.
- The maximum of 32 Indexes can be defined on a table, including join indexes.
- UPI only allowed for a single table join index.
- Value ordered NUSI and Partition Expression on the same Join Index columns are not allowed.
Single Table Join Index
Single-Table join indexes are created from exactly one base table. Their purpose is to have the same table available with a different primary index, partitioning, or smaller table (the join index table) with fewer columns to be spooled.
This improves the performance of joins as no distribution or duplication is needed. The user will query on the base table, but PE will decide whether to access the base table or single table join index.
Syntax
Following is the syntax of a JOIN INDEX.
Example
Consider the following Employee and Salary tables.
Following is an example of creating a Join index named Employee_JI on the Employee table.
If the user submits a query with a WHERE clause on Employee_Id, then the system will query the Employee table using the unique primary index.
If the user queries the employee table using employee_name, then the system can access the join index Employee_JI using employee_name.
The rows of the join index are hashed on the employee_name column. If the join index and the employee_name are not defined as a secondary index, then the system will perform a full table scan to access the rows, which is time-consuming.
We can run the following EXPLAIN plan and verify the optimizer plan. In the following example, the optimizer uses the Join Index instead of the base Employee table when the table queries using the Employee_Name column.
Explanation
Multi-Table Join Index
A multi-table join index is created by joining more than one table. A multi-table join index can be used to store the result set of frequently joined tables to improve the performance.
A multi-table join index is used to hold a pre-join result set from the two or more columns. So during join processing, PE may decide to access data from the Multi-table join index rather than joining again underlying base tables. We need to remember that we should define a multi-table join index after lots of analysis based on the frequency and cost of joining.
Multi-Table Join Indexes allow us to move resource-intensive joins from the online to the batch window.
Shifting the workload does not reduce the total workload, but it turns it to a point in time that is beneficial for the system's overall performance.
Example
The following example creates a JOIN INDEX named Employee_Salary_JI by joining Employee and Salary tables.
When the base tables Employee or Salary are updated, the Join index Employee_Salary_JI is also automatically updated.
If we are running a query joining Employee and Salary tables, then the optimizer may choose to access the data from Employee_Salary_JI directly instead of joining the tables. EXPLAIN plan on the query can be used to verify if the optimizer will choose the base table or Join index.
Aggregate Join Index
Aggregate join indexes method is used to resolving queries that frequently specify the same aggregate operations on the same column or columns.
When aggregate join indexes are available, then the system does not repeat aggregate calculations for every query. An aggregate join index can be defined on two or more tables or a single table.
If a table is consistently aggregated on specific columns, then the aggregate join index can be defined on the table to improve the performance. One limitation of the aggregate join index is that it supports only the following functions.
- SUM function
- COUNT function
- GROUP BY clause
Example
In the following example, the Employee and Salary are joined to identify the total salary per Department.
Teradata Partitioned Primary Index
Partitioned Primary Index (PPI) is one of the powerful features of Teradata, which allows a user to access a part of the table rather than full table access. PPI is an indexing mechanism that is useful in improving the performance of specific queries. PPI works the same as Primary Index for the data distribution but creates partitions according to range or case as specified in the table.
A partitioned primary index (PPI) allows rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries predicated on primary index values.
When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. And when a table is defined with PPI, then the rows are sorted by their partition number.
Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined. Below are the following significant points of the PPI, such as:
- PPI is used to improve performance for large tables when we submit queries that specify a range constraint.
- PPI allows reducing the number of rows to be processed by using partition elimination.
- PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints
Commonly queried rows are stored together in partitions so that we can restrict data retrieval to those partitions that contain the rows required by your SQL statement. Other partitions that do not provide requested rows are not read.
Incidentally, other database systems, such as Netezza, go the other way round by storing the information about the data blocks where the requested rows are not stored. Netezza called this method "Data Skipping".
Partitioning will not help to avoid table scans, but it will reduce the number of data blocks that each AMP must move to its local memory (FSG cache) because only relevant partitions need to be accessed.
Characteristics of PPI
Here are some characteristics of the Partitioned Primary Index, such as:
- Partitioning can be done on volatile, global temporary, and permanent tables.
- The partitioning column doesn't have to be part of the non-unique Primary Index.
- The partitioning column must be part of a unique Primary Index.
- The partitioned column will not decide which AMP data should go and dependent on the Primary Index.
- Rows within AMP will be sorted by partitioned column and not by row Id.
- PPI can be defined at multi-level.
- Teradata uses a partition elimination strategy to avoid a Full Table Scan (FTS) for partitioned tables.
- Partitioned tables may reduce input for range queries by reducing the number of rows to be processed.
- Details about the partitions, constraints & conditions can be retrieved from dbc.PartitioningConstraintsV view.
Advantages of PPI
Following are the advantages of the Partitioned Primary Index:
- It avoids a full table scan for specific queries.
- PPI does not use a secondary index that requires additional physical structure and other input maintenance.
- It can access a subset of a large table quickly.
- PPI drop the old data quickly and add new data.
- It can be created on a global temp table, volatile table, and non-compressed join indexes.
- For range-based queries, we can remove SIs and use PPI, which will reduce the space of an overhead SI Subtable.
Disadvantages
Partitioned Primary Index also has the following disadvantages, such as:
- Partitioning can make a single row or primary index accesses to the table slower if the partitioning column is not a part of the primary index column.
- An additional of 2 bytes is added to each row and hence increases the perm space.
- While joining a PPI table with a non-partitioned table takes a long time.
- Access other than the PPI column takes more time.
Types of PPI
There are four types of Partitioned Primary Index:
- Case partitioning
- Range-based partitioning
- Multi-level partitioning
- Character-based partitioning
Example
Consider the following Orders table with Primary Index on OrderNo.
StoreNo | OrderNo | OrderDate | OrderTotal |
---|---|---|---|
2001 | 75 | 2017-11-02 | 900 |
2001 | 99 | 2017-12-10 | 1,200 |
2002 | 77 | 2017-06-09 | 3,000 |
2002 | 18 | 2017-09-05 | 2,454 |
2001 | 71 | 2017-11-03 | 1201 |
2003 | 70 | 2017-18-04 | 2,454 |
2001 | 80 | 2017-10-08 | 1201 |
2001 | 65 | 2017-15-05 | 1201 |
Assume that the records are distributed between AMPs, as shown in the following tables. Recorded are stored in AMPs, sorted based on their row hash.
AMP 1
RowHash | OrderNo | OrderDate |
---|---|---|
1 | 71 | 2017-11-03 |
2 | 18 | 2017-09-05 |
3 | 75 | 2017-11-02 |
4 | 65 | 2017-15-05 |
AMP 2
RowHash | OrderNo | OrderDate |
---|---|---|
1 | 80 | 2017-10-08 |
2 | 99 | 2017-12-10 |
3 | 70 | 2017-18-04 |
4 | 77 | 2017-06-09 |
If we run a query to extract the orders for a particular date, then the optimizer may choose to use Full Table Scan, then all the records within the AMP may be accessed. We define the order date as Partitioned Primary Index to avoid the full table scan. When rows are inserted into the orders table, they are partitioned by the order date. Within each partition, they will be ordered by their row hash.
The following data shows how the records will be stored in AMPs if they are partitioned by Order Date. If a query is run to access the records by Order Date, only the partition that contains the records for that particular order will be obtained.
AMP 1
Partition | RowHash | OrderNo | OrderDate |
---|---|---|---|
0 | 3 | 75 | 2017-11-02 |
1 | 1 | 71 | 2017-11-03 |
1 | 2 | 18 | 2017-09-05 |
2 | 4 | 65 | 2017-15-05 |
AMP 2
Partition | RowHash | OrderNo | OrderDate |
---|---|---|---|
0 | 2 | 99 | 2017-12-10 |
0 | 4 | 77 | 2017-06-09 |
1 | 3 | 70 | 2017-18-04 |
2 | 1 | 80 | 2017-10-08 |
Following is an example of creating a table with a partition primary Index. PARTITION BY clause is used to define the partition.
In the above example, the table is partitioned by the OrderDate column. There will be one separate partition for each day.
PARTITION BY RANGE_N(totalorders BETWEEN *, 100, 1000 AND *,
UNKNOWN);
In the example, RANGE_N specifies four partitions to which a row can be assigned, based on the value of the totalorders column:
Partition Number | Condition |
---|---|
1 | The value of the totalorders column is less than 100. |
2 | The value of the totalorders column is less than 1000, but greater than or equal to 100. |
3 | The value of the totalorders column is greater than or equal to 1000. |
4 | The totalorders column is NULL, so the range is UNKNOWN. |
PARTITION BY (RANGE_N(totalorders BETWEEN *, 100, 1000 AND *)
,RANGE_N(orderdate BETWEEN *, '2005-12-31' AND *) );
The example defines six partitions to which a row can be assigned. The first RANGE_N expression defines three partitions based on the value of the totalorders column. The second RANGE_N expression subdivides each of the three partitions into two partitions based on the value of the orderdate column.
Level 1 Partition Number | Level 2 Partition Number | Condition |
---|---|---|
1 | 1 | The value of the totalorders column is less than 100 and the value of the orderdate column is less than '2005-12-31'. |
2 | The value of the totalorders column is less than 100 and the value of the orderdate column is greater than or equal to '2005-12-31'. | |
2 | 1 | The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is less than '2005-12-31'. |
2 | The value of the totalorders column is less than 1000 but greater than or equal to 100, and the value of the orderdate column is greater than or equal to '2005-12-31'. | |
3 | 1 | The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is less than '2005-12-31'. |
2 | The value of the totalorders column is greater than or equal to 1000 and the value of the orderdate column is greater than or equal to '2005-12-31'. |
Teradata Secondary Indexes
Secondary indexes are never required for Teradata Database tables, but they can often improve system performance.
A Secondary Index (SI) is an alternate data access path. It allows accessing the data without having to do a full-table scan.
The secondary index can be created explicitly using the CREATE TABLE and CREATE INDEX statements. Teradata Database can implicitly create unique secondary indexes.
For example, when we use a CREATE TABLE statement that specifies a primary index, Teradata Database implicitly creates unique secondary indexes on column sets that we specify using PRIMARY KEY or UNIQUE constraints.
Secondary Indexes are stored in separate subtables that require additional disk space and maintenance, which is handled automatically by the system. These tables are built-in all AMPs.
These subtables contain the index rows, so we need to add another set of rows that requires updating each time.
Secondary index subtables are also duplicated whenever a table is defined with FALLBACK. The Secondary index is not involved in the data distribution.
The purpose of the Secondary Index Subtable will be to return back to the real row in the base table via the Row-ID.
Secondary Index Rules
Here are some essential rules for the secondary index:
Rule 1: Secondary Indexes are optional.
Rule 2: Secondary Index values can be unique or non-unique.
Rule 3: Secondary Index values can be NULL.
Rule 4: Secondary Index values can be modified.
Rule 5: Secondary Indexes can be changed.
Rule 6: A Secondary Index has a limit of 64 columns.
Secondary Indexes Types
There are two types of secondary indexes in Teradata:
- Unique Secondary Index (USI)
- Non-Unique Secondary Index (NUSI)
Unique Secondary Index (USI)
A Unique Secondary Index allows unique values for the columns defined as USI.
A Unique Secondary Index (USI) serves two purposes.
- Enforces uniqueness on a column or group of columns.
- Speeds up access to a row (data retrieval speed).
When A USI is created, then the Teradata will immediately build a secondary index subtable on each AMP.
Each AMP will then hash the secondary index value for each of their rows in the base table.
Syntax
Below is the syntax to create a Unique Secondary Index:
Example
First create an Employee table with the following records, such as:
Emp_Id | First_Name | Last_Name | Department_No |
---|---|---|---|
202001 | Mike | Richard | 1 |
202002 | Robert | Williams | 2 |
202003 | Peter | Collin | 2 |
202004 | Alexa | Stuart | 1 |
202005 | Robert | Peterson | 1 |
The following example creates USI on the Emp_Id column of the employee table.
The output of the Emp_Id hash will utilize the hash map to point to a specific AMP and that AMP will hold the secondary index subtable row for the secondary index value.
That means the subtable row will hold the base table Row-ID, and Teradata will then find the base row immediately.
Non-Unique Secondary Index (NUSI)
A Non-Unique Secondary Index (NUSI) allows duplicate values for the columns defined as NUSI.
A Non-Unique Secondary Index is usually specified to prevent full-table scans, in which every row of a table is read.
A USI is always a Two-AMP operation, so it is almost as fast as a Primary Index, but a NUSI is an All-AMP operation, but not a Full Table Scan.
When A NUSI is created, then the Teradata will immediately build a secondary index subtable on each AMP.
Each AMP will hold the secondary index values for their rows in the base table only. There can be up to 32 Secondary Indexes on a table.
Syntax
Below is the generic syntax to create a Non-Unique Secondary Index:
Example
The following example creates NUSI on the First_Name column of the employee table.
In the above example, each AMP holds the name column for all employee rows in the base table on their AMP (AMP local).
Each AMP Local name will have the Base Table Row-ID (pointer), so the AMP can retrieve it quickly if needed.
If an AMP contains duplicate first names, only one subtable row for that name is built with multiple Base Row-IDs.
Teradata Hashing Algorithm
The hashing algorithm is a piece of code that acts as a translation table. A row is assigned to a particular AMP based on the primary index value. Teradata uses a hashing algorithm to determine which AMP gets the row.
The Teradata Database hashing algorithms are proprietary mathematical functions that transform an input data value of any length into a 32-bit value referred to as a rowhash, which is used to assign a row to an AMP.
Whether the input is a combination of different column values, comes from differently-sized values from a variable-length column, or is composed of different data types, the hashing algorithm's output will always be a fixed size and format.
It dealing with uniformly-sized row identifiers simplifies the database's effort during storing, aggregating, sorting, and joining rows.
Following is a high-level diagram on the hashing algorithm.
Here is the explanation of the above diagram, how to insert the data:
- First, the client submits a query.
- Then, the parser receives the query and passes the PI value of the record to the hashing algorithm.
- The hashing algorithm hashes the primary index value and returns a 32-bit number, called Row Hash.
- The higher-order bits of the row hash (first 16 bits) are used to identify the hash map entry. The hash map contains one AMP #. Hash map is an array of buckets that contain specific AMP #.
- BYNET sends the data to the identified AMP.
- AMP uses the 32 bit Row hash to locate the row within its disk.
- If there is any record with the same row hash, it increments the uniqueness ID, a 32-bit number. For new row hash, uniqueness ID is assigned as 1 and incremented whenever a record with the same row hash is inserted.
- The combination of Row hash and Uniqueness ID is called Row ID.
- Row ID prefixes each record in the disk.
- Each table row in the AMP is logically sorted by their Row IDs.
Hashing Functions
Teradata SQL provides several functions that can be used to analyze the hashing characteristics of the existing indexes and candidate indexes.
These functions are documented fully in SQL Functions, Operators, Expressions, and Predicates. There are four types of hashing functions available in the Teradata.
1. HASHROW: It describes hexadecimal rowhash value for an expression. The Query would give the same results if we ran again and again.
The HASHROW function produces the 32-bit binary Row Hash that is stored as part of the data row. It returns maximum of 4,294,967,295 unique values.
Syntax
2. HASHAMP: The HASHAMP function returns the identification number of the primary AMP for any Hash Bucket number.
When no value is passed through the HASHAMP function, it returns a number less than the number of AMPs in the current system configuration.
Syntax
3. HASHBUCKET: The HASHBUCKET function produces 16bit binary Hash Bucket used with the Hash Map to determine the AMP that store and retrieve the data row.
The values range from 0 to 1,048,575, not counting the NULL as a possible result.
Syntax
4. HASHBAKAMP: The HASHBAKAMP function returns the identification number of the Fallback AMP for any Hash Bucket number.
Syntax
Hash Collisions
The Hash Collision is a situation in which the rowhash value for different rows is identical, making it difficult for a system to discriminate among the hash synonyms when one unique row is requested for retrieval from a set of hash synonyms.
Minimizing Hash Collisions
To minimize the Hash Collision problem, Teradata Database defines 4.2 billion hash values. The AMP software adds a system-generated 32-bit uniqueness value to the rowhash value.
The resulting 64-bit value prefixed with an internal partition number is called the rowID. This value uniquely identifies each row in a system, making a scan to retrieve a particular row among several having the same rowhash a trivial task.
A scan must check each of the rows to determine if it has the searched-for value and not another value with the same rowhash value.
Hash Maps
A hash map is a mechanism for determining which AMP a row is stored on, or, in the case of the Open PDE hash map, the destination AMP for a message sent by the Parser subsystem.
Each cell in the map array corresponds to a hash bucket, and each hash bucket is assigned to an AMP. Hash map entries are maintained by the BYNET.
Teradata Join Strategies
Teradata Join Strategies are utilized by the optimizer to choose the minimum cost plan and better performance.
The strategy will be chosen based on the available information to the optimizer, such as Table size, PI information, and Stats Information.
Teradata joins strategies are the following types:
- Merge (Exclusion)
- Nested
- Row Hash Product
Merge Join Strategies
Merge Join method takes place when the join is based on the equality condition.
Merge Join requires the joining rows to be on the same AMP. Rows are joined based on their row hash.
Merge Join uses four different merge join strategies based on the redistribution to bring the rows to the same AMP.
Strategy 1
The 1st merge join will utilize the Primary Index on both tables in the join condition.
The inner join above focuses on returning all rows when there is a match between the two tables.
The ON clause is essential because this join establishes the join (equality) condition.
Each matching row is joined where Emp = Emp is stated from the ON Clause in the JOIN.
EMP is the Primary Index for both tables. This first merge join type is extremely efficient because both columns in the ON clause are the Primary Indexes of their respective tables. When this occurs, NO data has to be moved into the spool, and the joins can be performed in what is called AMP LOCAL.
Teradata can perform this join with rapid speed and fewer data to be moved to complete a join.
Strategy 2
This strategy is used when a join is performed on a Primary Index column of one table to a non-primary indexed column of another table.
Two tables are being joined based on the DEPT column. In the department table, the Primary Index column is DEPT.
The employee table has EMP as the Primary Index column. The primary objective is to bring the rows together from each table on the same AMPs.
There are several options that the Teradata Optimizer could choose to complete this task, such as:
- The first option is to duplicate the smaller table on all AMPs.
- The second option could be to leave the department table with an equality condition match on the Primary Index Column stationary on the AMP.
- The next step would be to move the rows from the Employee table into a spool. This would be accomplished by hashing (locating) the columns in the employee table and then transferring these rows into spool to the appropriate AMPs where the department table rows reside.
Strategy 3
This is used where neither table is being joined on the Primary Index of either table. In this case, Teradata will redistribute both tables into the spool and sort them by hash code.
When we want to redistribute and sort by hash code, we merely hash the non-primary index columns and move them to the AMPs spool where they are sorted by hash.
Once this is accomplished, then the appropriate rows are together in spool on all the AMPs.
The Primary Index of the department table is DEPT, and the Primary Index for the manager table is LOC. In this case, both columns being utilized in this join equality are not part of the Primary Index columns.
Rows from both tables will need to be rehashed and redistributed into SPOOL because neither column selected in the ON Clause is the Primary Index of the respective tables. Therefore, both tables are redistributed based on the ON clause columns.
The next step in this process is to redistribute the rows and locate them to the matching AMPs.
When this is completed, the rows from both tables will be located in two different spools. The rows in each spool will be joined together to bring back the matching rows.
Merge Join Strategy 4
The fourth merge join strategy is called the big table - small table join. If one of the tables being joined is small, then Teradata may choose a plan that will duplicate the smaller table across all the AMPs.
The key to this strategy is that regardless of the table is part of the Primary Index Column or not, Teradata could still choose to duplicate the table across all the AMPs.
The two tables involved in the join are the Employee table and the Department table.
The Dept column is the join equality that is making the match between the two tables.
The DEPT column is the Primary Index Column in the Department table.
The Employee table has the Emp column as the Primary Index. The Department table is small.
To join these two tables together:
- The first step is to get the rows together on the same AMP. In this case, since the Department table is small, Teradata will choose to duplicate the entire Department table on each AMP into a spool.
- Then, the AMPs to join the base Employee rows with the Department rows.
Nested Join
A nested join is designed to utilize a unique index type (Either Unique Primary Index or Unique Secondary Index) from one of the tables in the joint statement to retrieves a single row.
Then it matches that row to one or more rows on the other table being used in the join.
The nested join has the join equality (ON) condition based on the Dept column. The Dept column is the Primary Index Column on the department table.
The Dept column is the Secondary Index Column in the employee table. The nested join can move a single row into the spool and then matching that row with another table that contains several matches.
Analysis of this join statement indicates a new clause has been added to this statement. This is known as the WHERE option.
When utilized, the WHERE option allows for a single row to be retrieved from a table. A nested join will always use a unique index to isolate that single record and then join that record to another table.
The other table may use an index, or it may not. The utilization of indexes in join statements will improve performance and utilize fewer resources, as shown in the below diagram.
Since there is only one row in the department table with a match for department =10, which is based on the AND option in the join statement, the Teradata Optimizer will choose a path to move the department table columns into the spool and duplicate them across all the AMP.
After completed, the matches will proceed with that single record (10 and SALES) to the second table, which did not move from the base AMP.
Nested Joins are great in an OLTP Environment because of the usage of both Unique and Non-Unique Indexes.
Row Hash Join
The Hash Join is part of the Merge Join. The key to a Merge Join is based on equality condition, such as E.Dept = D.Dept in the ON clause of the join statement.
A Hash Join can only occur if one or both of the tables on each AMP can fit entirely inside the AMP's memory.
- The column's names can be different, but the row information has to be similar for the match to work.
- Both the EMP and MGREMP columns have the same type of information, so therefore a join based on these column names will be successful.
- EMP column is the Primary Index column on the employee table. However, the MGREMP column is not an index column in the department table.
- The key to determining a Hash Join is if the SMALLER TABLE can be entirely held in each AMP's MEMORY.
The Hash Join process is where the smaller table is sorted by row hash and duplicated on every AMP.
Here the key is a smaller table that is required to be held entirely in each AMP's memory.
Teradata will use the join column of the larger table to search for a match. The row hash join is extremely efficient because it eliminates the sorting, redistribution, and copying of the larger table into the spool.
The rows that are duplicated into the AMP's memory yield increased performance because the rows never go into a spool. Rows that go into spool always have to involve disk activity.
Product Joins
Product Joins compare every row of one table to every row of another table. They are called product joins because they are a product of the number of rows in table one multiplied by the number of rows in table two.
For example, if one table had six rows and the other had six rows, then the Product Join would compare 6 x 6 rows with a potential of 36 rows.
Many times, product joins are major mistakes because all rows in both tables will be compared.
Teradata tables have the potential to contain millions of rows. If a user accidentally writes a product, join against two tables that have 1 million rows each.
The result set would return One Trillion Rows. To avoid a product join, check the syntax to ensure that the join is based on an EQUALITY condition.
In the above example, the equality statement reads "WHERE EMP Like '_b%' because this clause is not based on a common domain condition between the two tables e.dept = d.dept, the result is a product join.
Another cause of a product join is when aliases are not used after being established. So firstly, ensure the WHERE clause is not missing in the join syntax.
Cartesian product Join
A Cartesian Product Join will join every row in one table to every row in another table. The only thing that decides the number of rows will be the total number of rows from both tables.
If one table had 5 rows and another had 10 rows, then we will always get 50 rows returned.
Most of the time, a Cartesian Product Join is a major problem because all rows in both tables will be joined.
To avoid a Cartesian Product Join, check the syntax to ensure that the join is based on an EQUALITY condition.
In the above example, the WHERE clause is missing because this clause is missing, a common domain condition between the two tables (e.dept = d.dept).
Another cause of a product join is when aliases are not used after being established.
Exclusion Join
Exclusion Joins have one primary function. They exclude rows during a join.
In the above example, the join utilizes the NOT IN statement. Exclusion joins are used for finding rows that don't have a matching row in the other table.
Queries with the NOT IN operator are the types of queries that always provide exclusion joins results. In this case, this query will find all the employees who belong to department 10 who are NOT managers.
These joins will always involve a Full Table Scan because Teradata will need to compare every record to eliminate rows that will need to be excluded.
This type of join can be resource-intensive if the two tables in this comparison are large.
The biggest problem with the Exclusion Joins is when the NOT IN statement is used because the NULLs are considered unknowns, so the data returned in the answer will be NULLs. There are two ways to avoid this problem:
- Define NOT IN columns as NOT NULL on the CREATE TABLE.
- Add the "AND WHERE Column IS NOT NULL" to the end of the JOIN.
Teradata Views
Views are database objects that are built by the query. Views can be created using a single table or multiple tables by way of joining.
Their definition is stored permanently in a data dictionary, but they don't save a copy. Data for the view is built dynamically.
A view may contain a subset of the table's rows or a subset of columns.
Create a View
Views are created using a special form of Data Definition Language (DDL). The CREATE requests a new VIEW, provides the name and the SELECT for the view. It is suggested the name of the view either start with "v_" or end with "_v" to identify it as a view name.
That way, it is visibly obvious to people that this is a view and not a table. The name of the view must be unique from the names of other objects in the database. The CREATE VIEW verifies that the name does not already exist and returns an error if it does.
Syntax
Following is the syntax for creating a view.
- view_name: Name of the new view. If view_name is not fully qualified, the default database is used.
- database_name: It divided into two sections
- user_name: Name of the database or user to contain view_name if something other than the current database or user.
- column_name: Name of a view column. If more than one column is specified, list their names in the order in which each column is to be displayed for the view.
- AS: An introduction to the view definition.
NOTE
- Views that reference a row-level security table can include columns based on row-level security constraints, but it is not required. However, the view enforces any security constraints in the base table, whether or not they are included in the view definition.
- A view can reference both row-level security tables and non-row-level security tables, but all referenced row-level security tables must contain the same security constraint columns or subsequent requests to access the view fail with an error.
Example
Consider the following Employee table.
Emp_Id | First_Name | Last_Name | Department_No | BirthDate |
---|---|---|---|---|
202001 | Mike | Richard | 1 | 4/8/1988 |
202002 | Robert | Williams | 2 | 8/5/1991 |
202003 | Peter | Collin | 2 | 5/9/1990 |
202004 | Alexa | Stuart | 1 | 15/11/1989 |
202005 | Robert | Peterson | 1 | 22/2/1990 |
The following example creates a view on the Employee table.
Using Views
We can use a regular SELECT statement to retrieve data from Views.
Example
The following example retrieves the records from Employee_View;
Modifying Views
A current view can be modified using the REPLACE VIEW statement.
REPLACE VIEW redefines an existing view or, if the specified view does not exist, it creates a new view with the specified name.
Syntax
Following is the syntax to modify a view.
Example
The following example modifies the view Employee_View for adding additional columns.
Drop View
A current view can be dropped using the DROP VIEW statement.
Syntax
Following is the syntax of DROP VIEW.
Example
Following is an example to drop the view Employee_View.
Advantages of Views
Here are some advantages of using Views in Teradata, such as:
- Views provide an additional level of security by restricting the rows or columns of a table.
- Users can be given access only to views instead of base tables.
- Simplifies the use of multiple tables by pre-joining those using Views.
- Taking requires rows and columns from the tables.
- The view is used to reduce the net bond width.
- It provides better binding between the table and view data.
Teradata Macros
Macro is a set of SQL statements that are stored and executed by calling the Macro name. The definition of Macros is stored in the Data Dictionary.
Users only need EXEC privilege to execute the Macro. Users don't need separate rights on the database objects used inside the Macro.
A macro consists of one or more statements that can be executed by performing a single statement. Each time the Macro is completed, one or more rows of data can be returned.
Macro statements are executed as a single transaction. If one of the SQL statements in Macro fails, then all the statements are rolled back. Macros can accept parameters. Macros can contain DDL statements, but that should be the last statement in Macro.
A frequently used SQL statement or series of statements can be incorporated into a macro and defined using the SQL CREATE MACRO statement. The comments in the Macro are performed using the EXECUTE statement. A macro can include an EXECUTE statement that executes another macro.
Create Macros
CREATE MACRO privilege on the database or user in which the Macro is to be created. The creator automatically gets the DROP MACRO and EXECUTES privileges WITH GRANT OPTION. The immediate owner of the Macro:
- It is the database in which it exists, not the user who created it.
- Determines the macro access privileges, not the Macro.
- Must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION
Syntax
Macros are created using the CREATE MACRO statement. Following is the generic syntax of the CREATE MACRO command.
- macro_name: The name of the new Macro. If a fully qualified name is not specified, the default database or user is used. The name of a macro must be unique within its containing user or database.
Emp_Id | First_Name | Last_Name | BirthDate |
---|---|---|---|
202001 | Mike | Richard | 11/05/1990 |
202002 | Robert | Williams | 20/01/1988 |
202003 | Peter | Collin | 21/08/1985 |
202004 | Alexa | Stuart | 10/12/1992 |
202005 | Robert | Peterson | 19/03/1987 |
Salary Table:
Emp_Id | Gross | Deduction | NetPay |
---|---|---|---|
202001 | 40,000 | 4,000 | 36,000 |
202002 | 80,000 | 6,000 | 74,000 |
202003 | 90,000 | 7,000 | 83,000 |
202004 | 75,000 | 5,000 | 70,000 |
202005 | 80,000 | 00 | 80,000 |
The following example creates a Macro called Get_Emp. It contains a select statement to retrieve records from the employee table.
Executing Macros
The immediate owner of the Macro (the database in which the Macro resides) must have the necessary privileges on objects named in the request set for statements that are contained in the Macro.
Syntax
Macros are executed using the EXEC command. Following is the syntax of the EXECUTE MACRO command.
Example
The following example executes the Macro names Get_Emp; when the following command is executed, it retrieves all records from the employee table.
Parameterized Macros
Teradata Macros can accept parameters. Within a Macro, these parameters are referenced with a semicolon (;).
Example
Following is an example of a Macro that accepts parameters.
Executing Parameterized Macros
Macros are executed using the EXEC command. We need EXEC privilege to execute the Macros.
Syntax
Following is the syntax of the EXECUTE MACRO statement.
Example
The following example executes the Macro names Get_Emp. It accepts employee Id as a parameter and extracts records from the employee table for that employee.
Replace a Macro
REPLACE privilege depends on whether the Macro being replaced already exists.
- If the Macro already exists, the DROP MACRO privilege on the Macro or it contains a database or user.
- If the Macro does not already exist, the CREATE MACRO privilege on the Macro or its containing database or user.
NOTE
- Once a macro has been replaced, its immediate owner is the database in which it exists, not the user who replaced it.
- The immediately owning database must have all the appropriate privileges for executing the Macro, including WITH GRANT OPTION.
Teradata OLAP Functions
OLAP functions are similar to aggregate functions except that the aggregate functions will return only one value. In contrast, the OLAP function will provide the individual rows in addition to the aggregates.
The OLAP functions are built into the Teradata database to provide data mining capabilities and trend analysis.
These functions provide processing not available using the standard aggregation. The OLAP functions give the result of their operation and display the detail data values used in the function.
The detail row data is displayed as part of the answer set since the output is in row format and not in a report format like WITH.
The OLAP functions may be performed on all tables or views and used to populate tables in conjunction with INSERT/SELECT. The most significant difference is that these functions can be used in Queryman, unlike WITH.
Syntax
Following is the syntax of the OLAP function.
NOTE: Aggregation functions can be SUM, COUNT, MIN, MAX, and AVG.
Example
Consider the following Salary table of the employees.
Emp_Id | Gross | Deduction | NetPay |
---|---|---|---|
202001 | 40,000 | 4,000 | 36,000 |
202002 | 80,000 | 6,000 | 74,000 |
202003 | 90,000 | 7,000 | 83,000 |
202004 | 75,000 | 5,000 | 70,000 |
Following is an example to find the cumulative sum or running total of NetPay on the Salary table.
Records are sorted by Emp_Id, and the cumulative sum is calculated on the NetPay column.
After executing the above query, it produces the following output.
QUANTILE Function
A QUANTILE function is used to divide rows into several partitions of roughly the same number of rows in each partition. The percentile is the QUANTILE most commonly used in business.
By default, both the QUANTILE column and the QUANTILE value itself will be output in ascending sequence.
In some cases, the ORDER BY clause may be used to reorder the output for display. Here the order of the output does not change the meaning of the output, unlike a summation where the values are being added together and all need to appear in the proper sequence.
Syntax
Following is the syntax of the QUANTILE function.
RANK Function
RANK function orders the records based on the column provided. The RANK function can also filter the number of records returned based on the Rank.
The RANK function allows a column to be evaluated and compared based on either high or low order, against all other rows to create the output set.
The order will be sorted by default in the descending sequence of the ranking column, which correlates to descending Rank.
The output of the RANK function is the highest or the lowest data values in the column, depending on the sort requested.
Syntax
Following is the syntax to use the RANK function.
Example
Consider the following Employee table.
Emp_Id | First_Name | Last_Name | JoinedDate | Department_No | BirthDate |
---|---|---|---|---|---|
202001 | Mike | Richard | 3/27/2008 | 1 | 1/5/1980 |
202002 | Robert | Williams | 4/25/2013 | 1 | 3/5/1983 |
202003 | Peter | Collin | 3/21/2010 | 1 | 4/1/1983 |
202004 | Alexa | Stuart | 1/2/2017 | 2 | 11/6/1984 |
202005 | Robert | Peterson | 1/4/2015 | 2 | 12/1/1984 |
The following query orders the records of the employee table by Joined Date and assigns the ranking on Joined Date.
When the above query is executed, it produces the following output.
PARTITION BY clause groups the data by the columns defined in the PARTITION BY clause and performs the OLAP function within each group. Following is an example of the query that uses the PARTITION BY clause.
When the above query is executed, it produces the following output. We can see that the Rank is reset for each Department.
Teradata Data Protection
Teradata has many features to protect against potential data loss. Each of these features protects the data differently and on a different level of granularity. Some processes for protection are automatic, and some of them are optional.
Here are some features available for data protection in Teradata.
- RAID
- Fallback
- Cliques
- Journals
- Hot Standby Nodes
RAID
RAID mechanism is used to protect data from Disk Failures. Disk Array consists of a set of disks that are grouped as a logical unit. This unit looks like a single unit to the user, but it can be spread across several disks.
RAID 1 is commonly used in Teradata. In RAID 1, each disk is associated with a mirror disk. Any changes to the data in the primary disk are reflected in mirror copy as well. If the primary disk fails, then the data from the mirror disk can be accessed.
Although each AMP has exactly one virtual disk assigned (VDISK), several physical disks make up this VDISK. Half of the discs keep the mirrored data.
As in the case of fallback protection, the cost is the usage of twice as much as disk space.
The RAID 1 configuration ensures enhanced data accessibility and security. If a single disk fails, data will serve from the mirror disk. Only if both the primary disk and the mirror disk fail, and no fallback protection were enabled, the data would be lost.
Fallback
Fallback protects the table data by storing the second copy of rows of a table on another AMP called as Fallback AMP. If one AMP fails, then the fallback rows are accessed. With this, even if one AMP fails, data is still available through fallback AMP. The fallback option can be used at table creation or after table creation.
Fallback ensures that the second copy of the table's rows is always stored in another AMP to protect the data from AMP failure. However, fallback occupies twice the storage and input for Insert, Delete, or Update.
The advantage of fallback protection is that the data may be accessible in case of an AMP failure, as the auxiliary AMP will take over its tasks until the failed AMP comes back online.
Only in the unlikely event, that the backup AMP fails as well, your table ultimately will become inaccessible.
The following diagram shows how a fallback copy of the rows is stored in another AMP.
To achieve maximum protection, AMP's are grouped in clusters. Main AMP and fallback AMP always belong to the same cluster and protect each other.
The main AMP and the fallback AMP are never stored together physically at the same node, a prudent design choice on the nature of hardware failures. Even if a complete node fails, the fallback protection still allows us to access the data!
Cliques
Clique is a mechanism used by Teradata to protect data from Node failures. A clique is nothing but a set of Teradata nodes that share a common set of Disk Arrays. When a node fails, the AMPs of the failed node will be migrated to another node belonging to the same clique and stays fully functional and continue to access their disk arrays.
This protection mechanism adds another level of security to the Teradata System.
To remove this restriction, Teradata offers hot standby nodes to take over the AMPs of a failed node.
As these AMPs are not engaged in routine operations in the ordinary course of business, no performance degradation occurs. Before the AMP migration occurs, a system restart is required, followed by another restart as soon as the failed node goes online again.
Journals
Journals are kept on the system to provide data availability in the event of a component or process failure. Journals are the following types in the Teradata:
1. Recovery Journal: The Teradata Database uses Recovery Journals to maintain data integrity in the case of automatically
- An interrupted transaction.
- An AMP failure.
Recovery Journals are created, maintained, and purged by the system automatically, so no DBA intervention is required.
Recovery Journals are tables stored on disk arrays like user data is, so they take up disk space on the system.
2. Down-AMP Recovery Journal: The Down-AMP Recovery Journal allows continued system operationwhile an AMP is down.
A Down-AMP Recovery Journal is used with Fallback-protected tables to maintain a record of writing transactions (updates, creates, inserts, deletes, etc.) on the failed AMP while unavailable.
The Down-AMP Recovery Journal starts after the loss of an AMP in a cluster automatically. Any changes to the data on the failed AMP are logged into the Down-AMP Recovery Journal by the other AMPs in the group.
When the failed AMP is online again, the restart process includes applying the changes in the Down-AMP Recovery Journal to the recovered AMP. The journal is discarded when the process is complete, and the AMP is brought online and fully recovered.
3. Transient Journal: A Transient Journal maintains data integrity when in-flight transactions are interrupted. Data is returned to its original stateafter transaction failure.
A Transient Journal is used during regular system operation to keep "before images" of changed rows so the data can be restored to its previous state if the transaction is not completed. This happens on each AMP as changes occur. When a transaction is started, the system stores a copy of all the rows automatically affected by the transaction in the Transient Journal until the transaction is completed.
Once the transaction is complete, the "before images" are purged. In the event of a transaction failure, "before images" are reapplied to affected tables, deleted from the journal, and the "rollback" operation is completed.
4. Permanent Journal: Permanent Journals are an optional feature used to provide an additional level of data protection. We can specify the use of Permanent Journals at the table level. It offers full-table recovery for a specific point in time. It can also reduce the need for costly and time-consuming full-table backups.
Permanent Journals are tables stored on disk arrays such as user data, so they take up additional disk space. The Database Administrator maintains the Permanent Journal entries such as deleting, archiving, etc.
The additional disk space required may be calculated in advance to ensure adequate resources. The Database Administrator must leave the Permanent Journal to external media, thus reducing the need for full-table backups since only changes are backed up rather than the entire database.
Hot Standby Node
Hot Standby Node is a node that does not participate in the production environment.
Teradata Database can use spare nodes to improve availability and maintain performance levels in node failure. If a node fails, then the vprocs from the failed nodes will migrate to the hot standby node.
Once the failed node is recovered, it becomes the hot standby node. Hot Standby nodes are used to maintain the performance in case of node failures.
When the failed node is recovered and restarted, it becomes the new hot standby node. The second restart of Teradata is not needed.
Hot standby node has the following characteristics:
- A node that is a member of a clique.
- Hot Standby Node does not regularly participate in the trusted parallel application (TPA).
- It can be brought into the configuration when a node fails in the clique.
- It helps with unplanned outages.
- It also eliminates the need for a restart to bring a failed node back into service.
Teradata Stored Procedure
A stored procedure contains a set of SQL statements and procedural statements. They consist of a set of control and condition handling comments that make SQL a computationally complete programming language. The definition of the stored procedure is stored in the database, and the parameters are stored in data dictionary tables.
These features provide a server-based procedural interface to Teradata Database for application programmers.
Stored procedures can be a great benefit for some tactical query applications. This section provides:
- Some examples of using stored procedures to process complex updates.
- A comparison of the relative efficiency of stored procedures and macros for different tactical query applications.
Advantages
Teradata Stored Procedure provides the following essential advantages, such as:
- Stored procedures reduce the network load between the client and the server.
- It provides better security since the data is accessed through stored procedures instead of accessing them directly.
- It gives better maintenance since the business logic is tested and stored in the server.
Elements of Stored Procedures
The set of statements constituting the main tasks of the stored procedure is called the stored procedure body, consisting of a single statement or a compound statement or block.
A single statement stored procedure body can contain one control statement, such as LOOP or WHILE, or one SQL DDL, DML, or DCL statement, including dynamic SQL. The following statements are not allowed, such as:
- Any declaration statement, such as local variable, cursor, condition, or condition handler.
- A cursor statement (OPEN, FETCH, or CLOSE).
A compound statement stored procedure body consists of a BEGIN-END statement enclosing a set of declarations and statements, including:
- Local variable declarations.
- Cursor declarations.
- Condition declarations.
- Condition handler declaration statements.
- Control statements.
- SQL DML, DDL, and DCL statements supported by stored procedures, including dynamic SQL.
- Multi-statement requests (including dynamic multi-statement requests) delimited by the keywords BEGIN REQUEST and END REQUEST.
Creating a Stored Procedure
Teradata stored procedure is created from the following commands:
- BTEQ utility using the COMPILE command.
- CLIv2 applications, ODBC, JDBC, and Teradata SQL Assistant (formerly called Queryman) using the SQL CREATE PROCEDURE or REPLACE PROCEDURE statement.
- Stored Procedures are created using the CREATE PROCEDURE statement.
The procedures are stored in the user database space as objects and are executed on the server.
Syntax
Following is the syntax of the CREATE PROCEDURE statement.
Example
Consider the following Salary Table of the employees.
EmployeeId | Gross | Deduction | NetPay |
---|---|---|---|
202001 | 40,000 | 2,000 | 38,000 |
202002 | 80,000 | 3,000 | 77,000 |
202003 | 85,000 | 5,000 | 80,000 |
202004 | 90,000 | 2,000 | 88,000 |
The following example creates a stored procedure named InsertSalary to accept the values and insert them into the Salary Table.
Executing Stored Procedures
If we have sufficient privileges, we can execute a stored procedure from any supporting client utility or interface using the SQL CALL statement.
We can also execute a stored procedure from an external stored procedure written in C, C++, or Java. We have to specify arguments for all the parameters contained in the stored procedure.
Syntax
Following is the generic syntax of the CALL statement.
Example
The below example calls the stored procedure InsertSalary and inserts records to Salary Table of the employees.
EmployeeId | Gross | Deduction | NetPay |
---|---|---|---|
202001 | 40,000 | 2,000 | 38,000 |
202002 | 80,000 | 3,000 | 77,000 |
202003 | 85,000 | 5,000 | 80,000 |
202004 | 90,000 | 2,000 | 88,000 |
202005 | 20,000 | 1,000 | 19,000 |
Teradata User Management
The Teradata Database Administrators are responsible for managing the users in the database. They will create the Teradata account for each user with limited space and access to the database objects.
Users
A user is created using the CREATE USER command. In Teradata, a user is also similar to a database. They both can be assigned space and contain database objects except that the user is assigned a password.
We must explicitly grant the CREATE USER or USER privilege to any user that requires it, except user DBC, which has these privileges by default. The creator of a user automatically receives the DROP USER privilege on the user.
Syntax
Following is the syntax for CREATE USER.
- CREATE USER user_name: The name of an individual Teradata Database user. We cannot allow multiple users to share a username for accessing the database.
- PERM: The space in bytes available to contain objects that the user creates or owns. Permanent space cannot exceed the current PERM of the user's immediate owner, which is usually the database or user specified in the FROM clause of the CREATE USER statement.
- PASSWORD = password: A temporary password. If the system is set up according to the recommendations in creating the security administrator user, the user is prompted to change the temporary password to a private password at first login.
- TEMPORARY: Only objects such as global temporary tables and volatile tables require temporary space. We should develop a strategy for using the database objects that need temp space before assigning temp space to users. We can add a temp space specification later with a MODIFY USER or MODIFY PROFILE statement.
- SPOOL: Defines the maximum amount of space available to the user for intermediate query results, volatile tables, or formatted answer sets to queries. The system borrows spool space from unused system perm space, such as the spool reserve in creating the spool space reserve.
The spool space specified in a profile takes precedence over spool space defined in a CREATE USER statement.
Specify spool space as part of a profile unless the user requires unique spool considerations.
The spool space allocation should be approximately 20% of the perm space available to the most extensive database the user accesses.
NOTE: While creating a user, the values for user name, Permanent space, and Password is mandatory. But the other fields are optional.
Example
Following is an example to create user T001.
Accounts
ACCOUNT option in CREATE USER is used to assign the account. While creating a new user, the user may be assigned to:
- The user, but the profile contains more than one account. The default is the first account in the string. The user can submit a SET SESSION ACCOUNT statement to set the default for the current session to one of the other accounts.
- A user and the profile contains only one account. The default is the account defined in the profile.
- A user, but the profile does not include an account. The default is the account identifier of the immediate owner of the user.
- Not assigned to the user, the default is the account identifier of the user's immediate owner.
NOTE
- If a user definition specifies an account but not a profile, the default account is the account string in the user definition.
- If you assign a profile to a user, and the profile specifies an account string, the profile account is in effect for AMPUsage and DBQL for that user. However, the system uses the value found in the DBC.DatabaseV view for space accounting.
Syntax
Following is the syntax for CREATE USER with the account option.
Example
The following example creates the user T002 and assigns the account as IT and Admin.
The user can specify the account id while logging into the Teradata system or after being logged into the system using the SET SESSION command.
Grant Privileges
GRANT command is used to assign one or more privileges on the database objects to the user or database.
Syntax
Following is the syntax of the GRANT command.
NOTE
Privileges can be INSERT, SELECT, UPDATE, REFERENCES.
Example
Following is an example of the GRANT statement.
Revoke Privileges
REVOKE command removes the privileges from the users or databases. The REVOKE command can only remove explicit privileges.
Syntax
Following is the basic syntax for the REVOKE command.
Example
Below is an example of the REVOKE command.
Teradata FastLoad
Teradata FastLoad is used to load a large amount of data in an empty table on a Teradata System or load data from the client to the Database system.
It allows fast loading of a single empty table with no SIs, JIs, HIs, or RI (FKs) or column partitioning.
FastLoad loads large amounts of data into an empty table, with or without a primary index, on Teradata Database. However, row size and the number of columns, more than any other factor, affect performance.
Teradata FastLoad is a command-driven utility which can be invoked either in batch or interactive mode.
Teradata FastLoad utility establishes multiple session i.e., one session per AMP by default, does not use any transient journal during operation, and transfer data using block rather than row to load data into the table. That's why it is so fast.
It loads only one table at a time. If we want to load multiple tables simultaneously, invoke multiple FastLoad jobs. FastLoad can load data from:
- Input files on a network-attached workstation.
- Disk or tape files from channel attached workstation.
- Individual input module (INMOD) routines written to select, validate, and preprocess input data.
- Any other device is providing correctly formatted source data.
Teradata FastLoad Prerequisite
There are three key components specified to run any Teradata FastLoad.
1. Log Table: One log table is needed to track the status of every FastLoad session running on the system.
Teradata maintains one table called fastlog under the database SYSADMIN. To use this table, we need to INSERT, DELETE, and UPDATE privilege on this table.
2. Empty Target Table:FastLoad needs one empty target table before inserting data into it. FastLoad does not care about how this has been accomplished.
We can achieve this by deleting the data from the target table or dropping-recreate the target table before starting the loading.
3. Two ERROR Tables:Two error tables are required by FastLoad to capture if an exception occurs during the FastLoad process. These tables will be created automatically.
- The First ERROR Table:It is used to capture any translation errors or constraint violation.
For example, a row with column wrong data type would be reported to the first error table. - The Second ERROR Table:It is used for error caused by duplicates values for Unique Primary Index (UPI).
FastLoad will load only one instance for every UPI and stores the duplicate occurrence in the second table. However, if the entire row is a duplicate, FastLoad counts it but does not store the row.
NOTE
Teradata will allow a maximum of 15 FastLoads, Multiloads, or FastExports at the same time. This protects the system from giving all the resources to only load utilities.
Phases of FastLoad
FastLoad divides the whole process into two phases:
Phase 1: Acquisition
- This phase's main objective is to transfer the data from the Host computer to the Teradata environment for Access Module Processor (AMP) as quickly as possible.
- Teradata Parsing Engine reads the records from the input file and sends a block to each AMP.
- Then PE opens a session from the FastLoad client to the AMPs directly. By default, it will create one session per AMP.
- One of the client sessions will pack the raw data into 64K blocks.
- Then this large block of data will be sent to the AMP randomly without any concern for which AMPs get the data.
- After receipt of each data block, every AMP hashes its rows based on the primary index and redistributes them to the proper AMP.
- After redistribution of rows, data will be written to an internal Worktable on the AMPs but remain unsorted.
- At the end of Phase 1, each AMP has its rows, but they are not in row hash sequence.
Phase 2: Application
- The main objective of this phase is to write the data to the actual table space.
- Phase 2 starts when FastLoad receives the END LOADING statement.
- Each AMP will start sorting of rows in the second error table.
- After sorting, these rows will be written to the actual data block permanently in the disk.
- Locks on the target table are released, and the error tables are dropped.
Example
Create a text file with the following records and name the file as emp.txt.
Following is a sample FastLoad script to load the above file into the Emp_Stg table.
Executing a FastLoad Script
Once the input file emp.txt is created, and the FastLoad script is named as EmpLoad.fl, we can run the FastLoad script using the following command in UNIX and Windows.
Once the above command is executed, the FastLoad script will run and produce the log. In the log, here is the number of records processed by FastLoad and status code.
Teradata MultiLoad
Teradata MultiLoad or MLoad is a command-driven load utility for fast, high volume data maintenance on multiple tables or views in Teradata large databases.
MultiLoad is a command-driven utility that can do fast, high-volume maintenance on multiple tables and views of a Teradata Relational Database Management System (RDBMS).
Using a single MultiLoad job, we can do several different imports and delete tasks on RDBMS tables and views:
- Each MultiLoad import task can do multiple data insert, update, and delete functions on up to five different tables or views.
- Each MultiLoad delete task can remove large numbers of rows from a single table.
We can use MultiLoad to import data from:
- Disk or tape files on a channel-attached client system.
- Input files on a network-attached workstation.
- Individual input module (INMOD) programs write to select, validate, and preprocess input data.
- Access modules.
- Any device is providing correctly formatted source data.
The table or view in the RDBMS receiving the data can be any existing table or view for which we have access privileges for the maintenance tasks we want.
MultiLoad Modes
Teradata has the two modes of the MultiLoad, such as:
1. MultiLoad Import
- Each MultiLoad import task can perform multiple INSERT, UPDATE, DELETE, and UPSERT operation on five target tables in parallel.
- It can also import data from the network-attached system or mainframe attached system using a custom access module.
2. MultiLoad DELETE
- Each MultiLoad delete can remove large volumes of data from a single table.
- The MultiLoad DELETE is used to perform global (all amps) delete operation on just one table.
- The DELETE's main feature is that it bypasses the transient journal (TJ) and can be restarted if it fails before finishing.
- We can't use the primary index in the MultiLoad DELETE operation as primary index access is AMP specific, but MultiLoad DELETE is built for global delete.
MultiLoad requires a work table, a log table, and two error tables in addition to the target table.
- Log Table: Used to maintain the checkpoints taken during load, which will be used for restart.
- Error Tables: These tables are inserted during load when an error occurs. The first error table stores conversion errors, whereas the second error table stores duplicate records.
- Log Table: Maintains the results from each phase of MultiLoad for restart purposes.
- Worktable: MultiLoad script creates one work table per target table. The work table is used to keep DML tasks and input data.
MultiLoad Features
Teradata MultiLoad has the following features, such as:
- High-performance maintenance operations apply updates to multiple tables in a single pass.
- Best for over 1-2% of rows changed.
Data Conversion Capabilities
If an input data field with character data type is targeted for a column with date data type specification, Teradata MultiLoad can change the input data specification to date before inserting it into the target table.
Following are the conversions that support by the Teradata MultiLoad:
- Numeric-to-numeric
- Character-to-numeric
- Character-to-date
- Date-to-character
Teradata MultiLoad Limitations
Teradata MultiLoad has the following limitations, such as:
- No Unique Secondary Index: MultiLoad does not support unique secondary index (USI) the same as FastLoad, as in the case of USI subtable may create in different AMP where the actual row resides.
But unlike FastLoad, it supports non-unique secondary index (NUSI) as, in this case, suitable will be created in the same AMP. In MultiLoad, each AMP works in parallel and independently. That's why it supports NUSI but not USI. - No Referential Integrity: Referential Integrity (RI) on the target table not supported by the Teradata MultiLoad. RI requires too much system checking to maintain the referential constraints to a different table.
- Triggers are not allowed:Triggers are involved with more than one table. MultiLoad deals with only one target table. ALTER the Triggers to the DISABLED status before using MultiLoad.
- No concatenation of the input files: It could impact the restart process if the data were concatenated in different sequences.
- No Aggregate, exponential operator, or arithmetic functions:If we need data conversions, then we using an INMOD to prepare the data before loading it.
- MultiLoad does not support SELECT, Foreign key references, Hash Indexes, Join Indexes, NOPI tables.
How MultiLoad Works
Teradata MultiLoad import has five phases:
- Phase 1 (Preliminary Phase): It performs basic setup activities.
- Phase 2 (DML Transaction Phase): It verifies the syntax of DML statements and brings them to the Teradata system.
- Phase 3 (Acquisition Phase): It brings the input data into work tables and locks it.
- Phase 4 (Application Phase): It applies all DML operations.
- Phase 5 (Cleanup Phase): It releases the table lock.
The following steps involved in a MultiLoad script:
Step 1: Set up the log table.
Step 2: Log on to Teradata.
Step 3: Then, specify the Target, Work, and Error tables.
Step 4: Define the INPUT file layout.
Step 5: Also, define the DML queries.
Step 6: Name the IMPORT file.
Step 7: And specify the LAYOUT to be used.
Step 8: Initiate the Load.
Step 9: And last, Finish the load and terminate the sessions.
Example
Create a text file with the following records and name the file as emp.txt.
The following example is a MultiLoad script that reads records from the employee table and loads into the Emp_Stg table.
Executing a MultiLoad Script
Once the input file emp.txt is created, and the MultiLoad script is named as EmpLoad.ml, then we can run the MultiLoad script using the following command in UNIX and Windows.
Teradata BTEQ
Teradata BTEQ stands for Basic Teradata Query. It is a command-driven utility that enables users to interact with one or more Teradata Database Systems.
BTEQ utility is used in both batch and interactive mode. It can be used to run any DML statement, DDL statement, create Macros, and stored procedures. BTEQ can be used to import data into Teradata tables from a flat file, and it can also be used to extract data from tables into files or reports.
BTEQ Terms
Below are some terms that commonly used in BTEQ scripts.
- LOGON: It is used to log into the Teradata system.
- ACTIVITYCOUNT: It returns the number of rows affected by the previous query.
- ERRORCODE: It returns the status code of the previous query.
- DATABASE: It sets the default database.
- LABEL:It assigns a name to a set of SQL commands.
- RUN FILE:It executes the query contained in a file.
- GOTO:It transfers control to a label.
- LOGOFF:It logs off from the database and terminates all sessions.
- IMPORT:It specifies the input file path.
- EXPORT:It specifies the output file path and initiates the export.
BTEQ Limitations
Teradata BTEQ has the following limitations, such as:
- It can submit SQL in both modes, either in batch or interactive way. Interactive users can send SQL and get an answer set on the screen. Users also can submit BTEQ jobs from the batch scripts, have error checking and conditional logic, and allow for the work to be done in the background.
- Teradata SQL Assistant (Queryman) outputs data in a spreadsheet format wherein using BTEQ users can get the output in the report format.
- BTEQ is an excellent tool for importing and exporting data.
Importing Data
BTEQ can import the data from the source and incorporate it into SQL to modify the content of one or more tables.
Syntax of the IMPORT command:
Exporting Data
The data can be written to either mainframe or LAN attached computer using SELECT from Teradata. Data can be exported in a variety of formats such as:
- Record/DATA mode: In this mode, data will be written to a flat file in native format i.e., INTEGER data is written as a 4 bytes binary field. It cannot be read and understood using a regular text editor. It is set by .EXPORT DATA.
- Field/Report mode: This is the default mode of BTEQ. The output of this BTEQ export includes column headers for the fields, white space, and can be understood using a text editor. It is set by .EXPORT REPORT.
- Indicator mode: It is similar to DATA mode, but it also includes a bitmap to recognize NULL. It helps to acknowledge NULL values into another RDBMS. And it is set by .EXPORT INDICDATA.
- DIF mode: Allows users to export data from Teradata that is directly utilized for spreadsheet applications such as Excel, FoxPro, and Lotus.
Example
Following is an example of the Teradata BTEQ script.
The above Teradata BTEQ script performs the above task in the following steps.
Step 1: Logs into Teradata System.
Step 2: Sets the Default Database.
Step 3: Creates a table called emp_bkup.
Step 4: Selects one record from the Emp table to check if the table has any records.
Step 5: Drops emp_bkup table if the table is empty.
Step 6: Transfers the control to a Label InsertEmp, which inserts records into the emp_bkup table.
Step 7: Then, check ERRORCODE to ensure that the statement is successful, following each SQL statement.
Step 8: ACTIVITYCOUNT returns the number of records selected/impacted by the previous SQL query.
Teradata Performance Tuning
In Teradata, Performance tuning is done to identify all the bottlenecks and then resolve them.
The bottleneck is not a form of error, but it certainly causes a certain amount of delay.
There are the following ways to identifying a bottleneck and resolve them, such as:
EXPLAIN Plan
The first step in performance tuning is the use of EXPLAIN on the query. When a query is preceded with the EXPLAIN command, the execution plan of the Parsing Engine is returned to the user instead of AMPs.
The EXPLAIN plan gives a clear picture of the way optimizer will execute a query. In the Explain plan, check for the keywords such as
- Confidence level
- Join strategy used
- Spool file size
- Redistribution, etc.
Collect Statistics
The optimizer uses Data demographics to come up with an effective execution strategy. COLLECT STATISTICS command is used to collect data demographics of the table. Make sure that the statistics collected on the columns are up to date.
- Collect statistics on the columns that are used in the WHERE clause and on the columns used in the joining condition.
- Collect statistics on the Unique Primary Index columns.
- Collect statistics on Non-Unique Secondary Index columns. The optimizer will decide if it can use NUSI or Full Table Scan.
- Collect statistics on the Join Index through the statistics on the base table is collected.
- Collect statistics on the partitioning columns.
Environment Information
- Number of Nodes, AMPs and CPUs
- Amount of memory
Data Demographics
- Number of rows
- Row size
- Range of values in the table
- Number of rows per value
- Number of Nulls
There are three approaches to collect statistics on the table.
- Random AMP Sampling
- Full statistics collection
- Using SAMPLE option
Collecting Statistics
COLLECT STATISTICS command is used to collect statistics on a table.
Syntax
Following is the basic syntax to collect statistics on a table.
COLLECT [SUMMARY] STATISTICS INDEX (indexname) COLUMN (columnname) ON <tablename>;
Example
The following example collects statistics on EmployeeNo column of Employee table.
COLLECT STATISTICS COLUMN(EmployeeNo) ON Employee;
Viewing Statistics
You can view the collected statistics using HELP STATISTICS command.
Syntax
Following is the syntax to view the statistics collected.
HELP STATISTICS <tablename>;
Example
Following is an example to view the statistics collected on Employee table.
HELP STATISTICS employee;
Using the THRESHOLD Options to Collect and Recollect Statistics
Statistics collection thresholds enable you to minimize the unnecessary collection of statistics. For a system-determined threshold, the Optimizer automatically determines whether a recollection is needed or whether extrapolation is adequate. You can collect and recollect threshold statistics on tables.
The THRESHOLD options enable you to skip recollecting statistics if the amount of data change since the last statistics collection or the age of the current statistics is under the thresholds in effect for the statistics.
The Optimizer can automatically determine the appropriate thresholds to apply based on previously collected statistics, column histories, change counts, and other factors, or you can explicitly specify the thresholds as part of a COLLECT STATISTICS request by specifying a change percentage, a number of days, or both.
You can submit collect statistics requests at regular intervals, and the Optimizer can use stored threshold and historical data to determine whether the specified recollection of statistics is necessary or not. The Optimizer can use the same data to determine when to recollect statistics for those columns that exceed the specified threshold. For example, if you determine that the threshold is to be a 10% data change and you submit a request to recollect statistics, the Optimizer does not recollect those statistics if the change in the data from the last collection is less than 10%.
If you specify a THRESHOLD option for first time statistics collections, the Optimizer collects the statistics and stores the THRESHOLD options you specified for future use without consulting the threshold values you specify. The Optimizer uses the saved collection thresholds to determine whether the current statistics collection can be skipped or not when you submit recollection requests at a later time.
Specifying a new THRESHOLD option for statistics recollection overrides the current setting, if any. The Optimizer applies the new THRESHOLD value to the current request, and remembers the updated THRESHOLD option for future statistics recollection.
Collect Stats in Teradata collect the statistics for one or even multiple columns of the base table specified. These statistics include the hash index, a join index, etc. The primary purpose is to collect statistical profiles of the required columns and store them in a data dictionary.
Statistics collection thresholds enable you to minimize the unnecessary collection of statistics. For a system-determined threshold, the Optimizer automatically determines whether a recollection is needed or whether extrapolation is adequate. You can collect and recollect threshold statistics on tables.
The THRESHOLD options enable you to skip recollecting statistics if the amount of data change since the last statistics collection or the age of the current statistics is under the thresholds in effect for the statistics.
The Optimizer can automatically determine the appropriate thresholds to apply based on previously collected statistics, column histories, change counts, and other factors, or you can explicitly specify the thresholds as part of a COLLECT STATISTICS request by specifying a change percentage, a number of days, or both.
You can submit collect statistics requests at regular intervals, and the Optimizer can use stored threshold and historical data to determine whether the specified recollection of statistics is necessary or not. The Optimizer can use the same data to determine when to recollect statistics for those columns that exceed the specified threshold. For example, if you determine that the threshold is to be a 10% data change and you submit a request to recollect statistics, the Optimizer does not recollect those statistics if the change in the data from the last collection is less than 10%.
If you specify a THRESHOLD option for first time statistics collections, the Optimizer collects the statistics and stores the THRESHOLD options you specified for future use without consulting the threshold values you specify. The Optimizer uses the saved collection thresholds to determine whether the current statistics collection can be skipped or not when you submit recollection requests at a later time.
Specifying a new THRESHOLD option for statistics recollection overrides the current setting, if any. The Optimizer applies the new THRESHOLD value to the current request, and remembers the updated THRESHOLD option for future statistics recollection.
Collect Stats in Teradata collect the statistics for one or even multiple columns of the base table specified. These statistics include the hash index, a join index, etc. The primary purpose is to collect statistical profiles of the required columns and store them in a data dictionary.
What is Collect stats in Teradata?
Collect Stats is used to gather the statistics on various columns as per the requirement. Thereafter Teradata’s optimizer creates an execution strategy that is based on these statistics collected by the COLLECT STATS command.
This COLLECT STATS command gathers data demographics and environment information, which the optimizer utilizes in order to optimize the plan for the SQL based on these tables.
Environment information consists of the below set of info:
- The amount of memory required
- The number of nodes
- AMP’s
- CPU’s
Data demographics consist of the below set of info:
- The row size of the table
- Number of rows in that table
- The entire range of values present in that table
- The number of rows available per table
- Number of Null values in that table
You can specify the following kinds of thresholds to control recollecting statistics.- Change-based
- Time-based
NO SAMPLE Use a full-table scan to collect the specified statistics.You can only specify this option if you also specify an explicit index or column set.
You cannot specify NO SAMPLE for a standard recollection of statistics on an implicitly specified index or column set.- For the first collection of statistics, NO SAMPLE overrides the default.
- For recollections of statistics, NO SAMPLE overrides the previous SAMPLE options and collects full statistics.
NO THRESHOLD - Do not apply any thresholds to the collection of statistics.
- Remove the existing threshold before collecting the statistics.
You can only specify this option if you also specify an explicit column or index.
You cannot specify NO THRESHOLD for a standard recollection of statistics on an implicitly specified index or column set.- For the first collection of statistics, NO THRESHOLD overrides the default settings.
- For recollections of statistics, NO THRESHOLD overrides any previously specified THRESHOLD options and recollects the statistics without any thresholds.
NO THRESHOLD DAYS - Do not apply a DAYS threshold to the collection of statistics.
- Remove the existing DAYS threshold before collecting the statistics.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify NO THRESHOLD DAYS for a standard recollection of statistics on an implicitly specified index or column set.- For the first collection of statistics, NO THRESHOLD DAYS overrides the default setting.
- For recollection of statistics, NO THRESHOLD DAYS overrides any previous DAYS threshold specification.
NO THRESHOLD PERCENT - Do not apply a PERCENT change threshold to the collection of statistics.
- Remove the existing PERCENT change threshold before collecting the statistics.
You can only specify this option if you also specify an explicit column or index.
You cannot specify NO THRESHOLD PERCENT for a standard recollection of statistics on an implicitly specified index or column set.- For the first collection of statistics, NO THRESHOLD PERCENT overrides the default.
- For recollections of statistics, NO THRESHOLD PERCENT overrides any previous change threshold percent specification.
SAMPLE Scan a system-determined percentage of table rows to collect the specified statistics.SAMPLE has the same meaning as SYSTEM SAMPLE and is only provided for backward compatibility to enable existing COLLECT STATISTICS scripts that specify the USING SAMPLE option to continue to run.
You should use the SYSTEM SAMPLE option instead of SAMPLE.
SAMPLE n PERCENT Scans the percentage of table rows that you specify rather than scanning all of the rows in the table to collect statistics.The value for n can be a decimal number or integer from 2 through 100.
Specifying SAMPLE 100 PERCENT is equivalent to collecting full statistics.
You can only specify this option if you also specify an explicit index or column set.
You cannot specify SAMPLE n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.- For the first collection of statistics, the specified sample percentage overrides the default.
- For recollection of statistics, SAMPLE n PERCENT overrides any previous SAMPLE option specifications and instead scans n percent of the rows in the table.
Collect Stats is used to gather the statistics on various columns as per the requirement. Thereafter Teradata’s optimizer creates an execution strategy that is based on these statistics collected by the COLLECT STATS command.
This COLLECT STATS command gathers data demographics and environment information, which the optimizer utilizes in order to optimize the plan for the SQL based on these tables.
Environment information consists of the below set of info:
- The amount of memory required
- The number of nodes
- AMP’s
- CPU’s
Data demographics consist of the below set of info:
- The row size of the table
- Number of rows in that table
- The entire range of values present in that table
- The number of rows available per table
- Number of Null values in that table
- Change-based
- Time-based
NO SAMPLE | Use a full-table scan to collect the specified statistics. You can only specify this option if you also specify an explicit index or column set. You cannot specify NO SAMPLE for a standard recollection of statistics on an implicitly specified index or column set.
|
NO THRESHOLD |
You can only specify this option if you also specify an explicit column or index. You cannot specify NO THRESHOLD for a standard recollection of statistics on an implicitly specified index or column set.
|
NO THRESHOLD DAYS |
You can only specify this option if you also specify an explicit index or column set. You cannot specify NO THRESHOLD DAYS for a standard recollection of statistics on an implicitly specified index or column set.
|
NO THRESHOLD PERCENT |
You can only specify this option if you also specify an explicit column or index. You cannot specify NO THRESHOLD PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
|
SAMPLE | Scan a system-determined percentage of table rows to collect the specified statistics. SAMPLE has the same meaning as SYSTEM SAMPLE and is only provided for backward compatibility to enable existing COLLECT STATISTICS scripts that specify the USING SAMPLE option to continue to run. You should use the SYSTEM SAMPLE option instead of SAMPLE. |
SAMPLE n PERCENT | Scans the percentage of table rows that you specify rather than scanning all of the rows in the table to collect statistics. The value for n can be a decimal number or integer from 2 through 100. Specifying SAMPLE 100 PERCENT is equivalent to collecting full statistics. You can only specify this option if you also specify an explicit index or column set. You cannot specify SAMPLE n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
|
THRESHOLD n DAYS | Do not to recollect statistics if the age of the statistic is less than the number of days specified. The value for n must be an integer number that represents the number of days. The valid range for n is 1 - 9999. You can only specify this option if you also specify an explicit index or column set. You cannot specify THRESHOLD n DAYS for a standard recollection of statistics on an implicitly specified index or column set.
The DBS Control field DefaultTimeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102. |
THRESHOLD n PERCENT | Do not to recollect statistics if the percentage of data change since the last collection is less than the specified percentage. The value for n can be either a decimal number or an integer number. The valid range of n is 1 - 9999.99. You can only specify this option if you also specify an explicit column or index. You cannot specify THRESHOLD n PERCENT for a standard recollection of statistics on an implicitly specified index or column set.
The DBS Control field DefaultUserChangeThreshold in the STATISTICS field group contains the default for this option. For more information, see Teradata Vantage™ - Database Utilities, B035-1102. |
- Using the sample option: This includes unique index columns, nearly unique indexes, or columns
- Full statistic collection methodology: This usually includes Non-Indexed columns, Partition for all tables whether permitted or not, and Collection of full stats over relevant columns.
- Most NUPIs refer to non-unique primary indexed columns, while UPIs stand for unique primary indexed columns.
- Apart from these, it also includes the single-column join constraints along with NUSI’s
- The Random AMP Sampling: It involves the USI’s or UPI’s if only used with equality predicates.
How to Collect Stats in Teradata?
Below is the syntax of COLLECT STATS statement in Teradata:
COLLECT [SUMMARY] STATISTICS
INDEX (name_of_the_index)
COLUMN (col_name)
ON <table_name>;
Here the keyword SUMMARY is optional, and the user may skip it if not required.
Let’s take up some examples to understand how to collect stats in Teradata in detail:
COLLECT STATISTICS COLUMN(roll_number) ON Student_Table;
- This will collect stats on the roll_number column of the student_table
- When the above query is executed, the below kind of output is produced
Let’s take an example to understand the optimization in detail.
Suppose we have two tables table1 and table2.
table1 contains the details of the students like id, name, age, marks, etc
whereas table2 contains the geographic info of the students like address, location along with the primary key as ID
SELECT
a.id,
a.name,
a.age,
a.marks,
b.address,
b.location
From table1 as a
left join table2 as b
on a.id = b.id
Let’s consider two cases wherein the above-mentioned query gets executed.
CASE 1: When we do not have any information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be more costly.
CASE2: When we do have the specifically required information regarding the statistics of any columns from table1 and table2. In this case, the execution plan for the above query will be less costly.
The reason is, during the join, which is based on the id column from table1 and table2, it needs to be on the same AMP in order to join the data based on this column from table1 and table2
Suppose table1 contains 100 records having ID from 1 to 100 distributed evenly over 10 AMP in the below fashion.
- The records having the ID from 1 to 10 in AMP1
- The records having the ID from 11 to 20 in AMP2
- The records having the ID from 21 to 30 in AMP3
- The records having the ID from 31 to 40 in AMP4
- and so on…
- The records having the ID from 91 to 100 in AMP10
And table2 has only 80 records having ID from 1 to 100 with 20 missing ID’s
- The records having the ID from 1 to 8 in AMP1
- The records having the ID from 9 to 10 and 15 to 20 in AMP2
- The records having the ID from 21 to 28 in AMP3
- The records having the ID from 29 to 36 in AMP4
- and so on…
- The records having the ID from 92 to 100 in AMP8
Now for the join to happen, the ID’s should be available in the SAME AMP
Generally, the data from the smaller table is redistributed. So here, the redistribution will happen for table2
How can we see the statistics collected for the tables in Teradata?
The collected stats can be seen using the below query:
HELP STATISTICS <table_name>
Let’s see the stats collected on Student_table
HELP STATISTICS student_table
When the above query is executed, The result will be something like this:
Data Types
Each column in a table is associated with a data type. Data types specify what kind of values will be stored in the column. Every data value belongs to an SQL data type.
Make sure that proper data types are used. This will avoid the use of excessive storage than required.
Conversion
Ensure that the data types of the columns used in join conditions are compatible to avoid explicit data conversions.
Sort
Remove unnecessary ORDER BY clauses unless required.
Spool Space Issue
Spool space error is generated if the query exceeds per AMP spool space limit for that user.
Verify the explain plan and identify the step that consumes more spool space. These common queries can be split and put as separately to build temporary tables.
Primary Index
The primary index is used to specify where the data resides in Teradata. Each table is required to have a primary index defined, so make sure the Primary Index is correctly set for the table.
The primary index column should evenly distribute the data and should be frequently used to access the data.
Partition Primary Index
If Partition Primary Index is defined in tables, try to use it. If we are not using it in the filter condition, it will degrade the performance.
SET Table
If we define a SET table, then the optimizer will check if the record is duplicate for each and every record inserted. To remove the duplicate check condition, we need to define Unique Secondary Index for the table.
UPDATE on Large Table
Updating the large table will be time-consuming. Instead of updating the table, we can delete the records and insert the files with modified rows.
Dropping Temporary Tables
Drop the temporary tables (staging tables) and volatiles if they are no longer needed. This will free up permanent space and spool space.
MULTISET Table
If we confirmed that the input records would not have duplicate files, we could define the target table as a MULTISET table to avoid the duplicate row check used by the SET table.
IN Clause
Avoid the IN clause in filter conditions when there can be a huge number of values in where conditions, then it's better to insert such values in a volatile table and use a volatile table with INNER JOIN in the main query.
UPDATE Clause
Avoid using the UPDATE clause with just SET condition and no WHERE conditioning. Even if the Target and Source have only one row, add WHERE clause for PI column.
columnar table
Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS)
In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.
SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.
-----------------------------------------------------
SQL STATEMENT TYPE
1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.
DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.
DDL Statements:
CREATE
DROP
ALTER
RENAME
MODIFY
REPLACE
COLLECT STATISTICS
SET
BEGIN
END
FLUSH
CHECKPOINT
HELP
SHOW
2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.
DML is used to work with data, including inserting data, updating existing data or performing queries on the data.
The DML subset will be the focal point of this fundamentals course.
DML Statements:
SELECT
INSERT
UPDATE
DELETE
3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.
DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.
DCL Statements:
GRANT
REVOKE
GIVE
4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.
Some statements in this category are also functionally aligned with the DDL SQL category.
For more information on SQL statements in this category, see references in help section.
SQL Statements in this category:
BEGIN/END QUERY LOGGING
SHOW QUERY LOGGING
COLLECT DEMOGRAPHICS
COLLECT STATISTICS
DROP STATISTICS
DUMP EXPLAIN
INITIATE INDEX ANALYSIS
INITIATE PARTITION ANALYSIS
INSERT EXPLAIN
RESTART INDEX ANALYSIS
-------------------------------------------------------------
HELP
HELP command displays information about database objects (databases, tables, views, macros etc.).
This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.
-----------------------------------------------------------------------------------------------------------------
SHOW command displays the current data definition language (DDL) of a database object.
It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).
------------------------------------------------
EXLAIN
=========================================================
Introduction to the Teradata Database
Teradata Parallel Architecture
|
|
Relational Database Advantages
|
|
Teradata Database Features
|
|
What is a Database?
|
|
Logical/Relational Modeling
|
|
Relational Databases
|
|
Normalized Data Model vs. Star Schema Model
|
|
Primary Keys
|
|
Foreign Keys
|
|
Relational Advantages
|
|
Evolution of Data Processing
|
|
Row vs. Set Processing
|
|
Response Time vs. Throughput
|
|
The Advantage of Using Detail Data
|
|
The Data Warehouse
|
|
Data Marts
|
|
Teradata Functional Overview
|
|
Major Components of the Teradata Database
|
|
The Parsing Engine (PE)
|
|
The BYNET
|
|
The Access Module Processor (AMP)
|
|
What is a Node?
|
|
Parallelism
|
|
Teradata Compression Comparison
|
|
Layers of the Data Warehouse Architecture
|
|
Teradata Database Objects
|
|
Teradata Database Space
|
|
Space Management
|
|
How Does the Teradata Database Distribute Rows?
|
|
Primary Keys and Primary Indexes
|
|
Primary Key (PK) vs. Primary Index (PI)
|
|
Primary Indexes
|
|
Defining the Primary Index
|
|
Row Distribution via Hashing
|
|
UPI Row Distribution
|
![]() |
Unique Primary Index (UPI) Access
|
|
NUPI Row Distribution
|
![]() |
Non-Unique Primary Index (NUPI) Access
|
|
Highly Non-Unique NUPI Row Distribution
|
![]() |
Partitioned Primary Index (PPI)
|
|
Multilevel Partitioned Primary Index (MLPPI)
|
|
What is a NoPI Table?
|
|
Teradata Columnar
|
|
Teradata Columnar Benefits
|
|
Secondary Indexes
|
|
Defining a Secondary Index
|
|
Defining a Secondary Index
|
|
Other Types of Secondary Indexes
|
|
Other Types of Secondary Indexes
|
|
Full-Table Scans
|
|
Locks
|
|
Data Protection Features
|
|
Transient Journal
|
|
Fallback
|
|
Down AMP Recovery Journal
|
|
RAID Protection
|
|
Hot Standby Nodes (HSN)
|
|
Teradata Tools and Utilities Overview
|
|
BTEQ
|
|
Teradata Parallel Transporter
|
|
TPT Operators
|
|
TPT Architecture
|
|
TPT Parallelism
|
|
===============================================================
Teradata SQL for Applications Development
Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:
the column has defined default values
the column has a default system value specified
the column permits nulls
If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.
The SQL command INSERT INTO tablename DEFAULT VALUES;
Will insert defined default values into each column.
Will insert a null if no default is defined.
Will fail if no default is defined and a null is not allowed.
Some options for NULL and DEFAULT attributes:
NOT NULL - Nulls are not permitted for this column
DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.
Example: DEFAULT 'Unknown'
DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx
Example: DEFAULT DATE '2099-12-31'
WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)
DEFAULT TIME - Assign the current time to the integer data type
DEFAULT USER - Assign the user id of the session to the character string
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.
[] - Square braces enclose a subset of characters.
types of Databases
If the goal is only to retrieve information, a NoSQL database will likely suffice. DynamoDB and MongoDB are two well-known examples of NoSQL databases today.
Relational Database?
CARDINALITY - ROW , DEGREE-COL
As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.
- Relational
- Dimensional
- Logical and Physical
Relational Data Model
It is the most common type of data model. A relational data model should reflect the business rules.
Dimensional Data Models
Dimensional Data Models
Dimensional model is specially designed for reporting as it provides navigation paths and aligns with the work process of a business users. A dimensional model emphasizes usability because the tables are:
- Defined in a way that a business uses the data
- Organized in a way that a given user or group of users can easily understand the data
Its purpose is to allow efficient, user-friendly filtering, sorting and summing of data specific to the needs of one or more business area(s).
Every dimensional model is composed of one table with a multipart key, called a fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table.
Normalization vs. Denormalization
Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.
Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.
Dimensional Model
Dimensional Model
In a dimensional model, fact tables always represent M:M (many to many) relationships. According to the model, a fact table should contain one or more numerical measures (the "facts" of the fact table) that occur for the combination of keys that define each tuple in the table.
DBMS
A DBMS is the application and system software used for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.A DBMS creates reports, enforces database rules and constraints, and maintains the database schema. A database is a collection of bits and bytes without DBMS.The most common type of database management system is called a Relational Database Management System (RDBMS).
SQL?
SQL?
Relational Database Management Systems use Structured Query Language (SQL) as a database query language. Just as humans communicate with the help of language, SQL is the language that a database understands. SQL is used to create, manage, manipulate, and query database objects such as tables.
Data Warehouse
A data warehouse is a specially constructed data repository where data is organized so that it can be easily accessed by end users for various applications. Data warehouses are common in corporations where enterprise-wide detailed data is analyzed to make strategic and tactical business decisions. This is because data warehouses often carry many years’ worth of detailed data so historical trends may be analyzed. Data warehouses may begin as small in scope and purpose, but often grow quite large if their utility becomes more fully exploited by the enterprise.
Data Marts
Data marts can be independent of, dependent on, or part of a data warehouse. Data warehouses have an enterprise-wide depth. The information in data marts pertains to a single department.
Independent data marts are isolated entities entirely separate from the enterprise data warehouse. The independent data mart data is derived from independent sources and should be viewed as data pirates in the context of the enterprise data warehouse. The independent inputs, which are entirely separate from the enterprise data warehouse, have a high likelihood of producing data that does not match that of the warehouse.
Dependent data marts are derived from the enterprise data warehouse. Depending on how a dependent data mart is configured, it might or might not be useful. The recommended process uses only data that is derived from the enterprise data warehouse data store. This permits users to have full access to the enterprise data store when the need to investigate enterprise-wide issues arises.
The logical data mart is a form of dependent data mart that is constructed virtually from the physical data warehouse. Data is presented to users of the mart using a series of SQL views that make it appear that a physical data mart underlies the data available for analysis.
What is Analytic Processing?
What is Analytic Processing?
A data warehouse was originally defined as a decision support system (DSS). As earlier discussed, data warehouses are computer-based information systems that support business or organizational decision-making activities. OLAP and OLTP are sets of operations that can be performed on a data set.
OLAP (On-Line Analytic Processing) operations comprise the complex analysis that are performed on a data set within a data warehouse. These operations include data mining, querying, pivoting, slicing, dicing, drilling, reporting and other decision-support applications. A data warehouse can be used without OLAP by just running reports. ATM
OLTP (On-Line Transaction Processing) manages business applications and collects the business’s day-to-day data like order entry, financial transactions, or sales. This is different than a data warehouse. OLTP accesses a small number of rows (or records), or a few tables, in a matter of seconds.
Reasons to Deploy a Data Warehouse
For companies to be successful they must make good business decisions. Making good business decisions requires businesses to have the ability to analyze their related, relevant business data. A data warehouse is an indispensable tool for businesses. While business data can be useful, it means little if it cannot be stored and analyzed in a useful way. A key driver of analytics is strategic queries – business questions that are intended to provide a strategic advantage through forecasting. Businesses can also learn more about their customers, transactions, and profits – studying these items to find patterns which can assist them in avoiding costly mistakes, enhance business productivity, and manage customer relationships.
Data Warehouse Architecture - Layers
Data Warehouse Architecture - Layers
The above diagram shows the source of the data and its flow. The reference information architecture is broken down into 3 major data layers:
acquisition- As the first point of entry into the data warehouse, raw data is acquired from various source systems such as mainframe, server, etc.
integration- The integration layer is primarily responsible for integrating data from multiple systems both normalized and potentially de-normalized. It also may create common metrics and summaries which are widely used within an organization.
access-The access layer’s primary responsibility is to provide easy access to the data using various analytic methods.
Types of implementation possibilities
Types of implementation possibilities
Only a few of the multiple types of implementations of a data warehouse are listed below
A centralized data warehouse is useful for small and mid-size data warehouses. It is a single physical repository that can contain organizational data for a specific utility area, department, branch, division or the whole organization. Centralized data warehouses serve the needs of several separate business units within an organization at the same time using a single data model.
A federated data warehouse shares information among a number of different systems. Critical master files will be shared, and the other systems will be able to use this information. An example of this would be adding financial information to advertising data. This will allow an organization to better market its products to customers. If the data is federated, organizations can reduce response time.
A data mart exists within a single organizational data warehouse repository. Ideally, it a condensed and more focused version of the data warehouse that stores data dedicated to a specific business function or department within the organization. This subset of data may span across many, or all, of the organization’s functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit such as accounting, marketing, sales, etc.
Deployments Options
-->The on-premises deployment option involves buying the software and hardware from a data warehouse company. Some examples of these companies would be Teradata, Oracle, or IBM. The on-premises (sometimes abbreviated as on-prem) data warehouse is installed, runs, and is overseen on the premises of the organization using it. Benefits of this option are that it gives the organization total control, flexibility, accessibility, and predictability, and dependable performance. Some consider this option to also be the most secure and might even be required to use on-premises depending on the sensitivity of the organization’s data. Examples of this might be highly regulated markets such as healthcare or financial. There are also some countries that restrict access to internet content so on-premises would be the only viable solution.
-->If a company chooses to implement its data warehouse in the public or private cloud then it will run on a computing platform that belongs to the cloud provider, and access to it is granted, as a service, by a provider. Some examples of companies that offer this deployment option are Amazon (AWS), or Microsoft (Azure), or Google cloud platform. A public cloud solution can be quick to set up, is scalable, accessible, and easy to use. As mentioned, while discussing the on-premises solution, security might be an issue for some organizations depending on the market they are in. Trust issues are often the most difficult to overcome for IT executives when considering whether or not to move to a cloud-based data warehouse. One consideration that organizations need to take into account, is that cloud options require high bandwidth which may not be an option for some smaller organizations. A private cloud, such as Teradata’s IntelliCloud option, can provide a more custom, secure virtual environment than a public cloud.
Benefits of TD
-->Unlimited Parallelism means that every request is broken down into components which are all worked on at the same time (parallel). Parallelism is at the very heart of the Teradata Advanced SQL Engine and is built into the system from the ground up. There is virtually no part of the system where parallelism has not been built in.
-->Because the Teradata Advanced SQL Engine is highly scalable it is available everywhere.
-->The Teradata Advanced SQL Engine is highly scalable which makes it easy for customers to expand the system as they need to for their growing businesses.
-->With its high availability, there is no single point of failure. The Teradata Advanced SQL Engine has the ability to make data available in the event of errors or failures, because both hardware and software provide fault tolerance. For example, fallback tables are duplicate copies of primary tables. Each fallback row in a fallback table is stored on an AMP different from the one to which the primary row hashes. This storage technique maintains availability should the system lose an AMP and its associated disk storage in a cluster. In that event, the system would access data in the fallback rows.
Partitioning
Partitioning
Partitioning is a database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.
Object Management
Object Management
In order to manage the database objects, Teradata needs to store information about all of the objects. This information is referred as metadata. Metadata can include:
- Means of creation of the data
- Purpose of the data
- Time and date of creation along with the creator and author of the data
- Location on a computer network where the data was created
- Standards used and file size
Data Dictionary
Data Dictionary
The data dictionary is owned by system user DBC. The data dictionary is composed of tables and views. Data dictionary views provide access to the information in the tables.Tables and views are reserved for use by the system which contains the metadata about the database objects such as privileges, system events, and system usage.
What is Data Distribution?
The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
Teradata Advanced SQL Engine generates a row hash by hashing the values of the PI columns. The row hash and a sequence number, which is assigned to distinguish between rows with the same row hash within a table, are collectively called a row identifier and uniquely identify each row in a table.
Metadata
Metadata
Metadata is stored in tables that belong to user DBC and is reserved for use by the system. This information is stored in the Data Dictionary. The Data Dictionary contains metadata about the objects in the system like privileges, system events, and system usage. Views provide access to the information in the tables.
Index
Indexes can provide an easier and faster way of accessing and locating data and thus reducing unwanted inputs and outputs.
The use of Unique Primary Index (UPI) and Unique Secondary Index (USI) ensures effective maintenance of uniqueness on the table.
Ideally, the rows of every large table will be distributed among all of the AMPs. There may be some circumstances where this is not true. A column-partitioned table or join index may have a primary AMP index as an alternative to having no primary index. A primary AMP index (PA) allows for single-AMP access when the values of the primary AMP index column(s) are specified or to provide better distribution of the rows.
Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows of the table and, consequently, the performance of the table in a production environment. Unique Primary Indexes (UPI's) are often desirable because they guarantee the uniform distribution of the rows of that table.
Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique values; that is, there are duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution.
Improve Performance
Accessing data with equality on Primary Index result in a one AMP operation, which is extremely fast and efficient. Similar improvements can be seen with the use of other indexes.
Types and Usage
Types and Usage
The Teradata Advanced SQL Engine has several types of indexes that are used to distribute data, retrieve data, and enhance performance.
-->A Primary Index (PI) is a column or set of columns that provide a mechanism for physical row storage and access.
The PI is defined when the table is created and is there for two major reasons:
- To determine data distribution
- Enhance performance
-->A No Primary Index (NOPI) table is simply a table without a primary index. NOPI tables are typically used as staging tables for the initial data load.
-->A Secondary Index (SI) can be used to enforce uniqueness on a column. The SI also provides an alternate path to access data.
-->A join index is an indexing structure containing columns from multiple tables, specifically the resulting columns from one or more tables. The join Index (JI) is an optional index which may be created by the user to improve performance of complex queries.
There are three general ways to access a table; two of which have been covered already:
--Primary Index access (one AMP access)
--Full Table Scan (all AMP access).
The third option is via a Secondary Index access (two or all AMP access). As we discussed earlier, there are two types of secondary indexes USI (unique) and NUSI (non unique).
UPI
Guarantees uniform distribution of rows, has no duplicate values, and is made up of one or more columns.
--Efficient way to access data. Accessing data through a UPI is a one-AMP operation that leads directly to the single row with the desired UPI value
--The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, Teradata can distribute the rows evenly.
NUPI
Non-unique Primary Index
Can cause skewed data, allows duplicate values, and is made up of one or more columns.
--Accessing data through a NUPI is a one-AMP operation that may lead to multiple rows with the desired NUPI value. The system creates a spool file during a NUPI access, if needed. NUPI access is efficient if the number of physical block reads is small.
--If, however, the order number was used as a non-unique Primary Index (NUPI), then the row distribution among AMPs will possibly be uneven. All rows with the same primary index value (in other words, with the same order number) will be stored on the same AMP. If the Primary Index is slightly non-unique, that is, there are only four or five rows per index value; the table will still distribute evenly. But what if there are hundreds or thousands of rows for some index values? The rows with the same primary index value would all be sent to the same AMP resulting in skewed data.
USI
Unique Secondary Index
Not involved in data distribution, has no duplicate values, enables data access via an alternate path.
--Efficient way to access data. Data access through a USI is always a two-AMP operation, which leads directly to the single row with the desired USI value.
USI retrieval will always retrieve one row at most whereas the number of rows a NUPI will retrieve depends upon the selectivity of the PI value. The performance of a NUPI will depend upon how non unique the PI is
NUSI
Non-unique secondary index
One or more columns, may have duplicate values and is used to improve access.
--(NUSI) is efficient only if the number of rows accessed is a small percentage of the total data rows in the table. NUSI access is an all-AMPs operation since the NUSI subtables must be scanned on each AMP. It is a multiple rows operation since there can be many rows per NUSI value. A spool file will be created, if needed
PPI
Partition Primary Index
Can quickly improve the performance of complex range-based queries. Partitioned Primary Index (PPI) is an indexing mechanism that is useful in improving the performance of certain queries. When rows are inserted into a table, they are stored in an AMP and arranged by their row hash order. When a table is defined with PPI, the rows are sorted by their partition number. Within each partition, they are arranged by their row hash. Rows are assigned to a partition based on the partition expression defined.
NOPI
No Primary Index
--Simply a table without a primary index. As rows are inserted into a NoPI table, they are appended at the end of the table and never inserted in the middle of a hash sequence. Retrievals are likely to be full table scans (FTS) unless other index or table structures are in place e.g. columnar
--Supports faster loads and inserts into tables with no primary index defined.
Data is effectively appended to an AMP in a NoPI table. In addition, a NoPI table:
--Has no duplicate row checking. Duplicate rows can exist in a NoPI table after a TPT LOAD operation.
--A table without a primary index. Data is distributed to the AMPs by methods other than the primary index (i.e., Random Generator).
--Is useful in staging and in data labs when the data is not known and there is no obvious primary index.
--Is useful as a staging table and can be loaded quickly using Teradata Parallel Transporter (TPT) LOAD.
Full table scan
Scan is efficient in that each row is scanned only once. Although index access is generally preferred to a FTS, there are cases where they are the best way to access the data. Like the situation with NUPIs and USIs, Full Table Scans can sometimes be more efficient than a NUSI. Since V2R5 a Cylinder read which reads all the data blocks in that cylinder just needs 1 I/O. A FTS will only read each data block once whereas a NUSI would read the index subtable then access the data blocks; this is efficient only when the NUSI is nearly unique. Statistics will give the optimizer the demographic information to make the right decision to select a FTS.
Secondary index:
secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A benefit is that searches will typically be less expensive than a full table scan. However, secondary indexes add overhead to the table, both in terms of disk space and maintenance (due to sub-tables) but may be dropped when not needed and recreated whenever they would be helpful.
A Secondary Index is like a Primary Index in that it allows the user to locate rows, but different in that it has no influence on the way rows are distributed among AMPs. A unique secondary index can be used to maintain uniqueness within a column or set of columns.
Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table providing an alternate path to the rows of a table. One table can have from 0 to 32 secondary indexes.
Locking
Teradata Database locks have two dimensions: Level and Severity.
Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.
Types
There are four types of locks in Teradata. (Click on each type to learn more)
exclusive - Prevents any other type of concurrent access
write - Prevents other reads, writes, and exclusives
read-Prevents other writes and exclusives
access- Prevents exclusive rights only
Levels
Locks can be applied at four levels.
Level 1
The database level applies to all tables/views in the database.
Level 2
Table/view level applies to all rows in the table/view.
Level 3
Partition level which applies to all rows within the specified partition(s).
Level 4
The row hash level is an algorithm which determines which AMP gets the row.
Data Integration Tools
producer- Producer operators extract data from a data source by reading the data and writing it to the data stream.
consumer- Consumer operators read the data from the data stream and writes or loads it to the target which might be Teradata tables or an external data store.
Teradata Compression
TD:
Shared nothing - AMP works on only the assigned db portion, no sharing.
Hashing - td use it to distribute data in amps , PI decides whether the distribution is even or not.
even distribution is good because in operation with all rows ex full tab scan , each amp will have even work load and inc performance
Hashing
Normalization is a set of rules and a methodology for making sure that the attributes in a design are carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize update abnormalities.
Teradata Database Architecture Overview
Structured Query Language (SQL) is the industry language for communicating with Relational Database Management Systems (RDBMS)
In an RDBMS, data is stored in tables in which a row represents one instance of an entity and the columns represent the data fields that make up a row.
SQL is used to construct queries to communicate with the Teradata Vantage Advanced SQL Engine and define the answer set that is returned to user.
-----------------------------------------------------
SQL STATEMENT TYPE
1. Data Definition Language (DDL) consists of all statements that support the definition of database objects.
DDL is used to create, modify database objects such as tables, views, macros, databases as well as restructure the database.
DDL Statements:
CREATE
DROP
ALTER
RENAME
MODIFY
REPLACE
COLLECT STATISTICS
SET
BEGIN
END
FLUSH
CHECKPOINT
HELP
SHOW
2. Data Manipulation Language (DML) consists of all SQL statements that support the manipulation and processing of database objects.
DML is used to work with data, including inserting data, updating existing data or performing queries on the data.
The DML subset will be the focal point of this fundamentals course.
DML Statements:
SELECT
INSERT
UPDATE
DELETE
3. Data Control Language (DCL) is a subset of the SQL language that definies and supports the security authorization for accessing database objects.
DCL statements are used for administrative tasks such as granting / revoking privileges or controlling object ownership.
DCL Statements:
GRANT
REVOKE
GIVE
4. Query and Workload Analysis statements are used for collecting and analyzing query and data demographics.
Some statements in this category are also functionally aligned with the DDL SQL category.
For more information on SQL statements in this category, see references in help section.
SQL Statements in this category:
BEGIN/END QUERY LOGGING
SHOW QUERY LOGGING
COLLECT DEMOGRAPHICS
COLLECT STATISTICS
DROP STATISTICS
DUMP EXPLAIN
INITIATE INDEX ANALYSIS
INITIATE PARTITION ANALYSIS
INSERT EXPLAIN
RESTART INDEX ANALYSIS
-------------------------------------------------------------
HELP
HELP command displays information about database objects (databases, tables, views, macros etc.).
This information is pulled from the data dictionary and includes information such as object attributes, function parameters, and column data types.
-----------------------------------------------------------------------------------------------------------------
SHOW command displays the current data definition language (DDL) of a database object.
It returns the last definition statement performed against the object (which may not be same DDL used to initially create it).
------------------------------------------------
EXLAIN
=========================================================
Introduction to the Teradata Database
The Teradata database is a Relational Database Management System (RDBMS) that drives a company's data warehouse.
Teradata Parallel Architecture
There are many reasons customers choose the Teradata database as the preferred platform for enterprise data warehousing:- Supports more warehouse data than all competitors combined. There are over 5400 1TB or larger warehouses in the field.
- Supports scalability from small (10GB) to massive (50+PB) databases.
- Provides a parallel-aware Optimizer that makes it unnecessary to perform complicated tuning to get a query to run efficiently.
- Automatic data distribution eliminates complex indexing schemes and time-consuming reorganizations.
- Designed and built on a parallel architecture from the start (not a parallel retrofit).
- Supports ad-hoc queries using ANSI-standard Structured Query Language (SQL) which is used to communicate with an RDBMS. This allows Teradata to interface with 3rd party BI tools and submit queries from other database systems.
- Single operational view of the entire MPP system via the Service Workstation (SWS) and single point of control for the DBA with Teradata Viewpoint.
Teradata Corporation has been doing data warehousing longer than any other vendor. The Teradata database was built to do Decision Support.
|
|
Relational Database Advantages
The Teradata database provides customers with unlimited, proven scalability. As the example on the previous page showed, the Teradata database can scale from 10 gigabytes to over 100 petabytes of data on a single system. When we talk about scalability, it isn't just about being able to grow very large and handle enormous amounts of data. It's about growth without losing any performance capabilities. This scalability provides investment protection for customer's growth and application development.As is proven in the benchmarks we perform, the Teradata database can handle the most concurrent users, who are often running complex queries.
The parallelism of the Teradata database is unlimited. The Teradata database performs every task in parallel, including joins, sorts, and aggregations.
The optimizer is parallel-aware and the most robust in the industry, able to handle multiple complex queries, joins per query, and unlimited ad-hoc processing.
The Teradata File System manages data and disk space automatically. This is what makes the rebuilding of indexes unnecessary.
The Teradata database provides a low total cost of ownership (TCO) due to:
- Ease of setup and maintenance
- No reorganization of data needed
- Automatic data distribution
- Most robust utilities in the industry
- Low cost of disk to data ratio
- Ease in expanding the system
High availability is also a major advantage because with the Teradata database architecture, there is no single point of failure - fault tolerance is built-in to the system. The Teradata database is the only database that is truly linearly scalable, and this extends to data loading with the use of our parallel utilities.
|
|
Teradata Database Features
The Teradata database is a high-performance database system that processes enormous quantities of detail data that are beyond the capability of conventional systems.The system is specifically designed for large data warehouses. From the beginning, the Teradata database was created to store, access and manage large amounts of data.
Hundreds of terabytes of storage capacity are currently available, making it an ideal solution for enterprise data warehouses and even for smaller data marts.
Parallel processing distributes data across multiple processors evenly. The system is designed such that the components divide the work up into approximately equal pieces. This keeps all the parts busy all the time and enables the system to accommodate a larger number of users and/or more data.
Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. Teradata runs on industry standard operating systems as well.
Linear scalability enables the system to grow to support more users/data/queries/query complexity, without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1.
The Teradata database currently runs as a database server on a variety of hardware platforms for single node or Symmetric Multi-Processor (SMP) systems, and on Teradata hardware for multi-node Massively Parallel Processing (MPP) systems.
|
|
What is a Database?
A database is a collection of permanently stored data that is used by an application or enterprise.A database contains logically related data, which means that the database was created with a specific purpose in mind. A database supports shared access by many users. One characteristic of a database is that many people use it, often for many different purposes. A database is protected to control access and managed to retain its value and integrity.
One example of a database is payroll data that includes the names of the employees, their employee numbers, and their salary history. This database is logically related - it's all about payroll. It must have shared access, since it will be used by the payroll department to generate checks, and also by management to make decisions. This database must be protected; much of the information is confidential and must be managed to ensure the accuracy of the records.
The Teradata database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industry-standard SQL statements. SQL is a set-oriented language for relational database management.
Later in this course we will provide another definition of database that is specific to the Teradata database.
|
|
Logical/Relational Modeling
The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly.The design of the data model is the same regardless of data volume.
An enterprise model is one that provides the ability to look across functional processes.
Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model.
Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is "in normal form."
The intent of normalizing a relational database is to put one fact in one place. By decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables.
A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple (an ordered set of values), and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. The separator for each value is often a comma. Common uses for the tuple as a data type are:
- For passing a string of parameters from one program to another
- Representing a set of value attributes in a relational database
The Logical Model
- Should be designed without regard to usage
- Accommodates a wide variety of front end tools
- Allows the database to be created more quickly
- Should be the same regardless of data volume
- Data is organized according to what it represents - real world business data in table (relational) form
- Includes all the data definitions within the scope of the application or enterprise
- Is generic - the logical model is the template for physical implementation on any RDBMS platform
Normalization
- Process of reducing a complex data structure into a simple, stable one
- Involves removing redundant attributes, keys, and relationships from the conceptual data model
|
|
Relational Databases
Relational databases are solidly founded on Set Theory and are based on the Relational Model, which is a set of principles for relational databases formalized by Dr. E.F. Codd in the late 1960s. The relational model of data prescribes how the data should be represented in terms of structure, integrity, and manipulation. The rules of the relational model are widely accepted in the information technology industry, though actual implementations may vary.The key to understanding relational databases is the concept of the table, which is made up of rows and columns. The relational model provides the system with rules and a structure to build on. The relational terms we will be discussing in this course are independent of the database you may be running on.
A column is an attribute of the entity that a table represents. A column always contains like data, such as part name, supplier name, or employee number. In the example below, the column named LAST NAME contains last name and never anything else. Columns should contain atomic data, so a phone number might be divided into three columns; area code, prefix, and suffix, so drill-down analysis can be performed. Column position in the table is arbitrary. Missing data values are represented by "nulls", or the absence of a value.
A row is one instance of all the columns of a table. It is a single occurrence of an entity in the table. In our example, all information about a single employee is in one row. The sequence of the rows in a table is arbitrary.
In a relational database, tables are defined as a named collection of one or more named columns that can have zero or many rows of related information. Notice that the information in each row of the table refers to only one person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important.
Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place, thing, or event about which the table contains information. In this case, the entity is the employee.
Note: In relational math we use the term:
- Table to mean a relation.
- Row to mean a tuple.
- Column to mean an attribute.
Row →
(Single
Entity) Employee
↓ Column (Attribute)EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT 1006 1019 301 312101 Stein John 861015 631015 3945000 1008 1019 301 312102 Kanieski Carol 870201 680517 3925000 1005 0801 403 431100 Ryan Loretta 861015 650910 4120000 1004 1003 401 412101 Johnson Darlene 861015 560423 4630000 1007 Villegas Arnando 870102 470131 5970000 1003 0801 401 411100 Trader James 860731 570619 4785000

Relation
The employee table has:
- Nine columns of data
- Six rows of data - one per employee
- Missing data values represented by nulls
- Column and row order are arbitrary
|
|
Normalized Data Model vs. Star Schema Model
As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:- Entities - One record in a table
- Attributes - Columns
- Relationships - Between tables
First normal form (1NF) rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities.
Second normal form (2NF) requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key.
Third normal form (3NF) requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity.
While the Teradata database can support any data model that can be processed via SQL; an advantage of a normalized data model is the ability to support previously unknown (ad-hoc) questions.
|
|
Star Schema
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.Some characteristics of a Star Schema model include:
- They tend to have fewer entities
- They advocate a greater level of denormalization
|
Primary Keys
Tables are made up of rows and columns and represent entities and attributes. Entities are the people, places, things, or events that the tables model. A Primary Key is required for every logical model version of a table, because each entity holds only one type of tuple (i.e., a row about a person, place, thing or event), and each tuple is uniquely identified within an entity by a Primary Key (PK).
|
|
Primary Key Rules
- A Primary Key uniquely identifies each tuple within an entity. A Primary Key is required, because each tuple within an entity must be able to be uniquely identified.
- No duplicate values are allowed. The Primary Key for the EMPLOYEE table is the employee number, because no two employees can have the same number.
- Because it is used for identification, the Primary Key cannot be null. There must be something in that field to uniquely identify each occurrence.
- Primary Key values should not be changed. Historical information, as well as relationships with other entities, may be lost if a PK value is changed or re-used.
- Primary Key can include more than one attribute. In fact, there is no limit to the number of attributes allowed in the PK.
- Only one Primary Key is allowed per entity.
|
Selecting a Primary Key
Each column of a row is called an attribute of that row. A database designer can select any attribute to be a Primary Key but, as a result of the rules listed above, many attributes will not qualify as a Primary Key candidate. For example, we could have selected Last Name as the PK of the EMPLOYEE table, but as soon as the company hires two people with the same last name, the PK is no longer unique. Even if we made the PK Last Name and First Name (possible since PKs can be made up of more than one column) we could still have two employees with the same name. Moreover, some employees may choose to change their last names.Many data modelers recommend using system-assigned sequential integers for Primary Keys. This assures uniqueness and gives users control over Primary Key values, but it also takes up database space.
Primary Key (PK) values uniquely identify each entry (tuple) of an entity.
Employee
EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT PK 1006 1019 301 312101 Stein John 861015 631015 3945000 1008 1019 301 312102 Kanieski Carol 870201 680517 3925000 1005 0801 403 431100 Ryan Loretta 861015 650910 4120000 1004 1003 401 412101 Johnson Darlene 861015 560423 4630000 1007 Villegas Arnando 870102 470131 5970000 1003 0801 401 411100 Trader James 860731 570619 4785000
|
Foreign Keys
Relational databases permit data values to associate across more than one entity. A Foreign Key (FK) value identifies entity relationships.Below you will see that the employee table has three FK attributes, one of which models the relationship between employees and their departments. A second FK attribute models the relationship between employees and their jobs.
A third FK attribute is used to model the relationship between employees and each other. This is called a recursive relationship.
|
|
Foreign Key Rules
- Duplicate values are allowed in a FK attribute.
- NULLs are allowed in a FK attribute.
- Values may be changed in a FK attribute.
- Each FK must exist elsewhere as a Primary Key.
Note that Department_Number is the Primary Key for the DEPARTMENT entity.
Remember, these terms are not Teradata specific - they are just general relational concepts.
Employee
Code
Table 
EMPLOYEE (partial listing)EMPLOYEE
NUMBER MANAGER
EMPLOYEE
NUMBER DEPARTMENT
NUMBER JOB
CODE LAST
NAME FIRST
NAME HIRE
DATE BIRTH
DATE SALARY
AMOUNT PK FK FK FK 1006 1019 301 312101 Stein John 861015 631015 3945000 1008 1019 301 312102 Kanieski Carol 870201 680517 3925000 1005 0801 403 431100 Ryan Loretta 861015 650910 4120000 1004 1003 401 412101 Johnson Darlene 861015 560423 4630000 1007 Villegas Arnando 870102 470131 5970000 1003 0801 401 411100 Trader James 860731 570619 4785000
Foreign Key (FK)
values model
relationships.
DEPARTMENT
NUMBER DEPARTMENT
NAME BUDGET
AMOUNT MANAGER
EMPLOYEE
NUMBER PK FK 501 marketing sales 80050000 1017 301 research and development 46560000 1019 302 product planning 22600000 1016 403 education 93200000 1005 402 software support 30800000 1011 401 customer support 98230000 1003 201 technical operations 29380000 1025
- Foreign Keys (FK) are optional.
- An entity may have more than one FK.
- A FK may consist of more than one attribute.
- FK values may be duplicated.
- FK values may be null.
- FK values may be changed.
- FK values must exist elsewhere as a PK.
|
Relational Advantages
Advantages of a Relational Database (physical and logical) compared to other database methodologies include:- Flexible
Flexibility provides substantial benefits. The user does not need to know the access path to the data; the RDBMS keeps track of where everything is. Relational databases use atomic data - breaking data down into its smallest parts to allow maximum flexibility in selecting and using data. In addition, a single copy of the data can serve multiple purposes. - Responds quickly
In a non-relational database, adding a field means that all programs that use the database must be rewritten to become aware of the new data structure. In a relational database, programs do not need to be re-written when a field is added. - Data-driven
Relational databases are designed to represent a business and its practices - not the application or the computer system. - Business-oriented
The two tables we have looked at, EMPLOYEE and DEPARTMENT, are organized to reflect the way the business really works. - Simple and easy to use and understand
Simplicity is useful not only to the people who ask the questions, but also to the people who have to figure out how to retrieve information from the database. Understanding how an RDBMS functions is not necessary. - Easier to build applications
Relational databases make the data do more work. Programs and transactions are simpler, which makes it easier to build applications. - Support the trend toward end-user computing
The trend is moving away from organizations funneling all data requests through the IT people who know how the system works. As systems get easier to use, more people have access to them. This is called the "democratization of data." - Set Theory
Set theory is the mathematical science of the infinite. It studies properties of sets, abstract objects that pervade the whole of modern mathematics. The language of set theory, in its simplicity, is sufficiently universal to formalize all mathematical concepts and thus set theory, along with predicate calculus, constitutes the true foundations of mathematics.
|
|
Evolution of Data Processing
Traditionally, data processing has been divided into two categories: On-Line Transaction Processing (OLTP) and Decision Support Systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.
|
|
Traditional
A transaction is a logical unit of work.- On-Line Transaction Processing (OLTP)
OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time. - Decision Support Systems (DSS)
Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of pre-written scripts, users now require the ability to do ad hoc queries, analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data and rely less on summary data.
|
Today
Functional Trends Rankings- On-line Analytical Processing (OLAP)
OLAP is a modern form of analytic processing within a DSS environment. OLAP tools (e.g., from companies like MicroStrategy and Cognos) provide an easy to use Graphical User Interface to allow "slice and dice" analysis along multiple dimensions (e.g., products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes. - Data Mining (DM)
DM (predictive modeling) involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns (e.g., buying, attrition, or fraud patterns) that are used to predict future behavior. An "analytic model" is built from the historical data (Phase 1: minutes to hours) incorporating the detected patterns. The model is then run against current detail data ("scoring") to predict likely outcomes (Phase 2: seconds or less). Likely outcomes, for example, include scores on likelihood of purchasing a product, switching to a competitor, or being fraudulent.
Type Examples Number of Rows Accessed Response Time OLTP Update a checking account to reflect a deposit.
Debit transaction takes place against current balance to reflect amount of money withdrawn at ATM. Small Seconds DSS How many child size blue jeans were sold across all of our Eastern stores in the month of March?
What were the monthly sale shoes for retailer X? Large Seconds or minutes OLAP Show the top ten selling items across all stores for 1997.
Show a comparison of sales from this week to last week. Large of detail rows or moderate of summary rows Seconds or minutes Data Mining Which customers are most likely to leave?
Which customers are most likely to respond to this promotion? Moderate to large detailed historical rows Phase 1: Minutes or hours
Phase 2: Seconds or less
|
Row vs. Set Processing
Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.
|
|
What is Row-by-Row Processing?
Where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program.A benefit of row processing is that there is less lock contention.
|
What is Set Processing?
A lot of data processing is set processing, which is what relational databases do best.Instead of processing row-by-row sequentially, you can process relational data set-by-set, without a cursor. For example, to sum all payment rows with balances of $100 or less, a single SQL statement completely processes all rows that meet the condition as a set.
With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-time processing.
Some characteristics of set processing include:
- An update with all AMPs involved
- Single session processing which takes advantage of parallel processing
- Efficient updates of large amounts of data.
|
Response Time vs. Throughput
When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.
|
|
Response Time
This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.
|
Throughput
A throughput measure is an amount of something per unit time. For operating systems, throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.
|
What Does this Mean to Teradata?
Response Time Throughput Measures the average duration of queries Measures quantity of queries completed during a time interval A measure of process completion A measure of the amount of work processed Is how long the processing takes How many queries were processed The elapsed time per query The number of queries executed in an hour
In order to improve both response time and throughput on a Teradata system, you could increase CPU power (i.e., add nodes), implement workload management to control resources, and decrease the number of concurrent users.
|
The Advantage of Using Detail Data
Until recently, most business decisions were based on summary data.The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half-as shown in the example on below.
Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data.
Decision support, answering business questions, is the real purpose of databases. To answer business questions, decision-makers must have four things:
- The right data
- Enough detail data
- Proper data structure
- Enough computer power to access and produce reports on the data
Strategic workloads tend to rely heavily on detail data.
Consider your own business and how it uses data. Is that data detailed or summarized? If it's summarized, are there questions it cannot answer?
QUESTION: How effective was the national advertisement for jeans that ran June 6 through June 8?

DETAIL DATA vs. SUMMARY DATA
- Detail data gives a more accurate picture.
- Correct business decisions result.
|
|
The Data Warehouse
A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems (this is where the saying, "Load once, use many" comes from). A data warehouse has a centrally located logical architecture, using a single data store, which minimizes data synchronization, utilizes less space, and provides a single version of the truth.Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years' worth of detail data so that historical trends may be analyzed using the full power of the data.
Many data warehouses get their data directly from operational systems so that the data is timely and accurate, providing less disparate results. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise.
The data warehouse carries a massive volume of detail data, so it is well suited to the parallel architecture of the Teradata database. Because the Teradata database has traditionally specialized in large-scale, decision-support queries against huge volumes of detail data, its use for data warehousing becomes an obvious solution.
The Teradata database can run in single-node environments with a database of a terabyte or less. As requirements grow, the single node can easily be expanded to multiple nodes with the Teradata database's linear scalability. In the multi-node environment, the Teradata database can scale upward to systems supporting more than 50 petabytes of data and process hundreds of concurrent sessions. With the power of multiple CPUs propelling multiple AMPs over multiple disks, getting answers from the detail data in the warehouse becomes fast and efficient.
The Teradata database supports normalized logical models because it can perform 128 joins in a single query as well as large aggregations during very complex queries using its robust parser and optimizer. A key Teradata database strength is the ability to model the customer's business. With a business model, data is organized to what it represents. The Teradata database can do Star Schema and other forms of relational modeling - but Third Normal Form is the methodology we recommend to our customers. The Teradata database also has:
- A Priority Scheduler and workload management tools that prioritize resources among competing applications based on importance
- A powerful suite of load and maintenance utilities
- Numerous third-party strategic and alliance partners

A central, enterprise-wide database that contains information extracted from operational systems.
- Based on enterprise-wide model
- Can begin small but may grow large rapidly
- Populated by extraction/loading of data from operational systems
- Responds to users "what if" queries
- Minimizes data movement/ synchronization
- Provides a single view of the business
|
|
Data Marts
A data mart is a special purpose subset of data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Some reasons data marts are implemented are when a department wants to own its own data, a quick ROI is required, and in some cases, canned queries may offer users better performance.
|
|
Data Mart Types
- Independent Data Marts
Independent data marts are created directly from operational systems, just as a data warehouse is. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate. - Logical Data Marts
Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology (Teradata), it removes the need for massive data loading and transformation, making a single data store available for all user needs. - Dependent Data Marts
Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for some performance-critical user queries.

|
Data Models - Enterprise vs. Application
To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business.In addition, an enterprise data model is more extensible than an application data model. It is intended to encompass the entire enterprise.
|
Teradata Functional Overview
The Parsing Engine is responsible for:- Managing individual sessions
- (up to 120 sessions per PE)
- Parsing and optimizing your SQL requests
- Building query plans with the parallel-aware, cost-based, intelligent Optimizer
- Dispatching the optimized plan to the AMPs
- EBCDIC/ASCII input conversion (if necessary)
- Sending the answer set response back to the requesting client

A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.
Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.
|
|
Session Control
The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.
|
Parsing
The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.
|
Optimizing
The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.
|
Dispatching
The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.
|
Major Components of the Teradata Database
We have looked at the overall node, and now we will describe the components that make up a node in detail.
|
|
Parallel Database Extensions (PDE)
The Parallel Database Extensions is a software interface layer that lies between the operating system and database. It controls the virtual processors (vprocs).
|
Parsing Engine (PE)
The Parsing Engine is a virtual processor (vproc) that interprets SQL requests, receives input records, and passes data. To do that, it sends the messages over the BYNET to the AMPs (Access Module Processor).
|
BYNET
The BYNET is the message-passing layer. It determines which AMP(s) should receive a message.
|
Access Module Processor (AMP)
The AMP is a virtual processor (vproc) designed for and dedicated to managing a portion of the entire database. It performs all database management functions such as sorting, aggregating, and formatting data. The AMP receives data from the PE, formats rows, and distributes them to the disk storage units it controls. The AMP also retrieves the rows requested by the Parsing Engine.
|
Disks
Disks are disk drives associated with an AMP that store the data rows. On current systems, they are implemented using a disk array.
|
What Happens When a Query is Submitted?
- The Parser evaluates the SQL for proper syntax.
- The Optimizer develops the least expensive plan.
- The BYNET passes the steps from the Dispatcher to the AMPs.

|
The Parsing Engine (PE)
A Parsing Engine (PE) is a virtual processor (vproc). It is made up of the following software components: Session Control, the Parser, the Optimizer, and the Dispatcher.Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the RDBMS. Each PE can handle up to 120 sessions, and each session can handle multiple requests.
|
|
Session Control
The major functions performed by Session Control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session's context.
|
Parsing
The Parser interprets SQL statements, checks them for proper SQL syntax and evaluates them semantically. The Parser also consults the Data Dictionary to ensure that all objects and columns exist and that the user has authority to access these objects.
|
Optimizing
The Optimizer is responsible for developing the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest plan is chosen. This plan is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. In order to maximize throughput and minimize resource contention, the optimizer must know about system configuration, available units of parallelism (number of AMPs and PE's), and data demographics of the tables in the query. The Teradata Database Optimizer is robust and intelligent. The optimizer enables the Teradata database to handle multiple complex, ad hoc queries efficiently. It is parallel-aware and cost-based.
|
Dispatching
The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the AMPs via the BYNET. It is composed of execution-control and response-control tasks. Execution control receives the step definitions from the Parser and transmits them to the appropriate AMP(s) for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, a step will be sent to one AMP, or broadcast to all AMPs.Note: By default, Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. It is possible to configure a node to support more than 1200 sessions.
The Parsing Engine is responsible for:
- Managing individual sessions (up to 120 sessions per PE)
- Parsing and optimizing your SQL requests
- Building query plans with the parallel-aware, cost-based, intelligent Optimizer
- Dispatching the optimized plan to the AMPs
- EBCDIC/ASCII input conversion (if necessary)
- Sending the answer set response back to the requesting client

|
The BYNET
The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate.The BYNET handles the internal communication of the Teradata database. All communication between PEs and AMPs is done via the BYNET.
When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. Messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET.
Depending on the nature of the dispatch request, the communication between nodes may be a:
- Broadcast - message is routed to all nodes in the system.
- Point-to-point - message is routed to one specific node in the system.
Once the message is on a participating node, PDE directs the message to the appropriate AMPs on that node. All AMPs receive a broadcast message. With a point-to-point or multicast (multiple AMPs) message, the message is directed only to the appropriate AMP(s) on that node by PDE.
So, while a Teradata database system does do multicast messaging, the BYNET hardware alone cannot do it - the BYNET can only do point-to-point and broadcast between nodes.
The BYNET has several unique features:
- Fault tolerant: each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0 cannot be reconfigured, hardware on BYNET 0 is disabled and messages are re-routed to BYNET 1, and vice versa.
- Load balanced: traffic is automatically and dynamically distributed between both BYNETs.
- Scalable: as you add nodes to the system, overall network bandwidth scales linearly - meaning an increase in system size without loss of performance.
- High Performance: an MPP system has two BYNET networks providing hardware redundancy. Because both networks are active, the system benefits from the full aggregate bandwidth.
The technology of the BYNET is what makes the scalable parallelism of the Teradata database possible.

BYNET merges final answer set
Dual redundant, fault-tolerant, bi-directional interconnect network that enables:- Automatic load balancing of message traffic
- Automatic reconfiguration after fault detection
- Scalable bandwidth as nodes are added
The BYNET connects and communicates with all the AMPs and PEs in the system:
- Between nodes, the BYNET hardware carries broadcast and point-to-point communications
- On a node, BYNET software and PDE together control which AMPs receive a multicast communication
- Merges the final answer set
|
|
The Access Module Processor (AMP)
The Access Module Processor (AMP) is the virtual processor (vproc) in the architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set, including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the requested rows from a table.An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs.
The Database Manager subsystem resides on each AMP. This subsystem will:
- Lock databases and tables.
- Create, modify, or delete definitions of tables.
- Insert, delete, or modify rows within the tables.
- Retrieve information from definitions and tables.
- Return responses to the Dispatcher.
Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs space-management functions that control the use and allocation of space.
The Teradata database performs all tasks in parallel, providing exceptional performance. The greater the number of tasks processed in parallel, the better the system performance. Many databases call themselves "parallel," but they can only perform some tasks in parallel.
|
|
Shared Nothing Architecture
The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.
AMPs perform all tasks in parallel
The AMP is responsible for:- Storing rows to and retrieving rows from its vdisk
- Lock management
- Sorting rows and aggregating columns
- Join processing
- Output conversion and formatting (ASCII, EBCDIC)
- Creating answer sets for clients
- Disk space management and accounting
- Special utility protocols
- Recovery processing
|
What is a Node?
Up to this point, we have discussed relational databases in terms of how the user sees them - as a collection of tables that relate to one another. Now we'll look at the physical components of the system.A node is made up of various hardware and software components. All applications run under Linux or Windows, and all the Teradata database software runs under Parallel Database Extensions (PDE). All share the resources of CPU and memory on the node.

Access Module Processors (AMPs) and Parsing Engines (PEs) are virtual processors (vprocs) running under control of the PDE. Their numbers are software configurable.
AMPs are associated with virtual disks (vdisks), which are configured as ranks of a disk array.
The "Shared Nothing" architecture of the Teradata database means that each vproc is responsible for its own portion of the database, and for all activities to be performed on that part of the system.
All AMPs and PEs communicate via the BYNET, which is a message passing system. In an SMP (Symmetric Multi-Processing) system, it is implemented as a BYNET driver, and in MPP (Massively Parallel Processing) systems, it is implemented as a software and hardware solution. The BYNET allows multiple vprocs on multiple nodes to communicate with each other. PDE actually controls message-passing activity, while the BYNET handles message queuing and flow control.
An application that runs under the control of PDE, such as the Teradata database, is considered a Trusted Parallel Application (TPA).
Note: The versatility of the Teradata database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. These vprocs are a set of software processes that run on a node under Parallel Database Extensions (PDE) and the multitasking environment of the operating system.
In a massively parallel processing (MPP) system, there is hardware redundancy among both the multiple nodes and the BYNET.
|
|
Shared Nothing Architecture
The Teradata database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.
|
Parallelism
Parallelism is at the very heart of the Teradata database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient.Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets.
The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform's maximum throughput.
Each AMP can perform up to 80 tasks (or more) in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results.
In truth, parallelism is built into the Teradata database from the ground up!

Notes:
- Each PE can handle up to 120 sessions in parallel.
- Each Session can handle multiple REQUESTS.
- The Message Passing Layer can handle all message activity in parallel.
- Each AMP can perform up to 80 tasks in parallel (can be configured for more).
- All AMPs can work together in parallel to service any request.
- Each AMP can work on several requests in parallel.
|
|
Teradata Compression Comparison
|
Multi-value Compression
Multi-value compression has three variations:- COMPRESS – Nulls are compressed.
- COMPRESS NULL – Nulls are compressed. (COMPRESS & COMPRESS NULL mean the same thing)
- COMPRESS <constant> – Nulls and the specified <constant> value are compressed.
With algorithmic compression, you must specify a UDF for compression and a UDF for decompression.Example:
CREATE TABLE bank_accounts
(customer_id INTEGER
,primary_bank INTEGER COMPRESS -- only NULL is compressed
,account_type CHAR(10) COMPRESS 'Savings' -- both NULL and 'Savings' are compressed
,account_notes VARCHAR(500) CHARACTER SET UNICODE
COMPRESS USING TD_SYSFNLIB.TransUnicodeToUTF8
DECOMPRESS USING TD_SYSFNLIB.TransUTF8ToUnicode -- algorithmic compression
. . . );
|
Multi-value Compression vs. VARCHAR
For character-based data, an alternative to Teradata compression is the VARCHAR (N) data type. The number of bytes used to store each field is the length of the data item plus two bytes. Contrast this to a fixed-length CHAR (N) data type that takes N bytes per row, regardless of the actual number of characters in each field. Combining Teradata compression with fixed-length character data type can be a very effective space saver. MVC can compress 255 values in a column.Data demographics can help determine whether variable-length character data type or fixed length plus compression is more efficient. The most important factors are the maximum field length, the average field length, and the compressibility of the data.
|
Algorithmic Compression
When column values are mostly unique, Algorithmic Compression can provide better compression results than Multi-value Compression. You can define your own compression and decompression algorithms and apply them to data at the column level.A user can create their own compression algorithms as external C/C++ scalar UDFs, and then specify them in the column definition of a CREATE TABLE/ALTER TABLE statement. Teradata invokes these algorithms to compress and decompress the data when the data is inserted into or retrieved from the table.
ALC allows you to implement the compression scheme that is most suitable for data in a particular column. The cost of compression and decompression depends on the algorithm chosen.
|
Block Level Compression
Performs compression on whole data blocks at the file system level before they are written to storage.- BLC will compress/decompress only data blocks but will not be used on any file system structures such as Master/Cylinder indexes, the WAL log and table headers.
- Only primary data subtable and fallback copies can be compressed. Both objects are either compressed or neither is compressed.
- Secondary Indexes (USI/NUSI) cannot be compressed but Join Indexes can be compressed since they are effectively user tables.
- Only the compressed form of the data block will be cached, each block access must perform the data block decompression.
- On initial delivery, a single compression algorithm will be supplied and used.
- Spool data blocks can be compressed via a system-wide tunable, as well as Temporary, WORK (sort Work space), and permanent journal.
- Once enabled, reversion to an earlier release for compressed tables isn't allowed.
- Teradata 14.0 added the capability of Temperature Based Block Level Compression (TBBLC) to enable greater compression flexibility while negating potential performance offsets.
Multi-value Compression
(MVC) Algorithmic
Compression
(ALC) Block Level
Compression
(BLC) - (TBBLC) Ease of Use
Easy to apply to well understood data columns and values Easy to apply on column with CREATE TABLE Set once and forget Analysis Required
Need to analyze data for common values Use Teradata algorithms or user-defined compression algorithms to match unique data patterns Need to analyze CPU overhead trade-off (You can turn on for all data on system or you can apply on a per table basis) Flexibility
Works for a wide variety of data and situations Automatically invoked for values not replaced by MVC Automatically combined with other compression mechanisms Performance Impact
No or minimal CPU usage Depends on compression algorithm used Reduced I/O due to compressed data blocks (CPU cycles are used to compress/decompress) Applicability
Replaces common values Industry data, UNICODE, Latin data All Data
A customer can choose any combination or all three on a column/table.
|
Layers of the Data Warehouse Architecture
There are several "layers" that make up the EDW environment. These layers include:- Staging Layer
Staging is the primary purpose of the staging layer is to perform data transformation and loading during either the ETL or ELT process. - Access Layer
Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a third party tool. - Physical Layer
The physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc. - Data Lab
This is Teradata's implementation of a controlled "sandbox" environment. It enables rapid loading and analysis of new data combined with existing warehouse data. It quickly assimilates untested data into a "non-production" portion of the warehouse, providing a self-provisioning, self-service environment for prototyping and analysis on new, un-cleansed data, which is temporary in nature.Data Lab is comprised of two Viewpoint portlets.


Teradata Data Lab - Interface
|
|
Teradata Database Objects
A "database" or "user" in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functions, or indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. Starting with Teradata Release 14.10, extended object names feature, allows object names to be up to 128 characters where prior it was a 30-byte limit.All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D).
Databases provide a logical grouping for information. They are also the foundation for space allocation and access control.
|
|
Tables
A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows.
|
Views
A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used to implement security and control access to the underlying tables. They can hide columns from users, insulate applications from changes, and simplify or standardize access techniques. They provide a mechanism to include a locking modifier. Views provide a well-defined and tested high performance access to data. X-Views are special Views that limit the display of data to the user who accesses them.
|
Macros
A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Macros may also be used to limit user access to data.
|
Triggers
A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired - effectively executing the SQL statements.
|
Stored Procedures
A stored procedure is a program that is stored within Teradata and executes within the Teradata database. A stored procedure uses permanent disk space and may be used to limit a user's access to data.
|
User Defined Functions (UDF)
A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable information (PII), even from the DBA(s).
X-Views limit the display of results to the user accessing them.
|
Teradata Database Space
There are three types of space within the Teradata database:- Perm Space
All databases and users have a defined upper limit of Permanent Space. Permanent Space is used for storing the data rows of tables. Perm Space is not pre-allocated. It represents a maximum limit. - Spool Space
All databases also have an upper limit of Spool Space. If there is no limit defined for a particular database or user, limits are inherited from parents. Theoretically, a user could use all of the unallocated space in the system for their query. Spool Space is unused Perm Space that is used to hold intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released.Example: You have a database with total disk space of 100GB. You have 10GB of user data and an additional 10GB of overhead. What is the maximum amount of Spool Space available for queries?
Answer: 80GB. All of the remaining space in the system is available for spool.
- Temp Space
The third type of space is Temporary Space. Temp Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart. Temp Space is permanent space currently not used.
|
|
Space Management
The way the Teradata database handles space management is different from other database implementations.The initial Teradata database system comes with several users and databases:
- Users: DBC, SysAdmin, SystemFE, TDPuser
- Databases: Crashdumps, Default, All, Public
Initially, most of the space in the system belongs to user DBC, with some allocated to each of the users or databases listed above. Database DBC contains all of the RDBMS software components and system tables.Before defining application users and databases, the Database Administrator should first create a special administrative user and assign most of the space in the system to that user. This space comes from user DBC. User DBC becomes the owner of the administrative user.
There should be enough space left in user DBC to accommodate the growth of system tables and logs.
As the administrative user creates additional users and databases, space assigned to those objects will be subtracted from the administrative user's space. As these users and databases create subordinate objects, they will give up some of their space to these new users and databases.
The clear boxes represent all the space available in the system. The shaded boxes represent space that is being used in the system by tables. Space that is not currently being used by tables is available for use as spool or temporary space. If a table is dropped, that space becomes available to the system again.
Database 1 and Database 2 contain the production tables. Database 3 contains views and macros to control access to the data by end users and does not consume any database space. User D is an example of an end user who has no Perm Space but has been granted rights on certain views and macros to accomplish work. This user can still execute queries and create views or macros. User A, User B, and User C are examples of application developers who need some Perm Space to hold sample data for program testing and validation.
Most database vendors do not handle space this way. Once space is allocated to a table, it cannot be made available again without the Database Administrator having to perform a re-organization and re-allocate the space and partition the data.

- A new database or user must be created from an existing database or user.
- All Perm Space limits are subtracted from the owner.
- Perm Space is a zero-sum game - the total of all Perm Space limits must equal the total amount of disk space available.
- Perm Space currently not being used is available as Spool Space or Temp Space.
|
|
How Does the Teradata Database Distribute Rows?
Each AMP is designed to hold a portion of the rows of each table. An AMP is responsible for the storage, maintenance and retrieval of the data under its control. The Teradata database uses hashing to dynamically distribute data across all AMPs.The Teradata database's automatic hash distribution eliminates costly data maintenance tasks. There is no specific order to the placement of the data. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. There is never a need to reorganize data or index space which makes the database easier to manage and maintain. As a result, strategic business data is more accessible to the users.
The benefits of automatic data placement include:
- Distribution is the same regardless of data volume.
- Distribution is based on row content, not data demographics.
- Rows are moved or re-blocked automatically.
Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. For example, what if there are fewer rows than AMPs? In this case, at least some AMPs will hold no rows from that table. (This example should be considered the exception, and not the rule.)
Ideally, the rows of each table will be evenly distributed across all of the AMPs. Even distribution is desirable because in an operation involving all rows of the table (such as a full-table scan), each AMP will have an equal portion of work to do. When workloads are not evenly distributed, the desired response can only be as fast as the slowest AMP.
Remember: The Teradata database uses a Shared-Nothing Architecture - each AMP is responsible for its own portion of the database and does not share it with any other AMP.

|
|
Data Management
Many other databases require you to manually partition the data. Teradata automates the placement and maintenance of physical data location.With the Teradata database, your DBA can spend more time with users developing strategic applications to beat your competition.
Loading, inserting, updating and, deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization.
|
Primary Keys and Primary Indexes
While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on this slide summarizes those differences.A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table.
It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table.
- Indexes are conceptually different from keys.
- A PK is a relational modeling convention which allows each row to be uniquely identified.
- A PI is a Teradata convention which determines how the row will be stored and accessed.
- A significant percentage of tables MAY use the same columns for both the PK and the PI.
- A well-designed database WILL use a PI that is different from the PK for some tables.
Primary Key (PK) Primary Index (PI) Logical concept of data modeling Physical mechanism for access and storage Teradata doesn’t need to recognize Each table can have (at most) one primary index No limit on number of columns 64 column limit Documented in data model (Optional in CREATE TABLE) Defined in CREATE TABLE statement Must be unique May be unique or non-unique Identifies each row Identifies 1 (UPI) or multiple rows (NUPI) Values should not change Values may be changed (Delete + Insert) May not be NULL - requires a value May be NULL Does not imply an access path Defines most efficient access path Chosen for logical correctness Chosen for physical performance
|
|
Primary Key (PK) vs. Primary Index (PI)
Primary Indexes are conceptually different from Primary Keys.- Primary Key (PK) is a relational data-modeling term that defines the attribute(s) used to uniquely identify a tuple in a entity.
- Primary Index (PI) is a physical database implementation term that defines the column(s) used to distribute the rows in a table. The Primary Index column(s) value may be unique or non-unique.
- Unique Primary Index (UPI) the values in the Primary Index column(s) are unique which results in even row distribution and the elimination of duplicate row checks during data load operations.
- Non-Unique Primary Index (NUPI) the values in the Primary Index column(s) are non-unique which results in potentially skewed row distribution - all duplicate Primary Index values go to the same AMP.
Accessing a row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance and known access paths, and will therefore be a different choice than the Primary Key of its source entity.
Example: PK vs. PI. An insurance claims table may contain the following columns: Claim ID, Date, Insurance Provider ID, Patient ID, and Claim Amount. The Primary Key is defined as Claim ID and Date. The same claim number may be submitted on one date and then resubmitted on another date, so you need both columns to look up a claim and its details. Other users may often be analyzing provider fraud behavior and need to access the claim by the Provider ID. In this case, the Primary Index may be the Provider ID column. Even if data distribution may be skewed, a choice like this might be made to provide efficient access to a table.
Primary Key (PK) Primary Index (PI) Logical concept of data modeling Mechanism for row distribution and access Teradata does not need the PK defined A table can have one (or none) Primary Index No limit on the number of attributes May be from 1 to 64 columns Documented in the logical data model Defined in the CREATE TABLE statement Must be unique Value may be unique or non-unique Unique identifier Used to place a row on an AMP Value should not change Value may be changed (Updated) May not be NULL May be NULL Does not imply access path Defines the most efficient access path Chosen for logical correctness Chosen for physical performance
|
|
Primary Indexes
With the Teradata database, the Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on its disks. Indexes are also used to access rows from a table without having to search the entire table. A table may have a Primary Index. The Primary Index of the table cannot be changed (i.e., altered, added, or dropped).Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows in the table and the performance of the table in a production environment. Although many tables use combined columns as the Primary Index choice, the examples we use here are single-column indexes.
There are two types of primary indexes; unique (UPI) and non-unique (NUPI).
A Unique Primary Index (UPI) is a column that has no duplicate values. UPIs are desirable because they guarantee even distribution of table rows. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation.
Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique, or duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. A NUPI allows duplicate values as well as access to the non-unique columns. A NUPI can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates.
Accessing the row by its Primary Index value is the most efficient way to access a row and is always a one-AMP operation. Choosing a Primary Index is not an exact science. It requires analysis and thought for some tables and will be completely self-evident on others. Sometimes the obvious choice will be the Primary Key, which is known to be unique. Sometimes the choice of Primary Index may have to do with join performance (enabling AMP-local joins to other tables) and known access paths (allowing local joins to similar entities), and will therefore be a different choice than the Primary Key of the table.
The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. Rows are hashed based on the Primary Index.
|
|
Defining the Primary Index
When a table is created, the Primary Index should be specified. The Primary Index may consist of a single column or a combination up to 64 columns. Columns are assigned data types and may optionally be assigned attributes and constraints.Once you choose the Primary Index for a table, it cannot be changed. The existing table must be dropped and a new table created to change the Primary Index definition.
Note: While you cannot change the Primary Index itself, values in a Primary Index column may be changed. The Teradata database simply rehashes that row to its new location, based on the Primary Index value.
The table definition is stored in the Data Dictionary (DD).
UPI Table:CREATE TABLE Table1
( Col1 INTEGER
,Col2 INTEGER
,Col3 INTEGER )
UNIQUE PRIMARY INDEX (Col1);
NUPI Table:CREATE TABLE Table2
( Col1 INTEGER
,Col2 INTEGER
,Col3 INTEGER )
PRIMARY INDEX (Col2);
- The Primary Index is the mechanism used to assign a row to an AMP.
- The Primary Index may be Unique (UPI) or Non-Unique (NUPI).
- Unique Primary Indexes result in even row distribution and eliminate duplicate row checking.
- Non-Unique Primary Indexes result in even row distribution proportional to the number of duplicate values. This may cause skewed distribution.
|
|
Row Distribution via Hashing
The diagram below gives an overview of Primary Index hash distribution, which is the process by which all data is distributed in the Teradata database.The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Communications Layer. The Hash Map, in combination with the Row Hash determines which AMP gets the row. The Hash Map is a part of the Communications Layer interface.
The Teradata database's hash partitioning is the only approach which offers balanced processing and balanced data placement, while minimizing interconnect traffic and freeing the DBA from time-consuming reorganizations. The Teradata File System eliminates the need to rebuild indexes when tables are updated or structures change.

|
|
Hashing Down to the AMPs
The Teradata database is unique in hashing data directly to a physical address on disk, creating an even distribution of data. This allows a balanced application of parallelism, and avoids imbalance due to data skew. The Teradata database's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.The rows of all tables are distributed across the AMPs according to their primary index value. The primary index value goes into the hashing algorithm. The output is a 32-bit row hash. The high-order 16 bits are referred to as the hash bucket number and are used to identify a hash-map entry. This entry is used to identify the AMP to be targeted. The remaining 16 bits are not used to locate the AMP.
Teradata provides customers with two choices for the size and number of the hash buckets in a system:
- 16 bits (65,536 hash buckets, the same as in previous releases)
- 20 bits (1,048,576 hash buckets)
For larger systems, such as those with 1,000 AMPs or more, the 20-bit option will result in more hash buckets to allow workloads to be distributed more evenly over the AMPs. New systems are also candidates for the expanded hash bucket option.
The entire 32-bit row hash will be used by the selected AMP to locate the row within its disk space.
Hash maps are uniquely configured for each size of system. A 50-AMP system will have a different hash map than a 30-AMP system.
A hash map is simply an array that associates bucket numbers with specific AMPs. When a system grows, new AMPs are typically added. Adding AMPs requires a change to the hash map to reflect the new total number of AMPs.
|
UPI Row Distribution
The hashing algorithm predictably distributes rows across all AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, the Teradata database can distribute the rows evenly. If the Primary Index is non-unique, but there are a fairly even number of rows per index value, the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values, the distribution will probably be skewed.In the example below, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is uniform. This assures maximum efficiency because each AMP does approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task.
This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata database.
|
|
Order
Order
Number Customer
Number Order
Date Order
Status PK UPI 7325 2 4/13 O 7324 3 4/13 O 7415 1 4/13 C 7103 1 4/10 O 7225 2 4/15 C 7384 1 4/12 C 7402 3 4/16 C 7188 1 4/13 C 7202 2 4/09 C
- Order_Number values are unique (UPI).
- The rows will distribute evenly across the AMPs.

![]() |
Unique Primary Index (UPI) Access
Accessing a table using a Primary Index is the most efficient way to access the data and is always a one-AMP operation. In the case of a UPI, the one-AMP access can return, at most, one row. In the example below, we are looking for the row whose primary index value is 45. By specifying the PI value as part of our selection criteria, we are guaranteed that only the AMP containing the specified row will be searched.The correct AMP is located by taking the PI value and passing it through the hashing algorithm. Hashing takes place in the Parsing Engine. The output of the hashing algorithm contains information that will point to a specific AMP. Once it has isolated the appropriate AMP, finding the row is quick and efficient.
SELECT *
FROM Customer
WHERE Cust = 45;
CREATE TABLE Customer
( Cust INTEGER
,Name CHAR(10)
,Phone CHAR(8) )
UNIQUE PRIMARY INDEX (Cust);
|
|
CUSTOMER table
Cust Name Phone PK UPI 37 White 555-4444 98 Brown 333-9999 74 Smith 555-6666 95 Peters 555-7777 27 Jones 222-8888 56 Smith 555-7777 45 Adams 444-6666 84 Rice 666-5555 49 Smith 111-6666 51 Marsh 888-2222 31 Adams 111-2222 62 Black 444-5555 12 Young 777-4444 77 Jones 777-6666

|
NUPI Row Distribution
In the example below, Customer_Number has been used as a non-unique Primary Index (NUPI). Note that row distribution among AMPs is uneven. All rows with the same primary index value (with the same customer number) are stored on the same AMP.Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all rows among all AMPs.
AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation, AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. Overall operation time is increased and some AMPs are under-utilized.
This illustrates how NUPIs can create irregular distributions, called skewed distributions. AMPs that have more than an average number of rows will take longer for full-table operations than other AMPs. Because an operation is not complete until all AMPs have finished, the operation where distribution is skewed will take longer than it would if all AMPs were utilized evenly.
|
|
Order
Order
Number Customer
Number Order
Date Order
Status PK NUPI 7325 2 4/13 O 7324 3 4/13 O 7415 1 4/13 C 7103 1 4/10 O 7225 2 4/15 C 7384 1 4/12 C 7402 3 4/16 C 7188 1 4/13 C 7202 2 4/09 C
- Customer_Number values are non-unique (NUPI).
- Rows with the same PI value distribute to the same AMP causing skewed row distribution.

![]() |
Non-Unique Primary Index (NUPI) Access
Accessing a table using a Non-Unique Primary Index (NUPI) is also a one-AMP operation. In the case of a NUPI, the one-AMP access can return zero to many rows. In the example below, we are looking for the rows whose primary index value is 555-7777. By specifying the PI value as part of our selection criteria, we are once again guaranteeing that only the AMP containing the required rows will need to be searched.The correct AMP is located by taking the PI value and passing it through the hashing algorithm executed in the Parsing Engine. The output of the hashing algorithm will once again point to a specific AMP. Once it has isolated the appropriate AMP, it must find all rows that have the specified value. In the example, the AMP returns two rows.
SELECT *
FROM Customer
WHERE Phone = '555-7777';
CREATE TABLE Customer
( Cust INTEGER
,Name CHAR(10)
,Phone CHAR(8) )
PRIMARY INDEX (Phone);
|
|
CUSTOMER table
Cust Name Phone PK NUPI 37 White 555-4444 98 Brown 333-9999 74 Smith 555-6666 95 Peters 555-7777 27 Jones 222-8888 56 Smith 555-7777 45 Adams 444-6666 84 Rice 666-5555 49 Smith 111-6666 51 Marsh 888-2222 31 Adams 111-2222 62 Black 444-5555 12 Young 777-4444 77 Jones 777-6666 72 Adams 666-7777 40 Smith 222-3333

Single AMP access with 0 to n rows returned.
|
Highly Non-Unique NUPI Row Distribution
The example below uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all rows are placed on two AMPs.This is an example of a highly non-unique Primary Index.
When choosing a Primary Index, never pick a column with a severely-limited value set. The degree of uniqueness is critical to efficiency. Choose NUPIs that allow all AMPs to participate fairly equally.
|
|
Order
Order
Number Customer
Number Order
Date Order
Status PK NUPI 7325 2 4/13 O 7324 3 4/13 O 7415 1 4/13 C 7103 1 4/10 O 7225 2 4/15 C 7384 1 4/12 C 7402 3 4/16 C 7188 1 4/13 C 7202 2 4/09 C
- Order_Status values are highly non-unique (NUPI).
- Only two values exist. The rows will be distributed to two AMPs.
- This table will not perform well in parallel operations.
- Highly non-unique columns are poor PI choices.
- The degree of uniqueness is critical to efficiency.

![]() |
Partitioned Primary Index (PPI)
The Teradata database has a partitioning mechanism called Partitioned Primary Index (PPI). PPI is used to improve performance for large tables when you submit queries that specify a range constraint. PPI allows you to reduce the number of rows to be processed by using a technique called partition elimination. PPI will increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. PPI is useful for instantaneous dropping of old data and rapid addition of new data. Using PPI lets you avoid having to do a full-table scan without the overhead of a Secondary Index.
|
|
How Does PPI Work?
Data distribution with PPI is still based on the Primary Index:- Primary Index >> Hash Value >> Determines which AMP gets the row
With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.
Using PPI, the rows are stored first by partition and then by row hash /row id order. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.
|
Data Storage Using PPI
To store rows using PPI, specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.Partition Elimination is used to access tables with a PPI. The term "partition elimination" refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination.
If the partitioning expression references any columns that aren't part of the primary index, then PI access may be slower. When all partitioning expression columns are contained in the PI, then PI access is unchanged.
The Orders table defined with a Non-Partitioned Primary Index (NPPI) on Order_Number (O_#).

Partitioned Primary Indexes:
- Improve performance on range constraint queries
- Use partition elimination to reduce the number of rows accessed
- May use a CASE_N or RANGE_N partitioning expression
The Orders table defined with a Primary Index on Order_Number (O_#) Partitioned By Order_Date (O_Date) (PPI).
|
Multilevel Partitioned Primary Index (MLPPI)
A PPI can be sub-partitioned, increasing the opportunities for partition elimination at the various additional levels (or combinations of levels). With a multilevel PPI, you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination.- Allows multiple partitioning expressions on a table or a non-compressed join index - including base tables, Global Temp tables, and Volatile tables.
- Multilevel partitioning allows each partition at a level to be sub-partitioned. Each partitioning level is defined independently using a RANGE_N or CASE_N expression.
- A multilevel PPI allows efficient searches by using partition elimination at the various levels or combination of levels.
- Allows flexibility on which partitioning expression to use when there are multiple choices of partitioning expressions.
For example, an insurance claims table could be partitioned by claim date and then sub-partitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as June 2012) would take advantage of partition elimination for faster performance.Note: an MLPPI table must have at least two partition levels defined.
|
|
Advantages of partitioned tables:
- They reduce the I/O for range constraint queries.
- They take advantage of dynamic partition elimination.
- They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices (i.e., you can use last name or some other value that is more readily available to query on).
- The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities.
- Row hash locks are used for SELECT with equality conditions on the PI columns.
- Partitioned tables allow for fast deletes of data in a partition.
- They allow for range queries without a secondary index.
- Specific partitions maybe archived or deleted.
- May be created on Volatile tables; global temp tables, base tables, and non-compressed join indexes.
- May replace a Value Ordered NUSI for access.
|
Disadvantages of partitioned tables:
- Rows in a partitioned table are eight bytes longer
- Access via the Primary Index may take longer
- Full table joins to an NPPI table with the same PI may take longer
|
Syntax:
|
What is a NoPI Table?
A NoPI Table is simply a table without a primary index.Prior to Teradata database 13.0, Teradata tables required a primary index. The primary index was used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradata's parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index.
Starting with Teradata database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.
Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value.
With a NoPI table:
- INSERTS/UPDATES to the table are always appended at the end of the table
- The table must always be Multi-set
- There are no duplicate row checks performed upon load
If you use FastLoad to load rows into a NoPI table, you cannot have secondary or join indexes, CHECK constraints, referential integrity constraints on the table as FastLoad cannot load a table defined with any of these features. You can use FastLoad to load rows into un-partitioned NoPI tables with row-level security defined, however.
When FastLoading to a NoPI table, the rows are distributed a block at a time.
Example of creating a NoPI table:
CREATE TABLE table_x
(col_x INTEGER
,col_y CHAR(10)
,col_z DATE)
NO PRIMARY INDEX;
|
|
Teradata Columnar
Teradata Column or Column Partitioning (CP) is a new physical database design implementation option (starting with Teradata 14.0) that allows single columns or sets of columns of a NoPI table to be stored in separate partitions. Column partitioning can also be applied to join indexes.Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk.
Columnar offers the ability to partition a table or join index by column. It can be used alone or in combination with row partitioning in multilevel partitioning definitions. Column partitions may be stored using traditional 'ROW' storage or alternatively stored using the new 'COLUMN' storage option. In either case, columnar can automatically compress physical rows where appropriate.
The key benefit in defining row-partitioned (PPI) tables is when queries access a subset of rows based on constraints on one or more partitioning columns. The major advantage of using column partitioning is to improve the performance of queries that access a subset of the columns from a table, either for predicates (e.g., WHERE clause) or projections (i.e., SELECTed columns).
Because sets of one or more columns can be stored in separate column partitions, only the column partitions that contain the columns needed by the query need to be accessed. Just as row-partitioning can eliminate rows that need not be read, column partitioning eliminates columns that are not needed.
The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks.
Columnar makes more sense in CPU-rich environments because CPU cycles are needed to "glue" columns back together into rows, for compression and for different join strategies (mainly hash joins).
Example of creating a Columnar table:
CREATE TABLE table_y
(col_a CHAR(25) NOT NULL
,col_b CHAR(25) NOT NULL
,col_c DATE NOT NULL
,col_d CHAR(7) NOT NULL
,col_e INTEGER)
PARTITION BY COLUMN;
|
|
Teradata Columnar Benefits
Benefits of using the Teradata Columnar feature include:- Improved query performance: Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows.
- Reduced disk space: The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index.
- Increased flexibility: Provides a new physical database design option to improve performance for suitable classes of workloads.
- Reduced I/O: Allows fast and efficient access to selected data from column partitions, thus reducing query I/O.
- Ease of use: Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries.
|
|
Columnar Join Indexes
A join index can also be created as column-partitioned for either a columnar table or a non-columnar table. Conversely, a join index can be created as non-columnar for either type of table as well.Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes.
A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index.
|
Secondary Indexes
A secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection.There are three general ways to access a table:
- Primary Index access (one AMP access)
- Secondary Index access (two or all AMP access)
- Full Table Scan (all AMP access)
Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan.
Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful.
- A secondary index provides an alternate path to the rows of a table.
- A secondary index can be used to maintain uniqueness within a column or set of columns.
- A table can have from 0 to 32 secondary indexes (each have 1
- 64 columns).
- Secondary Indexes:
- Do not affect table distribution.
- Add overhead, both in terms of disk space (sub-tables) and maintenance.
- May be added or dropped dynamically as needed.
- Are only chosen to improve query performance.
|
|
Defining a Secondary Index
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.
Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.
All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.
A Secondary Index may be defined ...
- at table creation (CREATE TABLE)
- following table creation (CREATE INDEX)
- may be up to 64 columns
USI NUSI If the index choice of column(s) is unique, it is called a USI.Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.CREATE UNIQUE INDEX
(EmpNumber) ON Employee;
If the index choice of column(s) is nonunique, it is called a NUSI.Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.CREATE INDEX
(LastName) ON Employee;
Notes:
- Creating a Secondary Index cause an internal sub-table to be built.
- Dropping a Secondary Index causes the sub-table to be deleted.
|
|
Defining a Secondary Index
Just as with primary indexes, there are two types of secondary indexes – unique (USI) and nonunique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values.Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is the case when a PRIMARY KEY or UNIQUE constraint is defined and the table has a specified Primary Index or No Primary Index option. Making it a USI has the effect of enforcing the uniqueness of a PRIMARY KEY or UNIQUE constraint.
Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan.
All secondary indexes cause an AMP subtable to be built and maintained as rows are inserted, updated, or deleted. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed.
A Secondary Index may be defined ...
- at table creation (CREATE TABLE)
- following table creation (CREATE INDEX)
- may be up to 64 columns
USI NUSI If the index choice of column(s) is unique, it is called a USI.Unique Secondary Index
Accessing a row via a USI is a 2 AMP operation.CREATE UNIQUE INDEX
(EmpNumber) ON Employee;
If the index choice of column(s) is nonunique, it is called a NUSI.Non-Unique Secondary Index
Accessing row(s) via a NUSI is an all AMP operation.CREATE INDEX
(LastName) ON Employee;
Notes:
- Creating a Secondary Index cause an internal sub-table to be built.
- Dropping a Secondary Index causes the sub-table to be deleted.
|
|
Other Types of Secondary Indexes
There are a few other Secondary Indexes available with Teradata:
|
|
Sparse Index
Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.
|
Hash Index
Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.
|
Value-Ordered NUSI
Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.Value-ordered NUSIs have the following limitations.
- The sort key is limited to a single numeric column.
- The sort key column cannot exceed four bytes.
- They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.
|
Other Types of Secondary Indexes
There are a few other Secondary Indexes available with Teradata:
|
|
Sparse Index
Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index.
|
Hash Index
Hash indexes are used for the same purpose as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP.You can define a hash index on a single table only. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query.
|
Value-Ordered NUSI
Value-ordered NUSIs are efficient for range conditions and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus the major advantage of a value-ordered NUSI is in the performance of range queries.Value-ordered NUSIs have the following limitations.
- The sort key is limited to a single numeric column.
- The sort key column cannot exceed four bytes.
- They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.
|
Full-Table Scans
A full table scan is another way to access data without using any Primary or Secondary Indexes.In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on this slide, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster.
With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once.
As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans.
While full table scans are impractical and even disallowed on some systems, the Vantage Advanced SQL Engine routinely executes ad hoc queries with full table scans.
Every row of the table must be read. All AMPs scan their portion of the table in parallel.
- Fast and efficient on Teradata due to parallelism.
Full table scans typically occur when either:- An index is not used in the query
- An index is used in a non-equality test
|
|
CUSTOMER
CustID CustName CustPhone USI NUPI
Examples of Full Table Scans:
SELECT * FROM Customer WHERE CustPhone LIKE '858-485-_ _ _ _ ';
SELECT * FROM Customer WHERE CustName = 'Lund';
SELECT * FROM Customer WHERE CustID > 1000;
|
Locks
Locking serializes access and prevents multiple users who are trying to update the same data at the same time from violating data integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks:- Exclusive Locks - prevents any other type of concurrent access. Exclusive locks are applied only to databases or tables, never to rows. They are the most restrictive types of lock; all other users are locked out. Exclusive locks are typically used only when structural changes are being made to the database or table, such as changing the block size on a table, as in the case of a DDL statement.
- Write Locks - prevents other Read, Write, Exclusive locks. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (access lock readers). Until a write lock is released, no new read or write locks are allowed. When you update a table without a WHERE clause, the system places a write lock on the table.
- Read Locks - prevents Write and Exclusive locks. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted.
- Access Locks - prevents Exclusive locks only. Users who are not concerned about data consistency can specify access locks. Using an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small, single-row changes. Access locks are sometimes called stale read or dirty read locks. You may get stale data that hasn't been updated, or has been overwritten.
|
|
Lock Levels
Three levels of database locking are provided:- Database - applies to all tables/views in the database. Locks all objects in the database.
- Table/View - applies to all rows in the table/view. Locks all rows in the table or view.
- Row Hash - applies to all rows with same row hash. Locks all rows with the same row hash (primary and Fallback rows, and Secondary Index subtable rows.
The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock.
- SELECT - requests a Read lock
- UPDATE - requests a Write lock
- CREATE TABLE - requests an Exclusive lock
Lock requests may be upgraded or downgraded:LOCKING TABLE Table1 FOR ACCESS . . .
LOCKING TABLE Table1 FOR EXCLUSIVE . . .
|
Data Protection Features
Facilities that provide system-level protection:- Disk Arrays
- RAID data protection (e.g., RAID 1)
- Redundant data buses and array controllers
- Cliques and Vproc Migration
- SMP or Linux failures; Vprocs can migrate to other nodes (or HSN) within the clique.
Facilities that provide Teradata database protection- Fallback - provides data access with a “down” AMP
- Locks - provide data integrity
- Transient Journal - automatic rollback of aborted transactions
- Down AMP Recovery Journal - fast recovery of fallback rows for AMPs
- Permanent Journal - optional before and after-image journaling
- Unity - integrated portfolio that turn a multi-system environment into an orchestrated Ecosystem
- ARC - Archive/Restore facility
- NetBackup and Tivoli - provide backup storage management, ARC script creation, and scheduling capabilities
Disk Arrays – Disk arrays provide RAID 1 or RAID 6 protection. If a disk drive fails, the array subsystem provides continuous access to the data.Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays.
Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired and released during the processing of a request.
Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP” and provides protection in the event of an AMP failure. Fallback provides AMP fault tolerance at the table level.
Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP.
Transient Journal – the TJ is part of the WAL log and exists to permit the successful rollback of a failed transaction.
ARC/DSA and NetBackup/Tivoli – ARC and DSA command scripts provide the capability to backup and restore the Teradata database.
|
|
Transient Journal
The transient journal (or Online Transaction Journal) permits the successful rollback of a failed transaction (TXN). Transactions are not committed to the database until the AMPs have received an End Transaction request, either implicitly or explicitly. There is always the possibility that the transaction may fail. If so, data is returned to its original state after transaction failure.The transient journal automatically maintains a copy on each AMP of before images of all rows affected by the transaction. In the event of a transaction failure, the before images are reapplied to the affected table, and a rollback operation is completed. In the event of transaction success, the before images for the transaction are discarded from the journal at the point of transaction commit.
Journal entries are made on and by the AMP making the change. Transient Journal activities are automatic and transparent to the user.
|
|
ALTER TABLE Misconception
An often repeated misconception is that the when an ALTER TABLE command is executed, the “altered data” is written to the transient journal/WAL (Write-Ahead Logging or Write-Ahead Logic) log.For example, assume an ALTER TABLE ADD COLUMN is executed. Assuming the altered table is a large table, this operation may be time consuming. Why?
The reason that ALTER TABLE ADD COLUMN may be time consuming is that each row is rebuilt, each block is rebuilt, and the cylinder is rebuilt (re-lays out each cylinder). It does so one cylinder at a time in order to avoid having to make a full copy of the table. A new cylinder is allocated; each datablock is read; the records are reformatted and written to a new datablock which is placed on the new cylinder. When all the blocks are complete, the new cylinder is committed to disk. Then the old cylinder is released and the process is repeated until all cylinders are completed. This work reformats the rows and blocks in the same way as an INSERT-SELECT.
The only thing written to the WAL log is a single record per cylinder as a cylinder is committed. This is used in case the ALTER TABLE is interrupted – e.g., by a Teradata restart. The last cylinder log record is used to restart the process from the last completed cylinder and complete it. This is also why an ALTER TABLE cannot be aborted once it has been started.
|
Fallback
Fallback protects your data by storing a second copy of each row of a table on an alternate, Fallback AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all table data is still available. Users may continue to use Fallback tables without any loss of available data.During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent.
Fallback guarantees that the two copies of a row will always be on different AMPs in the same cluster. A cluster is a group of AMPs that act as a single Fallback unit. Clustering has no effect on primary row distribution of the table, but the Fallback row will always go to another AMP in the same cluster. If either AMP fails, the alternate row copy is still available on the other AMP
There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O for INSERTs, UPDATEs, and DELETEs. (The Fallback option does not require any extra I/O for SELECT operations and the Fallback I/O will be performed in parallel with the primary I/O.)
The benefits of Fallback include:
- Protects your data from an AMP failure. An AMP failure may occur because of physical disk failures - e.g., both disk drives in a RAID 1 drive group have failed or three drives in the same RAID 6 drive group. An AMP may also fail because of an AMP software failure or because access to the disk array has failed.
- Protects your data from software (node) failure.
- Automatically recovers with minimum recovery time, after repairs or fixes are complete.
- May be specified at the table or database level
- Permits access to table data during AMP off-line period
- Adds a level of data protection beyond disk array RAID 1 & 5
- Highest level of data protection is RAID 1 and Fallback
- Critical for high availability applications

Loss of any two AMPs in a cluster causes RDBMS to halt!
|
|
Down AMP Recovery Journal
After the loss of any AMP, a down-AMP recovery journal is started automatically to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row-ID in its recovery journal.This process will continue until the down AMP is brought back on-line. As part of the restart activity, the recovery journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and the AMP is brought on-line, fully recovered.
Recovery Journal is:
- Automatically activated when an AMP is taken off-line
- Maintained by other AMPs in the cluster
- Totally transparent to users of the system
While AMP is off-line:- Journal is active
- Table updates continue as normal
- Journal logs Row-IDs of changed rows for down-AMP
When AMP is back on-line:- Restores rows on recovered AMP to current status
- Journal discarded when recovery complete

For Fallback Tables
|
|
RAID Protection
There are many forms of disk array protection. The Teradata database supports the following protection schemes:- RAID 1
- RAID 5
- RAID S
RAID 1 (Mirroring) uses a disk-mirroring technique. Each physical disk is mirrored elsewhere in the array. This requires the array controllers to write all data to two separate locations, which means data can be read from two locations as well. In the event of a disk failure, the mirror disk becomes the primary disk to the array controller and performance is unchanged.
RAID 5 (Parity) uses a parity-checking technique. For every three blocks of data (spread over three disks), there is a fourth block on a fourth disk that contains parity information. This allows any one of the four blocks to be reconstructed by using the information on the other three. If two of the disks fail, the rank becomes unavailable. A rank is defined as a group of disks that have the same SCSI ID (the disks in the rank are not on the same channel, but are on different channels). The array controller does the recalculation of the information for the missing block. Recalculation will have some impact on performance, but at a much lower cost in terms of disk space.

RAID S uses a parity-checking technique similar to RAID 5 and is used with EMC disk arrays.
Teradata systems use disk arrays that employ redundant controllers (DRDAC) and redundant disk drives. When a drive fails, the array switches to the mirror drive or uses parity to reconstruct the data.
Note: With Teradata databases, RAID 1 and RAID 5 are commonly used. Unless there is a reason not to do so, Teradata recommends that all our customers implement RAID 1.
Summary:
RAID-1 - Good performance with disk failures
Higher cost in terms of disk space RAID-5 - Reduced performance with disk failures
Lower cost in terms of disk space
|
|
Hot Standby Nodes (HSN)
A Hot Standby Node (HSN) is a node which is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails Teradata restarts and the AMPs from the failed node move to the hot standby node. The performance degradation is 0%.When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed.
Characteristics of a hot standby node are:
- A node that is a member of a clique.
- Does not normally participate in the trusted parallel application (TPA).
- Will be brought into the configuration when a node fails in the clique.
- Helps with unplanned outages.
- Helps with planned outages
- Eliminates the need for a restart to bring a failed node back into service.
Hot standby nodes are positioned as a performance continuity feature.


This example illustrates vproc migration using a Hot Standby Node.
- Performance Degradation is 0% as AMPs are moved to the Hot Standby Node.
- When Node 1 is recovered, it becomes the new Hot Standy Node.
|
|
Teradata Tools and Utilities Overview
Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system.Teradata utilities leverage the Teradata database's high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.
Teradata database client utilities include the following, described in this section:
- Query Submitting Utilities - Query tools that submit SQL queries
- BTEQ
- Teradata Studio Express
- Load and Unload Utilities
- Teradata Parallel Transporter (TPT)
- Administrative Utilities - for managing workloads
- Teradata Studio
- Priority Scheduler
- Database Query Log (DBQL)
- Teradata Workload Analyzer
- Teradata Integrated Workload Management (TIWM)
- Teradata Active Systems Management (TASM)
- Tools that support a Multi-System Environment
- Viewpoint
- Unity
- Teradata Multi-Systems Manager (TMSM)
- Data Mover
|
|
BTEQ
BTEQ stands for Basic Teradata Query utility. It is a SQL front-end tool that runs on all client platforms. It supports both interactive ad hoc and scripted batch queries and provides standard report writing and formatting functionality. It also provides a powerful import/export capability even though there are other utilities available for these functions as well, particularly when larger tables are involved. When a user logs on with BTEQ, they can set the number of sessions they need before logging on. Using multiple sessions may improve performance for certain operations. This feature exploits the parallelism of the Teradata database.BTEQ uses the CLI application programming interface.
|
|
Key Points about BTEQ
- Basic Teradata Query utility
- SQL front-end
- Report writing and formatting features
- Interactive and scripted batch queries
- Import/Export across all platforms
- Supports ad-hoc querying
- Supports table updates with import and export
- Runs on all platforms

BTEQ Uses the CLI API

|
Teradata Parallel Transporter
Teradata Parallel Transporter (TPT) is the replacement for Teradata Warehouse Builder (TWB). This utility is effectively an object-oriented software system that executes multiple instances of data extraction, transformation, and load functions in a scalable, high-speed parallel processing environment.TPT is scalable and enables end-to-end parallelism. The previous versions of utilities (like FastLoad) allow you to load data into Teradata in parallel, but you still have a single input stream. Teradata Parallel Transporter allows you to run multiple instances of the extract, optional transformation, and load. You can have as many loads as you have sources in the same job. With multiple sources of data coming from multiple platforms, integration is important in a parallel environment.
TPT eliminates the need for persistent storage. It stores data into data buffers so you no longer need to write data into a flat file.
TPT provides a single scripting language. You can do the extract, some transformation, and loads all in one SQL-like scripting language.
Once the dynamics of the language are learned, you can perform multiple tasks with a single script.
TPT supports FastLoad INMODs, FastExport OUTMODs, and Access Modules to provide access to all the data sources you use today. TPT also provides a Direct API interface that can be used by Third Party partners. They can effectively write 'C' code to directly load/unload data to/from Teradata.
It is effectively a parallel load and export utility. Characteristics of this utility include:
High performance- Parallel Export and Load operators eliminate sequential bottlenecks.
- Data Streams eliminate the overhead of intermediate (persistent) storage.
- Scalable
- End-to-end parallelism
Easy to use
- Single scripting language
- Access to various data sources
Extensible
- Direct API enables Third Party partners to write 'C' code to directly load/unload data to/from Teradata.
|
|
TPT Operators
The operators are components that "plug" into the TPT infrastructure and actually perform the functions.- The FastLoad and FastExport operators support the current FastLoad and FastExport INMOD/OUTMOD features.
- The DataConnector operator is an adapter for the Access Module or non-Teradata files.
- The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands.
- The Load, Update, Export and Stream operators are similar to the FastLoad, MultiLoad, FastExport, and TPump utilities, but built for the TPT parallel environment.
The INMOD and OUTMOD adapters, DataConnector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export, and Stream operators are purchased separately.
Here is a comparison of the Teradata Parallel Transporter Operators with the classic utilities that we just covered.
TPT Operator Teradata Utility Description LOAD FastLoad A consumer-type operator that uses the Teradata FastLoad protocol to load empty tables. Supports Error limits,Checkpoint/ Restart,Multi-value Compression and PPI. UPDATE MultiLoad Utilizes the Teradata MultiLoad protocol to update tables. This allows highly scalable and parallel inserts and updates to a populated table. EXPORT FastExport A producer operator that emulates FastExport STREAM TPump Uses multiple sessions to perform DML transactions in near real-time. DataConnector N/A Emulates the DataConnector API. Reads external data files, writes data to external data files, reads an unspecified number of data files. ODBC N/A Reads data from an ODBC Provider. SQL INSERT BTEQ A consumer operator that inserts data into a table, similar to a BTEQ import operation. SQL SELECT BTEQ A producer operator that submits SQL SELECT statements to retrieve data from a table.
|
|
TPT Architecture
There are two types of parallelism used by TPT: pipeline parallelism and data parallelism.
|
|
Pipeline Parallelism
Pipeline parallelism (also called "functional parallelism") is achieved by connecting operator instances through data streams in a single job. As seen in the following diagram, a "producer" operator extracts data from a data source and writes it to the data stream, that is, it 'produces' a data stream. Finally, the "consumer" operator 'consumes' the data from the data stream and then starts writing data to the target.There are two operators involved here, producer and consumer. Both operators can operate concurrently and independently of each other thereby creating a pipeline effect without the need for intermediary files.
For example:

A "producer" operator extracts data from a data source and writes ("produces") it to a data stream. A "consumer" operator reads ("consumes") the data from the data stream and writes or loads data to the target. Both operators can operate concurrently and independently of each other. This creates a pipeline effect without the need for intermediary files.
|
TPT Parallelism
Imagine you have the same job as described on the previous slide, but that it is handling a large quantity of data coming from several different input files. Each file can be handled by a different instance of the producer operator so that the files are read concurrently. The operator instances can then merge these files into a single data stream which can then be input to multiple consumer operators. The diagram on this slide illustrates this concept.
Each source can be handled by a different instance of the producer operator, which merge these files into a single data stream.
The data stream can then be input to multiple consumer operators.
===============================================================
Teradata SQL for Applications Development
Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:
the column has defined default values
the column has a default system value specified
the column permits nulls
If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.
The SQL command INSERT INTO tablename DEFAULT VALUES;
Will insert defined default values into each column.
Will insert a null if no default is defined.
Will fail if no default is defined and a null is not allowed.
Some options for NULL and DEFAULT attributes:
NOT NULL - Nulls are not permitted for this column
DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.
Example: DEFAULT 'Unknown'
DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx
Example: DEFAULT DATE '2099-12-31'
WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)
DEFAULT TIME - Assign the current time to the integer data type
DEFAULT USER - Assign the user id of the session to the character string

^ - A circumflex character indicates that the character (or characters) that follow the circumflex in the regular expression must occur at the beginning of the source string.Example: '^E' requires that the source string begin with an upper case E.
$ - A dollar sign character indicates that the preceding character (or characters) must occur at the end of the source string.Example ':$' requires that the source string ends with a colon.
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.Example: '(ABC)+' would match one or more successive occurrences of ABC at the corresponding position in a source string.
[] - Square braces enclose a subset of characters.Example: '[0-9]' would match any single digit at the corresponding position in a source string.
Data Models define how elements of data are connected or related
As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.
Majorly, there are three types of data models:
Relational
Dimensional
Logical and Physical
Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.
Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.
The Teradata Advanced SQL Engine performs space management and is different from most other database implementations.
|
|
===============================================================
Teradata SQL for Applications Development
Teradata Database has the ability to insert a row using only the DEFAULT VALUES keywords. For this feature to work successfully, one of the following statements must be true for each column of the table:
the column has defined default values
the column has a default system value specified
the column permits nulls
If none of these statements is true, an insert using DEFAULT VALUES will fail. Note that such an insert may be executed multiple times as long as no uniqueness attributes of the table are violated.
The SQL command INSERT INTO tablename DEFAULT VALUES;
Will insert defined default values into each column.
Will insert a null if no default is defined.
Will fail if no default is defined and a null is not allowed.
Some options for NULL and DEFAULT attributes:
NOT NULL - Nulls are not permitted for this column
DEFAULT n - Unless explicitly specified, assign the column a value of n. n must be of the same data type as the column.
Example: DEFAULT 'Unknown'
DEFAULT DATE 'xxxx-xx-xx' - Unless explicitly specified, assign the date xxxx-xx-xx
Example: DEFAULT DATE '2099-12-31'
WITH DEFAULT - Assign the system default (e.g. spaces for char strings, zero for numeric data types and current date for date data type)
DEFAULT TIME - Assign the current time to the integer data type
DEFAULT USER - Assign the user id of the session to the character string
() - Parentheses enclose a group of characters that is to be treated as a single regular expression entity.
[] - Square braces enclose a subset of characters.
Data Models define how elements of data are connected or related
As a data model is refined, it passes through different states which can be referred to as normal forms. Another familiar term is third normal form or (3NF). 3NF are the rules and guidelines about how the data model should look. In practice, this simply means which columns should belong to which tables.
Majorly, there are three types of data models:
Relational
Dimensional
Logical and Physical
Normalization is the process of reducing a complex database schema into a simple, stable one. In order to normalize, a user must be able to uniquely identify a reading. Generally, this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalized data is optimized for entity level transactions.
Dimension denormalization supports dimensional modeling's twin objectives of speed and simplicity. De-normalized data is optimized for answering business questions and driving decision making.
The Teradata Advanced SQL Engine performs space management and is different from most other database implementations.
Types of Spaces
Types of Spaces
Generally, there are three types of space within the Teradata Advanced SQL Engine.
1.Permanent Space, also known as Perm space, is used for storing the data rows of tables and is the maximum storage assigned to a user and/or database. It is not pre-allocated or reserved ahead of time, it is available on-demand. Space is deducted from the owner’s allocation as it is used.
2.Spool Space is unused Perm Space that holds intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released. All databases have an upper limit of Spool Space. Theoretically, a user could use all the unallocated space in the system for their query.
3.Temporary Space is sometimes referred to as Temp Space and is used for global temporary tables. Tables created in Temp Space will survive a restart and remain available to the user until their session is terminated.
What is Data Distribution?
The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.
Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.The Teradata Advanced SQL Engine's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.
Teradata Database l
Locks have two dimensions: Level and Severity.
Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.
Types
There are four types of locks in Teradata. (Click on each type to learn more)
EXCLUSIVE -Prevents any other type of concurrent accessWRITE-Prevents other reads, writes, and exclusivesREAD- Prevents other writes and exclusivesACCESS - Prevents exclusive rights only
Generally, there are three types of space within the Teradata Advanced SQL Engine.
1.Permanent Space, also known as Perm space, is used for storing the data rows of tables and is the maximum storage assigned to a user and/or database. It is not pre-allocated or reserved ahead of time, it is available on-demand. Space is deducted from the owner’s allocation as it is used.
2.Spool Space is unused Perm Space that holds intermediate query results or formatted answer sets for queries. Once the query is complete, the Spool Space is released. All databases have an upper limit of Spool Space. Theoretically, a user could use all the unallocated space in the system for their query.
3.Temporary Space is sometimes referred to as Temp Space and is used for global temporary tables. Tables created in Temp Space will survive a restart and remain available to the user until their session is terminated.
What is Data Distribution?
The Teradata Advanced SQL Engine uses hashing to dynamically distribute data across all AMPs.
Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.The Teradata Advanced SQL Engine's hashed data distribution scheme provides optimized performance with minimal tuning and no manual reorganizations, resulting in lower administration costs and reduced development time.
Teradata Database l
Locks have two dimensions: Level and Severity.
Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks.
Types
There are four types of locks in Teradata. (Click on each type to learn more)
Comments
Post a Comment