Physical Data Organization and Indexing

Article

Introduction

This article is intended to explain the main storage structures used in database systems through examples in Microsoft SQL Server and Oracle Database. Choosing the right structure is an important strategy for database developers and architects in order to access data efficiently and have a good quality of service in applications such as performance and availability.

Thus we're going to analyze the types of tables and indexes we may face when designing the data requirements of an enterprise solution in order to result in long-live databases.

Tables in SQL Server

1.1 Heap tables

Heap tables are the simplest data storage structure and also the most common creation of table through the SQL statement ''create table"; however it might change over time when you create an index on it.

Heap tables use a classic data structure called heap which is basically an area of space (in disk or memory) where data rows are placed where they fit best. The most important characteristic is that the data rows are unordered. Data rows are grouped by pages logically inside the table thus the logical address of a row in a data file is given by a row id that comprises a page number within the file and a slot number identifying the row inside the page. The number of rows that will fit on a page is not fixed.

A new row is inserted by appending it to the end of the pages, and it is deleted by setting up the slot it ocuppies as empty. When another new row is inserted, the empties slots are checked to see if one of them fits or not then the row is appended to the end of the last page. As this process goes over time, then some gaps may appear and sometimes these gaps may represent an amount of wasted space. Thus, a defragmentation technique must be applied to the pages in the data file.

Let's demonstrate the unordered behavior of heap tables using some simple sequence of SQL code which can be tested in any RDBMS such as Oracle Database and Microsoft SQL Server. I've tested this code in Microsoft SQL Server 2005 (see Listing 1).

create table tbl_heap (nid int)
Messages: Command(s) completed successfully.

insert into tbl_heap(nid) values(1)
Messages: (1 row(s) affected).

insert into tbl_heap(nid) values(2)
Messages: (1 row(s) affected).

insert into tbl_heap(nid) values(3)
Messages: (1 row(s) affected).

delete from tbl_heap where nid=2
Messages: (1 row(s) affected).
 
insert into tbl_heap(nid) values(4)
Messages: (1 row(s) affected).

select nid from tbl_heap
Output: order showed is 1,4,3

Listing 1. SQL sequence of code showing the unordered behavior of heap tables

It is remarkable to say that this storage structure has an overhead associated with the number of Input/Output (I/O) operations required to do selections of rows because when you need to access specific rows in the table, the database system must scan all the pages looking for the required rows.

1.2 Sorted tables

Because of the drawbacks of heap tables, you need a more sophisticated storage structure. Then if the rows are stored in an arbitrary order by sorting them based on the values of one or more attributes of the table, you have the advantage when searching rows to stop the scan process at the point at which those are located, and therefore the query performance increases. But we have an overhead associated with this solution and it's that when a new row is inserted, this storage structure may be changed over and over depending on the position where the row is going to reside.

A solution to this problem is to leave empty spaces on each page for being used in future insertions.

The fill factor parameter refers to the percentage of space in a page that is initially filled with living room for new rows. This is not a definitive solution because a new row might be inserted and it gets too large to fit in the left space by the fill factor, therefore the database systems must set up another page for it.

The implementation of sorted tables in Oracle Database is by the index-organized tables (IOT) where the data s physically stored and sorted by the primary key.

Let's create an IOT table using Oracle SQL (see Listing 2)

create table tbl_iot
(id int primary key,
value int,
change date)
organization index

including change
overflow

Listing 2. The creation of the tbl_iot index-organized table

As you can see in Listing 2, the organization index clause is required for the creation of the IOT table while overflow and including clause are optional. The overflow clause specifies the database systems to set up another segment, thus making the IOT tables a multisegment object, where the rows can overflow when it gets too large. The including clause tells the database systems to include the specified columns on the index block and the remaining columns are stored in the overflow.

The implementation of sorted tables in SQL Server is by using a clustered index on the primary key of the table (see Listing 3)

create table ctbl_facts
(id int not null,

value int)
go

alter table ctbl_facts
add constraint PK_ ctbl_facts primary key clustered (Id)

go

Listing 3. The creation of a clustered index on the primary key of one table

Indexes

An index on a table provides a mechanism for locating one or more rows without searching the entire table. The property to be located is specified by a set of columns of the indexed table called search key or index key which is a way of the faster access path to data and increases the overall performance of query execution.

The structure of an index consists of a set of entries together with a mechanism for locating the rows based on the search key value. The use of an index can reduce the number of data file pages retrieve, although accessing the index itself is another overhead. In the case of thousands of millions of rows using an index, it might be possible to access to a row in one or two I/O operations. It is remarkable to say that index needs maintenance and when a row is inserted or updated in the table, the index must be modified in order to be adapted to the changes.

There are two types of indexes: clustered and unclustered. In a clustered index, the entries which are close imply proximity among the corresponding data rows. A sorted index is clustered if the index entries and data rows are sorted on the same key; otherwise, the index is unclustered.

In a clustered index, the leaf node of the index tree has the actual data pages itself, while in an unclustered index, the leaf node contains pointers to data pages or clustered index data pages. For this reason of the physical order of rows in the data according to the index, you can have only one clustered index on a given table. Clustered index concept corresponds to the index-organized table (IOT) in Oracle Database and clustered index on a table in Microsoft SQL Server which we discussed in the last section.

There are fundamentally two types of index structures implemented in most database systems: B*Tree and Bitmap indexes. A B*Tree index or Balance Tree index is the most commonly used index structure and default when you create an index using the create index SQL statement. B*Tree is similar in structure to a binary tree. One property of B*Tree is that the tree is balanced, and it implies that despite the insertion and deletion of rows, any path from the tree root to a leaf node has the same length, and from the point of view of database systems the selection of a row has the same performance overhead independently of the searched row. Most B*Tree indexes will have a length of 2 or 3 levels even for thousands of millions of rows.

A bitmap index is a sort of one or more bits that stores pointers to many rows with a single index key entry and typically we find a very small number of entries pointing to many rows. A bitmap index is mainly designed for OLAP databases to be used in data warehouse and data mining applications; and not for OLTP solutions that change the data frequently. The bitmap index is appropriate for table attributes that can take on only a small number of values. A Bitmap index should not be selective, unlike a B*Tree index that should be selective for example attributes storing boolean values. Oracle Databases only implement this type of index, unlike SQL Server databases.

2.1 SQL statement for the creation of an index

The basic statement for the creation of an index is (see Listing 4)

create index [schema].index_name on [schema].table_name
({column}[ASC|DESC] [,{column}[ASC|DESC]])

Listing 4. SQL statement for the creation of an index

Microsoft SQL Server and Oracle Database have their own extension to the former SQL statement as it is resumed in the following tables.

Argument	Description
Unique	To enforce the uniqueness of the index key
clustered \| unclustered	To specify if the index is clustered or unclustered
asc \| desc	To specify the column's order of indexing

Table 1. Microsoft SQL Server extension to the creation of an index

Argument	Description
Unique	To enforce the uniqueness of the index key
Bitmap	To specify that the index is created as a bitmap rather than using the normal structure based on B*Tree
asc \| desc	To specify the column's order of indexing
Compress \| uncompress	To specify that key compression is enable

Table 2. Oracle Database extension to the creation of an index

2.3 Examples of index creation

In this section, we are going to analyze some type of index and its underlying implementation in Microsoft SQL Server and Oracle Database.

Creation of a simple index

In order to create a simple index in Microsoft SQL Server and Oracle Database, you must use the CREATE INDEX statement (see Listing 5).

create index ndx_myindex on table_name(attr1_name, attr2_name, ...)

Listing 5. Generic form for creating an index using SQL

The creation of a bitmap index

To create a bitmap index in Oracle Database based on the attribute job of the table emp, you must use the CREATE BITMAP INDEX statement (see Listing 6)

create bitmap index bndx_dept_dname on dept(dname)

Listing 6. The creation of a bitmap index on the Oracle Database

Creation of a unique index

The creation of a unique index on the table in Oracle Database and in Microsoft SQL Server is similar to using the CREATE UNIQUE INDEX statement. Let's illustrate it with a generic example (see Listing 7).

create unique index on tbl_table_name on (attr1_name, attr2_name, ...)

Listing 7. The creation of a unique index in the Oracle database.

The creation of composite indexes

You can create an index on one or more attributes of a table as well as set up the order for each attribute. Let's suppose that we have a table named tbl_composite with three attributes attr1, attr2, and attr3. This table has thousands of millions of rows. We are going to query this table to display its columns and filter some rows (see Listing 8).

select attr1, attr2, attr3
from tbl_composite
where attr1=value1 and attr2=value2 and attr3=value3

Listing 8. Expensive query to the table tbl_composite

As you can see this query is very expensive because we are using different filtering criteria. If you want to improve the overall performance, you should create a composite index on the table tbl_composite and on attributes attr1, attr2, and attr3 and specify the order (see Listing 9).

create index on tbl_composite (attr1 asc, attr2 asc, attr3 desc)

Listing 9. The creation of a composite index

There are common situations when we've found a lot of repetitive values in composite values. You can use the compress clause in Oracle Database to eliminate these repetitive values (see Listing 10).

create index on tbl_composite (attr1 asc, attr2 asc, attr3 desc) compress &1;

Listing 10. A composite index with the index key's compression enable

The & 1 clause indicates creating the index using compression on just the leading columns which are the first part of the index key values and thus will have many repetitive values.

Join indexes in Oracle Database

A join index is used to increase the performance of the join of several tables. You can think of a join index as a pre-computed join. An index is normally created on a table but a join index allows indexing a given table using the columns from another table. This index can be organized as a B*Tree or as a Bitmap. Oracle Database only implements bitmap join indexes.

This type of index can be used particularly in a data-warehouse environment where the main schema is star-based thus having a lot of relationships between entities. Although, it's recommended not to use this type of index in OLTP databases because of the overhead associated with its maintenance.

Let's illustrate these concepts with an example. Suppose that we have a data warehouse that contains a star schema with a fact table named sales and a dimension table named product. We want to create a report (see Listing 11) to display the sales amount given a product category.

select sum(sales.amount)
from sales, product

where sales.prod_id = product.prod_id
and product.category = 'dbms';

Listing 11. Report displaying sales amount by product category

Because the sales table might be very large as any other transaction table, the performance of this query might be not good at all. In order to eliminate some key interactions and merge work, we might define a bitmap join index to increase the response time and improve the overall application performance (see Listing 12). You can notice that in the index creation, we can refer to the clause where using the index creation syntax on sales(product.category).

create bitmap index bidx_sales_product
on sales(product.category)

from sales, product
where sales.prod_id = product.prod_id

Listing 12. The Oracle SQL code for the creation of join indexes

Function-based index

This type of index is used to store computed columns value and it's only implemented in Oracle Database.

Let's suppose that we write a SELECT statement for a report in order to display information about a particular employee using the emp table (see Listing 13).

select *
from emp
where upper(ename)='JOHN'

Listing 13. Query displaying information about the employee whose name is JOHN

You can improve this query response time by creating a function-based index on the string function upper (see Listing 14).

Create index f_ndx_dept_loc_upper on emp( upper(ename) )

Listing 14. The creation of function based index

Conclusion

This article has illustrated the main concepts about the two most important storage structures in a database system such as tables and indexes. We have discussed structures used to store rows in tables as well as index mechanisms and several index types used to implement some strategies to access rows in tables in order to improve the overall performance of the application and decrease the overhead of I/O operations.

After you read this article, you are able to apply these concepts and techniques to your own business scenario.