By Leandro Sarmiento - Data Engineer
-
With the advent of new technologies during the last decades, the amount and types of data that organizations have at their disposal to exploit in order to provide better products, services and experience to their users has increased.
Thus, more and more organizations are opting for the use of Big Data and Artificial Intelligence in the cloud; however, it is still necessary to continue working with traditional on premises solutions, either because applications must be supported while they are being migrated to the cloud or because certain applications will not be migrated. This means that we must be prepared to interact with such applications, such as those that use relational database engines.
In this article we will address some concepts and tools specific to SQL Server in order to perform a first analysis and query optimization.
Concepts
Indexes
It is a fundamental concept when talking about query optimization, hereinafter we will see the most important aspects without going into too much detail.
SQL Server organizes the data of a table in logical structures known as Heaps or Balanced Trees. A table that is organized as a Balanced Tree is also known as Clustered Table or Clustered Index. The type of logical organization of the table will influence the number of physical reads (reads to disk) that the database engine will have to perform to return a row or set of rows which will affect query performance. The ultimate goal will be to reduce the number of physical reads.
Heaps:
This is a very simple structure where the data has no logical order.
In each query the complete Heap must be scanned even if you want to return only one row of the table. This operation is called Table Scan.
Clustered Index (Balanced Trees):
A table is organized as a Balanced Tree when we explicitly create an index using the CREATE CLUSTERED INDEX command or implicitly when we create a Primary Key or Unique Constraint on the table.
This structure is similar to an inverted tree with a root at the top level, intermediate levels and leaves at its final level, it is at this level where the table data resides.
As the table data is in the Clustered Index there can only be a single Clustered Index.
The data in the leaf levels are logically ordered according to the Clustering Key, i.e. the column or columns that form the index.
A clustered index only implies the logical organization of the table and not a physical organization.
The Clustering Key can consist of one or more columns. A column with repeated values can be used as a Clustered Index but SQL Server will maintain the uniqueness of the key internally.
To return the rows of a query SQL Server traverses the tree to its leaves which is very efficient, this operation is known as Index Seek.
If SQL Server has to traverse the entire index to return the data of a query it performs an Index Scan.
NonClustered Index:
Allows to sort the table in a different way than the one present in the Clustered Index, this sorting is stored in a separate structure so there can be several NonClustered Index.
This structure is also sorted like a Balanced Tree, except that the leaves do not contain the table data (as in the Clustered Index), but rather pointers to the base table data known as row locator.
They can be created for a Heap or a Balanced Tree, the only difference will be that in the case of a Heap the row locator is a RID (Row Identifier) and in the case of a Balanced Tree the row locator is the Clustering Key.
To return the rows of a query, SQL Server goes through the entire NonClustered Index, gets to the leaves and then uses the RID or the Clustering Key to look for the data in the Heap or the Clustered Index. This operation is known as RID Lookup or Key Lookup depending on whether the data is being searched in a Heap or a Clustered Index. These are efficient operations as long as the query returns a limited number of rows.
You can add columns of data to the NonClustered Index using the INCLUDE command this will allow SQL Server to find the query data in the leaves of the Non Clustered Index, this way the RID Lookup and Key Lookup operations are avoided, this type of queries are known as covered queries.
If SQL Server has to traverse the entire index to return the query data it performs an Index Scan.
To know the indexes that a table has we can consult the system view sys.indexes:
SELECT OBJECT_NAME(object_id) AS table_name, name AS index_name, type, type_desc
FROM sys.indexes
WHERE object_id = OBJECT_ID(N'[TablaAConsultar]’, N'U');
[TableQueryTable] must be replaced by the name of the table whose indexes we want to know.
The SalesLT.Product table has 4 indexes, the Clustered Index created from the primary key and 3 NonClustered Index.
Statistics
Another important aspect of a query is to know the number of read operations needed to get the results, the fewer reads the more efficient the query is. In SQL Server we can know this using the SET STATISTICS IO ON command.
To obtain information about the query execution we first run the SET STATISTICS IO ON command and then the query we want to evaluate:
The above query returns these SET STATISTICS values:
Scan count: Number of Index Scan or Table Scan performed.
Logical Reads: Number of pages read from the cache (a page is a physical unit of 8 KB where the table data is stored, a table can have many pages assigned to it).
Physical reads: Number of pages read from disk.
Note: only part of the information returned by SET STATISTICS IO is shown, that which is relevant for this article.
So far we have covered the first concepts to perform query optimization, in the second part we will see other concepts and examples that apply them.
Comments