Tuesday, March 2, 2010

Using GUID (Unique Identifiers) as table main keys

Using What?!
Unique Identifier (which from now on will be referred to as ‘guid’) is an algorithm-based 32-byte hex string (128 bit integer) which is supposed to generate completely unique values.

I’d like to compare the chances of two identical generated guids to the chance earth will be destroyed by a meteor in the next five minutes. Now if that happens – you won’t be here to prove me wrong! J

Anyway, here’s one:  5B5D6F63-1FF0-4574-9B77-BD77D716FD66. It is probably unique and there’s no other like it in the world. (See more here - http://en.wikipedia.org/wiki/Globally_Unique_Identifier)
With today’s increasing demand for fully-distributed systems, GUID usage is increasing accordingly, as it allows managing two disconnected databases which later on needs to merge, knowing that each row on desired shared tables is unique.

Here’s a simple example –
Consider a very large company with many branches across the world.
Each branch has local users, and there’s a sync process which merges the entire users list.
The ‘traditional’ way is to maintain a range of numbers for each branch, use some prefix on the username to maintain uniqueness, or just make a combined key of UserId+Another column contains the branch id.
This is a classic sample for using GUID instead – make the UserId column guid and problem is solved.

Sounds like fun! Any reason why I shouldn’t use it?
Well, yes. Before you get all excited and change your “int’s” and “bigint’s” to guid, consider the following:
-          Guid column takes 16 bytes, which is double than Bigint. This means that index on the column will also take more space
-          Index Fragmentation is an issue and it is most likely to happen.
(See more about it on the Tips section below)

Some Best-Practices Tips:

[Tip #1]
The first, and probably the most important tip is to avoid using CLUSTERED INDEX on guids.
This is because the guid values are random – a physical sort will cause the table to be sorted on every insert.
 Still, in many cases you’d want to avoid having HEAP tables, which means another ID (probably an internal identity) has to be used.
When possible - use the NEWSEQUENTIALID function. This will create unique guids on a table but will make sure the new generated guid is greater than the last. (It is still globally unique as long as the machine has its own MAC address).
Let’s see a quick sample of the differences between the NEWID() function and the NEWSEQUENTIALID default value:
Let’s create two tables; each uses one of the techniques:
      -- Creating two tables:
     
      IF OBJECT_ID('UsersNewId') IS NULL
      BEGIN
            CREATE TABLE UsersNewId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWID(),
                  UserName VarChar(255)
            )
      END
     
      IF OBJECT_ID('UsersNewSeqId') IS NULL
      BEGIN
            CREATE TABLE UsersNewSeqId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWSEQUENTIALID(),
                  UserName VarChar(255)
            )
      END;
     
      -- Now, let's fill the tables with some data
      -- Note: I'm using SQL2008+ syntax here; On SQL2005 separate the insert statements
      INSERT INTO UsersNewId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
      INSERT INTO UsersNewSeqId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
     
      -- Now, let's see the tables
      SELECT * FROM UsersNewId
      SELECT * FROM UsersNewSeqId

The results:


Note the guid on the 2nd table looks the same on first sight, but actually it is not. Those are increasing values that still maintain uniqueness.
This is obviously much better for index fragmentation of any time!

[Tip #2]
 Related to the above tip – as a DBA, try to keep the guid generation on your side!  othewise
If the guid’s are generated by a different layer in your application (DAL or so), the guids cannot be sequential.  
Besides, try to think as if the guid’s are actually a new version of your INT identities; You wouldn’t let anyone control the auto-increase now would you?
This is practically the same.

In some cases you may want to consider surrogate keys, which will be used as the foreign key between the tables. To use the sample at the beginning of this article – each user will have its own GUID, but with an additional “UserInternalId” column (int) which will be the one to connect the main table to all the rest.

[Tip #4]
Don’t use Guid if you don’t have to! When a table needs an ID which does not have to be unique across the enterprise, keep using the regular numeric id.


Final Words:
Guids are great and simple solution to maintain uniqueness across the enterprise. However, don’t overuse it when not required as it might hurt your databases performances.
This is a very wide subject. I was trying to limit myself to a blog-post size; I might write a follow-up article about this subject