Tuesday, March 2, 2010

Using GUID (Unique Identifiers) as table main keys

Using What?!
Unique Identifier (which from now on will be referred to as ‘guid’) is an algorithm-based 32-byte hex string (128 bit integer) which is supposed to generate completely unique values.

I’d like to compare the chances of two identical generated guids to the chance earth will be destroyed by a meteor in the next five minutes. Now if that happens – you won’t be here to prove me wrong! J

Anyway, here’s one:  5B5D6F63-1FF0-4574-9B77-BD77D716FD66. It is probably unique and there’s no other like it in the world. (See more here - http://en.wikipedia.org/wiki/Globally_Unique_Identifier)
With today’s increasing demand for fully-distributed systems, GUID usage is increasing accordingly, as it allows managing two disconnected databases which later on needs to merge, knowing that each row on desired shared tables is unique.

Here’s a simple example –
Consider a very large company with many branches across the world.
Each branch has local users, and there’s a sync process which merges the entire users list.
The ‘traditional’ way is to maintain a range of numbers for each branch, use some prefix on the username to maintain uniqueness, or just make a combined key of UserId+Another column contains the branch id.
This is a classic sample for using GUID instead – make the UserId column guid and problem is solved.

Sounds like fun! Any reason why I shouldn’t use it?
Well, yes. Before you get all excited and change your “int’s” and “bigint’s” to guid, consider the following:
-          Guid column takes 16 bytes, which is double than Bigint. This means that index on the column will also take more space
-          Index Fragmentation is an issue and it is most likely to happen.
(See more about it on the Tips section below)

Some Best-Practices Tips:

[Tip #1]
The first, and probably the most important tip is to avoid using CLUSTERED INDEX on guids.
This is because the guid values are random – a physical sort will cause the table to be sorted on every insert.
 Still, in many cases you’d want to avoid having HEAP tables, which means another ID (probably an internal identity) has to be used.
When possible - use the NEWSEQUENTIALID function. This will create unique guids on a table but will make sure the new generated guid is greater than the last. (It is still globally unique as long as the machine has its own MAC address).
Let’s see a quick sample of the differences between the NEWID() function and the NEWSEQUENTIALID default value:
Let’s create two tables; each uses one of the techniques:
      -- Creating two tables:
     
      IF OBJECT_ID('UsersNewId') IS NULL
      BEGIN
            CREATE TABLE UsersNewId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWID(),
                  UserName VarChar(255)
            )
      END
     
      IF OBJECT_ID('UsersNewSeqId') IS NULL
      BEGIN
            CREATE TABLE UsersNewSeqId
            (
                  UserId UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED DEFAULT NEWSEQUENTIALID(),
                  UserName VarChar(255)
            )
      END;
     
      -- Now, let's fill the tables with some data
      -- Note: I'm using SQL2008+ syntax here; On SQL2005 separate the insert statements
      INSERT INTO UsersNewId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
      INSERT INTO UsersNewSeqId(UserName) VALUES ('User1'),('User2'),('User3'),('User4'),('User5')
     
      -- Now, let's see the tables
      SELECT * FROM UsersNewId
      SELECT * FROM UsersNewSeqId

The results:


Note the guid on the 2nd table looks the same on first sight, but actually it is not. Those are increasing values that still maintain uniqueness.
This is obviously much better for index fragmentation of any time!

[Tip #2]
 Related to the above tip – as a DBA, try to keep the guid generation on your side!  othewise
If the guid’s are generated by a different layer in your application (DAL or so), the guids cannot be sequential.  
Besides, try to think as if the guid’s are actually a new version of your INT identities; You wouldn’t let anyone control the auto-increase now would you?
This is practically the same.

In some cases you may want to consider surrogate keys, which will be used as the foreign key between the tables. To use the sample at the beginning of this article – each user will have its own GUID, but with an additional “UserInternalId” column (int) which will be the one to connect the main table to all the rest.

[Tip #4]
Don’t use Guid if you don’t have to! When a table needs an ID which does not have to be unique across the enterprise, keep using the regular numeric id.


Final Words:
Guids are great and simple solution to maintain uniqueness across the enterprise. However, don’t overuse it when not required as it might hurt your databases performances.
This is a very wide subject. I was trying to limit myself to a blog-post size; I might write a follow-up article about this subject

2 comments:

  1. I am by far not a data base expert, but it seems to me that using a GUID to maintain the uniqueness of rows is going in the wrong direction. Why not just have a count column, and for each duplicate record, increment the count in the count column. I realize this would require implementation in the database engine, but it would save a lot of space. The up front cost for inserting a row would be higher since one would have to check the entire table for a duplicate record, but there is some overhead in any method, and there would be savings in space, searching and retrieval after the row is written.

    I don't even see the need for keeping knowledge of the server in the table since with a record that was identical on two different servers, if any column of a table on one server had a different meaning that the same column on a different server, it would be a conversion issue, not a uniqueness issue. One exception that I can think of is time, which could be addressed by using Universal Time. The other that could raise issues would be language: if translations were to be required, there would have to be something to address that, but not at the row or even the table level unless in a merge, the rows were to maintain their original language.

    Respectfully,
    Alex Scianna

    ReplyDelete
  2. One more thing I think is worth adding -
    Running numbers (simple identity) often is often a path for security-related errors.
    For example - consider the following line:
    http://www.mysite.com/products/orders/14235
    (Assume Orders.OrderId column behind the code is an identity)
    If a authentication/security model on this website is not strict enough, a user can alter the URL to .../orders/14236 and possibly see the next order in line, even if it's not his own

    In such case, if OrderId was GUID, the problem would have not existed. as it was very hard for one to guess which Guids are actually being used.

    ReplyDelete