Referential Integrity
 |
|
| You are reading part 3 from "Physical data storage in SQL Server 2005 and 2008," excerpted from A Developer's Guide to Data Modeling for SQL Server, Covering SQL Server 2005 and 2008, by Eric Johnson and Joshua Jones, copyright 2008, printed with permission from Addison-Wesley Professional. |
|
|
 |
 |
We discussed referential integrity (RI) in Chapter 2. Now we look specifically
at how you implement referential integrity in a physical database.
In general, data integrity is the concept of keeping your data consistent
and helping to ensure that your data is an accurate representation of the
real world and that it is easy to retrieve. There are various kinds of integrity;
referential integrity ensures that the relationships between tables
are adhered to when you insert or update data. For example, suppose you
have two tables: one called Employee and one called Vehicle. You require
that each vehicle be assigned to an employee; this is done via a relationship,
and the rule is maintained with RI. You physically implement this relationship
using primary and foreign keys.
Primary Keys
A primary key constraint in SQL Server works in the same way as a primary
key does in your logical model. A primary key is made up of the column or
columns that uniquely identify the row in any given table.
The first step in creating a PK is to identify the columns on which to
create the key; most of the time this is decided during logical modeling.
What makes a good primary key in SQL Server, and, more importantly,
what makes a poor key? Any column or combination of columns in your
table that can uniquely identify the row are known as candidate keys.
Often there are multiple candidate keys in a table. Our first tip for PK selection
is to avoid string columns. When you join two tables, SQL Server
must compare the data in the primary key to the data in the other table's
foreign key. By their nature, strings take more time and processing power
to compare than do numeric data types.
That leaves us with numeric data. But what kind of numeric should you
use? Integers are always good candidates, so you could use any of the int data types as long as they are large enough to be unique given the table's
potential row count. Also, you can create a composite PK (a PK that uses
more than one column), but we do not recommend using composite PKs
if you can avoid it. The reason? If you have four columns in your PK, then
each table that references this table will require the same four columns.
Not only does it take longer to build a join on four columns, but also you
have a lot of duplicate data storage that would otherwise be avoided.
To recap, here are the rules you should follow when choosing a PK
from your candidate keys:
- Avoid using string columns.
- Use integer data when possible.
- Avoid composite primary keys.
Given these rules, let's look at a table and decide which columns to use
as our PK. Figure 3.1 shows a table called Products. This table has a couple
of candidate keys, the first being the model number. However, model
numbers are unique only to a specific manufacturer. So the best option
here would be a composite key containing both Model Number and
Manufacturer. The other candidate key in this table is the SKU. An
SKU (stock-keeping unit) number is usually an internal number that can
uniquely identify any product a company buys and sells regardless of
manufacturer.

Figure 3.1 A Products table in need of a primary key
Let's look at each of the candidates and see whether it violates a rule.
The first candidate (Model Number and Manufacturer) violates all the
rules; the data is a string, and it would be a composite key. So that leaves
us with SKU, which is perfect; it identifies the row, it's an integer, and it is
a single column.
Now that we have identified our PK, how do we go about configuring
it in SQL Server? There are several ways to make PKs, and the method you
use depends on the state of the table. First, let's see how to do it at the
same time you create the table. Here is the script to create the table, complete
with the PK.
CREATE TABLE Products(
sku |
int |
NOT NULL |
PRIMARY KEY, |
modelnumber |
varchar(25) |
NOT NULL, |
|
name |
varchar(100) |
NOT NULL, |
|
manufacturer |
varchar(25) |
NOT NULL, |
|
description |
varchar(255) |
NOT NULL, |
|
warrantydetails |
varchar(500) |
NOT NULL, |
|
price |
money(10,0) |
NOT NULL, |
|
weight |
decimal(5,2) |
NOT NULL, |
|
shippingweight |
decimal(5,2) |
NOT NULL, |
|
height |
decimal(4,2) |
NOT NULL, |
|
width |
decimal(4,2) |
NOT NULL, |
|
depth |
decimal(4,2) |
NOT NULL, |
|
isserialized |
bit |
NOT NULL, |
|
status |
tinyint |
NOT NULL, |
|
)
You will notice the PRIMARY KEY statement following the definition of the sku column. That statement adds a PK to the table on the sku column,
something that is simple and quick.
However, this method has one inherent problem. When SQL Server
creates a PK in the database, every PK has a name associated with it. Using
this method, we don't specify a name, so SQL Server makes one up. In this
case it was PK_Products_30242045. The name is based on the table name
and some random numbers. On the surface, this doesn't seem to be a big
problem, but what if you later need to delete the PK from this table? If you
have proper change control in your environment, then you will create a
script to drop the key and you will drop the key from a quality assurance
server first. Once tests confirm that nothing else will break when this key is dropped, you go ahead and run the script in production. The problem is
that if you create the table using the script shown here, the PK will have a
different name on each server and your script will fail.
How do you name the key when you create it? What you name your
keys is mostly up to you, but we provide some naming guidelines in
Chapter 7. In this case we use pk_product_sku as the name of our PK. As
a best practice, we suggest that you always explicitly name all your primary
keys in this manner. In the following script we removed the PRIMARY KEY
statement from the sku column definition and added a CONSTRAINT statement
at the end of the table definition.
CREATE TABLE Products(
sku |
int |
NOT NULL |
PRIMARY KEY, |
modelnumber |
varchar(25) |
NOT NULL, |
|
name |
varchar(100) |
NOT NULL, |
|
manufacturer |
varchar(25) |
NOT NULL, |
|
description |
varchar(255) |
NOT NULL, |
|
warrantydetails |
varchar(500) |
NOT NULL, |
|
price |
money(10,0) |
NOT NULL, |
|
weight |
decimal(5,2) |
NOT NULL, |
|
shippingweight |
decimal(5,2) |
NOT NULL, |
|
height |
decimal(4,2) |
NOT NULL, |
|
width |
decimal(4,2) |
NOT NULL, |
|
depth |
decimal(4,2) |
NOT NULL, |
|
isserialized |
bit |
NOT NULL, |
|
status |
tinyint |
NOT NULL, |
|
CONSTRAINT pk_product_sku PRIMARY KEY (sku)
)
Last, but certainly not least, what if the table already exists and you
want to add a primary key? First, you must make sure that any data already
in the column conforms to the rules of a primary key. It cannot contain
NULLs, and each row must be unique. After that, another simple script
will do the trick.
ALTER TABLE Products
ADD CONSTRAINT pk_product_sku PRIMARY KEY (sku)
But wait—there's more. Using the sku column as we have done here is
fine, but there are other PK options we need to discuss. If you were to go
through your entire database and define PKs as we have done on the
Products table, you would likely end up with a different column name in each table that holds the primary key. This is not necessarily a bad thing,
but it means that you must look up the data type and column name whenever
you want to add another column with a foreign key or you need to
write a piece of code to join tables.
Wouldn't it be nice if all your tables had their PKs in columns having
the same name? For example, every table in your database could be given
a column named objectid and that column could simply have an arbitrary
unique integer. In this case, you can use an identity column in SQL Server
to manage your integer PK value. An identity column is one that automatically
increments a number with each insert into the table. When you
make a column an identity, you specify a seed, or starting value, and an increment, which is the number to add each time a new record is added.
Most commonly, the seed and increment are both set to 1, meaning that
each new row will be given an identity value that is 1 higher than the preceding
row.
Another option for an arbitrary PK is a GUID. GUIDs are most often
used as PKs when you need to copy data between databases and you need
to be sure that data copied from another database does not conflict with
existing data. If you were instead to use identities, you would have to play
with the seed values to avoid conflicts; for example, the number 1,000,454
could easily have been used in two databases, creating a conflict when the
data is copied. The disadvantages of GUIDs are that they are larger than
integers and they are not easily readable for humans. Also, PKs are often
clustered, meaning that they are stored in order. Because GUIDs are random,
each time you add data it ends up getting inserted into the middle of
the PK, and this adds overhead to the operation. In Chapter 10 we talk
more about clustered versus nonclustered PKs.
Of all the PK options we have discussed, we most often use identity
columns. They are easy to set up and they provide consistency across tables.
No matter what method you use, carefully consider the pros and cons.
Implementing a PK in the wrong way not only will make it difficult to write
code against your database but also could lead to degraded performance.
Foreign Keys
As with primary keys, foreign keys in SQL Server work in the same way as
they do in logical design. A foreign key is the column or columns that correspond
to a primary key and establish a relationship. Exactly the same
columns with the same data as the primary key exist in the foreign key. It is for this reason that we strongly advise against using composite primary
keys; not only does it mean a lot of data duplication, but also it adds overhead
when you join tables. Going back to our employee and vehicle example,
take a look at Figure 3.2, which shows the tables with some sample
data.

FIGURE 3.2 Data from the employee and vehicle tables showing the relationship between the tables
As you can see, both tables have objid columns. These are identity
columns and serve as our primary key. Additionally, notice that the vehicle
table has an employee_objid column. This column holds the objid of the
employee to whom the car is assigned. In SQL Server, the foreign key is
set up on the vehicle table, and its job is to ensure that the value you enter
in the employee_objid column is in fact a valid value that has a corresponding
record in the employee table.
The following script creates the vehicle table. You will notice a few
things that are different from the earlier table creation script. First, when
we set up the objid column, we use the IDENTITY(1,1)statement to create an identity, with a seed and increment of 1 on the column. Second, we
have a second CONSTRAINT statement to add the foreign key relationship. When creating a foreign key, you specify the column or columns in the referencing table that contain the foreign key as well as the referenced table
and columns that contain the primary key.
CREATE TABLE dbo.vehicle(
objid int IDENTITY(1,1) NOT NULL,
make varchar(50) NOT NULL,
model varchar(50)NOT NULL,
year char(4) NOT NULL,
employee_objid int NOT NULL,
CONSTRAINT PK_vehicle PRIMARY KEY (objid),
CONSTRAINT FK_vehicle_employee
FOREIGN KEY(employee_objid)
REFERENCES employee (objid)
)
Once your primary keys are in place, the creation of the foreign keys is
academic. You simply create the appropriate columns on the referencing
table and add the foreign key. As stated in Chapter 2, if your design requires
it, the same column in a table can be in both the primary key and a
foreign key.
When you create foreign keys, you can also specify what to do if an update
or delete is issued on the parent table. By default, if you attempt to
delete a record in the parent table, the delete will fail because it would result
in orphaned rows in the referencing table. An orphaned row is a row
that exists in a child table that has no corresponding parent. This can cause
problems in some data models. In our employee and vehicle tables, a
NULL in the vehicle table means that the vehicle has not been assigned to
an employee. However, consider a table that stores orders and order details;
in this case, an orphaned record in the order detail table would be
useless. You would have no idea which order the detail line belonged to.
Instead of allowing a delete to fail, you have options. First, you can
have the delete operation cascade, meaning that SQL Server will delete
all the child rows along with the parent row you are deleting. Be very careful
when using this option. If you have several levels of relationships with
cascading delete enabled, you could wipe out a large chunk of data by issuing
a delete on a single record.
Your second option is to have SQL Server set the foreign key column
to NULL in the referencing table. This option creates orphaned records,
as discussed. Third, you can have SQL Server set the foreign key column
back to the default value of the column, if it has one. Similar options are
also available if you try to update the primary key value itself. Again, SQL
Server can either (1) cascade the update so that the child rows still point to
the correct parent rows with the new key, (2) set the foreign key to NULL,
or (3) set the foreign key back to its default value.
Changing the values of primary keys isn't something we recommend
you do often, but in some situations you may find yourself needing to do
just that. If you find yourself in that situation often, you might consider setting
up an update rule on your foreign keys.
Constraints
SQL Server contains several types of constraints to enforce data integrity.
Constraints, as the name implies, are used to constrain the values that can
be entered into columns. We have talked about two of the constraints in
SQL Server: primary keys and foreign keys. Primary keys constrain the
data so that duplicates and NULLs cannot exist in the columns, and foreign
keys ensure that the entered value exists in the referenced table.
There are several other constraints you can implement to ensure data integrity
or enforce business rules.
Unique Constraints
Unique constraints are similar to primary keys; they ensure that no duplicates exist in a column or collection of columns. They are configured on
columns that do not participate in the primary key. How does a unique constraint
differ from a primary key? From a technical standpoint, the only difference
is that a unique constraint allows you to enter NULL values;
however, because the values must be unique, you can enter only one NULL
value for the entire column. When we talked about identifying primary
keys, we talked about candidate keys. Because candidate keys should also
be able to uniquely identify the row, you should probably place unique constraints
on your candidate keys. You add a unique constraint in much the
same way as you add a foreign key, using a constraint statement such as
CONSTRAINT UNQ_vehicle_vin UNIQUE NONCLUSTERED (vin_number)
Check Constraints
Check constraints limit the values that can be entered into a column by
using a logical expression. A logical expression is any SQL expression
that can evaluate to TRUE or FALSE. The expression can be any valid
SQL expression, from simple comparisons to something more complex
such as calling a function. For example, say we want to limit the values that
can be entered for salary in our employee table. The expression we would
use to evaluate the data would be something like this:
salary >= 10000 and salary <=150000
This line rejects any value less than 10,000 or greater than 150,000.
Each column can have multiple check constraints, or you can reference
multiple columns with a single check. When it comes to NULL values,
check constraints can be overridden. When a check constraint does its
evaluation, it allows any value that does not evaluate to false. This means
that if your check evaluates to NULL, the value will be accepted. Thus, if
you enter NULL into the salary column, the check constraint returns unknown
and the value is inserted. This feature is by design, but it can lead
to unexpected results, so we want you to be aware of this.
Check constraints are created in much the same way as keys or unique
constraints; the only caveat is that they tend to contain a bit more meat.
That is, the expression used to evaluate the check can be lengthy and
therefore hard to read when viewed in T-SQL. We recommend you create
your tables first and then issue ALTER statements to add your check constraints. The following sample code adds a constraint to the Products table
to ensure that certain columns do not contain negative values.
ALTER TABLE dbo.Products
ADD CONSTRAINT chk_non_negative_values
CHECK
(
weight >= 0
AND (shippingweight >= 0 AND shippingweight >= weight)
AND height >= 0
AND width >= 0
AND depth >= 0
)
Because it doesn't make sense for any of these columns to contain negative
numbers (items cannot have negative weights or heights), we add this
constraint to ensure data integrity. Now when you attempt to insert data
with negative numbers, SQL Server simply returns the following error and
the insert is denied. This constraint also prevents a shipping weight from
being less than the product's actual weight.
The INSERT statement conflicted with the CHECK constraint
"chk_non_negative_values"
As you can see, we created one constraint that looks at all the columns
that must contain non-negative values. The only downfall to this method is that it can be hard to find the data that violated the constraint. In this case,
it's pretty easy to spot a negative number, but imagine if the constraint were
more complex and contained more columns. You would know only that
some column in the constraint was in violation, and you would have to go
over your data to find the problem. On the other hand, we could have created
a constraint for each column, making it easier to track down problems.
Which method you use depends on complexity and personal preference.
Implementing Referential Integrity
Now that we have covered PKs, FKs, and constraints, the final thing we
need to discuss is how to use them to implement referential integrity.
Luckily it's straightforward once you understand how to create each of the
objects we've discussed.
One-to-Many Relationships
One-to-many relationships are the most common kind of relationship you
will use in a database, and they are also what you get with very little additional
work when you create a foreign key on a table. To make the relationship
required, you must make sure that the column that contains your
foreign key is set to not allow NULLs. Not allowing NULLs requires that
a value be entered in the column, and adding the foreign key requires that
the value be in the related table's primary key. This type of relationship implements
a cardinality of "one or more to one." In other words, you can
have a single row but you are not limited to the total number of rows you
can have. (Later in this chapter we look at ways to implement advanced
cardinality.) Allowing NULL in the foreign key column makes the relationship
optional—that is, the data is not required to be related to the
reference table. If you were tracking computers in a table and using
a relationship to define which person was using the computer, a NULL
in your foreign key would denote a computer that is not in use by an
employee.
One-to-One Relationships
One-to-one relationships are implemented in exactly the same way as oneto-
many relationships—sort of. You still create a primary key and a foreign
key; the problem is that at this point SQL Server still allows users to insert
many rows into the foreign key table that reference the primary key table. There is no way, by default, to constrain the data to one-to-one. To implement
a one-to-one relationship that is enforced, you must get a little
creative.
The first option is to write a stored procedure (more on stored procedures
later in this chapter) to do all your inserting, and then add logic to
prevent a second row from being added to the table. This method works in
most cases, but what if you need to load data directly to tables without a
stored procedure? Another option to implement one-to-one relationships
is to use a trigger, which we also look at shortly. Basically, a trigger is a
piece of code that can be executed after or instead of the actual insert
statement. Using this method, you could roll back any insert that would violate
the one-to-one relationship.
Additionally—and this is probably the easiest method—you can add a
unique constraint on the foreign key columns. This would mean that the
data in the foreign key would have to be a value from the primary key, and
each value could appear only once in the referencing table. This approach
effectively creates a one-to-one relationship that is managed and enforced
by SQL Server.
Many-to-Many Relationships
One of the most complex relationships when it comes to implementation
is the many-to-many relationship. Even though you can have a many-tomany
relationship between two entities, you cannot create a many-to-many
relationship between only two tables. To implement this relationship, you
must create a third table, called a junction table, and two one-to-many
relationships.
Let's walk through an example to see how it works. You have two tables—
one called Student and one called Class—and both contain an identity
called objid as their PK. In this situation you need a many-to-many
relationship, because each student can be in more than one class and each
class will have more than one student. To implement the relationship, you
create a junction table that has only two columns: one containing the
student_objid, and the other containing the class_objid. You then create a
one-to-many relationship from this junction table to the Student table, and
another to the Class table. Figure 3.3 shows how this relationship looks.

FIGURE 3.3 Many-to-many relationship between a Student and a Class table
You will notice a few things about this configuration. First, in addition
to being foreign keys, these columns are used together as the primary key
for the Student_Class junction table. How does this implement a many-tomany
relationship? The junction table can contain rows as long as they do not violate the primary key. This means that you can relate each student to
all the classes he attends, and you can relate all the students in a particular
class to that class. This gives you a many-to-many relationship.
It may sound complex, but once you create a many-to-many relationship
and add some data to the tables, it becomes pretty clear. The best way
to really understand it is to do it. When we build our physical model in
Chapter 9, we look more closely at many-to-many relationships, including
ways to make them most useful.
Implementing Advanced Cardinality
In Chapter 2, we talk about cardinality. Cardinality simply describes the
number of rows in a table that can relate to rows in another table.
Cardinality is often derived from your customer's business rules. As with
one-to-one relationships, SQL Server does not have a native method to
support advanced cardinality. Using primary and foreign keys, you can easily
enforce one-or-more-to-many, zero-or-more-to-many, or one-to-one
cardinality as we have described previously.
What if you want to create a relationship whereby each parent can contain
only a limited number of child records? For example, using our employee
and vehicle tables, you might want to limit your data so that each
employee can have no more than five cars assigned. Additionally, employees
are not required to have a car at all. The cardinality of this relationship
is said to be zero-to-five-to-many. To enforce this requirement, you need
to be creative. In this scenario you could use a trigger that counts the number
of cars assigned to an employee. If the additional car would put the
employee over five, the insert could be reversed or rolled back.
Each situation is unique. In some cases you might be able to use check
constraints or another combination of PKs, FKs, and constraints to implement
your cardinality. You need to examine your requirements closely to
decide on the best approach.
Check out the authors' website where they co-host a podcast show for IT professionals.