Definition of Normalization

Normalization is the word coined by E.F.Codd (who conceived of the relational model) and his colleagues to refer to the organization of the logical structure of a database so as to facilitate both ad-hoc query as well as data update. In a 1993 interview with DBMS magazine, Codd stated that he got the inspiration for this phrase in 1972 when he read about President Nixon "normalizing relations" with China. (Codd used the word "Relation" as a synonym for "Table".) It implies "logical" (proper) design of a database with minimal redundancy of data.

Normalization is achieved through the application of some simple rules that are best illustrated by example. Letís try to capture information about students who are taking courses in a college during a particular semester. Apart from recording information about a student per se (name, contact information), we would want to record the particulars of each course (course description and the student's grade in each course.)

The WRONG way of attempting to capture this information is to attempt to capture everything in a SINGLE table as shown in the diagram below:

Student_Course_Information
Social Sec No
Last Name
First Name
Sex
Year
Dorm
Course ID 1
Course Name 1
Teacher 1
Grade 1
Course ID 2
Course Name 2
Teacher 2
Grade 2
Course ID 3
Course Name 3
Teacher 3
Grade 3
Course ID 4
Course Name 4
Teacher 4
Grade 4
Course ID 5
Course Name 5
Teacher 5
Grade 5
Course ID 6
Course Name 6
Teacher 6
Grade 6

There are two major problems with this table. (Its design reflects what I like to call "spreadsheet-think", because it reflects the way that many people organize their data in spreadsheets. )

Here, the information about teachers, courses and grades is a " repeating group", because a student can take multiple courses, i.e., the course/grade/teacher information repeats.

Unfortunately, a database table is not a spreadsheet.

  1. Unlike a spreadsheet, which stretches (almost) indefinitely to the right, the number of columns in a database table have to be fixed in advance. Therefore, you have to allow for the maximum number of courses any student is likely to take. In case you have a highly talented and energetic student who decides to take seven or even twelve courses, you canít store information about the extra courses (unless you decide to kludge it and create an additional record for the student). If you decide to go all the way and create fields like "Course 15" and "Teacher 15", there will be a lot of unused space per record when entering data for the average student who rarely takes more than four courses in a semester.
  2. The problem of query asymmetry arises. The above table is adequate if you simply want to find out what courses a student is taking. But what if you want to know which students who are taking a particular course? If you know the name of the course (e.g. "CS101"), then you have to search for all records where Course1="CS101" or Course2="CS101"... or Course6="CS101". If you have allowed for something like a maximum of 12 courses, this is far too tedious.

Hence Codd's first rule of Normalization: A table shall contain no repeating groups. The first step for proper table design is to remove  repeating groups, and place them in a separate table.A table without repeating groups is said to be in first normal form. The modified database structure is shown below.

Notice that we replicate the studentís social security number in the Courses table. This is essentially the guts of relational database design; we achieve flexibility through limited redundancy (that is, duplication) of data. Specifically, the only field that is duplicated is the primary key of the table: placed in another table, it becomes a foreign key there.

The primary key of  the Courses table is (SSN + course ID). Given a course, we can search for all the students taking the course, and given the student, we can find out the courses taken.

We have now split the original table into two: one records information that applies to a Student independent of the courses s/he is taking, and the other to record course information.

The second rule of normalization

If a table has a compound primary key, and one or more fields in a table depend on only part of the primary key for that table, move them to a separate table along with that part of the key. To illustrate here:

We already know that the course name depends not on the combination of Student and Course ID, but on Course ID alone. If it so happens that a particular course is always taken by a single instructor (as is generally the case) in this semester, we donít need to repeat the instructorís name along with the course for every single student who happens to be taking that course. In other words, the value of the Course Name and Teacher depends, not on the primary key for the Courses table, but on part of the primary key (the course). Therefore, it makes more sense to move the Course Name and Teacher information into a separate table, along with the course ID to link back to the information about the student's courses and grades.

When we do so, it is worthwhile to consider the use of a naming convention.

  1. The table with Course ID,Course Name and Teacher records information about Courses that are offered irrespective of whether any students are taking a particular course, and the name Courses is best used to describe it.
  2. On the other hand, the table with the SSN, Course_ID and Grade records a student's performance in a given course, and it is best named Students_Courses. Using a combination name like this indicates that this table forms a kind of "bridge" between the Students table and the Courses table. The underscore is used in the name for the following reasons:

The third rule of normalization

If one or more fields in a table do not depend at all on the primary key for that table (or any part of it), move them to a separate table along with copies of the fields on which they do depend.

Suppose that, in a particular college, the (non-co-ed) residence where a student stays is decided solely by which year the student is in, and by the sex of the student. Then, looking at the Student table, where the primary key is the Social Security No, it is clear that we donít need to store Dorm information against each and every student, because Dorm does not depend on Social Security No at all (it depends only on Sex and Year). Therefore, we create an additional table called Residences, and our database now has the following structure.

For the Residences table, the primary key is Sex+Year. Note that that though the term "key" is singular, it can consist of more than one field.

Rules two and three can be synthesized as follows: Identify the primary key field/s in a table. Then ensure that the rest of the fields all depend on all of the primary key fields (rule 2) and on nothing but the primary key fields (rule 3). In their book "Introducing the new SQL" (Morgan Kaufman publishers), Jim Melton and Alan Simon condense these two rules into the phrase "all fields in a table should depend on the key, the whole key and nothing but the key" (to which someone has added "so help me, Codd").

What do we achieve with Normalization?

The purpose of normalization is to eliminate redundancy. The major benefit here, apart from space savings, is ease of update. Suppose a teacher who has been scheduled to teach a particular course this semester is unavailable (or resigns), and someone else is designated to take that teacher's place. With the original design, the new teacher's name would have to replace the old one in numerous places - against every student who happens to have signed up for that course, to be precise. With the normalized design, we need to change only a single record in the Courses table.

Also, in a large database, the same table can be used for multiple purposes. We might also record fee payments or athletic activities for Students, and the other tables that contain such information can also connect to the Students table. If a Student changes her/his last name, we need to make the change in only a single place. 

Our example has been somewhat simplified. In a real-world situation, we might record detailed information on Teachers (such as Salary, Benefits, etc.), and instead of recording the teacher's name in the Courses table, we might instead choose to record the teacher's Employee ID, which is looked up in an Employee table.

User Interface Concerns

The Student_Courses table records SSN and Course ID, and you might expect that to create a new row in this table, you would have to look up these values for a given student and course first. (If you have set referential integrity, you are prevented from entering Course IDs or SSNs that do not exist in the Courses and Students tables.)

Forcing users of a database to remember or look up IDs manually is a terrible idea, and data entry mistakes are almost certainly likely to occur. Microsoft Access has a feature called a "Lookup Wizard" where you can specify that the SSN is looked up (by the computer) in the Students table, and you can choose the Last Name and the First Name to be displayed in addition to, (or even instead of) the SSN. In the user interface, the SSN column then appears as a pull-down list. You can also specify that this list be sorted in order of Last Name and then First Name. Typing ahead with the first few letters of the Last Name will then cause the list to scroll to the first instance of the Last name whose letters match what you have typed, and then you can use the mouse to scroll down further and choose the student you want.  

Similarly, you can specify that instead of the Course ID, the Course Name and Teacher be displayed, and that the list be sorted in order of Course Name.

In high-end databases, some degree of programming is required to provide the same level of functionality that Access provides out of the box. 

Further Normalization

Applying the first three rules of normalization is adequate for the vast majority of databases. However, itís important to recognize the few occasions where you need to go further.

Letís now consider details of a course. (This example is modified from a similar one in "Principles of Database Systems " by C.J.Date, (Addison Wesley)) Letís say that more than one teaching assistant can assist in the teaching of a course. Further, there may be several texts recommended for that course. Letís say that we design a table, consisting of the columns: Course Number, Text Book Description and Teaching Assistant Name. Such a structure obeys all the three rules given above (no repeating fields, both Teaching Assistant and Text Book dependent on the primary key, Course Number) and yet it is obviously nonsensical, because there is no direct association between Text Books and Teaching Assistantsóthe two, while both dependent on Course Number, are clearly independent of each other. There could be five TAs and three books, or vice versa. 

In other words: even if the rule about "the whole key and nothing but the key" is not always insurance against nonsensical tables. For Boyce-Codd normalization, you should design a table such that it does not attempt to combine different, independent relationships. Each independent relationship that exists deserves its own table. Here, we need to split up such a table into two (Course Number, Teaching Assistant) and (Course Number, Text Book Description).

The fourth rule of normalization can be stated formally: If a table appears to combine information that on closer inspection, is composed of independent logical units, (both dependent on the primary key for the table) it should be split (decomposed) into separate tables reflecting this independence, replicating the primary key for the parent table in each of the daughter tables.

Now consider a table that records which Teaching Assistants (TAs) assist which Students in which Courses. We assume that a TA can assist more than one course. To keep track, we can create a table with the fields (TA, Student, Course).

This can introduce a potential problem. The table (TA, Student, Course) is essential, but it is not necessarily enough. For example, if a TA named Jones is assigned to assist two students in Calculus 101, and both these students subsequently drop out of Calculus 101, we can delete the two records holding this information. However, if Jones was assigned to these two students only, we will also lose the information that Jones had once TA'd this course. Therefore, to record the fact that certain TAs assist (or have assisted) in certain courses, whether or not they have students currently assigned to them, we need an extra table (TA, Course).

Therefore, the fifth rule of normalization is: Create extra tables as needed to prevent loss of ancillary information when data is updated, added or deleted, if an analysis of the problem indicates that such loss is likely to occur. The database literature refers to the design solution as "preventing update anomalies".

The Importance of Common Sense and Domain Knowledge

From the above normalization exercise, you have probably realized that it is important to understand the domain for which the system is being designed before designing. The knowledge of the domain brings in constraints that make it better to design the database one way rather than another for a particular set of circumstances.

For example, in the previous exercise, the knowledge of the relationship of Dorm to Year and Sex also helped us to normalize, but this example was a bit contrived. If we wish to record the studentís room number within the dorm, we must certainly keep it in the Student table. If this room number has, say, a 3 letter prefix which records the name of the Dorm as well, (e.g., SIL320), we have not gained very much from splitting off Dorm information into a separate table (though the knowledge of the dependence of Dorm on Sex and Year might be used in a validation routine to prevent inadvertent data entry errors).

In a situation where a significant proportion of students stay off-campus, we would certainly wish to record a full address against each student, even at the cost of repeating dorm information for students living in dorms.

One can use a more complex design which saves some disk space (i.e., maintain the dorms table, and add another table for street addresses of off-campus students, with a Boolean in the students table specifying whether the student is off-campus). This way, to look up where a student is staying, we check whether s/he is off-campus. If so, we consult a table of street addresses which is indexed by student Social Security Number. If the student is not off-campus, we consult the dorms table. However, such a design cannot be navigated except with considerable programming, and for a modest-sized database, the savings in space may not be worth the extra programming overhead.

When to Depart from Normalization

Normalization is fine, but, as for any set of rules, it is necessary to know when to break them.

Overkill

Normalization can sometimes be carried out to unnecessary extremes. The most obvious case cited in the  literature is that of Zip Code. Technically, if you know the zip code for an address, you also know the city and the state, and therefore the latter two fields are theoretically redundant. However, this statement assumes that:

  1. A table of zip codes with matching city and state data is available. (Commercially available zip code tables exist and are extensively used for validation:  if a zip code is entered from hard copy, and the City and State that pop up on the screen are not the same as those on hard-copy, there is most probably a data entry error. However, not all sites possess zip code tables.)
  2. Zip codes occasionally change, and, if you have an updated zip codes table, you also have zip codes in your Individuals table that point to nowhere. In such a case, the name of the City and State, while theoretically redundant, serves as additional insurance in case of change.

Efficiency

Sometimes one chooses to deliberately depart from normalization for efficiency reasons. Relational systems are slowest when it comes to processing queries which involve fields from multiple tables If a particular query is going to be asked often enough, it may make sense to modify the structure of the database so that certain tables may redundantly hold critical fields from tables several links removed, so that a potential multi-table query becomes a one-table or two-table query. This is a classic space-for-time tradeoff; we consume extra disk space to save time.

However, do not underestimate the power of a database engine (or the power of your hardware). Modern engines have been optimized so that even queries involving twenty or more tables are answered remarkably quickly. The golden rule is that you should not try to denormalize unless you have determined that performance of the system is terrible, and you have eliminated other possible causes of the problem, such as insufficient indexing of tables, or insufficient RAM on your machine. (While you can't solve all problems by throwing more hardware at them, you should be aware that insufficient RAM can make a high-end server perform like a low-end PC.) Denormalization, like brain surgery, must only be done by experts.

Non-First-Normal Form Tables (NF-NF, or NF2)

With newer object-relational technology, as exemplified by Oracle version 8 or Informix Universal Server, advanced developers can, when the situation calls for it, represent the "repeating group" form of a table that I so severely criticized at the start of this section. This representation is typically used to represent data such as waveforms (e.g., X-Y pairs in an electrocardiogram). In this case, the hundreds of Y values that represent a wave tracing can be stored as an array, a common data structure in programming languages. If we know the interval between consecutive X values (e.g., 0.01 seconds for an EKG), we can save space by avoiding the storage of the X values.

Note that we cannot get away with doing this in a conventional relational table because there is no implicit ordering of the data in a table. That is, records are not necessarily stored in the order created. ("Holes" left in the table from previously deleted records may be recycled and filled in, so that the order of records is generally arbitrary and cannot be relied upon. Instead, we must physically order the records on a particular value whenever we need to.)

NF2 tables are an advanced programmer's technique, and are chiefly used when individual values are less important than the data as a whole. (Hence their use to represent course data, as in our example, would not be appropriate.)