[Home] [Papers] [Courses] [Contact]
A Synthesis Method for Database Normalization

Jamal Munshi, Sonoma State Univesity, 1995
All rights reserved

Although the advantage of RDBMS in terms of query flexibility, ad hoc report generation, and dramatically lower application maintenance costs have been long recognized, their adoption in mainstream MIS applications has been retarded by performance problems when compared with traditional `hard wired' hierarchical systems such as IMS. However, a new generation of RDBMS with improved search algorithms and search optimizers has largely overcome these performance problems. Clearly relational databases are now entering the mainstream of MIS even in large production systems aided not only by their inherent data flexibility but also by the increased productivity of end-users and application developers alike.

In the RDBMS metaphor, a database is a collection of tables that together represent a cohesive data environment of the business enterprise. Each table has a unique name within the database and consists of columns and rows. Each column has a unique name within the table and represents an attribute relevant to the user. Each row contains an instance of the table. One or more columns form a key structure whose value appears only once in each table and uniquely identifies a row in the table. Tables may also contain one or more foreign keys which appear as keys in other tables. It is through these foreign keys that row to row linkages are established between related tables. It is a property of the relational model that the conceptual data structure is established by the semantic content of the data rather than the preconceived or so called `hard wired' reporting requirements. Therefore, theoretically, if the set of tables is properly constructed, all queries that are meaningful within the semantic constraints are possible using SQL.

The construction is proper if all non-key attributes are functionally dependent on the key, the whole key, and nothing but the key and if no part of the key is functionally dependent on any other part of the key. This state of the tables is referred to as the `normalized set' and the normalization process consists of procedures to produce this set.

Conventionally, the normalization concept and procedure are explained in stages called `normal forms'. The process begins with one single table that contains all attributes (columns) that are relevant to the database being designed. The single large table is then decomposed stepwise into normal forms. Each step addresses a single normalization issue.

In the first step repeating groups are removed and the resultant tables are said to be in the first normal form. The first normal relations are then decomposed stagewise by addressing each of the normalization criteria. When dependencies on part of the key are removed, the relations are said to be in the `second normal form'. Similarly, when transitive dependencies (dependencies on non-key attributes) are removed by further decomposition the result is the third normal form or the so called Boyce Codd normal form.

Further decomposition to remove multivalued dependencies may be used to produce a set of relations in the fourth normal form. Some designers recognize a Domain Key Normal Form (Fagin 1981) in which every constraint on the relations is the result of only two variables: the key and the domains of the non-key attributes. No generalized method for achieving this state has been proposed.

The decomposition method of normalization is difficult to use and confusing to students and analysts alike and it is based on the unrealistic notion that we begin the design process with the largest possible tables. In the synthesis method proposed in this paper, we reverse the process and begin with the smallest possible tables which are normalized by definition since each of these tables states a single dependency equation.

We call these relations `elemental tables. They are produced by the following procedure: first list all the entity-types and their attributes; select the attributes one at a time and determine the functional dependency and key structure needed to identify an instance of this entity type; pick those attributes from the attribute list and construct the key for that one single non-key (or partial key) attribute; and place the attribute and its key structure into an elemental table. Then go on to the next attribute and keep constructing elemental tables until all attributes have been accounted for.

Although we now have a normalized set of relations, their use is cumbersome since the number of tables that must be referenced by SQL queries will be unnecessarily large. The database will also suffer severe degradation problems. To enhance database performance and simplify queries, we now synthesize larger tables from the elemental tables. In the synthesis process we delete elemental relations that are semantically redundant and combine relations that have exactly the same key structure. The result will be a set of tables that are as normalized as any analysis method is able to achieve but the procedure is immensely simpler. In the resultant tables every non-key attribute is FD on the key, the whole key, and nothing but the key and no attribute of the composite key is FD on any other part of the key.

As an example, consider Chens famous example (Chen, P., ACM TODS, vol. 1, no. 1)

EMPLOYEES (SSN, NAME,AGE,DEPT,BUDGET,PROJECT,PROJNAME,PCTIME)

The functional dependencies are described in the paper. First we produce the elemental tables as follows:

ER1(SSN,NAME) ER2(SSN,AGE) ER3(SSN,DEPT) ER4(SSN,PROJECT) ER5(DEPT,BUDGET) ER6(PROJECT,PROJNAME) ER7(SSN,PROJECT,PCTIME)

ER7 contains all the information in ER4. Therefore ER4 is semantically redundant and is dropped. ER1, ER2, and ER3 contain identical keys. They are collapsed into a single table. The resultant normalized tables are:

NORMAL1(SSN,NAME,AGE,DEPT) NORMAL2(DEPT,BUDGET) NORMAL3(PROJECT,PROJNAME) NORMAL4(SSN,PROJECT,PCTIME)