Toan Hoang

SQL103 / Database Modelling

In part one of this SQL series, we introduced the concepts of the Relational Database Management System (RDBMS), installed Microsoft SQL Server Express, SQL Server Management Studio, and the AdventureWorks database. In part two of this series, we introduced the SELECT statement and various clauses for returning data from the RDBMS; this includes the SELECT, FROM, WHERE, GROUP BY, HAVING and ORDER BY. In part three, we are going to take a step back and look at the theory of data modelling; understanding data modelling is essential and far too often overlooked.

Data Modelling

Data Modelling in software engineering is the process of creating a data model for an information system by applying certain formal techniques; a Data Model is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world entities.

The main aim of data models is to support the development of information systems by providing the definition and format of data. According to West and Fowler (1999) “if this is done consistently across systems then compatibility of data can be achieved. If the same data structures are used to store and access data then different applications can share data.

Data models are commonly represented as Entity-Relationship Models (ERMs), or Entity-Relationship Diagrams (ERDs).

Entity-Relationship Models (ERM)

An Entity–Relationship Model is the result of systematic analysis to define and describe what is important to processes in an area of a business; it does not define the business processes; it only presents a business data schema in graphical form. It is usually drawn in a graphical form as boxes (entities) that are connected by lines (relationships) which express the associations and dependencies between entities. Entities may be characterized not only by relationships but also by additional properties (attributes).

For example:

Peter Chen, the father of ER modelling said in his seminal paper:

“The entity-relationship model adopts the more natural view that the real world consists of entities and relationships. It incorporates some of the important semantic information about the real world.”

Types of Entity-Relationship Diagrams

There are three types of diagrams according to ANSI in 1975:

When joining a new project, the Conceptual Data Model will allow you to see an overview of how data is represented, interconnected and constrained. Once you are given a business area you may want to view the Logical Data Model where you can determine what attribute exists and what the relationships and attributes are. Finally, when you are looking to write SQL SELECT Statements you will want to look at the Physical Data Model to guide your table designs.

Conceptual Data Model (CDM)

A conceptual model’s primary objective is to convey the fundamental principles and basic functionality of the system which it represents. Also, a conceptual model must be developed in such a way as to provide an easily understood system interpretation for the model’s users. A conceptual model, when implemented properly, should satisfy four fundamental objectives.

The conceptual model plays an important role in the overall system development life cycle.

Logical Data Model (LDM)

Logical data models represent the abstract structure of a domain of information. They are often diagrammatic in nature and are most typically used in business processes that seek to capture things of importance to an organization and how they relate to one another. Once validated and approved, the logical data model can become the basis of a physical data model and form the design of a database.

Logical data models should be based on the structures identified in a preceding conceptual data model since this describes the semantics of the information context, which the logical model should also reflect. Even so, since the logical data model anticipates implementation on a specific computing system, the content of the logical data model is adjusted to achieve certain efficiencies.

The term ‘Logical Data Model’ is sometimes used as a synonym of ‘domain model’ or as an alternative to the domain model. While the two concepts are closely related and have overlapping goals, a domain model is more focused on capturing the concepts in the problem domain rather than the structure of the data associated with that domain.

Reasons for building a Logical Data Model (LDM) are to:

Physical Data Model (PDM)

A physical data model (or database design) is a representation of a data design as implemented, or intended to be implemented, in a database management system. In the lifecycle of a project, it typically derives from a logical data model, though it may be reverse-engineered from a given database implementation. A complete physical data model will include all the database artefacts required to create relationships between tables or to achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters.

Analysts can usually use a physical data model to calculate storage estimates; it may include specific storage allocation details for a given database system.

Once you start looking at Logical and Physical Data Models, the topic of Normalisation starts to appears, but what is this?

Database Normalisation

Database normalization is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model.

Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design).

A lot of Entity-Relationship Diagrams (ERDs) will be in various states of normalisation, so you will need to understand the basics of this.

Unnormalized Form (UNF) is also known as an unnormalized relation or non-first normal form (NF2), is a simple database data model (organization of data in a database) lacking the efficiency of database normalization. An unnormalized data model will suffer the pitfalls of data redundancy, where multiple values and/or complex data structures may be stored within a single field or attribute, or where fields may be replicated within a single table (a way of subverting the first normal form rule of one value per field or attribute).

When an attempt is made to modify (update, insert into, or delete from) a relation, the following undesirable side-effects may arise in relations that have not been sufficiently normalized:

First Normal Form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar Codd, defined a relation to be in first normal form when none of its domains has any sets as elements.

First Normal Form, 1NF, is an essential property of a relation in a relational database. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement.

First normal form enforces these criteria:

Full Technopedia tutorial on 1NF: https://www.techopedia.com/definition/25955/first-normal-form-1nf

Second Normal Form (2NF) is a normal form used in database normalization. 2NF was originally defined by E.F. Codd in 1971.

A relation that is in first normal form (1NF) must meet additional criteria if it is to qualify for second normal form. Specifically, a relation is in 2NF if:

Full 2NF tutorial on Technopedia: https://www.techopedia.com/definition/21980/second-normal-form-2nf

Third Normal Form (3NF) is a normal form that is used in normalizing a database design to reduce the duplication of data and ensure referential integrity by ensuring that:

Full Technopedia tutorial on 3NF: https://www.techopedia.com/definition/22561/third-normal-form-3nf

Most 3NF relations are free of insertion, update, and deletion anomalies.

Additional Normal Forms

It does not stop with 3rd normal form, but there are other normal forms that you can read up about:

Note: you can keep normalising until you are blue in the face; normalization does have its purposes and is useful for data modification. However, when it comes to reporting, how you structure your data depends on the physical storage layer, sometimes a denormalised data structure is prefered; this all depends on how the physical database system works and the optimal structure.

Summary

In this part of the series we have been introduced to some data modelling concepts, do not worry, we will go over this in more details so do not panic, but you should be somewhat familiar with these terms:

In the next part of this series, we will look at writing queries across multiple tables in our AdventureWorks Database, which is a normalized data model.

As always, please let me know what you think, leave a comment below or reach out to me on Twitter @Tableau_Magic.

Sources

Exit mobile version