In the first part of this tutorial series, SQL101 / Relational Database Management Systems, we went through the process of installing a Relational Database Management System (SQL Server Express), installing SQL Server Management Studio (SSMS), and finally restoring a sample database for us to explore (AdventureWorks). In this episode, we will be going through and introducing you to the Structure Query Language, better known as SQL.
Structure Query Language (SQL)
Structured Query Language, commonly referred to as SQL (pronounced “sequel”), is a language used in programming and designed for managing data held in a relational database management system (RDBMS).
Originally based upon relational algebra and tuple relational calculus, SQL consists of many types of statements, which may be informally classed as sublanguages, commonly:
- Data Query Language (DQL) – for linking and retrieving data from relational databases.
- Data Definition Language (DDL) – for schema creation and modification.
- Data Control Language (DCL) – for permissions and data acess control.
- Data Manipulation Language (DML) – for inserting, updating or deleting data.
SQL was one of the first commercial languages for Edgar F. Codd’s relational model, as described in his influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks”. Despite not entirely adhering to the relational model as described by Codd, it became the most widely used database language.
SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. Since then, the standard has been revised to include a larger set of features. Despite the existence of such standards, most SQL code is not completely portable among different database systems without slight adjustments.
Database providers have also included extensions to Standard SQL to add procedural programming language functionality, such as control-of-flow constructs. These include:
Source | Common Name | Full Name |
ANSI/ISO Standard | SQL/PSM | SQL/Persistent Stored Modules |
Interbase / Firebird | PSQL | Procedural SQL |
IBM DB2 | SQL PL | SQL Procedural Language (implements SQL/PSM) |
IBM Informix | SPL | Stored Procedural Language |
IBM Netezza | NZPLSQL | based on Postgres PL/pgSQL |
Invantive | PSQL | Invantive Procedural SQL (implements SQL/PSM and PL/SQL) |
Microsoft / Sybase | T-SQL | Transact-SQL |
Mimer SQL | SQL/PSM | SQL/Persistent Stored Module (implements SQL/PSM) |
MySQL | SQL/PSM | SQL/Persistent Stored Module (implements SQL/PSM) |
MonetDB | SQL/PSM | SQL/Persistent Stored Module (implements SQL/PSM) |
NuoDB | SSP | Starkey Stored Procedures |
Oracle | PL/SQL | Procedural Language/SQL (based on Ada) |
PostgreSQL | PL/pgSQL | Procedural Language/PostgreSQL |
SAP R/3 | ABAP | Advanced Business Application Programming |
SAP HANA | SQLScript | SQLScript |
Sybase | Watcom-SQL | SQL Anywhere Watcom-SQL Dialect |
Teradata | SPL | Stored Procedural Language |
* We will be focusing on Microsoft SQL Server and in later tutorials will be using T-SQL to create functions and stored procedures.
For the remainder of this tutorial, we will be focused on learning the basic structure and syntax around Data Query Language (DQL), more specifically the SELECT Statement.
SELECT FROM
The main statement in the Data Query Language (DQL) for SQL is the SELECT Statement; the SELECT Statement is used to retrieve information from relational database management systems (RDBMS). We will use SQL Server Management Studio (SSMS) to issue SELECT Statements to the AdventureWorks database (set up in the previous tutorial); we will gradually build our knowledge about the various aspects of the SELECT Statement. So let us get started:
- Open SQL Server Management Studio.
- In the Connect to Server, click Connect.
- In the Object Explorer
- Click on the + next to Databases.
- Right-click on AdventureWorks2017, select New Query.
- A new SQL Query window will open
- Type the following command and press F5 or click on the Execute button:
SELECT TOP 100 *
FROM Person.Person
- The Results panel will show you all columns for the first 100 records from the Person table in the Person Schema.
Let us understand a bit more about the various Clauses.
SELECT CLAUSE
The SELECT Clause of the SELECT Statement tells the Relational Database Management System (RDBMS) what data is to be returned.
SELECT TOP 100 *
FROM CLAUSE
The FROM Clause of the statement tells the Relational Database Management System (RDBMS) where to get the information from.
FROM Person.Person
Database Schemas
A Database Schema is a way to group related objects in a database together. In our example, we are querying the Person table within the Person schema; in AdventureWorks, the Person schema also has tables containing addresses, phone numbers as well as other Person related information.
FROM [Schema Name].[Table Name]
Database Tables
A database table is a collection of related data held in a structured format within a database. It consists of columns and rows.
TOP 100
This part of the SELECT Clause tells the RDBMS to return the first 100 records; this is used to limit our data and especially useful when exploring data.
Asterix (*)
The asterisk (*) tells the Relational Database Management System (RDBMS) to return all columns.
Now, let us resume with our tutorial.
While looking at the results you can see that there are columns within our returned dataset that we may or may not care about, for example, I do not care about the demographics or rowguid (for now); actually the more I think about it, I want to be able to specify the exact columns that I want, so let us modify our query to do exactly that in the SELECT Clause:
SELECT TOP 100 BusinessEntityID, Title, FirstName, MiddleName, LastName, PersonType, EmailPromotion
FROM Person.Person
Execute this query and have a look at the results; notice that in the SELECT Clause you can have your columns in whatever order you want.
Note: SQL Server Management Studio (SSMS) comes with intellisense which will try to autocomplete your SQL Statements, this is truly a godsend.
SELECT FROM WHERE
Now that we have a SQL SELECT Statement that instructs the RDBMS to return specific columns from a specific table in a specific schema, I may also want to as the RDBMS to filter the data based on provided condition. Let us role play for a moment… you are working on getting a list of people that you can actively market to, and your marketing team has informed you that you can only market to Persons who have their EmailPromotion set to 1, so let us amend our SQL Statement:
SELECT BusinessEntityID, Title, FirstName, MiddleName, LastName, PersonType
FROM Person.Person
WHERE EmailPromotion = 1
As we want a complete list, we have removed the TOP 100; we have also removed the EmailPromotion from the SELECT Clause as we know that it is going to be 1.
Note: On the bottom right-hand side of SSMS you can see the number of records returned. In our case, we are returning 5044 rows.
You now want to send the results back to your marketing team for validation, you do this by right-clicking in the Results Panel and selecting Save Results As. Unfortunately, you get an email response saying that you were miss informed, they want a list of names where Email Promotion is either 1 or 2. As such we have to amend our query again:
SELECT BusinessEntityID, Title, FirstName, MiddleName, LastName, PersonType
FROM Person.Person
WHERE EmailPromotion = 1 OR EmailPromotion = 2
You run this query which now returns 8814 rows, you save this and send this to the marketing team.
COMMON WHERE CLAUSE OPERATORS
Below are the common where clause operators that you can use to filter your data.
Operator | Description | Example |
= | Equals | EmailPromotion = 1 |
<> or != | Not equals | EmailPromotion <> 0 |
> | Greater than | EmailPromotion > 0 |
< | Less than | EmailPromotion < 1 |
>= | Greater than or equal to | EmailPromotion >=1 |
<= | Less than or equal to | EmailPromotion <=1 |
BETWEEN | Between an inclusive range | EmailPromotion BETWEEN 1 AND 2 |
LIKE | Searches for a pattern | FirstName LIKE ‘A%’. The % is a wildcard sign, and therefore this will bring back any record where the first name starts with A.
FirstName LIKE ‘%a%d%’ will bring back records where there is an a followed by d within the first name column. |
IN | Matches one of the values in a list | EmailPromotion IN (1,2) |
OR | True is one of two conditions is true | EmailPromotion = 1 OR EmailPromotion = 2 |
AND | True if both conditions are true | EmailPromotion = 1 AND FirstName LIKE ‘A%’ |
IS NULL | True if there is no value | Title IS NULL |
IS NOT NULL | True if there is a value | Title IS NOT NULL |
Please take a moment to experiment with your WHERE Clause and test out the operators above.
Note: When comparing text columns you need to use single quotes. When you are comparing numbers you do not need any quotations. Do not use double Quotes in the WHERE Clause.
AGGREGATION AND GROUP BY CLAUSE
The marketing team has made an additional request of you, they are unsure of the data you have provided, and asked you to provide them with a summary so they can see the high-level numbers. As such, we will have to modify the SQL Statement to bring back summary (aggregated) information. The marketing team wants to see how many people there are with EmailPromotion 0, 1 and 2. Let us create a new SQL Statement:
SELECT EmailPromotion, COUNT(BusinessEntityID)
FROM Person.Person
GROUP BY EmailPromotion
Execute this SQL Statement and look at the results.
We want to include the EmailPromotion in our results, we also want the COUNT of the BusinessEntityID which is a unique key allocated to each person. Given that we are aggregating require a GROUP BY Clause; when we are using an Aggregate function like COUNT, SUM, MIN or MAX, we must include the non aggregated columns in a GROUP BY clause.
The marketing team comes back with another request, be careful of being too helpful at work. Given that it is October, they want to see a list of people who have a first name starting with the letter O, and as a follow up they also want to see the same results for every letter. Let us go about writing our two SQL Statements:
SELECT COUNT(BusinessEntityID)
FROM Person.Person
WHERE LEFT(FirstName,1) = 'O'
As we do not have any non-aggregated columns in our SELECT Clause, we do not need a GROUP BY, we have also used the LEFT Function (more on functions in later chapters) to get the first letter of the FirstName. This result a count of 164. Now let us get the entire alphabet:
SELECT LEFT(FirstName,1), COUNT(BusinessEntityID)
FROM Person.Person
GROUP BY LEFT(FirstName,1)
ORDER BY 1
We have now put the LEFT(FirstName,1) into the SELECT Clause and therefore require this to also be reflected in the GROUP BY Clause. We have also removed the WHERE Clause as we want all the values of the alphabet. Lastly, we have introduced an ORDER BY Clause which will instruct the RDBMS to sort the result set by column one (default is ascending).
They love what you are doing for them, but have one last request, they promise. They would like to order by the second column (highest first) and only return letter that has more than 500 people.
SELECT LEFT(FirstName,1), COUNT(BusinessEntityID)
FROM Person.Person
GROUP BY LEFT(FirstName,1)
HAVING COUNT(BusinessEntityID) > 500
ORDER BY 2 DESC
We have now added a HAVING BY Clause that will filter the aggregation. WHERE Clauses filter the dataset rows, whereas, HAVING filters the aggregated results. We have also changed the ORDER BY Clause to sort on the second column and in descending order.
Execute the statement and review the results.
Summary
This has been a long article, but it covers the basics of the SELECT Statement and its various Clauses. To summarise:
- SELECT – What data will be returned
- FROM – Where the data is coming from
- WHERE – Record level filtering of data
- GROUP BY – When you are using aggregate functions
- HAVING – When you want to put a condition on aggregated values
- ORDER BY – When you want to sort your data
We have covered a lot so far and in the next part of this series, we will go through concepts of relationship modelling and how data is designed and stored in Relational Database Management Systems (RDBMS), we will go through some more theory before looking at connecting (joining) tables.
As always, please let me know what you think, leave a comment below or reach out to me on Twitter @Tableau_Magic.
Sources
- Wikipedia