# Database 2, année 2023

Charles Paperman and Mikaël Monet

# Foreword

**Prerequisites to this course:** The syntax and
semantics of first-order logic, the basis of complexity theory (a little
bit of algorithms, the notion of reduction
between computational problems, the definition of Turing machines,
the class PTIME, the class NP, being able to prove by yourself a simple
NP-completeness result, having already seen and understood an
undecidability proof, the class PSPACE).

**Supplementary material**: This course is inspired from
the book “Database Theory” (work in progress, available here), and from Dan
Suciu’s lecture notes (here). Do not hesitate
to consult these.

\(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\Consts}{\mathsf{Consts}}\) \(\newcommand{\vars}{\mathrm{vars}}\)

# Part I: Relational databases, queries and first-order logic, complexity of model checking and evaluation

## Relational databases: definitions

### The unnamed perspective

This is the notation that we will use most often in this course.

Let us fix a countably infinite set \(\mathsf{Consts}\), called the
*domain*, of constants (also called *data values*).

A *tuple* \(\bar{t}\) is an
element of \(\mathsf{Consts}^k\) for a
certain \(k\), that is called the
*arity* of \(\bar{t}\).

If \(\mathsf{Consts}\) contains (among other things) the values \(a\), \(b\), and \(\text{toto}\), then \((a,\text{toto},a)\) is a tuple of arity 3.

A *relation* is a finite set of tuples of same arity (that is
then called the *arity* of the relation)

\(\{(a,a,a), (a,b,a), (a,\text{toto},c)\}\), or \(\{(a), (42)\}\), or ${()}, or \(\emptyset\) (the empty set) are relations. \(\{(a),(a,b)\}\) is not a relation.

A *(relational) schema* \(\Sigma\) is a set of *relational
symbols* \(R_1,...,R_k\), each with
its *arity* (a natural number), written \(\mathrm{ar}(R_i)\).

A *relational database* (DB for short) over \(\Sigma\) and \(\mathsf{Consts}\) associates to each
relational symbol \(R\) of \(\Sigma\) a relation, that we will write
\(R^D\) (or simply \(R\) when there is no risk of
confusion).

\(\Sigma = \{R,S,T\}\), of arity, respectively, 2,3 and 1. Then a DB could consist of the following relations: \(R^D = \{(a,a), (a,42)\}\), \(S^D = \{(b,b,a), (\text{toto},b,c), (c,c,c)\}\), and \(T^D = \{(a), (\text{toto})\}\). To see clearer, we often simply draw \(D\) as a set of tables in the usual way.

We call a *fact* of \(D\)
something of the form \(R(a_1,...,a_{\mathrm{ar}(R)})\). Hence we
can equivalently see a DB as a set of facts.

The previous example but where we see \(D\) as a set of facts, i.e., \(D = \{R(a,a), R(a,42), S(b,b,a), S(\text{toto},b,c), S(c,c,c), T(a), T(\text{toto})\}\).

We write \(|D|\) the number of facts
of a database \(D\), and call this its
*size*.

The *active domain* of a database \(D\), written \(\mathrm{ADom}(D)\), is the set of constants
that occur in the database.

For the database of the previous example, \(\mathrm{ADom}(D) = \{a,b,c,\text{toto},42\}\). It might be the case that \(\mathrm{ADom}(D)\) is a strict subset of \(\mathsf{Consts}\).

### The named perspective

This is the one you are used to when you manipulate relational
databases management systems (RDBMSs, such as MySQL, PostgreSQL, etc.),
where columns in tables have *attribute names*. This contrasts
with the unnamed perspective we just saw, where **we do not
care** about the names of attributes and only the position inside
tuples is important.

In the named perspective, we could have for instance a relation `Employee(int id, varchar(50) name, float salary).`

In the unnamed, we simply see this as a relation of arity 3. The domain
\(\mathsf{Consts}\) can contain
integers, strings, floats, whatever we want: for the type of problems we
will study in this course, this will be of no importance.

Why do we use the unnamed perspective? Because this is closer to the syntax of first-order logic, and also because it is less notationally heavy and hence less annoying when we want to write queries or proofs (see examples below).

### Differences between this course and “real” databases

In real databases data values have a type (int, string, etc). As already mentioned, here we do not care and put everything in \(\mathsf{Consts}\).

The other main difference is that in this course we consider what is
called the *set semantics* (a relation is a *set* of
tuples, so we cannot have repeated tupes), whereas typical RDBMSs allow
for duplicates, which is refered to as the *bag semantics*. This
subtle distinction can make the complexity of some problems differ. The
set semantics is less painful to work with, and it is already
interesting enough to make it worth studying.

## Queries: definitions

A *query* \(q\) (on \(\Sigma\)) is a function that associates to
each DB \(D\) a relation, called the
*result of q on D* and written \(q(D)\). Note that when the arity of that
relation is zero, only two results are possible: the emptyset \(\emptyset\), and \(\{()\}\), the set containing only the empty
tuple. We traditionally associate the first to FALSE and the second to
TRUE, and we then call the query *Boolean*. A database can then
either *satisfy the query*, written \(D
\models q\), or not satisfy it, written \(D \not\models q\).

Queries are usually written in specific *query languages*. In
this course we will see first-order queries and restrictions thereof
such as conjunctive queries (CQs) or unions of conjunctive queries
(UCQs). More powerfull query languages exist, such as second-order
logical queries, or Datalog queries, that
will probably not be covered in this course. As always, more
expressivity often comes with a cost, in the sense that typical problems
have higher complexity.

### First-order logic

You should already know the syntax and semantics of first-order logic (cf. prerequisites, see wikipedia if you need a refresher (the difference being that we do not use “function symbols” in this class, and all “structures” correspond to our databases and are thus finite)). Just a few notational reminders here so that we are on the same page.

Let \(V\) be a countably infinite
set of *variables*, disjoint from \(\mathsf{Consts}\).

An *atom* is either of the form \(t_1 = t_2\) (called an *equality
atom*), or of the form \(R(t_1,...,t_{\mathrm{ar}(R)})\) where each
\(t_i\) is either a constant or a
variable.

Formulas can be built from atoms by combining them using logical connectives (\(\land\), \(\lor\), \(\lnot\), \(\implies\)) and existential (\(\exists\)) and universal (\(\forall\)) quantifiers.

A variable can be *free* or *bound* in a formula.

A *sentence* is a formula with no free variables.

(Taken from Dan Suciu’s lecture notes). In English, what do these queries say?

- \(\exists x \exists y \exists y (x \neq y) \land (x\neq z) \land (y \neq z)\)
- \(\exists x \exists y \forall z (z = x) \lor (z = y)\)
- If the schema contains a single binary relation \(E\) (i.e., of arity two), one can see a DB on this schema as a directed graph. What does the following formula say about such graphs? \(\forall x \exists y E(x,y) \lor E(y,x)\)
- \(\forall x \forall y \exists z E(x,z) \land E(z,y)\)
- \(\exists x y z E(x,y) \land E(y,z) \land E(z,x)\)

A *first-order query* \(q\)
(or simply *FO query*) is the given of a first-order formula and
of a tuple that contains all its free variables. The result \(q(D)\) over a DB is defined as we expect
(if you are in doubt, see “First-Order Queries” around page 29 of the book).

\(q(x,x,y) = \exists z E(x,z) \land E(z,y)\). The formula is \(\phi(x,y) = \exists z E(x,z) \land E(z,y)\), with \(x\) and \(y\) the two free variables, and the tuple is \((x,x,y)\). Example of a database and \(q(D)\) in class.

Proposition: In terms of expressivity, first-order queries and relational algebra (Wikipedia reminder) and “vanilla SQL” are equivalent. (This should have been covered in the M1 database course.)

### Conjunctive queries (CQs)

A *conjunctive query* (CQ) is an FO query in which we only use
conjunction of atoms and existential quantifiers. In all generality,
they are thus of the form \(q(\bar{x}) =
\exists \bar{y} \bigwedge_{i=1}^n R_i(\bar{z_i})\) (we write
\(\bar{x}\) for a tuple, of the form
\(\bar{x} = (x_1,...,x_n)\), where the
\(x_i\)s are not necessarily distinct)
where the \(R_i\)s are not necessarily
distinct, \(\bar{x}\) and \(\bar{y}\) are tuples of variables that do
not intersect, each \(\bar{z_i}\) is a
tuple of variables and/or constants, and the union of the variables
occuring in the \(z_i\)s is equal to
\(\bar{x} \cup \bar{y}\).

\(q(x,y,y,z) = \exists t,u [ A(z,y,t) \land A(t,u,u) \land B(u,x) ]\)

(we write \(\exists t,u\) instead of \(\exists t \exists u\)).

If the database contains facts \(A(a,b,a)\), \(A(a,c,c)\) and \(B(c,a)\) for example, then the tuple \((a,b,b,a)\) will be in the result \(q(D)\).

CQs correspond to the SELECT FROM WHERE fragment of SQL, where we only use equality in the WHERE clause. So they are quite an important fragment of FO queries.

Find the name of the customers that have bought chocolate, with the
following schema (named perspective): \(\Sigma
=\) Customer(int id, varchar(50) name)

Orders(int clientId, int productID)

Products(int id, varchar(50) name)

In SQL, in the named perspective:

```
SELECT Customers.name FROM Customers, Orders, Products WHERE Customer.id =
Order.clientId AND Order.productId = Products.id AND Products.name = 'Chocolate'
```

With a CQ, in the unnamed perspective: \(q(x) = \exists y \exists z \mathrm{Customer}(y,x) \land \mathrm{Order}(y,z) \land \mathrm{Products}(z,\text{'chodolate'})\)

This is much easier to read: we see directly the “structure” of the query, we don’t care about the attribute names, and we do not need to chase down the equalities as they are directly visible.

== To move further, maybe for a TD? ### Unions of conjunctive queries (UCQs)

UCQ = a disjunction of CQs that all have the same free variables

\(q(x,y) = (\exists z R(x,z) \land S(z,y)) \lor (\exists z1 \exists z2 S(z1, x) \land S(z2,y))\)

==

## Model checking and query evaluation, notions of combined and data complexity

Query evaluation problem: given a query \(q\) and a DB \(D\), compute \(q(D)\).

Model checking: just like query evaluation, but for Boolean queries (the result is simply TRUE or FALSE).

If our query language is Turing-complete (e.g., Python), this task is going to be undecidable. But what about FO queries for instance?

When faced with a new problem, we usually:

- First, try to understand in which complexity class it lies (is it PTIME, NP-complete, PSPACE-complete, EXPSPACE-complete, or is it maybe undecidable?)
- If we are lucky and it is in PTIME, then we might want to obtain “efficient” algorithms (linear-time, or quadratic, or O(n n), etc.)

In this first part of the course we will mostly study classical problems and try to determine their complexity class. If they are in PTIME, we will not try to optimize the degree of the polynomial.

As usual, to talk about complexity, we have to decide how the input is represented (e.g., on a Turing machine input tape). For DBs and queries, we will consider any reasonable encoding (see, e.g., Appendix C of the book).

Then, there are two measures of complexity that are commonly used in
database theory: the *combined complexity* and the *data
complexity*. Intuitively, in the combined complexity the query
*and* the DB are considered to be part of the input, whereas for
the data complexity the query is fixed and only the DB is part of the
input. Data complexity is motivated by the fact that in real life
queries are much smaller than databases. Let’s see some examples of what
we mean exactly by combined and data complexity.

**ModelChecking(**FO**)** is the following
decision problem:

**INPUT**: A Boolean FO query q (notice that this is just a sentence) and a DB \(D\).**OUTPUT**: Accept if \(D\) satisfies \(q\), reject otherwise

(We define **QueryEval(**FO**)** similarly
for arbitrary FO queries, where the output is \(q(D)\).)

Here, we measure the combined complexity, that is, the complexity as a function \(f(|\phi|,|D|)\). We will soon see what is the complexity of this specific problem, but for now let’s see an example of a problem where we measure the data complexity.

In data compexity, we typically have **one computational
problem per query**. For instance:

**ModelChecking(**\(\exists x
y z R(x,y) \land R(y,z)\)**)** is the following
decision problem:

**INPUT**: A database \(D\)**OUTPUT**: Accept if \(D\) satisfies that query, reject otherwise

Here, we measure the data complexity. It might be the case that
**ModelChecking(**\(q_1\)**)** and
**ModelChecking(**\(q_2\)**)** have a different
complexity if \(q_1\) and \(q_2\) are different.

**Theorem 1:** For every FO Boolean query \(q\), the problem
**ModelChecking(**\(q\)**)** is in PTIME.
Informally, we can say that “the data complexity of FO is PTIME”

(informal proof). We first prove this theorem for the specific query
(borrowed from Dan Suciu’s lecture notes): \(q
= \exists x [A(x) \land \forall y ( B(y) \implies C(x,y) ) ]\).
We propose the following algorithm for solving
**ModelChecking(**\(q\)**)** in time \(O(|D|^2)\), in pseudocode:

```
def model_checking(q, D):
# this can be done in linear time
compute_adom(D) = False
foundX for x in ADom(D):
if not A(x): # shorthand for "if A(x) \in A^D"
continue # recall that this breaks one iteration of the loop
= True
allY for y in ADom(D):
if B(y) and not C(x,y):
= False
allY if allY:
= True
foundX return foundX
```

This algorithm is clearly correct. What is its complexity, as a function of \(|D|\)? Let \(n\) be \(|\mathrm{ADom}(D)|\). Clearly, we have \(n \leq 2 |D|\), because the relations have arity at most two, so there can be at most \(2|D|\) distinct values in \(\mathrm{ADom}(D)\). Let us be conservative and consider that checking whether \(A(c)\) is in \(A^D\) for a specific value c is in linear time in \(|D|\) (we can read the whole database in linear time, and if we encounter the fact \(A(c)\) we return yes). Then the algorithm has complexity \(O(n^3)\), which is \(O(|D|^3)\) by what precedes. This is indeed PTIME.

-> This can be generalized to any FO Boolean query, with complexity roughly \(O(|D|^k)\), with \(k\) the number of variables in the query. Indeed, we can always rewrite an FO formula as a bunch of quantifiers, plus at the end a propositional formula (with no quantifiers) (see This wikipedia page). Then we can have an algorithm with the same structure as the one above, where we will have roughly \(k\) nested for-loops, plus a bit of code to test the instantiated propositional formla.

**Exercice:** For any FO query (not necessarily
Boolean), what is the complexity of **QueryEval(**\(q\)**)**?

We now introduce an important notion for Boolean CQs that we will use a lot: homomorphisms.

Set \(q\) be a Boolean CQ and \(D\) a DB. We write \(\mathrm{vars}(q)\) the set of variables of
\(q\). A *homomorphism from \(q\) to \(D\)* is a function

\(h: \mathrm{vars}(x) \cup \mathsf{Consts} \to \mathsf{Consts}\) such that:

- Constants are preserved: for every \(c \in \mathsf{Consts}\), we have \(h(c) = c\)
- Atoms of \(q\) are mapped to facts of \(D\): for every atom \(R(\bar{x})\) of \(q\), \(R(h(\bar{x})) \in R^D\)

(here we write \(h(\bar{x})\) as a shorthand for the tuple \((h(x_1),...,h(x_n))\).)

\(q = \exists x y z R(x,x,y) \land R(y,x,z) \land S(z,y)\), D = … TODO in class

Important observation: \(D\models q\) iff there exists a homomorphism from \(q\) to \(D\).

**Theorem 2**: the problem
**ModelChecking(**CQs**)** is NP-complete.
Informally, we say that “CQs have NP-complete combined complexity”.

Before doing the proof, notice how this contrasts with Theorem 1 above: the data complexity of any FO query is PTIME, but when we look at the combined complexity, even restricted to the fragment of FO with only conjunctions and existential quantifiers (CQs), it becomes NP-complete!

This also shows that it is important to be precise when one defines a problem, in particular to specify exactly what is part of the input and what is not.

This problem is in NP: why? To show hardness, let’s reduce from 3SAT: Let \(\phi\) be a 3CNF. We build (in PTIME in \(|\phi|\)) a CQ \(q_\phi\) and a DB \(D\) such that \(\phi\) is satisfiable if and only if \(D\) satisfies \(q_\phi\).

The schema contains 4 relational symbols of arity 3, say \(A0\), \(A1\), \(A2\), and \(A3\). The query \(q_\phi\) contains one atom per clause of \(\phi\), in the following way:

- if \(C\) is a clause with no negated variables, so of the form \(x_i \lor x_j \lor x_k\), then \(q_\phi\) contains the atom \(A0(x_i,x_j,x_k)\)
- if \(C\) is a clause with exactly one negated variable, so of the form \(x_i \lor x_j \lor \lnot x_k\), then \(q_\phi\) contains the atom \(A1(x_i,x_j,x_k)\)
- … exactly two negated variables, … \(x_i \lor \lnot x_j \lor \lnot x_k\), then … \(A2(x_i,x_j,x_k)\)
- … all 3 variables negated, then … \(A4(x_i,x_j,x_k)\)

(\(q_\phi\) is the conjunction of all those atoms, with all variables existentially quantified)

The database consists of the following 4 relations \(A0^D\), \(A1^D\)… over the domain \(\{0,1\}\). In \(A0\) there are all possible facts except \(A0(0,0,0)\), in A1 there are all possible facts except…, etc. (whiteboard drawing)

We can construct \(q_\phi\) and \(D\) in PTIME? -> yes

\(\phi\) is satisfiable iff \(D\models q_\phi\)? -> yes (check in class, using homomorphisms)

Question (at home): what is the complexity of
**QueryEval(**CQs**)**?

**Theorem 3**: The problem
**ModelChecking(**FO**)** is
PSPACE-complete.

Same proof as for Theorem 2, but we reduce from the PSPACE-hard problem TQBF, and we add to \(q_\phi\) the appropriate quantifiers.

This shows that if we restrict problem (from FO to CQs), its complexity can decrease (here from PSPACE to NP)!

We also present an alternative proof of Theorem 2, simpler, by reduction from the 3 colorability problem (of non-directed graphs).

Alternative proof of Theorem 2. We reduce from 3 coloring, which we recall is the following problem:

**3 colorability**:

**INPUT**: A non directed graph \(G = (V,E)\)

**OUTPUT**: Accept if we can color the nodes of the
graph with 3 colors such that no two adjacent nodes have the same
color.

This problem is NP-complete. The reduction is then as follows.

The schema consists of a single binary relation \(E\).

The database is \(\{E(1,2), E(1,3), E(2,1), E(2,3), E(3,1), E(3,2)\}\), i.e., all facts \(E(a,b)\) for \(a,b \in \{0,1\}\) with \(a\neq b\).

The CQ \(q\) contains, for every edge \(\{u,v\}\) of \(G\), the two atoms \(E(u,v)\) and \(E(v,u)\). So the variables of \(q\) are the nodes of \(G\), and they are all existentially quantified.

This construction can be done in PTIME. We only need to check that \(G\) is 3-colorable iff \(D \models q\): in class, using homomorphisms

# Lab

- Fill the TODO of the following python code
- Fill the TODO of the following python code, with first corrections
- Fill the TODO of the following python code, with second corrections

Compiled the: mar. 22 oct. 2024 11:40:58 CEST