# Database 2, année 2023

# Part II: static analysis of classical problems

\(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\Consts}{\mathsf{Consts}}\) \(\newcommand{\vars}{\mathrm{vars}}\)

## Satisfiability

Let \(\mathcal{Q}\) be a class of Boolean queries. We write \(\mathrm{SAT}_{\mathrm{fin}}(\mathcal{Q})\) the following problem:

INPUT: \(q \in \mathcal{Q}\) OUTPUT: accept if there exists a DB that satisfies \(q\), reject otherwise.

The “fin” in \(\mathrm{SAT}_{\mathrm{fin}}\) stands for “finite”. This is because classical model theory considers structures (databases) that are not necessarily finite, whereas in our case we are only interested in databases that have a finite number of facts.

Question: what is the complexity of \(\mathrm{SAT}_{\mathrm{fin}}(CQs)\) and \(\mathrm{SAT}_{\mathrm{fin}}(UCQs)\)?

-> This is PTIME, since any (non-empty) CQ \(q\) is satisfiable by the *canonical
database* associated with \(q\),
which is the database where we see the variables of \(q\) as constants, and we take all the atoms
of \(q\) as facts.

**Theorem 4 [Trakhtenbrot, 1950]**: \(\mathrm{SAT}_{\mathrm{fin}}(FO)\) is
undecidable.

In class, following Theorem 8.1 of the book. We started the proof together and you had to finish reading it from the book.

## Containment and equivalence

To simplify, we will only consider Boolean queries in this section, but all concepts generalize to non-Boolean queries.

Let \(q_1,q_2\) be two Boolean
queries. We say that \(q_1\) is
*contained* in \(q_2\), written
\(q_1 \subseteq q_2\), if, for every DB
\(D\), if \(D\models q_1\) then \(D\models q_2\).

(If we draw two potatoes, where the first potatoe represents the databases that satisfy \(q_1\) and the second potatoe the databases that satisfy \(q_2\), the first potatoe will be included in the second one, hence the name.)

For a class \(\calQ\) of Boolean
queries, define the *containment problem* for \(\calQ\) as follows:

**Containment(**\(\calQ\)**)**:

**INPUT**: Two queries \(q_1,q_2 \in \calQ\)

**OUPUT**: Yes if \(q_1
\subseteq q_2\).

Two Boolean queries \(q_1,q_2\) are equivalent, written \(q_1 \equiv q_2\) if, for every database \(D\), we have that \(D \models q_1\) if and only if \(D \models q_2\).

**Equivalence(**\(\calQ\)**)**:

**INPUT**: Two queries \(q_1,q_2 \in \calQ\)

**OUPUT**: Yes if \(q_1 \equiv
q_2\).

\(q_1\) is *strictly
contained* in \(q_2\), written
\(q_1 \subsetneq q_2\), if we have
\(q_2 \subseteq q_2\) but not \(q_2 \subseteq q_1\).

**Question**: For each pair of CQs among the following,
determine the relationship between them in terms of containment and
equivalence (and explain why).

\(q_1 = \exists x y E(x,y)\)

\(q_2 = \exists x y E(x,y) \land E(z,y)\)

\(q_3 = \exists x y z E(x,y) \land E(y,z)\)

\(q_4 = \exists x y z t E(x,y) \land E(x,z) \land E(y,t) \land E(z,t)\)

**Question**: show that for any class of Boolean queries
\(\calQ\), the problem
**Equivalence(**\(\calQ\)**)** reduces in PTIME
to **Containment(**\(\calQ\)**)**.

Why are we interested in these problems? For optimization: when we have query \(q\), it might be interesting to “minimize it”, i.e., to find a smaller query \(q'\) that is equivalent to \(q\) and that is smaller. This is interesting because, as we have seen, the running time of evaluating such a query is roughly \(O(|D|^k)\) where \(k\) is the number of variables of the query; so we can find a smaller equivalent query we might as well evaluate that one, which gives us a better complexity. We encourage you to have a look at the “Optimizing a Simple Query” Example (Example 15.1) in the book.

Let us start with bad news:

**Theorem 5**:
**Containment(**FO**)** and
**Equivalence(**FO**)** are both
undecidable.

We reduce from **Satisfiability(**FO**)**,
which we have shown is undecidable. Given an Boolean FO query \(q\), we want to decide if it is
satisfiable, that is, if there exists a database \(D\) such that \(D\models q\).

Consider the query \(q' = \exists x x\neq x\). This query is not satisfied by any database. We can then check that \(q\) is unsatisfiable iff \(q\equiv q'\) (iff \(q\subseteq q'\)).

We will then study these problems for conjunctive queries.

Homomorphisms play a central role to analyze the containment and
equivalence for conjunctive queries. Above, we defined homomorphisms to
be from CQs to databases, but we can adapt this definition to be from
CQs to CQS. Formally, given two Boolean CQs \(q_2\) and \(q_2\), a **homomorphism** from
\(q_1\) to \(q_2\) is a function \(h: \vars(q_1) \cup \Consts \to \vars(q_2) \cup
\Consts\) such that:

- Constants are preserved, i.e., for every \(c\in \Consts\) we have \(h(c) = c\).
- Atoms of \(q_1\) are mapped to atoms of \(q_2\): for every atom \(R(\bar{z_i})\) of \(q_1\), \(R(h(\bar{z_i}))\) is an atom of \(q_2\).

**Theorem 6 [Chandra & Merlin, 1977]**: For any two
Boolean CQs \(q_1, q_2\), we have \(q_1 \subseteq q_2\) if and only if there
exists a homomorphism from \(q_2\) to
\(q_1\).

Pay attention here that the “direction” is reversed: \(q_1\) is contained in \(q_2\) iff there is a homomorphism that goes
*from* \(q_2\) *to* \(q_1\).

Let us prove this result.

We prove both directions in turn.

- Assume that there exists a homomorphism \(h\) from \(q_2\) to \(q_1\), and let us show then that \(q_1 \subseteq q_2\). Let \(D\) be a database that satisfies \(q_1\). By the “important observation” above, this implies that there exists a homomorphism, call it \(h'\), from \(q_1\) to \(D\). Now, you can check that \(h \circ h'\) is a homomorphism from \(q_2\) to \(D\) (really – do it!), so that we have \(D\models q_2\) as well.
- Assume that \(q_1 \subseteq q_2\). Consider the canonical database \(D_{q_1}\) associated with \(q_1\). By definition of the canonical database, it should be clear that we have \(D_{q_1} \models q_1\). Therefore, since \(q_1 \subseteq q_2\), we also have that \(D_{q_1} \models q_2\). By the important observation, this implies that there exists a homomorphism \(h\) from \(q_2\) to \(D_{q_1}\). But it is clear that this is also a homomorphism from \(q_2\) to \(q_1\).

**Theorem 7**:
**Containment(**CQ**)** is NP-complete.

The problem is in NP: why?

To show NP-hardness, we reduce from 3-colorability. This is essentially the same proof as for the “alternative” proof of Theorem 2. The graph is coded by one of the CQs, the colors by the other CQ.

**Theorem 8**:
**Equivalence(**CQ**)** is NP-complete.

The problem is in NP: why?

To show NP-hardness, we reduce from
**Containment(**CQ**)**, which we just showed
was NP-complete. Let \(q_1,q_2\) be two
CQs, that we assume are over disjoint sets of variables. Construct the
query \(q_3\) to be the conjunction of
\(q_1\) and \(q_2\). Then we showed in class that \(q_1 \subseteq q_2\) iff \(q_1 \equiv q'\), which concludes the
reduction.

## Minimization

**CQ minimization:** given a Boolean CQ (BCQ) \(q\), we would like to find an equivalent
BCQ \(q'\) that is “smaller” (so
that we can evaluate it faster on a database).

Define \(A_q\) to be the set of atoms that \(q\) contains.

A *minimization* of a BCQ \(q\) is a BCQ \(q'\) such that:

- \(q' \equiv q\); and
- For every BCQ \(q''\), if \(q'' \equiv q\) then \(|A_{q''}| \geq |A_{q'}|\).

We present next one candiate procedure to compute such a minimization of \(q\).

### The core of a BCQ

Idea: Start from a BCQ \(q\), and
try to remove an atom, calling the resulting CQ \(q'\). Obviously, we have \(q \subseteq q'\) (why?). But if we are
lucky and, additionally, we have \(q'
\subseteq q\), then \(q \equiv
q'\), and we have found a CQ that is smaller and equivalent
to \(q'\). We can then continue
doing this, and at some point we will find a CQ, call it \(q_m\), that is equivalent to \(q\) and to which we cannot remove any atom
while preserving equivalence. We call such a CQ \(q_m\) a *core* of \(q\). Let us define this formally.

We then consider the following algorithm:

If there exists an atom \(R(\overline{x})\) in q such that, considering \(q'\) the query \(q\) where we have removed this atom, we have \(q' \subseteq q\), then return ComputeCore(q’) else return q

What happens on \(q = \exists x,y R(x,y) \land R(x,z)\) ?

On \(q' = \exists x,y R(x,y) \land R(y,z)\) ?

We note that by doing different choices on the atoms to remove, we might end up on different queries.

A *core* of \(q\) is then an
query \(q'\) that is possible to
obtain via this algorithm, i.e., there is a choice of atoms to remove
that ends up on \(q'\).

The next proposition says that all those choices essentially end up on the same query:

**Proposition:** Every core of \(q\) is also a minimization of \(q\). Moreover, for any two cores \(q_1,q_2\) of \(q\), \(q_1\) and \(q_2\) are isomorphic, i.e., they are the
same up to renaming of variables.

See Chapter 16 of the book.

Compiled the: mar. 22 oct. 2024 11:40:58 CEST