Database 2, année 2024
Part II: static analysis of classical problems
\(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\Consts}{\mathsf{Consts}}\) \(\newcommand{\vars}{\mathrm{vars}}\)
Satisfiability
Let \(\mathcal{Q}\) be a class of Boolean queries. We write \(\mathrm{SAT}_{\mathrm{fin}}(\mathcal{Q})\) the following problem:
INPUT: \(q \in \mathcal{Q}\) OUTPUT: accept if there exists a DB that satisfies \(q\), reject otherwise.
The “fin” in \(\mathrm{SAT}_{\mathrm{fin}}\) stands for “finite”. This is because classical model theory considers structures (databases) that are not necessarily finite, whereas in our case we are only interested in databases that have a finite number of facts.
Question: what is the complexity of \(\mathrm{SAT}_{\mathrm{fin}}(CQs)\) and \(\mathrm{SAT}_{\mathrm{fin}}(UCQs)\)?
-> This is PTIME, since any (non-empty) CQ \(q\) is satisfiable by the canonical database associated with \(q\), which is the database where we see the variables of \(q\) as constants, and we take all the atoms of \(q\) as facts. Formally, let us introduce the concept of the canonical database of a CQ, which will be useful later.
Let \(q\) be a CQ. The canonical database of \(q\), denoted \(\mathrm{Canon}(q)\), contains one fact \(R(\bar{z'})\) for every atom \(R(\bar{z})\) of \(q\), where: - \(z'_i\) is simply \(z_i\) if \(z_i\) is a constant; - \(z'_i\) is the fresh constant \(c_{z_i}\) if \(z_i\) is a variable.
TODO example of canonical database of a CQ in class.
Theorem 4 [Trakhtenbrot, 1950]: \(\mathrm{SAT}_{\mathrm{fin}}(FO)\) is undecidable.
In class, following Theorem 8.1 of the book. We started the proof together and you had to finish reading it from the book.
Containment and equivalence
To simplify, we will only consider Boolean queries in this section, but all concepts generalize to non-Boolean queries.
Let \(q_1,q_2\) be two Boolean queries. We say that \(q_1\) is contained in \(q_2\), written \(q_1 \subseteq q_2\), if, for every DB \(D\), if \(D\models q_1\) then \(D\models q_2\).
(If we draw two potatoes, where the first potatoe represents the databases that satisfy \(q_1\) and the second potatoe the databases that satisfy \(q_2\), the first potatoe will be included in the second one, hence the name.)
For a class \(\calQ\) of Boolean queries, define the containment problem for \(\calQ\) as follows:
Containment(\(\calQ\)):
INPUT: Two queries \(q_1,q_2 \in \calQ\)
OUPUT: Yes if \(q_1 \subseteq q_2\).
Two Boolean queries \(q_1,q_2\) are equivalent, written \(q_1 \equiv q_2\) if, for every database \(D\), we have that \(D \models q_1\) if and only if \(D \models q_2\).
Equivalence(\(\calQ\)):
INPUT: Two queries \(q_1,q_2 \in \calQ\)
OUPUT: Yes if \(q_1 \equiv q_2\).
\(q_1\) is strictly contained in \(q_2\), written \(q_1 \subsetneq q_2\), if we have \(q_2 \subseteq q_2\) but not \(q_2 \subseteq q_1\).
Question: For each pair of CQs among the following, determine the relationship between them in terms of containment and equivalence (and explain why).
\(q_1 = \exists x y E(x,y)\)
\(q_2 = \exists x y z E(x,y) \land E(z,y)\)
\(q_3 = \exists x y z E(x,y) \land E(y,z)\)
\(q_4 = \exists x y z t E(x,y) \land E(x,z) \land E(y,t) \land E(z,t)\)
Question: show that for any class of Boolean queries \(\calQ\), the problem Equivalence(\(\calQ\)) reduces in PTIME to Containment(\(\calQ\)).
Why are we interested in these problems? For optimization: when we have query \(q\), it might be interesting to “minimize it”, i.e., to find a smaller query \(q'\) that is equivalent to \(q\) and that is smaller. This is interesting because, as we have seen, the running time of evaluating such a query is roughly \(O(|D|^k)\) where \(k\) is the number of variables of the query; so we can find a smaller equivalent query we might as well evaluate that one, which gives us a better complexity. We encourage you to have a look at the “Optimizing a Simple Query” Example (Example 15.1) in the book.
Let us start with bad news:
Theorem 5: Containment(FO) and Equivalence(FO) are both undecidable.
We reduce from Satisfiability(FO), which we have shown is undecidable. Given an Boolean FO query \(q\), we want to decide if it is satisfiable, that is, if there exists a database \(D\) such that \(D\models q\).
Consider the query \(q' = \exists x x\neq x\). This query is not satisfied by any database. We can then check that \(q\) is unsatisfiable iff \(q\equiv q'\) (iff \(q\subseteq q'\)).
We will then study these problems for conjunctive queries.
Homomorphisms play a central role to analyze the containment and equivalence for conjunctive queries. Above, we defined homomorphisms to be from CQs to databases, but we can adapt this definition to be from CQs to CQS. Formally, given two Boolean CQs \(q_2\) and \(q_2\), a homomorphism from \(q_1\) to \(q_2\) is a function \(h: \vars(q_1) \cup \Consts \to \vars(q_2) \cup \Consts\) such that:
- Constants are preserved, i.e., for every \(c\in \Consts\) we have \(h(c) = c\).
- Atoms of \(q_1\) are mapped to atoms of \(q_2\): for every atom \(R(\bar{z_i})\) of \(q_1\), \(R(h(\bar{z_i}))\) is an atom of \(q_2\).
Theorem 6 [Chandra & Merlin, 1977]: For any two Boolean CQs \(q_1, q_2\), we have \(q_1 \subseteq q_2\) if and only if there exists a homomorphism from \(q_2\) to \(q_1\).
Pay attention here that the “direction” is reversed: \(q_1\) is contained in \(q_2\) iff there is a homomorphism that goes from \(q_2\) to \(q_1\).
Let us prove this result.
We prove both directions in turn.
- Assume that there exists a homomorphism \(h\) from \(q_2\) to \(q_1\), and let us show then that \(q_1 \subseteq q_2\). Let \(D\) be a database that satisfies \(q_1\). By the “important observation” above, this implies that there exists a homomorphism, call it \(h'\), from \(q_1\) to \(D\). Now, you can check that \(h \circ h'\) is a homomorphism from \(q_2\) to \(D\) (really – do it!), so that we have \(D\models q_2\) as well.
- Assume that \(q_1 \subseteq q_2\). Consider the canonical database \(\mathrm{Canon}(q_1)\) associated with \(q_1\). By definition of the canonical database, it should be clear that we have \(\mathrm{Canon}(q_1) \models q_1\). Therefore, since \(q_1 \subseteq q_2\), we also have that \(\mathrm{Canon}(q_1) \models q_2\). By the important observation, this implies that there exists a homomorphism \(h\) from \(q_2\) to \(\mathrm{Canon}(q_1)\). But it is clear that this is also a homomorphism from \(q_2\) to \(q_1\).
Theorem 7: Containment(CQ) is NP-complete.
The problem is in NP: why?
To show NP-hardness, we reduce from 3-colorability. This is essentially the same proof as for the “alternative” proof of Theorem 2. The graph is coded by one of the CQs, the colors by the other CQ.
Theorem 8: Equivalence(CQ) is NP-complete.
The problem is in NP: why?
To show NP-hardness, we reduce from Containment(CQ), which we just showed was NP-complete. Let \(q_1,q_2\) be two CQs, that we assume are over disjoint sets of variables. Construct the query \(q_3\) to be the conjunction of \(q_1\) and \(q_2\). Then we showed in class that \(q_1 \subseteq q_2\) iff \(q_1 \equiv q_3\), which concludes the reduction.
Minimization
CQ minimization: given a Boolean CQ (BCQ) \(q\), we would like to find an equivalent BCQ \(q'\) that is “smaller” (so that we can evaluate it faster on a database).
Define \(A_q\) to be the set of atoms that \(q\) contains.
A minimization of a BCQ \(q\) is a BCQ \(q'\) such that:
- \(q' \equiv q\); and
- For every BCQ \(q''\), if \(q'' \equiv q\) then \(|A_{q''}| \geq |A_{q'}|\).
We present next one candiate procedure to compute such a minimization of \(q\).
The core of a BCQ
Idea: Start from a BCQ \(q\), and try to remove an atom, calling the resulting CQ \(q'\). Obviously, we have \(q \subseteq q'\) (why?). But if we are lucky and, additionally, we have \(q' \subseteq q\), then \(q \equiv q'\), and we have found a CQ that is smaller and equivalent to \(q'\). We can then continue doing this, and at some point we will find a CQ, call it \(q_m\), that is equivalent to \(q\) and to which we cannot remove any atom while preserving equivalence. We call such a CQ \(q_m\) a core of \(q\). Let us define this formally.
We then consider the following algorithm:
If there exists an atom \(R(\overline{x})\) in q such that, considering \(q'\) the query \(q\) where we have removed this atom, we have \(q' \subseteq q\), then return ComputeCore(q’) else return q
What happens on \(q = \exists x,y,z R(x,y) \land R(x,z)\) ?
On \(q' = \exists x,y R(x,y) \land R(y,z)\) ?
We note that by doing different choices on the atoms to remove, we might end up on different queries.
A core of \(q\) is then an query \(q'\) that is possible to obtain via this algorithm, i.e., there is a choice of atoms to remove that ends up on \(q'\).
The next proposition says that all those choices essentially end up on the same query:
Proposition: Every core of \(q\) is also a minimization of \(q\). Moreover, for any two cores \(q_1,q_2\) of \(q\), \(q_1\) and \(q_2\) are isomorphic, i.e., they are the same up to renaming of variables.
See Chapter 16 of the book.
Compiled the: mar. 21 janv. 2025 22:04:19 CET