# Part III: Probabilistic databases, provenance, knowledge compilation

\(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\Consts}{\mathsf{Consts}}\) \(\newcommand{\vars}{\mathrm{vars}}\) \(\newcommand{\Pr}{\mathrm{Pr}}\) \(\newcommand{\calD}{\mathcal{D}}\) \(\newcommand{\calH}{\mathcal{H}}\) \(\newcommand{\pqetid}{\textbf{PQE}_\text{TID}}\) \(\newcommand{\pqe}{\textbf{PQE}}\) \(\newcommand{\atoms}{\textrm{Atoms}}\)

# Probabilistic databases and probabilitic query evaluation: definitions

## Probabilistic databases: definition

### Probabilistic databases and semantics of query evaluation

We fix as usual a database schema \(\Sigma\) and set of constants \(\Consts\).

A *probabilistic database* \(\calD =
(W,\Pr)\) consists of a *finite* set \(W=\{D_1,\ldots,D_n\}\) of databases over
\(\Sigma\) (called the *possible
worlds* of \(\calD\)), each having
a probability \(Pr(D_i) \in [0,1]\),
with \(\sum_{D\in W} Pr(D) = 1\).

\(W = \{D_1,D_2,D_3\}\) with \(\Pr(D_1) = 0.5\), \(\Pr(D_2) = 0.4\), \(\Pr(D_1) = 0.1\), and \(D_1\), \(D_2\), \(D_3\) being DBs (TODO whiteboard)

Let \(q\) be a Boolean query. Define
*the probability that \(\calD\)
satisfies \(q\)*, written \(Pr(\calD \models q)\), to be \(Pr(\calD \models q) = \sum_{D\in W, D\models q}
\Pr(D)\).

Continuing the above example, consider the query \(q = \exists x,y: R(x,y) \land S(y)\). Then \(Pr(\calD \models q) = \ldots\).

We can generalize this to non-Boolean queries: for a given tuple
\(\bar{t}\), define *the probability
that \(\bar{t}\) is in the result of
\(q\) on \(\calD\)* as \(\Pr(\bar{t}\in q(\calD)) = \sum_{D\in W,
\bar{t}\in q(D)} \Pr(D)\)

The *probabilistic query evaluation problem* (PQE) is the
problem of, given a Boolean query \(q\)
and PDB \(\calD\), computing \(\Pr(\calD \models q)\). Here again we can
consider the data or the combined complexity. In data complexity we have
one problem \(\pqe(q)\) for each
Boolean query \(q\). The complexity
will depend on how \(\calD\) is
represented as input.

Assume that \(\calD\) is simply
represented by the list of all its possible worlds together with their
probabilities, and let \(q\) be a
Boolean query. Show that \(\pqe(q)\)
reduces in PTIME to **ModelChecking(**\(q\)**)**, and vice-versa.

In general however probabilistic data do not come in this form but instead are given by some kind of implicit factorization. We’ll see next too examples of probabilistic database formalisms that are closer to real-life applications

### Tuple-independent PDBs (TIDs)

One way to represent a probabilistic database in a compact manner is
to start from a database \(D\), and
then annotate each of its facts \(f=R(\bar{t})\) by a probability value \(\pi(f)\), with the intended semantics that
\(f\) is present in the database with
probability \(\pi(f)\) (and absent with
probability \(1-\pi(f)\)), independent
of the other facts. This gives rise to the *tuple-independent
database* (TID) model.

Formally, a TID \(T = (D,\pi)\) consists of a database \(D\) and of a function \(\pi:D \to [0,1]\). It defines a PDB \(\calD_T = (W_T,\Pr_T)\) in the following way:

- The possible worlds \(W_T\) is the set of all subsets of \(D\) (seing \(D\) as a set of facts)
- For \(D' \in W_T\), we define \(\Pr_T(D') = (\prod_{f\in D'} \pi(f)) \times (\prod_{f\in D\setminus D'} (1-\pi(f))\)

Show that this defines a valid probabilistic database, i.e., that the sum of probabilities of the possible worlds is always equal to \(1\).

\(T\) containing the following facts: Teaches(Charles, Databases) with probability \(0.8\), Teaches(Mikaël, Databases) with probability \(0.5\), Teaches(Sylvain, Functional Programming) with probability \(1\). What are the possible worlds? If \(q\) is the Boolean query “there are two different people teaching the same course, what is \(\Pr(T\models q)\)? Same question if we add the fact Teaches(Jean, Databases).

This is a much more natural way of representing probabilistic databases! Note that if we have \(n\) facts, then this concisely represents \(2^n\) possible worlds.

Unfortunately this conciseness comes at a price: first, we will see (exercices) that not all PDBs can be represented as a TID. Second, we will see that the complexity of PQE can quickly become high even for very simple queries.

### Bloc-independent PDBs (BIDs)

TODO define if time

# Complexity of PQE on TIDs

Here we will give a quick overview on PQE over TIDs, in data complexity. For a Boolean query \(q\), let us denote \(\pqetid(q)\) the following problem:

\(\pqetid(q)\) is the following problem:

**INPUT**: A TID \(T = (D,\pi)\).**OUTPUT**: \(Pr_\pi(T\models q)\).

For some queries, for instance CQs, this problem is in polynomial time:

Show that for \(q = \exists x: R(x,x)\), \(\pqetid(q)\) is in PTIME. (Done in class)

However we will see that for some other queries the problem is (highly) intractable. To do this, we need to introduce the complexity class #P.

## Interlude on the complexity class #P

Informally, #P is the class that corresponds to counting the number of solutions of problems that are in NP.

Formally, #P is the set of all functions \(f:\{0,1\}^* \to \mathbb{N}\) such that there exists a PTIME nondeterministic Turing machine \(M\) such that for all \(x\in \{0,1\}^*\), \(f(x)\) is precisely the number of accepting runs of \(M\) on \(x\).

See the Wikipedia page for more details.

You already know that SAT, or even 3-CNF SAT, is NP-complete. It turns out that this carries over to the counting versions and to #P-completeness.

#SAT is the following problem:

**INPUT**: A propositional formula \(\varphi\) over variables \(X\)**OUTPUT**: The number of valuations \(\tau:X\to \{0,1\}\) that satisfy \(\varphi\).

**Theorem:** #SAT is #P-complete, even when restricted
to 3-CNF formulas. (This means that #SAT is in #P, and that it is
#P-hard, i.e., for every other problem A in #P, there exists a
polynomial-time reduction from A to #SAT.)

Explain (done in class) why the problems #SAT, #VERTEX-COVER, #CLIQUES are in #P. (It turns out that these are also #P-complete.)

To show hardness of a simle query for PQE, we will need a stronger result than #SAT for 3-CNFs is #P-hard.

Define #PP2DNF (for *Partitionned Positive 2-DNFs*) to be the
following problem:

#PP2DNF is the following problem:

**INPUT**: A PP2DNF formula. This is a DNF formula over variables \(X,Y\) with \(X\cap Y = \emptyset\) in which every clause of the formula uses one variable from \(X\) and one variable from \(Y\), without negation. So in all generality such a formula can be written as \(\varphi = \bigvee_{i=1}^m (x_{i,j} \land y_{i,j})\), with \(x_{i,j} \in X\) and \(y_{i,j} \in Y\).**OUTPUT**: The number of valuations \(\tau:X\to \{0,1\}\) that satisfy \(\varphi\).

**Theorem:** #PP2DNF is #P-complete. (We won’t prove it
in this course though.)

## Example of a CQ for which PQE is #P-hard

**Proposition:** Let \(q_{\text{RST}}\) be the BCQ \(q_{\text{RST}} = \exists x,y: R(x) \land S(x,y)
\land T(y)\). Then \(\pqetid(q_{\text{RST}})\) is #P-hard.

Done in class.

What is interesting is that the model checking problem for this query is PTIME (since this is a FO query), but the probabilistic variant is intractable.

## Dichotomy theorems

**Theorem: ( Dichotomy theorem for UCQs [Dalvi &
Suviu, 2012])** For any Boolean UCQ \(q\), the problem

**PQE(**\(q\)

**)**is either PTIME or #P-hard.

This is an important theorem in the field, whose proof is about 100 pages long, so we will not cover it this course. Note that the result is not obvious, since it is not the case that all counting problems in #P are either in PTIME or #P-complete (similarly to Ladner’s theorem.

We will however prove a (much) simpler version of it, for so-called
*self-join free Boolean CQs*.

We need a few definitions.

A BCQ \(q\) is *self-join
free* (written then SJFBCQ) if all its relational symbols are
distinct. For instance the query \(q_\text{RST}\) from above is self-join
free, whereas the query \(\exists x,y
R(x,y)\land R(y,x)\) is not.

For \(q\) a BCQ and \(x\in \vars(q)\), denote by \(\atoms(x)\) the set of atoms of \(q\) in which \(x\) occurs. For instance for the query \(q_{\text{RST}}\) we have \(\atoms(x) = \{ R(x), S(x,y)\}\) and \(\atoms(y) = \{S(x,y),T(y)\}\).

Call a SJFBCQ *hierarchical* if, for every \(x,y\in \vars(q)\), we have either \(\atoms(x)\cap \atoms(y) = \emptyset\), or
\(\atoms(x)\subseteq \atoms(y)\), or
\(\atoms(y)\subseteq \atoms(x)\). For
instance \(q_\text{RST}\) is not
hierarchical whereas the query \(\exists x,y
R(x) \land S(x,y)\) is.

We will prove:

**Theorem: (dichotomy theorem for SJFBCQs)** Let \(q\) be a SJFBCQ. If \(q\) is hierarchical then \(\pqetid(q)\) is PTIME, otherwise it is
#P-hard.

For the hardness part, assuming that \(q\) is not hierarchical, then we know that there are variables \(x,y\in \vars(q)\) and an atom \(R(\bar{t_1})\) in \(q\) such that \(x\) occurs in \(\bar{t_1}\) but not \(y\), another atom \(S(\bar{t_1})\) such that both \(x\) and \(y\) occur in \(\bar{t_2}\), and another atom \(T(\bar{t_3})\) such that \(y\) occurs in \(\bar{t_3}\) but not \(x\). Then we can do a reduction from #PP2DNF using the same idea that we used for the query \(q_{\text{RST}}\).

Work out the details of the reduction by yourself. (Hint: for the positions that do not correspond to \(x\) or \(y\), use the same constant in the construction.)

To show the other direction, let us first define the hypergraph associated to \(q\), denoted \(\calH_q = (V,E)\), and defined as follows:

- The nodes \(V\) of \(\calH_q\) are the atoms of \(q\);
- For every variable \(x\in \vars(q)\), we have an hyperedge that is \(\atoms(x)\).

Then, when we try to draw the hypergraph of a hierarchical SJFBCQ, we understand why these queries are called hierarchical! From there, we can see how an algorithm is going to work, using the fact that there is no self-join and the fact that the facts are independent of each others (similar to what we did on the examples with the relation Teaches).

Compiled the: mar. 22 oct. 2024 11:40:58 CEST