# Topological Sorting under Regular Constraints

# Topological Sorting under Regular Constraints

By Antoine Amarilli and Charles Paperman.

This page presents the constrained topological sorting and constrained shuffle problems, and some of our results and open questions related to these problems. It is a complement to our paper, which was presented at ICALP’18.

## Problem definitions

Fix an alphabet \(A\). An *\(A\)-DAG* is a directed
acyclic graph \(G\) where each
vertex is labeled by a letter of \(A\).
A topological
sort of \(G\) is a linear ordering
of the vertices that respects the edges of the DAG, i.e., for every edge
\((u, v)\) of \(G\), the vertex \(u\) is enumerated before \(v\). The topological sort *achieves*
the word of \(A^*\) formed by
concatenating the labels of the vertices in the order where they are
enumerated.

Fix a language \(L \subseteq A^*\).
The *constrained topological sort problem for \(L\)*, written \(\text{CTS}\!\left[L\right]\) asks, given an
\(A\)-DAG \(G\), whether there is a topological sort of
\(G\) that achieves a word of \(L\).

One problem variant is the *multi-letter setting* where the
input DAG is an *\(A^*\)-DAG*,
where the vertices are labeled by a word of \(A^*\), i.e., a topological sort achieves
the word obtained by concatenating the labels of the vertices, but the
words labeling each vertex cannot be interleaved with anything else.
However in this page we mostly focus on the *single-letter*
setings, i.e., \(A\)-DAGs.

Our current main results on the CTS-problem are presented in our paper. We show that \(\text{CTS}\!\left[L\right]\) is in NL for some regular languages \(L\), and is NP-hard for some other regular languages.

**Main dichotomy conjecture:** For every regular
language \(L\), either \(\text{CTS}\!\left[L\right]\) is in NL or
\(\text{CTS}\!\left[L\right]\) is
NP-hard.

## Restrictions on the input DAG

When the input DAG \(G\) is an union
of paths, the
problem is called *constrained shuffle problem* (CSh), because a
topological sort of \(G\) corresponds
to an interleaving of the strings represented by the paths.

We can consider the problem where the input DAG has bounded *height*, where
the height of a DAG is defined as the length of the longest directed
path.

We can consider the problem where the input DAG has bounded
*width*, where the width
of a DAG is the size of its largest *antichain*, i.e.,
subset of pairwise incomparable vertices. In the case of the CSh
problem, the width is the number of paths.

**Theorem:** For any constant \(k\) and regular language \(L\), the problem \(\text{CTS}\!\left[L\right]\) on DAGs of
width \(k\) is in NL.

*Proof idea:* see Proposition C.2 in the preprint.

**Theorem:** For any constant \(k\) and regular language \(L\), the problem \(\text{CSh}\!\left[L\right]\) on height
\(k\) is in NL.

*Proof idea:* there are only constantly many possible chains,
do a dynamic algorithm on the number of occurrences of each type.

**Theorem:** The problem \(\text{CTS}\!\left[(ab)^*\right]\) on input
DAGs of height \(3\) is NP-hard.

*Proof idea:* reduction from the \(k\)-clique problem similar to
Proposition 94 of amarilli2017possible.

**Remark:** Following the same idea, we can design a
language such that the CTS-problem for \(L\) is hard on input DAGs of height \(2\), e.g., \((a
a')^* (b b')^* (a | b)^*\).

**Open question:** Is \(\text{CTS}\!\left[(ab)^*\right]\) hard on
input DAGs of height \(2\)? In
particular, is it hard for input DAGs where the top layer is labeled by
\(a\) and the bottom layer is labeled
by \(b\)?

**Open direction:** What about the case where the input
DAG is a tree or a forest?

## Classifying regular languages

### Hardness results

**Theorem:** The CSh problem for \((ab)^*\) is NP-hard

*Proof:* see preprint, Theorem 3.6.

**Theorem:** The CSh problem for \((ab+a)^*\) is NP-hard

*Proof:* Reduce from \((ab)^*\) on DAGs with as many \(a\)’s as \(b\)’s. This is Proposition 3.9 of the preprint

**Theorem:** The CSh problem for \((aa+bb)^*\) is NP-hard

*Proof:* see preprint, Proposition
3.8.

**Theorem:** For any word \(u\) containing two different letters, the
problem \(\text{CSh}\!\left[u^*\right]\) is
NP-hard.

*Proof:* see preprint, Proposition
3.7.

**Conjecture:** Let \(F\) be a finite language. Assume that there
is a letter \(a\) such that \(F\) contains a word that contains \(a\), but \(F\) does not contain any power of \(F\) (i.e., \(F
\cap a^+ = \emptyset\)). Then CSh is NP-hard for the language
\(F^*\). (This is Conjecture 3.10 in
the preprint.)

*Proof idea:* Reason on the maximal consumption rate of \(a\)’s in the language, and do a so-called
“shuffle reduction” (see preprint) that enforces this
consumption rate. Could be extended to a proof on automata of a specific
shape.

### Tractable cases

#### General results

A *monomial* is a language of the form \(a_1 A_1^* \cdots a_{n-1} A_{n-1}^* a_n\),
where \(a_i \in A\) and \(A_i \subseteq A\) for all \(i\).

**Theorem:** The CTS problem for monomials is in NL.

*Proof:* see preprint, Theorem 4.3.

A *group language* is a language whose syntactic
monoid is a group. A
*district group monomial* is a language of the form \(a_1 G_1 \cdots a_{n-1} G_{n-1} a_n\), where
\(a_i \in A\) for all \(i\) and where each \(G_i\) is a group language on some alphabet
\(A_i \subseteq A\) (note that this
generalizes monomials).

**Theorem:** The CSh problem is tractable for group
languages and for district group monomials.

*Proof:* see preprint, Theorem 6.2.

**Conjecture:** The CTS problem is tractable for group
languages and for district group monomials.

#### Width-based results

**Theorem:** For any regular language \(L\), for any integers \(i\) and \(j\), the problem \(\text{CTS}\!\left[L+(a^i+b^j)^*\right]\) is
in NL.

*Proof:* see preprint, Proposition 4.5,
essentially the problem is trivial unless the DAG width is bounded.

**Corollary:** For any depth bound \(k\), the complement of the Dyck language of
depth bounded by \(k\) is
tractable.

These tractable languages are all in the class W of monoids (see this paper), and they traverse the dot-depth hierarchy (see this paper), implying that there are tractable languages that are arbitrary high in the dot-depth hierarchy.

**Theorem:** The problem \(\text{CTS}\!\left[(ab)^*+A^*aaA^*\right]\)
is in NL.

*Proof:* see preprint, Proposition
4.4.

**Conjecture:** For any regular language \(L\) and integer \(i\), the problem \(\text{CTS}\!\left[L+A^*a^iA^*\right]\) is
NP-hard.

*Proof idea:* see below.

#### Ad-hoc cases

**Theorem:** The CSh problem for \((aa+b)^*\) is in NL.

*Proof idea:* see preprint, Proposition
4.6.

**Open problem:** Is the same true of the CTS problem?
Is the same true of languages \((a^i+b)^*\)?

**Theorem:** The CTS-problem is in PTIME for \((ab)^*(\epsilon+bA^*)\).

*Proof idea:* see Proposition D.3 in preprint (ad-hoc greedy
algorithm).

**Open problem:** is this also in NL?

**Theorem:** The CSh-problem for \((a^+b^+a^+b^+)^*\) is in NL.

*Proof idea:* see Proposition 6.3 in preprint.

There are still various examples of ad hoc languages that we do not know how to classify, for instance \((aa+bb+ab)^*\), which we conjecture to be tractable for CSh, but where we do not know the complexity for CTS:

**Conjecture:** The CSh-problem for \((aa+bb+ab)^*\) is in PTIME.

*Possible proof direction:* assume that the number of elements
in the input instance is even as the problem is easy otherwise. If there
is no dominating chain (i.e., of length larger than the sum of the other
lengths), then we can always consume one element from the largest chain
and one from another chain and go to a situation with no dominating
chain, and when we consume two incomparable elements we can always avoid
\(ab\). If there is a dominating chain,
we can do a greedy algorithm following this chain, making insertions
from the other chains when necessary, until the chain is no longer
dominating.

## Beyond regular languages

**Theorem:** The CTS-problem for the Dyck
language with two delimiters is NP-hard.

*Proof idea:* should be similar to the proof for DAGs of
height 3.

**Theorem:** The CTS-problem for the Dyck language with
one delimiter plus a neutral letter is NP-hard.

*Proof idea:* see conclusion of preprint.

**Corollary:** The *multi-letter* CTS problem for
the Dyck language (i.e., the language of well-parenthesized words on
\(a, b\)) is NP-hard.

**Theorem:** The *multi-letter* CSh problem for
the Dyck language is in PTIME.

*Proof idea:* This follows for tractability of the problem on
*series-parallel* DAGs, see reference to Garey-Johson given in
conclusion of preprint.

**Open problem:** Is the CTS problem for the Dyck
language NP-hard? or is it in PTIME? In other words, do we need the
neutral letter to show hardness for the Dyck language?

## DAG decompositions

Fix the alphabet \(A = \{a, b\}\).
Given an \(A\)-DAG \(G\), the *min-width* of an antichain
of \(G\) is the number of occurrences
of the least frequent symbol. The min-width of \(G\) is the maximum min-width of an
antichain of \(G\). We would now need
an analogue of Dilworth’s
theorem for labeled DAGs:

**Open problem:** For any integer \(k\), how can we characterize the \(A\)-DAGs with min-width less than \(k\)?

*Ideas:* there is a constant-width part and a “quasi”-serial
part that alternates between levels of \(a\)-labeled vertices and \(b\)-labeled vertices, where quasi-serial
means that every sufficiently large antichain is almost completely
included in one level. See this CStheory
question and the draft linked there for details. However, we do not
know how to generalize this to larger alphabets.

The goal of this approach is to use it to solve the problem by a case
study: either the input DAG has a large min-width, or we can decompose
it using this result to parts of small \(a\)-width or small \(b\)-width, where the *\(a\)-width* of an \(A\)-DAG is the maximum number of \(a\)’s in an antichain, and *\(b\)-width* is similarly defined.

**Open problem:** For any integer \(k\) and regular language \(L\), is the CTS-problem in NL on \(A\)-DAGs of \(a\)-width \(\leq
k\)?

Obviously this is the case for the CSh problem as we can do a dynamic algorithm on the constant number of chains that contain an \(a\), but we do not know how to show it for CTS.

The case \(k=1\) is solved on CStheory question through a dynamic programming approach using the transition monoid of the automaton. This implies that the problem \(\text{CTS}\!\left[L+\Sigma^*aa\Sigma^*\right]\) is tractable for any regular language \(L\). However, the case \(k=2\) is still open and we do not know whether it is in PTIME or NP-hard.

## Main open problems

The main open questions would be to show (or disprove) the main dichotomy conjecture. Here are some intermediate steps towards this end

Is the CTS problem tractable for groups? (There is a CStheory question about this.) Is it tractable for district group monomials?

Characterizing the \(A\)-DAGs with small min-width (see here)

Is the CTS problem tractable for any language having the “big antichain” property, i.e., whenever there is a large antichain containing all letters of the alphabet then we can achieve all words? We could show this for CSh, but we do not know how to show it for CTS.

Are there languages on \(\{a, b\}\) that are hard even on DAGs with bounded \(a\)-width

Can we show hardness for a general family of languages or automata? E.g., the languages having a “bounded consumption rate” of some letter?

What is the complexity of the ad-hoc language examples for which we do not know it?

What is the complexity of CTS and CSh when the input DAGs is restricted to be a tree or a forest?

## Other observations

Tractable languages for CSh are closed under union but are not closed under intersection, complement, inverse of morphism, and quotient; see preprint, Proposition 5.1. The same is true of CTS except that closure under quotient holds in this case. In addition to this, we now know the following:

**Theorem:** The languages that are tractable for the
CTS-problem are not closed under the concatenation operation.

*Proof idea:* take \(L_1 = (caa)^∗ +
A^∗ ccA^∗\) and \(L_2 = (bbc)^∗ (a +
b)^∗ + A^∗ ccA^∗\). The CTS-problem for \(L_1\) and for \(L_2\) are in NL by a greedy algorithm,
however the problem for \(L_1 L_2\) is
NP-hard using Proposition 94 of this paper.

**Open problem:** Is the same true for the
CSh-problem?

## Problem variants

We can consider a version where, instead of fixing one language, we fix a semiautomaton, and can choose sets of initial-final states, to make tractability closed under several operations. See preprint, Section 5.

We can also consider two alternative problem phrasings, one where we
look for a *path* of the DAG realizing the input language, and
one where we look for a *sub-DAG*. However, these two problems
are tractable:

**Theorem:** Fixing a regular language \(L\), given an \(A\)-DAG, we can decide in linear time
whether \(G\) has a *path*
realizing a word of \(L\).

*Proof idea:* do the product of \(G\) with an automaton for \(L\).

**Theorem:** Fixing a regular language \(L\), given an \(A\)-DAG, we can decide in NL whether there
is a sub-DAG of \(G\) which is a
positive instance to \(\text{CTS}\!\left[L\right]\).

*Proof idea:* assume wlog that \(L\) is closed under supersequences, and use
Higman’s
lemma to argue that \(L\) can be
rewritten as a sum of \(A^*w_i A^*\)
for some fixed words \(w_i\).

Compiled the: dim. 07 janv. 2024 23:19:29 CET