Topological Sorting under Regular Constraints

By Antoine Amarilli and Charles Paperman.

This page presents the constrained topological sorting and constrained shuffle problems, and some of our results and open questions related to these problems. It is a complement to our paper, which will be presented at ICALP’18.

Problem definitions

Fix an alphabet A. An A-DAG is a directed acyclic graph G where each vertex is labeled by a letter of A. A topological sort of G is a linear ordering of the vertices that respects the edges of the DAG, i.e., for every edge (u, v) of G, the vertex u is enumerated before v. The topological sort achieves the word of A* formed by concatenating the labels of the vertices in the order where they are enumerated.

Fix a language L ⊆ A*. The constrained topological sort problem for L, written CTS​[L] asks, given an A-DAG G, whether there is a topological sort of G that achieves a word of L.

One problem variant is the multi-letter setting where the input DAG is an A*-DAG, where the vertices are labeled by a word of A*, i.e., a topological sort achieves the word obtained by concatenating the labels of the vertices, but the words labeling each vertex cannot be interleaved with anything else. However in this page we mostly focus on the single-letter setings, i.e., A-DAGs.

Our current main results on the CTS-problem are presented in our paper. We show that CTS​[L] is in NL for some regular languages L, and is NP-hard for some other regular languages.

Main dichotomy conjecture: For every regular language L, either CTS​[L] is in NL or CTS​[L] is NP-hard.

Restrictions on the input DAG

When the input DAG G is an union of paths, the problem is called constrained shuffle problem (CSh), because a topological sort of G corresponds to an interleaving of the strings represented by the paths.

We can consider the problem where the input DAG has bounded height, where the height of a DAG is defined as the length of the longest directed path.

We can consider the problem where the input DAG has bounded width, where the width of a DAG is the size of its largest antichain, i.e., subset of pairwise incomparable vertices. In the case of the CSh problem, the width is the number of paths.

Theorem: For any constant k and regular language L, the problem CTS​[L] on DAGs of width k is in NL.

Proof idea: see Proposition C.2 in the preprint.

Theorem: For any constant k and regular language L, the problem CSh​[L] on height k is in NL.

Proof idea: there are only constantly many possible chains, do a dynamic algorithm on the number of occurrences of each type.

Theorem: The problem CTS​[(ab)*] on input DAGs of height 3 is NP-hard.

Proof idea: reduction from the k-clique problem similar to Proposition 94 of amarilli2017possible.

Remark: Following the same idea, we can design a language such that the CTS-problem for L is hard on input DAGs of height 2, e.g., (aa′)*(bb′)*(a|b)*.

Open question: Is CTS​[(ab)*] hard on input DAGs of height 2? In particular, is it hard for input DAGs where the top layer is labeled by a and the bottom layer is labeled by b?

Open direction: What about the case where the input DAG is a tree or a forest?

Classifying regular languages

Hardness results

Theorem: The CSh problem for (ab)* is NP-hard

Proof: see preprint, Theorem 3.6.

Theorem: The CSh problem for (ab + a)* is NP-hard

Proof: Reduce from (ab)* on DAGs with as many a’s as b’s. This is Proposition 3.9 of the preprint

Theorem: The CSh problem for (aa + bb)* is NP-hard

Proof: see preprint, Proposition 3.8.

Theorem: For any word u containing two different letters, the problem CSh​[u*] is NP-hard.

Proof: see preprint, Proposition 3.7.

Conjecture: Let F be a finite language. Assume that there is a letter a such that F contains a word that contains a, but F does not contain any power of F (i.e., F ∩ a+ = ∅). Then CSh is NP-hard for the language F*. (This is Conjecture 3.10 in the preprint.)

Proof idea: Reason on the maximal consumption rate of a’s in the language, and do a so-called “shuffle reduction” (see preprint) that enforces this consumption rate. Could be extended to a proof on automata of a specific shape.

Tractable cases

General results

A monomial is a language of the form a1A1*an − 1An − 1*an, where ai ∈ A and Ai ⊆ A for all i.

Theorem: The CTS problem for monomials is in NL.

Proof: see preprint, Theorem 4.3.

A group language is a language whose syntactic monoid is a group. A district group monomial is a language of the form a1G1an − 1Gn − 1an, where ai ∈ A for all i and where each Gi is a group language on some alphabet Ai ⊆ A (note that this generalizes monomials).

Theorem: The CSh problem is tractable for group languages and for district group monomials.

Proof: see preprint, Theorem 6.2.

Conjecture: The CTS problem is tractable for group languages and for district group monomials.

Width-based results

Theorem: For any regular language L, for any integers i and j, the problem CTS​[L+(ai+bj)*] is in NL.

Proof: see preprint, Proposition 4.5, essentially the problem is trivial unless the DAG width is bounded.

Corollary: For any depth bound k, the complement of the Dyck language of depth bounded by k is tractable.

These tractable languages are all in the class W of monoids (see this paper), and they traverse the dot-depth hierarchy (see this paper), implying that there are tractable languages that are arbitrary high in the dot-depth hierarchy.

Theorem: The problem CTS​[(ab)*+A*aaA*] is in NL.

Proof: see preprint, Proposition 4.4.

Conjecture: For any regular language L and integer i, the problem CTS​[L+A*aiA*] is NP-hard.

Proof idea: see below.

Ad-hoc cases

Theorem: The CSh problem for (aa + b)* is in NL.

Proof idea: see preprint, Proposition 4.6.

Open problem: Is the same true of the CTS problem? Is the same true of languages (ai + b)*?

Theorem: The CTS-problem is in PTIME for (ab)*(ϵ + bA*).

Proof idea: see Proposition D.3 in preprint (ad-hoc greedy algorithm).

Open problem: is this also in NL?

Theorem: The CSh-problem for (a+b+a+b+)* is in NL.

Proof idea: see Proposition 6.3 in preprint.

There are still various examples of ad hoc languages that we do not know how to classify, for instance (aa + bb + ab)*, which we conjecture to be tractable for CSh, but where we do not know the complexity for CTS:

Conjecture: The CSh-problem for (aa + bb + ab)* is in PTIME.

Possible proof direction: assume that the number of elements in the input instance is even as the problem is easy otherwise. If there is no dominating chain (i.e., of length larger than the sum of the other lengths), then we can always consume one element from the largest chain and one from another chain and go to a situation with no dominating chain, and when we consume two incomparable elements we can always avoid ab. If there is a dominating chain, we can do a greedy algorithm following this chain, making insertions from the other chains when necessary, until the chain is no longer dominating.

Beyond regular languages

Theorem: The CTS-problem for the Dyck language with two delimiters is NP-hard.

Proof idea: should be similar to the proof for DAGs of height 3.

Theorem: The CTS-problem for the Dyck language with one delimiter plus a neutral letter is NP-hard.

Proof idea: see conclusion of preprint.

Corollary: The multi-letter CTS problem for the Dyck language (i.e., the language of well-parenthesized words on a, b) is NP-hard.

Theorem: The multi-letter CSh problem for the Dyck language is in PTIME.

Proof idea: This follows for tractability of the problem on series-parallel DAGs, see reference to Garey-Johson given in conclusion of preprint.

Open problem: Is the CTS problem for the Dyck language NP-hard? or is it in PTIME? In other words, do we need the neutral letter to show hardness for the Dyck language?

DAG decompositions

Fix the alphabet A = {a, b}. Given an A-DAG G, the min-width of an antichain of G is the number of occurrences of the least frequent symbol. The min-width of G is the maximum min-width of an antichain of G. We would now need an analogue of Dilworth’s theorem for labeled DAGs:

Open problem: For any integer k, how can we characterize the A-DAGs with min-width less than k?

Ideas: there is a constant-width part and a “quasi”-serial part that alternates between levels of a-labeled vertices and b-labeled vertices, where quasi-serial means that every sufficiently large antichain is almost completely included in one level. See this CStheory question and the draft linked there for details. However, we do not know how to generalize this to larger alphabets.

The goal of this approach is to use it to solve the problem by a case study: either the input DAG has a large min-width, or we can decompose it using this result to parts of small a-width or small b-width, where the a-width of an A-DAG is the maximum number of a’s in an antichain, and b-width is similarly defined.

Open problem: For any integer k and regular language L, is the CTS-problem in NL on A-DAGs of a-width  ≤ k?

Obviously this is the case for the CSh problem as we can do a dynamic algorithm on the constant number of chains that contain an a, but we do not know how to show it for CTS.

Ideas: For L = (ab)* and k = 1, we know that this holds, but we do not know how to generalize it. For k = 1 we should have an NL algorithm for L = u*. Otherwise we hope it’s PTIME but we don’t have an algorithm yet. For k = 2 we don’t know, maybe it’s NP-hard.

Main open problems

The main open questions would be to show (or disprove) the main dichotomy conjecture. Here are some intermediate steps towards this end

Other observations

Tractable languages for CSh are closed under union but are not closed under intersection, complement, inverse of morphism, and quotient; see preprint, Proposition 5.1. The same is true of CTS except that closure under quotient holds in this case. In addition to this, we now know the following:

Theorem: The languages that are tractable for the CTS-problem are not closed under the concatenation operation.

Proof idea: take L1 = (caa) + AccA and L2 = (bbc)(a + b) + AccA. The CTS-problem for L1 and for L2 are in NL by a greedy algorithm, however the problem for L1L2 is NP-hard using Proposition 94 of this paper.

Open problem: Is the same true for the CSh-problem?

Problem variants

We can consider a version where, instead of fixing one language, we fix a semiautomaton, and can choose sets of initial-final states, to make tractability closed under several operations. See preprint, Section 5.

We can also consider two alternative problem phrasings, one where we look for a path of the DAG realizing the input language, and one where we look for a sub-DAG. However, these two problems are tractable:

Theorem: Fixing a regular language L, given an A-DAG, we can decide in linear time whether G has a path realizing a word of L.

Proof idea: do the product of G with an automaton for L.

Theorem: Fixing a regular language L, given an A-DAG, we can decide in NL whether there is a sub-DAG of G which is a positive instance to CTS​[L].

Proof idea: assume wlog that L is closed under supersequences, and use Higman’s lemma to argue that L can be rewritten as a sum of A*wiA* for some fixed words wi.