How to implicit hitting set

A while ago, a colleague of mine presented a paper on minimum-size decision trees, and we started discussing their lower bounding strategies. At a certain point, we had the conversation that went approximately like this:1

— So, they [authors of the paper] employ a few pruning rules, all of them corresponding to some set cover problem.
— Wait, we already played these games:) I am pretty sure we have seen somewhere a strategy for solving with set covering.
— Of course we did, it is a common approach to solving MaxSAT problems.

After we worked through the paper, we recollected a few more applications of this idea from previous seminars—which is a broad strategy of using implicit hitting sets. While that specific conversation eventually petered out, it got me thinking: for such a conceptually clear idea with so many ways to apply it, there are precious few resources to learn it! So, today, I would like to materialize the conversations I had into what I hope will be a clear explanation of how to implicit hitting set.

A motivating example: solving the knapsack with IHS

Implicit hitting set is a rich idea that generalizes to various NP (and even to some beyond-NP) problems. But first, let’s start with something much simpler than decision trees: the 0—1 knapsack problem. As a refresher, in this problem, given a set of items with weights and values, you have to select a subset of items such that:

  • the total weight does not exceed the knapsack’s capacity,
  • and the total value is maximized.

There are many ways to solve this problem, but for now, we will try doing it by trial and error. For example, let’s try solving the following problem (capacity is 6 units of weight):

Item Weight Value
A 4 5
B 2 4
C 5 8

We want to get as much value as possible out of this. In this problem, you would be justified to explore valid solutions in some manner that goes from one solution to another, a better one. What if we won’t try to build a solution from scratch, but instead ask: “What went wrong last time?” That’s the idea behind implicit hitting sets: it is a dual search approach that solves problems by repeatedly refining its guesses based on counterexamples.

Back to our example; without any extra information, we can try being greedy:

Would not it be great to take all three items? We can get 17 units of value.

Unfortunately, no.

…but why?

Well, because you cannot take eleven units of weight, you only have six! In fact, you cannot even take B and C together.

Too bad, but good to know that B and C are incompatible. Can I at least take A and C then? 13 units of value are also good.

Still no, but I know what you will ask me; please don’t take A and C again.

Hm, so I have two options now. I can keep C, but now I cannot take anything else. I can drop C; would it be good now to take A and B?

Good job. Enough reading, though; try this logic for yourself! The interactive demo below provides you with an implementation for that; you are also welcome to play around with various conflict choices (which are on you to make). What is the smallest collection of conflicts you need to build en route to the optimal solution?


Demo: IHS knapsack explorer

Capacity:

Item subsets


The general idea

First, let’s have a look at what we have just done. In this approach to the knapsack problem, we maintained the conflict sets and repeated the following:

  1. Find the best solution that avoids all known conflicts.
  2. Is it feasible? Then stop, because it is also optimal.
  3. By this point, our solution is infeasible; inspect the solution and extract a conflict out of it.

And now the best part: this approach does not rely on anything knapsack-specific. All it requires is that:

  • You can state the problem in terms of “valid” versus “invalid” configurations.
  • Given an invalid configuration, you can extract a conflict out of it: a set of elements that cannot be added together.
  • Given a set of invalid configurations, you can find the “largest” that avoids all conflicts.

The “duality” comes from this loop: instead of searching over the space of solutions (primal) until you exhaust all of it, you instead explore the space of explanations (dual) until you know enough of it. This workflow is especially powerful when your problem has no compact formulation, but it can provide you with counterexamples when you ask for it. You don’t need to model the whole thing up front — just keep learning what doesn’t work.

Here is a graphical summary of this technique:

You can also narrate it in the minimization form:

  • You can state the problem in terms of “valid” versus “invalid” configurations. (same)
  • Given an invalid configuration, you can extract a core out of it: a set of elements that cannot be removed together.
  • Given a set of invalid configurations, you can find the “smallest” that covers all the cores.

Given this workflow, if you can fit your problem in this scheme, you can also solve it: at least, all the ingredients are in place!

But how can I apply this?

Where IHS shows up

In more than a few domains, actually! To give a few examples:

Optimization

First, let’s consider one of the most cited success stories of the implicit hitting set: the MaxSAT problem (Davies and Bacchus 2011). The problem has the following structure: given \(m\) clauses with weights \(w_1, \dotsc, w_n > 0\) over Boolean variables, find an assignment that minimizes the weight of violated clauses.2

The problem structure suggests a natural conflict description: an unsatisfiable subset of the input clauses. (In SAT literature, this set is referred to as core.) That, in turn, suggests a way to extract conflicts: given a set of clauses that are supposed to be satisfied, feed it into a SAT solver; here is what can happen:

  • The SAT solver returns a satisfying assignment: great, this is the optimal solution.
  • The SAT solver returns an UNSAT verdict: then we request an UNSAT core of the formula and add it to the core set.

Once we set up the same conflict-avoidant logic as in our knapsack example, we get the following IHS loop:

  1. Find the minimum-weight hitting set of all known cores.
  2. Construct a SAT formula from this hitting set and halt if the assignment is found.
  3. Recover an UNSAT core and add it to the core set; repeat.

Model debugging

Just like you can find the largest satisfiable set, you can also find the smallest unsatisfiable one. This is an important problem when you, for example, are debugging a constraint model that is unsatisfiable but should not be for some extra-modeling reasons (e.g., domain knowledge). More precisely, the problem, known as the smallest minimal unsatisfiable set (MUS) problem, is as follows: given \(m\) clauses3 \(\Omega = \{ \omega_1, \dotsc, \omega_m \}\) over Boolean variables, find its smallest unsatisfiable subset \(\Omega^* \subseteq \Omega\) (Ignatiev et al. 2015) .

With a few examples under the belt, the dual of this problem is clear:

  • we need to avoid all not unsatisfiable subsets \(\Omega' \subset \Omega\),
  • so any unsatisfiable set has to contain one of the clauses in a set \(C := \Omega \setminus \Omega'\) of omitted clauses. Any subset of clauses \(C\) with this property—discarding it leaves a satisfiable formula—is called a correction subset. If adding any further clause to \(C\) breaks this, like a reverse Jenga puzzle, then it is a maximal correction set (MCS).

With that said, here is the IHS approach we get:

  1. Find the minimum-weight hitting set of all known MCSes.
  2. Construct a SAT formula from this hitting set and halt if a solver declares it UNSAT.4
  3. Recover an MCS and add it to the set of known MCSes; repeat.

Inferring functional dependencies

You can also discover the IHS formulation for enumeration problems, rather than the ones of optimization. For a widely known example of such a problem, consider the situation described in (Mannila and Räihä 1994): You have a relational table, for example, like this one:

Title Release year Length
West Lion 2006 159
Sassy Packer 2005 154
Jingle Sagebrush 2005 159

Clearly, some of the columns are going to functionally determine the others; for example, if you know Title and Release year, you should know Length. However, not all of the functional dependencies are going to be “natural,” such as the relation (Length, Release year) → Title. Knowing the relations that hold in this table—even if not marked explicitly—can be helpful for all sorts of reasons:

  • Pattern mining: knowing that some if-else pattern holds unconditionally on your dataset is as good as it gets, if you are looking for dependencies.
  • Database design: if a relation has a valid functional dependency that does not correspond to some table key, making that dependency explicit is helpful for data integrity.
  • Query optimization: similarly, if a query optimizer knows that some column is defined by another, it can leverage such information to short-circuit some table joins or even avoid table lookups altogether (Siegel, Sciore, and Salveter 1992).

A good start would be to enumerate all such dependencies. Again, by the IHS playbook, we need to determine the following:

  1. When \((a_1, \dotsc, a_n) \to b\) is not a functional relation? That would mean that the column \(b\) in some two rows \(p, q\) is indistinguishable from columns in \(a\), in other words, when \(p(a_1, \dotsc, a_n) = q(a_1, \dotsc, a_n)\) but \(p(b) \neq q(b)\) for some rows \(p, q\).
  2. How to define a conflict? Given the “witness” rows \(p\) and \(q\) and the dependent column \(b\), we cannot discard \((a_1, \dotsc, a_n)\) from the set \(\Omega\) of all columns (similarly to the MUS example above).
  3. How to trim a conflict? Similarly to before, try adding more columns to the left-hand side \((a_1, \dotsc, a_n)\) without ending up discriminating the rows \(p\) and \(q\); once this is no longer possible, we found the maximal non-functional left-hand side and minimal conflict.
  4. How do valid solutions look like? They should add one of the conflict columns for all conflicts discovered as described above.

Well, that was a lot; here is a short summary to match problems with their IHS ingredients:

Problem Conflict structure Conflict extraction Hitting set interpretation
MaxSAT UNSAT core SAT solver run A subset of clauses avoiding all UNSAT cores
Smallest MUS Minimal correcting subsets “Grow” procedure (add clauses to MCS and call a SAT solver) A subset of clauses that does not fully discard any MCS
Functional relation inference Non-discriminating column sets Pairwise row testing A subset of columns that does not collate any pair of rows disagreeing on the right-hand side

Miscellaneous tricks

Of course, while this scheme is simple to describe, executing it well is not so simple. Effective implementations of the IHS workflow incorporate a variety of extra techniques; for example:

Conflict/core extraction heuristics. Better core extraction = fewer iterations. For example, one way to speed up MUS extraction is to observe that:

  • MCSes not only state that some subset is invalid, but also give an assignment that has to be falsified by any MUS (which has not happened on the IHS iteration of interest);
  • and similar assignments (e.g., the ones obtained by flipping one bit) could yield different MCSes. This technique is known as model rotation and, while not giving any formal guarantees, it is known to be helpful for MUS extraction (Marques-Silva and Lynce 2011).

Smarter hitting set solvers. Hitting set computation is NP-complete; even if it is easier than handling most of the problems here directly, it is still not trivial. Therefore, hitting set solvers also commonly employ:

Conclusion

IHS dualization elegantly delegates search (candidate generation) to explanation of infeasibility (conflict learning). As any other versatile idea, it has many names in other fields; here are a few:

  • conflict-driven clause learning from SAT solving,
  • nogood learning in constraint programming,
  • cutting plane approaches in integer programming,
  • counterfactual explanations in machine learning,
  • counter-example guided abstract refinement in program synthesis.

Implicit Hitting Set dualization is a unifying perspective for tackling hard problems. Whether you’re debugging a program, optimizing a Boolean formula, or solving a puzzle, if it is easier to diagnose non-solutions than to enumerate valid solutions, IHS is a tool you want to consider.

  1. I do not recollect the conversation on a word-by-word level, so the source for the next block is “I made it up.” That said, I recollect that we had a discussion that back-referenced us from this paper to previous IHS applications. 

  2. The more common problem definition also discriminates between hard clauses that have to be satisfied unconditionally, and soft clauses that can be violated for a penalty. However, it is equivalent to the text definition: just set all hard clause weights to a number larger than the sum of soft clause weights, and declare the problem UNSAT if any of the hard clauses is violated in the optimal solution. In the blog text, I stick to the soft-only definition, as it makes the exposition of the approach more accessible. 

  3. But it might as well be an arbitrary constraint, albeit with less space for domain-specific techniques for correcting set extraction. 

  4. Yes, this is the exact opposite of the MaxSAT condition; this only makes sense given the duality between those two problems. 

References

  1. Solving MAXSAT by Solving a Sequence of Simpler SAT Instances
    Jessica Davies, and Fahiem Bacchus
    In Principles and Practice of Constraint Programming – CP 2011, 2011
  2. Smallest MUS Extraction with Minimal Hitting Set Dualization
    Alexey Ignatiev, Alessandro Previti, Mark Liffiton, and 1 more author
    In Principles and Practice of Constraint Programming, 2015
  3. Algorithms for inferring functional dependencies from relations
    Heikki Mannila, and Kari-Jouko Räihä
    Data & Knowledge Engineering, 1994
  4. A method for automatic rule derivation to support semantic query optimization
    Michael Siegel, Edward Sciore, and Sharon Salveter
    ACM Trans. Database Syst., Dec 1992
  5. On Improving MUS Extraction Algorithms
    Joao Marques-Silva, and Ines Lynce
    In Theory and Applications of Satisfiability Testing - SAT 2011, Dec 2011
  6. Witty: An Efficient Solver for Computing Minimum-Size Decision Trees
    Luca Pascal Staus, Christian Komusiewicz, Frank Sommer, and 1 more author
    Dec 2024
  7. Implicit hitting set algorithms for reasoning beyond NP
    Paul Saikko, Johannes P. Wallner, and Matti Järvisalo
    In Proceedings of the Fifteenth International Conference on Principles of Knowledge Representation and Reasoning, Cape Town, South Africa, Dec 2016