How to implicit hitting set | Konstantin Sidorov

A while ago, a colleague of mine presented a paper on minimum-size decision trees, and we started discussing their lower bounding strategies. At a certain point, we had the conversation that went approximately like this:¹

— So, they [authors of the paper] employ a few pruning rules, all of them corresponding to some set cover problem.
— Wait, we already played these games:) I am pretty sure we have seen somewhere a strategy for solving with set covering.
— Of course we did, it is a common approach to solving MaxSAT problems.

After we worked through the paper, we recollected a few more applications of this idea from previous seminars—which is a broad strategy of using implicit hitting sets. While that specific conversation eventually petered out, it got me thinking: for such a conceptually clear idea with so many ways to apply it, there are precious few resources to learn it! So, today, I would like to materialize the conversations I had into what I hope will be a clear explanation of how to implicit hitting set.

A motivating example: solving the knapsack with IHS

Implicit hitting set is a rich idea that generalizes to various NP (and even to some beyond-NP) problems. But first, let’s start with something much simpler than decision trees: the 0—1 knapsack problem. As a refresher, in this problem, given a set of items with weights and values, you have to select a subset of items such that:

the total weight does not exceed the knapsack’s capacity,
and the total value is maximized.

There are many ways to solve this problem, but for now, we will try doing it by trial and error. For example, let’s try solving the following problem (capacity is 6 units of weight):

Item	Weight	Value
A	4	5
B	2	4
C	5	8

We want to get as much value as possible out of this. In this problem, you would be justified to explore valid solutions in some manner that goes from one solution to another, a better one. What if we won’t try to build a solution from scratch, but instead ask: “What went wrong last time?” That’s the idea behind implicit hitting sets: it is a dual search approach that solves problems by repeatedly refining its guesses based on counterexamples.

Back to our example; without any extra information, we can try being greedy:

Would not it be great to take all three items? We can get 17 units of value.

Unfortunately, no.

…but why?

Well, because you cannot take eleven units of weight, you only have six! In fact, you cannot even take B and C together.

Too bad, but good to know that B and C are incompatible. Can I at least take A and C then? 13 units of value are also good.

Still no, but I know what you will ask me; please don’t take A and C again.

Hm, so I have two options now. I can keep C, but now I cannot take anything else. I can drop C; would it be good now to take A and B?

Good job. Enough reading, though; try this logic for yourself! The interactive demo below provides you with an implementation for that; you are also welcome to play around with various conflict choices (which are on you to make). What is the smallest collection of conflicts you need to build en route to the optimal solution?

Show the IHS knapsack explorer demo

Demo: IHS knapsack explorer

Capacity:

Item subsets

Show the IHS knapsack explorer demo

The general idea

First, let’s have a look at what we have just done. In this approach to the knapsack problem, we maintained the conflict sets and repeated the following:

Find the best solution that avoids all known conflicts.
Is it feasible? Then stop, because it is also optimal.
By this point, our solution is infeasible; inspect the solution and extract a conflict out of it.

And now the best part: this approach does not rely on anything knapsack-specific. All it requires is that:

You can state the problem in terms of “valid” versus “invalid” configurations.
Given an invalid configuration, you can extract a conflict out of it: a set of elements that cannot be added together.
Given a set of invalid configurations, you can find the “largest” that avoids all conflicts.

The “duality” comes from this loop: instead of searching over the space of solutions (primal) until you exhaust all of it, you instead explore the space of explanations (dual) until you know enough of it. This workflow is especially powerful when your problem has no compact formulation, but it can provide you with counterexamples when you ask for it. You don’t need to model the whole thing up front — just keep learning what doesn’t work.

Here is a graphical summary of this technique:

You can also narrate it in the minimization form:

You can state the problem in terms of “valid” versus “invalid” configurations. (same)
Given an invalid configuration, you can extract a core out of it: a set of elements that cannot be removed together.
Given a set of invalid configurations, you can find the “smallest” that covers all the cores.

Given this workflow, if you can fit your problem in this scheme, you can also solve it: at least, all the ingredients are in place!

But how can I apply this?

Where IHS shows up

In more than a few domains, actually! To give a few examples:

Optimization

First, let’s consider one of the most cited success stories of the implicit hitting set: the MaxSAT problem (Davies and Bacchus 2011). The problem has the following structure: given \(m\) clauses with weights \(w_1, \dotsc, w_n > 0\) over Boolean variables, find an assignment that minimizes the weight of violated clauses.²

The problem structure suggests a natural conflict description: an unsatisfiable subset of the input clauses. (In SAT literature, this set is referred to as core.) That, in turn, suggests a way to extract conflicts: given a set of clauses that are supposed to be satisfied, feed it into a SAT solver; here is what can happen:

The SAT solver returns a satisfying assignment: great, this is the optimal solution.
The SAT solver returns an UNSAT verdict: then we request an UNSAT core of the formula and add it to the core set.

Once we set up the same conflict-avoidant logic as in our knapsack example, we get the following IHS loop:

Find the minimum-weight hitting set of all known cores.
Construct a SAT formula from this hitting set and halt if the assignment is found.
Recover an UNSAT core and add it to the core set; repeat.

Model debugging

Just like you can find the largest satisfiable set, you can also find the smallest unsatisfiable one. This is an important problem when you, for example, are debugging a constraint model that is unsatisfiable but should not be for some extra-modeling reasons (e.g., domain knowledge). More precisely, the problem, known as the smallest minimal unsatisfiable set (MUS) problem, is as follows: given \(m\) clauses³ \(\Omega = \{ \omega_1, \dotsc, \omega_m \}\) over Boolean variables, find its smallest unsatisfiable subset \(\Omega^* \subseteq \Omega\) (Ignatiev et al. 2015) .

With a few examples under the belt, the dual of this problem is clear:

we need to avoid all not unsatisfiable subsets \(\Omega' \subset \Omega\),
so any unsatisfiable set has to contain one of the clauses in a set \(C := \Omega \setminus \Omega'\) of omitted clauses. Any subset of clauses \(C\) with this property—discarding it leaves a satisfiable formula—is called a correction subset. If adding any further clause to \(C\) breaks this, like a reverse Jenga puzzle, then it is a maximal correction set (MCS).

With that said, here is the IHS approach we get:

Find the minimum-weight hitting set of all known MCSes.
Construct a SAT formula from this hitting set and halt if a solver declares it UNSAT.⁴
Recover an MCS and add it to the set of known MCSes; repeat.

Inferring functional dependencies

You can also discover the IHS formulation for enumeration problems, rather than the ones of optimization. For a widely known example of such a problem, consider the situation described in (Mannila and Räihä 1994): You have a relational table, for example, like this one:

Title	Release year	Length
West Lion	2006	159
Sassy Packer	2005	154
Jingle Sagebrush	2005	159

Clearly, some of the columns are going to functionally determine the others; for example, if you know Title and Release year, you should know Length. However, not all of the functional dependencies are going to be “natural,” such as the relation (Length, Release year) → Title. Knowing the relations that hold in this table—even if not marked explicitly—can be helpful for all sorts of reasons:

Pattern mining: knowing that some if-else pattern holds unconditionally on your dataset is as good as it gets, if you are looking for dependencies.
Database design: if a relation has a valid functional dependency that does not correspond to some table key, making that dependency explicit is helpful for data integrity.
Query optimization: similarly, if a query optimizer knows that some column is defined by another, it can leverage such information to short-circuit some table joins or even avoid table lookups altogether (Siegel, Sciore, and Salveter 1992).

A good start would be to enumerate all such dependencies. Again, by the IHS playbook, we need to determine the following:

When \((a_1, \dotsc, a_n) \to b\) is not a functional relation? That would mean that the column \(b\) in some two rows \(p, q\) is indistinguishable from columns in \(a\), in other words, when \(p(a_1, \dotsc, a_n) = q(a_1, \dotsc, a_n)\) but \(p(b) \neq q(b)\) for some rows \(p, q\).
How to define a conflict? Given the “witness” rows \(p\) and \(q\) and the dependent column \(b\), we cannot discard \((a_1, \dotsc, a_n)\) from the set \(\Omega\) of all columns (similarly to the MUS example above).
How to trim a conflict? Similarly to before, try adding more columns to the left-hand side \((a_1, \dotsc, a_n)\) without ending up discriminating the rows \(p\) and \(q\); once this is no longer possible, we found the maximal non-functional left-hand side and minimal conflict.
How do valid solutions look like? They should add one of the conflict columns for all conflicts discovered as described above.

Well, that was a lot; here is a short summary to match problems with their IHS ingredients:

Problem	Conflict structure	Conflict extraction	Hitting set interpretation
MaxSAT	UNSAT core	SAT solver run	A subset of clauses avoiding all UNSAT cores
Smallest MUS	Minimal correcting subsets	“Grow” procedure (add clauses to MCS and call a SAT solver)	A subset of clauses that does not fully discard any MCS
Functional relation inference	Non-discriminating column sets	Pairwise row testing	A subset of columns that does not collate any pair of rows disagreeing on the right-hand side

Miscellaneous tricks

Of course, while this scheme is simple to describe, executing it well is not so simple. Effective implementations of the IHS workflow incorporate a variety of extra techniques; for example:

Conflict/core extraction heuristics. Better core extraction = fewer iterations. For example, one way to speed up MUS extraction is to observe that:

MCSes not only state that some subset is invalid, but also give an assignment that has to be falsified by any MUS (which has not happened on the IHS iteration of interest);
and similar assignments (e.g., the ones obtained by flipping one bit) could yield different MCSes. This technique is known as model rotation and, while not giving any formal guarantees, it is known to be helpful for MUS extraction (Marques-Silva and Lynce 2011).

Smarter hitting set solvers. Hitting set computation is NP-complete; even if it is easier than handling most of the problems here directly, it is still not trivial. Therefore, hitting set solvers also commonly employ:

integer linear programming solvers to get high-quality relaxations (Davies and Bacchus 2011),
greedy approximations for fast lower bounds (Staus et al. 2024),
and adding domain-specific constraints to avoid exploring clearly infeasible hitting sets (Saikko, Wallner, and Järvisalo 2016).

Conclusion

IHS dualization elegantly delegates search (candidate generation) to explanation of infeasibility (conflict learning). As any other versatile idea, it has many names in other fields; here are a few:

conflict-driven clause learning from SAT solving,
nogood learning in constraint programming,
cutting plane approaches in integer programming,
counterfactual explanations in machine learning,
counter-example guided abstract refinement in program synthesis.

Implicit Hitting Set dualization is a unifying perspective for tackling hard problems. Whether you’re debugging a program, optimizing a Boolean formula, or solving a puzzle, if it is easier to diagnose non-solutions than to enumerate valid solutions, IHS is a tool you want to consider.

I do not recollect the conversation on a word-by-word level, so the source for the next block is “I made it up.” That said, I recollect that we had a discussion that back-referenced us from this paper to previous IHS applications. ↩
The more common problem definition also discriminates between hard clauses that have to be satisfied unconditionally, and soft clauses that can be violated for a penalty. However, it is equivalent to the text definition: just set all hard clause weights to a number larger than the sum of soft clause weights, and declare the problem UNSAT if any of the hard clauses is violated in the optimal solution. In the blog text, I stick to the soft-only definition, as it makes the exposition of the approach more accessible. ↩
But it might as well be an arbitrary constraint, albeit with less space for domain-specific techniques for correcting set extraction. ↩
Yes, this is the exact opposite of the MaxSAT condition; this only makes sense given the duality between those two problems. ↩

Maxsat is an optimization version of Satisfiability aimed at finding a truth assignment that maximizes the satisfaction of the theory. The technique of solving a sequence of SAT decision problems has been quite successful for solving larger, more industrially focused Maxsat instances, particularly when only a small number of clauses need to be falsified. The SAT decision problems, however, become more and more complicated as the minimal number of clauses that must be falsified increases. This can significantly degrade the performance of the approach. This technique also has more difficulty with the important generalization where each clause is given a weight: the weights generate SAT decision problems that are harder for SAT solvers to solve. In this paper we introduce a new Maxsat algorithm that avoids these problems. Our algorithm also solves a sequence of SAT instances. However, these SAT instances are always simplifications of the initial Maxsat formula, and thus are relatively easy for modern SAT solvers. This is accomplished by moving all of the arithmetic reasoning into a separate hitting set problem which can then be solved with techniques better suited to numeric reasoning, e.g., techniques from mathematical programming. As a result the performance of our algorithm is unaffected by the addition of clause weights. Our algorithm can, however, require solving more SAT instances than previous approaches. Nevertheless, the approach is simpler than previous methods and displays superior performance on some benchmarks.

Minimal explanations of infeasibility are of great interest in many domains. In propositional logic, these are referred to as Minimal Unsatisfiable Subsets (MUSes). An unsatisfiable formula can have multiple MUSes, some of which provide more insights than others. Different criteria can be considered in order to identify a good minimal explanation. Among these, the size of an MUS is arguably one of the most intuitive. Moreover, computing the smallest MUS (SMUS) finds several practical applications that include validating the quality of the MUSes computed by MUS extractors and finding equivalent subformulae of smallest size, among others. This paper develops a novel algorithm for computing a smallest MUS, and we show that it outperforms all the previous alternatives pushing the state of the art in SMUS solving. Although described in the context of propositional logic, the presented technique can also be applied to other constraint systems.

The dependency inference problem is to find a cover of the set of functional dependencies that hold in a given relation. The problem has applications in relational database design, in query optimization, and in artificial intelligence. The problem is exponential in the number of attributes. We develop two algorithms with better best case behavior than the simple one. One algorithm reduces the problem to computing the transversal of a hypergraph. The other is based on repeatedly sorting the relation with respect to a set of attributes.