1. INTRODUCTION
The success of constraint programming takes its roots in its unique combination of modeling facilities and solving efficiency. However, the use of constraint programming is often limited by the knowledge of the constraints that may be appropriate to represent a given problem. It can happen that a model involves a constraint that is only partially known, like, for example, if the constraint represents a concept we do not know, or do not want to define in extension. It can be the set of mammals in a description of animals, solar systems inside astronomical data, a preference between possibilities in a configuration problem, the notion of “good” wine, or a habit such as the usually free time slots in somebody's diary. It may also happen that the user does not know which constraint can be used to model the problem because of lack of knowledge in constraint programming, but can easily express examples or counterexamples for it. This situation appears in design (Chandrasekaran, 1999; O'Sullivan, 2002) when a requirement is difficult or is impossible to model. For example, when designing a bicycle, the angle α of the fork (see Fig. 1) has an impact on measurable concepts like the turning circle but also on less well-defined concepts like the ability to go straight when driving hands up or the feeling of comfort of the user. Some of these properties are mutually exclusive and need to be adjusted according to the use of the bicycle. For example, for a mountain bike, a greater importance will be given to maneuverability than to comfort, resulting in a small α value.
Such a constraint can be modeled by examples and counterexamples and impose requirements on the shape of the item when designing a new fork. In addition, other concepts like consumer satisfaction can be modeled alike by giving examples for already built objects.
In this paper, we propose to use partially defined finite domain constraints. In a partially defined constraint, some tuples are known to be true, some other are known to be false, and some are just unknown. We make use of this partial knowledge for learning the concept that is behind the partially defined constraint. Given positive and negative examples of the concept, the task consists in completing the definition in such a way that new examples never met by the system will be correctly classified. This framework has been extensively studied in machine learning (Mitchell, 1997) under the name of supervised classification. In the context studied in this paper, we assume that the full constraint is not available to the system, even by asking other agents or the environment. Hence, there is not a single way to complete a partially defined constraint, just like different people may agree on a set of examples but may have different concepts in mind. In addition, the definition may be revised when the system gets more experience or knowledge. In this paper, we are only concerned by the acquisition of a single constraint. However, its arity may be large.
Partially defined constraints can be learned whenever examples and counterexamples of the relation are available. For example, complex physical systems are often very difficult to model, even if each component is well described and understood. Interactions and uncertainty on some components may cause changes in the behavior of the system. In this case, examples can be provided by a limited number of experimentations or simulations. Then, the approximated subsystem can be used as part of bigger system. Another example comes from the field of sensorial analysis where a set of products with known characteristics is presented to a panel of users. The users rate the products according to different factors, and the set of observations can be used to model global preferences of potential customers. This kind of analysis is often used for designing products by the agrofood industry, but can be applied to any product. Sensorial information can be of various kind, like taste, comfort, ergonomy, and so forth. By treating this information as a constraint, it could be possible to handle it from the early stages of design of a product, thus limiting the impact of changes at later stages. Let us take an example in which partially defined constraints occur naturally:
Example 1. The design of a car dashboard is of critical importance for several reason. It must be clear and readable for safety, not tiring. But it is known that these criteria are only one aspect of the problem. Other characteristics like the position, the color, or the size of the counters are major issues in the perception the driver has of his car, and of himself as a driver. For example, it is hardly possible for a sports car to have a small speedometer and no rpm counter. Hence, producing a good design for a dashboard is a clever mixture between functionality, design, and anticipation of consumer taste. In order to integrate these features in the design procedure, we can assume that there is some degree of freedom in the characteristics of the dashboard components, like position, color, or size of the counters. A database of customer tastes has been created by presenting examples of dashboards to a panel of drivers and asking them to select the “good” ones. However, because the number of possible dashboards is very large, it would be impossible to rank each possible one individually. Hence, it is better to ask the users only on a few examples and to model the concept of “good dashboard” as a partially defined constraint.
For the example we propose to use throughout this paper, we have built a database of dashboard (using our own taste, which should not be considered as potentially representative of the common taste of the average driver). The database is composed of the following fields:
- the shape of the speedometer: circle, ellipsoid, half-circle, quarter-circle;
- size of the speedometer: small, medium, large;
- position of the speedometer: left, center, right;
- size of the rpm counter: small, medium, large;
- position of the rpm counter: left, center, right;
- instantaneous gas consumption: yes/no;
- all indicators in one group: yes/no;
- the mileage counter: mechanical, LCD;
- light color: white, blue, red;
- type of fuel gauge: linear, circular;
- the position of the fuel gauge: left, center, right;
- type of water temperature indicator: linear, circular;
- position of the water temperature indicator: left, center, right;
- type of turn signal indicator: light arrow, thick arrow;
- position of the turn signal indicator: high, middle, low;
- character font: Arial, Lucid-Sans, Helvetica, Impact;
- overall color: white, black, grey, brown;
- icon color: black, white, red, orange;
- frame color: black, white, chrome;
- counter face color: black, white, grey;
- text color: black, white, red, orange;
- color of needles: white, red, orange;
- driver's opinion: positive, negative.
The positioning of items is relative to their respective size. For example, the speedometer and the rpm counter are approximately the same size, which means that they cannot be both in the same place. The same does not hold for the fuel gauge and the temperature indicator because they are small items. However, it is possible to have both the rpm counter and the water temperature indicator on the left.
We can consider this database as a partially defined 22-ary constraint (by using the first 22 fields of the database). The modeling of customer preferences as a partially defined constraint allows the system to “invent” new dashboards that are acceptable within the learned concept. One interesting point that justifies the use of constraints in design is that the design of a dashboard for a given car does not obey only to readability and customer preference constraints. There is also other requirements that come, for example, from consistency constraints. For example, the speedometer and the rpm counter cannot be both in the central position (physical constraint) and the fuel gauge and water temperature indicator should be the same color (readability constraint). Finding a dashboard meeting all requirements and following customer preferences is a constraint satisfaction problem. In addition, the dashboard has to fit into a space that is different for every car. By adding information on the surface occupied by each item, we can, for example, find the best dashboard that has the minimum item surface. This is a constraint optimization problem. █
The idea of the technique we use for learning comes directly from the classical constraint solver model computing a chaotic iteration of reduction operators (Apt, 1999). We begin by learning the constraint. However, instead of learning it by a classifier that takes as input all its variables and answers “yes” if the tuple belongs to the constraint and “no” otherwise, we choose to learn the support function of the constraint for each value of its variables' domains. A tuple is part of the constraint if accepted by all classifiers for each of its values and rejected as soon as it gets rejected by one. This method is nonstandard in machine learning, but we show in Section 4 that it can achieve a low error ratio—comparable to well-established learning methods—when new tuples are submitted, which proves its validity experimentally.
As is, a classifier is only able to perform satisfiability checks for a partially defined constraint. If added to a constraint satisfied problem (CSP), this constraint would not contribute to the reduction of variables domains and it would yield a “generate and test” behavior that could quickly ruin the performances of the system. Hence, it is needed that partially defined constraints should have a solver and not only a satisfiability test in order to meet the standards of efficiency of constraint programming. The classifiers we learn are expressed by functions and we turn them into propagators by taking their extension to intervals. This formal transformation does not involve any more learning techniques, thus preserving the properties of the first part. Then the classifiers can be used with variable domains as input. We also show that the consistency they enforce, although weaker than arc consistency, is nevertheless interesting and yields a strong pruning along the search space.
2. PRELIMINARIES: BUILDING CONSISTENCIES
We first recall the basic notion of consistency in order to present the approximation scheme we use for learning. For a set E, we denote by
its powerset and by |E| its cardinality. Let V be a set of variables and D = (DX)X∈V be their family of (finite) domains. For W ⊆ V, we denote by DW the set of tuples on W, namely [prod ]X∈WDX. Projection of a tuple or a set of tuples on a set of variables is denoted by |. A constraint c is a couple (W, T) where W ⊆ V are the variables of c (denoted by var(c)) and T ⊆ DW is the set of solutions of c (denoted by sol(c)). Given a set of variables and their respective domains, a CSP is a set of constraints. A solution is a tuple that satisfies all constraints. In this paper, we use the common framework combining search and domain reduction by propagation. This propagation computes a property called consistency.
A search state is a set of yet possible values for each variable: for W ⊆ V, it is a family s = (sX)X∈W such that ∀X ∈ W, sX ⊆ DX. The corresponding search space is
. The set SW, ordered by pointwise inclusion ⊆ is a complete lattice. Some search states we call singletonic represent a single tuple and play a special role as representant of possible solutions. A singletonic search state s is such that |[prod ]s| = 1.
A consistency can be modeled as the greatest fixpoint of a set of so-called propagators and is computed by a chaotic iteration (Apt, 1999). For a constraint c = (W, T), a propagator is an operator f on SW1
When iterating operators for constraints on different sets of variables, a classical cylindrification on V is applied.
- monotonicity:
- contractance:
- singleton equivalence: let s ∈
SW be a singletonic state, then
Singleton equivalence means that the operator is also a satisfiability test for a single tuple embedded in a singletonic state. Correctness is also a major property of propagators and it means that a solution tuple never gets rejected across the search space:
Proposition 1. Let c = (W, T) be a constraint and f be a propagator for c. Then f is correct for c: ∀s ∈ SW, [prod ]s ∩ sol(c) ⊆ [prod ]f (s) ∩ sol(c). █
Proof: Suppose that f is not correct. Then ∃s ∈ S and t ∈ sol(c) such that t ∈ [prod ]s and t ∉ [prod ]f (s). We note st = ({tX})X∈W ∈ SW the singletonic state corresponding to t. Thus, we have st [nsube ] f (s). Because t ∈ sol(c) and f is singleton equivalent, we have st = f (st). Hence, we have st ⊆ s and f (st) [nsube ] f (s), which contradicts the fact that f is monotonic. █
When search begins, the initial search state s0 is initialized to the entire set of values: s0 = (DX)X∈V.
Let us now define some consistencies associated to a constraint c = (W, T). The well-known arc-consistency operator (acc) is defined by
If we suppose that each variable domain DX is equipped with a total ordering ≤, we denote by [a ··· b] the interval {e ∈ DX|a ≤ e ≤ b}. For A ⊆ DX, we denote by [A] the set [min(A) ··· max(A)]. By extension to Cartesian products, for s = (sX)X∈W ∈ SW, we denote by [s] the family ([sX])X∈W in which each variable domain is extended to its smallest enclosing interval. The bound-consistency operator (bcc) is defined by
Bound consistency only enforces consistency for the bounds of the domain by shifting them to the next consistent value in the suitable direction. Consistencies are partially ordered according to their pruning power and we have f ⊆ f′ if ∀s ∈ SW,f (s) ⊆ f′(s).
Because only variables domains are reduced, a consistency operator f for a constraint c = (W, T) can be split into |W| projection operators (fX)X∈W according to each variable of the constraint. By confluence of chaotic iterations (Apt, 1999), these operators can be scheduled independently as long as they follow the three first properties of consistency operators. In order to represent the same constraint, they have to be singleton complete collectively. It is worth noticing that there is a disymmetry between reject and acceptance and that for satisfiability, a nonsolution tuple must be rejected (at least) by one of these operators while correctness imposes that a solution tuple is accepted by all operators.
The role of a consistency operator fX is to eliminate from the domain of its target variable X some values that are unsupported by the constraint. Arc consistency eliminates all inconsistent values. Thus, it has to find a support for each considered value a (a solution tuple whose projection on the target variable is a) in order to allow the value to remain in the variable's domain. This task has been proved to be NP-complete in general (Bessière, Coletta, et al., 2004) for n-ary constraints. While many useful constraints have polynomial-time arc-consistency propagators, some exist for which this task is intractable. Because we are dealing here with constraints expressed by examples, this case must be taken into account seriously and motivates an approximation scheme.
At a finer level of granularity, the way a support is computed has motivated a constant improvement in arc-consistency algorithms. The first technique consists in keeping a table that records every tuple of the constraint and sweeping this table every time a support is needed. This technique has been the most optimized, and optimal algorithms only check each value once and keep minimal data structures like the general arc-consistency (GAC) schema (instantiated by a table; Bessière & Régin, 1997). These techniques are well suited to highly irregular constraints. In Figure 2 are depicted the projections of a ternary constraint c(X, Y, Z) on the three values of Xs domain for different types of constraints. On each projection black squares denote allowed tuples. Figure 2a illustrates on which type of constraint table-based techniques are well suited, but the problem is that their space requirements grow according to domain size and constraint arity.
Conversely, if the relation is very regular, it can happen that a single expression in a given language can capture the notion of support for every value of the domain. This opportunity is exploited by the indexical language (van Hentenryck et al., 1991). For example, in Figure 2c, all supports for the variable X of the constraint Z = X + Y can be computed using the same language expression [in this case, X in dom(Z) − dom(Y)]. The “regularity” of the constraint can be measured as the length of the shortest expression for an operator in a given language. Although indexicals are powerful enough to represent any constraint operator, this can be done in some cases at the price of an expression proportional to the length of the table itself. Indexicals are thus best suited for operators that enjoy a short expression. Another advantage of the indexicals language is that it allows to express easily weaker consistencies such as bound consistency by associating an expression directly to the bounds of the variable. The issue of computing an approximation of a consistency using indexicals has been tackled in Lallouet et al. (2003).
However, there exists an intermediate situation depicted in Figure 2b where it is possible to find a compact representation for finding supports, but this representation is different for every value of the domain of the variable to be reduced. In other words, in order to represent an operator fX, a Boolean function is associated to each value a of X's domain. This function evaluates to false if the value has no support and to true if one is found or if an incomplete search is not able to decide. We call this function an elementary reduction function (ERF) and denote it by fX=a. By combining elementary reduction functions, we are able to build an operator for the target variable: all we have to do is to collect the answers of the functions and intersect the produced domain with the variable's current domain. Hence, for a search state s, if we assume that we have a set of ERFs {fX=a|a ∈ sX}, we build the operator fX as fX(s) = s′, where sX′ = sX ∩ {a | fX=a(s)}. For bound consistency, instead of sweeping the entire domain, an upward while loop can be used to move the minimum bound to its next supported value (respectively downward for the max bound).
Compared to the first technique (Fig. 2a), this representation is likely to take less space because we can bound the size of each ERF. Then, the space requirement grows only according to the domain sizes. However, it puts some limits to the expressivity of the functions that can be used and arc consistency may not be representable for some constraints. Compared to the second technique (Fig. 2c), it is more versatile because different functions can be used to compute the support of different values.
By using ERFs, we are able to give each value its own support function. A function fX=a takes as input the current domain of the constraint's other variables. However, it may make full use of this information or just use it partially. For example, it can only use the bounds of the domains. From these combinations, we get four consistencies: two are classical and two are intermediate, denoted by ac− and bc+ (see Table 1).
In other words, if we give each domain value an ERF but if we assume that this ERF takes as input only the bounds of the other variables' domains, we get an intermediate consistency we call ac−:
It does not have the full power of arc consistency because it makes use of less input information but may reduce more than bound consistency because not only the bounds can be reduced. The counterpart, bc+, is when bounds can be reduced by a function taking as input all information available in the whole domain of the other variables:
Proposition 2. ac ⊆ bc+ ⊆ bc and ac ⊆ ac− ⊆ bc. █
Proposition 3. ac− and bc+ are uncomparable. █
3. PARTIALLY DEFINED CONSTRAINTS
In this section, we give the definition of partially defined constraints and introduce the notion of extension, which provides a closure of the constraint.
A classical constraint c = (W, T) is supposed to be known in totality. The underlying closed world assumption (CWA) states that what is not explicitly declared as true is false. Hence, the complementary T is the set of tuples that do not belong to c. In the following, we call ordinary constraints under CWA closed or classical constraints. When dealing with incomplete information, it may happen that some parts of the constraint are unknown (see Fig. 3):
Definition 1 (partially defined constraint). A partially defined constraint is a triple c = (W, c+, c−) where c+ ⊆ DW,c− ⊆ DW, and c+ ∩ c− = Ø. █
In a partially defined constraint c = (W, c+, c−), c+ represents the allowed tuples and c− the forbidden ones. The remaining tuples,
, are simply unknown. Note that a classical constraint c = (W, T) is a particular partially defined constraint c = (W, T, T) for which the negative part is the complement of the positive part.
Partially defined constraints need a special treatment in order to be used in a CSP because little propagation can be done without knowing the integrality of the constraint. Hence, a partially defined constraint needs to be closed to be usable in a constraint solving environment. The closure of a partially defined constraint c is done by choosing a class (it belongs or does not belong to the constraint) for all unknown tuples. We call the resulting classical constraint an extension of the partially defined constraint.
Definition 2 (extension). Let c = (W, c+, c−) be a partially defined constraint. A (classical) constraint c′ = (W, T) is an extension of c if c+ ⊆ T and c− ⊆ T. █
In other terms, an extension is a classical constraint compatible with the known part (positive and negative) of the partially defined constraint. A partially defined constraint allows us to catch a glimpse of a hidden reality and one of its extensions corresponds to the genuine relation of the world. In most cases, the knowledge of this constraint is impossible to get and all that can be done is computing an approximation of it. In general, many extensions can be considered, and let us introduce three of them. Let c = (W, c+, c−) be a partially defined constraint. We denote an extension of c by [c].
- Cautious extension:
. All unknown tuples are assumed to be true (Fig. 4b). A solver generated according to this extension is cautious in the sense that it will not prune the search space for any unknown tuple. Thus, the consequence of a bad choice for this tuple will not compromise the correctness of a solution. The counterpart is that if the unknown part of the constraint is large, it will yield weak pruning and the user will be provided with many unwanted false solutions.
- Brave2extension: [c]b = (W, c+). All unknown tuples are assumed to be false (Fig. 4c) as in the classical CWA. A solver generated according to this extension is brave because it will prune the search space as soon as possible. If the tuple was incorrectly classified as false, correctness is lost and a nonmonotonic restoration mechanism is needed, like in dynamic CSPs (Verfaillie & Jussien, 2003).
Note that we choose the terminology of cautious and brave with respect to the solver and not to the constraint. Indeed, making all unknown tuples true can be considered as brave with respect to the constraint.
- Algorithmic extension
be a tuple classification algorithm such that. Then(Fig. 4d).
We are mostly interested in the last extension in which the unknown part is completed by a learning algorithm.
This kind of extension is obtained by supervised classification: it consists in the induction of a function that associates to all tuples a class from a set of examples given with their respective class. Machine learning puts strong requirements on what is called a good algorithmic extension. The initial and perhaps most important step is that, the correct class for unknown tuples should be forecast with the highest possible accuracy. The ratio between the number of correctly classified tuples and the number of presented tuples defines the correctness ratio of the generalization. Most techniques used in machine learning provide much better performances than random classification, and more than 90% of success in prediction is not unusual. In order to achieve this, we assume that the known part of the partially defined constraint should be representative of the whole underlying constraint and that the underlying constraint has some regularities that can be approximated. Then, the representation of the classification function is searched in a space of possible functions called hypothesis space. A learning algorithm finds the best possible function in this space by optimizing some criteria, like accuracy, simplicity, or generalization, and so forth.
4. CONSTRAINT ACQUISITION
At first, we address the problem of constructing a good extension for a partially defined constraint. In order to represent a relation, the first idea is to build a classifier taking as input an instantiation of all variables of the relation and returning a Boolean stating if the tuple belongs or not to the relation. However, unfortunately, although learning is effective with this technique (see Rossi & Sperduti, 2004), it would be difficult to extract a solver from this representation. Motivated by the equivalence between a constraint and a correct and singleton complete solver, we propose to acquire a partially defined constraint c = (W, c+, c−) by learning the support function for all values of domain variables. More precisely, we propose to build an independent classifier for each value a of the domain of each variable X ∈ W in the spirit of ERFs introduced in Section 2. This classifier computes a Boolean function stating if the value a should remain in the current domain (output value 1) or if it can be removed (value 0). We call it an elementary classifier. It takes as input the value of every other variable in W − {X}.
We propose to use as representation an artificial neural network (ANN) with an intermediate hidden layer. This representation has been chosen for its good properties in learning and the possibility of a further transformation into a solver. Other kinds of classifiers can be used but we cannot describe them for lack of space. For W ⊆ V, a neuron is a function
computing the weighted sum of its inputs followed by a threshold unit. A dummy input is added to tune the threshold. This one is assigned to 1. The sigmoid function is often chosen for the threshold unit because derivability is an important property for the learning algorithm. Let (wX)X∈W be the weights associated to each input variable and w0 be the adjustment weight for the dummy input. Hence, the function computed by a neuron taking as input a = (aX)X∈W is
For a constraint c = (W, c+, c−), the classifier we build for X = a is a tree of neurons with one hidden layer as depicted in Figure 5. Let (ni)i∈I be the intermediary nodes and out be the output node. All neurons of the hidden layer have as input a value for each variable in W − {X} and are connected to the output node.
Let us call n〈X=a〉 the network concerning X = a. Because neurons are continuous by nature, we use an analog coding of the domains. Let D be a finite domain and < a total order on D (natural or arbitrary), then we can write D as {a0,…, an} with ∀i ∈ [1 ··· n],ai−1 < ai. According to this order, we can map D onto [0 ··· 1] by coding ai by i/n. In a similar way, the output is in the interval [0 ··· 1] and we choose as convention that the value a should be removed from the domain of X if out ≤ 0.5. This threshold is the last level of the network depicted in Figure 5.
The global classifier for the partially defined constraint is composed of all of these elementary classifiers for all values in the domain of all variables {n〈X=a〉|X ∈ W, a ∈ DX}. Following the intuition of ERFs for solving, we can use these elementary classifiers to decide if a tuple belongs to the extension of the constraint or not by checking if the tuple gets rejected or not by one of the classifiers. Let t ∈ DW be a candidate tuple and let (n〈X=t|X〉(t|W−{X}))X∈W be the family of 0/1 answers of the elementary classifiers for all values. We can interpret the answers according two points of view:
- vote with veto: the tuple is accepted if and only if it is accepted by all elementary classifiers;
- majority vote: the tuple is accepted if accepted by a majority of elementary classifiers.
In order to produce the extension of the partially defined constraint, these classifiers are trained on examples and counterexamples selected from the projection of the known part of the constraint on the subspace orthogonal to a variable's value. For E ⊆ DW, X ∈ W and a ∈ DX, we denote by E〈X=a〉 the selection of tuples of DW having a as value on X: E〈X=a〉 = {t ∈ E |t|X = a}. Thus, in order to build the classifier n〈X=a〉, we take the following sets of examples and counterexamples:
For example, for a partially defined constraint defined by W = {X, Y, Z}, c+ = {(1, 1, 0), (1, 0, 1)} and c− = {(1, 1, 1)}, we get
The networks are trained by the classical backpropagation algorithm (Rumelhart et al., 1986), which finds a value for the weights using gradient descent. The algorithm is stopped when all examples and counterexamples are correctly classified. This requirement comes from the need of correctness of constraint programming but it may be adjusted according to the application and to how noisy the training set is. In general, the choice of the architecture of a neural network is a complex problem that does not have a theoretical solution. In a similar way, the number of examples required to learn a relation with sufficient accuracy depends on the learning technique and the difficulty of the concept. It is nevertheless known that a multilayer perceptron with only one hidden layer is a universal approximator (Hornik et al., 1989). But the theory does not make it possible to determine how many neurons in the hidden layer are necessary to approximate a given function. Thus, this number is found in an empirical way starting from a relatively small value. This is done in order to minimize the complexity of the classifier and to improve the quality of generalization. In many cases, it is better to keep a small network size in order to preserve its generalization capabilities.
Because the technique we propose for learning relations is not classical in machine learning, we present validation results to show that the concept lying behind the partially defined constraint is actually learned. This framework has been implemented in a system called Solar and tested on the dashboard example presented in the introduction and on various machine learning databases3
The databases are taken from the UCI Machine Learning repository (http://www.ics.uci.edu/∼mlearn).
We compare the generalization performance of our technique to the popular decision tree learning system C5.0 (RuleQuest Research, 2004). For each base, we have performed a cross-validation by using the following method: we cut the base in 10 parts, we use 9 of them for learning and the 10th for validation. This process is repeated 10 times, each part of the base being used in turn for validation. The cutoff is identical for the test with all methods. Then, the whole test is repeated on five sessions with different cutoffs, yielding 50 tests for each technique. The generalization ratio is the percentage of correctly classified tuples.
Table 2 contains three parts. The first one contains a description of the data: database name, arity, size, and size of the variables' domains. Then follows some information about the learned classifiers: the number of neurons in the hidden layer, the number of individual classifiers learned, and the learning time. In comparison, the learning time for C5.0 is very low, typically a few seconds, but the interest of our technique is not only for classification, as described in the next section. The last part presents the generalization results: the generalization ratio and standard deviation (SD) for Solar with veto vote, for Solar with majority vote, and for C5.0 and for C5.0 with boosting (with number of trials equal to the arity of the constraint in order to balance our number of classifiers). The mushroom database is very easy to learn. Hence, we only used 500 examples out of 8124 for all techniques; otherwise, they all would have reached 100%. Let us note that the techniques of machine learning such as decision trees and ANNs require a relatively large set of examples to be effective. More generally, it is the case for any statistical approach of learning. Unfortunately, in certain real cases it can be difficult (expensive or just impossible) to obtain a large database of examples. In these cases the use of our technique could be ineffective, in terms of the quality of learning and generalization. Nevertheless, the approach is well adapted for the situations where enough data are provided by simulation or when sufficient experimental data are available.
The technique we propose challenges powerful techniques such as boosting (Freund & Shapire, 1999), both in generalization performance and scattering of results as measured by SD and error. Nevertheless, the vote of elementary classifiers cannot be viewed as a variant of boosting. An important difference is that we partition the set of examples. In veto mode, the learned concept is more focused on the core of the real concept as we impose more elementary classifiers to agree on a tuple. Thus, it is not surprising that veto mode performs less satisfactorily than majority mode. The tuples that are accepted at unanimity are in some sense the most “pure” ones with respect to the real concept and the error depicted in Table 2 corresponds to rejected positive tuples and never to accepted negative ones. For optimization purposes, this could even be an advantage because the solution space is smaller and the correctness of the answer is better preserved.
5. FROM CLASSIFIERS TO SOLVERS
When added in a CSP, a constraint will have an active behavior; it should contribute to the domain reduction. This is why simply keeping the classifier as a definition of the constraint is not an interesting option. The induced “generate and test” behavior would not prune the search space at all. Another idea could be to first generate offline the solutions of the extension of the constraint and use them for solving with a standard but efficient arc-consistency propagation algorithm like the GAC schema (Bessière & Régin, 1997). However, unfortunately, this method suffers from two major drawbacks. First, the generation time is prohibitive because some constraints have a huge number of solutions. For example, 3 h of “generate and test” computation on the mushroom database could hardly produce 76,835 solutions out of 1.5 × 107 tries. Second, a problem comes from the representation size of the relation. The extension of the 22-ary constraint mushroom contains more than 4E6 solutions and would thus need more than 88 MB of memory to be stored. In contrast, the representation we have is rather economic. For a constraint of arity n, if we assume that the hidden layer contains m neurons and the size of the domains is d, it requires n(n + 1)dm + 1 floats (10 bytes) to store the weights. For the dashboard constraint (n = 22, m = 3,d = 4), we attain a size of 60 kb, for mushroom (n = 22, m = 3,d = 12), 180 kb.
We propose to use the learned classifiers also for solving. In order to do this, let us recall some notions on interval analysis (Moore, 1966). We call
the interval lattice built on the set
of real numbers. All functions have extensions to intervals.
Definition 3 (extension to intervals). Let
be a function. A function
is an extension to intervals of f if
. █
An extension F is monotonic if A ⊆ B ⇒ F(A) ⊆ F(B). Between all extensions to intervals of f, there is a smallest one, called canonical extension to intervals:
. The canonical extension is monotonic. Here are the canonical extensions to intervals of the operators we use in classifiers:
where P = {ac, ad, bc, bd}
Division is not a problem in our setting because no interval contains 0 (see the sigmoid denominator). If e is an expression using these operators and E the same expression obtained by replacing each operator by a monotonic extension, then
. This property of monotonic extensions is called “the fundamental theorem of interval arithmetic” (Moore, 1966). It also holds when domains are replaced by Cartesian products of intervals. By taking the canonical extension of all basic operators in an expression e, we do not always obtain an extension E that is canonical. We instead call it the natural extension. Multiplication is only subdistributive in interval arithmetic (Moore, 1966), that is, A × (B + C) ⊆ (A × B) + (A × C). Hence, the natural extension is canonical only for expressions with single variable occurrences (single occurrence theorem, Moore, 1966).
An elementary classifier n〈X=a〉 defines naturally a Boolean function of its input variables. Let N〈X=a〉 be its natural interval extension, defined by taking the canonical extension of each basic operator +, −, ×, /, exp. Then, by using as input the current domain of the variables, we can obtain a range for its output. In order to do this, we compute the interval range of every neuron of the hidden layer and we use these results to feed the output neuron and compute its domain. Because we put a 0.5 threshold after the output neuron, we can reject the value a for X if the maximum of the output range is less than 0.5, which means that all tuples are rejected in the current domain intervals. Otherwise, the value remains in the domain.
Proposition 4. N〈X=a〉 is an ERF. █
Proof: It is only needed to check the correctness of the function applied to a search state s with respect to the partially defined constraint's accepted tuples. If a tuple t such that t|X = a belongs to the solutions of the learned constraint, then n〈X=a〉((t|Y)Y∈W−{X}) = 1. Hence, if t ∈ [prod ]s, we have max(N〈X=a〉(s|W−{X})) = 1 because N is a monotonic extension. █
By doing this for every value of every variable's domain, we are able to define a consistency operator fX that gathers the results of the ERFs. For s ∈ SW, fX(s) = s′ where sX′ = sX ∩ {a ∈ DX|max(N〈X=a〉(s|W−{X})) = 1} and sY′ = sY for Y ≠ X.
Proposition 5. The operators (fX)X∈W define a consistency for c. █
Proof: Each operator fX is monotonic, contractant and correct (by the fundamental theorem of interval arithmetic). They are together singleton complete (because the extension of the partially defined constraint is defined by them). █
We call learned consistency (lc) the consistency defined by the learned propagators. Because we use multiple occurrences of the same variable, the lc computes an approximation of ac−.
Proposition 6. ac− ⊆ lc. █
Proof: Because multiple occurrences of variables yield a less precise interval, it follows that the maximum of the output interval of the last neuron of an ERF N〈X=a〉 may exceed 0.5 even if there is no support for X = a. Thus, the value is not suppressed as it would be by ac−. █
Note that if we had used single-layer perceptrons, the extension would have been exact and we would have been closer to ac−. However, single-layer perceptrons have severe limitations in learning (Mitchell, 1997). The propagators for each variable are independent; thus, the generalization obtained when using the solver is the one obtained with veto vote. This is due to the independent scheduling of the consistency operators for each variable in the fixpoint computed by chaotic iteration (Apt, 1999). The exact ac− consistency could be reached only if all elementary classifiers agree on each value, which is not the case in general. On the one hand, this allows us to strengthen the learning power as examplified in the preceding section; but on the other hand, it weakens the propagation a little bit.
The Solar system takes a partially defined constraint as input and outputs a set of consistency operators that can be adapted to any solver. In our experiments, we used a custom-made solver. We made two sets of experiments in order to evaluate the learned consistency.
The first one is summarized in Table 3 and describes the search for all solutions using the lc. It is done by taking a CSP containing the partially defined constraint alone. The results are averages on 10 runs for dashboard, breast-cancer-wisconsin, and house-votes-84 databases. Only one run was done for mushroom database. Each run consists of learning a constraint followed by the resolution. Indeed, two runs of backpropagation algorithm with a different random initialization of weights provide slightly different results. For every partially defined constraint, we use our system to count the number of solutions (#Sol). Because we do not have arc consistency, we record the number of failures lc makes while finding these solutions (#Fail lc). Then we compute the ratio lc = #Fail lc/#Sol. If we had arc consistency, there would not be any backtracks. The purpose of this experiment is to compare lc to what ac could have done if ac was available for partially defined constraints. In terms of failure, the average ratio on all experiments is of 1.6 failures per solution. This result should be put into balance with the huge number of failures “generate and test” would have done. We also report the time lc needs to find these solutions (only the time of the resolution; see Table 2 for the learning time). The mushroom constraint has a very large space and a medium tightness and we could not obtain its full extension. However, for the other constraints, this is the only method to get all solutions since the Cartesian products of the domains are so large that this prevents the use of generate and test with the classifier.
Our second set of experiment is a random sampling of the reductions made by the different consistencies on random search states (the domain of each variable are arbitrary sets, not intervals). These results are depicted in Tables 4, 5, 6, and 7.
For each constraint, we give the number of tuples of the initial search state (|[prod ]s|) and the number of tuples after an application of each of the operators lc, bc, bc+, ac−, and ac. The data are the average of 1000 experiments with the same average size of search state. In order to compute exactly the consistencies bc, bc+, ac−, and ac, we have first computed all solutions included in s in a table with the help of lc. In a second step, we have computed all needed projections from the solution table. For example, the last column of Table 4 for the dashboard example shows that, starting from a search space containing 166,563 tuples (in its Cartesian product), the learned consistency reduces it to a search space of 9037 tuple whereas arc consistency reduces it to 1326 tuples (all these values are average). This shows that the learned consistency is weaker than more classical consistencies but still reduces notably the search space. Figures 6, 7, 8, and 9 depict a graphical view of lc compared to the other consistencies for all examples (using data of Tables 4, 5, 6, and 7).
In addition, the 22-ary partially defined constraint dashboard has been tested in different optimization problems as described in Example 1. As expected, the best solution found is a disposal of the different items invented by the system and that was not in the database. Every problem comprises the set of 22 variables describing a dashboard, the dashboard constraint, and some constraints specific to each problem. The optimization criterion consists in minimizing the overall surface, subject to physical and preference constraints. The computed surface corresponds to the global surface of the actual items and not of the dashboard they define, which is larger. For this purpose, an elementary surface is associated to each item, as well as to some particular combinations that take into account a reasonable separating space. Hence, the objective function is more complex than just a sum of the individual surfaces. For every problem, a classical branch and bound search has been used. We have compared the time needed to solve the optimization problem in three contexts.
- Learned: The learned constraint and its propagators are used as described in this paper.
- GAC: All solutions of the constraint have been produced using our technique and have been put in a table. The table is processed during search by using an hyperarc-consistency algorithm.
- Classifier: Only the classifier has been used for the partially defined constraint during search. This means that most of the time, a dashboard is generated and refused at the very last time by the classifier (generate and test behavior).
The results follow.
Problem 1. The CSP is composed of four constraints: the learned dashboard constraint, position of speedometer ≠ position of rpm counter, text color ≠ face color, and icons color ≠ face color.
- learned: 2.23 s.
- GAC: 31.34 s.
- classifier: 2 h, 55 min, 46 s.
Minimal surface = 158 cm2.
Problem 2. The CSP is composed of three constraints: the learned dashboard constraint, rpm counter size = large, and rpm counter position = center.
- learned: 2.94 s.
- GAC: 25.67 s.
- classifier: >4 h (time out).
Minimal surface = 217 cm2. An artistic view of the solution of this example is depicted in Figure 10.
Problem 3. The CSP is composed of six constraints: the learned dashboard constraint, rpm counter size = large, rpm counter position = center, fuel indicator = circular, temperature indicator = circular, and frame color = chrome.
- learned: 0.21 s.
- GAC: 5.61 s.
- classifier: 1 h, 53 min, 23 s.
Minimal surface = 233 cm2.
As expected, the use of a generated propagator improves the resolution time. The observed improvement is of an order of magnitude over arc consistency computed on a table and more over generate and test. However, it should be noted that the generation of the table itself requires a propagator in order to be obtained in reasonable time (by avoiding sweeping the whole Cartesian product). In addition, some constraints are too large to be stored in extension in memory, which is a point in favor of our approach.
6. RELATED WORK AND CONCLUSION
Partially defined constraints were introduced in Lallouet et al. (2004). In Faltings and Macho-Gonzalez (2002), the comparable concept of open constraint was proposed in the context of distributed reasoning but with the goal of minimizing the number of requests needed to complete the definition. They were similarly used in the framework of interactive constraint satisfaction (Alberti et al., in press). Solver learning was introduced in Apt and Monfroy (1999) with a special rule system, but the generation algorithm was a bottleneck to handle large constraints. This work was extended by Abdennadher and Rigotti (2004) and Lallouet et al. (2003), but still in the context of closed constraints. None of these methods can combine generalization and solver efficiency. Partially defined constraints are also related to uncertainty because an uncertain constraint (Yorke-Smith & Gervet, 2003) can be viewed as a limited form of partially defined constraint for which it is assumed that only a few tuples are missing. The idea of learning constraints, extended to the learning of a preference instead of just a Boolean for a tuple, was used in Rossi and Sperduti (2004) in the context of soft constraints. They use an ad hoc neural network to represent the constraint. Although the learning is effective, the problem of building a solver for the constraint was not tackled in this work. In Coletta et al. (2003) and Bessière, Hebrard, et al. (2004), a CSP composed of predefined constraints like = or ≤ was learned. The constraints were discovered by a version-space algorithm that reduces the possible constraints during the learning process. ANNs were considered for solving CSPs in the GENET system (Davenport et al., 1994) but with a completely different approach.
SUMMARY
Partially defined constraints allow the use of constraints partially defined by examples and counterexamples in decision and optimization problems. In this work, we propose a new technique for learning partially defined constraints by using classifiers. Not only does the generalization we obtain have remarkable properties from a machine learning point of view, but it can also be turned into a very efficient solver that gives an active behavior to the learned constraint. In a design perspective, partially defined constraints can be used to represent complex requirements for which a precise definition is either too complex or impossible to get.
ACKNOWLEDGMENTS
The authors thank Patrick Sebastian for suggesting the link to sensorial analysis.