Hostname: page-component-745bb68f8f-b95js Total loading time: 0 Render date: 2025-02-11T13:08:38.190Z Has data issue: false hasContentIssue false

Probability models for detecting transgenic plants

Published online by Cambridge University Press:  01 June 2008

Carlos M. Hernández-Suárez
Affiliation:
Facultad de Ciencias, Universidad de Colima, Bernal Díaz del Castillo No. 340 Col. Villas San Sebastián, C.P. 28045Colima, Colima, México
Osval A. Montesinos-López
Affiliation:
Facultad de Telemática, Universidad de Colima, Bernal Díaz del Castillo No. 340 Col. Villas San Sebastián, C.P. 28045Colima, Colima, México
Graham McLaren
Affiliation:
Biometrics and Bioinformatics Unit, Crop Informatics Laboratory (CRIL), International Rice Research Institute (IRRI), DAPO Box 7777, Manila, Philippines
José Crossa*
Affiliation:
Biometrics and Statistics Unit, Crop Informatics Laboratory (CRIL), International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, MéxicoDF, México
*
*Correspondence j.crossa@cgiar.org
Rights & Permissions [Opens in a new window]

Abstract

When detecting the adventitious presence of transgenic plants (AP), it is important to use an appropriate testing method in the laboratory. Dorfman's group testing method is effective for reducing the number of laboratory analyses, but does not consider the case where AP is diluted below the sensitivity of the analyses, which causes the rate of false negatives to increase. The objective of this study is to propose binomial and negative binomial probabilistic models for determining the required sample size (n), number of pools (g), and size of the pool (k) for detecting individuals possessing AP with a probability ≥ (1 − α) (for a small α) given: (1) pool size (k); (2) estimated proportion of individuals with AP in the population (p); (3) concentration of the trait of interest (AP) in individual seeds (w); and (4) detection limit of the test (c) (AP concentration in a pool below which it cannot be detected). The proposed models consider the different rates of false positives (δ) and false negatives (λ), and the assessment of consumer and producer risks. Results have shown that when using the negative binomial, a required sample size n can be determined that guarantees a high probability that m individuals or g pools containing AP will be found. The pools formed have an optimum size, such that one element with AP will be detected at a low cost. The negative binomial distribution should be used when it is known that the proportion of individuals with AP in the population is p < 0.1; thus, it is guaranteed that m individuals or g pools of individuals with AP will be detected with high probability.

Type
Research Article
Copyright
Copyright © Cambridge University Press 2008

Introduction

The presence of genetically modified plants, hereafter named the adventitious presence of unwanted transgenic plants (AP), is becoming common in modern crop production systems. This reality has created concerns regarding possible gene flow through outcrossing between AP crops and their landraces and wild relatives. This is especially important in a country such as Mexico, a centre of diversity for maize, where the effects of AP maize outcrossing with traditional maize landraces and wild relatives, such as tripsacum and teosinte, are unknown. Recently, different authors have reported contrasting results in terms of detecting AP maize in Mexico. Quist and Chapela (Reference Quist and Chapela2001, Reference Quist and Chapela2002) were the first to report AP landraces collected in the Sierra Juarez region of the Mexican State of Oaxaca; they specifically identified genes from Bacillus thuringiensis (Bt), a soil bacterium gene used to create maize that is resistant to some insects. In contrast, 4 years later, Ortiz-García et al. (Reference Ortiz-García, Ezcurra, Schoel, Acevedo, Soberón and Snow2005a, Reference Ortiz-García, Ezcurra, Schoel, Acevedo, Soberón and Snowb) sampled maize landraces in the same region of Oaxaca State and failed to detect AP.

When testing for AP, two distinct activities should be emphasized. The first is determining the optimal sample size (n) and sampling strategy to be used when taking seeds at random from a seed lot (Cleveland et al., Reference Cleveland, Soleri, Aragón-Cuevas, Crossa and Gepts2005; Ortiz-García et al., Reference Ortiz-García, Ezcurra, Schoel, Acevedo, Soberón and Snow2005c); the second is determining the sample preparation and testing method to be used in the laboratory (Remund et al., Reference Remund, Dixon, Wright and Holden2001). The sensitivity of the analyses and specificity of the tests are important factors that may affect the rates of false-negative and false-positive results (Remund et al., Reference Remund, Dixon, Wright and Holden2001). Usually, it will be necessary to collect a large number of seeds from a reference population located in the region of interest, since the frequency of AP is likely to be very low.

Because laboratory tests are expensive, it is not feasible to analyse all n individual seeds collected from a lot. There are several testing plans for reducing the number of samples to be analysed (Montgomery, Reference Montgomery1997). One plan consists of testing pooled seed samples (Remund et al., Reference Remund, Dixon, Wright and Holden2001). Conditions listed by Federer (Reference Federer1991) for pooled samples are that: (1) the trait is discrete, i.e. it can be measured as presence or absence, or some countable quantity; (2) the proportion of positives (e.g. AP) is relatively small; and (3) pooling the samples does not alter the characteristics of individual samples.

The group testing method of Dorfman (Reference Dorfman1943) is effective for reducing the number of laboratory analyses and can result in up to 80% savings in the number of laboratory analyses (Federer, Reference Federer1991). This method consists of dividing n individual samples (e.g. seeds) into g groups (or pools), each of size k. If a group tests positive, then at least one individual in the pool is positive; the author gives an approximate solution for the optimal value of k. A formula for determining the sample size (n) required for detecting AP can be derived from the Dorfman method. However, a major disadvantage of the Dorfman testing plan is that it is insensitive to the dilution that arises when the group is formed. This is particularly true for large group sizes where the number of AP kernels in the pool can be diluted below the sensitivity of the analyses, which causes the rate of false negatives to increase.

All quality laboratory methods have false positives (δ = probability of falsely detecting a seed with impurity) and false negatives (λ = probability of failing to detect seed with impurity). Furthermore, these two types of errors, which commonly occur in any testing plan, can be integrated in an overall consumer and producer risk assessment. Remund et al. (Reference Remund, Dixon, Wright and Holden2001) proposed testing plans that integrate a given lower quality limit (LQL) and acceptable quality limit (AQL) for the consumer and producer risks, respectively.

For sampling seeds from a seed lot (USDA/GIPSA, 2000a, b) and testing seeds in the laboratory (Kay and Van den Eede, Reference Kay and Van den Eede2001; Kay and Paoletti, Reference Kay and Paoletti2002), a uniform distribution of the number of individuals with AP in the seed lot is assumed. Therefore, when the size of the seed sample (n) is small in relation to the size of the reference population (N), the acceptance sampling method and testing plan generally use a binomial probability distribution with parameter p (frequency of AP in the population). However, when AP is rare, i.e. p ≤ 0.1, using a binomial distribution may not provide an unbiased and precise estimate of p (Haldane, Reference Haldane1945; Cochran, Reference Cochran1980); in this case, using a negative binomial distribution is suggested (Gerrard and Cook, Reference Gerrard and Cook1972; Kalton and Anderson, Reference Kalton and Anderson1986).

The main objective of this research is to propose probabilistic models for determining the required (1) sample size, n; (2) number of pools, g; and (3) size of the pool, k, that will detect individuals containing AP with a probability ≥ (1 − α) (for small α). The proposed models were developed within the framework of the Dorfman model, but considering: (1) the dilution effect when forming groups (pools) of seeds to be tested; (2) the detection limit of the laboratory test; (3) the different rates of false positives and false negatives; and (4) the assessment of consumer and producer risks. The probability distributions used in this study were binomial and negative binomial distributions for: (1) pool size, k; (2) estimated proportion of individuals with AP in the population, p; (3) known concentration of the trait of interest (AP) in individual seeds, w; and (4) the detection limit of the test, c (AP concentration in a pool below which it cannot be detected). These models can be used for detecting the presence/absence of AP or any other trait of interest.

The Dorfman model

The procedure proposed by Dorfman (Reference Dorfman1943) consists of dividing n individuals into g groups or pools, each of size k. Each group is tested; if a group has the AP, then at least one individual has the AP. All k individuals in that pool must then be examined to identify individuals with AP or to estimate the proportion of individuals with AP. This may not be necessary if the objective is simply to know if any individuals have AP. The probabilistic model in this procedure is useful for determining sample size, n, in order to detect individuals possessing AP with acceptable probability, and for determining the optimum pool size, k, and the number of pools, g (note that the sample size n = gk).

Assume a population of size N in which a fraction p has AP [say type (+)]. We consider the problem of determining the optimum values of n and k such that the probability of detecting at least one individual with AP is greater than (1 − α) (for a given α). For sample size n and group size k, g = n/k pools can be formed. If X is the number of + individuals in a pool, then P(X = j) (j = 1, 2,…, k) follows a binomial distribution X~Bin(k, p). The probability that a group is (+) is one minus the probability that k randomly selected individuals are negative

P ( X > 0) = 1 - (1 -  p )^{ k }

The probability of a pool testing negative ( − ) is P ( X = 0) = (1 - p )^{ k }. Because there are g = n/k pools, the probability of detecting only ( − ) groups, given that the proportion of (+) individuals in the population is p, is

(1)
\left [(1 - p )^{ k }\right ]^{ n / k } = (1 - p )^{ n }

If a small probability, α, of detecting only ( − ) individuals is required, given that there is a proportion p of (+) individuals in the population, then equation (1) can be written as

(2)
(1 - p )^{ n }< \alpha

It should be pointed out that the Dorfman model was not developed with the objective of determining the sample size n, but rather for determining the required number of pools, g, and the size of the pools, k, that will minimize the number of laboratory tests, T. Under these premises, the expected value of T, E( T ) = g + kgp \prime, is a function of the number of pools (g = n/k), plus the number of individuals in the positive pools that need to be analysed, where p \prime = 1 - (1 - p )^{ k} is the probability of a pool being detected positive. Therefore, the ratio between the expected number of laboratory tests required (T) and the required sample size for each lab method is a measure of its expected relative cost E( T )/ n = ( g + kgp \prime )/ n = (1/ k ) + p \prime (Dorfman, Reference Dorfman1943). Thus, minimizing the number of lab tests is equivalent to finding the minimum relative cost. However, Dorfman's model assumes that when k individuals in a pool are mixed, AP concentration would not diluted. Therefore, under these assumptions, the value of n that satisfies equation (2) can be obtained as

(3)
n = \frac {\,log\,( \alpha )}{\,log\,(1 - p )}

The expression given in equation (3) is used by the United States Department of Agriculture (USDA/GIPSA, 2000a, b) to determine sample sizes for detecting AP seeds. It is mentioned that detecting AP is not different from detecting seeds with other discrete traits. This analysis would suggest a single group; however, in practice AP cannot always be detected when the proportion of AP seeds in the group is very small because the analytical methods used may not be sensitive enough. It has been suggested that for a pool size of k = 400 grains, standard analytical procedures in the laboratory should be able to detect the presence of one AP grain. Equation (3) does not give any guidelines as to how the concentration of the trait of interest (AP) (impurity) in individual seeds (w) could affect the pool size k, or how the dilution effect could make AP undetectable by standard analytical procedures in the laboratory (c).

Binomial sampling with the dilution effect

When k individuals that form a pool are mixed or homogenized, the AP will be diluted; this dilution effect increases with the size of the pool, and may decrease the AP concentration in the pool below the test's detection limit (c), thereby increasing the number of false negatives [i.e. seed(s) with AP not detected when, in fact, it is present in the group].

We propose a model that considers the dilution effect as well as the laboratory's detection limit for a pool sampling method based on the Dorfman model (Reference Dorfman1943). We assume a reference population of size N, with a proportion p of individuals with AP [or type (+)]. We also assume that the concentration of AP per individual, w, is known (i.e. transgenic DNA as % of the total DNA in the seed). When g pools are formed from a total of n individuals, the AP concentration in a single (+) individual in a pool is reduced to wg/n = w/k. If c is the laboratory detection limit, it is required that (w/k) ≥ c, in which case the probability of detecting AP in a pool with at least one (+) individual is 1, and zero otherwise. If a pool has X (+) individuals, then we require [(wX)/k] ≥ c. Note that in this study, the units of AP concentration, w, can be given in % DNA, whereas the units of w for other traits of interest, such as unwanted diseases in the grain, may be given in % kernel (Laffont et al., Reference Laffont, Remund, Wright, Simpson and Grégoire2005).

The question is: what is the required sample size n and pool size k such that the probability of detecting individuals of type (+) in the population is equal to or greater than (1 − α)? Variable X = number of (+) individuals in the pool of size k (X = 0, 1, 2,…, k) is a binomial variable with parameters k and p, that is, X~Bin(k, p). Hence, the probability that a group will be detected (+) is

(4)
P \left ( X \geq \frac { ck }{ w }\right ) = { \sum _{ j = ck / w }^{ k } }\, P ( X = j ) = { \sum _{ j = ck / w }^{ k } }\,\left ({\begin{array}{ccc} k \\ j \\ \end{array} }\right ) p ^{ j }(1 - p )^{ k - j }

To compute more precise probability values from equation (4) and avoid rounding errors when calculating ck/w, and because the binomial distribution computes probability for discrete values between 0 and n, we will use the relationship between the binomial and beta distributions given by

{  \sum  _{ x  =  a }^{ k } }\,\left ({\begin{array}{ccc} k  \\   x  \\  \end{array} }\right ) p ^{ x }(1 -  p )^{ k  -  x } = {{ {1}}\over{{ B ( a , b )}}}{  \int  _{0}^{ p } }\, x ^{ a  - 1}(1 -  x )^{ k  -  a }d x \quad  = {{ {\Gamma ( k  + 1)}}\over{{\Gamma ( a )\Gamma ( k  -  a  + 1)}}}{  \int  _{0}^{ p } }\, x ^{ a  - 1}(1 -  x )^{ k  -  a }d x ,

where the beta function is related to the gamma function by

{{ {1}}\over{{ B ( a , b )}}} = {{ {1}}\over{{{{ {\Gamma ( a )\Gamma ( b )}}\over{{\Gamma ( a  +  b )}}}}}} = {{ {\Gamma ( a  +  b )}}\over{{\Gamma ( a )\Gamma ( b )}}} = {{ {\Gamma ( k  + 1)}}\over{{\Gamma ( a )\Gamma ( k  -  a  + 1)}}}\quad for\, a > 0\hairsp and\, b  = ( k  -  a  + 1)> 0.

Thus, P ( X \geq a ) = P ( Y \leq p ), where X~Bin(k, p) and Y~Beta(x|a, b = k − a+1). This has the advantage that, for the beta distribution, a>0 and b>0, which is not possible with the binomial distribution. Therefore, equation (4) can be rewritten as

(5)
P \left ( X \geq \frac { ck }{ w }\right ) = { \sum _{ j = ck / w }^{ k } }\, P ( X = j )\quad = \frac {\Gamma ( k + 1)}{\Gamma ( ck / w )\Gamma ( k - ck / w + 1)}{ \int _{0}^{ p } }\, j ^{ ck / w - 1}(1 - j )^{ k - ck / w }d j

For the detection of AP, two types of error rates are important. One is the proportion of false positives, δ, which is the probability that one individual (or group) is detected as (+) even though it is ( − ) (1 − δ is the test specificity); the other is the rate of false negatives, λ, which is the probability of an individual or pool testing ( − ) even though it is (+) (1 − λ is the test sensitivity) (Remund et al., Reference Remund, Dixon, Wright and Holden2001). Therefore, the adjusted probability of detecting AP in a group is given by p _{a} = P [( + )\vert ( + )] P [( + )] + P [( + ) where P [( + )\vert ( + )] = 1 - \lambda, P [( + )] = p _{b}, P [( - )] = 1 - p _{b}, and P [( + )\vert ( - )] = \delta where

(6)
p _{b} = \frac {\Gamma ( k + 1)}{\Gamma ( ck / w )\Gamma ( k - ck / w + 1)}{ \int _{0}^{ p } }\, j ^{ ck / w - 1}(1 - j )^{ k - ck / w }d j

Therefore,

(7)
p _{a} = (1 - \lambda ) p _{b} + \delta (1 - p _{b})

When the rate of false positives (δ) and false negatives (λ) are introduced into these equations, and since the pools are formed independently, then the number of pools testing (+) can be considered a random variable (Y) with a binomial distribution Y~Bin(g, p a). Therefore, the probability of finding at least one (+) pool is

(8)
P ( Y \geq 1) = 1 - \left ({\begin{array}{ccc} g \\ 0 \\ \end{array} }\right ) p _{a}^{0}(1 - p _{a})^{ g } = 1 - (1 - p _{a})^{ g }

We want this probability to be ≥ (1 − α); alternatively, we want the probability that none is detected (+) to be < α

(9)
P( Y = 0) = \left ({\begin{array}{ccc} g \\ 0 \\ \end{array} }\right ) p _{a}^{0}(1 - p _{a})^{ g }< \alpha

Note that equation (9) determines the required sample size n with k = [w/c] − 1 and with p _{a} = (1 - \lambda ) p _{b} + \delta (1 - p _{b}). We used k = [w/c] − 1, because it is the maximum possible value of the pool size; however, with this value of n and k, the minimum number of laboratory tests [E( T ) = g + kgp \prime] and the minimum relative cost of the tests being (E( T )/ n = ( g + kgp \prime )/ n = (1/ k ) + p _{a}) are not achieved (Dorfman, Reference Dorfman1943). The strategy for minimizing the number of laboratory tests is to find, for a given n, a value of k between 1 and min(n, k1 = [w/c] − 1) that satisfies equation (9) with the minimum relative cost.

Negative binomial sampling with the dilution effect considering false positives and false negatives

Haldane (Reference Haldane1945) proposed the inverse sampling method (or negative binomial sampling or inverse binomial sampling) for cases where p is small (i.e. p ≤ 0.1). In this method, sampling continues until m individuals with AP are obtained. Assume a finite population U of size N is divided into two disjoint and complementary sets: C i (i = 1,2) of size N i (U = C _{1}\cup C _{2}, C _{1}\cap C _{2} = \emptyset) and N = \sum _{ i = 1}^{2} N _{ i }, where class C 1 contains individuals with AP and class C 2 individuals without AP. Then, p = N 1/N and q = N 2/N = (1 − p) such that N = N _{1} + N _{2} = Np + Nq.

To estimate p, individuals are sequentially sampled until the mth individual of set C 1 is obtained. The total sample size, n, is a random variable with a probability distribution given by

P ( n ) =  P ( E ) P ( F \vert  E )

where E is the event that in a sample of size n − 1, exactly m − 1 individuals belong to set C 1, and F is the event where the last individual belongs to set C 1 (Guenther, Reference Guenther1969). This probability distribution is

P( n ) = \left \{{\begin{array}{ccc}{{ {\left ({\begin{array}{ccc} Np  \\   m  - 1 \\  \end{array} }\right )\left ({\begin{array}{ccc} Nq  \\   n  -  m  \\  \end{array} }\right )}}\over{{\left ({\begin{array}{ccc} N  \\   n  - 1 \\  \end{array} }\right )}}}\left [{{ { Np  - ( m  - 1)}}\over{{ N  - ( n  - 1)}}}\right ] & ( n  =  m , m  + 1,\ldots , m  +  Nq ) \\  0 & otherwise \\  \end{array} }\right.

which is a negative hypergeometric distribution with expectation E( n ) = (( N + 1) m )/( Np + 1) and variance Var( n ) = (( N + 1)( Np - m + 1)( N - Np ))/(( Np + 1)^{2}( Np + 2)). However, when trying to detect AP plants, the size of the reference population N is unknown, but very large. In this case, it is reasonable to assume that N → ∞. For this case, it is well known (Kotz et al., Reference Kotz, Johnson and Read1988) that the random variable n has a negative binomial distribution

(10)
P ( n ) = \left \{{\begin{array}{ccc}\left ({\begin{array}{ccc} n - 1 \\ m - 1 \\ \end{array} }\right ) p ^{ m } q ^{ n - m }; & n = m , m + 1 + \cdots, \\ 0 & otherwise \\ \end{array} }\right.

with E( n ) = m / p and Var( n ) = m (1 - p )/ p ^{2}. An unbiased estimate of p if m>1 is \circ {>p} = ( m - 1)/( n - 1) and an unbiased estimate of the variance of \circ {>p} is Var( \circ {>p} ) = \circ {>p} (1 - \circ {>p} )/( n - 2). Furthermore, the required probabilities from the negative binomial can be computed directly from the binomial distribution due to the following equality (Patil, Reference Patil1960; Bartko, Reference Bartko1962; Morris, Reference Morris1963):

{  \sum  _{ i  =  m }^{ n ^\ast } }\,\left ({\begin{array}{ccc} n ^\ast  \\   i  \\  \end{array} }\right ) p ^{ i } q ^{ n ^\ast  -  i } = {  \sum  _{ n  =  m }^{ n ^\ast } }\,\left ({\begin{array}{ccc} n  - 1 \\   m  - 1 \\  \end{array} }\right ) p ^{ m } q ^{ n  -  m }

The inverse sampling method suggested by Haldane (Reference Haldane1945) is more precise than binomial sampling because when m>1, the coefficient of variation of p decreases. Therefore, when the dilution effect is considered, and g = n/k independent pools are formed from the sample of n individuals, it is necessary to consider the number of pools with at least one element carrying AP as a random variable with a binomial distribution Y~Bin(g, p a), where p _{a} = (1 - \lambda ) p _{b} + \delta (1 - p _{b})[equation (7)] is the probability that at least one positive element is in a pool adjusted by the rate of false positives and false negatives. Therefore, the probability of finding at least m positive pools out of g (g = n/k) is

(11)
P ( Y \geq m ) = { \sum _{ j = m }^{ g } }\,\left ({\begin{array}{ccc} g \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ g - j } = 1 - { \sum _{ j = 0}^{ m - 1} }\,\left ({\begin{array}{ccc} g \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ g - j }

where g = n/k and k = [w/c] − 1.

We wish to find n such that P ( Y \geq m )\geq 1 - \alpha. Using equation (11), this is equivalent to

(12)
{ \sum _{ j = 0}^{ m - 1} }\,\left ({\begin{array}{ccc} n / k \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ n / k - j }< \alpha

Given m, p a, and α, equation (12) can be solved numerically for n. Equation (12) is used to compute the required n with k = [w/c] − 1, g = n/k, and p _{a} = (1 - \lambda ) p _{b} + \delta (1 - p _{b}). Similar to the binomial case, for this value of n, values of k between 1 and min(n, k1 = [w/c] − 1) that satisfy equation (12) should be found that will have a minimum relative cost (E( T )/ n = ( g + kgp \prime )/ n = (1/ k ) + p _{a}) (Dorfman, Reference Dorfman1943). Note that for m = 1, equation (12) reduces to

(13)
P ( Y \geq 1) = 1 - \left ({\begin{array}{ccc} g \\ 0 \\ \end{array} }\right ) p _{a}^{0}(1 - p _{a})^{ g }< \alpha

Thus, the binomial sampling method with the dilution effect proposed in equations (13) and (8) is a particular case of the inverse sampling method shown in equation (12). Also, it is interesting to show that for δ = λ>0; p a will increase and thus n will decrease, compared to the case where δ = λ = 0. However, p a considers all individuals that will be detected as (+), even if they are not. However, for determining n, only the true probability of the real (+) positives should be considered, as was the case with δ = 0, that is, when p _{a} = (1 - \lambda ) p _{b}.

Testing seed plans with the dilution effect considering false positives, false negatives, lower quality limit (LQL), and a given acceptable quality limit (AQL)

The AP testing plan previously outlined has the aim of computing the required sample size (n), pool size (k), and number of pools (g) to guarantee, with probability ≥ (1 − α), that at least one AP plant (or AP pool) will be in the sample. These types of testing plans have zero tolerance because they focus only on the limiting quality level (LQL = p), and do not consider the acceptable quality level (AQL) (i.e. AQL = 0), which refers to different seed production levels under normal conditions. Zero-tolerance testing plans generally have a high producer risk (Remund et al., Reference Remund, Dixon, Wright and Holden2001). In practice, tolerance testing plans have two main parameters: (1) the number of individual seeds (or seed pools); and (2) the maximum number of unacceptable seeds (or seed pools) that can be tolerated in the sample before the seed lot is rejected. Thus, in practice, testing plans should consider the consumer's and the producer's interests and risks by assessing the LQL and AQL. More details on the definition and description of LQL and AQL for consumer and producer risks can be found in Remund et al. (Reference Remund, Dixon, Wright and Holden2001).

When n individuals (or g seed pools) are chosen from a seed lot, and it is decided to reject the lot if more than m AP plants are observed, it is necessary to construct an operating characteristic (OC) curve that plots the true AP proportion (p) versus the probability of accepting the lot. OC curves are useful for evaluating whether or not a given testing plan satisfies the testing objectives. In this case, the probability that the lot will be accepted, given its true AP proportion (p), λ, δ, and considering w and c is given by

(14)
P (Accept\,Lot\vert p , n , k , c , w ) = P ( Y \leq m ) = { \sum _{ j = 0}^{ m } }\,\left ({\begin{array}{ccc} n / k \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ n / k - j }

where p _{a} = (1 - \lambda ) p _{b} + \delta (1 - p _{b}). Equation (14) can be used to estimate the consumer and producer risks for a given lower quality limit (LQL) and a given acceptable quality limit (AQL), such as

(15)
Consumer\,\,Risk = P ( Y \leq m \vert p = LQL, n , k , c , w ) = { \sum _{ j = 0}^{ m } }\,\left ({\begin{array}{ccc} n / k \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ n / k - j }

and

(16)
Producer\,\,Risk = P ( Y > m \vert p = AQL, n , k , c , w ) = { \sum _{ j = m + 1}^{ n / k } }\,\left ({\begin{array}{ccc} n / k \\ j \\ \end{array} }\right ) p _{a}^{ j }(1 - p _{a})^{ n / k - j }

Equations (15) and (16) are adapted from Remund et al. (Reference Remund, Dixon, Wright and Holden2001), including the dilution effect and the testing limit c. If the value of k from equations (14), (15) and (16 is >[w/c] − 1, the program will search for a k between 1 and min (n, k1 = [w/c] − 1) that will guarantee that the AP concentration (w) in the pool is larger than the detection limit (c) and that the size of the pool has the minimum relative cost.

Results

The binomial sampling method with the dilution effect

For different values of p and w, Table 1 shows the sample size, n, number of pools, g, and pool size, k, required to achieve 95% and 99% probability of detecting AP in the sample using the modified Dorfman method with the dilution effect. Clearly, as p increases and w increases, the required n decreases. When w = 0.0002, all individuals must be tested, g = n and k = 1; as w increases, a smaller number of groups (g) of larger sizes (k) must be tested.

Table 1 Sample size (n), group size (k), and number of groups (g) for various values of p and w required for achieving a 95% and 99% probability of detecting at least one individual with AP using the binomial distribution with λ=0 (rate of false negatives), δ=0 (rate of false positives) and the dilution effect (c=0.0001)

Sample sizes, n, for different values of p and w, are very different for the standard Dorfman model, as compared with those obtained from the modified Dorfman model with the dilution effect. For example, consider that the laboratory detection limit is c = 0.0001; then for p = 0.01 and w = 0.0002, the required sample size to guarantee, with 0.95 probability, the detection of at least one (+) individual is 299 individuals using the standard Dorfman method of equation (3). However, the modified Dorfman method with the dilution effect required testing 7036 individuals [equation (8)] (Table 1). For p = 0.01 and w = 0.01, the required sample size from equation (3) is 299 individuals, while using the modified Dorfman method of equation (8), the required sample size is 892 individuals (Table 1); however, while the modified Dorfman method with the dilution effect recommends performing 60 laboratory tests with 15 grains each (g = 60, k = 15), the traditional Dorfman method without the dilution effect does not say how many groups (g) (i.e. g, laboratory tests) and group sizes (k) are required.

Therefore, the two methods give rise to different sample sizes, n, but the modified Dorfman method with the dilution effect has the advantage that it gives a precise value for g and k. The Dorfman model, which emphasizes pool size, was proposed with the objective of minimizing the number of laboratory tests. However, it disregards the dilution effect and the laboratory detection limit, thus increasing the probability of false negatives (i.e. detecting no AP grains in a sample when, in fact, there are some present). The modified Dorfman model adjusted for the dilution effect considers pool size as well as the laboratory detection limit. The problem is that when pool size increases, the dilution of AP increases; that is, AP concentration, w, decreases and may become smaller than the laboratory detection limit, c (i.e. w < c). It is important to point out that w is considered a fixed quantity for a given grain, but this concentration decreases when grains in the pool are mixed because only a fraction, p, contain AP.

When p increases, n, k and g decrease, but for a given value of p, when w increases, k may increase or decrease (Table 1). For c = 0.0001, p = 0.01, and w = 0.0002, the sample size for alpha = 0.05 was n = 7036, and the pool size was k = 1. This result indicates that each individual must be tested separately, otherwise w < 0.0001 (below the detection limit). For the same case, but assuming w = 0.01 with alpha = 0.05, then n = 892, k = 15. The increase in AP concentration should reduce the total cost of laboratory testing because now we have an optimum group size that minimizes the number of laboratory tests without diluting the AP below the laboratory detection limit. Even if only one element of the 15 had AP, the AP concentration in the group would be 0.000666, which is still detectable.

Figure 1 plots the relative cost of laboratory tests versus the group size (k) under different values of n and p; it shows, under the binomial distribution with dilution, that the optimum pool size is between 1 and min(n, k1 = [w/c] − 1). For example, for p = 0.01, n = 892, the optimum group size that minimizes the relative cost is k = 15; for p = 0.03, n = 100, the optimum pool size for minimizing the relative cost is k = 9; for p = 0.05, n = 90, the best pool size is k = 6; for p = 0.07, n = 56, the best size is k = 5; and for p = 0.09 and 0.11, n = 40 and 31, respectively, the best size for both is k = 4.

Figure 1 The relative costs of laboratory tests as a function of optimum group size (k) and sample size (n) for different values of p [proportion of the adventitious presence of transgenic plants (AP) in the population] under the binomial distribution with the dilution effect for c = 0.0001, λ = 0, δ = 0, w = 0.01 and (1 − α) = 0.95.

The negative binomial sampling method with the dilution effect considering false positives and false negatives

Table 2 shows, for different values of m and p, the sample size, n, group size, k, and number of pools, g, needed to be able to detect the AP with a 0.95 probability obtained using equation (11) or (12); this assumes that the rate of false negatives is λ = 0.02, and the proportion of false positives is δ = 0. Results show that as p increases, sample size (n) decreases. When more AP individuals must be detected (m), the sample size increases for all levels of the other factors. As already shown in the theoretical sections, an increase in the number of individuals to be detected increases the precision of the estimate of p. For example, for p = 0.01, w = 0.0008, c = 0.0001, and m = 1, the required n to detect an individual containing AP with a 95% probability is 6553, whereas for p = 0.01, w = 0.0008, c = 0.0001, and m = 3 and m = 11, the required sample sizes are 13,777 and 37,143, respectively, and the pool size is 7. However, for the same value of p, but w = 0.01, the n required for m = 3 and m = 11 are n = 2179 and n = 6139, respectively, with a pool size of k = 15. On the other hand, assuming p = 0.03, w = 0.0008 in order to detect m = 1, 3, 5, 7, 9, and 11 AP individuals with a 95% probability, the required sample sizes are n = 883, 1863, 2717, 3515, 4285, and 5041, respectively, with k = 7 for each pool. However, if w = 0.006, the required sample sizes are n = 237, 532, 827, 1122, 1358, and 1653, with a required pool size of k = 8.

Table 2 Sample size (n), group size (k), and number of groups (g) for various values of p and w required for achieving a probability of (1−α)=95% of detecting different numbers of individuals (m) with the trait of interest using the negative binomial distribution with the dilution effect and a probability of λ=0.002 of false negatives and δ=0.00 of false positives

When p is modelled by the negative binomial, its coefficient of variation does not change much with different values of p (Fig. 2). However, it decreases considerably when more individuals with AP must be detected. With m ≥ 20, the coefficient of variation of p decreases at least 30% compared to the case where m = 1; however, this will significantly increase the sample size and, therefore, the total cost of sampling and testing.

Figure 2 Coefficient of variation (CV%) of p under the negative binomial distribution with the dilution effect for different values of m and p for obtaining sample sizes with c = 0.0001, λ = 0, δ = 0, w = 0.0008 with 95% probability of detecting individuals with the adventitious presence of transgenic plants (AP).

Results using the negative binomial distribution indicate that an important increase in detection precision (decrease in the coefficient of variation of p) can be achieved in the testing plan by increasing the number of individuals with the AP that need to be detected, m. However, this implies an important increase in the total cost of sampling and testing. When λ>0 and δ = 0, the sample sizes tended to increase, whereas when both error rates (λ and δ) increase, sample size tends to decrease, but group size remains constant (data not shown).

Testing seed plans with the dilution effect considering false positives, false negatives, lower quality limit (LQL), and a given acceptable quality limit (AQL)

Figure 3 depicts the operating characteristics (OC) curves for four different laboratory testing plans with different values of g and k (g = 80, k = 40; g = 54, k = 60; g = 40, k = 80 and g = 32, k = 100 with n = 3200, m = 8, c = 0.0001 and w = 0.014) for an LQL of 1.0% and an AQL = 0.5% using equations (14), (15) and (16. The consumer risks for the four plans are 0.0060, 0.0248, 0.0759 and 0.1526, whereas the producer risks are 0.3787, 0.1863, 0.0764 and 0.0317. Results show that the best testing plan is when g = 40 and k = 80, because it represents the lowest consumer (7.59%) and producer (7.64%) risks; that is, when g = 40 and k = 80, seed lots with 1.0% of AP plants will be accepted 7.59% of the time, and seed lots with 0.5% AP plants will be accepted 7.64% of the time. Figure 3 shows the influence of m, c, and w on producer and consumer risks. Specific testing plans guarantee that if there is at least one pool with AP, this will be detected. However, these specific testing plans do not guarantee that it will have low consumer and producer risks.

Figure 3 Operating characteristic curves for four different seed lot testing plans (with n = 3200, m = 8, λ = 0, δ = 0, c = 0.0001 and w = 0.014) with consumer risk (lower quality limit, LQL) = 1.0% and producer risk (acceptable quality limit, AQL) = 0.5%.

Figure 4 shows the effect of the detection limit (c) on the OC curves. The first three OC curves (from right to left) have detecting limit values of c = 0.00 014, c = 0.00 012 and c = 0.00 010, with n = 1820, k = 15, m = 6, and w = 0.0088. The OC farthest to the left shows the standard testing plan (for details, see Remund et al., Reference Remund, Dixon, Wright and Holden2001), which does not include the dilution effect; no values of c and w are used. If the LQL and AQL are set at 1.1% and 0.4% and c = 0.00 014, 0.00 012, 0.0001 and without considering c (standard testing plan), consumer risks are 0.0868, 0.0514, 0.0281 and 0.0003, whereas producer risks are 0.0419, 0.0692, 0.1102 and 0.5727. The best testing plan is when c = 0.00 012, because it gives the lowest consumer (5.14%) and producer (6.92%) risks. It is interesting to note that, in the standard testing plan, the consumer risk is 0.03%, and producer risk is 57.27%. Results show the differences among the four testing plans and the important role that the detection limit (c) plays for designing balanced testing plans, because different values of c and w greatly affect the laboratory testing plans.

Figure 4 Operating characteristic curves for four different seed lot testing plans with detection limits of c = 0.00 010, 0.00 012, 0.00 014, m = 6, λ = 0, δ = 0 and w = 0.0088 (with n = 1820, k = 15) and for the standard testing plan with consumer risk (lower quality limit, LQL) = 1.1% and producer risk (acceptable quality limit, AQL) = 0.4%.

Conclusions

A sample size that guarantees a high probability (1 − α) that at least one individual with AP will be detected can be obtained by applying the modified Dorfman method using the binomial distribution and considering the dilution effect and the laboratory detection limit. When using the negative binomial and performing inverse sampling, a required sample size n can be determined that guarantees a high probability (1 − α) that m individuals or groups containing AP will be found. The method offers a strategy for forming groups (pools) from the sample that will be subjected to laboratory tests, and for determining an optimal number of pools that guarantees that if there is at least one group with AP, there is a high probability that it will be detected. The groups formed have an optimum size, such that one element with AP will be detected at a low cost.

The binomial distribution should be used (modified Dorfman method) when it is known that the proportion of AP in the population is large, p>0.1; otherwise, the inverse sampling method is recommended, because it guarantees that m individuals or pools of individuals with AP will be detected with high probability. It is important to point out that the precision with which p is estimated for detecting the AP is related to the value of m. This research shows that good precision is achieved for estimates of p with m>11, since this leads to a coefficient of variation < 25% for any value of p ≤ 0.1. However, the increase in precision is accompanied by a significant increase in the cost of laboratory testing. Larger values of concentration of the AP in the seeds, w, will require fewer laboratory tests and, therefore, the overall cost of testing will decrease. Performing laboratory tests pool by pool is recommended until AP is detected in the gth pool, and the probability of false positives is equal to zero. This is sufficient to conclude that there are individuals with AP in the lot.

The approach used in this study considers the proportion of false negatives, which is never zero in practice. Assuming a low proportion of false negatives, the models proposed in this study facilitate computing a more precise sample size for detecting AP that is larger than that obtained when it is assumed that the proportion of false negatives is zero. Furthermore, for designing sampling testing plans, the authors propose incorporating the dilution effect and considering the rates of false positives, false negatives, the lower quality limit (LQL), and a given acceptable quality limit (AQL). The LQL and AQL give the OC curves and the producer and consumer risks, which facilitate making decisions on important practical matters.

Possible disadvantages of the method are that (1) it does not provide a closed solution for sample size, pool size, and total cost (however, a computer program in MatLab is available); and (2) the value of w may be difficult to obtain (but an average value given by the % DNA per grain may be used). In the case of genetically modified plants, the weight of the enzyme or other protein formed by that specific DNA sequence may be estimated as a proportion of the total weight of the grain or as a percentage of DNA.

Program in MatLab

The second author of this paper has developed a program in MatLab for computing the optimal sample size, n, number of pools, g, and pool size, k, for different cases, and for calculating the consumer and producer risks. The software can be downloaded from http://docente.ucol.mx/oamontes1 and has two windows, sample size (window 1) and OC curves (window 2).

The sample size window computes the optimal sample size (n), number of groups (g), and size of the groups (k), for different cases. The following information must be given: an estimate of the proportion of individuals with AP in the population (p); an estimate of AP concentration in the seed (w); the laboratory detection limit (c); number of pools or individuals with AP that should be detected (m); rate of false negatives (λ); and the value of alpha (α). The program will then provide n, k, and g. Note that when m = 1, sampling is based on the binomial distribution (since this is a particular case of the inverse sampling method); when m>1, sampling is based on the negative binomial. When the rate of false negatives, λ, is unknown, a value of 0 should be given.

Window 2 draws the OC curves and calculates the producer and the consumer risks. The computer software that generates the proposed testing plans requires the following information: the value of the sample size (n), the pool size (k), the limit of AP seeds (m) that will accepted, the value of LQL (proportion of individuals with AP in the population), the value of AQL; an estimate of AP concentration in the seed (w); the laboratory detection limit (c); the rate of false negatives (λ); the rate of false positives (δ); and the value of alpha (α). The values of λ and δ could be ≥ 0. The program will then generate the OC curves with consumer and producer risks that are useful for evaluating whether or not a given testing plan satisfies the testing objectives. If the value of k proposed by the consumer and producer is ≤ min(n, [w/c] − 1), the program automatically draws the curves and calculates the consumer and producer risks; otherwise, the values used will be between one and min (n, k1 =  [w/c] − 1) with the aim of guaranteeing that AP concentration (w) in the pool is larger than the detection limit (c), and that the size of the pool represents a minimum relative cost.

References

Bartko, J.J. (1962) A note on the negative binomial distribution. Technometrics 4, 609610.Google Scholar
Cleveland, D.A., Soleri, D., Aragón-Cuevas, F., Crossa, J. and Gepts, P. (2005) Detecting (trans)gene flow to landraces in centers of crop origin: lessons from the case of maize in Mexico. Environmental Biosafety Research 4, 197208.CrossRefGoogle ScholarPubMed
Cochran, W.G. (1980) Técnicas de Muestreo. México, Editorial Continental.Google Scholar
Dorfman, R. (1943) The detection of defective members of large populations. Annals of Mathematical Statistics 14, 436440.Google Scholar
Federer, W.T. (1991) Statistics and society. Data collection and interpretation. New York, Marcel Dekker.Google Scholar
Gerrard, D.J. and Cook, R.D. (1972) Inverse binomial sampling as a basis for estimating negative binomial population densities. Biometrics 28, 971980.Google Scholar
Guenther, W.C. (1969) Modified sampling, binomial and hypergeometric cases. Technometrics 11, 639647.CrossRefGoogle Scholar
Haldane, J.B.S. (1945) On a method of estimating frequencies. Biometrika 33, 222225.Google Scholar
Kalton, G. and Anderson, D.W. (1986) Sampling rare populations. Journal of the Royal Statistical Society. Series A (General) 149, 6582.Google Scholar
Kay, S. and Paoletti, C. (2002) Sampling strategies for GMO detection and/or quantification.European Commission Report, Code EUR20239EN, Joint Research Centre Publication Office. Available online athttp://bgmo.jrc.ec.europa.eu/home/docs.htm#articles2002 (accessed 10 October 2007).Google Scholar
Kay, S. and Van den Eede, G. (2001) The limits of GMO detection. Nature Biotechnology 19, 405.Google Scholar
Kotz, S., Johnson, N.L. and Read, C.B. (1988) Encyclopedia of statistical sciences, Vol. 3 (1st edition). Toronto, Wiley.Google Scholar
Laffont, J.-L., Remund, K.M., Wright, D., Simpson, R.D. and Grégoire, S. (2005) Testing for adventitious presence of transgenic material in conventional seed or grain lots using quantitative laboratory methods: statistical procedures and their implementation. Seed Science Research 15, 197204.Google Scholar
Montgomery, D.C. (1997) Introduction to statistical quality control (3rd edition). New York, Wiley.Google Scholar
Morris, K.W. (1963) A note on direct and inverse binomial sampling. Biometrika 50, 544545.Google Scholar
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A.A. (2005a) Absence of detectable transgenes in local landraces of maize in Oaxaca, Mexico (2003–2004). Proceedings of the National Academy of Sciences, USA 102, 1233812343.Google Scholar
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A.A. (2005b) Correction. Proceedings of the National Academy of Sciences, USA 102, 18242.Google Scholar
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A.A. (2005c) Reply to Cleveland et al.'s ‘Detecting (trans)gene flow to landraces in centers of crop origin: lessons from the case of maize in Mexico’. Environmental Biosafety Research 4, 209215.Google Scholar
Patil, G.P. (1960c) On the evaluation of the negative binomial distribution with examples. Technometrics 2, 501505.CrossRefGoogle Scholar
Quist, D. and Chapela, I.H. (2001) Transgenic DNA introgressed into traditional maize landraces in Oaxaca, Mexico. Nature 414, 541543.Google Scholar
Quist, D. and Chapela, I.H. (2002) Biodiversity (Communications arising (reply)): suspect evidence of transgenic contamination. Maize transgene results in Mexico are artefacts. Nature 416, 602.CrossRefGoogle Scholar
Remund, K.M., Dixon, D.A., Wright, D.L. and Holden, R.L. (2001) Statistical considerations in seed purity testing for transgenic traits. Seed Science Research 11, 101119.Google Scholar
USDA/GIPSA (2000a) Sampling for the detection of biotech grains. Washington, DC, United States Department of Agriculture. Available at website http://archive.gipsa.usda.gov/biotech/sample2.htm (accessed 20 November 2007)..Google Scholar
USDA/GIPSA (2000b) Practical application of sampling for the detection of biotech grains. Washington, DC, United States Department of Agriculture. Available at website http://archive.gipsa.usda.gov/biotech/sample1.htm (accessed 20 November 2007)..Google Scholar
Figure 0

Table 1 Sample size (n), group size (k), and number of groups (g) for various values of p and w required for achieving a 95% and 99% probability of detecting at least one individual with AP using the binomial distribution with λ=0 (rate of false negatives), δ=0 (rate of false positives) and the dilution effect (c=0.0001)

Figure 1

Figure 1 The relative costs of laboratory tests as a function of optimum group size (k) and sample size (n) for different values of p [proportion of the adventitious presence of transgenic plants (AP) in the population] under the binomial distribution with the dilution effect for c = 0.0001, λ = 0, δ = 0, w = 0.01 and (1 − α) = 0.95.

Figure 2

Table 2 Sample size (n), group size (k), and number of groups (g) for various values of p and w required for achieving a probability of (1−α)=95% of detecting different numbers of individuals (m) with the trait of interest using the negative binomial distribution with the dilution effect and a probability of λ=0.002 of false negatives and δ=0.00 of false positives

Figure 3

Figure 2 Coefficient of variation (CV%) of p under the negative binomial distribution with the dilution effect for different values of m and p for obtaining sample sizes with c = 0.0001, λ = 0, δ = 0, w = 0.0008 with 95% probability of detecting individuals with the adventitious presence of transgenic plants (AP).

Figure 4

Figure 3 Operating characteristic curves for four different seed lot testing plans (with n = 3200, m = 8, λ = 0, δ = 0, c = 0.0001 and w = 0.014) with consumer risk (lower quality limit, LQL) = 1.0% and producer risk (acceptable quality limit, AQL) = 0.5%.

Figure 5

Figure 4 Operating characteristic curves for four different seed lot testing plans with detection limits of c = 0.00 010, 0.00 012, 0.00 014, m = 6, λ = 0, δ = 0 and w = 0.0088 (with n = 1820, k = 15) and for the standard testing plan with consumer risk (lower quality limit, LQL) = 1.1% and producer risk (acceptable quality limit, AQL) = 0.4%.