INTRODUCTION
Despite radiocarbon laboratories’ continuous efforts to increase the accuracy and precision of measurements, uncertainty regarding the general reliability of 14C dates to correctly date past human activity has long been one of the primary concerns of archaeologists (e.g. van der Plicht and Bruins Reference van der Plicht and Bruins2001; Pettitt et al. Reference Pettitt, Davies, Gamble and Richards2003; Mellars Reference Mellars2006; Buck et al. Reference Buck, Christen, Kenworthy and Litton2007; Faught Reference Faught2008; Graf Reference Graf2009). Although archaeologists are well aware that uncertainty of 14C dates is inevitable and dating results should be understood probabilistically, there remains a strong desire to obtain exact dates for target events.
Sources of uncertainty in 14C dating can be divided into two components: those derived from random errors and those from systematic errors (Ward and Wilson Reference Ward and Wilson1978; Scott et al. Reference Scott, Cook and Naysmith2007). The distinction between the two components is important to archaeologists, because they must be dealt with in different ways. Random errors are to be treated statistically, and by increasing sample size, uncertainty can be decreased. When using an accelerator mass spectrometer (AMS), precision is enhanced when extra time is taken to count the numbers of 13C and 14C isotopes. On the other hand, systematic errors should be controlled before results are interpreted. Systematic errors can further be subdivided into archaeological and nonarchaeological systematic errors. The former includes erroneous stratigraphic interpretations during fieldwork, failure to detect later inclusion of materials, contamination of samples during sampling, and the so-called old-wood effect (Schiffer Reference Schiffer1986), all of which should be avoided or carefully controlled by archaeologists. Nonarchaeological systematic errors are caused by physical and chemical factors affecting concentration of 14C in dated materials, and many of them such as the Suess effect (Stuiver and Suess Reference Stuiver and Suess1966), marine reservoir effect (Keith and Anderson Reference Keith and Anderson1963; Stuiver et al. Reference Stuiver, Pearson and Branziunas1986), hardwater effect (Shotton Reference Shotton1972), contamination through exposure to volcanic ash (Pichler and Friedrich Reference Pichler and Firedrich1976), and bone apatite diagenesis (Price et al. Reference Price, Blitz, Burton and Ezzo1992; Nielsen-Marsh and Hedges Reference Nielsen-Marsh and Hedges2000a, Reference Nielsen-Marsh and Hedges2000b) have been reported. From the perspective of archaeologists, nontaphonomic systematic errors are harder to determine and may include contamination of materials during pretreatment or erroneous measurement of calibration standards. These errors may systematically lead to anomalous results even if the reported dates were statistically treated and taphonomic contamination effects were controlled.
One of the problems that archaeologists face in practice is that when receiving dating results from laboratories, they are rarely able to critically assess whether differences between multiple 14C dates of materials that are expected to be the same age are caused by random or systematic errors, or whether the error is in their expectations of the temporal accumulation of archaeological deposits of the site. In such cases, archaeologists are often at a loss as to whether the results should be statistically treated or controlled in different ways or merely discarded. Although many statistical methods have been developed to deal with random errors (e.g. Ward and Wilson Reference Ward and Wilson1978; Christen Reference Christen1994; Christen and Buck Reference Christen and Buck1998; Buck and Millard Reference Buck and Millard2004; Scott et al. Reference Scott, Cook and Naysmith2007; Bronk Ramsey Reference Bronk Ramsey2009; Scott Reference Scott2011), unless archaeologists are able to distinguish systematic errors affecting the amount of 14C during measurement from random errors, statistical treatments of conflicting 14C dates (e.g. statistically combining multiple dates) are not meaningful.
To a certain degree, archaeologists’ practical problems can be mitigated if they can distinguish purely random errors and possible systematic errors. One way to distinguish the two types of errors is to measure a sample believed to be taphonomically consistent under various conditions to test whether systematic errors occur by comparing the results. If the comparison of results indicates that certain conditions repeatedly and consistently produce different results, they may be viewed as possible causes of systematic errors, which should be considered before undertaking a statistical analysis of the ages.
This paper reports the results of an experiment designed to check possible causes of errors of 14C dating of charcoal, by dating samples from single archaeological contexts under a variety of conditions. The experiment attempted to check four possible sources of variability:
-
(1) Repeatability under identical conditions: When one object is dated under presumably identical conditions in the same laboratory at the same point of time, how different are the results? What is the range of random errors of different aliquots?
-
(2) Interbatch differences in a laboratory: When multiple subsamples from the same bulk sample are submitted to a laboratory at different points in time, how much does the difference in timing of the analysis affect the outcomes? Do possible differences in measurement background and laboratory settings significantly affect the results?
-
(3) Interlaboratory difference: The International Radiocarbon Intercomparison (IRI) has been carried out five times thus far (Rozanski et al. Reference Rozanski, Stichler, Gofiantini, Scott, Beukens, Kromer and van der Plicht1992; Scott et al. Reference Scott, Bryant, Carmi, Cook, Gulliksen, Harkness, Heinemeier, McGee, Naysmith, Possnert, van der Plicht and Van Strydonck2003, Reference Scott, Cook and Naysmith2010), but all of the laboratories participating in the experiments were aware that their dating results would be compared with other laboratories. There could be a temptation to treat IRI samples differently than commercial samples if a source of systematic error is suspected. Laboratories may repeat dating the IRI sample, choose some dates considered to be close to the “consensus date” and report them. What if laboratories measure the same sample following their normal protocols, without knowing that they are participating in an interlaboratory experiment (cf. Potter and Reuther Reference Potter and Reuther2012)?
-
(4) Difference between inner and outer rings of wood: When dating long-lived wood from archaeological sites, it is conventional wisdom to select near-surface outer rings rather than inner rings in order to get results closer to an archaeological target event (Bowman Reference Bowman1990). However, in many parts of the world where preservation of organic material is poor, it can be difficult to discriminate from which aspect of a tree the sampled charred wood pieces are derived. How much does this factor affect the results in a given context? Does it lie beyond or within the statistical error range? Although this difference is not related to the uncertainty of 14C dates per se but is context dependent, here we test how much it affects the results in the Korean context because long-lived wood pieces are common in archaeological deposits and are usually used for 14C dating due to the difficulty of identifying in situ seeds that have not been bioturbated.
To examine these possible sources of uncertainty, a blind test was carried out at five different AMS laboratories across the world, which had no prior knowledge of the experiment in order to statistically compare the results from a suite of 80 samples. Five bulk charcoal samples from two archaeological sites in the central Korean peninsula were divided into multiple subsamples, and submitted blindly to the laboratories. This article reports the results of the experiment as they pertains to the four potential sources of systematic errors in 14C dating described above.
SITES AND SAMPLES
Samples for experiment were collected from two archaeological sites in the central Korean Peninsula (Figure 1).
Namgye, Yeoncheon (37°00′22″N, 127°05′53″E)
Namgye is a settlement with four subterranean houses previously known to date to the Proto Three Kingdoms period (100 BC–AD 300) of Korea (Seoul National University Museum 2014). The site is located on a sandy river terrace in the Hantan River Valley. This site was excavated by Seoul National University Museum in 2013.
Hongryeonbong, Seoul (37°33′07″N, 127°00′54″E)
Hongryeonbong is a fortress of the Koguryeo (37 BC–AD 668), an ancient state in northern Korea and northeast China (Choi Reference Choi2014). This fortress is located on a hilltop on the north bank of the Han River. Historical documents and archaeological evidence indicate that it was constructed around and occupied by Koguryeo’s southernmost frontline troops until the mid-6th century AD, and then reused by Silla, another ancient state competing with Koguryeo, between the late 6th and 7th centuries AD (Choi Reference Choi2014). The site was excavated by Korea University from 2007 to 2013 (Choi et al. Reference Choi, Lee, Oh and Cho2007; Korea Institute for Archaeology and Environment 2012).
The sites are located within the humid continental/subtropical climate of the central Korean Peninsula, which includes cold, dry winters and warm, humid summers (Kim et al. Reference Kim, Lee, Kong, Kim, Kang, Park, Park, Park, Song, Son, Yang, Lee and Choi2012). The bedrock is comprised primarily of Tertiary granites uplifted as the result of the formation of backarc basins that formed distally to the continental arc as the Pacific Plate subducted orthogonally under the Asian continent (Chough Reference Chough2013). Therefore, the present-day landscape includes high topographic relief with strongly seasonal monsoonal rainfall. The resulting vegetation mosaic is dominated by coniferous trees that grow on the northern aspects of the mountains with deciduous hardwood species located on the southern aspects (Kim et al. Reference Kim, Lee, Kong, Kim, Kang, Park, Park, Park, Song, Son, Yang, Lee and Choi2012).
Three bulk charcoal samples were collected from Namgye (Namgye 1, 2, 3; hereafter N1, N2, N3) and two from Hongryeonbong (Hongryeonbong 1 and 2; hereafter H1, H2). N1, N2, and N3 (Figure 2) were charred wood from a subterranean house feature abandoned following a fire (House No. 3). These three bulk samples are inferred to have been used as support beams for the wall installed when the house was constructed and are expected to have the same dates as one another within the standard range of error. However, we cannot eliminate the possibility that the beams might have been reused from earlier contexts. H1 and H2 were support beams for one of the inner stonewalls of the fortress (Figure 3), likely installed during reinforcement. H1 and H2 are also comprised of charred wood. H1 and H2 are also expected to be contemporaneous at an archaeological timescale. The samples were collected using rubber gloves and trowels, scooping charcoal into clean aluminum foil during excavations in collaboration between the excavation teams (Seoul National University Museum for Namgye and Korea University for Hongryeonbong) and our team in 2013.
All bulk samples from the two sites were identified as variants of oak (Quercus sp.), which is an abundant genus in Korea (Table 1). Because our aim was to compare the dating results measured under various conditions by dividing samples into many aliquots, we selected large pieces of wood charcoal as bulk samples, although we are aware of possible problems that may arise during pretreatment of charcoal (Gillespie Reference Gillespie1997; Bird et al. Reference Bird, Ayliffe, Fifield, Turney, Cresswell, Barrows and David1999), homogeneity issues (Scott et al. Reference Scott, Boaretto, Bryant, Cook, Gulliksen, Harkness, Heinemeier, McGee, Naysmith, Posssnert, van der Plicht and Van Strydonck2004), age differences from archaeological target events, and the “old wood problem” (Schiffer Reference Schiffer1986). Because of the readily available sources of standing hardwood and humid summers present in the region, “old” or recycled wood is not a taphonomic situation commonly considered in dating archaeological sites in Korea.
METHODS
The purpose of our blind test was to check the four potential sources of uncertainty of 14C dates discussed in the previous section, by dating the same samples under multiple conditions. Materially coherent bulk samples were divided into smaller aliquots using clean tweezers and knives for simultaneous and staggered submittal to the five different AMS laboratories being tested. Each subsample was assigned a subsample identification number to anonymize its relationship to the bulk sample within the suite of materials submitted to the laboratories. Depending on size, each bulk charcoal sample was divided into either 20 (N1, H1, and H2) or 10 (N2 and N3) subsamples; thus, a total of 80 subsamples were sent to the laboratories. Clear division between inner and outer rings of samples was only possible for H1, and the age difference between the inner and outer rings was considered to be approximately 10 to 15 yr, although the number of rings between the two parts was not exactly counted. For N1, N2, N3, and H2, outer parts of bulks were sampled. When dividing bulk samples into aliquots, we were careful to avoid possible contamination. As part of our sampling protocol, rings in similar ages were assayed to homogenize the aliquots from each bulk sample as much as possible (N1: 25.6–45.0 mg; N2: 42.6–59.3 mg; N3: 53.2–67.6 mg; H1 inner: 76.7–96.4 mg; H1 outer: 50.6–75.6 mg; H2: 47.3–51.4 mg) to avoid introducing systematic errors from dating different aspects of tree wood (sensu Scott et al. Reference Scott, Bryant, Carmi, Cook, Gulliksen, Harkness, Heinemeier, McGee, Naysmith, Possnert, van der Plicht and Van Strydonck2003, Reference Scott, Boaretto, Bryant, Cook, Gulliksen, Harkness, Heinemeier, McGee, Naysmith, Posssnert, van der Plicht and Van Strydonck2004). We did not pulverize samples because archaeologists rarely pulverize samples when they submit samples to laboratories for dating.
Samples were submitted to five AMS laboratories: two in the USA, one in the UK, one in Korea, and one in Japan. We do not specify names of the laboratories subjected to the test here; instead, we randomly assign the laboratory codes as A, B, C, D, and E. Each laboratory measured 16 samples. Among the five laboratories, four (Labs A to D) measured samples twice within a 2-month interval, while Lab E received its 16 samples at one time. Only site names, locations, and subsample identification numbers assigned by the research team were provided to the laboratories, and we did not inform staff at the laboratories that a test was being performed.Footnote 1
Following receipt of the results of 14C dating from the respective laboratories, dates were analyzed using Bayesian methods. First, medians of uncalibrated BP dates were estimated for each bulk sample by inferring a posterior distribution with the Markov chain Monte Carlo technique. Then, for a subset of samples that consistently showed different dates, the Bayesian p value (Bayarri and Berger Reference Bayarri and Berger2000), which determines the probability of occurrence of data more deviant than the observed data for relevant statistics, was calculated to assess whether the difference was statistically significant. Non-Bayesian chi-squared tests (Ward and Wilson Reference Ward and Wilson1978; Bronk Ramsey Reference Bronk Ramsey2009) were also carried out for these samples to supplement the Bayesian p value calculation.
RESULTS AND DISCUSSION
Among the 80 samples submitted, one sent to Lab A was determined to be undatable; therefore, we report dating results of 79 samples (Tables 2 and 3). Detailed statistical and physical analyses of the results are now in progress; thus, we briefly comment on some aspects of the experiment here.
Precisions of the dates, presented as the standard deviations of uncalibrated BP dates, vary with laboratories, ranging from 15 to 60 yr within the 1σ confidence interval, probably due to differences in isotope counting procedures among laboratories. In general, most dates from each site are in good statistical agreement with one another and the repeatability of measurement under identical conditions appears to be met. Statistically significant differences among subsamples assayed from bulk samples [i.e. what Ward and Wilson (Reference Ward and Wilson1978) call “Case II error”] were not detected in this experiment, as we expected during collection, although it was not perfectly certain that all the bulk samples from each site were contemporaneous when we collected them in the field. A statistical estimation of median dates using the Markov chain Monte Carlo technique does not show significant differences among bulk samples from each site (N1=1866.59, N2=1856.51, N3=1855.06, H1=1504.87, and H2=1493.27). Therefore, multiple bulk samples from each site can be seen as statistically overlapping between analyses even between different laboratories. The estimated median of 39 subsamples from Namgye settlement (N1 to N3) is 1859±14 BP, and that of the other 40 samples from Hongryeonbong (H1 and H2) is 1492±15 BP (Lee et al. Reference Lee, Lee and Kim2014).
There are a few outliers in the data, which warrant discussion. One date of N1 bulk sample (MRR2013-4: 2170±60 BP) measured by Lab A and one of N2 (MRR2013-53: 2275±25 BP) by Lab C fall outside the 2σ confidence interval from the aggregate confidence interval generated from all samples. When the two anomalous values are manually removed, the median Markov chain Monte Carlo age of Namgye becomes 1853±12 BP.
The interlaboratory variance does not seem significant in general. However, the results from Lab B tend to be younger than the other laboratories’ results (Table 4). A closer look at the results suggests that this tendency results from Lab B’s interbatch differences: dates measured from Batch 1 were consistently younger than those measured in Batch 2 two months later, for all samples regardless of site and bulk sample (Figures 4 and 5). Comparison of the dates with those measured by the other laboratories indicates that the dates of Lab B Batch 2 are in closer agreement with the dates generated from the other laboratories, unlike those of Batch 1 (Figures 6, 7, 8, and 9).
To assess the amount of possible bias with Lab B Batch 1, we calculated the Bayesian p value. In our study, y obs was the observed data, y rep represented replicated data, θ was the parameter, and T was the statistic representing deviation of the data. While the classical p value is defined by p c=P(T(y rep)≤T(y obs)|θ), for fixed θ, the Bayesian posterior predictive p value is defined by p b=P(T(y rep)≤T(y obs)|y obs), which is the probability that the replicated data deviate from the current model more than the observed data when using all the information available. Although it is convenient to use and is consequently popular, the Bayesian posterior predictive p value has been criticized for double-using data to calculate both the test statistic and the posterior probability (Tsui and Weerahandi Reference Tsui and Weerahandi1989; Berger and Boos Reference Berger and Boos1994). To avoid this problem, we calculate the partial posterior predictive p value (Bayarri and Berger Reference Bayarri and Berger2000) defined by P ppp=P(T(y rep)≤t obs|y obs/t obs), where t obs is the observed test statistic and y obs/t obs is the part of the data not involved in calculating t obs. By dividing y obs to t obs and y obs/t obs, the partial posterior predictive p value avoids the issue of circular validation.
In practice, often the division of y obs to t obs and the rest is not obvious. In the current analysis, the division is rather obvious, because t obs is a test statistic based on Lab B Batch 1. We set y obs/t obs as all the data except Lab B Batch 1. We estimated parameters related to Namgye and Hongryeonbong dates without using eight dates from Lab B Batch 1, and eliminate the influence on parameters by using Monte Carlo integration. Then, Bayesian p values of three test statistics (mean, minimum, and maximum) for uncalibrated BP dates from Namgye and Hongryeonbong were calculated. Specifically, we calculated (1) the probability that each mean of four data points replicated from Namgye and Hongryeonbong dates, respectively, are smaller than mean of the four observed data points (i.e. mean dates from Lab B Batch 1; Namgye=1742.5 and Hongryeonbong=1362.5); (2) the probability that minimums of four replicated data points are smaller than those of the four observed data points (Namgye=1690; Hongryeonbon=1340); and (3) the probability that a maximum of four replicated data points are smaller than those of the four observed data points (Namgye=1790; Hongryeonbong=1390).
Posterior distributions of all three Bayesian p values reject the null hypothesis that the variant statistical distribution of 14C ages generated by Lab B Batch 1 for both sites are the product of random errors (Table 5). This suggests that the eight measurements of Lab B Batch 1 are likely to have a systemic error in some aspect of the taphonomic, handling, or analytical measurement of the samples. Based on the available data, it is unknowable whether this consistent difference resulted from contamination during collection or handling of the sample, pretreatment, erroneous measurement of standards, or some changes in background of measurement. Taphonomic circumstances for the nonmatching age sets are not suspected since postdepositional contamination would have likely affected the samples equally. The same is true about handling, storage, and shipping, but, given the size of the artifact, it is plausible that one portion was inadvertently mishandled despite the protocols.
Mainly due to this statistically significant difference between batches, there is a nonrandom variance in the agreement of dates measured by Lab B compared to those of other laboratories that participated in this test, and non-Bayesian chi-squared tests (Ward and Wilson Reference Ward and Wilson1978) using OxCal v 4.2 demonstrate similar results (Tables 6 and 7). In the case of Namgye, only Lab B’s T value is significant at the 0.05 level, with 77.6% agreement. Hongryeonbong dates measured by Lab B also demonstrate high T value and low agreement, although some labs’ results also have T values significant at 0.05 level.
An experiment on the potential differences between inner and outer rings was carried out only on the H1 sample, and consistent differences in age outcomes were not detected (Figure 10). The number of rings in H1 was not rigorously counted by a botanical specialist, but our observation during aliquot division suggests the age difference between the two parts of H1 was only 10 to 15 yr. Also, the diameter of the bulk sample (oak tree) was 15 cm, suggesting that the age of the tree would not have been older than 25 yr in the typical central Korean environment (Byun et al. Reference Byun, Lee, Nor, Kim, Choi and Lee2010). Thus, although outer rings should theoretically provide a younger age than inner rings (Bowman Reference Bowman1990), the difference appears to lie within the statistical error range in this case, probably owing to the young age of the tree at the time it was felled to use as construction material.
Overall, our blind tests demonstrate generally good concordance in the results and present acceptable errors at an archaeological scale, but interbatch differences may potentially result in uncertainty of dating results, although this was detected for only one laboratory out of five that were subject to the tests. Bayesian p values and chi-squared tests reject the null hypothesis that the errors randomly occurred.
CONCLUSION
A total of 79 14C samples were analyzed from five macrosamples recovered from two separate archaeological sites showing a narrow distribution of uncalibrated 14C ages. One batch of samples produced results consistently outside the 2σ distribution of the remaining 75 samples and was determined to be the result of a systematic error, but the source of the error is not specifically known. The tests performed in this experiment were not designed to highlight deficiencies or successes of individual laboratories or identify unreliable laboratories, but rather to determine potential anomalies in data generation for 14C dating, in general. Users of 14C dating should be aware of different sources of potential uncertainty resulting from the metabolic lifecycle of the organism, burial, taphonomy, recovery, handling, shipping, and laboratory treatment of the sample in order to relevantly interpret the results. Uncertainty derived from random errors can be decreased by increasing sample size, insisting on more robust isotopic counting procedures and using appropriate statistical techniques. However, repeated errors could signal more significant problems in either pre-laboratory handling of samples or laboratory procedures, and it would be inappropriate to include them in the statistical sample of unaffected samples. Without determining whether differences among 14C ages are caused by random errors or possible systematic errors, it can be difficult to properly understand the uncertainty of measurements provided to consumers. However, when dating results are obtained from laboratories, most end-users will not be aware of potential errors in their data because sample sizes tend to be small, and our results suggest that results with systematic errors are erroneously included and reported in the archaeological literature.
Certainly, whether a data set is subject to either random errors or systematic errors is not clear-cut unless large numbers of samples are taken from one context. Even then, the division between random and systematic errors can be heuristic, depending on one’s perspective. 14C laboratories may view, for example, interbatch differences detected in this experiment as an uncontrollable random error that can possibly happen as a mass spectrometer runs many times. At the same time, a systematic error may be suspected because the error repeatedly occurs outside the statistical boundaries of a truly random distribution. In such cases, archaeologists face a dilemma in interpreting the veracity of their samples. In this case, we were able to statistically determine the presence of a systematic error in the results of Lab B’s Batch 1 based on a large data set of samples generated. However, few archaeological research projects can afford to generate so many ages from single macrosamples in order to identify potential sources of error. Even when identified, the source of the error is not obvious.
Large sample sizes are important for archaeologists to get accurate dates of archaeological events, but simply increasing sample size does not automatically guarantee a decrease in uncertainty unless possible systematic errors are relevantly controlled. If multiple samples are dated under the same conditions, it is possible for all results to be affected by the same systematic errors. This risk can be mitigated when samples are dated under multiple conditions and results are compared by users before ultimate age determination, although this may be costly and time consuming. Although there is no universal method for separating random and systematic errors of dating results, our experiment suggests that it is necessary for archaeologists to establish an organized strategy for dating sites before submitting samples to laboratories, which can avoid the inclusion of possible systematic errors.
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2013S1A5B6043901). We thank the participating laboratories for their cooperation with the test and understanding of our motivation to advance the scientific potential of the discipline. We also thank Tim Jull, Mark McClure, and Alex Bayliss and one anonymous reviewer for valuable comments.