Anthropological archaeologists are committed to achieving scientific understandings of complex social processes that operate on centennial or millennial scales, notably including segments of societies that are absent from or underreported in recorded history. Many, including us, also believe that this research on the past should also have the potential to inform social action in the present for the future.
Recent articles in the Proceedings of the National Academy of Sciences and American Antiquity (Kintigh et al. Reference Kintigh, Altschul, Beaudry, Drennan, Kinzig, Kohler, Fredrick Limp, Maschner, Michener, Pauketat, Peregrine, Sabloff, Wilkinson, Wright and Zeder2014a, Reference Kintigh, Altschul, Beaudry, Drennan, Kinzig, Kohler, Fredrick Limp, Maschner, Michener, Pauketat, Peregrine, Sabloff, Wilkinson, Wright and Zeder2014b) propose a set of 25 grand challenges for archaeology intended to represent the most compelling questions facing our discipline. The challenges include, for example, “Why and how do social inequalities emerge, grow, persist, and diminish, and with what consequences?” and “How do humans perceive and react to changes in climate and the natural environment over short and long terms?” The challenges do not focus on new discoveries, nor are they peculiarly archaeological; rather, they address major issues in the social sciences. Answers to these challenges will not and cannot emerge through intensive study of individual cases. Instead, they require research that synthesizes data and information—from a region, a hemisphere, or even the globe—to achieve knowledge that includes novel understandings of fundamentally important social processes (Kintigh et al. Reference Kintigh, Altschul, Beaudry, Drennan, Kinzig, Kohler, Fredrick Limp, Maschner, Michener, Pauketat, Peregrine, Sabloff, Wilkinson, Wright and Zeder2014a:5). Other scholars working at regional and macroregional scales have similarly recognized the need to integrate diverse sources of data (e.g., Arbuckle et al. Reference Arbuckle, Kansa, Kansa, Orton, Çakırlar, Gourichon, Atici, Galik, Marciniak, Mulville, Buitenhuis, Carruthers, Cupere, Demirergi, Frame, Helmer, Martin, Peters, Pöllath, Pawłowska, Russell, Twiss and Würtenberger2014; McKechnie et al. Reference McKechnie, Lepofsky, Moss, Butler, Orchard, Coupland, Foster, Caldwell and Lertzman2014; Manning et al. Reference Manning, Colledge, Crema, Shennan and Timpson2016; Mills et al. Reference Mills, Clark, Peeples, Haas, Roberts, Brett Hill, Huntley, Borck, Breiger, Clauset and Stephen Shackley2013).
WHAT SYNTHESIS REQUIRES
Achieving the kinds of synthesis envisioned here requires the resolution of both technical and social problems (explored initially in Kintigh Reference Kintigh2006 and Kintigh et al. Reference Kintigh, Altschul, Kinzig, Fredrick Limp, Michener, Sabloff, Hackett, Kohler, Ludäscher and Lynch2015 and in more detail in Altschul et al. Reference Altschul, Kintigh, Klein, Doelle, Hays-Gilpin, Herr, Kohler, Mills, Montgomery, Nelson, Ortman, Parker, Peeples and Sabloff2017, nd). This article focuses on one particular problem, achieving effective integration of data across multiple datasets. By data integration, we mean the process of transforming datasets that were recorded in different ways into a single, unified dataset with analytically comparable observations.
The Need to Integrate Primary Data
Synthetic research on the scale that grand challenges require entails deriving and comparing data-driven interpretations of primary data recovered by other archaeological projects. Today, syntheses are too often based on the data summaries and conclusions drawn by the original researchers. While this mode of synthesis is efficient and has undeniably been important, it also has liabilities. Conclusions that are erroneous or based on inconsistent premises become entrenched in the literature as “facts” that persist as faulty premises in subsequent scientific arguments. For example, for several decades archaeologists cited DiPeso's (Reference DiPeso1974) dating of the cultural chronology for Casas Grandes—a major center of the late prehistoric American Southwest/northwest Mexico region. The errors in dating were not corrected until Dean and Ravesloot (Reference Dean, Ravesloot, Woosley and Ravesloot1993) reinterpreted the primary data—the tree ring dates from Paquimé (see also Whalen and Minnis Reference Whalen and Minnis2001). Equally important, the potential to explore multiple, large sets of primary data provides the opportunity to discover important cross-dataset patterns that could never be seen when comparing higher-level interpretations.
The Need for Discovery and Access to Data
The explosion in the quantity and complexity of archaeological data has led to large databases, obtained at great expense, with immense potential to contribute to science. Nonetheless, datasets that could be extremely useful for synthesis are often unknown or not readily accessible to scientists. Fortunately, the needed technical infrastructure is now available through digital repositories that provide effective discovery, access, and long-term preservation of datasets, notably the Archaeology Data Service (Richards Reference Richards2017) in the United Kingdom and the Digital Archaeological Record (tDAR; McManamon et al. Reference McManamon, Kintigh, Ellison and Brin2017) in the United States. While that preservation and access infrastructure is now well established, only a tiny fraction of the potentially useful datasets developed in recent decades have been deposited in these repositories or are otherwise accessible.
The Need to Integrate Data across Projects and Areas
Comprehensive, regional-scale data are never collected by a single research team; data must be compiled from many projects. Integrating data across projects is essential to archaeologists’ efforts to recognize phenomena operating on large spatiotemporal scales and to conduct crucial comparative studies.
The Need for Comparable Observations
Although large-scale and synthetic research demands the integration of data across projects, recorded observations from different projects are often not directly comparable. This issue may be due to the variables chosen, inconsistent measurement techniques, evolving or conflicting taxonomies, or differing collection intensities. In the absence of tools to resolve these discrepancies systematically, researchers rely on text descriptions or verbal communication with the original investigators; or (too often) they proceed with analyses unaware of the implicit difficulties, thereby inviting spurious results.
The Need for Adequate Metadata
Adequate metadata for each variable in a dataset are essential to assess the comparability of observations in different datasets and to the task of aligning those observations to make them comparable (Kansa and Kansa Reference Kansa and Kansa2013; Kansa et al. Reference Kansa, Kansa and Arbuckle2014). Metadata include technical information, such as file formats and character sets used. They also include semantic documentation of individual tables, columns, and nominal values in a relational database or spreadsheet. Is a variable a count, a measurement, or a nominal value? If it is a measurement, what are the units, and how were they measured? If it is a code, what does each different value of the code represent, and how were the values distinguished?
Kintigh (Reference Kintigh, Contreras, Farjas and Melero2013) has elsewhere argued that to be considered adequate, metadata for databases must include sufficient information for an archaeologist not familiar with the project to make meaningful scientific use of the data. While meeting this standard demands considerable effort, it is necessary for datasets to be used in data integration. In addition to documenting the variables represented in the dataset, it is also important to provide key contextual metadata that typically do not appear anywhere in the dataset itself, such as dates, location, sampling intensity, or recovery technique.
The Need for General-Purpose Data Integration Tools
For decades, archaeologists, including us, have integrated multiple datasets. This process typically involved examining the representation of each variable under consideration in every one of the subject datasets and recoding all of these variables to an ad hoc standard in all of the datasets. Most who have done this would agree that it is an often frustrating and nearly always time-consuming endeavor. Furthermore, these efforts are typically tailored to the specific datasets involved and not readily generalized or extended.
QUERY-DRIVEN, ON-THE-FLY DATA INTEGRATION
Starting almost 20 years ago, archaeologists at Arizona State University were frustrated in their efforts to move beyond their individual areas of study to examine regional patterns. We envisioned a cyberinfrastructure tool that would facilitate the broad integration of data across our separate regional cases. Beginning in 1999, we sought National Science Foundation (NSF) funding to this end and in 2004 received our first award, whose goal was to assess disciplinary needs for cyberinfrastructure (reported on in Kintigh Reference Kintigh2006).
Subsequent awards from the NSF and the Andrew W. Mellon Foundation funded the development of tDAR. It was, in fact, the research need for data integration and the associated demands for discovery and access that drove tDAR's initial development. The preservation component that is now integral to tDAR was soon added as a natural and important complement.
The ontology-based approach to data integration described here was developed over the last 15 years as a product of close collaboration among archaeologists, computer and information scientists, and software engineers.Footnote 1 In most cases, the refinements that we have implemented were direct responses to researcher requests. Thus far, the most intensive use has been by the community of archaeological faunal analyses, and our examples are drawn from that experience (Spielmann and Kintigh Reference Spielmann and Kintigh2011).
As noted above, the standard approach to the integration of extant data sources is to do an ex post facto normalization of the subject datasets to a project-specific standard. In this approach, datasets that do not meet minimal data standards are rejected. When the best datasets have a precision that exceeds the set standard, that resolution is effectively discarded.
We chose instead to reconcile data source observations with the data requirements of the query under consideration rather than attempt global reconciliation of data sources. Because nominal variables (e.g., ceramic type, floral or faunal taxon, lithic tool type) are central to most archaeological analyses, reconciling nominal variables recorded using different classification systems is a central challenge for data integration. In this framework of query-driven data integration, each classification system used in the original recording of a nominal variable (e.g., butchering or faunal taxon) is represented by a set of values, each of which is explicitly linked to a node in a concept hierarchy, that is, an ontology (see below). In responding to a query, datasets using different coding systems can be used together as long as each separate classification system is linked to a shared ontology.
The source datasets (e.g., in Microsoft Excel or Access) are always maintained in tDAR in their original form (as well as in open-standard preservation formats). This policy is important because we do not want to lose the ability to see the data as they were originally recorded.
Ontologies
The integration software depends on agreed-upon ontologies for the database variables (columns) that are to be integrated. For our purposes, an ontology is a treelike hierarchy of concepts of increasing specificity. Figure 1 shows an ontology for the faunal variable butchering.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig1g.jpeg?pub-status=live)
FIGURE 1. Butchering ontology.
Ideally, ontologies are designed by a user community (in this case, archaeological faunal analysts) to capture the diversity of concepts used within the specialist community. Ontologies arrange concepts hierarchically, enabling more and less specific assignments (and the human lumpers and splitters) to peacefully coexist.
Coding Sheets
In many cases, the original datasets employ arbitrary numeric or textual codes to represent the individual values for a column. To document the meaning of these codes, tDAR allows the contributor of a dataset to enter a “coding sheet” that provides a translation of the codes to meaningful values (along with optional documentation of each value). Thus, the coding sheet might associate the code “100” in the taxon column of a particular database with the meaningful value “unidentified lagomorph.” These coding sheets can be unique to a specific dataset or shared across multiple datasets.
Mapping Coding Sheet Values to Ontologies
The analyst contributing the database and coding sheet then maps the individual coding sheet or database values to specific nodes in the ontology associated with that variable, as illustrated in Figure 2. In this way, any number of datasets recorded in different ways can be represented within the unified framework provided by the ontology.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig2g.jpeg?pub-status=live)
FIGURE 2. Faunal taxon variable in two datasets mapped to the taxon ontology (partial).
The tDAR ontologies accommodate synonyms that can assist in the mapping process. For example, coding sheets that employ common names for taxa can be easily mapped to the taxon ontology because common names are maintained as synonyms to the names in the Linnaean taxonomy.
Data Integration
Using datasets whose columns for the relevant variables have been mapped to the relevant ontologies, the stage is set for data integration. Assume that we want to compare artiodactyl indices (an indicator of reliance on large game) across sites in Southwest US datasets. The artiodactyl index is defined as the number of identifiable specimens (NISP) of artiodactyls (e.g., deer and antelope) divided by the total NISP of artiodactyls plus lagomorphs (rabbits and hares).
In that case, the integration process selects the subset of cases (rows) from all observations in all of the source (mapped) datasets in which the taxon variable is recorded as artiodactyl, or lagomorph, or any subcategory of either of those higher-level values. It returns a single dataset with the source dataset taxon values from all source datasets transformed into “Artiodactyla” or “Lagomorpha” as appropriate. The individual posing the original query can read the unified output database in Excel and use a pivot table or count occurrences of each value by site to easily calculate the artiodactyl index.
Now, let us say that same analyst wants to calculate the lagomorph indices for the same sites. The lagomorph index (indicating aspects of the local environment) is the ratio of the Sylvilagus (rabbit) NISP to the sum of the Sylvilagus and Lepus (hare or jackrabbit) NISPs. In this integration, we select from all the rows in all the datasets only the rows that are mapped to the genus (Sylvilagus or Lepus) level or below and produce a combined dataset as described above.
In this lagomorph index–directed integration, datasets that only classify bones by taxonomic order (Lagomorpha, Artiodactlya, Rodentia, etc.) would be ignored altogether because the information they contain does not address this integration query. However, those datasets would be used in calculations for the artiodactyl index integration query discussed above because they encode all the taxonomic specificity that is needed. The lagomorph index integration will also ignore (fail to select) any rows in the source datasets mapped directly to the Lagomorpha node (corresponding to an original dataset value of “unidentified lagomorph”) because these rows do not inform this particular query, which requires distinguishing the genus.
By retaining all the specificity contained in the original datasets and performing the integration on the fly at query time, we are able to take advantage of those datasets using less specific classifications where they are relevant to the specific query while retaining the ability to use the finely classified data to satisfy those queries that demand such refinement.
A convenient data management and data exploration by-product of this integration process is that in a single operation it permits the specification of complex selections and hierarchical data aggregation across a great number of databases. Thus, if one wished to find all the macaws identified in these datasets, it would not be necessary to understand the coding and search 20 different datasets; one would simply look in an integrated dataset at the taxon column for the genus Ara.
Example: Artiodactyl Index
We illustrate key features of the data integration process with a simple example from the Cíbola (Zuni) area of the northern Southwest United States and then present a substantive result based on our ongoing research. Following the artiodactyl index discussion above, an integration of six faunal databases from Pueblo III and Pueblo IV sites (roughly AD 1200–1350) yielded a table with 31,700 rows representing 34,948 bones (in some cases a single row represented more than one bone with all the same characteristics). Table 1 summarizes these data by project area. The table shows considerable variation in the artiodactyl index, which we might expect to relate to the project's proximity to relevant habitats and perhaps to the kinds of sites investigated.
TABLE 1. Artiodactyl Index Computation for Cíbola-Area Projects.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_tab1.gif?pub-status=live)
Table 1 shows an extremely high value (nearly as many artiodactyls as lagomorphs) for the El Morro Valley Prehistory Project, which primarily investigated a local post-Chacoan center close to the Zuni Mountains (a productive habitat for deer). The Ojo Bonito Archaeological Project, which primarily investigated another post-Chacoan center relatively distant from any mountains, had a very low proportion of artiodactyls (only 3% of the combined assemblage). The Cibola Archaeological Research Project and the Rudd Creek Archaeology Project, both close to major mountains, show moderate values, with artiodactyls representing about a third of both classes combined. However, a similar value is seen for the Upper Little Colorado Prehistory Project, which is somewhat farther from deer habitats. Seemingly anomalous is the Heshotauthla Archaeological Research Project, with a very low index value (0.05) despite being not much farther from the Zuni Mountains than the Cibola Archaeological Research Project and El Morro Valley Prehistory Project sites.
APPLICATION
Datasets
Spielmann led a large, NSF-supported collaborative synthetic effort through which 13 faunal analysts and other archaeologists contributed 42 datasets for faunal assemblages from 59 sites in the northern Southwest United States. Combined, these datasets contain more than 364,000 individually identified faunal specimens. The analysis presented here includes 297,839 specimens from 33 of these sites dating to the Pueblo III or Pueblo IV period (ca. AD 1150–1500). Datasets varied considerably with respect to which variables they recorded, though a core set was consistently recorded. All but seven of the 42 datasets (https://core.tdar.org/collection/16056) are now freely available in tDAR for anyone to use; the remainder are temporarily embargoed.
Ontologies
Faunal analysts mapped the dataset columns to a set of 24 ontologies (Table 2) for faunal variables developed by a series of working groups. (In most cases, this mapping was done by the original analysts.) With the exception of the taxon ontology, these are general-purpose ontologies devised to cover most nominal faunal variables recorded for prehistoric contexts in the United States and the United Kingdom. We used the Integrated Taxonomic Information System as a standard for the faunal taxa, including only taxa appearing in archaeological contexts in the US Southwest and adding indeterminate categories that were not encompassed by the taxonomic hierarchy (e.g., “large mammal”). All these ontologies are freely available for anyone to use in tDAR (https://core.tdar.org/collection/15376).
TABLE 2. Shared Faunal Variable Ontologies.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_tab2.gif?pub-status=live)
A broad-based working group of faunal analysts devised the initial drafts of the faunal ontologies. Ontologies were comparatively easy to develop for faunal variables that have well-defined categories and are generally recorded in similar ways (e.g., taxon, element, dorsal/ventral, and side). Other variables exhibit considerable diversity in how they are recorded (e.g., butchering, condition, completeness, and gnawing), but it was nonetheless possible to achieve agreements on ontologies.
The draft ontologies have been refined through interactions with working groups focused on the northern Southwest United States, on the United Kingdom, and on the Archaic period in the eastern United States. As analysts mapped their datasets to the draft ontologies and were unable to fit their categories, the draft ontologies were refined to the point that a large fraction of the analysts with whom we have interacted believe that their recordings can be reasonably represented within the system. This agreement was possible, in part, because analysts did not have to abandon their own coding schemes to use the ontologies. Mapping coded values to ontologies enables effective standardization without forcing analysts to accept them at the level of their individual analyses.
While working groups of users developed and uploaded the faunal ontologies used here, any data contributor can upload, use, and share an ontology. However, the more that an ontology represents shared agreement within a user community, the more it is possible to share and synthesize data across datasets recorded using different systematics.
If a research community shares a general approach to recording a variable, it may well be possible to reach a consensus on an ontology, using the hierarchical nature of the ontologies to accommodate disputes and differences among analysts. For example, there may be widespread agreement on the upper levels of a ceramic typology but variation in how “lumpers” and “splitters” deal with the finer points. In other cases, a community may be split with different factions recording certain variables in fundamentally incompatible ways, as is the case with some approaches to lithic typology. Even in these cases, there is value in employing ontologies to integrate data on the contested variables within each subgroup and sharing ontologies across the community on variables on which agreement is possible (e.g., lithic material).
Hypothesis
One substantive goal for this collaboration was to examine the hypothesis: human population persistence and concentration on the landscape result in large-mammal resource depression. As a part of this investigation, we needed to explore an alternative: Are environmental differences a significant factor in resource abundance (i.e., regardless of human demography, do more mesic environments favor larger game and drier environments, smaller game)? In the US Southwest, higher elevations tend to be more mesic, and elevation is a reasonable proxy for habitat productivity (e.g., Schollmeyer and Driver Reference Schollmeyer and Driver2013). In this region, the large mammals used for food are overwhelmingly artiodactyls, with lagomorphs being the other major faunal food resource. Therefore, large-mammal resource depression is indicated by a decrease in the artiodactyl index.
Results
An initial view of the data (Figure 3) indicates a fairly strong positive relationship between elevation and project area, suggesting that differences in the artiodactyl index may be due more to elevation than to human predation. This finding, of course, is not unexpected, as higher elevations are favored habitats for deer. However, further investigation revealed a bimodal distribution of elevation of the sites and projects investigated, as shown in Figure 4.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig3g.jpeg?pub-status=live)
FIGURE 3. Scatterplot of site elevation (averaged over an 18-km buffer) vs. artiodactyl index. Regression R 2 = 0.15; p = 0.02.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig4g.jpeg?pub-status=live)
FIGURE 4. Histogram of site elevation (averaged over an 18-km buffer) showing bimodal elevation distribution.
If we plot only those project areas located above 1,900 m elevation (Figure 5), it is clear that there is no relationship (R 2 < 0.01) between elevation and artiodactyl index at these higher-elevation sites. Rejecting the idea that the artiodactyl index is simply a function of elevation for the higher-elevation sites, we are able to proceed with further analysis of the original hypothesis with the higher-elevation projects. The key point here is that by looking at only one or a few cases, we would never have been able to see this patterning.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig5g.jpeg?pub-status=live)
FIGURE 5. Scatterplot of average site/project elevation vs. artiodactyl index for sites above 1,900 m. Regression R 2 < 0.01; p = 0.72.
Employing integrated data from these 34 sites, we are currently exploring taphonomic processes and other confounding factors. A full treatment of this synthetic research is in preparation and will appear separately. The data underlying Figures 3–5 are embargoed until that study appears but will be available in tDAR (at https://core.tdar.org/dataset/438729).
THE DATA INTEGRATION PROCESS IN tDAR
Dataset Ingest and Metadata Documentation
Data integration necessarily includes a number of steps. First, of course, the datasets to be integrated must be documented with appropriate column metadata, with coding sheets for nominal variables mapped to shared ontologies. tDAR provides software that interactively guides the user through each of these steps.
Thoroughly documenting a large dataset is a substantial undertaking. This process can be expected to go smoothly if the dataset is “clean” and was developed using good practices (e.g., Archaeology Data Service and Digital Antiquity 2013:73–84). It is further simplified to the extent that the same coding sheets (each of which is a separate tDAR resource) are used by multiple datasets because a coding sheet needs to be mapped to the corresponding ontology only once. tDAR also allows one documented dataset to serve as a template for others, so an analyst using a consistent coding scheme will find that while uploading and documenting the first dataset will take considerable time, uploads of subsequent datasets will go very quickly.
However, tDAR's data ingest and metadata documentation process can also reveal problems in the dataset design (e.g., having the interpretation of one column depend upon the value of a different column) or errors in coding that can be time-consuming to correct. For example, a dataset may contain numeric values that do not appear in the coding sheet for a particular variable. In that case, the analyst needs to determine whether this situation represents an omission in the coding sheet (in which case the coding sheet needs to be corrected) or whether the value was miscoded. If the value is miscoded, the analyst would attempt to ascertain whether this mistake was an error in transcription from a paper form (in which case the digital dataset can be corrected and reuploaded) or whether the value was initially coded incorrectly (in which case the value would be converted to the code for a missing value when it is impractical to reanalyze the specific specimen, as is usually the case).
Data Integration
Data integration proceeds by first selecting the datasets to be integrated. The user then chooses the variables to be integrated by selecting from a list of ontologies represented in the selected datasets (the process can only integrate variables that are mapped to shared ontologies). The user can then select one or more “display variables” for each dataset. These variables, for example, site identifier, provenience identifier, time period, or project name, are included in the output dataset but are not otherwise processed. The user also has the option of identifying a “count” variable (indicating that this row of the database represents that number of identical observations with respect to the variables recorded) used to statistically weight a case.
Finally, the user has the opportunity to control how the integration operates for each integration variable. For each variable in turn, the software displays all ontology values with check marks indicating which values are present in which datasets. The user then selects the ontology nodes to be output (Figure 6). Whenever a node is selected, tDAR automatically aggregates into the selected node all cases with unselected values that are hierarchically below it.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180208083826670-0667:S232637681700033X:S232637681700033X_fig6g.jpeg?pub-status=live)
FIGURE 6. Screenshot of the Digital Archaeological Record data integration window (partial).
DATA REUSE, DATA INTEGRATION, AND ANALYTICAL COMPARABILITY
If, as a discipline, we are to accomplish syntheses that advance our scholarly and public objectives, we need to become more serious about sharing data in ways that promote data reuse (Kansa and Kansa Reference Kansa and Kansa2013).
Data Reuse
The reuse of datasets depends on a number of factors:
-
• Relevance. The relevance of the dataset to the research questions, geographically, temporally, in terms of material and sample size, and on other substantive grounds.
-
• Discoverability. The ability of a potential user to discover the existence of a dataset and evaluate its relevance.
-
• Accessibility. Once a dataset is discovered, the ability to acquire a copy of it or to otherwise analyze it and the related costs.
-
• Adequacy of Metadata. Datasets can be rendered useless by inadequate metadata—the information that documents the content of the dataset, at the level of the dataset as a whole, the individual tables, the columns, and the nominal values that appear. For example, if the dataset uses arbitrary codes to represent nominal values, such as species or ceramic type, and the coding key is not documented in the dataset or its metadata, there is no meaningful access to the data.
-
• Availability of Contextual Information. The availability of key contextual information is important for establishing the analytical comparability of datasets and their constituent observations. This information includes spatial location, temporal assignment, depositional context, sampling intensity, and recovery technique. Too often, this contextual information is not a part of stand-alone analytical datasets.
-
• Ease of Use. Datasets may be difficult to use because they are stored in obsolete or obscure formats or because they are structured in a way that makes them difficult to employ in quantitative analyses without extensive data cleaning and reorganization. In contrast, datasets that are encoded in widely used formats and that employ or are linked to standards or other shared vocabularies or ontologies are more easily and effectively reused.
Data Integration and Data Reuse
Use of the data integration tools described here greatly facilitates data reuse. And because each data integration is responsive to the specific demands of the query, the datasets are exploited to their maximum potential. While there is a substantial onetime investment in developing the ontologies and mapping datasets to them, the payoffs can be enormous. Consider the examples presented above. If one simply had a copy of each original dataset in whatever form it was last used along with a pdf of the coding key, the example analyses described above that we completed in minutes would have taken literally months of effort. At any time, new or revised queries can quickly and easily be run on all the datasets mapped to the same ontologies. As new datasets are added to tDAR and mapped to the ontologies, the data integration queries can be saved and easily rerun to incorporate the new data. In this way, research communities can, over time, build ever more powerful integrated datasets.
User communities can range tremendously in scale. For example, faunal analysis lends itself to broad generalization because it deals mostly with biological characteristics, some of which (notably taxon and element) have established standards.
At the other end of the scale, Kintigh has worked with a number of students and close colleagues on several survey and excavation projects in the Cíbola area. At any particular time, the research teams shared a single coding sheet for recording ceramic type and form. However, over time, the forms evolved as they investigated new areas and refined some of their observations. To combine the results of the ceramic recordings for these projects, the easiest—and best—solution was to upload them separately to tDAR with their original coding sheets. In this way, tDAR preserves and maintains the data as originally recorded. It was then easy to develop type and form ontologies that captured the variation in the coding across the projects. Having mapped the project-specific coding sheets to the ontologies, the data integration tool made it easy to obtain a unified dataset with any desired aggregation of categories. In this case, the integration involved six projects and 11 datasets (survey and excavation were sometimes in different datasets), with about 240,000 individually recorded potsherds. This integrated database not only is easier for Kintigh (Reference Kintigh2016) and his immediate colleagues to use but is freely available for reuse.
Establishing Analytical Comparability
If the datasets have strong column- and value-level metadata and the mappings of dataset values to ontologies are reasonably consistent, data integration is highly effective at the variable level. However, data comparability also depends on contextual (including time, space, depositional context, sampling intensity) and taphonomic characteristics of the datasets as a whole—information that is often not directly documented in the datasets. For example, if an excavation project screened all deposits and a testing project recovered large numbers of artifacts from backhoe trenches without screening, then quantitative comparability is lost. More subtly, datasets may differ in their mix of contexts—one might have excavated largely room contexts, and another, mostly midden contexts. In this case, one would need to determine whether observed differences between the datasets are due to the different kinds of contexts investigated or to actual differences in the sites themselves.
Datasets, or contexts within them, may also differ in terms of the taphonomic processes that shaped the formation of the collections. Because of taphonomic differences, even consistently recorded datasets representing similar mixes of depositional contexts may not be comparable with respect to some kinds of questions. While these comparability problems can never be ignored, being able to integrate datasets in the way that we have proposed enormously facilitates their resolution. For our analysis of the collections from 33 Southwestern sites described above, we developed an analytical protocol that uses variables recorded in most or all datasets to evaluate statistically the taphonomic comparability of the datasets for different purposes (Clark 2014).
What too often limits the assessment of analytical comparability in practice is that datasets—especially those derived from specialist analyses—often do not contain contextual data about the proveniences investigated, because the analysts did not have this information in the first place or it was never later integrated with the specialist data. The absence of contextual data makes the datasets less useful than they otherwise would be. Of course, this issue is not a technical limitation of the databases or of tDAR's data integration tool. It indicates a serious deficiency in the workflows that produce, analyze, and archive digital data (McManamon et al. Reference McManamon, Kintigh, Ellison and Brin2017).
Costs and Responsibilities
Making datasets suitable for reuse entails planning, effort, and some direct costs. The direct cost of depositing a dataset in tDAR is low ($10 or less for a 10-MB database). The larger cost is in the effort devoted to properly preparing and documenting a dataset in a way that it can responsibly be used by others. Making datasets accessible and suitable for reuse is an ethical responsibility according to the “Society for American Archaeology Principles of Archaeological Ethics” (Society for American Archaeology 1996; discussed in Kintigh Reference Kintigh2006). Similar ethical responsibilities are laid out by the Chartered Institute for Archaeologists, the Register of Professional Archaeologists, and the European Association of Archaeologists. In many cases, there is also a legal obligation to make publicly funded data accessible (Cultural Heritage Partners 2012).
Depositing a dataset in tDAR immediately provides for easy discovery, both through tDAR's Web interface and through Google and other search engines. While tDAR users need to register (at no cost), all use of data in tDAR, including downloads of datasets, is free. Although tDAR does not force every depositor to provide ideal metadata with a dataset, its interactive interface prompts the depositor to provide thorough metadata at the dataset, column, and value levels. Not only are those metadata available to any subsequent user; they are directly exploited by the data integration tools. Mapping data values to shared ontologies so they can be used in data integration constitutes another level of metadata. That is, each mapping constitutes an assertion that a given value of this particular variable is reasonably equivalent to the value in this node of the shared ontology. We are not aware of any other repository that provides such comprehensive tools to gather critical dataset metadata or the ease of use and analytical power of tDAR's data integration software.
CONCLUSIONS
To answer many of the most pressing questions of concern to archaeologists, to scientists more generally, to policy makers, and to the broader publics to which we are responsible, archaeology needs to conduct synthetic research. That synthetic research requires that we integrate primary data from multiple projects that do not typically collect data in completely consistent ways. As a result, we must have means of integrating observations across datasets in ways that maintain their semantic integrity. However data integration is accomplished, it places heavy demands on the metadata that document not only the tables, columns, and values but also collection procedures and other information often not contained in the datasets themselves.
As data integration has traditionally been done, it is a highly time-consuming and often frustrating endeavor. The efforts are typically one-off and are not readily built upon. Through this article we have sought to draw attention to data integration as an important component of our disciplinary analytical processes. We have also sought to highlight what we believe are unique tools in tDAR that make it possible, with reasonable, incremental efforts, to integrate very large numbers of datasets in ways that are directly expandable. tDAR's query-directed, on-the-fly data integration tools not only enable the kinds of synthetic research that we need; they facilitate many other kinds of analysis, and they greatly foster data reuse.
Acknowledgments
This article draws on the collective efforts of Arizona State University archaeology and computer science colleagues who contributed to our initial National Science Foundation proposals. In addition to the authors, these individuals include John Anderies, Chitta Baral, Huiping Cao, George Cowgill, Hasan Davulcu, James DeVos, Michelle Hegmon, John Howard, Subbarao Kambhampati, Allen Lee, Peter McCartney, Francis McManamon, Ben Nelson, Margaret Nelson, Charles Redman, Arleyn Simon, and Sander van der Leeuw. Recent conversations with Jeffrey Altschul have also helped frame this discussion. The article was substantially improved based on comments from the editor and three anonymous reviewers.
The members of the Southwest Faunal Working Group contributed data used in this article and helped refine the data integration workflow: Nancy Akins, Robin Cordero, Kathy Roler, Vincent LaMotta, Barnet Pavao-Zuckerman, Karen Gust Schollmeyer, and Christine Szuter. In addition, Linda Cordell, Jonathan Driver, Barbara Mills, Alison Rautman, Kari Schmidt, the Arizona State Museum, the Crow Canyon Archaeological Center, Los Alamos National Laboratory, and the School for Advanced Research provided datasets used in the analyses. We are very grateful to them and the members of the North American, Eastern US Archaic, and UK faunal working groups, which together developed the faunal ontologies used here.
This article is an extended version of a paper presented in the invited session “The Future of Big Data in Archaeology,” organized by Erick Nolan Robinson at the 82nd Annual Meeting of the Society for American Archaeology, Vancouver, British Columbia, on March 31, 2017.
This material is based upon work supported by the National Science Foundation under grant numbers 0433959, 0624341, 1016921, and 1153115 awarded to Arizona State University and grant number 1353727 awarded to Indiana University of Pennsylvania. It is also based on work supported by a joint award, PX-50022-09, from the National Endowment for the Humanities and the Joint Information Systems Committee (UK). Permits were not required for this work.
Data Availability Statement
All data used in this article are available through tDAR: The Digital Archaeological Record. Links to specific data resource references are supplied at appropriate places within the article text.