SAA Paper

Socks Over Boots: History, Data Quality, and the Creation of Large Scale Information Systems

Eric Ingbar, Gnomon, Inc.

Mary Hopkins, Wyoming State Historic Preservation Office

Tim Seaman, New Mexico State Historic Preservation Office

Paper prepared for the 64th Annual Meeting of the Society for American Archaeology, Chicago, Illinois, March 24-28, 1999.

DRAFT - 3.24.99 DO NOT CITE WITHOUT WRITTEN PERMISSION OF AUTHORS

Introduction

Organized bodies of observations about "cultural resources" (cf. National Register Bulletin 15 for a definition) have been with us for a long time. Cultural resource information systems, as we shall use the term in this paper, refers to organized sets of observations made available to the professional public. Once upon a time all such systems were in file cabinets. Today, some agencies still use paper records, but most are automated or in the process of automating. The term "cultural resource information system" (CRIS) usually implies an automated set of records; much of what we have to say applies equally to paper and electronic information systems.

Many paper CRIS’s have been, and continue to be, useful. Typically, a successful paper CRIS has paper maps with inventory areas and resources marked on them. These maps serve as the index into file cabinets (for resource observations, e.g., site record forms) and bookcases (for investigation reports). Electronic information systems usually replicate this mode of organization in some way – by including spatial references in a database (e.g., search by township, range, and section) and by using GIS technology to replace the paper maps. There is a continuum from electronic index at the simplest end to automated analytical geographic and attribute database at the other. Because of the National Historic Preservation Act of 1966, most CRIS’s are the inventory of state historic preservation offices. Thus, they are organized on a state by state basis. This adds variation. Further, each state archive developed out of a unique confluence of historical events. New Mexico’s Records Management Section is the direct descendent of H.P. Mera’s research. Wyoming’s archive grew out of the WPA collections information.

Historic preservation legislation and its appearance in the environmental permit process have driven a broadly similar body of work and work processes in the western U.S. This has had a leavening effect on archive differences from one state to another. Archives have evolved in most western states to be repositories of the information generated by the Section 106/110 investigation, evaluation, and review process. So, most western U.S. CRIS’s contain observations made for similar reasons and in similar ways. Management is the most important reasons for these observations to be automated. Automation for research use is generally limited to creating enough of an index that a given resource’s "context" (temporal, spatial, functional, or thematic) can be derived, because this sort of context is a management need.

Paper records form the basis for all CRIS’s. Paper archives usually contain site recordation forms and usually reports (mostly CRM reports). Some archives also contain correspondence pertaining to National Register status and 106/110 review. Many archives maintain paper maps of resources and areas of inventory. These vary in scale and currency. Often, the built environment part of historic preservation is not filed with the archaeological information, and in the western U.S. these records are not consulted as frequently as the archaeological records (except perhaps in California; the situation is the reverse in the eastern U.S.).

Information systems developed from these paper records are affected by the difficulties posed by any automation project. Strategic factors include the cost of data entry and data storage, reliability of transcribed data (both the ambiguity with which information is recorded and whether it can be meaningfully encoded), affordable technology, and the utility or relevance of particular pieces of information. The last factor is the most important and the first decision taken: "Wouldn’t it be more efficient (useful, practical, etc.) to have this in a database?" Like any automation project, the outcome is achieved by using time and budgets to whittle away desirable categories of information and service. The square peg of an ideal system gets shaped to fit (however poorly) the circular hole of practical application. For example, the Wyoming SHPO began automating records in the late 1970’s. Data were highly encoded (1=camp site, 2 = kill site, etc.), because the available (affordable) technology at the time used 40 column punch cards.

Because of the context in which CRIS development, CRIS's usually contain the following sorts of information:

management status (National Register, state, local registration;
location described in some way;
some parts of some versions of state site record formats -- these formats change through time;
some sort of bibliographic reference file

It is interesting to consider some of the things that tend not to be in CRIS's:

museum collections information
records of research done outside of a management venue
updates to resource information from further, more intensive investigations or reanalyses
automated map information (a few states are ahead on this)
extensive sets of attributes for archaeological resources
imagery of resources (digital photos, scanned documents)

There is a distinct bias toward survey (discovery) of cultural resources and the first recordings (survey-level investigations) of resources in almost every CRIS with which we are familiar. Many systems track further investigations as they generate new reports, but it is less common for systems to update the resource records with new results.

Shortcomings in CRIS's are characteristics of the different channels within archaeology, history, and historic preservation. The "Section 106/110" channel broadcasts continually at a high strength. On the other hand, student research is sort of like the third community access channel, sporadic at about 5 watts. CRIS's are tuned into the "powerful" signals. In short, CRIS’s are a knowledge base, but only of a narrow sort.

Data quality and the quality assurance process: a CRIS perspective

The quality of primary observations is the basis of information value. How well, or how poorly, field observations are made and recorded in a comprehensible form determine the utility of any derived information products. CRIS managers are usually archivists, working within State Historic Preservation Offices. As archivists, CRIS staff does not control fieldwork quality, either by statute or mandate. To some degree, SHPO colleagues may control fieldwork quality, but their sieve is necessarily pretty wide because of time constraints: they simply do not read every resource record in great and tedious detail, let alone check things such as directions to a resource, legal location, or coordinates.

It seems to us that fieldwork quality is actually controlled (or predicted) by personal professional ethic first, and secondarily by mandated review (e.g., the Section 106 process). Field observation quality depends upon individual knowledge and effort, not institutionalized procedures. The ecology of human endeavor will always yield a niche for those willing to do fieldwork poorly but who can still get their work accepted.

Relevance is another important aspect of CRIS quality. Being a good observer and communicator is only part of the fieldwork process. Part of communicating is having someone who wants to listen. Not infrequently, one is offered very high quality information sets, but it is not (currently) relevant to our information systems. For example, a researcher may wish to share a database of fractal quantifications of biface flake scars, gathered at great time and expense. This is great stuff, but maintaining heterogeneous datasets of this sort is difficult within an "IS" type framework. At the same time, next year we might all want to record bifaces in fractal ways: relevance is a moving target. This happens all the time with site recording formats. Most states change their site recording formats every 10 to 15 years, because the old site form is perceived as collecting too much irrelevant information.

Variable quality and changing relevance of particular observations are immutable factors from the CRIS perspective, and hence require coping strategies on the part of CRIS designers and managers. Coping strategies seem to take several forms, and of course often fall outside of the CRIS venue itself. Basically, we see three ways of coping with the fundamental quality issues:

A. In the field

Have consistently observable "important" things (avoid free form text);
Have expert field staff who self-enforce consistency, quality, and relevance;

B. In the cultural resource information system (paper or electronic; esp. electronic)

Have smart data inputters who evaluate variable quality observations and translate them to common, "reliable" observations (but now we have interpreters of interpretations);
Attach reliability statements to observations in the information system (e.g., flags like "poor description", "unreliable location");

C. In the users of information systems

Educate users so that they do not "push" data too far;
Have users smart enough to phrase broad questions (akin to not pushing data too far, but are they then building on a weak observational foundation?);
Limit use of the CRIS to those with knowledge and expertise of the data heritage;
Have CRIS users phrase questions to data experts (CRIS staff).

From the above, we think it is clear that at present, there is no ideal single strategy for ensuring data quality in large scale information systems. There is simply too much variation in professional practice, in interpretation, and in information use.

There are two places where CRIS's can ensure data quality internally. These are in locational information, especially coordinate-based locations, and the other is in enforcing comprehensive record submittal. CRIS's routinely check locational attributes of resources as part of their data entry process. Locations (coordinates, cadastral, or both) are checked using maps required as part of the resource recording process. These are tangible facts, and though time-consuming are not subject to interpretation. CRIS's can also ensure, in most states, that any resource "registered" (given a number) by the CRIS turns in the records for that resource. While this is feasible to do, and is a requirement of most state and Federal fieldwork permits, it is not often done. Of course, ensuring that the records come in does nothing to ensure their quality, but just having the records could be considered a minimum standard.

We have tried many of the strategies described above for enforcing quality. A few have been successful, or partially successful. We cannot control the acceptance of the records that we must incorporate, so we try to interpret those records as consistently as possible. To be blunt, archive staff remain the front-line of quality assurance in our systems. Even so, one can have some doubts about how far information in our systems should be "pushed".

Because relevance and reliability are so variable, the vast majority of electronic CRIS's function as indexes to paper records. The index has some interesting ancillary information, but it is not sufficiently reliable to allow large scale tabulation of detailed information (mostly on archaeological sites, since these the majority of resources in our states). For example, tallying the frequency of bifaces, let alone bifaces of a particular reduction stage, is probably not advisable.

Coping Strategies: Creating Better CRIS's

As CRIS managers and designers, creating better CRIS's is of deep and abiding interest to us. Archives are made to be used; indeed, the purpose of "Criterion D" eligibility to the National Register is to add to scientific knowledge. Reports and resource observations are very much a part of that legacy, as are information systems built upon them. It is in our interest and the interest of the professional community and public that we create better information systems.

Some of the ways that the next half-generation (nobody has the money to start anew) of CRIS's can be improved include seeking consensus on relevance and reliability, putting our databases on a diet, more explicit documentation of variation in attribute values; broadening the scope of information held within electronic formats. We will discuss each of these in more detail below.

Agree on consistently observable, relevant, phenomena

Agreement on consistently observable, relevant phenomena to include in CRIS’s is the place to start. One of the problems with the heritage of CRIS’s is that they grew out of ideas about what would be "important" to record. Important attributes were not well assessed for their feasibility – it might be interesting to observe something but could it be observed reliably, consistently? An excellent example of this is the Intermountain Antiquities Computer System (IMACS) site record format for topography. A major landform is requested, followed by a secondary landform. If I am standing on a side spur halfway down into a drainage, am I on a "RIDGE->SPUR"? A "VALLEY->RIDGE"? The scheme cannot be used consistently.

We have been seeking consensus from our colleagues on what "core attributes" should be in every cultural resources information system (Figure 2). Figure 2 indicates the broad categories of information. Suffice it to say that the majority of core attributes for resources, investigations, and resource aggregations (the basic entities in our model of cultural resource business processes, see Figure 3) are simple and observable. They are probably not very satisfying for most researchers; again we seek a reliable index database, not a knowledge base.

The concept of site is a good example of a lack of consensus. We are familiar within our own states of variation in the same agency’s definition of what constitutes a "site" (and hence gets full recording) and what constitutes an "isolated find". While one would expect that within a single state agencies could at least set a statewide criterion, even this basic standardization is lacking.

Re-examine field recording strategies

As we have discussed above, field recording strategies are essential to the entire CRIS enterprise. This part of the data creation process is often ignored in discussions of CRIS’s. We think that fieldwork standards should be practical -- that is, one could actually put them into practice. There have been lots of unworkable fieldwork standards; all should be subject to periodic review by regional professionals.

Practial field recording strategies must be realistic about who is doing the fieldwork. Most archaeological crews are not staffed by professionals of long experience. So, recording strategies need to be simple enough that relatively inexperienced crews can record phenomena reliably.

GPS, in the western U.S., must be incorporated into field recording strategies. Uncorrected resource grade (code phase) GPS may be little better than good map reader accuracy in terrain with moderate or high relief. But, in low relief terrain it is better than most map readers. Corrected GPS (inaccuracies averaging less than 10m) exceed 1:24,000 scale map accuracy. Spatial accuracy will become more important as GIS becomes a component of all electronic CRIS’s.

Evaluate reliability in existing data

Existing datasets contain lots of information that experts would consider unreliable. These categories of information should be archived and removed from a CRIS: put the database on a diet. In almost every information system we know of, there are cellulite candidates – they add bulk, but no beauty, to the database.

Trimming unreliable data may not always be the wisest strategy, and it may be better to encode an affiliated attribute that describes the quality of a piece of information. For example, what if the unreliable data is something one thinks crucial, like UTM coordinates? One could create an associated set of descriptive attributes stating whether the UTM was checked, unchecked, what the UTM source was (map scale, GPS accuracy), and so forth. We have begun instituting this sort of record level metadata in our databases.

Do a better job of tracking results, not just paper

One of the things that CRIS’s do not do well is to track what has been learned. CRIS’s generally track what has been done. Notwithstanding that CRIS’s are indexes first and not knowledge bases, they still need to tune into some of the other broadcast channels more effectively. To do this, electronic CRIS’s need to provide avenues by which researchers can summarize their work easily, and these summaries can be linked to appropriate records in the CRIS. Even just including (and requiring) 200 word project abstracts within an electronic system might be helpful.

Make CRIS datasets easier to use and more consistent regionally

Prehistoric and many historic people did not honor the political boundaries by which our current CRIS’s are organized. Indexes need to straddle political boundaries (especially state lines) seamlessly. We and colleagues at other CRIS’s and in a variety of agencies have begun to develop sharing standards. As well, technological means to share attributes from different databases "painlessly" are becoming affordable (in the CRIS scheme of budgets). Technologies like XML (extended markup language) and internet GIS servers will be necessary, but not sufficient, parts of sharing.

Socks over boots: CRIS’s , data quality, and knowledge

The title of this paper comes from an old cowboy expression. Getting one’s socks over boots means doing things in the wrong order, to the detriment of hosiery, footwear, and the cowboy. Trolling large scale information systems for "research" is not going to work well, in our opinion. Detailed data are simply too variable in quality, heritage, and observer effects. The role of CRIS’s, at present, is to serve as an index by which one can increase one’s knowledge of the past by finding appropriate reports, resource recordings, and by providing a coherent spatial framework for these things.

This does not mean that CRIS’s are not useful for research. Electronic information systems do contain appropriate information, but one must use them intelligently. This caution seems so obvious that perhaps we should apologize for making it; yet, we receive information requests every month that seem inappropriate or naïve given the state of our data systems. On the other hand, the Colorado Council of Professional Archaeologists has just revised the state’s regional research agendas using the Colorado SHPO data to examine broad patterns in different parts of the state, assessing gaps in knowledge about certain time periods, and assessing bias in inventory coverage. These are appropriate and useful roles for CRIS’s in sparking research.

Large scale CRIS’s are not going away. They are (we hope) becoming more efficient, more service-oriented and more spatially enabled through GIS. We have taken some pains to point out the hazards and limits within our datasets. Knowledge of these shortcomings will become paramount as access to CRIS data through on-line mechanisms becomes widespread. CRIS’s have been, are, and will be dynamic systems. They are always going to be somewhat out of date with contemporary needs, we think that rational, practical design/redesign will extend the lifespan of CRIS databases by making them more usable, maintainable, and migratable. Part of that design cycle is reaching consensus between "management" and "research". Such a consensus is never easy, but creating better information systems is possible, is feasible, and is worthwhile.

FIGURE 1. GRAPHIC exploding the term cultural resource information system

CULTURAL RESOURCE - cf. National Register Bulletin 15

INFORMATION - A meaningful observation, not just the thing (lacks inherent meaning), or random attribute of the thing.

SYSTEM - An organized set of relationships between definable phenomena (nodes)

FIGURE 2. CORE ATTRIBUTES OF THE FGDC DRAFT MODEL

FIGURE 3. ENTITY RELATIONSHIPS,, ENTITIES, CORE MODEL