P-BOOKS • LOC WORKSHOP ON ELECTRONIC TEXTS • James Daly • 3

LOC WORKSHOP ON ELECTRONIC TEXTS
by James Daly

Previous Part 1 2 3 4 5 Next Part

Among CXP's findings concerning the production of microfilm from digital files, KENNEY reported that the digital files for the same Reed lecture were used to produce sample film using an electron beam recorder. The resulting film was faithful to the image capture of the digital files, and while CXP felt that the text and image pages represented in the Reed lecture were superior to that of the light-lens film, the resolution readings for the 600 dpi were not as high as standard microfilming. KENNEY argued that the standards defined for light-lens technology are not totally transferable to a digital environment. Moreover, they are based on definition of quality for a preservation copy. Although making this case will prove to be a long, uphill struggle, CXP plans to continue to investigate the issue over the course of the next year.

KENNEY concluded this portion of her talk with a discussion of the advantages of creating film: it can serve as a primary backup and as a preservation master to the digital file; it could then become the print or production master and service copies could be paper, film, optical disks, magnetic media, or on-screen display.

Finally, KENNEY presented details re production:

* Development and testing of a moderately-high resolution production scanning workstation represented a third goal of CXP; to date, 1,000 volumes have been scanned, or about 300,000 images.

* The resulting digital files are stored and used to produce hard-copy replacements for the originals and additional prints on demand; although the initial costs are high, scanning technology offers an affordable means for reformatting brittle material.

* A technician in production mode can scan 300 pages per hour when performing single-sheet scanning, which is a necessity when working with truly brittle paper; this figure is expected to increase significantly with subsequent iterations of the software from Xerox; a three-month time-and-cost study of scanning found that the average 300-page book would take about an hour and forty minutes to scan (this figure included the time for setup, which involves keying in primary bibliographic data, going into quality control mode to define page size, establishing front-to-back registration, and scanning sample pages to identify a default range of settings for the entire book—functions not dissimilar to those performed by filmers or those preparing a book for photocopy).

* The final step in the scanning process involved rescans, which happily were few and far between, representing well under 1 percent of the total pages scanned.

In addition to technician time, CXP costed out equipment, amortized over four years, the cost of storing and refreshing the digital files every four years, and the cost of printing and binding, book-cloth binding, a paper reproduction. The total amounted to a little under $65 per single 300-page volume, with 30 percent overhead included—a figure competitive with the prices currently charged by photocopy vendors.

Of course, with scanning, in addition to the paper facsimile, one is left with a digital file from which subsequent copies of the book can be produced for a fraction of the cost of photocopy, with readers afforded choices in the form of these copies.

KENNEY concluded that digital technology offers an electronic means for a library preservation effort to pay for itself. If a brittle-book program included the means of disseminating reprints of books that are in demand by libraries and researchers alike, the initial investment in capture could be recovered and used to preserve additional but less popular books. She disclosed that an economic model for a self-sustaining program could be developed for CXP's report to the Commission on Preservation and Access (CPA).

KENNEY stressed that the focus of CXP has been on obtaining high quality in a production environment. The use of digital technology is viewed as an affordable alternative to other reformatting options.

******

+++++++++++++ ANDRE * Overview and history of NATDP * Various agricultural CD-ROM products created inhouse and by service bureaus * Pilot project on Internet transmission * Additional products in progress * +++++++++++++

Pamela ANDRE, associate director for automation, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), presented an overview of NATDP, which has been underway at NAL the last four years, before Judith ZIDAR discussed the technical details. ANDRE defined agricultural information as a broad range of material going from basic and applied research in the hard sciences to the one-page pamphlets that are distributed by the cooperative state extension services on such things as how to grow blueberries.

NATDP began in late 1986 with a meeting of representatives from the land-grant library community to deal with the issue of electronic information. NAL and forty-five of these libraries banded together to establish this project—to evaluate the technology for converting what were then source documents in paper form into electronic form, to provide access to that digital information, and then to distribute it. Distributing that material to the community—the university community as well as the extension service community, potentially down to the county level—constituted the group's chief concern.

Since January 1988 (when the microcomputer-based scanning system was installed at NAL), NATDP has done a variety of things, concerning which ZIDAR would provide further details. For example, the first technology considered in the project's discussion phase was digital videodisc, which indicates how long ago it was conceived.

Over the four years of this project, four separate CD-ROM products on four different agricultural topics were created, two at a scanning-and-OCR station installed at NAL, and two by service bureaus. Thus, NATDP has gained comparative information in terms of those relative costs. Each of these products contained the full ASCII text as well as page images of the material, or between 4,000 and 6,000 pages of material on these disks. Topics included aquaculture, food, agriculture and science (i.e., international agriculture and research), acid rain, and Agent Orange, which was the final product distributed (approximately eighteen months before the Workshop).

The third phase of NATDP focused on delivery mechanisms other than CD-ROM. At the suggestion of Clifford LYNCH, who was a technical consultant to the project at this point, NATDP became involved with the Internet and initiated a project with the help of North Carolina State University, in which fourteen of the land-grant university libraries are transmitting digital images over the Internet in response to interlibrary loan requests—a topic for another meeting. At this point, the pilot project had been completed for about a year and the final report would be available shortly after the Workshop. In the meantime, the project's success had led to its extension. (ANDRE noted that one of the first things done under the program title was to select a retrieval package to use with subsequent products; Windows Personal Librarian was the package of choice after a lengthy evaluation.) Three additional products had been planned and were in progress:

1) An arrangement with the American Society of Agronomy—a professional society that has published the Agronomy Journal since about 1908—to scan and create bit-mapped images of its journal. ASA granted permission first to put and then to distribute this material in electronic form, to hold it at NAL, and to use these electronic images as a mechanism to deliver documents or print out material for patrons, among other uses. Effectively, NAL has the right to use this material in support of its program. (Significantly, this arrangement offers a potential cooperative model for working with other professional societies in agriculture to try to do the same thing—put the journals of particular interest to agriculture research into electronic form.)

2) An extension of the earlier product on aquaculture.

3) The George Washington Carver Papers—a joint project with Tuskegee University to scan and convert from microfilm some 3,500 images of Carver's papers, letters, and drawings.

It was anticipated that all of these products would appear no more than six months after the Workshop.

******

+++++++++++++ ZIDAR * (A separate arena for scanning) * Steps in creating a database * Image capture, with and without performing OCR * Keying in tracking data * Scanning, with electronic and manual tracking * Adjustments during scanning process * Scanning resolutions * Compression * De-skewing and filtering * Image capture from microform: the papers and letters of George Washington Carver * Equipment used for a scanning system * +++++++++++++

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), illustrated the technical details of NATDP, including her primary responsibility, scanning and creating databases on a topic and putting them on CD-ROM.

(ZIDAR remarked a separate arena from the CD-ROM projects, although the processing of the material is nearly identical, in which NATDP is also scanning material and loading it on a Next microcomputer, which in turn is linked to NAL's integrated library system. Thus, searches in NAL's bibliographic database will enable people to pull up actual page images and text for any documents that have been entered.)

In accordance with the session's topic, ZIDAR focused her illustrated talk on image capture, offering a primer on the three main steps in the process: 1) assemble the printed publications; 2) design the database (database design occurs in the process of preparing the material for scanning; this step entails reviewing and organizing the material, defining the contents—what will constitute a record, what kinds of fields will be captured in terms of author, title, etc.); 3) perform a certain amount of markup on the paper publications. NAL performs this task record by record, preparing work sheets or some other sort of tracking material and designing descriptors and other enhancements to be added to the data that will not be captured from the printed publication. Part of this process also involves determining NATDP's file and directory structure: NATDP attempts to avoid putting more than approximately 100 images in a directory, because placing more than that on a CD-ROM would reduce the access speed.

This up-front process takes approximately two weeks for a 6,000-7,000-page database. The next step is to capture the page images. How long this process takes is determined by the decision whether or not to perform OCR. Not performing OCR speeds the process, whereas text capture requires greater care because of the quality of the image: it has to be straighter and allowance must be made for text on a page, not just for the capture of photographs.

NATDP keys in tracking data, that is, a standard bibliographic record including the title of the book and the title of the chapter, which will later either become the access information or will be attached to the front of a full-text record so that it is searchable.

Images are scanned from a bound or unbound publication, chiefly from bound publications in the case of NATDP, however, because often they are the only copies and the publications are returned to the shelves. NATDP usually scans one record at a time, because its database tracking system tracks the document in that way and does not require further logical separating of the images. After performing optical character recognition, NATDP moves the images off the hard disk and maintains a volume sheet. Though the system tracks electronically, all the processing steps are also tracked manually with a log sheet.

ZIDAR next illustrated the kinds of adjustments that one can make when scanning from paper and microfilm, for example, redoing images that need special handling, setting for dithering or gray scale, and adjusting for brightness or for the whole book at one time.

NATDP is scanning at 300 dots per inch, a standard scanning resolution. Though adequate for capturing text that is all of a standard size, 300 dpi is unsuitable for any kind of photographic material or for very small text. Many scanners allow for different image formats, TIFF, of course, being a de facto standard. But if one intends to exchange images with other people, the ability to scan other image formats, even if they are less common, becomes highly desirable.

CCITT Group 4 is the standard compression for normal black-and-white images, JPEG for gray scale or color. ZIDAR recommended 1) using the standard compressions, particularly if one attempts to make material available and to allow users to download images and reuse them from CD-ROMs; and 2) maintaining the ability to output an uncompressed image, because in image exchange uncompressed images are more likely to be able to cross platforms.

ZIDAR emphasized the importance of de-skewing and filtering as requirements on NATDP's upgraded system. For instance, scanning bound books, particularly books published by the federal government whose pages are skewed, and trying to scan them straight if OCR is to be performed, is extremely time-consuming. The same holds for filtering of poor-quality or older materials.

ZIDAR described image capture from microform, using as an example three reels from a sixty-seven-reel set of the papers and letters of George Washington Carver that had been produced by Tuskegee University. These resulted in approximately 3,500 images, which NATDP had had scanned by its service contractor, Science Applications International Corporation (SAIC). NATDP also created bibliographic records for access. (NATDP did not have such specialized equipment as a microfilm scanner.

Unfortunately, the process of scanning from microfilm was not an unqualified success, ZIDAR reported: because microfilm frame sizes vary, occasionally some frames were missed, which without spending much time and money could not be recaptured.

OCR could not be performed from the scanned images of the frames. The bleeding in the text simply output text, when OCR was run, that could not even be edited. NATDP tested for negative versus positive images, landscape versus portrait orientation, and single- versus dual-page microfilm, none of which seemed to affect the quality of the image; but also on none of them could OCR be performed.

In selecting the microfilm they would use, therefore, NATDP had other factors in mind. ZIDAR noted two factors that influenced the quality of the images: 1) the inherent quality of the original and 2) the amount of size reduction on the pages.

The Carver papers were selected because they are informative and visually interesting, treat a single subject, and are valuable in their own right. The images were scanned and divided into logical records by SAIC, then delivered, and loaded onto NATDP's system, where bibliographic information taken directly from the images was added. Scanning was completed in summer 1991 and by the end of summer 1992 the disk was scheduled to be published.

Problems encountered during processing included the following: Because the microfilm scanning had to be done in a batch, adjustment for individual page variations was not possible. The frame size varied on account of the nature of the material, and therefore some of the frames were missed while others were just partial frames. The only way to go back and capture this material was to print out the page with the microfilm reader from the missing frame and then scan it in from the page, which was extremely time-consuming. The quality of the images scanned from the printout of the microfilm compared unfavorably with that of the original images captured directly from the microfilm. The inability to perform OCR also was a major disappointment. At the time, computer output microfilm was unavailable to test.

The equipment used for a scanning system was the last topic addressed by ZIDAR. The type of equipment that one would purchase for a scanning system included: a microcomputer, at least a 386, but preferably a 486; a large hard disk, 380 megabyte at minimum; a multi-tasking operating system that allows one to run some things in batch in the background while scanning or doing text editing, for example, Unix or OS/2 and, theoretically, Windows; a high-speed scanner and scanning software that allows one to make the various adjustments mentioned earlier; a high-resolution monitor (150 dpi ); OCR software and hardware to perform text recognition; an optical disk subsystem on which to archive all the images as the processing is done; file management and tracking software.

ZIDAR opined that the software one purchases was more important than the hardware and might also cost more than the hardware, but it was likely to prove critical to the success or failure of one's system. In addition to a stand-alone scanning workstation for image capture, then, text capture requires one or two editing stations networked to this scanning station to perform editing. Editing the text takes two or three times as long as capturing the images.

Finally, ZIDAR stressed the importance of buying an open system that allows for more than one vendor, complies with standards, and can be upgraded.

******

+++++++++++++ WATERS *Yale University Library's master plan to convert microfilm to digital imagery (POB) * The place of electronic tools in the library of the future * The uses of images and an image library * Primary input from preservation microfilm * Features distinguishing POB from CXP and key hypotheses guiding POB * Use of vendor selection process to facilitate organizational work * Criteria for selecting vendor * Finalists and results of process for Yale * Key factor distinguishing vendors * Components, design principles, and some estimated costs of POB * Role of preservation materials in developing imaging market * Factors affecting quality and cost * Factors affecting the usability of complex documents in image form * +++++++++++++

Donald WATERS, head of the Systems Office, Yale University Library, reported on the progress of a master plan for a project at Yale to convert microfilm to digital imagery, Project Open Book (POB). Stating that POB was in an advanced stage of planning, WATERS detailed, in particular, the process of selecting a vendor partner and several key issues under discussion as Yale prepares to move into the project itself. He commented first on the vision that serves as the context of POB and then described its purpose and scope.

WATERS sees the library of the future not necessarily as an electronic library but as a place that generates, preserves, and improves for its clients ready access to both intellectual and physical recorded knowledge. Electronic tools must find a place in the library in the context of this vision. Several roles for electronic tools include serving as: indirect sources of electronic knowledge or as "finding" aids (the on-line catalogues, the article-level indices, registers for documents and archives); direct sources of recorded knowledge; full-text images; and various kinds of compound sources of recorded knowledge (the so-called compound documents of Hypertext, mixed text and image, mixed-text image format, and multimedia).

POB is looking particularly at images and an image library, the uses to which images will be put (e.g., storage, printing, browsing, and then use as input for other processes), OCR as a subsequent process to image capture, or creating an image library, and also possibly generating microfilm.

While input will come from a variety of sources, POB is considering especially input from preservation microfilm. A possible outcome is that the film and paper which provide the input for the image library eventually may go off into remote storage, and that the image library may be the primary access tool.

The purpose and scope of POB focus on imaging. Though related to CXP, POB has two features which distinguish it: 1) scale—conversion of 10,000 volumes into digital image form; and 2) source—conversion from microfilm. Given these features, several key working hypotheses guide POB, including: 1) Since POB is using microfilm, it is not concerned with the image library as a preservation medium. 2) Digital imagery can improve access to recorded knowledge through printing and network distribution at a modest incremental cost of microfilm. 3) Capturing and storing documents in a digital image form is necessary to further improvements in access. (POB distinguishes between the imaging, digitizing process and OCR, which at this stage it does not plan to perform.)

Currently in its first or organizational phase, POB found that it could use a vendor selection process to facilitate a good deal of the organizational work (e.g., creating a project team and advisory board, confirming the validity of the plan, establishing the cost of the project and a budget, selecting the materials to convert, and then raising the necessary funds).

POB developed numerous selection criteria, including: a firm committed to image-document management, the ability to serve as systems integrator in a large-scale project over several years, interest in developing the requisite software as a standard rather than a custom product, and a willingness to invest substantial resources in the project itself.

Two vendors, DEC and Xerox, were selected as finalists in October 1991, and with the support of the Commission on Preservation and Access, each was commissioned to generate a detailed requirements analysis for the project and then to submit a formal proposal for the completion of the project, which included a budget and costs. The terms were that POB would pay the loser. The results for Yale of involving a vendor included: broad involvement of Yale staff across the board at a relatively low cost, which may have long-term significance in carrying out the project (twenty-five to thirty university people are engaged in POB); better understanding of the factors that affect corporate response to markets for imaging products; a competitive proposal; and a more sophisticated view of the imaging markets.

The most important factor that distinguished the vendors under consideration was their identification with the customer. The size and internal complexity of the company also was an important factor. POB was looking at large companies that had substantial resources. In the end, the process generated for Yale two competitive proposals, with Xerox's the clear winner. WATERS then described the components of the proposal, the design principles, and some of the costs estimated for the process.

Components are essentially four: a conversion subsystem, a network-accessible storage subsystem for 10,000 books (and POB expects 200 to 600 dpi storage), browsing stations distributed on the campus network, and network access to the image printers.

Among the design principles, POB wanted conversion at the highest possible resolution. Assuming TIFF files, TIFF files with Group 4 compression, TCP/IP, and ethernet network on campus, POB wanted a client-server approach with image documents distributed to the workstations and made accessible through native workstation interfaces such as Windows. POB also insisted on a phased approach to implementation: 1) a stand-alone, single-user, low-cost entry into the business with a workstation focused on conversion and allowing POB to explore user access; 2) movement into a higher-volume conversion with network-accessible storage and multiple access stations; and 3) a high-volume conversion, full-capacity storage, and multiple browsing stations distributed throughout the campus.

The costs proposed for start-up assumed the existence of the Yale network and its two DocuTech image printers. Other start-up costs are estimated at $1 million over the three phases. At the end of the project, the annual operating costs estimated primarily for the software and hardware proposed come to about $60,000, but these exclude costs for labor needed in the conversion process, network and printer usage, and facilities management.

Finally, the selection process produced for Yale a more sophisticated view of the imaging markets: the management of complex documents in image form is not a preservation problem, not a library problem, but a general problem in a broad, general industry. Preservation materials are useful for developing that market because of the qualities of the material. For example, much of it is out of copyright. The resolution of key issues such as the quality of scanning and image browsing also will affect development of that market.

The technology is readily available but changing rapidly. In this context of rapid change, several factors affect quality and cost, to which POB intends to pay particular attention, for example, the various levels of resolution that can be achieved. POB believes it can bring resolution up to 600 dpi, but an interpolation process from 400 to 600 is more likely. The variation quality in microfilm will prove to be a highly important factor. POB may reexamine the standards used to film in the first place by looking at this process as a follow-on to microfilming.

Other important factors include: the techniques available to the operator for handling material, the ways of integrating quality control into the digitizing work flow, and a work flow that includes indexing and storage. POB's requirement was to be able to deal with quality control at the point of scanning. Thus, thanks to Xerox, POB anticipates having a mechanism which will allow it not only to scan in batch form, but to review the material as it goes through the scanner and control quality from the outset.

The standards for measuring quality and costs depend greatly on the uses of the material, including subsequent OCR, storage, printing, and browsing. But especially at issue for POB is the facility for browsing. This facility, WATERS said, is perhaps the weakest aspect of imaging technology and the most in need of development.

A variety of factors affect the usability of complex documents in image form, among them: 1) the ability of the system to handle the full range of document types, not just monographs but serials, multi-part monographs, and manuscripts; 2) the location of the database of record for bibliographic information about the image document, which POB wants to enter once and in the most useful place, the on-line catalog; 3) a document identifier for referencing the bibliographic information in one place and the images in another; 4) the technique for making the basic internal structure of the document accessible to the reader; and finally, 5) the physical presentation on the CRT of those documents. POB is ready to complete this phase now. One last decision involves deciding which material to scan.

******

+++++++++++++ DISCUSSION * TIFF files constitute de facto standard * NARA's experience with image conversion software and text conversion * RFC 1314 * Considerable flux concerning available hardware and software solutions * NAL through-put rate during scanning * Window management questions * +++++++++++++

In the question-and-answer period that followed WATERS's presentation, the following points emerged:

* ZIDAR's statement about using TIFF files as a standard meant de facto standard. This is what most people use and typically exchange with other groups, across platforms, or even occasionally across display software.

* HOLMES commented on the unsuccessful experience of NARA in attempting to run image-conversion software or to exchange between applications: What are supposedly TIFF files go into other software that is supposed to be able to accept TIFF but cannot recognize the format and cannot deal with it, and thus renders the exchange useless. Re text conversion, he noted the different recognition rates obtained by substituting the make and model of scanners in NARA's recent test of an "intelligent" character-recognition product for a new company. In the selection of hardware and software, HOLMES argued, software no longer constitutes the overriding factor it did until about a year ago; rather it is perhaps important to look at both now.

* Danny Cohen and Alan Katz of the University of Southern California Information Sciences Institute began circulating as an Internet RFC (RFC 1314) about a month ago a standard for a TIFF interchange format for Internet distribution of monochrome bit-mapped images, which LYNCH said he believed would be used as a de facto standard.

* FLEISCHHAUER's impression from hearing these reports and thinking about AM's experience was that there is considerable flux concerning available hardware and software solutions. HOOTON agreed and commented at the same time on ZIDAR's statement that the equipment employed affects the results produced. One cannot draw a complete conclusion by saying it is difficult or impossible to perform OCR from scanning microfilm, for example, with that device, that set of parameters, and system requirements, because numerous other people are accomplishing just that, using other components, perhaps. HOOTON opined that both the hardware and the software were highly important. Most of the problems discussed today have been solved in numerous different ways by other people. Though it is good to be cognizant of various experiences, this is not to say that it will always be thus.

* At NAL, the through-put rate of the scanning process for paper, page by page, performing OCR, ranges from 300 to 600 pages per day; not performing OCR is considerably faster, although how much faster is not known. This is for scanning from bound books, which is much slower.

* WATERS commented on window management questions: DEC proposed an X-Windows solution which was problematical for two reasons. One was POB's requirement to be able to manipulate images on the workstation and bring them down to the workstation itself and the other was network usage.

******

+++++++++++++ THOMA * Illustration of deficiencies in scanning and storage process * Image quality in this process * Different costs entailed by better image quality * Techniques for overcoming various de-ficiencies: fixed thresholding, dynamic thresholding, dithering, image merge * Page edge effects * +++++++++++++

George THOMA, chief, Communications Engineering Branch, National Library of Medicine (NLM), illustrated several of the deficiencies discussed by the previous speakers. He introduced the topic of special problems by noting the advantages of electronic imaging. For example, it is regenerable because it is a coded file, and real-time quality control is possible with electronic capture, whereas in photographic capture it is not.

One of the difficulties discussed in the scanning and storage process was image quality which, without belaboring the obvious, means different things for maps, medical X-rays, or broadcast television. In the case of documents, THOMA said, image quality boils down to legibility of the textual parts, and fidelity in the case of gray or color photo print-type material. Legibility boils down to scan density, the standard in most cases being 300 dpi. Increasing the resolution with scanners that perform 600 or 1200 dpi, however, comes at a cost.

Better image quality entails at least four different kinds of costs: 1) equipment costs, because the CCD (i.e., charge-couple device) with greater number of elements costs more; 2) time costs that translate to the actual capture costs, because manual labor is involved (the time is also dependent on the fact that more data has to be moved around in the machine in the scanning or network devices that perform the scanning as well as the storage); 3) media costs, because at high resolutions larger files have to be stored; and 4) transmission costs, because there is just more data to be transmitted.

But while resolution takes care of the issue of legibility in image quality, other deficiencies have to do with contrast and elements on the page scanned or the image that needed to be removed or clarified. Thus, THOMA proceeded to illustrate various deficiencies, how they are manifested, and several techniques to overcome them.

Fixed thresholding was the first technique described, suitable for black-and-white text, when the contrast does not vary over the page. One can have many different threshold levels in scanning devices. Thus, THOMA offered an example of extremely poor contrast, which resulted from the fact that the stock was a heavy red. This is the sort of image that when microfilmed fails to provide any legibility whatsoever. Fixed thresholding is the way to change the black-to-red contrast to the desired black-to-white contrast.

Other examples included material that had been browned or yellowed by age. This was also a case of contrast deficiency, and correction was done by fixed thresholding. A final example boils down to the same thing, slight variability, but it is not significant. Fixed thresholding solves this problem as well. The microfilm equivalent is certainly legible, but it comes with dark areas. Though THOMA did not have a slide of the microfilm in this case, he did show the reproduced electronic image.

When one has variable contrast over a page or the lighting over the page area varies, especially in the case where a bound volume has light shining on it, the image must be processed by a dynamic thresholding scheme. One scheme, dynamic averaging, allows the threshold level not to be fixed but to be recomputed for every pixel from the neighboring characteristics. The neighbors of a pixel determine where the threshold should be set for that pixel.

THOMA showed an example of a page that had been made deficient by a variety of techniques, including a burn mark, coffee stains, and a yellow marker. Application of a fixed-thresholding scheme, THOMA argued, might take care of several deficiencies on the page but not all of them. Performing the calculation for a dynamic threshold setting, however, removes most of the deficiencies so that at least the text is legible.

Another problem is representing a gray level with black-and-white pixels by a process known as dithering or electronic screening. But dithering does not provide good image quality for pure black-and-white textual material. THOMA illustrated this point with examples. Although its suitability for photoprint is the reason for electronic screening or dithering, it cannot be used for every compound image. In the document that was distributed by CXP, THOMA noticed that the dithered image of the IEEE test chart evinced some deterioration in the text. He presented an extreme example of deterioration in the text in which compounded documents had to be set right by other techniques. The technique illustrated by the present example was an image merge in which the page is scanned twice and the settings go from fixed threshold to the dithering matrix; the resulting images are merged to give the best results with each technique.

THOMA illustrated how dithering is also used in nonphotographic or nonprint materials with an example of a grayish page from a medical text, which was reproduced to show all of the gray that appeared in the original. Dithering provided a reproduction of all the gray in the original of another example from the same text.

THOMA finally illustrated the problem of bordering, or page-edge, effects. Books and bound volumes that are placed on a photocopy machine or a scanner produce page-edge effects that are undesirable for two reasons: 1) the aesthetics of the image; after all, if the image is to be preserved, one does not necessarily want to keep all of its deficiencies; 2) compression (with the bordering problem THOMA illustrated, the compression ratio deteriorated tremendously). One way to eliminate this more serious problem is to have the operator at the point of scanning window the part of the image that is desirable and automatically turn all of the pixels out of that picture to white.

******

+++++++++++++ FLEISCHHAUER * AM's experience with scanning bound materials * Dithering * +++++++++++++

Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress, reported AM's experience with scanning bound materials, which he likened to the problems involved in using photocopying machines. Very few devices in the industry offer book-edge scanning, let alone book cradles. The problem may be unsolvable, FLEISCHHAUER said, because a large enough market does not exist for a preservation-quality scanner. AM is using a Kurzweil scanner, which is a book-edge scanner now sold by Xerox.

Devoting the remainder of his brief presentation to dithering, FLEISCHHAUER related AM's experience with a contractor who was using unsophisticated equipment and software to reduce moire patterns from printed halftones. AM took the same image and used the dithering algorithm that forms part of the same Kurzweil Xerox scanner; it disguised moire patterns much more effectively.

FLEISCHHAUER also observed that dithering produces a binary file which is useful for numerous purposes, for example, printing it on a laser printer without having to "re-halftone" it. But it tends to defeat efficient compression, because the very thing that dithers to reduce moire patterns also tends to work against compression schemes. AM thought the difference in image quality was worth it.

******

+++++++++++++ DISCUSSION * Relative use as a criterion for POB's selection of books to be converted into digital form * +++++++++++++

During the discussion period, WATERS noted that one of the criteria for selecting books among the 10,000 to be converted into digital image form would be how much relative use they would receive—a subject still requiring evaluation. The challenge will be to understand whether coherent bodies of material will increase usage or whether POB should seek material that is being used, scan that, and make it more accessible. POB might decide to digitize materials that are already heavily used, in order to make them more accessible and decrease wear on them. Another approach would be to provide a large body of intellectually coherent material that may be used more in digital form than it is currently used in microfilm. POB would seek material that was out of copyright.

******

+++++++++++++ BARONAS * Origin and scope of AIIM * Types of documents produced in AIIM's standards program * Domain of AIIM's standardization work * AIIM's structure * TC 171 and MS23 * Electronic image management standards * Categories of EIM standardization where AIIM standards are being developed * +++++++++++++

Jean BARONAS, senior manager, Department of Standards and Technology, Association for Information and Image Management (AIIM), described the not-for-profit association and the national and international programs for standardization in which AIIM is active.

Accredited for twenty-five years as the nation's standards development organization for document image management, AIIM began life in a library community developing microfilm standards. Today the association maintains both its library and business-image management standardization activities—and has moved into electronic image-management standardization (EIM).

BARONAS defined the program's scope. AIIM deals with: 1) the terminology of standards and of the technology it uses; 2) methods of measurement for the systems, as well as quality; 3) methodologies for users to evaluate and measure quality; 4) the features of apparatus used to manage and edit images; and 5) the procedures used to manage images.

BARONAS noted that three types of documents are produced in the AIIM standards program: the first two, accredited by the American National Standards Institute (ANSI), are standards and standard recommended practices. Recommended practices differ from standards in that they contain more tutorial information. A technical report is not an ANSI standard. Because AIIM's policies and procedures for developing standards are approved by ANSI, its standards are labeled ANSI/AIIM, followed by the number and title of the standard.

BARONAS then illustrated the domain of AIIM's standardization work. For example, AIIM is the administrator of the U.S. Technical Advisory Group (TAG) to the International Standards Organization's (ISO) technical committee, TC l7l Micrographics and Optical Memories for Document and Image Recording, Storage, and Use. AIIM officially works through ANSI in the international standardization process.

BARONAS described AIIM's structure, including its board of directors, its standards board of twelve individuals active in the image-management industry, its strategic planning and legal admissibility task forces, and its National Standards Council, which is comprised of the members of a number of organizations who vote on every AIIM standard before it is published. BARONAS pointed out that AIIM's liaisons deal with numerous other standards developers, including the optical disk community, office and publishing systems, image-codes-and-character set committees, and the National Information Standards Organization (NISO).

BARONAS illustrated the procedures of TC l7l, which covers all aspects of image management. When AIIM's national program has conceptualized a new project, it is usually submitted to the international level, so that the member countries of TC l7l can simultaneously work on the development of the standard or the technical report. BARONAS also illustrated a classic microfilm standard, MS23, which deals with numerous imaging concepts that apply to electronic imaging. Originally developed in the l970s, revised in the l980s, and revised again in l991, this standard is scheduled for another revision. MS23 is an active standard whereby users may propose new density ranges and new methods of evaluating film images in the standard's revision.

BARONAS detailed several electronic image-management standards, for instance, ANSI/AIIM MS44, a quality-control guideline for scanning 8.5" by 11" black-and-white office documents. This standard is used with the IEEE fax image—a continuous tone photographic image with gray scales, text, and several continuous tone pictures—and AIIM test target number 2, a representative document used in office document management.

BARONAS next outlined the four categories of EIM standardization in which AIIM standards are being developed: transfer and retrieval, evaluation, optical disc and document scanning applications, and design and conversion of documents. She detailed several of the main projects of each: 1) in the category of image transfer and retrieval, a bi-level image transfer format, ANSI/AIIM MS53, which is a proposed standard that describes a file header for image transfer between unlike systems when the images are compressed using G3 and G4 compression; 2) the category of image evaluation, which includes the AIIM-proposed TR26 tutorial on image resolution (this technical report will treat the differences and similarities between classical or photographic and electronic imaging); 3) design and conversion, which includes a proposed technical report called "Forms Design Optimization for EIM" (this report considers how general-purpose business forms can be best designed so that scanning is optimized; reprographic characteristics such as type, rules, background, tint, and color will likewise be treated in the technical report); 4) disk and document scanning applications includes a project a) on planning platters and disk management, b) on generating an application profile for EIM when images are stored and distributed on CD-ROM, and c) on evaluating SCSI2, and how a common command set can be generated for SCSI2 so that document scanners are more easily integrated. (ANSI/AIIM MS53 will also apply to compressed images.)

******

+++++++++++++ BATTIN * The implications of standards for preservation * A major obstacle to successful cooperation * A hindrance to access in the digital environment * Standards a double-edged sword for those concerned with the preservation of the human record * Near-term prognosis for reliable archival standards * Preservation concerns for electronic media * Need for reconceptualizing our preservation principles * Standards in the real world and the politics of reproduction * Need to redefine the concept of archival and to begin to think in terms of life cycles * Cooperation and the La Guardia Eight * Concerns generated by discussions on the problems of preserving text and image * General principles to be adopted in a world without standards * +++++++++++++

Patricia BATTIN, president, the Commission on Preservation and Access (CPA), addressed the implications of standards for preservation. She listed several areas where the library profession and the analog world of the printed book had made enormous contributions over the past hundred years—for example, in bibliographic formats, binding standards, and, most important, in determining what constitutes longevity or archival quality.

Although standards have lightened the preservation burden through the development of national and international collaborative programs, nevertheless, a pervasive mistrust of other people's standards remains a major obstacle to successful cooperation, BATTIN said.

The zeal to achieve perfection, regardless of the cost, has hindered rather than facilitated access in some instances, and in the digital environment, where no real standards exist, has brought an ironically just reward.

BATTIN argued that standards are a double-edged sword for those concerned with the preservation of the human record, that is, the provision of access to recorded knowledge in a multitude of media as far into the future as possible. Standards are essential to facilitate interconnectivity and access, but, BATTIN said, as LYNCH pointed out yesterday, if set too soon they can hinder creativity, expansion of capability, and the broadening of access. The characteristics of standards for digital imagery differ radically from those for analog imagery. And the nature of digital technology implies continuing volatility and change. To reiterate, precipitous standard-setting can inhibit creativity, but delayed standard-setting results in chaos.

Since in BATTIN'S opinion the near-term prognosis for reliable archival standards, as defined by librarians in the analog world, is poor, two alternatives remain: standing pat with the old technology, or reconceptualizing.

Preservation concerns for electronic media fall into two general domains. One is the continuing assurance of access to knowledge originally generated, stored, disseminated, and used in electronic form. This domain contains several subdivisions, including 1) the closed, proprietary systems discussed the previous day, bundled information such as electronic journals and government agency records, and electronically produced or captured raw data; and 2) the application of digital technologies to the reformatting of materials originally published on a deteriorating analog medium such as acid paper or videotape.

The preservation of electronic media requires a reconceptualizing of our preservation principles during a volatile, standardless transition which may last far longer than any of us envision today. BATTIN urged the necessity of shifting focus from assessing, measuring, and setting standards for the permanence of the medium to the concept of managing continuing access to information stored on a variety of media and requiring a variety of ever-changing hardware and software for access—a fundamental shift for the library profession.

BATTIN offered a primer on how to move forward with reasonable confidence in a world without standards. Her comments fell roughly into two sections: 1) standards in the real world and 2) the politics of reproduction.

In regard to real-world standards, BATTIN argued the need to redefine the concept of archive and to begin to think in terms of life cycles. In the past, the naive assumption that paper would last forever produced a cavalier attitude toward life cycles. The transient nature of the electronic media has compelled people to recognize and accept upfront the concept of life cycles in place of permanency.

Digital standards have to be developed and set in a cooperative context to ensure efficient exchange of information. Moreover, during this transition period, greater flexibility concerning how concepts such as backup copies and archival copies in the CXP are defined is necessary, or the opportunity to move forward will be lost.

In terms of cooperation, particularly in the university setting, BATTIN also argued the need to avoid going off in a hundred different directions. The CPA has catalyzed a small group of universities called the La Guardia Eight—because La Guardia Airport is where meetings take place—Harvard, Yale, Cornell, Princeton, Penn State, Tennessee, Stanford, and USC, to develop a digital preservation consortium to look at all these issues and develop de facto standards as we move along, instead of waiting for something that is officially blessed. Continuing to apply analog values and definitions of standards to the digital environment, BATTIN said, will effectively lead to forfeiture of the benefits of digital technology to research and scholarship.

Under the second rubric, the politics of reproduction, BATTIN reiterated an oft-made argument concerning the electronic library, namely, that it is more difficult to transform than to create, and nowhere is that belief expressed more dramatically than in the conversion of brittle books to new media. Preserving information published in electronic media involves making sure the information remains accessible and that digital information is not lost through reproduction. In the analog world of photocopies and microfilm, the issue of fidelity to the original becomes paramount, as do issues of "Whose fidelity?" and "Whose original?"

BATTIN elaborated these arguments with a few examples from a recent study conducted by the CPA on the problems of preserving text and image. Discussions with scholars, librarians, and curators in a variety of disciplines dependent on text and image generated a variety of concerns, for example: 1) Copy what is, not what the technology is capable of. This is very important for the history of ideas. Scholars wish to know what the author saw and worked from. And make available at the workstation the opportunity to erase all the defects and enhance the presentation. 2) The fidelity of reproduction—what is good enough, what can we afford, and the difference it makes—issues of subjective versus objective resolution. 3) The differences between primary and secondary users. Restricting the definition of primary user to the one in whose discipline the material has been published runs one headlong into the reality that these printed books have had a host of other users from a host of other disciplines, who not only were looking for very different things, but who also shared values very different from those of the primary user. 4) The relationship of the standard of reproduction to new capabilities of scholarship—the browsing standard versus an archival standard. How good must the archival standard be? Can a distinction be drawn between potential users in setting standards for reproduction? Archival storage, use copies, browsing copies—ought an attempt to set standards even be made? 5) Finally, costs. How much are we prepared to pay to capture absolute fidelity? What are the trade-offs between vastly enhanced access, degrees of fidelity, and costs?

These standards, BATTIN concluded, serve to complicate further the reproduction process, and add to the long list of technical standards that are necessary to ensure widespread access. Ways to articulate and analyze the costs that are attached to the different levels of standards must be found.

Given the chaos concerning standards, which promises to linger for the foreseeable future, BATTIN urged adoption of the following general principles:

* Strive to understand the changing information requirements of scholarly disciplines as more and more technology is integrated into the process of research and scholarly communication in order to meet future scholarly needs, not to build for the past. Capture deteriorating information at the highest affordable resolution, even though the dissemination and display technologies will lag.

* Develop cooperative mechanisms to foster agreement on protocols for document structure and other interchange mechanisms necessary for widespread dissemination and use before official standards are set.

* Accept that, in a transition period, de facto standards will have to be developed.

* Capture information in a way that keeps all options open and provides for total convertibility: OCR, scanning of microfilm, producing microfilm from scanned documents, etc.

* Work closely with the generators of information and the builders of networks and databases to ensure that continuing accessibility is a primary concern from the beginning.

* Piggyback on standards under development for the broad market, and avoid library-specific standards; work with the vendors, in order to take advantage of that which is being standardized for the rest of the world.

* Concentrate efforts on managing permanence in the digital world, rather than perfecting the longevity of a particular medium.

******

+++++++++++++ DISCUSSION * Additional comments on TIFF * +++++++++++++

During the brief discussion period that followed BATTIN's presentation, BARONAS explained that TIFF was not developed in collaboration with or under the auspices of AIIM. TIFF is a company product, not a standard, is owned by two corporations, and is always changing. BARONAS also observed that ANSI/AIIM MS53, a bi-level image file transfer format that allows unlike systems to exchange images, is compatible with TIFF as well as with DEC's architecture and IBM's MODCA/IOCA.

******

+++++++++++++ HOOTON * Several questions to be considered in discussing text conversion * +++++++++++++

HOOTON introduced the final topic, text conversion, by noting that it is becoming an increasingly important part of the imaging business. Many people now realize that it enhances their system to be able to have more and more character data as part of their imaging system. Re the issue of OCR versus rekeying, HOOTON posed several questions: How does one get text into computer-readable form? Does one use automated processes? Does one attempt to eliminate the use of operators where possible? Standards for accuracy, he said, are extremely important: it makes a major difference in cost and time whether one sets as a standard 98.5 percent acceptance or 99.5 percent. He mentioned outsourcing as a possibility for converting text. Finally, what one does with the image to prepare it for the recognition process is also important, he said, because such preparation changes how recognition is viewed, as well as facilitates recognition itself.

******

+++++++++++++ LESK * Roles of participants in CORE * Data flow * The scanning process * The image interface * Results of experiments involving the use of electronic resources and traditional paper copies * Testing the issue of serendipity * Conclusions * +++++++++++++

Michael LESK, executive director, Computer Science Research, Bell Communications Research, Inc. (Bellcore), discussed the Chemical Online Retrieval Experiment (CORE), a cooperative project involving Cornell University, OCLC, Bellcore, and the American Chemical Society (ACS).

LESK spoke on 1) how the scanning was performed, including the unusual feature of page segmentation, and 2) the use made of the text and the image in experiments.

Working with the chemistry journals (because ACS has been saving its typesetting tapes since the mid-1970s and thus has a significant back-run of the most important chemistry journals in the United States), CORE is attempting to create an automated chemical library. Approximately a quarter of the pages by square inch are made up of images of quasi-pictorial material; dealing with the graphic components of the pages is extremely important. LESK described the roles of participants in CORE: 1) ACS provides copyright permission, journals on paper, journals on microfilm, and some of the definitions of the files; 2) at Bellcore, LESK chiefly performs the data preparation, while Dennis Egan performs experiments on the users of chemical abstracts, and supplies the indexing and numerous magnetic tapes; 3) Cornell provides the site of the experiment; 4) OCLC develops retrieval software and other user interfaces. Various manufacturers and publishers have furnished other help.

Concerning data flow, Bellcore receives microfilm and paper from ACS; the microfilm is scanned by outside vendors, while the paper is scanned inhouse on an Improvision scanner, twenty pages per minute at 300 dpi, which provides sufficient quality for all practical uses. LESK would prefer to have more gray level, because one of the ACS journals prints on some colored pages, which creates a problem.

Bellcore performs all this scanning, creates a page-image file, and also selects from the pages the graphics, to mix with the text file (which is discussed later in the Workshop). The user is always searching the ASCII file, but she or he may see a display based on the ASCII or a display based on the images.

LESK illustrated how the program performs page analysis, and the image interface. (The user types several words, is presented with a list— usually of the titles of articles contained in an issue—that derives from the ASCII, clicks on an icon and receives an image that mirrors an ACS page.) LESK also illustrated an alternative interface, based on text on the ASCII, the so-called SuperBook interface from Bellcore.

LESK next presented the results of an experiment conducted by Dennis Egan and involving thirty-six students at Cornell, one third of them undergraduate chemistry majors, one third senior undergraduate chemistry majors, and one third graduate chemistry students. A third of them received the paper journals, the traditional paper copies and chemical abstracts on paper. A third received image displays of the pictures of the pages, and a third received the text display with pop-up graphics.

The students were given several questions made up by some chemistry professors. The questions fell into five classes, ranging from very easy to very difficult, and included questions designed to simulate browsing as well as a traditional information retrieval-type task.

LESK furnished the following results. In the straightforward question search—the question being, what is the phosphorus oxygen bond distance and hydroxy phosphate?—the students were told that they could take fifteen minutes and, then, if they wished, give up. The students with paper took more than fifteen minutes on average, and yet most of them gave up. The students with either electronic format, text or image, received good scores in reasonable time, hardly ever had to give up, and usually found the right answer.

In the browsing study, the students were given a list of eight topics, told to imagine that an issue of the Journal of the American Chemical Society had just appeared on their desks, and were also told to flip through it and to find topics mentioned in the issue. The average scores were about the same. (The students were told to answer yes or no about whether or not particular topics appeared.) The errors, however, were quite different. The students with paper rarely said that something appeared when it had not. But they often failed to find something actually mentioned in the issue. The computer people found numerous things, but they also frequently said that a topic was mentioned when it was not. (The reason, of course, was that they were performing word searches. They were finding that words were mentioned and they were concluding that they had accomplished their task.)

This question also contained a trick to test the issue of serendipity. The students were given another list of eight topics and instructed, without taking a second look at the journal, to recall how many of this new list of eight topics were in this particular issue. This was an attempt to see if they performed better at remembering what they were not looking for. They all performed about the same, paper or electronics, about 62 percent accurate. In short, LESK said, people were not very good when it came to serendipity, but they were no worse at it with computers than they were with paper.

(LESK gave a parenthetical illustration of the learning curve of students who used SuperBook.)

The students using the electronic systems started off worse than the ones using print, but by the third of the three sessions in the series had caught up to print. As one might expect, electronics provide a much better means of finding what one wants to read; reading speeds, once the object of the search has been found, are about the same.

Almost none of the students could perform the hard task—the analogous transformation. (It would require the expertise of organic chemists to complete.) But an interesting result was that the students using the text search performed terribly, while those using the image system did best. That the text search system is driven by text offers the explanation. Everything is focused on the text; to see the pictures, one must press on an icon. Many students found the right article containing the answer to the question, but they did not click on the icon to bring up the right figure and see it. They did not know that they had found the right place, and thus got it wrong.

The short answer demonstrated by this experiment was that in the event one does not know what to read, one needs the electronic systems; the electronic systems hold no advantage at the moment if one knows what to read, but neither do they impose a penalty.

LESK concluded by commenting that, on one hand, the image system was easy to use. On the other hand, the text display system, which represented twenty man-years of work in programming and polishing, was not winning, because the text was not being read, just searched. The much easier system is highly competitive as well as remarkably effective for the actual chemists.

******

+++++++++++++ ERWAY * Most challenging aspect of working on AM * Assumptions guiding AM's approach * Testing different types of service bureaus * AM's requirement for 99.95 percent accuracy * Requirements for text-coding * Additional factors influencing AM's approach to coding * Results of AM's experience with rekeying * Other problems in dealing with service bureaus * Quality control the most time-consuming aspect of contracting out conversion * Long-term outlook uncertain * +++++++++++++

To Ricky ERWAY, associate coordinator, American Memory, Library of Congress, the constant variety of conversion projects taking place simultaneously represented perhaps the most challenging aspect of working on AM. Thus, the challenge was not to find a solution for text conversion but a tool kit of solutions to apply to LC's varied collections that need to be converted. ERWAY limited her remarks to the process of converting text to machine-readable form, and the variety of LC's text collections, for example, bound volumes, microfilm, and handwritten manuscripts.

Two assumptions have guided AM's approach, ERWAY said: 1) A desire not to perform the conversion inhouse. Because of the variety of formats and types of texts, to capitalize the equipment and have the talents and skills to operate them at LC would be extremely expensive. Further, the natural inclination to upgrade to newer and better equipment each year made it reasonable for AM to focus on what it did best and seek external conversion services. Using service bureaus also allowed AM to have several types of operations take place at the same time. 2) AM was not a technology project, but an effort to improve access to library collections. Hence, whether text was converted using OCR or rekeying mattered little to AM. What mattered were cost and accuracy of results.

AM considered different types of service bureaus and selected three to perform several small tests in order to acquire a sense of the field. The sample collections with which they worked included handwritten correspondence, typewritten manuscripts from the 1940s, and eighteenth-century printed broadsides on microfilm. On none of these samples was OCR performed; they were all rekeyed. AM had several special requirements for the three service bureaus it had engaged. For instance, any errors in the original text were to be retained. Working from bound volumes or anything that could not be sheet-fed also constituted a factor eliminating companies that would have performed OCR.

AM requires 99.95 percent accuracy, which, though it sounds high, often means one or two errors per page. The initial batch of test samples contained several handwritten materials for which AM did not require text-coding. The results, ERWAY reported, were in all cases fairly comparable: for the most part, all three service bureaus achieved 99.95 percent accuracy. AM was satisfied with the work but surprised at the cost.

As AM began converting whole collections, it retained the requirement for 99.95 percent accuracy and added requirements for text-coding. AM needed to begin performing work more than three years ago before LC requirements for SGML applications had been established. Since AM's goal was simply to retain any of the intellectual content represented by the formatting of the document (which would be lost if one performed a straight ASCII conversion), AM used "SGML-like" codes. These codes resembled SGML tags but were used without the benefit of document-type definitions. AM found that many service bureaus were not yet SGML-proficient.

Additional factors influencing the approach AM took with respect to coding included: 1) the inability of any known microcomputer-based user-retrieval software to take advantage of SGML coding; and 2) the multiple inconsistencies in format of the older documents, which confirmed AM in its desire not to attempt to force the different formats to conform to a single document-type definition (DTD) and thus create the need for a separate DTD for each document.

The five text collections that AM has converted or is in the process of converting include a collection of eighteenth-century broadsides, a collection of pamphlets, two typescript document collections, and a collection of 150 books.

ERWAY next reviewed the results of AM's experience with rekeying, noting again that because the bulk of AM's materials are historical, the quality of the text often does not lend itself to OCR. While non-English speakers are less likely to guess or elaborate or correct typos in the original text, they are also less able to infer what we would; they also are nearly incapable of converting handwritten text. Another disadvantage of working with overseas keyers is that they are much less likely to telephone with questions, especially on the coding, with the result that they develop their own rules as they encounter new situations.

Government contracting procedures and time frames posed a major challenge to performing the conversion. Many service bureaus are not accustomed to retaining the image, even if they perform OCR. Thus, questions of image format and storage media were somewhat novel to many of them. ERWAY also remarked other problems in dealing with service bureaus, for example, their inability to perform text conversion from the kind of microfilm that LC uses for preservation purposes.

But quality control, in ERWAY's experience, was the most time-consuming aspect of contracting out conversion. AM has been attempting to perform a 10-percent quality review, looking at either every tenth document or every tenth page to make certain that the service bureaus are maintaining 99.95 percent accuracy. But even if they are complying with the requirement for accuracy, finding errors produces a desire to correct them and, in turn, to clean up the whole collection, which defeats the purpose to some extent. Even a double entry requires a character-by-character comparison to the original to meet the accuracy requirement. LC is not accustomed to publish imperfect texts, which makes attempting to deal with the industry standard an emotionally fraught issue for AM. As was mentioned in the previous day's discussion, going from 99.95 to 99.99 percent accuracy usually doubles costs and means a third keying or another complete run-through of the text.

Although AM has learned much from its experiences with various collections and various service bureaus, ERWAY concluded pessimistically that no breakthrough has been achieved. Incremental improvements have occurred in some of the OCR technology, some of the processes, and some of the standards acceptances, which, though they may lead to somewhat lower costs, do not offer much encouragement to many people who are anxiously awaiting the day that the entire contents of LC are available on-line.

******

+++++++++++++ ZIDAR * Several answers to why one attempts to perform full-text conversion * Per page cost of performing OCR * Typical problems encountered during editing * Editing poor copy OCR vs. rekeying * +++++++++++++

Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program (NATDP), National Agricultural Library (NAL), offered several answers to the question of why one attempts to perform full-text conversion: 1) Text in an image can be read by a human but not by a computer, so of course it is not searchable and there is not much one can do with it. 2) Some material simply requires word-level access. For instance, the legal profession insists on full-text access to its material; with taxonomic or geographic material, which entails numerous names, one virtually requires word-level access. 3) Full text permits rapid browsing and searching, something that cannot be achieved in an image with today's technology. 4) Text stored as ASCII and delivered in ASCII is standardized and highly portable. 5) People just want full-text searching, even those who do not know how to do it. NAL, for the most part, is performing OCR at an actual cost per average-size page of approximately $7. NAL scans the page to create the electronic image and passes it through the OCR device.

ZIDAR next rehearsed several typical problems encountered during editing. Praising the celerity of her student workers, ZIDAR observed that editing requires approximately five to ten minutes per page, assuming that there are no large tables to audit. Confusion among the three characters I, 1, and l, constitutes perhaps the most common problem encountered. Zeroes and O's also are frequently confused. Double M's create a particular problem, even on clean pages. They are so wide in most fonts that they touch, and the system simply cannot tell where one letter ends and the other begins. Complex page formats occasionally fail to columnate properly, which entails rescanning as though one were working with a single column, entering the ASCII, and decolumnating for better searching. With proportionally spaced text, OCR can have difficulty discerning what is a space and what are merely spaces between letters, as opposed to spaces between words, and therefore will merge text or break up words where it should not.

ZIDAR said that it can often take longer to edit a poor-copy OCR than to key it from scratch. NAL has also experimented with partial editing of text, whereby project workers go into and clean up the format, removing stray characters but not running a spell-check. NAL corrects typos in the title and authors' names, which provides a foothold for searching and browsing. Even extremely poor-quality OCR (e.g., 60-percent accuracy) can still be searched, because numerous words are correct, while the important words are probably repeated often enough that they are likely to be found correct somewhere. Librarians, however, cannot tolerate this situation, though end users seem more willing to use this text for searching, provided that NAL indicates that it is unedited. ZIDAR concluded that rekeying of text may be the best route to take, in spite of numerous problems with quality control and cost.

******

+++++++++++++ DISCUSSION * Modifying an image before performing OCR * NAL's costs per page *AM's costs per page and experience with Federal Prison Industries * Elements comprising NATDP's costs per page * OCR and structured markup * Distinction between the structure of a document and its representation when put on the screen or printed * +++++++++++++

HOOTON prefaced the lengthy discussion that followed with several comments about modifying an image before one reaches the point of performing OCR. For example, in regard to an application containing a significant amount of redundant data, such as form-type data, numerous companies today are working on various kinds of form renewal, prior to going through a recognition process, by using dropout colors. Thus, acquiring access to form design or using electronic means are worth considering. HOOTON also noted that conversion usually makes or breaks one's imaging system. It is extremely important, extremely costly in terms of either capital investment or service, and determines the quality of the remainder of one's system, because it determines the character of the raw material used by the system.

Concerning the four projects undertaken by NAL, two inside and two performed by outside contractors, ZIDAR revealed that an in-house service bureau executed the first at a cost between $8 and $10 per page for everything, including building of the database. The project undertaken by the Consultative Group on International Agricultural Research (CGIAR) cost approximately $10 per page for the conversion, plus some expenses for the software and building of the database. The Acid Rain Project—a two-disk set produced by the University of Vermont, consisting of Canadian publications on acid rain—cost $6.70 per page for everything, including keying of the text, which was double keyed, scanning of the images, and building of the database. The in-house project offered considerable ease of convenience and greater control of the process. On the other hand, the service bureaus know their job and perform it expeditiously, because they have more people.

As a useful comparison, ERWAY revealed AM's costs as follows: $0.75 cents to $0.85 cents per thousand characters, with an average page containing 2,700 characters. Requirements for coding and imaging increase the costs. Thus, conversion of the text, including the coding, costs approximately $3 per page. (This figure does not include the imaging and database-building included in the NAL costs.) AM also enjoyed a happy experience with Federal Prison Industries, which precluded the necessity of going through the request-for-proposal process to award a contract, because it is another government agency. The prisoners performed AM's rekeying just as well as other service bureaus and proved handy as well. AM shipped them the books, which they would photocopy on a book-edge scanner. They would perform the markup on photocopies, return the books as soon as they were done with them, perform the keying, and return the material to AM on WORM disks.

ZIDAR detailed the elements that constitute the previously noted cost of approximately $7 per page. Most significant is the editing, correction of errors, and spell-checkings, which though they may sound easy to perform require, in fact, a great deal of time. Reformatting text also takes a while, but a significant amount of NAL's expenses are for equipment, which was extremely expensive when purchased because it was one of the few systems on the market. The costs of equipment are being amortized over five years but are still quite high, nearly $2,000 per month.

HOCKEY raised a general question concerning OCR and the amount of editing required (substantial in her experience) to generate the kind of structured markup necessary for manipulating the text on the computer or loading it into any retrieval system. She wondered if the speakers could extend the previous question about the cost-benefit of adding or exerting structured markup. ERWAY noted that several OCR systems retain italics, bolding, and other spatial formatting. While the material may not be in the format desired, these systems possess the ability to remove the original materials quickly from the hands of the people performing the conversion, as well as to retain that information so that users can work with it. HOCKEY rejoined that the current thinking on markup is that one should not say that something is italic or bold so much as why it is that way. To be sure, one needs to know that something was italicized, but how can one get from one to the other? One can map from the structure to the typographic representation.

FLEISCHHAUER suggested that, given the 100 million items the Library holds, it may not be possible for LC to do more than report that a thing was in italics as opposed to why it was italics, although that may be desirable in some contexts. Promising to talk a bit during the afternoon session about several experiments OCLC performed on automatic recognition of document elements, and which they hoped to extend, WEIBEL said that in fact one can recognize the major elements of a document with a fairly high degree of reliability, at least as good as OCR. STEVENS drew a useful distinction between standard, generalized markup (i.e., defining for a document-type definition the structure of the document), and what he termed a style sheet, which had to do with italics, bolding, and other forms of emphasis. Thus, two different components are at work, one being the structure of the document itself (its logic), and the other being its representation when it is put on the screen or printed.

******

SESSION V. APPROACHES TO PREPARING ELECTRONIC TEXTS

+++++++++++++ HOCKEY * Text in ASCII and the representation of electronic text versus an image * The need to look at ways of using markup to assist retrieval * The need for an encoding format that will be reusable and multifunctional +++++++++++++

Susan HOCKEY, director, Center for Electronic Texts in the Humanities (CETH), Rutgers and Princeton Universities, announced that one talk (WEIBEL's) was moved into this session from the morning and that David Packard was unable to attend. The session would attempt to focus more on what one can do with a text in ASCII and the representation of electronic text rather than just an image, what one can do with a computer that cannot be done with a book or an image. It would be argued that one can do much more than just read a text, and from that starting point one can use markup and methods of preparing the text to take full advantage of the capability of the computer. That would lead to a discussion of what the European Community calls REUSABILITY, what may better be termed DURABILITY, that is, how to prepare or make a text that will last a long time and that can be used for as many applications as possible, which would lead to issues of improving intellectual access.

HOCKEY urged the need to look at ways of using markup to facilitate retrieval, not just for referencing or to help locate an item that is retrieved, but also to put markup tags in a text to help retrieve the thing sought either with linguistic tagging or interpretation. HOCKEY also argued that little advancement had occurred in the software tools currently available for retrieving and searching text. She pressed the desideratum of going beyond Boolean searches and performing more sophisticated searching, which the insertion of more markup in the text would facilitate. Thinking about electronic texts as opposed to images means considering material that will never appear in print form, or print will not be its primary form, that is, material which only appears in electronic form. HOCKEY alluded to the history and the need for markup and tagging and electronic text, which was developed through the use of computers in the humanities; as MICHELSON had observed, Father Busa had started in 1949 to prepare the first-ever text on the computer.

HOCKEY remarked several large projects, particularly in Europe, for the compilation of dictionaries, language studies, and language analysis, in which people have built up archives of text and have begun to recognize the need for an encoding format that will be reusable and multifunctional, that can be used not just to print the text, which may be assumed to be a byproduct of what one wants to do, but to structure it inside the computer so that it can be searched, built into a Hypertext system, etc.

******

+++++++++++++ WEIBEL * OCLC's approach to preparing electronic text: retroconversion, keying of texts, more automated ways of developing data * Project ADAPT and the CORE Project * Intelligent character recognition does not exist * Advantages of SGML * Data should be free of procedural markup; descriptive markup strongly advocated * OCLC's interface illustrated * Storage requirements and costs for putting a lot of information on line * +++++++++++++

Stuart WEIBEL, senior research scientist, Online Computer Library Center, Inc. (OCLC), described OCLC's approach to preparing electronic text. He argued that the electronic world into which we are moving must accommodate not only the future but the past as well, and to some degree even the present. Thus, starting out at one end with retroconversion and keying of texts, one would like to move toward much more automated ways of developing data.

For example, Project ADAPT had to do with automatically converting document images into a structured document database with OCR text as indexing and also a little bit of automatic formatting and tagging of that text. The CORE project hosted by Cornell University, Bellcore, OCLC, the American Chemical Society, and Chemical Abstracts, constitutes WEIBEL's principal concern at the moment. This project is an example of converting text for which one already has a machine-readable version into a format more suitable for electronic delivery and database searching. (Since Michael LESK had previously described CORE, WEIBEL would say little concerning it.) Borrowing a chemical phrase, de novo synthesis, WEIBEL cited the Online Journal of Current Clinical Trials as an example of de novo electronic publishing, that is, a form in which the primary form of the information is electronic.

Project ADAPT, then, which OCLC completed a couple of years ago and in fact is about to resume, is a model in which one takes page images either in paper or microfilm and converts them automatically to a searchable electronic database, either on-line or local. The operating assumption is that accepting some blemishes in the data, especially for retroconversion of materials, will make it possible to accomplish more. Not enough money is available to support perfect conversion.

WEIBEL related several steps taken to perform image preprocessing (processing on the image before performing optical character recognition), as well as image postprocessing. He denied the existence of intelligent character recognition and asserted that what is wanted is page recognition, which is a long way off. OCLC has experimented with merging of multiple optical character recognition systems that will reduce errors from an unacceptable rate of 5 characters out of every l,000 to an unacceptable rate of 2 characters out of every l,000, but it is not good enough. It will never be perfect.

Concerning the CORE Project, WEIBEL observed that Bellcore is taking the topography files, extracting the page images, and converting those topography files to SGML markup. LESK hands that data off to OCLC, which builds that data into a Newton database, the same system that underlies the on-line system in virtually all of the reference products at OCLC. The long-term goal is to make the systems interoperable so that not just Bellcore's system and OCLC's system can access this data, but other systems can as well, and the key to that is the Z39.50 common command language and the full-text extension. Z39.50 is fine for MARC records, but is not enough to do it for full text (that is, make full texts interoperable).

WEIBEL next outlined the critical role of SGML for a variety of purposes, for example, as noted by HOCKEY, in the world of extremely large databases, using highly structured data to perform field searches. WEIBEL argued that by building the structure of the data in (i.e., the structure of the data originally on a printed page), it becomes easy to look at a journal article even if one cannot read the characters and know where the title or author is, or what the sections of that document would be. OCLC wants to make that structure explicit in the database, because it will be important for retrieval purposes.

The second big advantage of SGML is that it gives one the ability to build structure into the database that can be used for display purposes without contaminating the data with instructions about how to format things. The distinction lies between procedural markup, which tells one where to put dots on the page, and descriptive markup, which describes the elements of a document.

WEIBEL believes that there should be no procedural markup in the data at all, that the data should be completely unsullied by information about italics or boldness. That should be left up to the display device, whether that display device is a page printer or a screen display device. By keeping one's database free of that kind of contamination, one can make decisions down the road, for example, reorganize the data in ways that are not cramped by built-in notions of what should be italic and what should be bold. WEIBEL strongly advocated descriptive markup. As an example, he illustrated the index structure in the CORE data. With subsequent illustrated examples of markup, WEIBEL acknowledged the common complaint that SGML is hard to read in its native form, although markup decreases considerably once one gets into the body. Without the markup, however, one would not have the structure in the data. One can pass markup through a LaTeX processor and convert it relatively easily to a printed version of the document.

WEIBEL next illustrated an extremely cluttered screen dump of OCLC's system, in order to show as much as possible the inherent capability on the screen. (He noted parenthetically that he had become a supporter of X-Windows as a result of the progress of the CORE Project.) WEIBEL also illustrated the two major parts of the interface: l) a control box that allows one to generate lists of items, which resembles a small table of contents based on key words one wishes to search, and 2) a document viewer, which is a separate process in and of itself. He demonstrated how to follow links through the electronic database simply by selecting the appropriate button and bringing them up. He also noted problems that remain to be accommodated in the interface (e.g., as pointed out by LESK, what happens when users do not click on the icon for the figure).

Given the constraints of time, WEIBEL omitted a large number of ancillary items in order to say a few words concerning storage requirements and what will be required to put a lot of things on line. Since it is extremely expensive to reconvert all of this data, especially if it is just in paper form (and even if it is in electronic form in typesetting tapes), he advocated building journals electronically from the start. In that case, if one only has text graphics and indexing (which is all that one needs with de novo electronic publishing, because there is no need to go back and look at bit-maps of pages), one can get 10,000 journals of full text, or almost 6 million pages per year. These pages can be put in approximately 135 gigabytes of storage, which is not all that much, WEIBEL said. For twenty years, something less than three terabytes would be required. WEIBEL calculated the costs of storing this information as follows: If a gigabyte costs approximately $1,000, then a terabyte costs approximately $1 million to buy in terms of hardware. One also needs a building to put it in and a staff like OCLC to handle that information. So, to support a terabyte, multiply by five, which gives $5 million per year for a supported terabyte of data.

******

+++++++++++++ DISCUSSION * Tapes saved by ACS are the typography files originally supporting publication of the journal * Cost of building tagged text into the database * +++++++++++++

During the question-and-answer period that followed WEIBEL's presentation, these clarifications emerged. The tapes saved by the American Chemical Society are the typography files that originally supported the publication of the journal. Although they are not tagged in SGML, they are tagged in very fine detail. Every single sentence is marked, all the registry numbers, all the publications issues, dates, and volumes. No cost figures on tagging material on a per-megabyte basis were available. Because ACS's typesetting system runs from tagged text, there is no extra cost per article. It was unknown what it costs ACS to keyboard the tagged text rather than just keyboard the text in the cheapest process. In other words, since one intends to publish things and will need to build tagged text into a typography system in any case, if one does that in such a way that it can drive not only typography but an electronic system (which is what ACS intends to do—move to SGML publishing), the marginal cost is zero. The marginal cost represents the cost of building tagged text into the database, which is small.

******

+++++++++++++ SPERBERG-McQUEEN * Distinction between texts and computers * Implications of recognizing that all representation is encoding * Dealing with complicated representations of text entails the need for a grammar of documents * Variety of forms of formal grammars * Text as a bit-mapped image does not represent a serious attempt to represent text in electronic form * SGML, the TEI, document-type declarations, and the reusability and longevity of data * TEI conformance explicitly allows extension or modification of the TEI tag set * Administrative background of the TEI * Several design goals for the TEI tag set * An absolutely fixed requirement of the TEI Guidelines * Challenges the TEI has attempted to face * Good texts not beyond economic feasibility * The issue of reproducibility or processability * The issue of mages as simulacra for the text redux * One's model of text determines what one's software can do with a text and has economic consequences * +++++++++++++

Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor, Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew a distinction between texts and computers: Texts are abstract cultural and linguistic objects while computers are complicated physical devices, he said. Abstract objects cannot be placed inside physical devices; with computers one can only represent text and act upon those representations.

The recognition that all representation is encoding, SPERBERG-McQUEEN argued, leads to the recognition of two things: 1) The topic description for this session is slightly misleading, because there can be no discussion of pros and cons of text-coding unless what one means is pros and cons of working with text with computers. 2) No text can be represented in a computer without some sort of encoding; images are one way of encoding text, ASCII is another, SGML yet another. There is no encoding without some information loss, that is, there is no perfect reproduction of a text that allows one to do away with the original. Thus, the question becomes, What is the most useful representation of text for a serious work? This depends on what kind of serious work one is talking about.

The projects demonstrated the previous day all involved highly complex information and fairly complex manipulation of the textual material. In order to use that complicated information, one has to calculate it slowly or manually and store the result. It needs to be stored, therefore, as part of one's representation of the text. Thus, one needs to store the structure in the text. To deal with complicated representations of text, one needs somehow to control the complexity of the representation of a text; that means one needs a way of finding out whether a document and an electronic representation of a document is legal or not; and that means one needs a grammar of documents.

SPERBERG-McQUEEN discussed the variety of forms of formal grammars, implicit and explicit, as applied to text, and their capabilities. He argued that these grammars correspond to different models of text that different developers have. For example, one implicit model of the text is that there is no internal structure, but just one thing after another, a few characters and then perhaps a start-title command, and then a few more characters and an end-title command. SPERBERG-McQUEEN also distinguished several kinds of text that have a sort of hierarchical structure that is not very well defined, which, typically, corresponds to grammars that are not very well defined, as well as hierarchies that are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely complicated things such as SGML, which handle strictly hierarchical data very nicely.

Previous Part 1 2 3 4 5 Next Part

Home - Random Browse

Share|