Logos customers like books. Lots of books. In the past couple of years, we've been publishing new electronic titles at breakneck speed through our pre-publication program. Since 1991, we've made available more than 9,000 titles in electronic format, all of which work with the Libronix Digital Library System.

But this is only a drop in the bucket when you consider the tens of thousands of Bible reference titles and journal volumes that sit on the shelves of a typical seminary library. Recently we started to explore ways to free thousands of these public domain books from the library shelves and get them back into circulation.

What we needed was some efficient way to scan books at the rate of thousands of pages per day...while preserving even fragile volumes.

New technology has brought this vision into the realm of the achievable. During the summer of 2004, a robotic book scanning machine was delivered to the Logos office in Bellingham. This machine, purchased from Kirtas Technologies, Inc., is capable of scanning books at the rate of 1,200 pages per hour. What's more, unlike conventional scanners there's no need to destroy the bindings—it can safely scan even fragile antique books.

The scanner has a robotic arm that uses gentle suction to turn each page of the book loaded in its cradle. A high-resolution camera mounted atop the unit snaps a photo of each page, using a clever system of mirrors to photograph each side and correct for the angle of the open book. (Watch a 30-second video! Windows Media Video Icon) Each image is then processed by character recognition software, creating a rough full-text digital edition.

Once we finished scanning our favorite public domain works from our personal rare book collections :-), we parked the book scanner at a seminary library to run through thousands of public domain titles from their collection. This is already generating terabytes of new digital files for rare and hard-to-find books...more than 3.2 million pages representing more than 7,700 books as of September 2006!

Many details of how these digital files will be used are yet to be determined. But stay tuned...the floodgates are about to open!

(See the Community Pricing Program page, which lists a number of public domain titles scanned with the book scanner and now available for pre-order via a new community-bidding program.)

From Paper to Bits

As the APT BookScan 1200 flips through each book, its 16-megapixel digital camera (a Canon EOS-1Ds Mark II) snaps photos at a high resolution and saves them as full-color JPGs (see Fig 1.1). Due to the method of securing the book in the cradle, there are some artifacts on every page, namely the clear plastic clamps that secure the book after each page is turned.

Gospel According to St. Matthew - Color Photo
Figure 1.1 — The camera produces a raw, uncorrected color photo of each page.

The scanner comes with a dedicated desktop computer running proprietary software to process each image. For our purposes, each photo is cleaned up and converted into a hi-resolution, compressed TIFF file (these average 80KB each; see Figure 1.2). The software automatically removes artifacts like the clamp and any discoloration that may appear on the page. The result is an impressively crisp and detailed digital reproduction of the page...even clearer than the original in some cases! 

Gospel According to St. Matthew - TIFF version
Figure 1.2 — The software cleans up the image and saves as a hi-res, compressed TIFF.
(Click here to download the actual TIFF, which cannot be viewed in most web browsers.)

From here, it's a matter of performing optical character recognition (OCR) on the images, thus extracting full text wherever possible and images in places the OCR software has trouble recognizing characters. We expect approximately 90% accuracy on this first pass, at least for Latin characters. As with any other digital book project, hand-correcting these raw files is a major expense. The correction of texts with a complex mix of scripts (Hebrew, Greek, Arabic, etc.) is especially time-consuming.

Mix of scripts on a page
Figure 2.1 — Some texts include a dizzying mix of scripts on a single page.

Having the ability to call up a hi-resolution digital image with the click of the mouse is extremely valuable when trying to verify the text on a given page. Scanned images from the APT BookScan 1200 make it possible to zoom right in and see an amazing amount of detail, even when the print book is not handy.

Text in Full Color at 100%
Text in Grayscale at 100%
Text in Black & White at 100%

Figure 2.2 — The scanner preserves a remarkable level of detail. Characters are displayed here at 100%. The top image is from the color photo, middle one from a grayscale TIFF and the lower one from a black & white TIFF.

