Project Status

A detailed analysis of the Copyright records has been completed and test scanning has been done to determine the best digitization parameters for the several formats of the records. A conclusion has been reached that for preservation, the records will generally be scanned in uncompressed tagged image file format (TIFF) at a minimum of 300 pixels per inch (ppi) in 24 bit color. For routine access to the digitized records, derivative files will be created in JPEG or JPEG2000 format at 50:1 compression. Production scanning has begun on three sets of records.

First, the 2.5 million catalog cards which constitute the indexes to assignments and transfers of copyrights from 1870 to 1977 have been digitized and the images placed in archival storage at the Madison Building data center and at an alternate computing facility. Four Metadata Specialists are working part-time to capture the index terms from the assignment and transfer card images in local databases, which will be a source for creating a publicly available online index to the digitized records. Data has already been captured from more than 145,000 images.

Second, the 7.7 million registration catalog cards from the 1971 to 1977 period, the 9.8 million cards from the 1955 to 1970 period, and 2.5 million cards from the 1946 to 1954 period have been digitized bringing the total to more than 22 million.

Third, the bound volumes of the Catalog of Copyright Entries have been scanned at the Internet Archive center in the Library of Congress Adams Building. This is the same center and process being used to scan works from the Library’s collections. 667 CCE volumes have been scanned and are now available at http://www.archive.org/details/copyrightrecords/ ranging from the very first publication in 1891 up to and including 1978 and these cover all classes of works and all renewals. A few volumes are still in process due to size which will require additional preparation prior to scanning. A limited search capability is available for the online volumes based on the results of optical character recognition (OCR) of the scanned text.

A high level of quality assurance is required in all scanning of Copyright records. A tool has been designed and built to allow the Copyright Office to inspect the scanned images using a sampling process to ensure the highest possible quality in the digitized records.

Detailed planning is underway to determine how to capture information from the records to enable effective searching that is widely available via the web. Strategies being considered include optical and intelligent character recognition, double-blind data capture, crowdsourcing, faceted search engines, and as an interim measure a virtual card catalog that would enable searching the card images online in a way that mimics the searching of the actual cards. A special project team has been formed made up of staff from all parts of the Office to assist in this effort through assessment of the options and through analysis of the record formats and the data patterns within the content.

A project blog has also been launched at http://blogs.loc.gov/copyrightdigitization/ as a means for posting information about plans and progress on the project and to seek input from interested parties.