Overview: ********* LOC data on copyright registrations is organized as a relational database in the MARC 21 format for bibliographic data, which is described in detail in the resources linked below. The data as presented in these files have been converted into a long-form .csv file, and the MARC 21 variable naming convention has been retained. The .csv files contain three columns: 1) "row_id" contains a unique ID for each record in the file. 2) "v_" contains the value for each variable, which can be string or numeric. 3) "var_nm" contains the name of each variable, according to the MARC 21 naming convention. MARC 21 naming convention: ************************** There are three types of information provided in the data, corresponding to three types of variables: 1) "Leader" is an alphanumeric code that provides information for the processing of the record, as laid out in the Marc 21 documentation. Leader fields are labeled "leader". 2) "Control Fields" provide identifiers and other coded information about the record. Specific control fields are identified by a three-digit number, corresponding to the Marc 21 documentation. Each control field may have subfields (although this is uncommon). Control fields are identified with a "cf" prefix, followed by, 1) a three-digit code indicating which control field it is, and 2) an alphanumeric subfield code, when applicable. For example, the variable name for control field 005 is "cf_005". 3) "Data Fields" contain data entries associated with the record (e.g., the title of a work). Specific data fields are identified by a three-digit code, corresponding to the Marc 21 documentation. Each data field may (and typically does) have subfields, and each subfield has two alphanumeric "indicators". Certain fields are repeatable, thus requiring an indicator for repetitions. Data fields are identified with a "df" prefix, followed by, 1) a three-digit code indicating which data field it is, 2) an alphanumeric subfield ("sf") code, when applicable, 3) indicator-1, 4) indicator-2, , and 5) a value indicating the repetition (again noting that some data fields are repeatable). For example, the variable name for the third repetition of data field 245, subfield z, with indicator-1 = a, and indicator-2 = na, is "df_sf_z_i1_a_i2_na_3". Converting to Wide Form: ************************ Some users may want to convert the data into wide form, with each row containing information on one record. The conversion from the Marc 21 format to a two-dimensional data table is not necessarily a straightforward exercise. Most data fields have many subfields, and those subfields have "indicators" that provide further information about individual entries within the field. Additionally, some fields and subfields are repeatable (i.e., a single registration record may have several, or even hundreds, of entries for a given field). Transforming data with many dimensions to a two-dimensional structure can result in very large data tables, so it may be prudent for users to omit information that is not relevant to their work. Resources: ********* 1) https://www.copyright.gov/policy/women-in-copyright-system/LOC-Copyright-Data-as-Distributed-in-the-MARC%2021-Format.pdf 2) https://www.loc.gov/marc/bibliographic/ecbdlist.html Questions or comments can be directed to economist@copyright.gov.