Processing MARC ... with open source tools

# Processing MARC
## … with open source tools
### 114. BiblioCon 2026 Berlin

---

## Links

Slides: https://jorol.github.io/2026-bibliocon/slides

Exercises: https://jorol.github.io/2026-bibliocon/#/Exercises

Files: https://jorol.github.io/2026-bibliocon/files/processing-marc.zip

Software: https://jorol.github.io/2026-bibliocon/#/Software

---

## "When MARC was created, the Beatles were a hot new group ..."

---

## MARC Must Die

In 2002 Roy Tennant declared "[MARC Must Die](https://www.libraryjournal.com/?detailStory=marc-must-die)". Today the [MARC 21](https://www.loc.gov/marc/) format is still the workhorse of library metadata. Even our "Next Generation Library Systems" heavily rely on this standard from the ‘60s. Since we will continue to work with MARC 21 in the coming years, this tutorial will give an introduction to MARC 21.

---

## Agenda

- MARC 21
    - Introduction
    - Record elements
- Serializations 
- Get MARC 21 data
- Validation of MARC 21 records and common errors 
- Statistical analysis of MARC 21 data sets 
- Conversion of MARC 21 records 
- Metadata extraction from MARC 21 records

---

## MARC 21 Format for Bibliographic Data

[MARC 21 format for Bibliographic Data](https://www.loc.gov/marc/bibliographic/) is a standard designed to be a carrier for bibliographic information about printed and manuscript textual materials, computer files, maps, music, continuing resources, visual materials, and mixed materials.

Bibliographic data commonly includes titles, names, subjects, notes, publication data, and information about the physical description of an 
item.

The standard defines [formats](https://www.loc.gov/marc/marcdocz.html) for the representation and exchange of [bibliographic](https://www.loc.gov/marc/bibliographic/), [authority](https://www.loc.gov/marc/authority/ecadhome.html), [holdings](https://www.loc.gov/marc/holdings/echdhome.html), [classification](https://www.loc.gov/marc/classification/eccdhome.html) and [community information](https://www.loc.gov/marc/community/eccihome.html) data in machine-readable form.

---

## A MARC record is composed of three elements:

* *Record structure*: an implementation of the international standard Format for Information Exchange (ISO 2709) and its American counterpart, Bibliographic Information Interchange (ANSI/NISO Z39.2).

* *Content designation*: the codes and conventions established explicitly to identify and further characterize the data elements within a record.

* *Data content of the record*: the content of the data elements that comprise a MARC record is usually defined by standards outside the formats (e.g. [ISBD](https://www.ifla.org/publications/international-standard-bibliographic-description), [AACR2](http://www.aacr2.org/), [RDA](http://www.rda-jsc.org/archivedsite/rdaprospectus.html) ).

---

## Code lists

The MARC 21 standard also provides [lists of source codes](https://www.loc.gov/standards/sourcelist/index.html) for vocabularies, rules and schemes.

---

## Agency

The MARC 21 standard is maintained by the [The Network Development and MARC Standards Office](https://www.loc.gov/marc/ndmso.html) and documented in detail: https://www.loc.gov/marc/marcdocz.html.

---

## Introduction

For a short introduction to MARC 21 see the OCLCs ["Introduction"](https://www.oclc.org/bibformats/en/introduction.html) or ["Understanding MARC Bibliographic: Machine-Readable Cataloging"](https://www.loc.gov/marc/umb/) for a more detailed one. The history of MARC is documented in ["MARC, its history and implications"](https://babel.hathitrust.org/cgi/pt?id=mdp.39015034388556).

---

## MARC 21 serializations

---

## MARC (ISO 2709)

A "MARC (ISO 2709)" record ([ISO 2709:2008](https://www.iso.org/standard/41319.html) & [ANSI/NISO Z39.2-1994](https://www.niso.org/publications/ansiniso-z392-1994-r2016)) consists of three parts:

* leader
* directory
* variable fields

---

## Leader

The [leader](https://www.loc.gov/marc/specifications/specrecstruc.html#leader) has a fixed length of 24 ASCII characters which provide some basic information for processing the record.

Data elements are positionally defined, see https://www.loc.gov/marc/bibliographic/bdleader.html.

Leader positions 00-05  define the length of the records. The total length of a "MARC (2709)" record is limited to 99999 bytes.

Position 09 defines the "character coding scheme" ([MARC-8](https://www.loc.gov/marc/specifications/specchartables.html) or [Unicode](https://www.iso.org/standard/69119.html)).

---

## Directory

The [directory](https://www.loc.gov/marc/specifications/specrecstruc.html#direct) is variable sequence of entries, describing the tag, length and the starting position of each field.

Each directory entry has a length of 12 characters:

* tag: 00-02
* length of field: 03-06
* starting postion: 07-11

The length of a "MARC (2709)" record field is limited to 9999 bytes.

---

## Variable fields

The [variable fields](https://www.loc.gov/marc/specifications/specrecstruc.html#varifields) are [control fields](https://www.loc.gov/marc/bibliographic/bd00x.html) followed by data fields.

Data fields consist of two indicators and a sequence of subfields.

Indicators can be used interpret or supplement the data found in the field. Their meaning varies by field.

Each subfield consists of a subfield code and the corresponding value.

Data fields and subfields could be repeatable.

---

## Separators

A MARC record is terminated with a record terminator (Unicode character 'INFORMATION SEPARATOR THREE' [U+001D](https://www.fileformat.info/info/unicode/char/001d/index.htm)).

Each part of a record is terminated with a field terminator (Unicode character 'INFORMATION SEPARATOR TWO' [U+001E](https://www.fileformat.info/info/unicode/char/001e/index.htm)).

Each subfield of the data fields is terminated with a subfield terminator (Unicode character 'INFORMATION SEPARATOR ONE' [U+001F](https://www.fileformat.info/info/unicode/char/001f/index.htm)).

---

## Example "MARC (ISO 2709)" record

```no-highlight
00998nas a2200325 c 4500001001000000003000700010005001700017
007001500034008004100049016002200090016002200112022001400134
035002500148035002100173040002800194041000800222082002400230
245002700254246000900281264001800290300002100308336002600329
337003200355338003700387362001300424363001900437655009900456
856005300555856006400608^^987874829^^DE-101^^20171201121143.
0^^cr||||||||||||^^080311c20079999|||u||p|o ||| 0||||1eng c^
^7 ^_2DE-101^_a987874829^^7 ^_2DE-600^_a2415107-5^^  ^_a1940
-5758^^  ^_a(DE-599)ZDB2415107-5^^  ^_a(OCoLC)502377032^^  ^
_a8999^_bger^_cDE-101^_d9999^^  ^_aeng^^74^_a020^_qDE-600^_2
22sdnb^^00^_aCode4Lib journal^_bC4LJ^^3 ^_aC4LJ^^31^_a[S.l.]
^_c2007-^^  ^_aOnline-Ressource^^  ^_aText^_btxt^_2rdaconten
t^^  ^_aComputermedien^_bc^_2rdamedia^^  ^_aOnline-Ressource
^_bcr^_2rdacarrier^^0 ^_a1.2007 -^^01^_81.1\x^_a1^_i2007^^ 7
^_0(DE-588)4067488-5^_0http://d-nb.info/gnd/4067488-5^_0(DE-
101)040674886^_aZeitschrift^_2gnd-content^^4 ^_uhttp://journ
al.code4lib.org/^_xVerlag^_zkostenfrei^^4 ^_uhttp://www.bibl
iothek.uni-regensburg.de/ezeit/?2415107^_xEZB^^^]
```

---

## Leader, directory and fields

```no-highlight
00251nas a2200121 c 4500
```

```no-highlight
001001000000
007001500010
022001400025
041000800039
245002700047
246000900074
362001300083
856003300096^^
```

```no-highlight
987874829^^
cr||||||||||||^^
  ^_a1940-5758^^
  ^_aeng^^
00^_aCode4Lib journal^_bC4LJ^^
3 ^_aC4LJ^^
0 ^_a1.2007 -^^
4 ^_uhttp://journal.code4lib.org/^^
^]
```

---

## MARC XML

The Library of Congress provides a [framework](https://www.loc.gov/standards/marcxml/) for working with MARC data in XML environments. The framework consists of a XML schema for MARC data ([XSD](https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd), [XSD illustration](https://www.loc.gov/standards/marcxml/xml/spy/spy.html)), [XSL stylesheets](https://www.loc.gov/standards/marcxml/#stylesheets) and some [tools](https://www.loc.gov/standards/marcxml/marcxml.zip) for transformation and validation of "MARC XML" data.

"MARC XML" is often used to provide MARC data via APIs like [SRU](https://www.loc.gov/standards/sru/index.html) & [OAI](https://www.openarchives.org/pmh/).

"MARC XML" defines several ["MARC XML design considerations"](https://www.loc.gov/standards/marcxml/marcxml-design.html), one is the "roundtripability from XML back to MARC".

The schema doesn't limit the length of records and fields, so many data providers use "MARC XML" to circumvent the length restriction of "MARC (2709)".

---

## Example "MARC XML" record

```xml
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
  <leader>00251nas a2200121 c 4500</leader>
  <controlfield tag="001">987874829</controlfield>
  <controlfield tag="007">cr||||||||||||</controlfield>
  <datafield tag="022" ind1=" " ind2=" ">
    <subfield code="a">1940-5758</subfield>
  </datafield>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">eng</subfield>
  </datafield>
  <datafield tag="245" ind1="0" ind2="0">
    <subfield code="a">Code4Lib journal</subfield>
    <subfield code="b">C4LJ</subfield>
  </datafield>
  ...
</record>
</collection>
```

---

## Turbomarc

[Index Data](https://www.indexdata.com/) developed "Turbomarc", another XML serialization for MARC data.

The primary development goal of "Turbomarc" was [to speed up](https://www.indexdata.com/turbomarc-faster-xml-marc-records/) the processing of MARC data.

---

## Example "Turbomarc" record

```xml
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.indexdata.com/turbomarc">
<r>
  <l>00251nas a2200121 c 4500</l>
  <c001>987874829</c001>
  <c007>cr||||||||||||</c007>
  <d022 i1=" " i2=" ">
    <sa>1940-5758</sa>
  </d022>
  <d041 i1=" " i2=" ">
    <sa>eng</sa>
  </d041>
  <d245 i1="0" i2="0">
    <sa>Code4Lib journal</sa>
    <sb>C4LJ</sb>
  </d245>
  ...
</r>
</collection>
```

---

## Line-based MARC formats

There are several line-based MARC formats.

These formats offer a more human-readable serialization of MARC records and are often used to examine, create or update MARC records.

Several records are divided by a blank line.

The formats differ slightly in the representation of MARC tags, indicators and subfield.

---

## MARC Line

"MARC Line" is a simple line-by-line format also developed by Index Data.

It is suitable for display but not recommended for further (machine) processing.

```no-highlight
00251nas a2200121 c 4500
001 987874829
007 cr||||||||||||
022    $a 1940-5758
041    $a eng
245 00 $a Code4Lib journal $b C4LJ
246 3  $a C4LJ
362 0  $a 1.2007 -
856 4  $u http://journal.code4lib.org/

```

---

## MARCMaker

This format was developed to create MARC records without having to use a MARC-based system.

It is the most widely used line-based format and supported by several software tools (e.g. Catmandu, MarcEdit) and libraries (e.g. marc4j, pymarc).

```no-highlight
=LDR  00251nas a2200121 c 4500
=001  987874829
=007  cr||||||||||||
=022  \\$a1940-5758
=041  \\$aeng
=245  00$aCode4Lib journal$bC4LJ
=246  3\$aC4LJ
=362  0\$a1.2007 -
=856  4\$uhttp://journal.code4lib.org/

```

---

## MicroLIF

"[MicroLIF](http://web.sonoma.edu/users/h/huangp/MARC_MicroLIF.htm)" is a MARC compatible record format created by a group of publishers and vendors in the '80s.

```no-highlight
LDR00251nas a2200121 c 4500^
001987874829^
007cr||||||||||||^
022  _a1940-5758^
041  _aeng^
24500_aCode4Lib journal_bC4LJ^
2463 _aC4LJ^
3620 _a1.2007 -^
8564 _uhttp://journal.code4lib.org/^

```

---

## Aleph Sequential

"Aleph Sequential" is a line-based serialization format used by Ex Libris Ltd. integrated library systems "[Aleph](https://exlibrisgroup.com/products/aleph-integrated-library-system/)".

```no-highlight
987874829 FMT   L BK
987874829 LDR   L 00251nas^a2200121^c^4500
987874829 001   L 987874829
987874829 007   L cr||||||||||||
987874829 022   L $$a1940-5758
987874829 041   L $$aeng
987874829 24500 L $$aCode4Lib journal$$bC4LJ
987874829 2463  L $$aC4LJ
987874829 3620  L $$a1.2007 -
987874829 8564  L $$uhttp://journal.code4lib.org/

```

---

## MARC in JSON (MiJ)

[JSON](https://www.json.org/) is a common lightweight data-interchange format which is also easy for humans to read and write.

"MARC in JSON" (MiJ) defines a standard how to store MARC data as JSON objects.

---

Example "MARC in JSON" record

```json
{
    "leader":"00251nas a2200121 c 4500",
    "fields":
    [
        {
            "001":"987874829"
        },
        {
            "245":
            {
                "subfields":
                [
                    {
                        "a":"Code4Lib journal"
                    },
                    {
                        "b":"C4LJ"
                    }
                ],
                "ind1":"0",
                "ind2":"0"
            }

}
    ]
}
```

---

## Catmandu JSON

The [Catmandu](http://librecat.org/Catmandu/) data toolkit converts MARC records internally as an "[array of arrays](https://metacpan.org/pod/Catmandu::Importer::MARC#EXAMPLE-ITEM)", which can be exported as JSON or YAML objects.

---

## Example "Catmandu JSON" record

```json
{
    "_id": "987874829",
    "record": [
        [
            "LDR",
            " ",
            " ",
            "_",
            "00251nas a2200121 c 4500"
        ],
        [
            "245",
            "0",
            "0",
            "a",
            "Code4Lib journal",
            "b",
            "C4LJ"
        ]
    ]
}
```

---

## Get MARC 21 data

---

## Open Data

Several libraries and library networks publish their data as "[open data](https://en.wikipedia.org/wiki/Open_data)".

[Péter Király](https://github.com/pkiraly) created a list of international open MARC 21 data sets at <https://github.com/pkiraly/metadata-qa-marc#datasources>.

The Internet Archive's [Open Library](http://openlibrary.org/) project is making thousands of library records freely available for anyone's use, see <https://archive.org/details/ol_data>.

You can download the data sets via the command line, e.g.:

```bash
$ wget http://ered.library.upenn.edu/data/opendata/pau.zip
$ unzip pau.zip
```

---

## API

Many libraries offer MARC 21 data via public [APIs](https://en.wikipedia.org/wiki/API) like Z39.50, SRU, OAI.

---

## Z39.50

Z39.50 is a standard ([ANSI/NISO Z39.50-2003](https://www.loc.gov/z3950/agency/Z39-50-2003.pdf)) that defines a client/server based service and protocol for information retrieval. Like MARC 21 Z39.50 has a quite long history ([Lynch, 1997](http://www.dlib.org/dlib/april97/04lynch.html))  and is maintained by Library of Congress.

Many libraries offer access to their Online Public Access Catalogues (OAPC) via Z39.50 server, e.g. [Library of Congress](https://www.loc.gov/z3950/lcserver.html) or [kobv](https://www.kobv.de/services/recherche/z39-50/).

See ["Bath Profile"](http://www.ukoln.ac.uk/interop-focus/activities/z3950/int_profile/bath/draft/stable1.html#5.A.1.%20Functional%20Area%20A:%20Level%201%20Basic%20Bibliographic%20Search%20and%20Retrieval%20Emphasizing%20Precision) or ["Bib-1 Attribute Set"](https://software.indexdata.com/yaz/doc/bib1.html) for common search and retrieval operations and attribute sets.

---

## Z39.50 - yaz-client

To retrieve data from Z39.50 servers you need a client software like `yaz-client` from [Index Data](https://www.indexdata.com/), which is part of the free open source toolkit "[YAZ](https://www.indexdata.com/resources/software/yaz/)".

```bash
# open client
$ yaz-client
# connect to database
Z> open lx2.loc.gov/LCDB
# set format to MARC
Z> format 1.2.840.10003.5.10
# set element set
Z> element F
# append retrieved records to file
Z> set_marcdump z3950_loc.mrc
# find record for subject
Z> find @attr 5=100 @attr 1=21 "Perl"
# get first 50 records
Z> show 1+50
# close client
Z> exit
```

---

## Z39.50 - yaz-client command file

```
# show command file
$ cat z3950.cmdfile
open lx2.loc.gov/LCDB
format 1.2.840.10003.5.10
element F
set_marcdump z3950_loc.mrc
find @attr 5=100 @attr 1=21 "Perl"
show 1+50
exit
# run command file
$ yaz-client -f z3950.cmdfile
# show records
$ cat -v z3950_loc.mrc
```

---

## Z39.50 - catmandu

The Catmandu toolkit provides a Z39.50 client "[Catmandu::Importer::Z3950](https://metacpan.org/pod/Catmandu::Importer::Z3950)":

```bash
$ catmandu convert Z3950 \
--host z3950.kobv.de \
--port 210 \
--databaseName k2 \
--preferredRecordSyntax usmarc \
--queryType PQF \
--query '@attr 5=100 @attr 1=1003 "Tempest, Kae"' \
--handler USMARC \
to MARC --type MARCMaker
```

---

## SRU

[SRU](https://www.loc.gov/standards/sru/) (Search/Retrieve via URL) is another standard protocol for information retrival. It uses HTTP as application layer protocol and XML for data serialization. Search queries are expressed with [CQL](https://www.loc.gov/standards/sru/cql/index.html) (Contextual Query Language), a formal language for representing queries.

---

## SRU - yaz-client command file

```
# show command file
$ cat sru.cmdfile
open http://sru.k10plus.de/opac-de-627
set_marcdump sru_k10p.mrc.xml
format marcxml
find pica.per="Tempest, Kae"
show 1+50
exit
# run command file
$ yaz-client -f sru.cmdfile
# show records. problem: file contains several XML documents
$ xmllint --format sru_k10p.mrc.xml
```

---

## SRU - catmandu

The Catmandu toolkit also provides a SRU client "[Catmandu::Importer::SRU](https://metacpan.org/pod/Catmandu::Importer::SRU)":

```bash
$ catmandu convert SRU \
--base https://sru.kobv.de/k2 \
--recordSchema MARCXML \
--query 'dc.creator = "Tempest, Kae"' \
--parser marcxml \
to MARC --type Line 
```

---

## OAI-PMH

[OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) (Open Archives Initiative Protocol for Metadata Harvesting) is a protocol for metadata replication and distribution. _Data providers_ host metadata records and their changes over time, so _service providers_ can harvest it. As SRU it uses HTTP as application layer protocol and XML for data serialization.

---

## OAI-PMH - catmandu

The Catmandu toolkit provides an OAI-PMH harvester "[Catmandu::Importer::OAI](https://metacpan.org/pod/Catmandu::Importer::SRU)":

```bash
$ catmandu convert OAI \
--url http://tudigit.ulb.tu-darmstadt.de/cgi-bin/digioai.cgi \
--from 2026-01-01 \
--until 2026-01-07 \
--metadataPrefix oai_dc \
--handler oai_dc \
to YAML
```

---
class: middle

## MARC 21 validation

---

## ... with yaz-marcdump

The command-line tool `yaz-marcdump` can be used for several MARC related tasks.

To validate the structure of MARC records use the option `-n`, which will omit any other output:

```bash
# validate MARC ISO records
$ yaz-marcdump -n loc.mrc
# validate MARC XML records
$ yaz-marcdump -n -i marcxml loc.mrc.xml
```

---

If `yaz-marcdump` finds any errors it will output an error message:

```bash
$ yaz-marcdump -np bad_hathi_records.mrc 





```

---

## [Common structural problems](https://bibwild.wordpress.com/2010/02/02/structural-marc-problems-you-may-encounter/) in MARC records:

- invalid leader bytes
- record exceeds the maximum length
- record field exceeds the maximum length
- invalid subfield element
- MARC control character in internal data value
- wrong encoded character

---

## ... with xmllint

Use `xmllint` to validate "MARC XML" data against the MARC [XSD schema](https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd).

If you just want validate the structure of "MARC XML" records, use the options `--noout` (which will omit any other output) and `--schema` (path to XSD file):

```bash
$ xmllint --noout \
--schema MARC21slim.xsd \
loc.mrc.xml
loc.mrc.xml validates
$ xmllint --noout \
--schema MARC21slim.xsd \
chabon-bad-subfields-element.xml
chabon-bad-subfields-element.xml:8: element subfields: Schemas validity error : Element '{http://www.loc.gov/MARC21/slim}subfields': This element is not expected. Expected is ( {http://www.loc.gov/MARC21/slim}subfield ).
chabon-bad-subfields-element.xml fails to validate
```

---

## ... with marcvalidate

While `yaz-marcdump` and `xmllint` are useful to identify structural problems within MARC records, `marcvalidate` can be used to validate MARC tags and subfield against a [Avram](https://format.gbv.de/schema/avram/specification) specification. The default specification was build by [Péter Király](https://pkiraly.github.io/2018/01/28/marc21-in-json/) based on the MARC documentation of the Library of Congress. The specification can be enhanced with local defined fields.

```bash
# validate MARC ISO records
$ marcvalidate loc.mrc
12360325    906 unknown field    
1180649 035 unknown subfield    9
...
# validate MARC XML records
$ marcvalidate --type XML loc.mrc.xml
12360325    906 unknown field    
1180649 035 unknown subfield    9
...
# validate against custom schema
$ marcvalidate --schema my_schema.json loc.mrc
```

---

## MARC statistics

---

## ... with marcstats.pl

To generate statistics for tags and subfield codes of "MARC (ISO 2709)" records use `marcstats.pl`.

```bash
$ marcstats.pl loc.mrc
Statistics for 50 records
Tag     Rep.    Occ.,%
001             100.00
005             100.00
006               2.00
020              76.00
   a             76.00
   q              2.00
035     [Y]      48.00
   9    [Y]      18.00
   a    [Y]      30.00
...
```

---

## ... with Catmandu

If you want generate statistics for other MARC serialization use [Catmandu::Breaker](https://metacpan.org/pod/Catmandu::Breaker). First you need to "break" the MARC records into pieces. Afterwards you can calculate statistics for MARC tags and subfield codes.

```bash
$ catmandu convert MARC --type XML to Breaker --handler marc \
< loc.mrc.xml > loc.breaker
$ catmandu breaker loc.breaker
```

With option `--fields` you can calculate statistics for specific tags and subfield codes:

```bash
$ catmandu breaker --fields 245a,020a loc.breaker
| name | count | zeros | zeros% | min | max | mean | variance | stdev | uniq~ | uniq% | entropy |
|------|-------|-------|--------|-----|-----|------|----------|-------|-------|-------|---------|
| #    | 50    |       |        |     |     |      |          |       |       |       |         |
| 245a | 50    | 0     | 0.0    | 1   | 1   | 1    | 0.0      | 0.0   | 45    | 90.1  | 5.4/5.6 |
| 020a | 52    | 12    | 24.0   | 0   | 4   | 1.04 | 0.8      | 0.9   | 51    | 98.2  | 5.3/6.0 |
```

Use option `--as` to specify a tabular output format (CSV, TSV, XLS(X)):

```bash
$ catmandu breaker --as XLSX loc.breaker > loc.xlsx
```

---

## Unicode

---

## MARC-8 and Unicode

"MARC (ISO 2709)" records could be encoded in two different character coding schemes: [MARC-8](https://www.loc.gov/marc/specifications/specchartables.html) or [UCS/Unicode](https://www.iso.org/standard/69119.html).

Use `yaz-marcdump` to convert the encoding of MARC records. Specify the encoding with options `-f` and `-t`. With option `-l` you can set the character coding scheme in the MARC leader position 09.

```bash
$ yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw \
> marc21.utf8.raw
```

A conversion from UTF-8 to MARC-8 is not recommended, because it could be "lossy".

---

## Unicode normalization

Unicode provides single code points for many characters that could be viewed  as combinations of two or more characters, e.g. German umlauts:

| Composed/NFC | Decomposed/NFD |
|----------|------------|
| ä ([Latin Small Letter A with Diaeresis](https://www.compart.com/en/unicode/U+00E4) U+00E4) | a ([Latin Small Letter A](https://www.compart.com/en/unicode/U+0061) U+0061) + ◌̈ ([Combining Diaeresis](https://www.compart.com/en/unicode/U+0308) U+0308) |

---

## uconv

With the command-line utility `uconv` you can transliterate data between different Unicode [normalization forms](https://unicode.org/reports/tr15/#Norm_Forms):

```bash
$ uconv -x NFC marc21.nfd.xml > marc21.nfc.xml
$ uconv -x NFD marc21.nfc.xml > marc21.nfd.xml
```

You should only normalize "MARC XML" data, as the normalization of "MARC (ISO 2709)" would result in corrupted records, due to different field length.

Use option `-x Any-Name` to show Unicode names of characters:

```bash
$ echo -en 'ÅÅ' | uconv -x Any-Name
\N{ANGSTROM SIGN}\N{LATIN CAPITAL LETTER A WITH RING ABOVE}
```

---

## Transformation of MARC data

---

## ... with yaz-marcdump

`yaz-marcdump` can be used to tranform MARC data between different serializations. Use options `-i` and `-o` to specfiy the input and output formats.

```bash
# MARC (ISO 2709) to MARC XML
$ yaz-marcdump -i marc -o marcxml code4lib.mrc > code4lib.xml
# MARC (ISO 2709) to Turbomarc
$ yaz-marcdump -i marc -o turbomarc code4lib.mrc > code4lib.turbo.xml
# MARC (ISO 2709) to MARC Line
$ yaz-marcdump -i marc -o line code4lib.mrc > code4lib.line
# MARC XML to MARC-in-JSON
$ yaz-marcdump -i marcxml -o json code4lib.mrc.xml > code4lib.json
```

---

## ... with Catmandu

The command-line interface of the Catmandu toolkit also offers several tranformations of MARC data. The default MARC serialization is "MARC (ISO 2709)".

```bash
# MARC (ISO 2709) to MARC XML
$ catmandu convert MARC toe MARC --type XML < code4lib.mrc \
> code4lib.xml
# MARC XML to MARC (ISO 2709)
$ catmandu convert MARC --type XML to MARC < code4lib.xml \
> code4lib.mrc
# MARC (ISO 2709) to MARCMaker
$ catmandu convert MARC to MARC --type MARCMaker < code4lib.mrc \
> code4lib.mrk
# MARC XML to MARC-in-JSON
$ catmandu convert MARC --type XML to MARC --type MiJ \
< code4lib.xml > code4lib.json
# MARC XML to YAML
$ catmandu convert MARC to YAML < code4lib.mrc \
> code4lib.yml
```

---

## Breaker

The [Catmandu::Breaker](https://metacpan.org/pod/Catmandu::Breaker) module "breaks" data into smaller components and exports them line by line:

```bash
$ catmandu convert MARC to Breaker --handler marc < code4lib.mrc
987874829   LDR 01031nas a2200337 c 4500
987874829   001 987874829
987874829   003 DE-101
987874829   005 20200306093601.0
987874829   007 cr||||||||||||
987874829   008 080311c20079999|||u||p|o ||| 0||||1eng c
987874829   0162    DE-101
987874829   016a    987874829
987874829   0162    DE-600
987874829   016a    2415107-5
987874829   022a    1940-5758
987874829   035a    (DE-599)ZDB2415107-5
...
```

---

You can process this output with other command-line utilities like `grep`, `sort` and `uniq`. For example, to extract all ISBN from a MARC data sets, we can build a command-line [pipeline](https://en.wikipedia.org/wiki/Pipeline_(Unix)) like this:

---

## Generic file formats

With Catmandu you can export data to generic data formats like CSV, JSON, TSV, XLSX and YAML. MARC serializations are "complex/nested data structures" which cannot be stored in flat data structures like tables.

You can export MARC records to nested formats like JSON and YAML:

```bash
$ catmandu convert MARC to YAML < code4lib.mrc
$ catmandu convert MARC to JSON < code4lib.mrc
```

This will **not** work:

```bash
$ catmandu convert MARC to CSV < code4lib.mrc
$ catmandu convert MARC to TSV < code4lib.mrc
$ catmandu convert MARC to XLSX < code4lib.mrc
```

You need to use "[Catmandu::Fix](https://metacpan.org/pod/Catmandu::Fix)" to extract and map your data to a tabular data structure:

```bash
$ catmandu convert MARC to CSV \
--fix 'marc_map(245abc,dc_title,join:" ");retain_field(dc_title)' \
< code4lib.mrc
```

---

## ... with XLST

If you want transform MARC records to other formats, you have to map MARC (sub)fields to corresponding fields of the other format. The Libary of Congress provides several crosswalks:

* [MARC to MODS](https://www.loc.gov/standards/mods/mods-mapping.html)
* [MODS to MARC](https://www.loc.gov/standards/mods/v3/mods2marc-mapping.html)
* [MARC to Dublin Core](https://www.loc.gov/marc/marc2dc.html)
* [Dublin Core to MARC](https://www.loc.gov/marc/dccross.html)
* [ONIX to MARC](https://www.loc.gov/marc/onix2marc.html)

Based on these crosswalks the Library of Congress published several [XLS stylesheets](https://www.loc.gov/standards/marcxml/#stylesheets), which can be used with a XSLT processor to transform "MARC XML" records to other formats like BIBFRAME, HTML, MODS, OAI-DC and RDF.

---

## xsltproc

```bash
# MARC XML to HTML
$  MARC21slim2HTML.xsl loc.mrc.xml > loc.html
# MARC XML to OAI-DC
$ xsltproc MARC21slim2OAIDC.xsl loc.mrc.xml > loc.oaidc.xml
# MARC XML to RDF-DC
$ xsltproc MARC21slim2RDFDC.xsl loc.mrc.xml > loc.rdfdc.xml
# MARC XML to [BIBFRAME](https://github.com/lcnetdev/marc2bibframe2)
$ xsltproc bibframe-xsl/marc2bibframe2.xsl loc.mrc.xml \
> loc.bibframe.xml
```

---

## Extract data from MARC records

---

## ... with xmllint

First check if a [XML namespace](https://www.w3.org/TR/xml-names/) is declared in the document:

```bash
$ head loc.mrc.xml
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
  <leader>01227cam a22002894a 4500</leader>
  <controlfield tag="001">12360325</controlfield>
  <controlfield tag="005">20070126075126.0</controlfield>
  <controlfield tag="008">010327s2001    nyua          001 0 eng  </controlfield>
  <datafield tag="906" ind1=" " ind2=" ">
    <subfield code="a">7</subfield>
    <subfield code="b">cbc</subfield>
    <subfield code="c">orignew</subfield>
```

---

If a namespace is set use the "local" XML element name in the [XPATH](https://www.w3.org/TR/2017/REC-xpath-31-20170321/) expression:

```bash
# no XML namespace
$ xmllint --xpath '//controlfield/@tag' \
loc.mrc.xml 
# with XML namespace
$ xmllint --xpath '//*[local-name()="controlfield"]/@tag' \
loc.mrc.xml
```

---

## xmllint --xpath

```bash
# extract all tags and count them
$ xmllint --xpath '//@tag' loc.mrc.xml | sort | uniq -c
# extract all IDs from MARC 001:
$ xmllint --xpath '//*[local-name()="controlfield"][@tag="001"]/text()' loc.mrc.xml
# extract all subfields from MARC 245 fields
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]' loc.mrc.xml
# extract subfield "a" from MARC 245 fields
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]/*[local-name()="subfield"][@code="a"]/' loc.mrc.xml
# extract content from subfield "a" from MARC 245 fields
$ xmllint --xpath '//*[local-name()="datafield"][@tag="245"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
# extract all ISBNs
$ xmllint --xpath '//*[local-name()="datafield"][@tag="020"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
# extract all DDC numbers
$ xmllint --xpath '//*[local-name()="datafield"][@tag="082"]/*[local-name()="subfield"][@code="a"]/text()' loc.mrc.xml
```

---

## ... Catmandu

Catmandu uses a [domain specific language](https://en.wikipedia.org/wiki/Domain-specific_language) (DSL) called "fix" to extract, map and tranform data. Several "fixes" for library specifc data format like [MARC](https://metacpan.org/pod/Catmandu::MARC) and [PICA](https://metacpan.org/pod/Catmandu::PICA) are available. Most common "fixes" are documented on [cheat sheet](https://librecat.org/assets/catmandu_cheat_sheet.pdf). "Fixes" can be used as command-line options or stored in a "fix" file:

```bash
$ catmandu convert MARC to CSV \
--fix 'marc_map(001,id); retain_field(id)' < loc.mrc
$ catmandu convert MARC to YAML --fix marc2dc.fix  < loc.mrc
```

---

## marc_map

With [`marc_map`](https://metacpan.org/pod/Catmandu::Fix::marc_map) you can extract (sub)fields from MARC records and map them to your own data model:

```no-highlight
marc_map(001,dc_identifier)
# {"dc_identifier":"12360325"}
```

---

## Extract part of field

MARC uses several "fixed-length" fields, where data elements are positionally defined. E.g. if you want to extract the language code from MARC 008 specify the positions with `/35-37`:

```no-highlight
marc_map(008/35-37,dc_language)
# {"dc_language":"eng"}
```

---

## Extract fields with specific indicators

If you want to extract fields with certain indicators specify them within sqare brackes `[1,4]`

```no-highlight
marc_map("246[1,4]",marc_varyingFormOfTitle)
# {"marc_varyingFormOfTitle":"Games, diversions & Perl culture"}
```

---

## Extract subfields

To extract certain subfields from a MARC data field use the subfield codes. By default several subfields will be joined to one string. Use option `join` to join them with another string. With option `split:1` you cal split the subfields to a list. Use option `pluck` if you want to extract the subfields in a certain order.

```no-highlight
marc_map(245ab,dc_title,join:' ')
# {"dc_title":"Perl : the complete reference /"}
marc_map(245ab,dc_title,split:1)
# {"dc_title":["Perl :","the complete reference /"]}
marc_map(245ba,dc_title,split:1,pluck:1)
# {"dc_title":["the complete reference /","Perl :"]}
```

---

## Extract repeatable fields

MARC data fields could be repeatable. Use option `split:1` to create a list from all fields.

```no-highlight
marc_map(650a,dc_subject,split:1)
# {"dc_subject":["Data mining.","Text processing (Computer science)","Perl (Computer program language)"]}
```

---

## Extract repeatable subfields

MARC subfields could be repeatable within a MARC data field.  Use option `split:1` to create a list from all fields. To create a list for all subfields within one data field use option `nested_arrays:1` which will return a "list of lists" of subfields, one list for each data field.

```no-highlight
marc_map(655ay,marc_indexTermGenre,split:1)
# {"marc_indexTermGenre":["Portrait photographs","1910-1920.","Photographic prints","1910-1920."]}
marc_map(655ay,marc_indexTermGenre,split:1,nested_arrays:1)
# {"marc_indexTermGenre":[["Portrait photographs","1910-1920."],["Photographic prints","1910-1920."]]}
```

---

## Extract subfields by value

To extract a subfield only if another subfield in the same data field has a certain value use a [loop](https://metacpan.org/pod/Catmandu::Fix::Bind::marc_each) with a [condition](https://metacpan.org/pod/Catmandu::Fix::Condition).

```no-highlight
=856  4\$uhttp://journal.code4lib.org/$xVerlag$zkostenfrei
=856  4\$uhttp://www.bibliothek.uni-regensburg.de/ezeit/?2415107$xEZB
```

```no-highlight
do marc_each()
  if marc_match(856x,EZB)
    marc_map(856u,ezb_uri)
  end
end
# {"ezb_uri":"http://www.bibliothek.uni-regensburg.de/ezeit/?2415107"}
```

---

## Conditions

Use conditions [`marc_has`](https://metacpan.org/pod/Catmandu::Fix::Condition::marc_has), [`marc_has_many`](https://metacpan.org/pod/Catmandu::Fix::Condition::marc_has_many) or [`marc_match`](https://metacpan.org/pod/Catmandu::Fix::Condition::marc_match) to check if an record has certain fields or match certain conditions.

```no-highlight
set_array(errors)

# Check if a 245 field is present
unless marc_has('245')
  set_field(errors.$append,"no 245 field")
end
 
# Check if there is more than one 245 field
if marc_has_many('245')
  set_field(errors.$append,"more than one 245 field?")
end
 
# Check if in 008 position 7 to 10 contains a 
# 4 digit number ('\d' means digit)
unless marc_match('008/07-10','\d{4}')
  set_field(errors.$append,"no 4-digit year in 008 position 7->10")
end
```

---

## Add fields to a record

You can add field to MARC records with [`marc_add`](https://metacpan.org/pod/Catmandu::Fix::marc_add).

```no-highlight
marc_add(999,a,my,b,local,c,field)
marc_add(900,a,$.my.field)
```

---

## Append values to (sub)fields

Use [`marc_append`](https://metacpan.org/pod/Catmandu::Fix::marc_append) to append values to a (sub)field

```no-highlight
marc_append(001,'-X')
marc_append(100a,' [author]')
```

---

## Assign a value to (sub)fields

Assign a new value to a MARC field with [`marc_set`](https://metacpan.org/pod/Catmandu::Fix::marc_set).

```no-highlight
marc_set(001,123456789)
marc_set(245a,'Perl - battle tested.')
```

---

## Remove (sub)fields

Use [`marc_remove`](https://metacpan.org/pod/Catmandu::Fix::marc_remove) to remove (sub)fields from MARC records.

```no-highlight
marc_remove(991)
marc_remove(9..)
marc_remove(0359)
```

---

## Replace strings in (sub)fields

Use [`marc_replace_all`](https://metacpan.org/pod/Catmandu::Fix::marc_replace_all) to replace a string in MARC (sub)fields.

```no-highlight
marc_replace_all(001,1,X)
marc_replace_all(245a,Perl,"Perl [programming language]")
```

---

## Filter MARC records

You can filter MARC records from a dataset with [`reject`](https://metacpan.org/pod/Catmandu::Fix::reject) or `select`.

```no-highlight
reject marc_has_many(245)
select marc_match(245a,Perl)
```

---

## Validate MARC records

You can [`validate`](https://metacpan.org/pod/Catmandu::Fix::validate) MARC records and collect the error messages or filter [`valid`](https://metacpan.org/pod/Catmandu::Fix::Condition::valid) records.

```no-highlight
validate(.,MARC,error_field: errors)
select valid(.,MARC)
```

---

## Dictionaries

MARC uses codes for [languages](https://www.loc.gov/marc/languages/language_code.html) and [countries](https://www.loc.gov/marc/countries/countries_code.html). You can build dictionaries based on these list and [lookup](https://metacpan.org/pod/Catmandu::Fix::lookup) names for these codes.

```csv
$ less languages.csv
eng,English
enm,English, Middle (1100-1500)
epo,Esperanto
esk,Eskimo languages
est,Estonian
..,
```
```no-highlight
# { "dc_language": "eng" }
lookup(dc_language,languages.csv)
lookup(dc_language,languages.csv,default:English)
lookup(dc_language,languages.csv,delete:1)
# { "dc_language": "English" }
```

---

## Normalize ISBNs and ISSNs

Use [`issn`](https://metacpan.org/pod/Catmandu::Fix::issn),  [`isbn10`](https://metacpan.org/pod/Catmandu::Fix::isbn10) or [`isbn13`](https://metacpan.org/pod/Catmandu::Fix::isbn13) to normalize international identifier.

```no-highlight
# { "issn" : "1553667x" }
issn(issn)
# { "issn" : "1553-667X" }
# { "isbn" : "1565922573" }
isbn10(isbn) 
# {"isbn" : "1-56592-257-3" }
isbn13(isbn)
# { "isbn" : "978-1-56592-257-0" }
```

---

## Links

- [Avram schema for MARC 21](https://pkiraly.github.io/2018/01/28/marc21-in-json/)
- [Catmandu cheat sheet](http://librecat.org/assets/catmandu_cheat_sheet.pdf)
- [Catmandu mapping rules](https://github.com/LibreCat/Catmandu-MARC/wiki/Mapping-rules)
- [Catmandu::MARC::Tutorial](https://metacpan.org/dist/Catmandu-MARC/view/lib/Catmandu/MARC/Tutorial.pod)
- [MARC Standards](https://www.loc.gov/marc/)
- [MARC 21 format for Bibliographic Data](https://www.loc.gov/marc/bibliographic/)
- [Tutorial "Processing MARC ... with open source tools"](https://jorol.github.io/processing-marc/#/)

---

## Literature

- Henriette Avram (1975): *MARC; its History and implications.* <http://catalog.hathitrust.org/Record/002993527>
- Bernhard Eversberg (1999): *Was sind und was sollen Bibliothekarische Datenformate* [urn:nbn:de:gbv:084-1103231323](https://nbn-resolving.org/urn%3Anbn%3Ade%3Agbv%3A084-11032313237)
- Roy Tennant (2002): *MARC Must Die.* <https://www.libraryjournal.com/?detailStory=marc-must-die>
- William E. Moen, Penelope Benardino (2003): *Assessing Metadata Utilization: An Analysis of MARC Content Designation Use* <https://dcpapers.dublincore.org/pubs/article/download/745/741.pdf>
- Karen Smith-Yoshimura, Catherine Argus, Timothy J. Dickey, Chew Chiat Naun, Lisa Rowlinson de Oritz & Hugh Taylor (2010): *Implications of MARC Tag Usage on Library Metadata Practices* <https://www.oclc.org/content/dam/research/publications/library/2010/2010-06.pdf>
- Roy Tennant (2013-2018): *MARC Usage in WorldCat* <http://roytennant.com/proto/groundtruthing/> (no longer available)
- Péter Király (2019): *Validating 126 million MARC records* [10.1145/3322905.3322929](https://doi.org/10.1145/3322905.3322929)
- Péter Király (2019): *Measuring Metadata Quality* [10.13140/RG.2.2.33177.77920](https://doi.org/10.13140/RG.2.2.33177.77920)

---

## Contact details

Johann Rolschewski johann.rolschewski@sbb.spk-berlin.de

Moritz Gadischke moritz.gadischke@sbb.spk-berlin.de

Staatsbibliothek zu Berlin

https://staatsbibliothek-berlin.de/