Catmandu - a data toolkit

class: center middle
![LibreCat](./img/librecat.png "Logo LibreCat")
# Catmandu
### a data toolkit

#### csv,conf,v2 3-4 May 2016 in Berlin, Germany
##### Johann Rolschewski

---
class: middle
## Libraries collect data ...

* books
  * journals
  * articles
  * maps
  * manuscripts
  * sheets of music
  * ...

---
class: middle
## Libraries create metadata ... 
  * bibliographic descriptions
  * holding informations
  * references
  * patron data
  * ...

---
class: middle center
## Metadata
![Katalog](./img/code4lib.png "Katalog")

---
class: middle
## Metadata

... catalogued in library specific formats (MARC, MAB2, PICA, ...)

... provided via library specific APIs (OAI, SRU, Z39.50, ...)

... used in diverse systems (OPACs, discovery systems, institutional repositories, link resolvers, ...)

---
class: center middle
## Demand

... for a library specific metadata toolkit

---
class: middle
## Catmandu

... created in 2012 as an open collaboration ("LibreCat") of the three university libraries of **Bielefeld**, **Gent** and **Lund**

... built up an international community with a dozen active submitters

... used by university libraries, archives and commercial implementers

---
class: middle
## Catmandu

... supports __"Extract, Transform, Load"__ (ETL) processes

... to *extract* metadata records from various sources, *transform* them into new formats and *load* them into databases, nosql stores or search engines

---
class: middle
## Tools

- __Command line tool__: catmandu
- __Importers__: CSV, DBI, JSON, LDAP, MAB2, MARC, OAI-PMH, PICA, RDF, RIS, SRU, Text, TSV, Twitter, Wikidata, XLS(X), YAML, Z39.50, ...
- __Exporters__: CSV, JSON, MAB2, MARC, PICA, RDF, RIS, Template, Text, TSV, XLS(X), XML, YAML, ...
- __Stores__:  Aleph, CouchDB, DBI, Elasticsearch, FedoraCommons, MongoDB, Solr, ...
- __Transformation__: Fix or any program that can read and write JSON
- __API__: Perl
- __Web development__: Dancer, PSGI

---
class: middle
## Fix

... a small __domain specific language__ (DSL) for manipulation of data

... consists of:

* __paths__ to refer to particular parts of an item
* __functions__ to manipulate (parts of) an item
* __conditionals__ to control when to apply which fix functions
* __binds__ to manipulate the execution of fix functions

---
class: middle
## Nested Data Structures

```json
{
  "preferredName" : "Larry Wall",
  "surname" : "Wall",
  "forename" : "Larry",
  "describedBy" : {
    "valid" : "2016-04-14T11:19:01+0200",
    "license" : "http://creativecommons.org/publicdomain/zero/1.0/legalcode",
    "id" : "http://hub.culturegraph.org/entityfacts/138937079"
  },
  "dateOfBirth" : "1954",
  "professionOrOccupation" : [ {
    "id" : "http://d-nb.info/gnd/4139395-8",
    "value" : "Informatiker"
  } ],
  "depiction" : {
    "image" : "https://commons.wikimedia.org/wiki/Special:FilePath/...",
    "thumbnail" : "https://commons.wikimedia.org/wiki/Special:FilePath/...",
    "url" : "https://commons.wikimedia.org/wiki/..."
  }
}
```

---
class: middle
## Paths

... to reference data within deep nested data structures

... uses "__dot notation__"

```json
surname → "Wall"
describedBy.id → "http://d-nb.info/gnd/4139395-8"
professionOrOccupation.0.value → "Informatiker"
```

---
class: middle
## Paths

... to add data to deep nested data structures

```perl
add_field(foo.bar.0.test.1.key,value)
```

```json
{
  "foo" : {
    "bar" : [
      { "test" : [
        null,
        {"key" : "value"}
        ]
      }
    ]
  }
}
```

---
class: middle
## Paths

```bash
$append   - Add a new item at the end of an array
$prepend  - Add a new item at the start of an array
$first    - Syntactic sugar for index '0' (the head of the array)
$last     - Syntactic sugar for index '-1' (the tail of the array)
*         - Wildcard for all array elements
```

```perl
# {}
add_field(dc_creator.$append,"Wall, Larry")
# { "dc_creator" : [ "Wall, Larry" ] }
set_field(dc_creator.$first,"Christiansen, Tom")
# { "dc_creator" : [ "Christiansen, Tom" ] }

```

---
class: middle
## Fix functions - field

```perl
# {}
add_field(name,'Christiansen, Tom')
# { "name" : "Christiansen, Tom" }
set_field(name,'Wall, Larry')
# { "name" : "Wall, Larry" }
copy_field(name,dc.creator)
# { "name" : "Wall, Larry", "dc" : { "creator" : "Wall, Larry" } }
remove_field(name)
# { "dc" : { "creator" : "Wall, Larry" } }
move_field(dc.creator,dc_creator)
# { "dc" : {}, "dc_creator" : "Wall, Larry" }
retain_field(dc_creator)
# { "dc_creator" : "Wall, Larry" }
```

```perl
# { "subjects" : "Perl,R,JavaScript,Perl,R" }
split_field(subjects,',')
sort_field(subjects)
uniq(subjects)
# { "subjects" : ["JavaScript", "Perl", "R"] }
join_field(subjects,'; ')
# { "subjects" : "JavasSript; Perl; R" }
```

---
class: middle
## Fix functions - string

```perl
# { "name" : "Wall" }
upcase(name);
# { "name" : "WALL" }
downcase(name);
# { "name" : "wall" }
capitalize(name);
# { "name" : "Wall" }
append(name,', Larry');
# { "name" : "Wall, Larry" }
prepend(name,', Dr. ');
# { "name" : "Dr. Wall, Larry" }
```

---
class: middle
## Fix functions - string

```perl
# { "name" : " Christiansen,  " }
trim(name);
# { "name" : "Christiansen," }
trim(name,'nonword');
# { "name" : "Christiansen" }
substring(name, 0, 1);
# { "name" : "C" }
```

```perl
# { "format" : "MARC21"}
replace_all(format, '\d', '')
# { "format" : "MARC"}

# { "id" : [ "123-4", "567-X" ] }
replace_all(id.*, '-[0-9xX]$', '')
# { "id" : [ "123", "567" ] }
```

---
class: middle
## Fix functions - numbers

```perl
# { "numbers" : [ 1, 2, 3 ] }
copy_field(numbers,count)
count(count)
copy_field(numbers,sum)
sum(sum)
copy_field(numbers,mean)
stat_mean(mean)
copy_field(numbers,variance)
stat_variance(variance)
# { "numbers" : [ 1, 2, 3 ], "count" : 3, "sum" : 6,
# "mean" : 2, "variance" : 0.67 }

```

---
class: middle
## Fix functions - special data formats

```perl
marc_map(008_/35-38,language)
marc_map(245[10]a,title)
```
```perl
mab_map(331[ ],title)
mab_map(406jk,coverage.$append, -join => ' - ')
```
```perl
pica_map(009Qa,primaryTopicOf.$append)
pica_map(027A[01]a','varyingFormOfTitle)
```

---
class: middle
## Fix functions - identifier

```perl
# { "issn" : "1553667x" }
issn(issn)
# { "issn" : "1553-667X" }

# { "isbn" : "1565922573" }
isbn13(isbn)
# { "isbn" : "978-1-56592-257-0" }

# { }
uuid(id)
# { "id" : "4162F712-1DD2-11B2-B17E-C09EFE1DC403" }
```

---
class: middle
## Fix functions - dictionaries
```bash
$ cat dict.csv
004,Informatik
310,Statistik
510,Mathematik
```

```perl
# { ddc => '004' }
lookup('ddc', 'dict.csv', -default=>'Allgemeines')
# or
lookup('ddc', 'dict.csv', -delete=>'1')
# { ddc => 'Informatik' }

# large dictionaries in stores
lookup_in_store('ddc', 'MongoDB', -database_name => 'lookups')
```

---
class: middle
## Fix functions - external sources

```perl
# passes JSON object to an external process over stdin 
# and reads a JSON object from it's stdout
cmd("jq -c -M {title}")

# fetch data from a JSON API
get_json("http://example.com/json", path: path.key)

# geocode address
geocode('Johannisstraße 2, 10117 Berlin')

# Add all author values to a MongoDB database. 
add_to_store(authors.*, MongoDB, database_name: catalog, bag: authors)

# logging
log('not a valid ISSN' , level:Warning);

```

---
class: middle
## Fix conditions

```perl
if exists(ddc)
    lookup(ddc, 'dict.csv', -default=>'Miscellaneous')
else
    add_field(ddc, 'Miscellaneous')
end
```
```perl
if any_match(ddc, '004')
    set_field(subject, 'Informatik')
end
```
```perl
if is_uri(uri_field)
  get_json(uri_field, path: path.key)
end
```
```perl
unless is_valid_issn(issn_field)
  issn(issn)
end
```

---
class: middle
## Fix binds

... a wrapper for fixes

```perl
do list(path:colors.*, var:c)
  upcase(c)
  append(c," is a nice color")
  copy_field(c,result.$append)
end
```

```perl
do maybe()
  download_from_internet() 
  process_results() # skipped when download_from_internet fails
end
```

```bash
#!/usr/bin/env catmandu run
do importer(OAI,url: "http://lib.ugent.be/oai") 
  retain(_id)
  add_to_exporter(.,YAML)
end
```

---
class: middle
## CLI

```bash
$ catmandu
  commands: list the application's commands
      help: show help

config: export the Catmandu config
   convert: convert objects
      copy: copy objects to another store
     count: count the number of objects in a store
      data: store, index, search, import, export or convert (deprecated)
    delete: delete objects from a store
      drop: drop a store or one of the bags
    export: export objects from a store
    import: import objects into a store
      info: list installed Catmandu modules
      repl: interactive shell for Catmandu
       run: run a fix command
```

---
class: middle
## CLI - convert

```bash
$ catmandu convert CSV --sep_char ';' to JSON --pretty 1 ↩
    < eu_elections_2014.csv

$ catmandu convert MARC to CSV --fix marc.fix --file journals.csv ↩
    --fields dc_identifier,dc_title,dc_language < journals.mrc

$ catmandu convert XLSX to Template --template journals.tt < journals.xlsx
```

---
class: middle
## CLI - import

```bash
$ catmandu import JSON to MongoDB --database_name journals ↩
    < journals.json

$ catmandu import XLSX to ElasticSearch --index_name journals ↩
    < journals.xlsx

$ catmandu import PICA to CouchDB --fix pica.fix --index_name mab ↩
    < pica.dat

```

---
class: middle
## CLI - export

```bash
$ catmandu export MongoDB --database_name journals to JSON

$ catmandu export CouchDB to CSV --fix export.fix

$ catmandu export ElasticSearch --index_name mab --query 'city:"Berlin"'

```

---
class: middle
## CLI - count

```bash
$ catmandu count MongoDB --database_name journals

$ catmandu count Elasticsearch --index_name journals --query ↩
    'city:"Berlin"'
```

---
class: middle
## CLI - delete

```bash
$ catmandu delete MongoDB --database_name journals

$ catmandu delete Elasticsearch --index_name journals --query ↩
    'city:"Berlin"'
```

---
class: middle
## CLI - copy

```bash
$ catmandu copy MongoDB --database_name journals to ElasticSearch ↩
    --index_name journals

$ catmandu copy ElasticSearch --index_name journals ↩
    --query 'city:"Berlin"' to MongoDB --database_name berlin
```

---
class: middle
## CLI - APIs

```bash
$ catmandu convert Atom --url http://my.example.org/feed.atom to YAML

$ catmandu convert OAI --url http://pub.uni-bielefeld.de/oai to JSON

$ catmandu convert SRU --base http://sru.gbv.de/gvk --recordSchema ↩ 
    marcxml --parser marcxml --query "issn=0939-4362" to YAML

$ catmandu import getJSON --from http://example.org/alice.json to ↩
    MongoDB --database_name import --fix import.fix

$ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf ↩
    --sparql 'SELECT * {?s ?p "Wall, Larry"}' to JSON

$ catmandu convert Wikidata --site enwiki --title "Larry Wall" to↩
    JSON --pretty 1

```

---
class: middle
## Config

```bash
$ cat catmandu.yml
---
store:
  mdb:
   package: MongoDB
   options:
    database_name: mydb
    fix: 'my.fix'
  els:
   package: Elasticsearch
   options:
    index_name: mydb
    fix: 'my.fix'

$ catmandu import JSON to mdb < records.json
$ catmandu import MARC to els < records.mrc
$ catmandu export mdb to JSON
$ catmandu export els to JSON
```

---
class: middle
## Extension

... you can extend Catmandu via it's Perl API with your own

* fixes
* commands
* importer & exporters
* ...

---
class: middle
## Catmandu @ Bielefeld
![Bielefeld](./img/pubbielefeld.png "Bielefeld")

---
class: middle
## Catmandu @ Gent
![Gent](./img/libugentbe.png "Gent")

---
class: middle
## Catmandu @ Koha
![Koha](./img/koha.png "Koha")

---
class: middle
## Catmandu @ LinkedDataFragments
![linked data fragments](./img/linkeddatafragmentsorg.png)

---
class: middle
## Catmandu @ OpenRefine
![openrefine](./img/openrefine.png)

---
class: middle
## Getting started

* Virtual machine https://librecatproject.wordpress.com/get-catmandu/
* Introduction https://librecatproject.wordpress.com/2014/12/01/day-1-getting-catmandu/
* Cheat sheet http://librecat.org/Catmandu/#fixes-cheat-sheet

---
class: middle
## Links

- http://librecat.org - home page
- https://librecatproject.wordpress.com - blog
- https://github.com/LibreCat - code
- librecat-dev@lists.uni-bielefeld.de - mailing list

---
class: middle
.center[![XKCD](./img/xkcd_perl.png "xkcd.com/519/")]
.center[<small>\[Comic by [Randall Munroe](http://xkcd.com/519/), [CC BY-NC 2.5](https://creativecommons.org/licenses/by-nc/2.5/)\]</small>]