class: center middle  # Catmandu ### a data toolkit #### csv,conf,v2 3-4 May 2016 in Berlin, Germany ##### Johann Rolschewski --- class: middle ## Libraries collect data ... * books * journals * articles * maps * manuscripts * sheets of music * ... --- class: middle ## Libraries create metadata ... * bibliographic descriptions * holding informations * references * patron data * ... --- class: middle center ## Metadata  --- class: middle ## Metadata ... catalogued in library specific formats (MARC, MAB2, PICA, ...) ... provided via library specific APIs (OAI, SRU, Z39.50, ...) ... used in diverse systems (OPACs, discovery systems, institutional repositories, link resolvers, ...) --- class: center middle ## Demand ... for a library specific metadata toolkit --- class: middle ## Catmandu ... created in 2012 as an open collaboration ("LibreCat") of the three university libraries of **Bielefeld**, **Gent** and **Lund** ... built up an international community with a dozen active submitters ... used by university libraries, archives and commercial implementers --- class: middle ## Catmandu ... supports __"Extract, Transform, Load"__ (ETL) processes ... to *extract* metadata records from various sources, *transform* them into new formats and *load* them into databases, nosql stores or search engines --- class: middle ## Tools - __Command line tool__: catmandu - __Importers__: CSV, DBI, JSON, LDAP, MAB2, MARC, OAI-PMH, PICA, RDF, RIS, SRU, Text, TSV, Twitter, Wikidata, XLS(X), YAML, Z39.50, ... - __Exporters__: CSV, JSON, MAB2, MARC, PICA, RDF, RIS, Template, Text, TSV, XLS(X), XML, YAML, ... - __Stores__: Aleph, CouchDB, DBI, Elasticsearch, FedoraCommons, MongoDB, Solr, ... - __Transformation__: Fix or any program that can read and write JSON - __API__: Perl - __Web development__: Dancer, PSGI --- class: middle ## Fix ... a small __domain specific language__ (DSL) for manipulation of data ... consists of: * __paths__ to refer to particular parts of an item * __functions__ to manipulate (parts of) an item * __conditionals__ to control when to apply which fix functions * __binds__ to manipulate the execution of fix functions --- class: middle ## Nested Data Structures ```json { "preferredName" : "Larry Wall", "surname" : "Wall", "forename" : "Larry", "describedBy" : { "valid" : "2016-04-14T11:19:01+0200", "license" : "http://creativecommons.org/publicdomain/zero/1.0/legalcode", "id" : "http://hub.culturegraph.org/entityfacts/138937079" }, "dateOfBirth" : "1954", "professionOrOccupation" : [ { "id" : "http://d-nb.info/gnd/4139395-8", "value" : "Informatiker" } ], "depiction" : { "image" : "https://commons.wikimedia.org/wiki/Special:FilePath/...", "thumbnail" : "https://commons.wikimedia.org/wiki/Special:FilePath/...", "url" : "https://commons.wikimedia.org/wiki/..." } } ``` --- class: middle ## Paths ... to reference data within deep nested data structures ... uses "__dot notation__" ```json surname → "Wall" describedBy.id → "http://d-nb.info/gnd/4139395-8" professionOrOccupation.0.value → "Informatiker" ``` --- class: middle ## Paths ... to add data to deep nested data structures ```perl add_field(foo.bar.0.test.1.key,value) ``` ```json { "foo" : { "bar" : [ { "test" : [ null, {"key" : "value"} ] } ] } } ``` --- class: middle ## Paths ```bash $append - Add a new item at the end of an array $prepend - Add a new item at the start of an array $first - Syntactic sugar for index '0' (the head of the array) $last - Syntactic sugar for index '-1' (the tail of the array) * - Wildcard for all array elements ``` ```perl # {} add_field(dc_creator.$append,"Wall, Larry") # { "dc_creator" : [ "Wall, Larry" ] } set_field(dc_creator.$first,"Christiansen, Tom") # { "dc_creator" : [ "Christiansen, Tom" ] } ``` --- class: middle ## Fix functions - field ```perl # {} add_field(name,'Christiansen, Tom') # { "name" : "Christiansen, Tom" } set_field(name,'Wall, Larry') # { "name" : "Wall, Larry" } copy_field(name,dc.creator) # { "name" : "Wall, Larry", "dc" : { "creator" : "Wall, Larry" } } remove_field(name) # { "dc" : { "creator" : "Wall, Larry" } } move_field(dc.creator,dc_creator) # { "dc" : {}, "dc_creator" : "Wall, Larry" } retain_field(dc_creator) # { "dc_creator" : "Wall, Larry" } ``` ```perl # { "subjects" : "Perl,R,JavaScript,Perl,R" } split_field(subjects,',') sort_field(subjects) uniq(subjects) # { "subjects" : ["JavaScript", "Perl", "R"] } join_field(subjects,'; ') # { "subjects" : "JavasSript; Perl; R" } ``` --- class: middle ## Fix functions - string ```perl # { "name" : "Wall" } upcase(name); # { "name" : "WALL" } downcase(name); # { "name" : "wall" } capitalize(name); # { "name" : "Wall" } append(name,', Larry'); # { "name" : "Wall, Larry" } prepend(name,', Dr. '); # { "name" : "Dr. Wall, Larry" } ``` --- class: middle ## Fix functions - string ```perl # { "name" : " Christiansen, " } trim(name); # { "name" : "Christiansen," } trim(name,'nonword'); # { "name" : "Christiansen" } substring(name, 0, 1); # { "name" : "C" } ``` ```perl # { "format" : "MARC21"} replace_all(format, '\d', '') # { "format" : "MARC"} # { "id" : [ "123-4", "567-X" ] } replace_all(id.*, '-[0-9xX]$', '') # { "id" : [ "123", "567" ] } ``` --- class: middle ## Fix functions - numbers ```perl # { "numbers" : [ 1, 2, 3 ] } copy_field(numbers,count) count(count) copy_field(numbers,sum) sum(sum) copy_field(numbers,mean) stat_mean(mean) copy_field(numbers,variance) stat_variance(variance) # { "numbers" : [ 1, 2, 3 ], "count" : 3, "sum" : 6, # "mean" : 2, "variance" : 0.67 } ``` --- class: middle ## Fix functions - special data formats ```perl marc_map(008_/35-38,language) marc_map(245[10]a,title) ``` ```perl mab_map(331[ ],title) mab_map(406jk,coverage.$append, -join => ' - ') ``` ```perl pica_map(009Qa,primaryTopicOf.$append) pica_map(027A[01]a','varyingFormOfTitle) ``` --- class: middle ## Fix functions - identifier ```perl # { "issn" : "1553667x" } issn(issn) # { "issn" : "1553-667X" } # { "isbn" : "1565922573" } isbn13(isbn) # { "isbn" : "978-1-56592-257-0" } # { } uuid(id) # { "id" : "4162F712-1DD2-11B2-B17E-C09EFE1DC403" } ``` --- class: middle ## Fix functions - dictionaries ```bash $ cat dict.csv 004,Informatik 310,Statistik 510,Mathematik ``` ```perl # { ddc => '004' } lookup('ddc', 'dict.csv', -default=>'Allgemeines') # or lookup('ddc', 'dict.csv', -delete=>'1') # { ddc => 'Informatik' } # large dictionaries in stores lookup_in_store('ddc', 'MongoDB', -database_name => 'lookups') ``` --- class: middle ## Fix functions - external sources ```perl # passes JSON object to an external process over stdin # and reads a JSON object from it's stdout cmd("jq -c -M {title}") # fetch data from a JSON API get_json("http://example.com/json", path: path.key) # geocode address geocode('Johannisstraße 2, 10117 Berlin') # Add all author values to a MongoDB database. add_to_store(authors.*, MongoDB, database_name: catalog, bag: authors) # logging log('not a valid ISSN' , level:Warning); ``` --- class: middle ## Fix conditions ```perl if exists(ddc) lookup(ddc, 'dict.csv', -default=>'Miscellaneous') else add_field(ddc, 'Miscellaneous') end ``` ```perl if any_match(ddc, '004') set_field(subject, 'Informatik') end ``` ```perl if is_uri(uri_field) get_json(uri_field, path: path.key) end ``` ```perl unless is_valid_issn(issn_field) issn(issn) end ``` --- class: middle ## Fix binds ... a wrapper for fixes ```perl do list(path:colors.*, var:c) upcase(c) append(c," is a nice color") copy_field(c,result.$append) end ``` ```perl do maybe() download_from_internet() process_results() # skipped when download_from_internet fails end ``` ```bash #!/usr/bin/env catmandu run do importer(OAI,url: "http://lib.ugent.be/oai") retain(_id) add_to_exporter(.,YAML) end ``` --- class: middle ## CLI ```bash $ catmandu commands: list the application's commands help: show help config: export the Catmandu config convert: convert objects copy: copy objects to another store count: count the number of objects in a store data: store, index, search, import, export or convert (deprecated) delete: delete objects from a store drop: drop a store or one of the bags export: export objects from a store import: import objects into a store info: list installed Catmandu modules repl: interactive shell for Catmandu run: run a fix command ``` --- class: middle ## CLI - convert ```bash $ catmandu convert CSV --sep_char ';' to JSON --pretty 1 ↩ < eu_elections_2014.csv $ catmandu convert MARC to CSV --fix marc.fix --file journals.csv ↩ --fields dc_identifier,dc_title,dc_language < journals.mrc $ catmandu convert XLSX to Template --template journals.tt < journals.xlsx ``` --- class: middle ## CLI - import ```bash $ catmandu import JSON to MongoDB --database_name journals ↩ < journals.json $ catmandu import XLSX to ElasticSearch --index_name journals ↩ < journals.xlsx $ catmandu import PICA to CouchDB --fix pica.fix --index_name mab ↩ < pica.dat ``` --- class: middle ## CLI - export ```bash $ catmandu export MongoDB --database_name journals to JSON $ catmandu export CouchDB to CSV --fix export.fix $ catmandu export ElasticSearch --index_name mab --query 'city:"Berlin"' ``` --- class: middle ## CLI - count ```bash $ catmandu count MongoDB --database_name journals $ catmandu count Elasticsearch --index_name journals --query ↩ 'city:"Berlin"' ``` --- class: middle ## CLI - delete ```bash $ catmandu delete MongoDB --database_name journals $ catmandu delete Elasticsearch --index_name journals --query ↩ 'city:"Berlin"' ``` --- class: middle ## CLI - copy ```bash $ catmandu copy MongoDB --database_name journals to ElasticSearch ↩ --index_name journals $ catmandu copy ElasticSearch --index_name journals ↩ --query 'city:"Berlin"' to MongoDB --database_name berlin ``` --- class: middle ## CLI - APIs ```bash $ catmandu convert Atom --url http://my.example.org/feed.atom to YAML $ catmandu convert OAI --url http://pub.uni-bielefeld.de/oai to JSON $ catmandu convert SRU --base http://sru.gbv.de/gvk --recordSchema ↩ marcxml --parser marcxml --query "issn=0939-4362" to YAML $ catmandu import getJSON --from http://example.org/alice.json to ↩ MongoDB --database_name import --fix import.fix $ catmandu convert RDF --url http://data.linkeddatafragments.org/viaf ↩ --sparql 'SELECT * {?s ?p "Wall, Larry"}' to JSON $ catmandu convert Wikidata --site enwiki --title "Larry Wall" to↩ JSON --pretty 1 ``` --- class: middle ## Config ```bash $ cat catmandu.yml --- store: mdb: package: MongoDB options: database_name: mydb fix: 'my.fix' els: package: Elasticsearch options: index_name: mydb fix: 'my.fix' $ catmandu import JSON to mdb < records.json $ catmandu import MARC to els < records.mrc $ catmandu export mdb to JSON $ catmandu export els to JSON ``` --- class: middle ## Extension ... you can extend Catmandu via it's Perl API with your own * fixes * commands * importer & exporters * ... --- class: middle ## Catmandu @ Bielefeld  --- class: middle ## Catmandu @ Gent  --- class: middle ## Catmandu @ Koha  --- class: middle ## Catmandu @ LinkedDataFragments  --- class: middle ## Catmandu @ OpenRefine  --- class: middle ## Getting started * Virtual machine https://librecatproject.wordpress.com/get-catmandu/ * Introduction https://librecatproject.wordpress.com/2014/12/01/day-1-getting-catmandu/ * Cheat sheet http://librecat.org/Catmandu/#fixes-cheat-sheet --- class: middle ## Links - http://librecat.org - home page - https://librecatproject.wordpress.com - blog - https://github.com/LibreCat - code - librecat-dev@lists.uni-bielefeld.de - mailing list --- class: middle .center[] .center[
\[Comic by [Randall Munroe](http://xkcd.com/519/), [CC BY-NC 2.5](https://creativecommons.org/licenses/by-nc/2.5/)\]
]