Data Transfer Strategies

Data Transfer Strategies

Jewish Buffalo Archives Project, University at Buffalo (SUNY) and Bureau of Jewish Education of Greater Buffalo (Buffalo, NY)

Acquiring the Jewish Buffalo Archives Project data was straightforward, and, excitingly, outlines a process by which one can obtain data from any HTML finding aid(s) published from a digital asset management system in a scalable and automated way. Buffalo provided a list of formulaic links to HTML finding aids and the HTML produced by Buffalo's DAM, XTF, is regular and well structured. Therefore, we could easily scrape all the HTML and extract the needed data via xQuery.

Each collection has a MARC record in the University of Buffalo system, and most have HTML finding aids that are linked from the MARC record. Buffalo provided AJHS and CJH a list of collections with records numbers. From Cygwin command-line shell[1], the CJH archivist used the a single “curl” command to scrape all the HTML finding aids into one file. The list of IDs was taken from the spreadsheet provided by Buffalo. The “curl” command allowed us to use one link, with variable IDs, to scrape the URLs for each finding aid. The command wrote the HTML it found at each URL into one large file.

Fig. 1: Curl command
curl {ms0217,ms0150_5,msq0150_2,ms0150_4,ms0150_1,ms0150_3,ms0225,ms0206,ms203,ms0204,ms0209,ms0202,ms201,ms0200_17,ms0200_22,ms0200_26,ms0200_15,ms0200_27,ms0225_1,ms0200_23,ms0200_5,ms0200_28,ms0200_11,ms0200_16,ms0200_31,ms0200_21}.xml > buffalo_html.txt

Fig. 2: HTML <meta> data from Buffalo finding aid

The resulting document was opened in the oXygen XML editor[2], doctype declarations were removed, and an XML root element and xml declaration added. The resulting valid XML file was imported into the XML database software BaseX[3], and after some study of the HTML, the following xQuery was written to extract the metadata fields required by the Portal's Collective Access software and those used for public display. This process produced CSV data that could be imported using Excel.

Fig. 3: xQuery Code, Buffalo
xquery version "3.0"; <results>
for $findingaid in /records/html
let $titleproper:= $findingaid/meta[@name="dc.title"]/@content
let $creator := $findingaid/meta[@name=""]/@content
let $subject := $findingaid/meta[@name="dc.subject"]/@content
let $dates:= $findingaid//h2[@class="tp_titleproper"]
let $abstract := $findingaid/head/meta[@name="description"]/@content
let $unitid := $findingaid//h3[@class="tp_collnum"]
let $link := $findingaid//div[@class="level-1"][1]/a/@href/substring-before(., "&")
let $extent := $findingaid//div[span/a[@name="node."]]/text()
<dates>{data(substring-after($dates, ","))}
<publisher>Jewish Buffalo Archives Project, at University Archives, University at Buffalo <abstract>{data($abstract)}
<link>{data($link)} <extent>{$extent}

Jewish Historical Society of Greater Hartford (West Hartford, CT)

The Jewish Historical Society of Greater Hartford provided EAD files by email. The EAD files were exported from Archivists' Toolkit, and were thus uniform in terms of tag and data placement. The CJH archivist analyzed the EAD files, imported them into the XML database software BaseX, and ran the following xQuery to extract metadata fields for the Portal's Collective Access software.

Fig. 4: xQuery Code, Hartford
xquery version "3.0";
for $ead in /ead
let $titleproper:= $ead/eadheader/filedesc/titlestmt/titleproper[1]/text()
let $creator := $ead/archdesc/did/unittitle
let $dates:= $ead/archdesc/did/unitdate
let $publisher:= $ead/eadheader/filedesc/publicationstmt/publisher
let $abstract := $ead/archdesc/scopecontent/p
let $unitid := $ead/archdesc/did/unitid
<publisher>{data($publisher)} <abstract>{data($abstract)}

Jewish Historical Society of Fairfield County (Stamford, CT)

The Jewish Historical Society of Fairfield County, located in Stamford, CT, provided AJHS and CJH with very rich container list information in an Excel format, exported from the description management software PastPerfect. The records were not organized at the collection level, but included title information that indicated the provenance of a particular container. For example, a sample container-level row contained this information:

Temple Sinai of Stamford, CT Bulletins from 2000-2001 Temple Sinai Bulletins Bulletins 478

The CJH and AJHS archivists suggested manually creating collection-level records, and the Historical Society agreed. The final collection-level record, which was imported into the Portal, looked like this (record slightly abbreviated):

Materials on Temple Sinai of Stamford, CT. Includes monthly bulletins from 1959-1960, 1978-2014 (with gaps and missing issues). Also includes booklets, annual meeting reports, various historical documents. Temple Sinai archival and bulletin collection Temple Sinai 1954-2012 JHSFC_473-524 473-524, 516b

This manual process was time-consuming, but the resulting records were contextually appropriate for the Portal.

Click here to return to the project home page.