2012-01-05. Frozen archive - links may not resolve - see directory of files at MoinMoin wiki archive

> DataConversion

Some notes on converting bibliographic metadata between formats.

LOC Metadata

Using MARC data from: http://www.archive.org/details/marc_records_scriblio_net

For details of the RDF conversion work done by Alistair Miles in 2008/09, see http://code.google.com/p/code4rda/wiki/MilestoneOne


Statistics

MODS XML

Below are statistics on the MODS XML representation of the LOC dataset.

element path count average count per record
mods 6927835
mods/abstract 326867 0.05
mods/abstract@xlink:href 5929 0
mods/abstract@xmlns:xlink 5929 0
mods/accessCondition 719 0
mods/accessCondition@type 719 0
mods/accessCondition@type='restrictionOnAccess' 170 0
mods/accessCondition@type='useAndReproduction' 549 0
mods/classification 11084966 1.6
mods/classification@authority 11084966 1.6
mods/classification@authority= 59572 0.01
mods/classification@authority='20' 1 0
mods/classification@authority='Bremen <Geschichte>' 1 0
mods/classification@authority='Politische Kultur' 1 0
mods/classification@authority='ardocs' 153 0
mods/classification@authority='azdocs' 262 0
mods/classification@authority='blsrissc' 6 0
mods/classification@authority='cadocs' 535 0
mods/classification@authority='candoc' 1273 0
mods/classification@authority='codocs' 920 0
mods/classification@authority='ddc' 3695274 0.53
mods/classification@authority='ex.sum.' 1 0
mods/classification@authority='fldocs' 300 0
mods/classification@authority='gadocs' 13 0
mods/classification@authority='iadoc' 2 0
mods/classification@authority='iadocs' 100 0
mods/classification@authority='ildocs' 1 0
mods/classification@authority='ksdocs' 151 0
mods/classification@authority='kssb/7' 2 0
mods/classification@authority='ladocs' 49 0
mods/classification@authority='lcc' 7162071 1.03
mods/classification@authority='midocs' 87 0
mods/classification@authority='modocs' 131 0
mods/classification@authority='moys' 2 0
mods/classification@authority='msdocs' 66 0
mods/classification@authority='nbdocs' 186 0
mods/classification@authority='ncdocs' 613 0
mods/classification@authority='njb' 293 0
mods/classification@authority='njb/9' 1 0
mods/classification@authority='nlm' 94173 0.01
mods/classification@authority='nmdocs' 81 0
mods/classification@authority='nvdocs' 87 0
mods/classification@authority='nydocs' 1136 0
mods/classification@authority='ohdocs' 254 0
mods/classification@authority='okdocs' 98 0
mods/classification@authority='ordocs' 317 0
mods/classification@authority='padocs' 129 0
mods/classification@authority='ridocs' 63 0
mods/classification@authority='rswk' 1 0
mods/classification@authority='scdocs' 316 0
mods/classification@authority='sddocs' 45 0
mods/classification@authority='sdocs' 1 0
mods/classification@authority='sudocs' 62582 0.01
mods/classification@authority='txdocs' 643 0
mods/classification@authority='udc' 1500 0
mods/classification@authority='undoc' 37 0
mods/classification@authority='undocs' 71 0
mods/classification@authority='undocs.' 1 0
mods/classification@authority='utdocs' 291 0
mods/classification@authority='wadocs' 289 0
mods/classification@authority='widocs' 491 0
mods/classification@authority='wydocs' 241 0
mods/classification@authority='z' 51 0
mods/classification@displayLabel 3 0
mods/classification@edition 2304523 0.33
mods/genre 4262848 0.62
mods/genre@authority 4262848 0.62
mods/genre@authority= 1537 0
mods/genre@authority='DLC' 3 0
mods/genre@authority='LCSH' 2 0
mods/genre@authority='aat' 90 0
mods/genre@authority='aat.' 6 0
mods/genre@authority='gafd' 1 0
mods/genre@authority='gasfd' 1 0
mods/genre@authority='gmgpc' 394 0
mods/genre@authority='gsaf' 1 0
mods/genre@authority='gsafd' 130627 0.02
mods/genre@authority='gsafd.' 26 0
mods/genre@authority='gsasf' 1 0
mods/genre@authority='gsfd' 5 0
mods/genre@authority='gtlm' 1 0
mods/genre@authority='gtlm.' 3 0
mods/genre@authority='lcsh' 32137 0
mods/genre@authority='lcsh.' 12 0
mods/genre@authority='lcshac' 2 0
mods/genre@authority='lctgm' 1 0
mods/genre@authority='lgsafd' 1 0
mods/genre@authority='local' 70 0
mods/genre@authority='marcgt' 4096511 0.59
mods/genre@authority='mesh' 95 0
mods/genre@authority='migfg' 2 0
mods/genre@authority='mim' 1 0
mods/genre@authority='radfg' 2 0
mods/genre@authority='rbbin' 57 0
mods/genre@authority='rbgen' 1 0
mods/genre@authority='rbgenr' 1062 0
mods/genre@authority='rbgenrb' 1 0
mods/genre@authority='rbpri' 156 0
mods/genre@authority='rbprov [from old catalog]' 1 0
mods/genre@authority='rbprov' 26 0
mods/genre@authority='rbprov.' 2 0
mods/genre@authority='rbpub' 9 0
mods/genre@authority='safd' 1 0
mods/identifier 10857175 1.57
mods/identifier/identifier 1 0
mods/identifier/identifier@invalid 1 0
mods/identifier/identifier@type 1 0
mods/identifier/identifier@type= 1 0
mods/identifier@invalid 72849 0.01
mods/identifier@type 10857175 1.57
mods/identifier@type= 33 0
mods/identifier@type='hdl' 2344 0
mods/identifier@type='isbn' 3543617 0.51
mods/identifier@type='ismn' 42 0
mods/identifier@type='isrc' 14 0
mods/identifier@type='issn' 163 0
mods/identifier@type='issue number' 47 0
mods/identifier@type='lccn' 6939150 1
mods/identifier@type='matrix number' 1 0
mods/identifier@type='music plate' 19 0
mods/identifier@type='music publisher' 109 0
mods/identifier@type='sici' 49 0
mods/identifier@type='stock number' 117501 0.02
mods/identifier@type='upc' 460 0
mods/identifier@type='uri' 253625 0.04
mods/identifier@type='videorecording identifier' 1 0
mods/language 7540276 1.09
mods/language/languageTerm 7540276 1.09
mods/language/languageTerm@authority 7540276 1.09
mods/language/languageTerm@authority='iso639-2b' 7540276 1.09
mods/language/languageTerm@type 7540276 1.09
mods/language/languageTerm@type='code' 7540276 1.09
mods/language@objectPart 237455 0.03
mods/location 437 0
mods/location/physicalLocation 437 0
mods/location/shelfLocation 236 0
mods/name 9809658 1.42
mods/name/affiliation 2 0
mods/name/namePart 13777932 1.99
mods/name/namePart@type 3016579 0.44
mods/name/namePart@type='date' 2879576 0.42
mods/name/namePart@type='termsOfAddress' 137003 0.02
mods/name/role 6340532 0.92
mods/name/role/roleTerm 6340532 0.92
mods/name/role/roleTerm@authority 5792141 0.84
mods/name/role/roleTerm@authority='marcrelator' 5792141 0.84
mods/name/role/roleTerm@type 6340532 0.92
mods/name/role/roleTerm@type='code' 6296 0
mods/name/role/roleTerm@type='text' 6334236 0.91
mods/name@type 9809655 1.42
mods/name@type='conference' 159644 0.02
mods/name@type='corporate' 1644781 0.24
mods/name@type='personal' 8005230 1.16
mods/note 13430193 1.94
mods/note@type 5799781 0.84
mods/note@type='action' 926 0
mods/note@type='additional physical form' 18959 0
mods/note@type='citation/reference' 35359 0.01
mods/note@type='original version' 80 0
mods/note@type='performers' 19 0
mods/note@type='reproduction' 109368 0.02
mods/note@type='restrictions' 170 0
mods/note@type='statement of responsibility' 5616926 0.81
mods/note@type='system details' 17967 0
mods/note@type='venue' 7 0
mods/note@xlink:href 60 0
mods/note@xmlns:xlink 60 0
mods/originInfo 6927835 1
mods/originInfo/copyrightDate 28052 0
mods/originInfo/copyrightDate@encoding 28052 0
mods/originInfo/copyrightDate@encoding='marc' 28052 0
mods/originInfo/dateCaptured 3 0
mods/originInfo/dateCaptured@encoding 3 0
mods/originInfo/dateCaptured@encoding='iso8601' 3 0
mods/originInfo/dateCaptured@point 3 0
mods/originInfo/dateCreated 9417 0
mods/originInfo/dateIssued 10473189 1.51
mods/originInfo/dateIssued@encoding 3559330 0.51
mods/originInfo/dateIssued@encoding='marc' 3559330 0.51
mods/originInfo/dateIssued@point 527500 0.08
mods/originInfo/dateIssued@qualifier 51879 0.01
mods/originInfo/edition 1226948 0.18
mods/originInfo/frequency 192 0
mods/originInfo/issuance 6927835 1
mods/originInfo/place 14224531 2.05
mods/originInfo/place/placeTerm 14224531 2.05
mods/originInfo/place/placeTerm@authority 6845116 0.99
mods/originInfo/place/placeTerm@authority='iso3166' 26 0
mods/originInfo/place/placeTerm@authority='marccountry' 6845090 0.99
mods/originInfo/place/placeTerm@type 14224531 2.05
mods/originInfo/place/placeTerm@type='code' 6845116 0.99
mods/originInfo/place/placeTerm@type='text' 7379415 1.07
mods/originInfo/publisher 7089288 1.02
mods/physicalDescription 6927835 1
mods/physicalDescription/extent 6920111 1
mods/physicalDescription/form 7128863 1.03
mods/physicalDescription/form@authority 7128687 1.03
mods/physicalDescription/form@authority='gmd' 112322 0.02
mods/physicalDescription/form@authority='marcform' 6890374 0.99
mods/physicalDescription/form@authority='smd' 125991 0.02
mods/physicalDescription/internetMediaType 30 0
mods/physicalDescription/reformattingQuality 194 0
mods/recordInfo 6927834 1
mods/recordInfo/descriptionStandard 4189083 0.6
mods/recordInfo/languageOfCataloging 22044 0
mods/recordInfo/languageOfCataloging/languageTerm 22044 0
mods/recordInfo/languageOfCataloging/languageTerm@authority 22044 0
mods/recordInfo/languageOfCataloging/languageTerm@authority='iso639-2b' 22044 0
mods/recordInfo/languageOfCataloging/languageTerm@type 22044 0
mods/recordInfo/languageOfCataloging/languageTerm@type='code' 22044 0
mods/recordInfo/recordChangeDate 6927834 1
mods/recordInfo/recordChangeDate@encoding 6927834 1
mods/recordInfo/recordChangeDate@encoding='iso8601' 6927834 1
mods/recordInfo/recordContentSource 5421300 0.78
mods/recordInfo/recordContentSource@authority 5421300 0.78
mods/recordInfo/recordContentSource@authority='marcorg' 5421300 0.78
mods/recordInfo/recordCreationDate 6927834 1
mods/recordInfo/recordCreationDate@encoding 6927834 1
mods/recordInfo/recordCreationDate@encoding='marc' 6927834 1
mods/recordInfo/recordIdentifier 6927834 1
mods/recordInfo/recordIdentifier@source 6927834 1
mods/relatedItem 2809549 0.41
mods/relatedItem/identifier 2290 0
mods/relatedItem/identifier@type 2290 0
mods/relatedItem/identifier@type='issn' 116 0
mods/relatedItem/identifier@type='local' 2174 0
mods/relatedItem/language 55 0
mods/relatedItem/language/languageTerm 55 0
mods/relatedItem/language/languageTerm@authority 55 0
mods/relatedItem/language/languageTerm@authority='iso639-2b' 55 0
mods/relatedItem/language/languageTerm@type 55 0
mods/relatedItem/language/languageTerm@type='code' 55 0
mods/relatedItem/location 237599 0.03
mods/relatedItem/location/url 237599 0.03
mods/relatedItem/location/url@displayLabel 229357 0.03
mods/relatedItem/location/url@note 2560 0
mods/relatedItem/name 322782 0.05
mods/relatedItem/name/namePart 620561 0.09
mods/relatedItem/name/namePart@type 96677 0.01
mods/relatedItem/name/namePart@type='date' 90197 0.01
mods/relatedItem/name/namePart@type='termsOfAddress' 6480 0
mods/relatedItem/name/role 1620 0
mods/relatedItem/name/role/roleTerm 1620 0
mods/relatedItem/name/role/roleTerm@authority 4 0
mods/relatedItem/name/role/roleTerm@authority='marcrelator' 4 0
mods/relatedItem/name/role/roleTerm@type 1620 0
mods/relatedItem/name/role/roleTerm@type='code' 4 0
mods/relatedItem/name/role/roleTerm@type='text' 1616 0
mods/relatedItem/name@type 321803 0.05
mods/relatedItem/name@type='conference' 2118 0
mods/relatedItem/name@type='corporate' 143072 0.02
mods/relatedItem/name@type='personal' 176613 0.03
mods/relatedItem/note 35368 0.01
mods/relatedItem/originInfo 843 0
mods/relatedItem/originInfo/edition 519 0
mods/relatedItem/originInfo/publisher 393 0
mods/relatedItem/part 201 0
mods/relatedItem/part/text 201 0
mods/relatedItem/physicalDescription 15 0
mods/relatedItem/physicalDescription/extent 9 0
mods/relatedItem/physicalDescription/form 6 0
mods/relatedItem/titleInfo 2535650 0.37
mods/relatedItem/titleInfo/partName 122568 0.02
mods/relatedItem/titleInfo/partNumber 41869 0.01
mods/relatedItem/titleInfo/title 2535650 0.37
mods/relatedItem/titleInfo@type 27 0
mods/relatedItem/titleInfo@type='uniform' 27 0
mods/relatedItem@displayLabel 49 0
mods/relatedItem@type 2588944 0.37
mods/relatedItem@type='constituent' 101257 0.01
mods/relatedItem@type='host' 269 0
mods/relatedItem@type='isReferencedBy' 35359 0.01
mods/relatedItem@type='original' 80 0
mods/relatedItem@type='otherFormat' 940 0
mods/relatedItem@type='otherVersion' 96654 0.01
mods/relatedItem@type='preceding' 418 0
mods/relatedItem@type='series' 2353548 0.34
mods/relatedItem@type='succeeding' 419 0
mods/subject 15881133 2.29
mods/subject/cartographics 27 0
mods/subject/cartographics/scale 27 0
mods/subject/genre 1085782 0.16
mods/subject/geographic 6671510 0.96
mods/subject/geographicCode 3117377 0.45
mods/subject/geographicCode@authority 3117377 0.45
mods/subject/geographicCode@authority= 123 0
mods/subject/geographicCode@authority='iso3166' 2 0
mods/subject/geographicCode@authority='local' 1 0
mods/subject/geographicCode@authority='marcgac' 3117249 0.45
mods/subject/geographicCode@authority='n-mx-ch' 1 0
mods/subject/geographicCode@authority='n-mx-za' 1 0
mods/subject/hierarchicalGeographic 14933 0
mods/subject/hierarchicalGeographic/city 14756 0
mods/subject/hierarchicalGeographic/country 14934 0
mods/subject/hierarchicalGeographic/county 32 0
mods/subject/hierarchicalGeographic/state 2028 0
mods/subject/name 1573667 0.23
mods/subject/name/namePart 2665153 0.38
mods/subject/name/namePart@type 931678 0.13
mods/subject/name/namePart@type='date' 807951 0.12
mods/subject/name/namePart@type='termsOfAddress' 123727 0.02
mods/subject/name/role 625 0
mods/subject/name/role/roleTerm 625 0
mods/subject/name/role/roleTerm@authority 1 0
mods/subject/name/role/roleTerm@authority='marcrelator' 1 0
mods/subject/name/role/roleTerm@type 625 0
mods/subject/name/role/roleTerm@type='code' 1 0
mods/subject/name/role/roleTerm@type='text' 624 0
mods/subject/name@type 1573667 0.23
mods/subject/name@type='conference' 11364 0
mods/subject/name@type='corporate' 474123 0.07
mods/subject/name@type='personal' 1088180 0.16
mods/subject/occupation 1 0
mods/subject/temporal 992761 0.14
mods/subject/temporal@encoding 627 0
mods/subject/temporal@encoding='iso8601' 627 0
mods/subject/temporal@point 426 0
mods/subject/titleInfo 112072 0.02
mods/subject/titleInfo/partName 38231 0.01
mods/subject/titleInfo/partNumber 188 0
mods/subject/titleInfo/title 112072 0.02
mods/subject/topic 17062222 2.46
mods/subject@authority 13032144 1.88
mods/subject@authority= 208 0
mods/subject@authority='DB' 42 0
mods/subject@authority='France.' 1 0
mods/subject@authority='History.' 1 0
mods/subject@authority='Kupu' 1 0
mods/subject@authority='SWD' 17 0
mods/subject@authority='SWD.' 3 0
mods/subject@authority='[ram]' 2 0
mods/subject@authority='aat' 1 0
mods/subject@authority='abne' 6 0
mods/subject@authority='bidex' 213 0
mods/subject@authority='csh' 891 0
mods/subject@authority='cuweb' 1 0
mods/subject@authority='dtict' 20 0
mods/subject@authority='embn' 2 0
mods/subject@authority='fmesh' 44 0
mods/subject@authority='gsafd' 20 0
mods/subject@authority='gtt' 29 0
mods/subject@authority='itrt' 19 0
mods/subject@authority='kupu' 4 0
mods/subject@authority='larpcal' 148 0
mods/subject@authority='lcsh' 12421421 1.79
mods/subject@authority='lcshac' 412771 0.06
mods/subject@authority='lctgm' 86 0
mods/subject@authority='ltcsh' 57 0
mods/subject@authority='mesh' 176026 0.03
mods/subject@authority='nal' 1145 0
mods/subject@authority='nasat' 278 0
mods/subject@authority='pmbok' 2 0
mods/subject@authority='ram' 2890 0
mods/subject@authority='rbgenr' 1 0
mods/subject@authority='renib' 57 0
mods/subject@authority='ruvkp' 12 0
mods/subject@authority='rvm' 15468 0
mods/subject@authority='sears' 44 0
mods/subject@authority='sigle' 17 0
mods/subject@authority='sigle.' 1 0
mods/subject@authority='swd' 31 0
mods/subject@authority='tlsh' 2 0
mods/subject@authority='trt' 87 0
mods/subject@authority='umitrist' 7 0
mods/subject@authority='unbisn' 3 0
mods/subject@authority='unbist' 59 0
mods/subject@authority='wot' 1 0
mods/subject@authority='wpicsh' 5 0
mods/tableOfContents 284974 0.04
mods/tableOfContents@xlink:href 4 0
mods/tableOfContents@xmlns:xlink 4 0
mods/targetAudience 246935 0.04
mods/targetAudience@authority 239025 0.03
mods/targetAudience@authority='marctarget' 239025 0.03
mods/titleInfo 8043387 1.16
mods/titleInfo/nonSort 1475257 0.21
mods/titleInfo/partName 32171 0
mods/titleInfo/partNumber 11737 0
mods/titleInfo/subTitle 3006802 0.43
mods/titleInfo/title 8043387 1.16
mods/titleInfo@displayLabel 35623 0.01
mods/titleInfo@lang 302 0
mods/titleInfo@type 1115552 0.16
mods/titleInfo@type='abbreviated' 2 0
mods/titleInfo@type='alternative' 798079 0.12
mods/titleInfo@type='translated' 2042 0
mods/titleInfo@type='uniform' 315429 0.05
mods/typeOfResource 6927835 1
mods/typeOfResource@collection 4235 0
mods/typeOfResource@manuscript 394 0
mods@version 6927835 1
MARC XML

Below are statistics on the MARC XML representation of the LOC dataset. (These stats would be better if you could see the count of subfield codes in the context of specific datafield tags.)

element path count average count per record
record 6925631
record/controlfield 27910041 4.03
record/controlfield@tag 27910041 4.03
record/controlfield@tag='001' 6925631 1
record/controlfield@tag='003' 6925631 1
record/controlfield@tag='005' 6925631 1
record/controlfield@tag='006' 2302 0
record/controlfield@tag='007' 205215 0.03
record/controlfield@tag='008' 6925631 1
record/datafield 94243608 13.61
record/datafield/subfield 175332800 25.32
record/datafield/subfield@code 175332800 25.32
record/datafield/subfield@code=' ' 2 0
record/datafield/subfield@code='2' 2505915 0.36
record/datafield/subfield@code='3' 240314 0.03
record/datafield/subfield@code='4' 6301 0
record/datafield/subfield@code='5' 107889 0.02
record/datafield/subfield@code='6' 7 0
record/datafield/subfield@code='7' 21 0
record/datafield/subfield@code='a' 94965652 13.71
record/datafield/subfield@code='b' 21651923 3.13
record/datafield/subfield@code='c' 26665904 3.85
record/datafield/subfield@code='d' 9612093 1.39
record/datafield/subfield@code='e' 773137 0.11
record/datafield/subfield@code='f' 160300 0.02
record/datafield/subfield@code='g' 14407 0
record/datafield/subfield@code='h' 207378 0.03
record/datafield/subfield@code='i' 35486 0.01
record/datafield/subfield@code='j' 11 0
record/datafield/subfield@code='k' 52062 0.01
record/datafield/subfield@code='l' 160300 0.02
record/datafield/subfield@code='m' 502 0
record/datafield/subfield@code='n' 162128 0.02
record/datafield/subfield@code='o' 27 0
record/datafield/subfield@code='p' 233989 0.03
record/datafield/subfield@code='q' 671736 0.1
record/datafield/subfield@code='r' 9601 0
record/datafield/subfield@code='s' 8507 0
record/datafield/subfield@code='t' 440574 0.06
record/datafield/subfield@code='u' 262245 0.04
record/datafield/subfield@code='v' 2880491 0.42
record/datafield/subfield@code='w' 2175 0
record/datafield/subfield@code='x' 7635556 1.1
record/datafield/subfield@code='y' 995198 0.14
record/datafield/subfield@code='z' 4870968 0.7
record/datafield@ind1 94243608 13.61
record/datafield@ind2 94243608 13.61
record/datafield@tag 94243608 13.61
record/datafield@tag='010' 6925631 1
record/datafield@tag='015' 820018 0.12
record/datafield@tag='016' 12766 0
record/datafield@tag='017' 793 0
record/datafield@tag='020' 3986248 0.58
record/datafield@tag='022' 176 0
record/datafield@tag='024' 2096 0
record/datafield@tag='025' 266861 0.04
record/datafield@tag='026' 1 0
record/datafield@tag='027' 374 0
record/datafield@tag='028' 210 0
record/datafield@tag='030' 16 0
record/datafield@tag='033' 3 0
record/datafield@tag='034' 18 0
record/datafield@tag='035' 2041705 0.29
record/datafield@tag='036' 3 0
record/datafield@tag='037' 111908 0.02
record/datafield@tag='040' 5419097 0.78
record/datafield@tag='041' 492742 0.07
record/datafield@tag='042' 2065366 0.3
record/datafield@tag='043' 2754867 0.4
record/datafield@tag='044' 121 0
record/datafield@tag='045' 14403 0
record/datafield@tag='046' 14 0
record/datafield@tag='047' 1 0
record/datafield@tag='048' 3 0
record/datafield@tag='049' 1 0
record/datafield@tag='050' 6901718 1
record/datafield@tag='051' 36807 0.01
record/datafield@tag='052' 430 0
record/datafield@tag='055' 11346 0
record/datafield@tag='060' 94193 0.01
record/datafield@tag='061' 1 0
record/datafield@tag='066' 17 0
record/datafield@tag='070' 9234 0
record/datafield@tag='072' 14733 0
record/datafield@tag='074' 28379 0
record/datafield@tag='080' 1500 0
record/datafield@tag='082' 3696726 0.53
record/datafield@tag='084' 341 0
record/datafield@tag='086' 67766 0.01
record/datafield@tag='088' 9548 0
record/datafield@tag='100' 5205321 0.75
record/datafield@tag='110' 459587 0.07
record/datafield@tag='111' 119522 0.02
record/datafield@tag='130' 33100 0
record/datafield@tag='210' 2 0
record/datafield@tag='222' 1 0
record/datafield@tag='240' 226103 0.03
record/datafield@tag='241' 1 0
record/datafield@tag='242' 2042 0
record/datafield@tag='243' 20 0
record/datafield@tag='245' 6925630 1
record/datafield@tag='246' 343621 0.05
record/datafield@tag='247' 71 0
record/datafield@tag='250' 1225716 0.18
record/datafield@tag='254' 1 0
record/datafield@tag='255' 27 0
record/datafield@tag='256' 176 0
record/datafield@tag='257' 1 0
record/datafield@tag='260' 6925272 1
record/datafield@tag='263' 75567 0.01
record/datafield@tag='265' 32 0
record/datafield@tag='270' 34 0
record/datafield@tag='300' 6917906 1
record/datafield@tag='306' 1 0
record/datafield@tag='310' 192 0
record/datafield@tag='350' 241852 0.03
record/datafield@tag='351' 36 0
record/datafield@tag='362' 971 0
record/datafield@tag='400' 1620 0
record/datafield@tag='410' 23855 0
record/datafield@tag='411' 172 0
record/datafield@tag='440' 1069851 0.15
record/datafield@tag='490' 1221202 0.18
record/datafield@tag='500' 4148026 0.6
record/datafield@tag='501' 6350 0
record/datafield@tag='502' 59148 0.01
record/datafield@tag='504' 3208344 0.46
record/datafield@tag='505' 284867 0.04
record/datafield@tag='506' 170 0
record/datafield@tag='507' 6 0
record/datafield@tag='508' 26 0
record/datafield@tag='510' 35359 0.01
record/datafield@tag='511' 19 0
record/datafield@tag='513' 451 0
record/datafield@tag='514' 1 0
record/datafield@tag='515' 13 0
record/datafield@tag='516' 141 0
record/datafield@tag='518' 7 0
record/datafield@tag='520' 324040 0.05
record/datafield@tag='521' 7910 0
record/datafield@tag='522' 8 0
record/datafield@tag='524' 18 0
record/datafield@tag='525' 1582 0
record/datafield@tag='530' 18956 0
record/datafield@tag='533' 109209 0.02
record/datafield@tag='534' 80 0
record/datafield@tag='535' 2 0
record/datafield@tag='536' 923 0
record/datafield@tag='538' 17968 0
record/datafield@tag='540' 548 0
record/datafield@tag='541' 1385 0
record/datafield@tag='544' 1 0
record/datafield@tag='545' 29 0
record/datafield@tag='546' 192052 0.03
record/datafield@tag='547' 6 0
record/datafield@tag='550' 381 0
record/datafield@tag='555' 73 0
record/datafield@tag='556' 2 0
record/datafield@tag='561' 5673 0
record/datafield@tag='562' 2 0
record/datafield@tag='563' 3 0
record/datafield@tag='565' 4 0
record/datafield@tag='580' 137 0
record/datafield@tag='581' 35 0
record/datafield@tag='583' 927 0
record/datafield@tag='585' 32 0
record/datafield@tag='586' 428 0
record/datafield@tag='600' 1088662 0.16
record/datafield@tag='610' 474260 0.07
record/datafield@tag='611' 11366 0
record/datafield@tag='630' 112066 0.02
record/datafield@tag='650' 9457632 1.37
record/datafield@tag='651' 1893884 0.27
record/datafield@tag='652' 4 0
record/datafield@tag='653' 72698 0.01
record/datafield@tag='654' 315 0
record/datafield@tag='655' 166390 0.02
record/datafield@tag='656' 1 0
record/datafield@tag='657' 12 0
record/datafield@tag='700' 2921390 0.42
record/datafield@tag='710' 1219439 0.18
record/datafield@tag='711' 40532 0.01
record/datafield@tag='720' 3 0
record/datafield@tag='730' 65300 0.01
record/datafield@tag='740' 465240 0.07
record/datafield@tag='752' 14939 0
record/datafield@tag='753' 23 0
record/datafield@tag='760' 29 0
record/datafield@tag='765' 2 0
record/datafield@tag='767' 3 0
record/datafield@tag='770' 10 0
record/datafield@tag='772' 15 0
record/datafield@tag='773' 255 0
record/datafield@tag='774' 14 0
record/datafield@tag='775' 115 0
record/datafield@tag='776' 936 0
record/datafield@tag='777' 13 0
record/datafield@tag='780' 418 0
record/datafield@tag='785' 419 0
record/datafield@tag='787' 77 0
record/datafield@tag='800' 53680 0.01
record/datafield@tag='810' 107176 0.02
record/datafield@tag='811' 1500 0
record/datafield@tag='830' 547519 0.08
record/datafield@tag='840' 4284 0
record/datafield@tag='850' 961 0
record/datafield@tag='852' 440 0
record/datafield@tag='856' 256449 0.04
record/datafield@tag='865' 1 0
record/datafield@tag='880' 1 0
record/datafield@tag='886' 100 0
record/datafield@tag='987' 32007 0
record/leader 6925631 1
record@xmlns 6925631 1

Sample Record (Part 29, Record 1)

Record 1 from part 29.

MARC XML

Generated by yaz-marcdump:

<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00519cam a22002171 4500</leader>
  <controlfield tag="001">sa 64009056 </controlfield>
  <controlfield tag="003">DLC</controlfield>
  <controlfield tag="005">20050711190518.0</controlfield>
  <controlfield tag="008">941008s1962 ii b 000 0 orio </controlfield>
  <datafield tag="010" ind1=" " ind2=" ">
    <subfield code="a">sa 64009056 </subfield>
  </datafield>
  <datafield tag="025" ind1=" " ind2=" ">
    <subfield code="a">PL480:I-O-205</subfield>
  </datafield>
  <datafield tag="035" ind1=" " ind2=" ">
    <subfield code="a">(OCoLC)31249133</subfield>
  </datafield>
  <datafield tag="040" ind1=" " ind2=" ">
    <subfield code="a">DLC</subfield>
    <subfield code="c">CU</subfield>
    <subfield code="d">CU</subfield>
    <subfield code="d">DLC</subfield>
  </datafield>
  <datafield tag="042" ind1=" " ind2=" ">
    <subfield code="a">premarc</subfield>
  </datafield>
  <datafield tag="050" ind1="0" ind2="0">
    <subfield code="a">PK2579.R255</subfield>
    <subfield code="b">K3</subfield>
  </datafield>
  <datafield tag="100" ind1="1" ind2=" ">
    <subfield code="a">Rautroy, Sachi,</subfield>
    <subfield code="d">1916-</subfield>
  </datafield>
  <datafield tag="245" ind1="1" ind2="0">
    <subfield code="a">Kabita&#772;.</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">[1962]</subfield>
  </datafield>
  <datafield tag="300" ind1=" " ind2=" ">
    <subfield code="a">8, 304 p.</subfield>
    <subfield code="c">25 cm.</subfield>
  </datafield>
  <datafield tag="500" ind1=" " ind2=" ">
    <subfield code="a">In Oriya.</subfield>
  </datafield>
  <datafield tag="504" ind1=" " ind2=" ">
    <subfield code="a">Bibliographical footnotes.</subfield>
  </datafield>
</record>
MODS XML

Generated from MARC XML by MARC21slim2MODS3.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd">
  <titleInfo>
    <title>Kabita&#772;</title>
  </titleInfo>
  <name type="personal">
    <namePart>Rautroy, Sachi</namePart>
    <namePart type="date">1916-</namePart>
    <role>
      <roleTerm authority="marcrelator" type="text">creator</roleTerm>
    </role>
  </name>
  <typeOfResource>text</typeOfResource>
  <genre authority="marcgt">bibliography</genre>
  <originInfo>
    <place>
      <placeTerm type="code" authority="marccountry">ii</placeTerm>
    </place>
    <dateIssued>[1962]</dateIssued>
    <dateIssued encoding="marc">1962</dateIssued>
    <issuance>monographic</issuance>
  </originInfo>
  <language>
    <languageTerm authority="iso639-2b" type="code">ori</languageTerm>
  </language>
  <physicalDescription>
    <form authority="marcform">print</form>
    <extent>8, 304 p. 25 cm.</extent>
  </physicalDescription>
  <note>In Oriya.</note>
  <note>Bibliographical footnotes.</note>
  <classification authority="lcc">PK2579.R255 K3</classification>
  <identifier type="lccn">sa 64009056</identifier>
  <recordInfo>
    <recordContentSource authority="marcorg">DLC</recordContentSource>
    <recordCreationDate encoding="marc">941008</recordCreationDate>
    <recordChangeDate encoding="iso8601">20050711190518.0</recordChangeDate>
    <recordIdentifier source="DLC">sa 64009056 </recordIdentifier>
  </recordInfo>
</mods>
MODS RDF

Generated from MODS XML by Simile marcmods2rdf/stylesheets/mods2rdf.xslt then tidied up by hand because the stylesheet expects a modsCollection root element:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:modsrdf="http://simile.mit.edu/2006/01/ontologies/mods3#"
  xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:dcterms="http://purl.org/dc/terms/" xmlns:role="http://simile.mit.edu/2006/01/roles#">

  <modsrdf:Record rdf:about="http://libraries.mit.edu/barton/DLC/sa_64009056_">
    <modsrdf:records rdf:resource="info:lccn/sa_64009056"/>
    <modsrdf:origin rdf:resource="info:marcorg/DLC"/>
    <modsrdf:created>
      <modsrdf:Date modsrdf:value="941008" modsrdf:encoding="marc"/>
    </modsrdf:created>
    <modsrdf:changed>
      <modsrdf:Date modsrdf:value="20050711190518.0" modsrdf:encoding="iso8601"/>
    </modsrdf:changed>
  </modsrdf:Record>
  <rdf:Description rdf:about="info:lccn/sa 64009056">
    <rdf:type rdf:resource="http://simile.mit.edu/2006/01/ontologies/mods3#Text"/>
    <modsrdf:title>
      <rdf:Description>
        <rdf:type rdf:resource="http://simile.mit.edu/2006/01/ontologies/mods3#Title"/>
        <modsrdf:value>Kabita&#772;</modsrdf:value>
      </rdf:Description>
    </modsrdf:title>
    <role:creator rdf:resource="http://simile.mit.edu/2006/01/Entity#Rautroy_Sachi_1916"/>
    <modsrdf:genre>
      <modsrdf:Genre rdf:about="http://simile.mit.edu/2006/01/genre/marcgt/bibliography"
        modsrdf:authority="marcgt" modsrdf:value="bibliography"/>
    </modsrdf:genre>

    <modsrdf:dateIssued>
      <modsrdf:Date modsrdf:value="[1962]"/>
    </modsrdf:dateIssued>
    <modsrdf:dateIssued>
      <modsrdf:Date modsrdf:value="1962" modsrdf:encoding="marc"/>
    </modsrdf:dateIssued>
    <modsrdf:issuance>monographic</modsrdf:issuance>

    <modsrdf:language>
      <modsrdf:Language rdf:about="http://simile.mit.edu/2006/01/language/iso639-2b/ori"
        modsrdf:authority="iso639-2b" modsrdf:value="ori"/>
    </modsrdf:language>
    <modsrdf:physicalDescription>
      <modsrdf:Description>
        <modsrdf:form>
          <modsrdf:Form rdf:about="http://simile.mit.edu/2006/01/form/marcform/print"
            modsrdf:value="print" modsrdf:authority="marcform"/>
        </modsrdf:form>
        <modsrdf:extent>8, 304 p. 25 cm.</modsrdf:extent>
      </modsrdf:Description>
    </modsrdf:physicalDescription>
    <modsrdf:note>In Oriya.</modsrdf:note>
    <modsrdf:note>Bibliographical footnotes.</modsrdf:note>
    <modsrdf:classification>
      <modsrdf:Classification modsrdf:value="PK2579.R255 K3" modsrdf:authority="lcc"/>
    </modsrdf:classification>
  </rdf:Description>
  <rdf:Description rdf:about="http://simile.mit.edu/2006/01/Entity#Rautroy_Sachi_1916">
    <rdf:type rdf:resource="http://simile.mit.edu/2006/01/ontologies/mods3#Person"/>
    <modsrdf:fullName>Rautroy, Sachi</modsrdf:fullName>
    <modsrdf:dates>1916-</modsrdf:dates>

  </rdf:Description>
</rdf:RDF>


Downloads

Downloads of this dataset, MODS XML and MARC XML representations..

Each download is a gzipped tar containing a *set* of up to 25 xml files. Each of these files is a 10,000 record split of the data in the corresponding part. I broke each part into 10,000 record splits so I could process the transformations more easily.

N.B. there is a bug in part 13 split 25, for some reason the marc xml output was incomplete so up to 10,000 records could be missing.


Processing Notes

The statistics were generated using Hadoop 0.17.2 running in pseudo-distributed mode on a single EC2 c1.xlarge instance. The job took 50 minutes to complete.

Assuming hadoop binaries are on the path...

$ hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar -input input -output output -mapper mapper7.py -reducer aggregate -file mapper7.py

Contents of mapper7.py ...

  1 
  2 
  3 
  4 
  5 
  6 
  7 
  8 
  9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
 47 
 48 
 49 
 50 
 51 
 52 
#!/usr/bin/python

import sys
import xml.sax
import string

def printElementPathCount(stack):
    print "LongValueSum:" + string.join(stack, "/") + "\t" + "1"

def printAttrPathCount(stack, attr):
    print "LongValueSum:" + string.join(stack, "/") + "@" + attr + "\t" + "1"

def printAttrValPathCount(stack, attr, val):
    print "LongValueSum:" + string.join(stack, "/") + "@" + attr + "='" + val + "'\t" + "1"

class ModsHandler(xml.sax.ContentHandler):

    def __init__(self):
        self.stack = []

    def startElement(self, name, attrs):
        self.stack.append(name)
        printElementPathCount(self.stack)
        for (k,v) in attrs.items():
            printAttrPathCount(self.stack, k)
            if (k == u"type" or k == u"encoding" or k == u"authority"):
                printAttrValPathCount(self.stack, k, v)

    def endElement(self, name):
        self.stack.pop()

def main(argv):
    parser = xml.sax.make_parser(["xml.sax.xmlreader.IncrementalParser"])
    parser.setContentHandler(ModsHandler())
    lines = []
    line = sys.stdin.readline()
    try:
        while line:
            lines.append(line)
            line = sys.stdin.readline()
    except "end of file":
        print "end of file hit"
        return None
    except:
        print "Unexpected error attempting to parse line: "+line+"\n", sys.exc_info()[0]
        return None
    for line in lines:
        parser.feed(line)
    parser.close()

if __name__ == "__main__":
    main(sys.argv)

N.B. mapper7.py caches lines for each file before feeding them to the SAX parser. I tried a direct feed of stdin to the SAX parser, but got broken pipe errors when running hadoop in distributed mode (standalone works fine).

Contents of hadoop-site.xml ...

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>localhost:9000</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>8</value>
  </property>
  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/mnt/tmp/hadoop/data</value>
  </property>
</configuration>