x-definition-beginner-xml Wiki

Uses X-definition to perform XML streaming processing

Brought to you by: curtthomas

Calling methods from other Java language programs in your .xdef files

Attachments

browsing.jpg (127500 bytes)

I stated in this Wiki's general introduction that much of what the project to hand is about is permitting end users to take advantage of X-definition's considerable functionality without having to write and compile Java source code. It's entirely possible that one possesses other compiled Java code (libraries, as it were) that has either proved useful in the past or stands to prove useful in future: indeed, as soon as you download this and this, you do (at least where using X-definition is concerned)!

In this article, we'll learn how to call existing Java code in our X-definition files. It might not sound very important, but it can be relevant when choosing between X-definition and XSLT 3.0 because the Saxon processor currently makes it available only as a premium feature (where the Professional or Enterprise editions need be paid up). Whether I've demonstrated that being able to put compiled Java classes or types to use gives X-definition certain advantages will be for you to decide.

The only two files we will need besides xdef-beginner.jar (sourceforge.net) and the XML source files are Isbn.jar (sourceforge.net) and IsbnInfo.class (sourceforge.net), also linked above leading up to the exclamation mark. The files should be saved in the same directory as your copy of xdef-beginner.jar (sourceforge.net). You still do not need to have the Java Developer's Kit (JDK) installed on your computer.

In the general introduction, I briefly mentioned the British National Bibliography and the corresponding download from the British Library. (I did not mention it throughout the tutorial because its XML markup can entail challenges not necessarily encountered elsewhere). Below, what we'll use instead is the British Library Integrated Catalogue, which although downloadable is really enormous. I'll furnish links to particular files extracted from the archive as we are going along, although I think that downloading the whole thing is quite worth it.

The following X-definition code can be saved as a file to the same directory as the Isbn.jar (sourceforge.net) and IsbnInfo.class (sourceforge.net) files downloaded earlier:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="rdf" root="rdf:RDF" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xd:declaration>
external method {
String IsbnInfo.getAgencyName(String);
String IsbnInfo.getPrefix(String);
}
boolean flag;
</xd:declaration>
<rdf:RDF>
<rdf:Description xd:script="occurs +;options ignoreOther;
init {flag = false;}forget;">
<bibo:isbn10>
onTrue {if (flag EQ false) {outln(getAgencyName(getText()) + " " + getPrefix(getText())); flag = true;}}
</bibo:isbn10>
<bibo:isbn13>
onTrue {if (flag EQ false) {outln(getAgencyName(getText()) + " " + getPrefix(getText())); flag = true;}}
</bibo:isbn13>
</rdf:Description>
</rdf:RDF>
</xd:def>

It is rudimentary code. The flag needs to be set in order to limit the number of International Standard Book Numbers obtained to one per bibliographic record.

Like the flag, the methods located in compiled Java code extrinsic to X-definition's own are declared in the xd:declaration element. The return type is the first part of each declaration, and the name of each method is qualified by the name of its declaring class, as X-definition requires (though it no longer does once that you begin using it in your X-definition scripting). It is of note that neither the return types nor the method parameters are types that the Java Programming Language describes: they are instead types of data (e.g. String) defined by X-definition itself, and may overlap with Java as they have here but might not all of the time.

Per the user documentation, in order to be able to declare and ultimately use methods like those we are discussing here in X-definition, the Java class or type to which they belong needs to have declared them both public and static.

As to syntax, braces were used because more than one method was declared. The keywords external and method were required before the braces, and a contrast exists inasmuch as it's also possible in the xd:declaration element's text area to declare and define methods using X-definition's built-in scripting language exclusively.

The getPrefix() method analyzes each International Standard Book Number (ISBN) obtained from the bibliographic records by X-definition and identifies the initial digits indicating the specific jurisdiction of the International ISBN Agency under which the ISBN falls. (I've said jurisdiction, but an analogy exists with domains and subdomains: for instance, there are two possible prefixes for Lebanon). It does so by applying rules that it obtains from Isbn.jar's code, and prints "Can't identify" if there is an error.

The getAgencyName() method obtains nearly identical information as natural language rather than as a numeric code. Where getPrefix() produces "9780", it will produce "English language" (a description assigned by the International ISBN Agency to both "9780" and "9781").

The Library's Integrated Catalogue is defined on the download site as records that do not qualify for the British National Bibliography, and comprises publications from lands all around the world. To obtain some idea as to which, one can run the example (I've assigned it the filename lands.xdef):

java -cp .:xdef-beginner.jar:Isbn.jar xdef lands.xdef BLICBasicB_202105_f97.rdf

The XML file identified at the very end of the command can be downloaded by clicking here, and should be extracted to the same directory as the files downloaded earlier. I was reluctant to zip up this file and the other one we shall use: the problem isn't so much the file's size in bytes (only 7.5M) as it is its breadth of markup. If your browser opens it after you clicked instead of persisting it to disc or to the cloud, it could attempt to parse the XML prior to displaying it (as a collapsible tree, in the tradition of Internet Explorer) and hang itself up,

We are running X-definition slightly differently to the way that it was done in either the tutorial or the general introduction. Because X-definition requires access to the Java classpath this time around, we need to specify xdef-beginner.jar's main class ("xdef") ourselves rather than relying on the jar archive's manifest as we were doing before. The -cp option is imperative, and it needs must be supplied with the respective locations of all of our compiled code, starting with the present or current directory (".").

The terminal could have seemed to have scrolled quite a lot once that the example was run, but the file to hand was only the most recent update to the Integrated Catalogue and is modest in comparison to most of the archive's others. It is possible to pipe the output to other commands and thereby do a lot more than scroll:

java -cp .:xdef-beginner.jar:Isbn.jar xdef lands.xdef BLICBasicB_202105_f97.rdf | sort | uniq -c | less

The command opts for the less terminal, although the output isn't large:

      3 Albania 97899927
     23 Brazil 97865
    101 Brazil 97885
     13 Chile 978956
      1 Denmark 97887
      2 Egypt 978977
    188 English language 9780
    533 English language 9781
     58 former U.S.S.R 9785
      4 France 97910
     56 French language 9782
     13 German language 9783
      1 International NGO Publishers and EU Organizations 97892
      2 Italy 97888
      1 Kosova 9789951
      5 Malaysia 978967
      1 Moldova 9789975
      5 Netherlands 97890
      1 Netherlands 97894
      2 Philippines 978971
      1 Portugal 978972
     14 Singapore 978981
      3 Thailand 978616
     46 Thailand 978974
      2 Tunisia 9789938
      1 Turkey 978975

If you try the command with an older update (e.g. this one), the output will amount to a little more.

The uniq command obviously consolidated (as well as formatted) the results, so that an approximation of the number of bibliographic records in the file respective to the lands where their ISBNs were minted suddenly appeared. It obviously is an approximation, if only because ISBNs have been assigned to books consequent on their publication only as of the 1960s, and because that started in some lands much sooner than it did in others. The original bibliographic records also could have contained multiple ISBNs for the same article if ISBNs for different areas were printed on (or attached to) the cover of the book or the verso of the book's title page. We only retrieved the first one to be found.

Any glimpse at all into a huge set of data like the Integrated Catalogue is certainly worthwhile, as could be any kind of an overview given the amount of your computer's resources the Integrated Catalogue uses even when it's only being stored.

I'm still not certain that demonstrating X-definition's extensibility need end there, particularly when the preponderance of records grouped in our results under the International ISBN Agency's "English language" rubric suggests a further experiment.

As well as encompassing two prefixes ("registrant groups": 978-0 and 978-1), "English language" encompasses many parts of the world (and can even be applicable to books printed in Afrikaans). The Integrated Catalogue facilitates distinguishing between books with ISBNs bearing the 978-0 and 978-1 prefixes inasmuch as each bibliographic record contains a code indicating the book's country of origin. Records that are similar can therefore be grouped together on that basis, but at least one scenario that has greater potential empirically (greater "real world" potential) is conceivable.

The code is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.0" name="rdf" root="rdf:RDF" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:isbd="http://iflastandards.info/ns/isbd/elements/" xmlns:owlt="http://www.w3.org/2006/time#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:rda="http://rdvocab.info/Elements/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<xd:declaration>
external method {
String IsbnInfo.getAgencyName(String);
String IsbnInfo.getPrefix(String);
int IsbnInfo.getRegistrantElementSizeAsInt(String);
String IsbnInfo.getQualifiedRegistrantElement(String);
}
boolean flagone;
boolean flagtwo;
String publisher;
String registrant;
String prefix;
String agency;
String land = "sa";
</xd:declaration>
<rdf:RDF>
<rdf:Description xd:script='occurs +;options ignoreOther;
init {flagone = false;flagtwo = false;publisher = "";prefix = "";} finally {if (flagone EQ true AND flagtwo EQ true) outln(registrant + " " +
"(" + prefix + ", " + agency + ") " + publisher);}forget;'>
<rda:placeOfPublication>
<rdf:Description>
<rdfs:label>
onTrue {if (getText() EQ land) flagtwo = true;}
</rdfs:label>
</rdf:Description>
</rda:placeOfPublication>
<dcterms:publisher>
<rdf:Description>
<rdfs:label>
onTrue {publisher = getText();}
</rdfs:label>
</rdf:Description>
</dcterms:publisher>
<bibo:isbn10>
onTrue {if (flagone EQ false) {flagone = true;registrant = getQualifiedRegistrantElement(getText());
prefix = getPrefix(getText());agency = getAgencyName(getText());}}
</bibo:isbn10>
<bibo:isbn13>
onTrue {if (flagone EQ false) {flagone = true;registrant = getQualifiedRegistrantElement(getText());
prefix = getPrefix(getText());agency = getAgencyName(getText());}}
</bibo:isbn13>
</rdf:Description>
</rdf:RDF>
</xd:def>

The code's not really too different to before. There are two flags where there previously was just one, because we're testing not only to see whether the record included an ISBN, but also in order to determine which country of origin (place of publication) was indicated each time and keeping the ISBN only if the land identified was South Africa (abbreviated "sa").

There are more String variables, and two of them (prefix and agency) are assigned in exactly the same way that they were in the preceding example. The land variable is assigned at the same time that it's initialized, and the assignment has global scope. The publisher variable is assigned verbatim based on the dcterms:publisher element and its children.

The registrant variable is assigned consequent on a method call, as prefix and agency were. The method used to determine its value is getQualifiedRegistrantElement(), also from the IsbnInfo class. The qualified registrant element is practically the same as the prefix (978-0, for example), only that it's followed by the number that the publisher's local ISBN agency put at its disposal (e.g., 978-014 for Penguin, where 14 is the registrant element and 978-0 is the part required in order for it to be unique).

I've declared one method although it wasn't required, because it appeared worth reminding whoso glimpses the code that ISBNs are weighted. If an ISBN assigned by Penguin to one of its publications were passed to getRegistrantElementSizeAsInt(), it would return 2, which is the number of digits in "14". The registrant element size goes up to 7. Publishers can be conceded a prefix that can accommodate up to one million distinct publications, as Penguin was, or one so long (i.e., seven digits) that it will only ever be used on ten distinct occasions. Weighing ISBNs according to their respective registrant elements is therefore weighing them according to "less is more".

We can try running our current example with the filename SouthAfrica.xdef and just permit it to scroll at first (the XML source is here):

java -cp .:xdef-beginner.jar:Isbn.jar xdef SouthAfrica.xdef BLICBasicB_202102_f50.rdf

This is all that's output:

97807983 (9780, English language) Africa Institute of South Africa
9780620 (9780, English language) uHlanga Press
97809922433 (9780, English language) Gender Links
9780620 (9780, English language) uHlanga
9780620 (9780, English language) uHlanga
9780620 (9780, English language) uHlanga
9781928215 (9781, English language) Modjaji Books
9781928331 (9781, English language) African Minds
9781928331 (9781, English language) JET Education Services
9781928215 (9781, English language) Modjaji books
97809946744 (9780, English language) Face2Face
9781928215 (9781, English language) Modjaji Books
9780620 (9780, English language) Gender Links
9781928331 (9781, English language) African Minds
9781920597 (9781, English language) AFSUN
9781928215 (9781, English language) Modjaji Books
97809922363 (9780, English language) African Perspectives Publishing
9780620 (9780, English language) Uhlanga
9780620 (9780, English language) Charlotte Wiener

It's pretty obvious that these are mostly to do with South Africa, but false drops are of course a possibility.

For the best results, you want to loop through everything:

for i in {1..9}
do
java -cp .:xdef-beginner.jar:Isbn.jar xdef SouthAfrica.xdef BLICBasicB_202105_f0$i.rdf >> SouthAfrica
done
for i in {10..94}
do
java -cp .:xdef-beginner.jar:Isbn.jar xdef SouthAfrica.xdef BLICBasicB_202105_f$i.rdf >> SouthAfrica
done

and save it to a file so the cat command can retrieve the data and pipe it to other commands.

But back to our current results. They can be grouped using the uniq command, as we did in the preceding example:

java -cp .:xdef-beginner.jar:Isbn.jar xdef SouthAfrica.xdef BLICBasicB_202102_f50.rdf | sort | uniq -c | less

The result is:

      1 9780620 (9780, English language) Charlotte Wiener
      1 9780620 (9780, English language) Gender Links
      3 9780620 (9780, English language) uHlanga
      1 9780620 (9780, English language) Uhlanga
      1 9780620 (9780, English language) uHlanga Press
      1 97807983 (9780, English language) Africa Institute of South Africa
      1 97809922363 (9780, English language) African Perspectives Publishing
      1 97809922433 (9780, English language) Gender Links
      1 97809946744 (9780, English language) Face2Face
      1 9781920597 (9781, English language) AFSUN
      1 9781928215 (9781, English language) Modjaji books
      3 9781928215 (9781, English language) Modjaji Books
      2 9781928331 (9781, English language) African Minds
      1 9781928331 (9781, English language) JET Education Services

Proceeding this way preserves the opportunity to glimpse the publishers' names. The grep command can remove them:

java -cp .:xdef-beginner.jar:Isbn.jar xdef SouthAfrica.xdef ../blic/BLICBasicB_202102_f50.rdf | grep -P -o '[^)]*?\) ' - | sort | uniq -c | less

The uniq command subsequently effected still more consolidation:

      7 9780620 (9780, English language) 
      1 97807983 (9780, English language) 
      1 97809922363 (9780, English language) 
      1 97809922433 (9780, English language) 
      1 97809946744 (9780, English language) 
      1 9781920597 (9781, English language) 
      4 9781928215 (9781, English language) 
      3 9781928331 (9781, English language)

This is what the totals were for the above prefixes/registrant elements once that I had looped through the whole Integrated Catalogue as recommended earlier:

    1524 9780620 (9780, English language)
     125 97807983 (9780, English language)
       1 97809922363 (9780, English language)
       2 97809922433 (9780, English language)
       1 97809946744 (9780, English language)
       1 9781920597 (9781, English language)
      19 9781928215 (9781, English language)
       8 9781928331 (9781, English language)

There is a qualitative difference between having bibliographic records at your disposal, as you certainly have consequent on downloading the Integrated Catalogue and extracting the files, and possessing metadata like that we have been busy making here. The initial parts of the ISBNs, once tallied, can subsequently be used to look for books elsewhere than in the Integrated Catalogue. Clicking here will demonstrate exactly how that can work on the Better World Books homepage.

The file that I obtained by looping is available here with the publishers' names (likewise, here for Australia), and here grouped only by ISBN prefix (and here for Australia, likewise ...). I pointed out previously that ISBNs are always ever weighted. The farther along one is in either file, particularly in the one where the publisher's names are dropped in order to permit the uniq command to produce fewer groups, the greater is the degree to which the numbers in the leftmost column are significant. 50 followed by an ISBN prefix six digits in length (including the 978 or 979) is 50 out of a possible million, but 50 in the leftmost column followed by an ISBN prefix 10 digits in length (again, with the 978 or 979 included) is 50 out of a possible one hundred. 5 in the leftmost column followed by an ISBN prefix 11 digits in length really is 5 out of a possible 10.

The way to run our example with any land you would like in mind is to simply go to the xd:declaration element and change the value of theland variable. All of the codes (e.g., "sz" for Switzerland, "le" for Lebanon) are available at https://id.loc.gov/vocabulary/countries.html.