Links

Content Skeleton

This Page

Previous topic

QXML

Next topic

QXML Examples

Exist to QXML migration

QXML is a lightweight XML querying python script and equivalent C tool built on top of Berkely DB XML (BDBXML). QXML is hosted in the env repository to reflect the intention to keep it a general tool for XML querying.

This exist2qxml.py script migrates eXist backup dumps OR the contents of live eXist servers into BDBXML containers. While QXML is currently ancillary to the main workflow of the heprez machinery, due to speed of querying the BDBXML compared to eXist they are very useful for rapid querying the content of large numbers of XML files.

Usage for full ingests:

./exist2qxml.py

The required QXML_CONFIG points to the config file:

simon:qxml blyth$ echo $QXML_CONFIG
/Users/blyth/env/db/bdbxml/qxml/hfagc.cfg

For selective ingests, eg into container with tag ‘sys’:

EXIST2QXML_SELECT=sys@@http://localhost/servlet/db/hfagc_system/v2qtags.xml ./exist2qxml.py
EXIST2QXML_SELECT=sys@@http://localhost/servlet/db/hfagc_system/qtag2latex.xml ./exist2qxml.py

QXML Config

Configured by the file pointed to by QXML_CONFIG.

General Config

Crucial settings include the default collection that querying addresses and the search path for XQuery modules that can be included into queries.

[dbxml]
environment_dir = /tmp/dbxml
default_collection = dbxml:////tmp/hfagc/hfagc.dbxml
baseuri = dbxml:/
xqmpath = /Users/blyth/heprez/qxml/lib:/Users/blyth/env/db/bdbxml/xq

Container Config

The container config defines the locations for the Berkely DB containers and the source of the XML for them, which can be local directories OR exist instances:

[container.source]
source = 
source = http://localhost/servlet/db/hfagc_system/
source = /data/heprez/data/backup/part/localhost/last/db/hfagc
source = http://cms01.phys.ntu.edu.tw/servlet/db/hfagc/

[container.path]
path = /tmp/hfagc/scratch.dbxml
path = /tmp/hfagc/hfagc_system.dbxml
path = /tmp/hfagc/hfagc.dbxml
path = /tmp/hfagc/remote.dbxml

[container.tag]
tag = tmp
tag = sys
tag = hfc
tag = rem

To suppress the leading slash in db names, supply a trailing slash in the source. This is useful for the sys container as it contains no sub-collections and not having slashes in names affords shortcut doc access.

All non-metadata xml files beneath the srcdir are ingested into container dbxml at the path specified which is subsequently referred to with qxml via the configured tag or alias eg with:

collection('avg')/dbxml:metadata('dbxml:name')

Namespace Config

Shorthand strings for namespace uri used when querying:

[namespace.name]
name = rez 
name = exist
name = qxml

[namespace.uri]
uri = http://hfag.phys.ntu.edu.tw/hfagc/rez
uri = http://exist.sourceforge.net/NS/exist
uri = http://dayabay.phys.ntu.edu.tw/qxml

Map Config

Often some snippets of XML need to be very frequently accessed, eg the latex string corresponding to a particle code or the latex corresponding to a decay quantity tag. In order to optimize access to these snippets a map is created at QXML startup which can be very rapidly accesses subsequently:

[map.name]
name = code2latex
name = qtag2latex

[map.query]      
query = for $glyph in collection('sys')/*[dbxml:metadata('dbxml:name')='pdgs.xml' or dbxml:metadata('dbxml:name')='extras.xml' ]//glyph return (data($glyph/@code), data($glyph/@latex)) 
query = for $qtag in doc("sys/qtag2latex.xml")//qtag return ($qtag/@value/string(),$qtag/latex/string())

Issues

When attempting to read from a non-running eXist server, such as below, the error is not informative and an empty container is created that requires manual deletion before rerunning:

simon:~ blyth$ rm /tmp/hfagc/hfagc_system.dbxml 
simon:~ blyth$ heprez-exist2qxml
INFO:__main__:using srcpfx_ None 
INFO:__main__:ingest sys creating /tmp/hfagc/hfagc_system.dbxml from xml files from http://localhost/servlet/db/hfagc_system/ 
XmlException ( 4 ):  Error: XML Indexer: Fatal Parse error in document at line 1, char 50. Parser message: whitespace expected
WARNING:__main__:tag hfc dbxml "/tmp/hfagc/hfagc.dbxml" exists already : delete it and rerun to update from src "/data/heprez/data/backup/part/localhost/last/db/hfagc"  
WARNING:__main__:tag rem dbxml "/tmp/hfagc/remote.dbxml" exists already : delete it and rerun to update from src "http://cms01.phys.ntu.edu.tw/servlet/db/hfagc/"  
simon:~ blyth$