Introduction
This document provides a guide to using the wkimporting package to port electronic dictionaries for display on mobile phones through Wunderkammer. There are four major steps to this process:
- Ensuring that the original dictionary file is in a suitable format
- Writing a configuration file for the import routine
- Customising the dictionary theme file (optional)
- Running the conversion and bundling script
The wkimporting package requires several other software packages to run. The wkimport program requires a Java runtime environment, the bundling script requires ant and the import.py script that co-ordinates the importing process requires Python 2 to run (note that it will not run without modification under the recently released Python 3). All of these packages can be downloaded for most major computer platforms for free from the websites linked to here if they are not already installed.
Formatting the source dictionary
Source dictionaries in a variety of machine-readable formats can be converted into the format used by Wunderkammer by the program wkimport. Built in to wkimport are routines that can convert dictionaries stored in the backslash coded format used by Shoebox/Toolbox produced by SIL and those stored in XML files that can be read by Kirrkirr.
wkimport has been designed to make it easy to write extension modules to convert other formats. We would like to encourage Java programmers who write their own extension modules to send them to us so that we can make them available on the Project for Free Electronic Dictionaries website.
This guide discusses only how to use the built-in Shoebox and XML import routines, although much of the information should also apply to extension import routines. However, when using an extension import routine, refer to the documentation that came with the routine for any additional information relevant to it.
There are very few restrictions on how dictionaries must be formatted for importing. The standard routines should be able to process any Shoebox dictionary where each entry starts with a lemma or headword or any XML dictionary where the entire content of each entry is under a single parent node.
Wunderkammer does not have any real support for multiple senses within a single entry. It is possible to have multiple fields of a single type within an entry, such as having two part of speech or definition fields in one entry, but there is no way to group fields within an entry to show they belong to a single sense. This limitation was a deliberate design decision aimed at ensuring entry data is simple enough to be displayed on small screens. The best strategy for formatting dictionaries that make use of multiple senses is probably either to divide the senses into separate homonymous entries or to edit the senses so that they will be interpretable when they are combined into a single sense during the conversion process.
Most import routines for wkimport automatically uniquify the entries in the input dictionary. If there are any homographic entries in the input dictionary - that is, if there are any entries that have identical lemmas - then a number will be added after each of the lemmas to distinguish them. The numbering starts from 1 for each group of identical lemmas. For example, in a dictionary that contains two entries, each of which has the lemma turla, the lemmas will be renamed to turla 1 and turla 2.
If a link points to a lemma that is uniquified when imported then the link will be broken. The standard uniquifying and link verification routines print lists of the lemmas uniquified and the broken links to the interactive output so it is possible to see which lemmas were uniquified and which links were broken.
When the standard uniquify method renames a lemma, it does not check that there is not already a lemma with the new name. Dictionary makers should try to avoid using names with the same format as those outputted by the uniquifying method to prevent the possibility of creating new homographic entries through uniquification.
Configuration files for import routines
Every import routine requires a configuration file for each dictionary it processes so that it knows how to interpret the various fields in the dictionary. Writing a configuration file is probably the most difficult part of the conversion process. This section goes through the structure of standard configuration files in detail. At the end are two complete example configuration files.
In standard configuration files each line of the file describes how to set a parameter required by the import routine. The name of the parameter to set is written first, followed by a semicolon, and then the value that the parameter should be set to. For example:
map; ./ENTRY/LX%lemma%true
The above line says that a mapping should be set up from the path ./ENTRY/LX in the source file to the Wunderkammer dictionary field type lemma in the output entry files, and that it is true that this field should be included in entries (a more detailed discussion of mappings can be found below).
The following is a list of the parameters recognised by standard import routines with a description of each. The parameter settings can appear in any order in the configuration file.
rootThis is the name of the first menu that should be loaded when Wunderkammer is started. The root field should only appear once in each configuration file.
This parameter is used to specify a path for the menus in Wunderkammer. The path can go to any depth. The nodes in the path should be separated with the forward slash character (/). Each node must specify the dictionary field to populate the menu with and the name of the menu. These are separated by a percent sign (%). For example:
index; sd%Semantic domains/lemma%Kaurna
This line says to create a menu path that starts with a menu populated by the semantic domain field sd. The menu is called 'Semantic domains'. Under this menu is another menu populated from the lemma field called 'Kaurna'. That is, the user will first see a menu called 'Semantic domains' that consists of a list of all the semantic domains contained in the entries in the dictionary. When they select an item from this first menu they will be taken to a second menu called 'Kaurna' that is a list of the lemmas of entries within the selected semantic domain. When they select an item from this menu they will be taken to the corresponding entry.
order is an optional parameter that can be used to specify a custom sort order for a menu. The menu to sort must be specified by name, followed by the order to use when sorting the menu. Items to the left of less than signs (<) are sorted before those to the right. Items separated by commas are treated as being the same in the sort order. Graphemes made up of more than one character, such as digraphs, can be included in the sort order. Below is an example sort order that specifies a reverse alphabetic sort order for the 'Semantic domains' menu.
order; Semantic domains%<z,Z< y,Y< x,X< w,W< v,V< u,U< t,T< s,S< r,R< q,Q < o,O< n,N< m,M< l,L< k,K< j,J< i,I< h,H< g,G < f,F< e,E< d,D< c,C< b,B< a,A
entry
The entry parameter is the path that should be followed to find entries in the input dictionary file. When using the XML import routine, the value of this parameter must be an XPath to entries in the dictionary, as in the following example:
entry; /DICTIONARY/ENTRY
The entry parameter should only appear once in each configuration file.
The Shoebox import routine does not require this parameter to be set, since there is no hierarchical structure in Shoebox files. The Shoebox routine simply processes the input file serially and starts a new entry every time it encounters a Shoebox field that is mapped to lemma.
The map parameter is used to set up a mapping between a field in the input file and a Wunderkammer entry field. Each mapping must have a source field, a target field and a boolean value indicating whether the field should be included in entries. Each value in the mapping must be separated by a percent sign (%). For example:
map; ./SENSE/DE%glossdef%true
In this example mapping, the text content of the XML element at the path ./SENSE/DE (where . represents the entry path, e.g. /DICTIONARY/ENTRY by the value given under the example entry parameter above) is mapped to the Wunderkammer entry field type glossdef. Note that in the XML import routine all XPaths must refer to elements; they cannot refer to attributes. The value true indicates that the field should be included in the entry file outputted by the import routine. Some fields are only included in the source dictionary for the purpose of making indexes and should not appear in the output entry. For example, a source dictionary might have a reverse index field that contains values that are simply tranformations of those in a gloss field or are identical to those in a gloss field, e.g. from gloss 'swamp grass' to 'grass, swamp' or 'cockatoo' to 'cockatoo'.
As can be seen above, in the XML import routine the source mapping must be an XPath from the entry node to the element that contains the field data. In the Shoebox import routine the source mapping should be the backslash code to map from without the preceding backslash, e.g.
map; lx%lemma%true
There is a closed list of possible target entry fields. Each of these fields has a conventional association to a particular type of data that is typically stored in dictionaries. These associations are spelt out in the list below.
lemma- the lemma or headwordsd- the semantic domainpos- the part of speechglossdef- a gloss or a definitionsound- the name of the sound file to play in this entryimage- the name of an image to include in an entrylink- a field that provides a link to another entry in the dictionaryri- an additional field that has no conventional associationrii- an additional field that has no conventional associationriii- an additional field that has no conventional association
Note that even though each of the fields has a conventional association, the data in sd, pos, glossdef, ri, rii and riii fields is treated simply as plain text by Wunderkammer. This means that any type of data intended to be displayed as text could be stored in these fields. The way that text should be rendered in each of these fields and the link field is determined by the theme. See below under Customising the dictionary theme file for information on how to modify themes. All of these fields can be repeated within a single entry.
The other fields are treated specially by Wunderkammer or wkimport and must contain specific types of data. The lemma field must contain the lemma, or headword, of the entry. The sound and image fields must contain the names of sound and image files that should be played or shown in the entry. The link field must contain the value of the lemma of the entry that it links to. There can only be one lemma and sound field in each entry. The image and link fields can be repeated in a single entry.
It is possible to map more than one source field to a single target field. The following configuration would be possible, for example:
map; ./SENSE/SUB%link%true
map; ./SENSE/UN%link%true
In this code the ./SENSE/SUB and ./SENSE/UN fields in the source dictionary are both mapped to the link field in the Wunderkammer output dictionary.
Below is an example configuration file for an XML dictionary. This file can also be found in the inputfiles/condic directory of the wkimporting distribution. Note that the data for the parameter order must all be on a single line, even though the line is wrapped in the example shown below.
root; Main menu
index; lemma%Kaurna to English
index; ri%English to Kaurna
index; sd%Semantic domains/lemma%Kaurna
order; Semantic domains%<z,Z< y,Y< x,X< w,W< v,V< u,U< t,T< s,S< r,R< q,Q < o,O< n,N<
m,M< l,L< k,K< j,J< i,I< h,H< g,G < f,F< e,E< d,D< c,C< b,B< a,A
entry; /DICTIONARY/ENTRY
map; ./LX%lemma%true
map; ./SENSE/DE%glossdef%true
map; ./SENSE/GE%glossdef%true
map; ./SENSE/SD%sd%true
map; ./SENSE/PS%pos%true
map; ./SENSE/UN%link%true
map; ./SENSE/SUB%link%true
map; ./SENSE/SYN%link%true
map; ./SENSE/RI%ri%false
map; ./SOUND%sound%true
map; ./IMAGE%image%true
Below is an example configuration file for a Shoebox dictionary. The differences between this and the XML configuration file are that there is no entry; parameter and the source mappings are Shoebox backslash-coded fields rather than XPaths. Note that the data for the parameter order must all be on a single line, even though the line is wrapped in the example shown below.
root; Main menu
index; lemma%Kaurna to English
index; ri%English to Kaurna
index; sd%Semantic domains/lemma%Kaurna
order; Semantic domains%<z,Z< y,Y< x,X< w,W< v,V< u,U< t,T< s,S< r,R< q,Q < o,O< n,N<
m,M< l,L< k,K< j,J< i,I< h,H< g,G < f,F< e,E< d,D< c,C< b,B< a,A
map; lx%lemma%true
map; de%glossdef%true
map; ge%glossdef%true
map; sd%sd%true
map; ps%pos%true
map; un%link%true
map; sub%link%true
map; syn%link%true
map; ri%ri%false
map; sound%sound%true
map; image%image%true
Customising the dictionary theme file
It is possible to change the appearance and localisation settings of Wunderkammer by bundling the program with a modified resource file. The default resource file is located at wkimporting/inputfiles/themes/wunderkammertheme.res. The resource file can be edited with the ResourceEditor application, which is bundled with the LWUIT library. ResourceEditor is located at LWUIT/util/ResourceEditor.jar in the package. There is documentation included with ResourceEditor on how to use the program.
To change the general appearance of Wunderkammer, the theme, images and animations stored within the resource file need to be edited. To modify the localisation settings or change the additional text that is added to fields within entries, the localisation settings need to be edited.
Running the conversion and bundling script
The dictionary data files need to be converted and bundled with the Wunderkammer resource file and the Wunderkammer binaries in a jar file before the dictionary can be run. To do this, put the files listed below into the appropriate directories and then run the script import.py.
- dictionary configuration file in the directory
wkimporting/inputfiles/condic - source dictionary Shoebox or XML file in the directory
wkimporting/inputfiles/condic - the directory containing images for the dictionary (if present) in the directory
wkimporting/inputfiles/imagedirs - the directory containing sounds for the dictionary (if present) in the directory
wkimporting/inputfiles/sounddirs - the theme file (if a new one has been created) in the directory
wkimporting/inputfiles/themes - the dictionary icon (if one has been created for the dictionary) in the directory
wkimporting/inputfiles/icons
Run the import.py script by typing python import.py at the command line from the wkimporting directory. The script will provide several prompts at which the relevant dictionary files can be selected. The prompts should be fairly self-explanatory.
Once the conversion and bundling are finished the output dictionary files should be in the directory wkimporting/dist. For information on how to run these files, see the User's guide to Wunderkammer.
Importing tips
Here are some tips for importing dictionaries into Wunderkammer:
- Make sure that all input text files are saved as 'UTF-8 no BOM'. This option may not be available in all text editors. Use an advanced text editor, such as TextWrangler for Mac or Notepad++ for Windows.
- Make sure there are no spaces in any file names and do not put spaces in the name of the dictionary or vendor when wkimporting prompts for these.
- On Windows machines make sure that ant is in the command search path. Go to Control Panel > System and Maintenance > Advanced System Settings > Environment Variables > System variables. In the list of System Variables find Path and click on Edit. In the field for Variable value, go to the end of the line, add a semi-column followed by the full path to the contents of the "bin" folder of ant installed on your system (e.g., ";C:\Program Files\ant\bin\"), press Save. (Tip from Dmitry Idiatov.)
Version 1.3 of Guide to importing dictionaries into Wunderkammer, James McElvenny (first name followed by the at sign and then pfed dot info), 4 March 2010