Updating Abstract Sources

CLDR-8802 in version 39 added abstract data to the Survey Tool, meaning that DBPedia abstracts (extracted from Wikipedia) are presented to the user to aid in translation. An example is shown to the right.

How does it work?

On startup of the Survey Tool, various SparQL queries are performed. These queries are cached.

The user's browser is passed the resource URL, such as https://dbpedia.org/resource/French_language. When the user clicks on the French row in the Survey Tool, DBPedia is queries for the article abstract. This query is cached, so won't be performed again if the user clicks French a second time.

How do I add a new type of data?

  1. Identify the data type to add, and be familiar with the XPath structure. For example, scripts have the path //ldml/localeDisplayNames/scripts/script[@type="Latn"] where Latn is the script code.
  2. Here is the difficult part. Construct a SparQL query on DBPedia which will retrieve both "resource" entries as well as code of some sort. Note that you can clean up the data in Java afterwards. For example, after much trial and error, the following seemed to work for scripts, producing tuples such as (resource?: http://dbpedia.org/resource/Avestan_alphabet, iso?: Avst)
    PREFIX dbp: <http://dbpedia.org/property/>
    PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX yago: <http://dbpedia.org/class/yago/>
    
    SELECT ?resource ?iso WHERE
    {
        ?resource rdf:type yago:CharacterSet106488880 .
        ?resource dbp:iso ?iso
    }
    1. Try the fora at https://forum.dbpedia.org
    2. Visit a few pages, such as http://dbpedia.org/page/Cyrillic_script and try to isolate common factors.
  3. Create a new class, org.unicode.cldr.rdf.ScriptMapper in the cldr-rdf project. (You might copy from this class or from LanguageMapper).
  4. Rewrite the query above using the Where Builder. This will isolate it a bit from changes to the syntax and could allow reuse:
        @Override
        public int addEntries(AbstractCache cache) throws ParseException {
        	int newAdd = 0;
            final CLDRConfig config = CLDRConfig.getInstance();
            ResultSet rs = queryScriptResources();
            
            XPathParts xpp = XPathParts.getFrozenInstance("//ldml/localeDisplayNames/scripts/script")
                .cloneAsThawed();
            while(rs.hasNext()) {
                final QuerySolution qs = rs.next();
                String code = QueryClient.getStringOrNull(qs, O_ISO.substring(1));
                final String res = QueryClient.getResourceOrNull(qs, O_RESOURCE.substring(1));
                if(code.length() != 4) { 
                     // bad
                    continue;
                }
                // DEBUG: You can print out "code" and "res" here for debugging!
                xpp.setAttribute(-1, "type", code);
                final String xpath = xpp.toString();
                if(cache.add(xpath, res) == false) {
                	newAdd ++;
                }
            }
            return newAdd;
        }
    
        public static ResultSet queryScriptResources() throws ParseException  {
        	final String resType = O_RESOURCE;
        	final SelectBuilder builder = new SelectBuilder()
        			.addPrefix("dbp", QueryClient.PREFIX_DBP)

        .addPrefix("rdf", QueryClient.PREFIX_RDF)

    .addPrefix("yago", QueryClient.PREFIX_YAGO) .addVar("*") .addWhere(resType, "rdf:type", "yago:CharacterSet106488880") .addWhere(resType, "dbp:iso", O_ISO) ; System.out.println(builder.buildString()); Query q = builder.build(); ResultSet results = QueryClient.getInstance().execSelect(q); return results; }

  5. Add the mapper to the MapAll.MapAll() constructor:
    public MapAll() {
    // add all mappers here
    mappers.add(new LanguageMapper());
    mappers.add(new ScriptMapper());
    }
  6. Now, to test it, run the TestMapAll JUnit test with -DCLDR_TEST_ENABLE_NET=true set in the VM arguments.  (By default, network queries are disabled during tests.)  As usual, you will need to set the CLDR_DIR property as well.
    1. If your new mapper class has debugging output enabled (see comment above), you may be able to see the results directly, otherwise the test will print:
      ScriptMapper + 75
      # AbstractCache wrote to /var/folders/tb/6847b2200cad4e35955f00b82baf01ac/T/154662182-0

    2.  There will be a properties file in the above named temporary directory containing the mappings, such as:
      //ldml/localeDisplayNames/scripts/script[@type\="Perm"]=http\://dbpedia.org/resource/Old_Permic_alphabet
  7. That's it! The Survey Tool will run this query when needed.
    1. Currently, the cache may not be invalidated when a code change happens (To fix this in CLDR-14323).  To work around this on production or development servers, delete /etc/tomcat/cldr/abstracts/xpath-to-resource.properties and it will be recreated at restart of the survey tool. (The 'abstracts' directory is a peer to cldr.properties)
  8. You are now ready to test this out locally, and then open a PR! For this example (Scripts), you can see the PR here: https://github.com/unicode-org/cldr/pull/869


Comments