WikiParser

Overview

Definitions of words from Wiktionary can be downloaded from Wikimedia dumps.
The file enwiktionary-latest-pages-articles.xml.bz2 contains definition, templates, modules and other information in XML format.

The first step before parsing the Wiktionary is to split the dump into plain text files to facilitate further processing. This step can be performed by running the WikiSplitter tool. It generates the following files:
- file wiki.dat containing all the definitions in a plain text file;
- file templates.dat containing all templates in a plain text file;
- file modules.dat containing all modules in a plain text file.

Once you have generated *.dat files with WikiExtract, you can parse wikitext, expand it and render to html in two steps:

parse and expand wikitext with tp.parse()
format expanded text to html with WikiFormatter.formatWikiText()

Before doing these steps, you shall instantiate the helper WikiPage to guide the parser in the wiki expansion with information like date, locale, templates, modules.

Wikitext parsing and rendering to html

Here is the typical code to perform wikitext parsing/expansion and rendering to html:


readfile(name2template, "templates.dat", false);

readfile(name2module, "modules.dat", false);

readfile(name2content, "wiki.dat", true);

String definition = name2content.get(keyword);

TemplateParser tp = new TemplateParser();

WikiPage wp = new WikiPage(keyword,  date, Locale.ENGLISH, tp, name2template, name2module, false, name2content, true);


String expanded = tp.parse(definition, wp);

String formatted = WikiFormatter.formatWikiText(new StringBuilder(keyword), new StringBuilder(expanded), linkBaseURL);

The string formatted will contain the html related to the given definition.

You can get from Github WikiParser the WikiFind class that can be used as starting point to write your own application class.