Overview
Definitions of words from Wiktionary can be downloaded from Wikimedia dumps. The file enwiktionary-latest-pages-articles.xml.bz2 contains definition, templates, modules and other information in XML format. The first step before parsing the Wiktionary is to split the dump into plain text files to facilitate further processing. This step can be performed by running the WikiSplitter tool. It generates the following files: - file wiki.dat containing all the definitions in a plain text file; - file templates.dat containing all templates in a plain text file; - file modules.dat containing all modules in a plain text file. Once you have generated *.dat files with WikiExtract , you can parse wikitext, expand it and render to html in two steps:
WikiPage to guide the parser in the wiki expansion with information like date, locale, templates, modules.
Wikitext parsing and rendering to html
Here is the typical code to perform wikitext parsing/expansion and rendering to html:
readfile(name2template, "templates.dat", false);
The string formatted will contain the html related to the given definition.
You can get from Github WikiParser the WikiFind class that can be used as starting point to write your own application class. |