Limburgish Corpus

noun (countable) /ˈlɪmˌbɜː(r)ɡɪʃ ˈkɔː(r)pəs/

a collection of written Limburgish texts stored on a computer and used for language research and writing dictionaries

A-Corpus - Page Identity Image

We cannot do it without you!

Help us by sending your own texts in Limburgish! The more texts we are allowed to use, the better our free digital language products will cater to all dialects.

For the Limburgish Corpus we are looking for texts in all Limburgish dialects, published or not. We collect all Limburgish writings, such as: stories, (Saint Nicholas) poems, newspaper clippings, carnival writings, columns, letters, diaries, educational material, speeches, shopping lists, scripts of plays or musicals, and much more. Also digital texts such as from the Limburgish Wikipedia, WhatsApp conversations, tweets, etc.

Partners in Limburg provide texts for the Limburg Corpus, for example the Stichting Boeken voor Mensen. Many authors have already sent texts. We ask authors’ written permission to use texts for our non-commercial project. Copyright remains with the writers. Publishers such as Veldeke Limburg, various Veldeke local chapters and the Huis voor de Kunsten Limburg have also given permission.

For the Limburgish Corpus we collect texts from 1775 to the present. More information is given in an interview below with L1 TV (Limburgish) and with an article in newspaper De Limburger (Dutch).

Accessibility Limburgish Corpus for researchers

Our Limburgs Corpus is in principle accessible to language researchers. Due to copyrighted material and some personal data, we provide acces on the basis of contractually agreed conditions. Materials may, among other things, not be made public or distributed. Contact us for research with our Limburgish Corpus.

NLP editing Limburgish Corpus

The texts are further processed for the Digital Library of Limburgish and examined for language use for the Limburgs Dictionary. At the Limburgish Academy we develop Natural Language Processing (NLP) software to digitally edit the Limburgish Corpus. For this we also work together with other researchers. We are also affiliated with the European Lexicographic Infrastructure eLexis. This keeps us informed of and gives us access to the latest lexicographic software developments.

First the spelling variation is normalized: the texts are put into one spelling to facilitate further processing. This is followed by tokenization (splitting the text), lemmatisation (adding the dictionary form to each word) and PoS tagging (adding the grammatical word type to each word). We use Sketch Engine for lexicographic analyses. Our lexicographic approach is explained in more detail in an article for eLex.

Digital applications for Limburgish

The Limburgish Corpus is enriched during the NLP processing stage. This makes Limburgish suitable for lexicographic purposes, digital linguistic research and further digital applications. One digital application is the predictive language model of the Limburgish keyboard for mobile applications as developed by Microsoft Swiftkey in collaboration with the Limburgish Academy. With the NLP-processed Limburgish Corpus, we are laying the digital foundation with which other language products for Limburgish can be developed such as spelling checkers, applications for automatic speech recognition, text-to-speech, speech-to-text, language courses and computer-supported methods for learning Limburgish, etc. This offers the possibility of adapting the language to contemporary and future digital requirements in order to keep it alive and extend the possibilities for its usage.