In the past fifteen years, many Arabic–Islamic heritage texts have been digitized and made available to the public (with or without infringement of copyrights). This digitization wave brings with it the huge benefit of fast and easy text search and the potential of computerized text analytics. The first products to exploit this potential were countless CD-ROMs of the Quran and the canonical ḥadīth collections, easy to browse and search. Soon such narrow-scoped applications moved to the web, while more advanced digital libraries were brought to the desktop. To this category belongs the tremendously popular al-Maktaba al-Shāmila (MSh; available at www.shamela.ws), containing, beside the Quran, over 6,500 titles from the fields of exegesis, tradition, Islamic law and jurisprudence, theology, history, etc. The program offers easy full-text search and the results refer back to printed editions of the texts. Now, thanks to two researchers from the ERC research project “The early Islamic Empire at work – the view from the regions toward the center”, there is a new alternative available, geared towards the academic world. I used MSh (version 3.61) and Jedli side-by-side for a month, and present the results of this comparison here. Both programs are only available for MS Windows, and were installed on a Windows 10 machine.
According to its user manual, “Jedli is designed with the texts of al-Maktaba al-Shamela in mind”. In fact, Jedli uses the same text files as MSh but converted from .bok or .epub to .txt. The obvious advantage of this is that the first version of Jedli already comes with an enormous book collection, and new books found in online repositories intended for MSh can easily be added. This also means, however, that weaknesses inherent in MSh's collection are now also a part of Jedli. An outspoken Sunni bias is one example. Another drawback of the MSh corpus is the fact that many of the digital texts are copied from editions that do not meet academic standards of critical text edition, even when much better editions exist. A third shortcoming is the lack of independent verification. To some extent, these shortcomings may be bypassed by manually adding or editing texts, but managing the quality of a library worth of text is something one cannot do alone.
At the heart of both programs lies the search functionality. Here, Jedli's advantage over MSh may not be as great as I expected it to be, but it is clearly there. Both applications allow easy selection of titles to perform the search in, including selecting all works of a certain category or genre, e.g. all ḥadīth compilations. Both allow the user to save such selections for later use. With MSh's collection editing tool one can quickly (re)categorize a title, for instance if you want Mālik's Muwaṭṭaʾ to appear under ḥadīth as well as fiqh. But doing the same in Jedli requires manually editing a spreadsheet. In Jedli one can define a virtually endless list of search terms, each with a selectable Boolean operator (AND, OR, NOT). MSh's setup, with up to five AND search terms and up to five OR search terms, will suffice in most cases, but when it falls short Jedli is the way to go. However, there are other reasons too. MSh offers the choice either to allow prefixes and suffixes or not. Jedli has the same option, but it is smarter and more versatile, allowing the user to choose between four levels of prefix (strict, all allowed, nominal, verbal) and suffix (likewise) restriction. Additionally, the user can manually define pre- and suffixes. So when looking for the word ṣūra, for instance, a non-strict search in MSh will also return maqṣūra, whereas Jedli, with the right amount of prefix restriction, will return al-ṣūra, wa-bi-ṣūra, and so on, but not maqṣūra. Similarly, in MSh one can choose to ignore the distinction between final hāʾ and tāʾ marbūṭa, final yāʾ and alif maqṣūra, and alif and alif-hamza, whereas in Jedli these options can be set independently. More importantly, through the use of regular expressions (regex) Jedli allows the user to search different words or variants of a word at once. The right search term (kt?b) will not only find kitāb but also kutub, a huge advantage when dealing with an Arabic corpus. One feature offered by MSh but not by Jedli is the option to limit search results to instances where the search terms occur in the user-defined order, so that searching for Muḥammad AND ʿAlī will return Muḥammad b. ʿAlī but not ʿAlī b. Muḥammad. A major drawback of MSh is that it uses the page as a meaningful unit of text, which it is not. The result is that the same search will not return Muḥammad b. ʿAlī if it is split across two pages. Jedli's “context search”, on the other hand, only takes the (user-definable) distance between two words into account, not the page on which they occur, so it will return results split across pages. Overall, Jedli's advanced search capabilities surpass those of MSh greatly.
Where Jedli has not quite reached the level of MSh is in the user interface. MSh is not aesthetically pleasing and takes a while getting used to, but it is also highly customizable and does a lot of things right that Jedli does wrong, or not at all. Take, for instance, the way in which search results are displayed. In MSh every search returns a table of results showing the title of the book, the chapter, and the volume and page. Clicking on a search result displays the page in question, with the search terms highlighted, after which one can easily go back or forth a page. Irrelevant items can be deleted, the list can be saved and recalled later, and a new search can be performed within the search results. Users can keep notes on certain titles, authors or search results, and at any time, a button can be clicked to retrieve bibliographical information about the text and the print edition on which it is based. Finally, there are two special display and search modes tailored to biographical and tafsīr works, of which especially the latter is a great addition to the program.
Jedli, on the other hand, offers three search “tools”. Each uses an internet browser instead of a built-in reader to display the results. The “Index it!” function renders a list of pages on which one of the search terms occurs – indeed, it ignores any AND operators, and effectively does an OR search. “Highlight it!” shows the full text (all volumes) and highlights each search term in a different colour. It allows one to see the search results in their context, but scrolling long texts is cumbersome compared to flipping pages and one-click search result browsing, as MSh lets one do. “Context search” is undoubtedly what most users want, as it only lists results where the search terms entered by the user appear in the vicinity of each other. Unfortunately, it displays results as small fragments of text in which the search terms are highlighted, so that browsing or reading the wider context is not possible. Other minor glitches include the fact that Arabic text is sometimes aligned to the left instead of the right, a rather annoying effect of which is that long book titles, for instance, are cut off at the beginning instead of the end.
Whatever Jedli lacks in user-friendliness it makes up for by its superior search capabilities. In addition, being designed by academics for academics, Jedli does not impose legal restrictions on its users, whereas MSh's user agreement states “it cannot be used to publish anything that conflicts with the ways of Sunni Islam”. One can only hope, therefore, that future versions bring new features and a better interface. There is reason to be hopeful: Jedli was written in the accessible programming language Python, released under the Apache 2.0 licence (which allows the redistribution of modified versions on condition of a disclaimer and copyright notice), its source code will be published, and the development team welcomes contributions from others. A more fundamental question is whether the developers will keep their product tied in with the MSh corpus, or in other words, to what extent the developers plan to address issues inherent in this corpus. For now, they seem to be less invested in maintaining a corpus than in creating a toolbox for text mining.
Finally, a remark on that last point is in order. Since the rise of (and increased funding for) the digital humanities, buzzwords such as “text mining” are in vogue, and are more often than not used for something they are not. Text mining implies analytics in addition to heuristics, relies on any combination of statistical, linguistic and machine learning techniques, and is achieved by building and calibrating a model on a set of texts and validating it on an independent set of texts. The result would be a tool that analyses the text and as a result produces information not readily available in, but distilled from, the texts. Contrary to claims on its website and in its user manual, that Jedli is a “text mining” and “data mining tool box”, the program does none of that. Nonetheless, it is a very good text search tool that has the potential greatly to enhance the Islamicist's workflow.