We need to talk about Arabic!

A practical critique of the hostility towards the second most common human writing system built into our quotidian digital infrastructures

Till Grallert

Humboldt-Universität zu Berlin

Methods Innovation Lab (NFDI 4Memory)

Exploring Epistemic Virtues and Vices

2024-03-16

https://tillgrallert.github.io/slides/dh/2024-03-luxembourg/

Introduction

My research interests

… or what I would want to do

Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.
Figure 1: Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.
Directed network of periodicals referenced in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr
Figure 2: Directed network of periodicals referenced in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr

… and what I spend my time on

Project Jarāʾid (2012–)
Closing the knowledge <gap/>

  • Bibliographic record of all Arabic periodical titles published between 1798 and 1929
    • websites and open datasets (TEI/XML) for more than 3500 periodicals
    • additional authority files for c.2700 persons, 220 places, 180 libraries
  • Unfunded collaboration with Adam Mestyan (Duke), “crowd”-sourcing
  • Networking and reconciling existing information:
    • Integration of holding information from library catalogues such as ZDB, AUB, BnF, HathiTrust
    • Publish everything as Linked Open Data on Wikidata
Periodicals by places of publication. Size of circles corresponds to the number of periodicals. Colour indicates the number of new titles per period.
Figure 3: Periodicals by places of publication. Size of circles corresponds to the number of periodicals. Colour indicates the number of new titles per period.

… and what I spend my time on

Open Arabic Periodical Editions (OpenArabicPE, 2015–)
Closing the infrastructural <gap/>

  • Digital scholarly editions
    • 6 Arabic magazines from Baghdad, Cairo, Damascus with c.800 issues and more than 9 million words.
    • full text and facsimiles modelled in TEI/XML
    • Bibliographic metadata (MODS, BibTeX, Zotero RDF)
    • Open licenses: CC BY-SA 4.0
  • Infrastructure:
  • Workflows and tools

Background

Late-Ottoman Eastern Mediterranean
Diversity across the board

Map showing the colonial spheres of interest agreed upon by France and UK. Signed by Sir Mark Sykes and Fr[ançois] Georges-Picot, 8 May 1916. Source: Royal Geographical Society (“English” 1916)
Figure 6: Map showing the colonial spheres of interest agreed upon by France and UK. Signed by Sir Mark Sykes and Fr[ançois] Georges-Picot, 8 May 1916. Source: Royal Geographical Society (“English” 1916)

Languages

  • Administrative: Ottoman, Arabic, Persian
  • Quotidian: Turkic languages, Arabic, Greek, Slavic languages, Armenian, Ladino …
  • Lithurgic: Arabic, Greek, Armenia, Coptic, Russian, Hebrew …
  • Educational: Ottoman, Arabic, French, English, Russian …

Scripts

  • From right to left:
    • Arabic, Hebrew, Assyrian
  • Fro left to right:
    • Greek, Armenian, Latin, Cyrillic, Coptic

Religions

  • Muslims: Sunni, Shi’ite
  • Christians: div. Orthodox, Eastern Catholic, Western Catholic, Assyrian, Protestant …
  • Jews: sephardic, ashkenazic
  • Zoroastrians

Calendars

  • Islamic (hijri): lunar, observed, epoch begins with Muḥammad’s exodus from Mecca
  • Reformed Julian: solar; year begins on 1. January, epoch begins with Christ’s birth
  • Ottoman fiscal (mālī): lunisolar; year begins on 1. March, eepoch begins with Muḥammad’s exodus from Mecca
  • Gregorian: solar; year begins on 1. January, epoch begins with Christ’s birth
  • Jewish: lunar; epoch begins with the creation of the world

Days, hours

  • alla turca: day begins with sundown, 12 unequal hours each for day and night
  • alle franca: day begins at midnight. 24 equinoctial hours

“You are as beautiful as an additional hour of electricity!”

“You are as beautiful as an additional hour of electricity!”

Destruction

Meeting room at the Orient-Institut Beirut damaged by the Beirut Port explosion on 4 August 2020. Source: OIB
Figure 7: Meeting room at the Orient-Institut Beirut damaged by the Beirut Port explosion on 4 August 2020. Source: OIB
The National Archives of Syria in Damascus the day after their conflagration on 16. July 2023. Source: “«أصبح رمادًا وركاما»... حريق كبير يدمر سوقًا في قلب دمشق” (2023)
Figure 8: The National Archives of Syria in Damascus the day after their conflagration on 16. July 2023. Source: “«أصبح رمادًا وركاما»... حريق كبير يدمر سوقًا في قلب دمشق” (2023)

Absences and exclusions

Global distribution of DH centres. Source: DH centerNet
Figure 9: Global distribution of DH centres. Source: DH centerNet

We need to talk about Arabic!

Arabic

Script

  • Second most important script after Latin
    • currently used by 14 languages: Arabic, Persian, Urdu, Pashto, Uzbek, Uighur …

Language

  • Fifth most important language
    • One of six official languages of the United Nations
    • Official language in 26 countries
    • >420 million speakers
  • Lithurgical language of 1,6 billion Muslims
Approximate distribution of Arabic script use along current national boundaries (Nemeth Arabic Type-Making in the Machine Age 2017, fig 1.1)
Figure 11: Approximate distribution of Arabic script use along current national boundaries (Nemeth Arabic Type-Making in the Machine Age 2017, fig 1.1)

Arabic Script Grammar

  • Written from right to left (RTL)
  • Letters (graphemes)
    • mostly connected in direction of writing
    • letterform depends on position within the string (allographs): ج جـ ـجـ ـج
    • combination of basic letterforms (archigraphemes, Arab. rasm) and diacritic marks (iʿjām)
  • diacritics
    • reduce semantic ambiguity
    • subject to regional preferences and change of time
  • Vocalisation (tashkīl) is optional and changes the semantics
Beginning of Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). Some ligatures are highlighted.
Figure 12: Beginning of Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). Some ligatures are highlighted.
Pseudo-rasm of the text in fig. 12. Automatically generated with Pohl (“Rasmifize” [2020] 2022).
Figure 13: Pseudo-rasm of the text in fig. 12. Automatically generated with Pohl (“Rasmifize” [2020] 2022).

multilinguality and linguistic imperialism

multilinguality and linguistic imperialism

Indigenous peoples have the right to revitalize, use, develop and transmit to future generations their histories, languages, oral traditions, philosophies, writing systems and literatures, and to designate and retain their own names for communities, places and persons.

(United Nations “UNDRIP” 2007, sec. 13)

‘Linguistic imperialism’ is shorthand for a multitude of activities, ideologies, and structural relationships. Linguistic imperialism takes place within an overarching structure of asymmetrical North/ South relations, where language interlocks with other dimensions, cultural (particularly in education, science, and the media), economic and political

(Phillipson “Realities and Myths of Linguistic Imperialism” 1997, 239)

The basis for the codes, languages, methodologies, and technical instruments of the digital humanities is English; the written and spoken language of all the main conferences, the most prestigious journals, the institutions that control the discipline, the organizations and international consortia, and the central authorities of knowledge is, with few exceptions, some dialect of British or American English.

(Fiormonte “Taxation Against Overrepresentation?” 2021, 334–35)

Technical affordances

Arabic Linotype, 1910s. Source: (Nemeth Arabic Type-Making in the Machine Age 2017, fig. 2.7)
Figure 14: Arabic Linotype, 1910s. Source: (Nemeth Arabic Type-Making in the Machine Age 2017, fig. 2.7)

Encoding characters

Unicode is awesome …

… but many contemporary and historical human writing systems are not supported even in its latest iteration.

Supported scripts in Unicode v1.0.0. Source: https://www.worldswritingsystems.org
Figure 15: Supported scripts in Unicode v1.0.0. Source: https://www.worldswritingsystems.org
Currently unsupported scripts. Source: https://www.worldswritingsystems.org
Figure 16: Currently unsupported scripts. Source: https://www.worldswritingsystems.org

Unicode is awesome …

… but standards depend on implementation and software support

Encoding nightmares

32 variants of encoding “Meccan” (مكية) (Milo “Visually Misleading Characters in the Arabic URL” 2014, 4)
Figure 17: 32 variants of encoding “Meccan” (مكية) (Milo “Visually Misleading Characters in the Arabic URL” 2014, 4)
In-browser search for “مك” in the Wikidata entry for “Mecca” (Q5806)
Figure 18: In-browser search for “مك” in the Wikidata entry for “Mecca” (Q5806)

Unicode is awesome …

… but standards depend on implementation and software support

Rendering nightmares

This ought to be the perfect example:

As Ramsey Nasser notes in the overview of his programming language ب ل ق [pre-existing digital techonologies] are almost exclusively based on the ASCII character set

(Isasi et al. “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian” 2023, 19)

ب ل ق should have been قلب

Unicode is awesome …

… did I mention industry consortia?

HTML elements all have names that only use ASCII alphanumerics (Web Hypertext Application Technology Working Group “HTML: Living Standard” 2023, sec. 13.1.2)

Rendered HTML with the built-in default CSS for Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). The test file is available online
Figure 19: Rendered HTML with the built-in default CSS for Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). The test file is available online
HTML-Code for fig. 19 in Visual Code Studio. By default, non-ASCII characters are visually highlighted
Figure 20: HTML-Code for fig. 19 in Visual Code Studio. By default, non-ASCII characters are visually highlighted

Bidirectional texts

The mandatory XML declaration <?xml version="1.0" encoding="UTF-8"?> sets left-to-right as the base direction.

Bi-directional TEI/XML at the beginning of Dammūs (“Ṣiḥāfat Sūriyya wa-Lubnān” 1911). Arrows indicate the reading direction. Numbers indicate reading order.
Figure 21: Bi-directional TEI/XML at the beginning of Dammūs (“Ṣiḥāfat Sūriyya wa-Lubnān” 1911). Arrows indicate the reading direction. Numbers indicate reading order.
The TEI/XML from fig. 21 in oXygen’s author mode. Styling relies on CSS.
Figure 22: The TEI/XML from fig. 21 in oXygen’s author mode. Styling relies on CSS.

Transliteration, the undead solution of yore

Transliteration into Latin script served the need of colonial administrations and academics with the technological affordances of the time.

مرآة الشرق

The Arabic original

Meraat al-Sherk

The official transcription provided by the paper’s masthead

Front page of Mirʾāt al-Sharq #192, 22 Nov. 1922, Jerusalem. Source: EAP.
Figure 23: Front page of Mirʾāt al-Sharq #192, 22 Nov. 1922, Jerusalem. Source: EAP.

Mirʾāt al-Sharq

Following the system of the International Journal of Middle East Studies (IJMES)

Mirʾāt aš-Šarq

Following the system of the Deutsche Morgenländische Gesellschaft (DMG)

mrMp Alcrq

Buckwalter transliteration

Arabic textual heritage online

The long tail of ASCII in discovery systems

الجنة?

No Arabic script

Search in ZDB for “الجنة”
Figure 24: Search in ZDB for “الجنة

al-Ǧanna?

Which Latinized transcription was used?

Search in ZDB for “al-Ǧanna”
Figure 25: Search in ZDB for “al-Ǧanna”

Ganna!

What are the normalization rules for the search algorithm?

Search in ZDB for “Ganna”
Figure 26: Search in ZDB for “Ganna”

Digitisation bias

Collection biases perpetuated

Periodicals and their holding institutions
Figure 27: Periodicals and their holding institutions
Table 1: Periodical holdings and digitization
periodicals –1918 –1929
published 2054 3550
known holdings 540 775
% of total 26.29 21.83
———————— ——– ——- ——– —————
digitized 156 233
% of total 7.59 6.56
———————— ——– ——- ——– —————
multiple digitisations 51 66
% of total 2.48 1.86
% of digitised 32.69 28.33

Digitisation bias

mind the <gap/>!

Table 2: Comparison of digitized periodicals between the Global South and the Global North
Arabic periodicals (1798–1918) WWI as mirrored by Hessian regional papers
community c. 420 mio. Arabic speakers c. 6.2 mio. inhabitants
periodicals 2054 newspapers and journals 125 newspapers
digitized 156 periodicals 125 newspapers with more than 1.5 million pages
type mostly facsimiles facsimiles and full text
access paywalls, geo-fencing open access
interface mostly foreign languages only local and foreign languages
Map of Arabic dialects. Source: reddit
Figure 28: Map of Arabic dialects. Source: reddit
Map of Hesse in Europe. Source: https://www.iz.sk/sk/projekty/regiony-eu/DE7
Figure 29: Map of Hesse in Europe. Source: https://www.iz.sk/sk/projekty/regiony-eu/DE7

mind the <gap/>!

Interfaces

Interface of the Translatio project (Bonn). Facsimile of Arabic original on the left. Yellow = English UI; purple = Arabic metadata in DMG transcription; green = German metadata
Figure 30: Interface of the Translatio project (Bonn). Facsimile of Arabic original on the left. Yellow = English UI; purple = Arabic metadata in DMG transcription; green = German metadata

mind the <gap/>!

al-Muqtabas 6 on HathiTrust (Original in Princeton) outside the USA
Figure 31: al-Muqtabas 6 on HathiTrust (Original in Princeton) outside the USA

cataloguing rules and algorithmic copyright detection cause further inaccessibilities

The page from fig. 31 with a US-IP
Figure 32: The page from fig. 31 with a US-IP

Quality of metadata

Bibliographic metadata is faulty throughout, mostly unstructured, and subject to linguistic imperialism

al-Muqtabas (“al-laban al-rāʾib” 1911) on Shamela as it appeared in 2019
Figure 33: al-Muqtabas (“al-laban al-rāʾib” 1911) on Shamela as it appeared in 2019
Facsimile of the same section of al-Muqtabas (“al-laban al-rāʾib” 1911) from EAP
Figure 34: Facsimile of the same section of al-Muqtabas (“al-laban al-rāʾib” 1911) from EAP

mind the <gap/>!

Traditional OCR

language [is] not currently OCRable.

Archive.org’s item description for (Kurd ʿAlī Gharāʾib al-Gharb 1923)

Table 3: Evaluation of traditional OCR software for Arabic font types from (Alghamdi and Teahan “Experimental Evaluation of Arabic OCR Systems” 2017, table IV). Values show percentage of correctly recognised characters
Font Type Sakhr (%) ABBYY (%) RDI(%) Tesseract (%)
Traditional Arabic 48.54 67.66 51.88 47.04
Tahoma 10.52 69.91 26.38 38.37
Simplified Arabic 52.97 67.69 44.94 46.75
M Unicode Sara 36.03 59.40 25.92 33.72
Diwani letter 18.13 18.47 18.13 23.32
DecoType Thuluth 36.12 37.71 24.26 32.48
Deco’Type Naskh 48.88 50.22 41.63 40.92
Arabic transparent 51.56 75.19 46.00 48.61
Andalus 28.07 37.53 21.68 25.34
AdvertisingBold 57.35 70.26 27.20 39.39
al-Bashīr 9 Jan. 1880 (#487), p.1 on GPA, quality of the OCR layer
Figure 35: al-Bashīr 9 Jan. 1880 (#487), p.1 on GPA, quality of the OCR layer

machine-learning approaches to OCR

For old prints, there’s […] kraken/calamari for coders, Transkribus if you’ve got money and just want to have the results[,] and OCR-D if you’ve got an IT department.

(Winkler and @awinkler@openbiblio.social Mastodon post 2023)

Table 4: Evaluation of my our Transkribus models
training set al-Ustādh al-Muqtabas
words 192829 11116
lines 18732 1013
epochs 200 200
CER train 2.01 0.07
CER validation 2.09 8.40
Transkribus web-app showing results of our model for al-Ḥasnāʾ 1(1)
Figure 36: Transkribus web-app showing results of our model for al-Ḥasnāʾ 1(1)

Conclusion

build the digital commons we need
with what we have at hand

  1. Do it yourself
    • but not alone
  2. keep it simple
    • for the sake of the people and spaceship earth
  3. there will be a future

Contemporary research instrumentation in our field, from natural language processing to network analysis, involves complex mechanisms. Their inner workings often lie beyond the full comprehension of the casual user. To use such tools well, we must, in some real sense, understand them better than the tool makers. At the very least, we should know them well enough to comprehend their biases and limitations.

(Tenen “Blunt Instrumentalism: On Tools and Methods” 2016, 85)

this implies learning how to produce, disseminate, and preserve digital scholarship ourselves, without the help we can’t get, even as we fight to build the infrastructures we need at the intersection of, with, and beyond institutional libraries and schools.

(Gil and Ortega “Global Outlooks in Digital Humanities” 2016, 29)

Thank you!

  • Contributors to OpenArabicPE: Jasper Bernhofer, Dimitar Dragnev, Patrick Funk, Talha Güzel, Hans Magne Jaatun, Daniel Kolland, Jakob Koppermann, Xaver Kretzschmar, Daniel Lloyd, Klara Mayer, Tobias Sick, Manzi Tanna-Händel, and Layla Youssef
  • Contributors to Project Jarāʾid: Hala Auji, Philippe Chevrant, Marina Demetriadou, Lamia Eid, Stacy Fahrenthold, Ulrike Freitag, Till Grallert, Rana Issa, Nicole Khayat, Peter Magierski, Leyla von Mende, Adam Mestyan, Christian Meier, Daniel Newman, Geoffrey Roper, Sinai Rusinek, Philip Sadgrove, Ola Seif, and Rogier Visser

References

Alghamdi, Mansoor, and William Teahan. 2017. “Experimental Evaluation of Arabic OCR Systems.” PSU Research Review 1 (3): 229–41. https://doi.org/gh4457.
al-Muqtabas. 1911. “Akhbār wa afkār: al-laban al-rāʾib” [News and thoughts: Yogurt] 6 (2), February 1, 1911. https://OpenArabicPE.github.io/journal_al-muqtabas/tei/oclc_4770057679-i_61.TEIP5.xml#div_21.d1e2838.
Dammūs, Ḥalīm Ibrāhīm. 1911. “Ṣiḥāfat Sūriyya wa-Lubnān” [The Press of Syria and Lebanon]. al-Zuhūr 2 (4), June 1, 1911. https://openarabicpe.github.io/journal_al-zuhur/tei/oclc_1034545644-i_15.TEIP5.xml#div_1.d2e634.
Digital Humanities Uni Potsdam, and @dh_potsdam@hcommons.social. 2023. “The Geography of #DH2023 Participants ...” Mastodon post. Mastodon. https://hcommons.social/@dh_potsdam/110696533594288097.
Fiormonte, Domenico. 2021. “Taxation Against Overrepresentation? The Consequences of Monolingualism for Digital Humanities.” In Alternative Historiographies of the Digital Humanities, edited by Dorothy Kim and Adeline Koh, 333–76. Earth: punctum books. https://doi.org/10.53288/0274.1.00.
Gil, Alex, and Élika Ortega. 2016. “Global Outlooks in Digital Humanities: Multilingual Practices and Minimal Computing.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Crompton, Richard J Lane, and Ray Siemens, 22–34. Abingdon: Routledge.
Isasi, Jennifer, Riva Quiroga, Nabeel Siddiqui, Joana Vieira Paulino, and Alex Wermer-Colan. 2023. “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian.” In Multilingual Digital Humanities, edited by Lorella Viola and Paul Spence, 17–30. London: Routledge. https://doi.org/10.4324/9781003393696-3.
Kurd ʿAlī, Muḥammad. 1923. Gharāʾib al-Gharb [The Oddities of the West]. 2nd ed. Vol. 1. Miṣr: al-Maṭbaʿa al-Raḥmāniyya. http://archive.org/details/1_20191109_20191109_1843.
Milo, Thomas. 2014. “Visually Misleading Characters in the Arabic URL.” Deco Type.
Nemeth, Titus. 2017. Arabic Type-Making in the Machine Age: The Influence of Technology on the Form of Arabic Type, 1908-1993. Leiden: Brill. https://doi.org/10.1163/9789004349308.
Phillipson, Robert. 1997. “Realities and Myths of Linguistic Imperialism.” Journal of Multilingual and Multicultural Development 18 (3): 238–48. https://doi.org/db3cnb.
Pohl, Oliver. (2020) 2022. “Rasmifize.” TypeScript. https://github.com/suchmaske/rasmifize.
Royal Geographical Society. 1916. “Sykes Picot Agreement Map Signed 8 May 1916.” MPK 1/426. PRO. https://commons.wikimedia.org/wiki/File:MPK1-426_Sykes_Picot_Agreement_Map_signed_8_May_1916.jpg.
Tenen, Dennis. 2016. “Blunt Instrumentalism: On Tools and Methods.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 83–91. Debates in the Digital Humanities 2. Minneapolis: University of Minnesota Press. https://doi.org/10.5749/j.ctt1cn6thb.12.
United Nations. 2007. “United Nations Declaration on the Rights of Indigenous Peoples.” A/RES/61/295. United Nations. https://undocs.org/A/RES/61/295.
Web Hypertext Application Technology Working Group. 2023. “HTML: Living Standard.” March 24, 2023. https://html.spec.whatwg.org/multipage/.
Winkler, Alexander, and @awinkler@openbiblio.social. 2023. Mastodon post. Mastodon. https://openbiblio.social/@awinkler/109981107178749600.
Zakham, Yūsuf. 1907. “Amīrkā wa-ʿulamāʾ al-ʿArab” [America and Arab Scholars]. al-Muqtabas 2 (1), February 14, 1907. https://OpenArabicPE.github.io/journal_al-muqtabas/tei/oclc_4770057679-i_13.TEIP5.xml#div_8.d1e1249.
“«أصبح رمادًا وركاما»... حريق كبير يدمر سوقًا في قلب دمشق.” 2023. Newspaper. الشرق الاوسط. July 16, 2023. https://aawsat.com/node/4436121.