We need to talk about Arabic!

A practical critique of the hostility towards the second most common human writing system built into our quotidian digital infrastructures

Till Grallert

Humboldt-Universität zu Berlin

Methods Innovation Lab (NFDI 4Memory)

Exploring Epistemic Virtues and Vices

2024-03-16

https://tillgrallert.github.io/slides/dh/2024-03-luxembourg/

Introduction

My research interests

… or what I would want to do

Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles. — Figure 1: Undirected network of authors in *al-Ḥaqāʾiq*, *al-Ḥasnāʾ*, *Lughat al-ʿArab*, and *al-Muqtabas*. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.

Directed network of periodicals referenced in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr — Figure 2: Directed network of periodicals referenced in *al-Ḥaqāʾiq*, *al-Ḥasnāʾ*, *Lughat al-ʿArab*, *al-Muqtabas*, and *al-Zuhūr*

… and what I spend my time on

Project Jarāʾid (2012–)
Closing the knowledge `<gap/>`

Bibliographic record of all Arabic periodical titles published between 1798 and 1929
- websites and open datasets (TEI/XML) for more than 3500 periodicals
- additional authority files for c.2700 persons, 220 places, 180 libraries
Unfunded collaboration with Adam Mestyan (Duke), “crowd”-sourcing
Networking and reconciling existing information:
- Integration of holding information from library catalogues such as ZDB, AUB, BnF, HathiTrust
- Publish everything as Linked Open Data on Wikidata

Figure 3: Periodicals by places of publication. Size of circles corresponds to the number of periodicals. Colour indicates the number of new titles per period.

… and what I spend my time on

Open Arabic Periodical Editions (OpenArabicPE, 2015–)
Closing the infrastructural `<gap/>`

Digital scholarly editions
- 6 Arabic magazines from Baghdad, Cairo, Damascus with c.800 issues and more than 9 million words.
- full text and facsimiles modelled in TEI/XML
- Bibliographic metadata (MODS, BibTeX, Zotero RDF)
- Open licenses: CC BY-SA 4.0
Infrastructure:
- TEI Boilerplate: static websites. No need for backend, database or internet connection
- GitHub / Zenodo: free hosting and archiving with DOIs
- Zotero group as gateway to search/browse the corpus
Workflows and tools

Webview of al-Zuhur 1(1) — Figure 4: Webview of *al-Zuhur* 1(1)

Webview of al-Muqtabas 3(2) — Figure 5: Webview of *al-Muqtabas* 3(2)

Background

Late-Ottoman Eastern Mediterranean
Diversity across the board

Languages

Administrative: Ottoman, Arabic, Persian
Quotidian: Turkic languages, Arabic, Greek, Slavic languages, Armenian, Ladino …
Lithurgic: Arabic, Greek, Armenia, Coptic, Russian, Hebrew …
Educational: Ottoman, Arabic, French, English, Russian …

Scripts

From right to left:
- Arabic, Hebrew, Assyrian
Fro left to right:
- Greek, Armenian, Latin, Cyrillic, Coptic

Religions

Muslims: Sunni, Shi’ite
Christians: div. Orthodox, Eastern Catholic, Western Catholic, Assyrian, Protestant …
Jews: sephardic, ashkenazic
Zoroastrians

Calendars

Islamic (hijri): lunar, observed, epoch begins with Muḥammad’s exodus from Mecca
Reformed Julian: solar; year begins on 1. January, epoch begins with Christ’s birth
Ottoman fiscal (mālī): lunisolar; year begins on 1. March, eepoch begins with Muḥammad’s exodus from Mecca
Gregorian: solar; year begins on 1. January, epoch begins with Christ’s birth
Jewish: lunar; epoch begins with the creation of the world

Days, hours

alla turca: day begins with sundown, 12 unequal hours each for day and night
alle franca: day begins at midnight. 24 equinoctial hours

“You are as beautiful as an additional hour of electricity!”

"حبيبتي، انت جميلة، كساعة اضافية من الكهرباء"

هذا غزل أحد المتظاهرين في ساحة التحرير اليوم.
رائعة حقيقة! pic.twitter.com/KI8sAkY719
— aya mansour (@aya_mansour_11_) July 31, 2015

مريم .. أنتِ جميلة كساعة إضافية من الكهرباء ..

كتبها عاشق في فلسطين - غزة pic.twitter.com/W3QvpmaE3O
— Jawdat Alsaleh (@JawdatAlsaleh) June 27, 2017

#سأكتب_على_الجدار
أنتِ جميلة كساعة إضافية من الكهرباء pic.twitter.com/jKpLnnlorR
— A - M .. Syria (@Azrael90) January 17, 2018

“You are as beautiful as an additional hour of electricity!”

very unequal access to the means of digital production
sustainable development goals (SDG) of the UN
electricity
- 800 mio have no access
  - almost exclusively in the global south
  - vast majority in subsaharan Africa
- by 2030 according to projections of the International Energy Agency (IEA):
  - 600 mio
  - 33 per cent of all Africans
- access:
  - 250–500 kWh per year and household
  - less than 14 hours of a 100W lightbulb per day
internet + 36,6 percent of the world population, or 2,93 billion people do not participate + 85 per cent of them live in Africa, South, East and South-East Asia + lower speed + higher latency + higher cost per unit of traffic

Destruction

Figure 7: Meeting room at the Orient-Institut Beirut damaged by the Beirut Port explosion on 4 August 2020. Source: OIB

Figure 8: The National Archives of Syria in Damascus the day after their conflagration on 16. July 2023. Source: “«أصبح رمادًا وركاما»... حريق كبير يدمر سوقًا في قلب دمشق” (2023)

Absences and exclusions

Figure 9: Global distribution of DH centres. Source: DH centerNet

Figure 10: Global distribution of attendees at DH2023 in Graz. Source: Digital Humanities Uni Potsdam and @dh_potsdam@hcommons.social (“The Geography of #DH2023 Participants ...” 2023)

We need to talk about Arabic!

Arabic

Script

Second most important script after Latin
- currently used by 14 languages: Arabic, Persian, Urdu, Pashto, Uzbek, Uighur …

Language

Fifth most important language
- One of six official languages of the United Nations
- Official language in 26 countries
- >420 million speakers
Lithurgical language of 1,6 billion Muslims

Approximate distribution of Arabic script use along current national boundaries (Nemeth Arabic Type-Making in the Machine Age 2017, fig 1.1) — Figure 11: Approximate distribution of Arabic script use along current national boundaries (Nemeth *Arabic Type-Making in the Machine Age* 2017, fig 1.1)

Arabic Script Grammar

Written from right to left (RTL)
Letters (graphemes)
- mostly connected in direction of writing
- letterform depends on position within the string (allographs): ج جـ ـجـ ـج
- combination of basic letterforms (archigraphemes, Arab. rasm) and diacritic marks (iʿjām)
diacritics
- reduce semantic ambiguity
- subject to regional preferences and change of time
Vocalisation (tashkīl) is optional and changes the semantics

Figure 12: Beginning of Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). Some ligatures are highlighted.

Figure 13: Pseudo-rasm of the text in fig. 12. Automatically generated with Pohl (“Rasmifize” [2020] 2022).

multilinguality and linguistic imperialism

Indigenous peoples have the right to revitalize, use, develop and transmit to future generations their histories, languages, oral traditions, philosophies, writing systems and literatures, and to designate and retain their own names for communities, places and persons.

(United Nations “UNDRIP” 2007, sec. 13)

‘Linguistic imperialism’ is shorthand for a multitude of activities, ideologies, and structural relationships. Linguistic imperialism takes place within an overarching structure of asymmetrical North/ South relations, where language interlocks with other dimensions, cultural (particularly in education, science, and the media), economic and political

(Phillipson “Realities and Myths of Linguistic Imperialism” 1997, 239)

The basis for the codes, languages, methodologies, and technical instruments of the digital humanities is English; the written and spoken language of all the main conferences, the most prestigious journals, the institutions that control the discipline, the organizations and international consortia, and the central authorities of knowledge is, with few exceptions, some dialect of British or American English.

(Fiormonte “Taxation Against Overrepresentation?” 2021, 334–35)

Technical affordances

Arabic Linotype, 1910s. Source: (Nemeth Arabic Type-Making in the Machine Age 2017, fig. 2.7) — Figure 14: Arabic Linotype, 1910s. Source: (Nemeth *Arabic Type-Making in the Machine Age* 2017, fig. 2.7)

Encoding characters

Unicode is awesome …

… but many contemporary and historical human writing systems are not supported even in its latest iteration.

Figure 15: Supported scripts in Unicode v1.0.0. Source: https://www.worldswritingsystems.org

Figure 16: Currently unsupported scripts. Source: https://www.worldswritingsystems.org

unicode can be traced back to the 1980s
Unicode has become the dominant encoding standard in the 2000s
almost universal support across operating systems has been driven by people’s fondness of emojis
Arabic has been part of Unicode since v 1.0.0
linguisting imperialism
- consortium: Adobe, Airbnb, Amazon, Apple, Yat, Google, ETCO, Meta, Microsoft, Netflix, SAP and Salesforce
- character encoding is part of the history of a global hegemonic technology stack bound up in historically contingent cultural traditions of the Global North.
- Mechanically and, later, electronically recording information in scripts other than Latin—particularly complex scripts with a much larger number of graphemes and different writing directions— was never considered sufficiently important or profitable to be supported out-of-the-box.
- Character encoding enforces Latin script grammar
  - unicode insufficiently distinguishes between languages and scripts
- The standard is written in English
- Current v 15:
  - we know at least 300 writing systems
  - 127 are currently not encoded

Unicode is awesome …

… but standards depend on implementation and software support

Encoding nightmares

Figure 17: 32 variants of encoding “Meccan” (مكية) (Milo “Visually Misleading Characters in the Arabic URL” 2014, 4)

Figure 18: In-browser search for “مك” in the Wikidata entry for “Mecca” (Q5806)

Unicode is awesome …

… but standards depend on implementation and software support

Rendering nightmares

This ought to be the perfect example:

As Ramsey Nasser notes in the overview of his programming language ب ل ق [pre-existing digital techonologies] are almost exclusively based on the ASCII character set

(Isasi et al. “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian” 2023, 19)

ب ل ق should have been قلب

Unicode is awesome …

… did I mention industry consortia?

HTML elements all have names that only use ASCII alphanumerics (Web Hypertext Application Technology Working Group “HTML: Living Standard” 2023, sec. 13.1.2)

Figure 19: Rendered HTML with the built-in default CSS for Zakham (“Amīrkā wa-ʿulamāʾ al-ʿArab” 1907). The test file is available online

Figure 20: HTML-Code for fig. 19 in Visual Code Studio. By default, non-ASCII characters are visually highlighted

Bidirectional texts

The mandatory XML declaration <?xml version="1.0" encoding="UTF-8"?> sets left-to-right as the base direction.

Figure 21: Bi-directional TEI/XML at the beginning of Dammūs (“Ṣiḥāfat Sūriyya wa-Lubnān” 1911). Arrows indicate the reading direction. Numbers indicate reading order.

Figure 22: The TEI/XML from fig. 21 in oXygen’s author mode. Styling relies on CSS.

Transliteration, the undead solution of yore

Transliteration into Latin script served the need of colonial administrations and academics with the technological affordances of the time.

مرآة الشرق

The Arabic original

Meraat al-Sherk

The official transcription provided by the paper’s masthead

Mirʾāt al-Sharq

Following the system of the International Journal of Middle East Studies (IJMES)

Mirʾāt aš-Šarq

Following the system of the Deutsche Morgenländische Gesellschaft (DMG)

mrMp Alcrq

Buckwalter transliteration

Arabic textual heritage online

The long tail of ASCII in discovery systems

الجنة?

No Arabic script

al-Ǧanna?

Which Latinized transcription was used?

Ganna!

What are the normalization rules for the search algorithm?

catalogue could be searched in Arabic but the data is missing
catalogues are historical artefacts
- digitisation of catalogues: NOT re-cataloguing of original material
  - card catalogue
  - ASCII OPAC
  - automated transcription of the card catalogue
  - human cataloguers depend on the technology they have at hand, which means they might be unable to enter the correct string
  - errors perpetuate
Latin input is mostly reduced to ASCII
- Hamza and ʿAyn escape this algorithm on ZDB
determined article is not automatically removed
The choices are not transparently documented
no software on-screen keyboards provided
additional problems
- catalogues are inherently local documents
- aggregated, if at all, on a national level
- frequently accessible only through Web interfaces and not APIs

Digitisation bias

Collection biases perpetuated

Figure 27: Periodicals and their holding institutions

Table 1: Periodical holdings and digitization
periodicals	–1918		–1929
published	2054		3550
known holdings	540		775
% of total		26.29		21.83
————————	——–	——-	——–	—————
digitized	156		233
% of total		7.59		6.56
————————	——–	——-	——–	—————
multiple digitisations	51		66
% of total		2.48		1.86
% of digitised		32.69		28.33

Digitisation bias

mind the `<gap/>`!

Table 2: Comparison of digitized periodicals between the Global South and the Global North
	Arabic periodicals (1798–1918)	WWI as mirrored by Hessian regional papers
community	c. 420 mio. Arabic speakers	c. 6.2 mio. inhabitants
periodicals	2054 newspapers and journals	125 newspapers
digitized	156 periodicals	125 newspapers with more than 1.5 million pages
type	mostly facsimiles	facsimiles and full text
access	paywalls, geo-fencing	open access
interface	mostly foreign languages only	local and foreign languages

Figure 28: Map of Arabic dialects. Source: reddit

Figure 29: Map of Hesse in Europe. Source: https://www.iz.sk/sk/projekty/regiony-eu/DE7

mind the `<gap/>`!

Interfaces

Figure 30: Interface of the Translatio project (Bonn). Facsimile of Arabic original on the left. Yellow = English UI; purple = Arabic metadata in DMG transcription; green = German metadata

mind the `<gap/>`!

copyright regimes, paywalls, and geo fencing

al-Muqtabas 6 on HathiTrust (Original in Princeton) outside the USA — Figure 31: *al-Muqtabas* 6 on HathiTrust (Original in Princeton) outside the USA

cataloguing rules and algorithmic copyright detection cause further inaccessibilities

Figure 32: The page from fig. 31 with a US-IP

Quality of metadata

Bibliographic metadata is faulty throughout, mostly unstructured, and subject to linguistic imperialism

al-Muqtabas (“al-laban al-rāʾib” 1911) on Shamela as it appeared in 2019 — Figure 33: *al-Muqtabas* (“al-laban al-rāʾib” 1911) on Shamela as it appeared in 2019

Facsimile of the same section of al-Muqtabas (“al-laban al-rāʾib” 1911) from EAP — Figure 34: Facsimile of the same section of *al-Muqtabas* (“al-laban al-rāʾib” 1911) from EAP

mind the `<gap/>`!

Traditional OCR

language [is] not currently OCRable.

Archive.org’s item description for (Kurd ʿAlī Gharāʾib al-Gharb 1923)

Table 3: Evaluation of traditional OCR software for Arabic font types from (Alghamdi and Teahan “Experimental Evaluation of Arabic OCR Systems” 2017, table IV). Values show percentage of correctly recognised characters
Font Type	Sakhr (%)	ABBYY (%)	RDI(%)	Tesseract (%)
Traditional Arabic	48.54	67.66	51.88	47.04
Tahoma	10.52	69.91	26.38	38.37
Simplified Arabic	52.97	67.69	44.94	46.75
M Unicode Sara	36.03	59.40	25.92	33.72
Diwani letter	18.13	18.47	18.13	23.32
DecoType Thuluth	36.12	37.71	24.26	32.48
Deco’Type Naskh	48.88	50.22	41.63	40.92
Arabic transparent	51.56	75.19	46.00	48.61
Andalus	28.07	37.53	21.68	25.34
AdvertisingBold	57.35	70.26	27.20	39.39

al-Bashīr 9 Jan. 1880 (#487), p.1 on GPA, quality of the OCR layer — Figure 35: *al-Bashīr* 9 Jan. 1880 (#487), p.1 on GPA, quality of the OCR layer

machine-learning approaches to OCR

For old prints, there’s […] kraken/calamari for coders, Transkribus if you’ve got money and just want to have the results[,] and OCR-D if you’ve got an IT department.

(Winkler and @awinkler@openbiblio.social Mastodon post 2023)

Table 4: Evaluation of my our Transkribus models
training set	al-Ustādh	al-Muqtabas
words	192829	11116
lines	18732	1013
epochs	200	200
CER train	2.01	0.07
CER validation	2.09	8.40

Transkribus web-app showing results of our model for al-Ḥasnāʾ 1(1) — Figure 36: Transkribus web-app showing results of our model for *al-Ḥasnāʾ* 1(1)

Conclusion

build the digital commons we need
with what we have at hand

Do it yourself
- but not alone
keep it simple
- for the sake of the people and spaceship earth
there will be a future

Contemporary research instrumentation in our field, from natural language processing to network analysis, involves complex mechanisms. Their inner workings often lie beyond the full comprehension of the casual user. To use such tools well, we must, in some real sense, understand them better than the tool makers. At the very least, we should know them well enough to comprehend their biases and limitations.

(Tenen “Blunt Instrumentalism: On Tools and Methods” 2016, 85)

this implies learning how to produce, disseminate, and preserve digital scholarship ourselves, without the help we can’t get, even as we fight to build the infrastructures we need at the intersection of, with, and beyond institutional libraries and schools.

(Gil and Ortega “Global Outlooks in Digital Humanities” 2016, 29)

Thank you!

Contributors to OpenArabicPE: Jasper Bernhofer, Dimitar Dragnev, Patrick Funk, Talha Güzel, Hans Magne Jaatun, Daniel Kolland, Jakob Koppermann, Xaver Kretzschmar, Daniel Lloyd, Klara Mayer, Tobias Sick, Manzi Tanna-Händel, and Layla Youssef
Contributors to Project Jarāʾid: Hala Auji, Philippe Chevrant, Marina Demetriadou, Lamia Eid, Stacy Fahrenthold, Ulrike Freitag, Till Grallert, Rana Issa, Nicole Khayat, Peter Magierski, Leyla von Mende, Adam Mestyan, Christian Meier, Daniel Newman, Geoffrey Roper, Sinai Rusinek, Philip Sadgrove, Ola Seif, and Rogier Visser

Links:
- Slides: https://tillgrallert.github.io/slides/dh/2024-03-luxembourg/
- Project blog: https://openarabicpe.github.io
- Papers: http://digitalhumanities.org/dhq/vol/16/2/000593/000593.html, https://doi.org/10/gkhrjr
- Mastodon: @tillgrallert@digitalcourage.social
- Email:

References

Alghamdi, Mansoor, and William Teahan. 2017. “Experimental Evaluation of Arabic OCR Systems.” PSU Research Review 1 (3): 229–41. https://doi.org/gh4457.

al-Muqtabas. 1911. “Akhbār wa afkār: al-laban al-rāʾib” [News and thoughts: Yogurt] 6 (2), February 1, 1911. https://OpenArabicPE.github.io/journal_al-muqtabas/tei/oclc_4770057679-i_61.TEIP5.xml#div_21.d1e2838.

Dammūs, Ḥalīm Ibrāhīm. 1911. “Ṣiḥāfat Sūriyya wa-Lubnān” [The Press of Syria and Lebanon]. al-Zuhūr 2 (4), June 1, 1911. https://openarabicpe.github.io/journal_al-zuhur/tei/oclc_1034545644-i_15.TEIP5.xml#div_1.d2e634.

Digital Humanities Uni Potsdam, and @dh_potsdam@hcommons.social. 2023. “The Geography of #DH2023 Participants ...” Mastodon post. Mastodon. https://hcommons.social/@dh_potsdam/110696533594288097.

Fiormonte, Domenico. 2021. “Taxation Against Overrepresentation? The Consequences of Monolingualism for Digital Humanities.” In Alternative Historiographies of the Digital Humanities, edited by Dorothy Kim and Adeline Koh, 333–76. Earth: punctum books. https://doi.org/10.53288/0274.1.00.

Gil, Alex, and Élika Ortega. 2016. “Global Outlooks in Digital Humanities: Multilingual Practices and Minimal Computing.” In Doing Digital Humanities: Practice, Training, Research, edited by Constance Crompton, Richard J Lane, and Ray Siemens, 22–34. Abingdon: Routledge.

Isasi, Jennifer, Riva Quiroga, Nabeel Siddiqui, Joana Vieira Paulino, and Alex Wermer-Colan. 2023. “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian.” In Multilingual Digital Humanities, edited by Lorella Viola and Paul Spence, 17–30. London: Routledge. https://doi.org/10.4324/9781003393696-3.

Kurd ʿAlī, Muḥammad. 1923. Gharāʾib al-Gharb [The Oddities of the West]. 2nd ed. Vol. 1. Miṣr: al-Maṭbaʿa al-Raḥmāniyya. http://archive.org/details/1_20191109_20191109_1843.

Milo, Thomas. 2014. “Visually Misleading Characters in the Arabic URL.” Deco Type.

Nemeth, Titus. 2017. Arabic Type-Making in the Machine Age: The Influence of Technology on the Form of Arabic Type, 1908-1993. Leiden: Brill. https://doi.org/10.1163/9789004349308.

Phillipson, Robert. 1997. “Realities and Myths of Linguistic Imperialism.” Journal of Multilingual and Multicultural Development 18 (3): 238–48. https://doi.org/db3cnb.

Pohl, Oliver. (2020) 2022. “Rasmifize.” TypeScript. https://github.com/suchmaske/rasmifize.

Royal Geographical Society. 1916. “Sykes Picot Agreement Map Signed 8 May 1916.” MPK 1/426. PRO. https://commons.wikimedia.org/wiki/File:MPK1-426_Sykes_Picot_Agreement_Map_signed_8_May_1916.jpg.

Tenen, Dennis. 2016. “Blunt Instrumentalism: On Tools and Methods.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein, 83–91. Debates in the Digital Humanities 2. Minneapolis: University of Minnesota Press. https://doi.org/10.5749/j.ctt1cn6thb.12.

United Nations. 2007. “United Nations Declaration on the Rights of Indigenous Peoples.” A/RES/61/295. United Nations. https://undocs.org/A/RES/61/295.

Web Hypertext Application Technology Working Group. 2023. “HTML: Living Standard.” March 24, 2023. https://html.spec.whatwg.org/multipage/.

Winkler, Alexander, and @awinkler@openbiblio.social. 2023. Mastodon post. Mastodon. https://openbiblio.social/@awinkler/109981107178749600.

Zakham, Yūsuf. 1907. “Amīrkā wa-ʿulamāʾ al-ʿArab” [America and Arab Scholars]. al-Muqtabas 2 (1), February 14, 1907. https://OpenArabicPE.github.io/journal_al-muqtabas/tei/oclc_4770057679-i_13.TEIP5.xml#div_8.d1e1249.

“«أصبح رمادًا وركاما»... حريق كبير يدمر سوقًا في قلب دمشق.” 2023. Newspaper. الشرق الاوسط. July 16, 2023. https://aawsat.com/node/4436121.

We need to talk about Arabic!

A practical critique of the hostility towards the second most common human writing system built into our quotidian digital infrastructures

Introduction

My research interests

… or what I would want to do

… and what I spend my time on

Project Jarāʾid (2012–) Closing the knowledge <gap/>

… and what I spend my time on

Open Arabic Periodical Editions (OpenArabicPE, 2015–) Closing the infrastructural <gap/>

Background

Late-Ottoman Eastern Mediterranean Diversity across the board

Languages

Scripts

Religions

Calendars

Days, hours

“You are as beautiful as an additional hour of electricity!”

“You are as beautiful as an additional hour of electricity!”

Destruction

Absences and exclusions

We need to talk about Arabic!

Arabic

Script

Language

Arabic Script Grammar

multilinguality and linguistic imperialism

multilinguality and linguistic imperialism

Technical affordances

Encoding characters

Unicode is awesome …

Unicode is awesome …

Encoding nightmares

Unicode is awesome …

Rendering nightmares

Unicode is awesome …

Bidirectional texts

Transliteration, the undead solution of yore

مرآة الشرق

Meraat al-Sherk

Mirʾāt al-Sharq

Mirʾāt aš-Šarq

mrMp Alcrq

Arabic textual heritage online

The long tail of ASCII in discovery systems

الجنة?

al-Ǧanna?

Ganna!

Digitisation bias

Collection biases perpetuated

Digitisation bias

mind the <gap/>!

mind the <gap/>!

Interfaces

mind the <gap/>!

copyright regimes, paywalls, and geo fencing

Quality of metadata

mind the <gap/>!

Traditional OCR

machine-learning approaches to OCR

Conclusion

build the digital commons we need with what we have at hand

Thank you!

References

Project Jarāʾid (2012–)
Closing the knowledge `<gap/>`

Open Arabic Periodical Editions (OpenArabicPE, 2015–)
Closing the infrastructural `<gap/>`

Late-Ottoman Eastern Mediterranean
Diversity across the board

mind the `<gap/>`!

mind the `<gap/>`!

mind the `<gap/>`!

mind the `<gap/>`!

build the digital commons we need
with what we have at hand