Till Grallert
Humboldt-Universität zu Berlin
Methods Innovation Lab (NFDI 4Memory)
Exploring Epistemic Virtues and Vices
2024-03-16
https://tillgrallert.github.io/slides/dh/2024-03-luxembourg/
<gap/>
<gap/>
"حبيبتي، انت جميلة، كساعة اضافية من الكهرباء"
— aya mansour (@aya_mansour_11_) July 31, 2015
هذا غزل أحد المتظاهرين في ساحة التحرير اليوم.
رائعة حقيقة! pic.twitter.com/KI8sAkY719
مريم .. أنتِ جميلة كساعة إضافية من الكهرباء ..
— Jawdat Alsaleh (@JawdatAlsaleh) June 27, 2017
كتبها عاشق في فلسطين - غزة pic.twitter.com/W3QvpmaE3O
#سأكتب_على_الجدار
— A - M .. Syria (@Azrael90) January 17, 2018
أنتِ جميلة كساعة إضافية من الكهرباء pic.twitter.com/jKpLnnlorR
Indigenous peoples have the right to revitalize, use, develop and transmit to future generations their histories, languages, oral traditions, philosophies, writing systems and literatures, and to designate and retain their own names for communities, places and persons.
(United Nations “UNDRIP” 2007, sec. 13)
‘Linguistic imperialism’ is shorthand for a multitude of activities, ideologies, and structural relationships. Linguistic imperialism takes place within an overarching structure of asymmetrical North/ South relations, where language interlocks with other dimensions, cultural (particularly in education, science, and the media), economic and political
(Phillipson “Realities and Myths of Linguistic Imperialism” 1997, 239)
The basis for the codes, languages, methodologies, and technical instruments of the digital humanities is English; the written and spoken language of all the main conferences, the most prestigious journals, the institutions that control the discipline, the organizations and international consortia, and the central authorities of knowledge is, with few exceptions, some dialect of British or American English.
(Fiormonte “Taxation Against Overrepresentation?” 2021, 334–35)
… but many contemporary and historical human writing systems are not supported even in its latest iteration.
… but standards depend on implementation and software support
… but standards depend on implementation and software support
This ought to be the perfect example:
As Ramsey Nasser notes in the overview of his programming language ب ل ق [pre-existing digital techonologies] are almost exclusively based on the ASCII character set
(Isasi et al. “A Model for Multilingual and Multicultural Digital Scholarship Methods Publishing: The Case of Programming Historian” 2023, 19)
ب ل ق should have been قلب
… did I mention industry consortia?
HTML elements all have names that only use ASCII alphanumerics (Web Hypertext Application Technology Working Group “HTML: Living Standard” 2023, sec. 13.1.2)
The mandatory XML declaration
<?xml version="1.0" encoding="UTF-8"?>
sets
left-to-right as the base direction.
Transliteration into Latin script served the need of colonial administrations and academics with the technological affordances of the time.
The Arabic original
The official transcription provided by the paper’s masthead
Following the system of the International Journal of Middle East Studies (IJMES)
Following the system of the Deutsche Morgenländische Gesellschaft (DMG)
Buckwalter transliteration
periodicals | –1918 | –1929 | ||
---|---|---|---|---|
published | 2054 | 3550 | ||
known holdings | 540 | 775 | ||
% of total | 26.29 | 21.83 | ||
———————— | ——– | ——- | ——– | ————— |
digitized | 156 | 233 | ||
% of total | 7.59 | 6.56 | ||
———————— | ——– | ——- | ——– | ————— |
multiple digitisations | 51 | 66 | ||
% of total | 2.48 | 1.86 | ||
% of digitised | 32.69 | 28.33 |
<gap/>
!Arabic periodicals (1798–1918) | WWI as mirrored by Hessian regional papers | |
---|---|---|
community | c. 420 mio. Arabic speakers | c. 6.2 mio. inhabitants |
periodicals | 2054 newspapers and journals | 125 newspapers |
digitized | 156 periodicals | 125 newspapers with more than 1.5 million pages |
type | mostly facsimiles | facsimiles and full text |
access | paywalls, geo-fencing | open access |
interface | mostly foreign languages only | local and foreign languages |
<gap/>
!<gap/>
!cataloguing rules and algorithmic copyright detection cause further inaccessibilities
Bibliographic metadata is faulty throughout, mostly unstructured, and subject to linguistic imperialism
<gap/>
!language [is] not currently OCRable.
Archive.org’s item description for (Kurd ʿAlī Gharāʾib al-Gharb 1923)
Font Type | Sakhr (%) | ABBYY (%) | RDI(%) | Tesseract (%) |
---|---|---|---|---|
Traditional Arabic | 48.54 | 67.66 | 51.88 | 47.04 |
Tahoma | 10.52 | 69.91 | 26.38 | 38.37 |
Simplified Arabic | 52.97 | 67.69 | 44.94 | 46.75 |
M Unicode Sara | 36.03 | 59.40 | 25.92 | 33.72 |
Diwani letter | 18.13 | 18.47 | 18.13 | 23.32 |
DecoType Thuluth | 36.12 | 37.71 | 24.26 | 32.48 |
Deco’Type Naskh | 48.88 | 50.22 | 41.63 | 40.92 |
Arabic transparent | 51.56 | 75.19 | 46.00 | 48.61 |
Andalus | 28.07 | 37.53 | 21.68 | 25.34 |
AdvertisingBold | 57.35 | 70.26 | 27.20 | 39.39 |
For old prints, there’s […] kraken/calamari for coders, Transkribus if you’ve got money and just want to have the results[,] and OCR-D if you’ve got an IT department.
(Winkler and @awinkler@openbiblio.social Mastodon post 2023)
training set | al-Ustādh | al-Muqtabas |
---|---|---|
words | 192829 | 11116 |
lines | 18732 | 1013 |
epochs | 200 | 200 |
CER train | 2.01 | 0.07 |
CER validation | 2.09 | 8.40 |
Contemporary research instrumentation in our field, from natural language processing to network analysis, involves complex mechanisms. Their inner workings often lie beyond the full comprehension of the casual user. To use such tools well, we must, in some real sense, understand them better than the tool makers. At the very least, we should know them well enough to comprehend their biases and limitations.
(Tenen “Blunt Instrumentalism: On Tools and Methods” 2016, 85)
this implies learning how to produce, disseminate, and preserve digital scholarship ourselves, without the help we can’t get, even as we fight to build the infrastructures we need at the intersection of, with, and beyond institutional libraries and schools.
(Gil and Ortega “Global Outlooks in Digital Humanities” 2016, 29)