Looking at the iceberg from below the waterline

Stylometric authorship attribution for anonymous articles in Arabic periodicals from the early twentieth century

Till Grallert

Scholarly Makerspace

Universitätsbibliothek

Humboldt-Universität zu Berlin

outline

  1. Background
    • Arab periodical studies
    • Research question
  2. Method: stylometric authorship attribution
  3. Corpus, data sets
  4. Results

Background

Arabic periodicals

  • Periodical press as agent of change
    • first mass medium
    • central medium of the literary and cultural Arabic renaissance (nahḍa)
    • medium of linguistic change
    • central forum for negotiations over modernity, nationalism, Islamism etc.
  • Periodicals as source but not a subject
  • Research is dominated by
    • national(ist) narratives
    • bias on two places and small no. of titles
    • implicit hypotheses
Distribution of new Arabic periodical titles, 1799–1929
Figure 1: Distribution of new Arabic periodical titles, 1799–1929

Arabic periodicals

First page of the journal al-Muqtabas 1(1), 1906, Cairo
Figure 2: First page of the journal al-Muqtabas 1(1), 1906, Cairo
Front page of the newspaper al-Iqbāl #1, 9 April 1902, Beirut
Figure 3: Front page of the newspaper al-Iqbāl #1, 9 April 1902, Beirut
Front page of the newspaper Kawkab Amīrkā #1, 15 April 1892, New York
Figure 4: Front page of the newspaper Kawkab Amīrkā #1, 15 April 1892, New York

Research interest: intellectual networks

Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.
Figure 5: Undirected network of authors in al-Ḥaqāʾiq, al-Ḥasnāʾ, Lughat al-ʿArab, and al-Muqtabas. Colour of nodes: betweenness centrality; size of nodes: number of periodicals; width of edges: number of articles.

Aims

  • empirical testing of hypotheses
  • evaluate existing literature

Observations

  • very limited overlap between periodicals from the same place
  • core network (14 of 319 nodes):
    • absent from the literature
    • suprising set up: many Iraqis (6), few Syrians (2), few Christians (2)

Problem: missing bylines

  • About 4/5 of all articles or 2/3 of all words carry no byline
  • Commonly ignored in scholarship
  • Implicit hypothesis is implausible and untested
  • Stylometric authorship attribution is untested for this material

Method

Stylometric authorship attribution

Authorship signal is prevalent in most frequent words, i.e. function words

comparative method

  • steps:
    1. compute frequencies for every text
    2. compare every text with every text
    3. validate through voting (consensus) of multiple iterations

challenges

  • novel application to Arabic and this genre
  • comparison depends on input
  • reliability depends on a minimal length of texts

Stylometry

  • In R with the stylo() package (Eder, Rybicki, and Kestemont ‘Stylometry with R’ 2016)
  • Based on parameter settings established in our tests (Romanov and Grallert ‘Parameters for Stylometric Authorship Attribution’ 2022)

stylo() settings

  • Tokens: words
  • Sampling: 2500 tokens
  • Most Frequent Features: 200–500 tokens, incremented by 100
  • Culling: 0
  • distance measure: Eder’s simple delta

Analysis

  • edges (and nodes) tables from stylo()
  • computing network measures with tidygraph() and igraph()
    • centrality
    • community detection
  • plotting results with ggraph() and ggplot2()

Parameter testing

(Romanov and Grallert ‘Parameters for Stylometric Authorship Attribution’ 2022)

  • corpus
    • 300 books from 28 authors
    • 19th and early 20th century
  • parameters
    • MFF: 100–500 tokens and character n-grams in increments of 100
    • culling: 0–50% in increments of 10
    • distance measure: all 14
    • sample length: 100 to 12000 tokens in increments of 100
  • testing
    • all possible combinations
    • Ward’s clustering (ward.D2 in hclust) for authors and works
  • infrastructure

Parameter testing

Plot of results from the parameter testing. Source: Romanov and Grallert (‘Parameters for Stylometric Authorship Attribution’ 2022)
Figure 6: Plot of results from the parameter testing. Source: Romanov and Grallert (‘Parameters for Stylometric Authorship Attribution’ 2022)

Corpus and data sets

Corpus

Table 1: Our corpus from “Open Arabic Periodical Editions
Periodical Place Dates1 Vol.s No.s Words Articles with author 2500+ words words/ article Authors DOI
al-Ḥaqāʾiq Damascus 1910–13 3 35 298090 389 41.90 22 832.66 104 10.5281/zenodo.1232016
al-Muqtabas Cairo, Damascus 1906–18 9 96 1981081 2964 12.72 241 873.34 140 10.5281/zenodo.597319
al-Zuhūr Cairo 1910–13 4 39 292333 436 41.51 6 695.09 112 10.5281/zenodo.3580606
Lughat al-ʿArab Baghdad 1911–14 3 34 373832 939 16.19 21 485.21 53 10.5281/zenodo.3514384
total 19 204 2945336 4728 290 622.96
al-Ustādh Cairo 1892–93 1 42 221447 435 5.52 13 582.21 8 10.5281/zenodo.3581028

Data sets

Plain text files of >2500 words

  • data set 1: 303 individual articles
    • 113 texts by 76 unique authors
    • 190 anonymous texts
  • data set 2: 88 sections of anonymous articles from 2 journals: al-Muqtabas and Lughat al-ʿArab
  • data set 3: 6 books by Muḥammad Kurd ʿAlī
  • data set 4: 246 full issues from 5 journals

Results
data set 1

Introducing the spaghetti monster!

Bootstrap consensus network of data set 1, coloured by author
Figure 7: Bootstrap consensus network of data set 1, coloured by author
fig. 7, coloured by community
Figure 8: fig. 7, coloured by community

Zooming in:
Individual authors

Kāẓim al-Dujaylī

Anonymous travellogue in Lughat al-ʿArab most likely written by the magazine’s editor Kāẓim al-Dujaylī

Detail from fig. 7
Figure 9: Detail from fig. 7
Radar plot for the unattributed article (20) in fig. 9, “Riḥla ilá Shufāthā”, Lughat al-ʿArab 3(1), Aug. 1913
Figure 10: Radar plot for the unattributed article (20) in fig. 9, “Riḥla ilá Shufāthā”, Lughat al-ʿArab 3(1), Aug. 1913

Ibn al-Muqaffaʿ?

A cluster of texts potentially written by Ibn al-Muqaffaʿ (d. 759) and edited by Ṭāhir al-Jazāʾirī

Detail from fig. 7
Figure 11: Detail from fig. 7
Radar plot for the unattributed article (34) in fig. 11, “al-Adab al-ṣaghīr”, al-Muqtabas 3(1), Sep. 1911
Figure 12: Radar plot for the unattributed article (34) in fig. 11, “al-Adab al-ṣaghīr”, al-Muqtabas 3(1), Sep. 1911

William Shakespeare

Unmarked translations of Shakespeare’s “Julius Caesar” in al-Zuhūr

Detail from fig. 7
Figure 13: Detail from fig. 7
Radar plot for the attributed article in fig. 13, Shakespear, “Yūliyūs Qayṣar”, al-Zuhūr 3(4), Oct. 1912
Figure 14: Radar plot for the attributed article in fig. 13, Shakespear, “Yūliyūs Qayṣar”, al-Zuhūr 3(4), Oct. 1912

Shukrī al-ʿAsalī: resolving acronyms

Texts by Shukrī al-ʿAsalī, later MP for Damascus and co-editor of one of Muḥammad Kurd ʿAlī’s newspapers

Detail from fig. 7
Figure 15: Detail from fig. 7
Radar plot for the attributed article (164) in fig. 15, al-ʿAsalī, “al-Jabāya fī al-Islām”, al-Muqtabas 4(2, 3), Feb., Mar. 1909
Figure 16: Radar plot for the attributed article (164) in fig. 15, al-ʿAsalī, “al-Jabāya fī al-Islām”, al-Muqtabas 4(2, 3), Feb., Mar. 1909

Charles Seignobos: threshold for distance measures?

Historical texts by Charles Seignobos translated by Muḥammad Kurd ʿAlī.

When does the distance measure become unrealiable?

Detail from fig. 7
Figure 17: Detail from fig. 7
Radar plot for the attributed article (201) in fig. 17, Seignobos, “al-Yūnān”, al-Muqtabas 2(5), June 1907
Figure 18: Radar plot for the attributed article (201) in fig. 17, Seignobos, “al-Yūnān”, al-Muqtabas 2(5), June 1907

Results
data set 2: owners-cum-editors as authors?

owners-cum-editors as authors?

al-Muqtabas

Anonmyous sections and editors, coloured by author (blue = Muḥammad Kurd ʿAlī, red = Kāẓim al-Duhaylī, green = Anastās al-Karmalī)

Muḥammad Kurd ʿAlī (blue) most likely not the author

owners-cum-editors as authors?

al-Muqtabas

Anonmyous sections and editors, coloured by community

Multiple anonymous candidates?

owners-cum-editors as authors?

Lughat al-ʿArab

Anonmyous sections and editors, coloured by author (blue = Muḥammad Kurd ʿAlī, red = Kāẓim al-Duhaylī, green = Anastās al-Karmalī)

Authorship of Anastās Mārī al-Karmalī and Kāẓim al-Duyalī more likely

owners-cum-editors as authors?

Lughat al-ʿArab

Anonmyous sections and editors, coloured by community

Authorship of Anastās Mārī al-Karmalī and Kāẓim al-Duyalī more likely

Data set 3
Do periodicals speak with a single voice?

stylistic differences between journals

Auctorial voices?

Issues of 5 periodicals from Cairo, Damascus, and Baghad
  • periodicals show distinct stylistic features
  • some similarity between al-Muqtabas and al-Zuhūr

stylistic differences between journals

al-Ḥaqāʾiq, Lughat al-ʿArab, and al-Muqtabas

PCA covariance matrix for the 100 MFWs in a corpus of al-Ḥaqāʾiq, Lughat al-ʿArab, and al-Muqtabas
Figure 19: PCA covariance matrix for the 100 MFWs in a corpus of al-Ḥaqāʾiq, Lughat al-ʿArab, and al-Muqtabas
  • Lughat al-ʿArab and al-Muqtabas are indistinguishable
  • al-Ḥaqāʾiq is different
  • some issues of al-Muqtabas are very different
PCA covariance matrix for the 900 MFWs in a corpus of al-Ḥaqāʾiq, Lughat al-ʿArab, and al-Muqtabas
Figure 20: PCA covariance matrix for the 900 MFWs in a corpus of al-Ḥaqāʾiq, Lughat al-ʿArab, and al-Muqtabas

stylistic differences between journals

Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr

PCA covariance matrix for the 100 MFWs in a corpus of Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr
Figure 21: PCA covariance matrix for the 100 MFWs in a corpus of Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr
  • Strong stylistic similarities between all three periodicals
  • some issues of al-Muqtabas are very different
PCA covariance matrix for the 900 MFWs in a corpus of Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr
Figure 22: PCA covariance matrix for the 900 MFWs in a corpus of Lughat al-ʿArab, al-Muqtabas, and al-Zuhūr

stylistic differences between journals

Importance of genre

The same 5 periodicals + 6 works by one of the editors
  • very limited similarity between al-Muqtabas and its editor Muḥammad Kurd ʿAlī

Thank you!

Thank you!

  • Maxim Romanov for his work on parameter testing
  • Contributors to OpenArabicPE: Jasper Bernhofer, Dimitar Dragnev, Patrick Funk, Talha Güzel, Hans Magne Jaatun, Daniel Kolland, Jakob Koppermann, Xaver Kretzschmar, Daniel Lloyd, Klara Mayer, Tobias Sick, Manzi Tanna-Händel, and Layla Youssef
  • Links:
  • Licence: slides and images are licenced as CC BY-SA 4.0

References

Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. ‘Stylometry with R: A Package for Computational Text Analysis’. The R Journal 8 (1): 107–21. https://doi.org/10.32614/RJ-2016-007.
Grallert, Till. 2021. ‘Catch Me If You Can! Approaching the Arabic Press of the Late Ottoman Eastern Mediterranean Through Digital History’. Edited by Simone Lässig. Geschichte Und Gesellschaft 47 (1, Digital History): 58–89. https://doi.org/gkhrjr.
———. 2022. ‘Open Arabic Periodical Editions: A Framework for Bootstrapped Scholarly Editions Outside the Global North’. Edited by Roopika Risam and Alex Gil. Digital Humanities Quarterly 16 (2, "Minimal Computing"). http://digitalhumanities.org/dhq/vol/16/2/000593/000593.html.
Romanov, Maxim. 2021. ‘A Corpus of Arabic Literature (19-20th centuries) for Stylometric Tests’. Zenodo. https://doi.org/10.5281/zenodo.5772261.
Romanov, Maxim, and Till Grallert. 2022. ‘Establishing Parameters for Stylometric Authorship Attribution of 19th-Century Arabic Books and Periodicals’. In Digital Humanities 2022: Conference Abstracts, 346–48. Tokyo: The University of Tokyo. https://dh2022.dhii.asia/dh2022bookofabsts.pdf.

  1. The current cut-off date is 1918.↩︎