Reviewing Sorting Phase Data: Talk Board Tags

In addition to sorting the Geniza, volunteers tagged subjects and talked about their work. What do these tags tell us?

Reviewing Sorting Phase Data: Talk Board Tags

By Emily Esten, Judaica Digital Humanities Coordinator at the University of Pennsylvania

To celebrate our volunteers’ hard work & review the data produced in the sorting phase, we’re sharing a series of blog posts that answer some of these questions about this project. Part 1 reviews the question of whether a subject was Hebrew or Arabic script. Part 2 reviews the question of whether a subject was written in formal or informal script. Part 3 reviews the presence of various visual characteristics. This final part reviews classification tags from the talk boards.

How many classification tags were used on the talk boards?

At the outset of the sorting phase, we defined few tags for volunteers to use. Over the course of the project, we encouraged volunteers to come up with new tags to best classify the subjects used in the project. There were 13,938 different instances of volunteers using tags on the Talk boards, to comment on 6,646 subjects (16.5% of subjects in the total project) in addition to general conversation on discussion forums in the project. There were 710 unique tags used on the Talk boards.

We used OpenRefine to find clusters of tags, or tags that looked similar but were spelt differently. For example, #song_of_songs and #songofsongs are different tags, but they refer to one of the scrolls in the Hebrew Bible. Then, we merged plural tags — so #drawing and #drawings would become #drawings — and misspellings — so #aphabet would become #alphabet. Finally, we manually clustered and merged tags we decided referred to similar things: #arabic and #arabicscript are both #arabicscript in this dataset, #hebrew and #hebrewscript are both #hebrewscript. After cleaning the data, we had 632 distinct tags used on the Talk boards.

The 10 most common tags used on the talk boards were:

  • name_of_god (used 792 times on 745 subjects)
  • nikkud (used 675 times on 663 subjects ), diacritical signs for vowels or pronounciations
  • arabic_script (used 591 times on 516 subjects)
  • faded (used 590 times on 585 subjects)
  • microfragment (used 538 times on 516 subjects)
  • damaged (used 485 times on 480 subjects)
  • asktheexperts (used 454 times on 409 subjects)
  • hebrew_script (used 430 times on 409 subjects)
  • large_letters (used 427 times on 409 subjects)
  • colons (used 394 times on 392 subjects)

Below are some of the fragments tagged with these popular tags.

From left to right: Subject 12506514: ENA 2841, Library of the Jewish Theological Seminary; Subject 11507384: MS L522, Library of the Jewish Theological Seminary; Subject 12499257: ENA 624, Library of the Jewish Theological Seminary
From left to right: Subject 21708735: MS-MOSSERI-V-00080, Genizah Research Unit, Cambridge University Library; Subject 11528868: ENA NS 77 0698.2, 
Library of the Jewish Theological Seminary, Subject 12501469: ENA 2082, Library of the Jewish Theological Seminary
Subject 11609423: ENA 2331, Library of the Jewish Theological Seminary; Subject 11583643: Halper 116, University of Pennsylvania, Herbert D. Katz Center for Advanced Judaic Studies Library, Cairo Genizah Collection; Subject 21952503: MS-TS-00012–00453, Genizah Research Unit, Cambridge University Library

We categorized most tags into 6 major categories, as discussed below:

Project: These tags refer to comments about the project (interface, notgenizah, mismatched) or subjective tags (weird, unusual, beautiful). There were 62 tags in this category, used 863 times, and on 731 subjects.

Language/Script: These tags refer to the scripts (arabic_script, hebrew_script, latin) or languages (coptic, english, italian, ladino) on the fragment. There were 25 tags in this category, used 1,562 times, and on 951 subjects.

Condition: These tags referred to the condition of the subject: things like microfragment, faded, damaged, or reuse are in this category. There were 37 tags in this category, used 1,687 times, and on 1,264 subjects.

Feature: These tags refer to specific markings (aleph, charakteres, strikethrough), visual characteristics (colons) or distinctive features (binding, diagonal_text, marginalia) of a fragment. There were 167 tags in this category, used 5,407 times, and on 3,357 subjects.

Type: If it’s not a religious text, these tags help identify what type of fragment it might be. These might be themes/terms referenced in the text (agriculture, magical, literary) or types of fragments (titlepage, reed-trial, legal_document). There were 71 tags in this category, used 1,599 times, and on 1,351 subjects.

Judaica: These tags make up the bulk of the Cairo Geniza — they vary from historical persons to Biblical references, specific types or literary or religious texts to holidays. There were 273 tags in this category, used 2,718 times and on 1,799 subjects.

We do want to note that if a subject was tagged a certain way, it does not mean that feature/tag is true for the subject. Tags are volunteer-generated, and were used to ask questions about a subject as often as they were to describe a subject. We have indicated some cases in which a tag was used but is not present, but a thorough analysis of talk boards tags would require all uses of tags to ensure their accuracy.


Which subjects received the most tags?

From left to right: Subject 21715403: MS-TS-NS-00322–00019, Genizah Research Unit, Cambridge University Library; Subject 12499257, ENA 624.23, Library of the Jewish Theological Seminary; Subject Subject 11634222, L 594.38, Library of the Jewish Theological Seminary
From left to right: Subject 12499254: ENA 624, Library of the Jewish Theological Seminary; Subject 21707958, 
MS-OR-01080-J-00289, Genizah Research Unit, Cambridge University Library.

Subject 21715403 was tagged 19 times with 18 different tags, including charakteres, diagrams, magical and mixed_languages.

Subject 12499257, an 18th century ledger, was tagged 17 times with 12 different tags, including seal, ledger, judeo-arabic, and arabic.

Subject 11634222, a bible scroll, was tagged 16 times with 14 different tags, including bible, scroll, marginalia, secondary_use, and book_of_Isaiah.

Subject 12499254, an 18th-century ledger, was tagged 16 times with 10 different tags, including seal, asktheexperts, and numerals.

Subject 21707958, a 12th-century ketubah, was tagged 16 times with 13 different tags, including ketubah, decorations, drawing, and damaged.

395 subjects were tagged 6 or more times, signifying significant engagement or interest in the subject’s contents.


This report concludes our overview of sorting phase data. As we move into the transcription phase, we hope to provide more insights with a #DataDeepDive series.