Multilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence

dc.contributor.authorWest, Andrew G.
dc.contributor.authorLee, Insup
dc.date2023-05-17T06:33:13.000
dc.date.accessioned2023-05-22T12:48:48Z
dc.date.available2023-05-22T12:48:48Z
dc.date.issued2011-09-21
dc.date.submitted2011-09-21T09:41:08-07:00
dc.description.abstractThere is much literature on Wikipedia vandalism detection. However, this writing addresses two facets given little treatment to date. First, prior efforts emphasize zero-delay detection, classifying edits the moment they are made. If classification can be delayed (e.g., compiling offline distributions), it is possible to leverage ex post facto evidence. This work describes/evaluates several features of this type, which we find to be overwhelmingly strong vandalism indicators. Second, English Wikipedia has been the primary test-bed for research. Yet, Wikipedia has 200+ language editions and use of localized features impairs portability. This work implements an extensive set of language-independent indicators and evaluates them using three corpora (German, English, Spanish). The work then extends to include language-specific signals. Quantifying their performance benefit, we find that such features can moderately increase classifier accuracy, but significant effort and language fluency are required to capture this utility. Aside from these novel aspects, this effort also broadly addresses the task, implementing 65 total features. Evaluation produces 0.840 PR-AUC on the zero-delay task and 0.906 PR-AUC with ex post facto evidence (averaging languages). Performance matches the state-of-the-art (English), sets novel baselines (German, Spanish), and is validated by a first-place finish over the 2011 PAN-CLEF test set.
dc.description.commentsPAN-CLEF '11: Notebook Papers on Uncovering Plagiarism, Authorship, and Social Software Misuse, Amsterdam, the Netherlands. September 2011. http://www.uni-weimar.de/medien/webis/research/events/pan-11/pan11-web/about.html
dc.identifier.urihttps://repository.upenn.edu/handle/20.500.14332/6534
dc.legacy.articleid1515
dc.legacy.fulltexturlhttps://repository.upenn.edu/cgi/viewcontent.cgi?article=1515&context=cis_papers&unstamped=1
dc.source.issue479
dc.source.journalDepartmental Papers (CIS)
dc.source.journaltitlePAN-CLEF '11: Notebook Papers on Uncovering Plagiarism, Authorship, and Social Software Misuse
dc.source.peerreviewedtrue
dc.source.statuspublished
dc.subject.otherCPS Internet of Things
dc.subject.otherWikipedia
dc.subject.othervandalism
dc.subject.othercollaborative software
dc.subject.othercollaborative security
dc.subject.othersocial software misuse
dc.subject.otherfeature selection
dc.subject.othermachine learning
dc.subject.otherDatabases and Information Systems
dc.subject.otherNumerical Analysis and Scientific Computing
dc.subject.otherOther Computer Sciences
dc.titleMultilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence
dc.typePresentation
digcom.identifiercis_papers/479
digcom.identifier.contextkey2249249
digcom.identifier.submissionpathcis_papers/479
digcom.typeconference
dspace.entity.typePublication
relation.isAuthorOfPublication5584daf2-ea60-4404-a98f-6077e4d91d24
relation.isAuthorOfPublication45a9eed5-3211-4c36-b40d-6394302dfdce
relation.isAuthorOfPublication.latestForDiscovery5584daf2-ea60-4404-a98f-6077e4d91d24
upenn.schoolDepartmentCenterDepartmental Papers (CIS)
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
pan_11_final.pdf
Size:
124.25 KB
Format:
Adobe Portable Document Format
Collection