Framework of Hate Speech Identification for Formal and Informal Text Using Lexical Approach

Husnain Saleem; Muhammad Javed; Syed Muhammad Ali Haider; Hamid Masood Khan; Muhammad Ahmad Jan; Asad Ullah

doi:10.53555/ks.v12i1.3249

Authors

Husnain Saleem
Muhammad Javed
Syed Muhammad Ali Haider
Hamid Masood Khan
Muhammad Ahmad Jan
Asad Ullah

DOI:

https://doi.org/10.53555/ks.v12i1.3249

Keywords:

NLP, Hate Speech, Toxic Speech, Roman Urdu, Lexicon Based Approach

Abstract

Social media refers to digital platforms and online venues where individuals and organizations share dynamic content and broadcast information. Through this dynamic virtual environment, prominent social networking sites such as Facebook, Instagram, Twitter, and YouTube allow users to produce, share, and trade different types of multimedia material such as text, photographs, videos, and links. Sentiment analysis is a digital process for determining and categorizing the emotional tone of textual material on social networking sites such as messages, comments, or tweets. It is also observed that this problem is extremely significant in the field of Natural Language Processing (NLP). Hate speech or Toxic speech is described in this context as language comprising hostile attitudes, insulting statements, and destructive intents directed against a person or a group of individuals. In this study, we used a lexicon-based approach at the sentence level to detect toxic speech in bilingual text specially published in English (Formal) and Roman-Urdu (informal) text. Moreover, in this study, we concentrated on three areas in particularly; race, religion, and nationality. We extracted our dataset from Twitter via the Twitter API, comprised of 3,030 tweets, 1,010 of which are relevant to each of the aforementioned domains. The proposed Framework attained outstanding average accuracy for race, religion, and nationality domains of 92.52%, 93.03%, and 93.35%, respectively.