A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics

In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is based on the integration of word semantic similarity derived from WordNet taxonomic relations, and named-entity semantic relatedness inferred from Wikipedia entity co-occurrences and underpinned by Normalized Google Distance. In addition, the WordNet similarity measure is enriched with word part-of-speech (PoS) conversion aided with a Categorial Variation database (CatVar), which enhances the lexico-semantics of words. We validated our hybrid approach using two different datasets; Microsoft Research Paraphrase Corpus (MSRPC) and TREC-9 Question Variants. In our empirical evaluation, we showed that our system outperforms baselines and most of the related state-of-the-art systems for paraphrase detection. We also conducted a misidentification analysis to disclose the primary sources of our system errors.

Authors:
Mohamed Muhidin, Oussalah Mourad

Publication type:
A1 Journal article – refereed

Place of publication:

Keywords:
Named-entity semantic relatedness, Paraphrase identification, Wikipedia, Word category subsumption, WordNet

Published:

Full citation:
Mohamed, M., Oussalah, M. A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Lang Resources & Evaluation 54, 457–485 (2020). https://doi.org/10.1007/s10579-019-09466-4

DOI:
https://doi.org/10.1007/s10579-019-09466-4

Read the publication here:
http://urn.fi/urn:nbn:fi-fe202001202588