Identifying paraphrased text has business value in many use cases. For example, by identifying sentence paraphrases, a text summarization system could remove redundant information. Another application is to identify plagiarized documents. In this post, we fine-tune a Hugging Face transformer on Amazon SageMaker to identify paraphrased sentence pairs in a few steps.
A truly robust model can identify paraphrased text when the language used may be completely different, and also identify differences when the language used has high lexical overlap. In this post, we focus on the latter aspect. Specifically, we look at whether we can train a model that can identify the difference between two sentences that have high lexical overlap and very different or opposite meanings. For example, the following sentences have the exact same words but opposite meanings:

I took a flight from New York to Paris
I took a flight from Paris to

