How does anti plagiarism work?

posted on August 2023 by Copyfind's Team

How does anti plagiarism work?

What is anti plagiarism all about?

In checking the originality, everyone relies on systems for checking the uniqueness of the text (or anti-plagiarism), but few people understand how anti-plagiarism works.
So what do anti-plagiarism systems do? Let's figure it out
Anti-plagiarism systems only check the uniqueness of the text
It's funny, but the notorious anti-plagiarism has nothing to do with checking the text for the presence of this very plagiarism. Judge for yourself: plagiarism is theft, when you take something else's and pass it off as your own. If any anti-plagiarism system is given two texts (the original and the rewrite), none will tell you where the original is. Due to technical features, they can only determine the proportion of borrowed text — any text that is available on the Internet or in their own databases.


What about uniqueness?


This proportion of the borrowed text is unique. The less borrowing, the higher it is.

There are two types of uniqueness:
Technical uniqueness. When they talk about the uniqueness of the text, in most cases they mean it. Such uniqueness is assessed by technical indicators such as the structure of the text, the set and order of words, and so on. At the same time, the meaning of the text may be non-unique and in one form or another appear in other materials.
Semantic uniqueness. If there is information in the text that was not there before, then it has a high semantic uniqueness. Anti-plagiarism systems ignore it because in order to evaluate the semantic uniqueness, you need to understand the content. Such uniqueness plays a role on the Internet, in texts that thousands of people read. And when they check the study papers, they just score on it.

 

How do systems check the uniqueness of a text?


Anti-plagiarism systems use sophisticated algorithms to check for uniqueness. The most common is the shingle algorithm.

 

The essence of this method is to find verbatim matches.

The program splits the text into small pieces (shingles) consisting of a certain number of words. At the same time, these pieces have words in common with each other — they are like scales superimposed on each other, so that not a single word remains untested.

For each shingle, a hash is calculated — a special unique set of letters and numbers, in which the contents of this piece of text are encrypted. There cannot be two shingles with a unique hash in the same text.

And this is how the verification happens: the service takes your text and the text that it considers similar, and begins to compare the hash of individual shingles. The more matches there are, the lower the uniqueness and the higher the probability that one of the texts is a copy of the other (maybe not completely, but partially).

Using the shingle algorithm, you can even find slightly modified texts — that's why it's so popular. But not without drawbacks: the shingle method fails when it is necessary to isolate quotations, phraseological units, and other stable expressions from the text. Therefore, if you try to check the text of a study paper, say, on medicine, anti-plagiarism systems are likely to show low uniqueness - all because of the use of stable expressions in the text that are characteristic of the topic. It seems that everyone understands that there is nowhere without such expressions, but they should be kind enough to tighten up the uniqueness.

In addition to the standard shingle method, services are being refined and additional verification methods are being added. For example, the lexical matching algorithm searches for similar terms and concepts in texts. The pseudonicalization algorithm helps to identify the text that has been processed using the uniqueness enhancement service.

More recent approaches to assessing the similarity of content using neural networks have achieved significantly greater accuracy, but are associated with high computational costs. Traditional neural network approaches embed both pieces of content into semantic vector embeddings to calculate their similarity, which is often their cosine similarity.

The easiest way to get a good result when checking a text for uniqueness is to write it yourself and rely on other materials as little as possible. However, when writing an essay, term paper, or diploma, it is difficult not to rely on books, manuals, and so on.