Fighting with Copyscape and Plagiarism Checkers
I’ve always been an honest student, barring one incident where I cheated on a spelling test in the first grade and got ratted out by the girl I had a crush on, so getting in trouble for plagiarism on an assignment was never on my radar. This philosophy of honesty has stuck with me into the professional world. It wasn’t until I was in college when sophisticated digital plagiarism checkers became a commonplace tool in academia. One of my most respected teachers went on a very long tangent the first day of one of our classes about the level of sophistication of the plagiarism checker she submitted all of our papers to. I was a bit terrified at that thought of it. I figured there are finite number of words in the English language which means any given sentence can only be worded with so many different arrangements of the words that apply to that thought group. I was concerned my own writing style come back to haunt me if my phrasing patterns would make it look like I copied and pasted text between my own assignments. After hundreds of thousands of papers were indexed by the checker, I presumed that eventually original work would start getting flagged as plagiarism. As an honest student, this terrified me, but my teacher assured me that I would not get in trouble. She was right. I never got in trouble.
However, my fears would eventually come true as I entered the realm of professional writing. Fortunately, every editor I’ve worked with has been aware that original work can get flagged for plagiarism just because of the sheer number of similar articles and blogs existing that share the same topic. It’s usually not a problem with blog writing, but it comes up frequently in any sort of sales pitch writing where you’re stuck working with things like boilerplate text and features lists. Ethics aside, repeating the same text on pages across multiple websites is known to cause major problems with content SEO. The search engines aren’t about to point the finger for plagiarism, but they do group pages that feature substantially similar content as “repeated” or “copied” pages and award them very low scores for search engine results in pages. Basically, the program sees the repeated text as adding no additional value to the web search because the original page containing that information already covered it. So from a pure business standpoint, providing content that gets flagged by search engines as repeat, low-value material is not something you want to do.
I frequently work with the Copyscape plagiarism checker and have done some pretty interesting things to adjust flagged content. When it comes to boilerplate text you really don’t have a lot of good options: you can either pull the text from the article or bite the bullet on the Copyscape flag. There are typically legal and policy rules that won’t allow the writer to rephrase the text, so all your years of education and work experience don’t mean a thing. The first few times you use boilerplate text it won’t get flagged, but it’ll eventually become an issue.
Plagiarism checkers become a real pain when you’re working with product feature lists in sales-oriented articles. Specifications are important, even if the typical reader may only understand what a few of them mean because they provide quantifiable, honest information (minus that recent debacle with the Volkswagen Diesel engine emission issue). When you’re writing about something like a car, computer, smartphone, tablet, appliance, or tool, there are only so many feature descriptors to work with on each. For example, it’s very important to note things like the processor clock speed and system memory size on things like computers, smartphones, and tablets. Those specs are like how much horsepower a car engine puts out: much of the audience may not care, but it’s pretty important to go on the record about stuff like that. The audience may be more in tune with the absence of specs indicating that you’re trying to hide something negative about the product. So with computer device memory, you can list it as 4GB memory, 4 Gigabytes of RAM, 4GB DDR3 system memory, and a small number of other combinations of the same information. This text will eventually get flagged as the Copyscape database grows and the number of articles about a specific product grows. So you’ll start with 4GB Memory, get flagged, change it to 4 Gigabytes of DDR3 RAM, and get flagged again. Your next step is to shuffle the order you’re listing other specs in with the hopes that Copyscape will see enough difference between this required information.
A thesaurus can be a big help when rephrasing the same information. Unfortunately, the thesaurus only gets you so far, especially when you’re working with trademarked properties where synonyms don’t apply. In many cases, I rely on the Text Mechanic Sort Text Lines tool to randomize the order of text elements in a list to satisfy Copyscape. Eventually, with enough randomization, feature swapping, and synonym application, I’m able to devise something that passes the plagiarism checker that was never plagiarism to begin with. It’s all worth it, though, as the final product is usually much better for your client.
Dan S is a former news journalist turned web developer and freelance writer. He has a penchant for all things tech and believes the person using the machine is the most important element.