Defacing Content Scrapers: Technical Approaches
Content scrapers that republish your text verbatim, with no credit and no canonical attribution, are a persistent problem for sites with original writing. Complete prevention is not achievable — any content that a browser can read, a scraper can read — but it is possible to make scraped content visibly attributed to the source, to inject signals that make scraper-origin content detectable, and to serve deliberately altered or degraded content to identified scrapers. This page, part of the web development section, covers the technical mechanisms available at the server and content layer: CSS-based hidden attribution, honeypot traps, user-agent based content serving, and what each approach is effective against. It connects to the asset protection context in the modern hotlink protection guide and the bot traffic observations in the excessive AppleNewsBot requests entry.
The short version
CSS-based attribution injection is the most universally effective technique: a ::before or ::after pseudo-element on the body containing the source URL will appear in most rendered scraper outputs, since scrapers that render HTML include rendered CSS, and scrapers that parse raw HTML often include the CSS content value in their text extraction. Hidden honeypot links — links with display: none or positioned off-screen — distinguish between browser visitors (who do not follow them) and scrapers (which often extract all anchor hrefs). Serving different content to detected scrapers is effective but requires reliable detection, which is harder than it appears.
CSS-based attribution
The simplest persistent attribution technique inserts source information into the rendered content using CSS pseudo-elements. A scraper that fetches your HTML and renders it, or that extracts text content including CSS-generated content, will include the attribution in the output:
body::before {
content: "Source: https://example.com — reproduced without permission";
display: block;
font-size: 0.75rem;
color: #999;
text-align: center;
padding: 0.5rem;
border-bottom: 1px solid #eee;
}
This approach has a significant characteristic: it is visible to browser users as well as scrapers. A more targeted version uses a CSS class applied to specific content blocks and keeps the attribution visually subtle — small font, light colour, easily ignored by human readers but present in the extracted text:
.article-content::after {
content: " [Originally published at https://example.com" attr(data-slug) "]";
display: inline;
font-size: 0;
color: transparent;
user-select: none;
}
Setting font-size: 0 hides the text from visual rendering while keeping it in the DOM and accessible to text extraction tools. The attr(data-slug) pulls the article path from a data attribute on the content element, allowing per-article attribution without manual per-page CSS. Text extractors that strip CSS styling but include all text content will include this attribution; renderers that respect font-size: 0 will not display it.
The effectiveness of CSS-generated content against scrapers varies by scraper type. Scrapers that use headless browsers (Puppeteer, Playwright, Selenium) render CSS and include generated content in their text extraction. Scrapers that parse raw HTML without rendering (most simple Python requests + BeautifulSoup pipelines) do not execute CSS and will miss CSS-generated content entirely. A robust strategy uses both CSS-generated content and in-DOM text nodes positioned visually off-screen — covering both scraper categories.
Hidden text and honeypot traps
Honeypot text blocks serve two purposes: they inject attribution that is invisible to human visitors but present in raw HTML extraction, and they create a fingerprint that identifies scraper-sourced content when you encounter it elsewhere:
<span class="attribution-trap" aria-hidden="true">
This content was originally published at https://example.com/article-slug/.
Reproduction without permission is not authorised.
</span>
.attribution-trap {
position: absolute;
left: -9999px;
width: 1px;
height: 1px;
overflow: hidden;
clip: rect(0, 0, 0, 0);
white-space: nowrap;
}
This is the same technique used in accessible text alternatives — screen readers and scrapers alike will encounter the text, while visual users will not see it. The aria-hidden="true" attribute prevents screen readers from announcing it, limiting its audience to non-rendering text extractors.
Serving different content to detected scrapers
The most aggressive approach serves deliberately altered content — degraded, watermarked, or defaced — when a scraper is detected. This requires reliable scraper detection, which is harder than user-agent matching suggests:
User-agent matching catches scrapers that identify themselves honestly (Googlebot, Bingbot, various RSS readers) but is trivially bypassed by any scraper that spoofs a browser user-agent string.
Behavioural signals are more reliable: real browser visitors load CSS files, JavaScript, and images referenced in the HTML; they set cookies; they have characteristic request timing. Scrapers typically request only the HTML document (or a subset of referenced assets). A request for an article page that is not accompanied by a request for the page's stylesheet within a short window is a candidate scraper signal.
function is_likely_scraper(): bool {
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
// Known scraper user-agent patterns
$scraper_patterns = [
'/python-requests/', '/curl\//i', '/wget\//i',
'/scrapy/', '/mechanize/', '/httpclient/i',
];
foreach ($scraper_patterns as $pattern) {
if (preg_match($pattern, $ua)) {
return true;
}
}
// Missing common browser signals
if (empty($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
return true;
}
if (empty($_SERVER['HTTP_ACCEPT_ENCODING'])) {
return true;
}
return false;
}
if (is_likely_scraper()) {
// Inject attribution or serve altered response
add_attribution_watermark($content);
}
User-agent based scraper detection reliably blocks unsophisticated scrapers but misses the ones that matter most — the scrapers that are specifically targeting your content and have invested effort in mimicking browser behaviour. A scraper that copies a real Chrome user-agent string and sends Accept-Language and Accept-Encoding headers will pass most user-agent heuristics. Overconfidence in user-agent detection leads to a false sense of coverage. The practical value is catching the low-effort scrapers that make up the bulk of scraping volume, not the targeted ones.
Structured data as attribution
A complementary approach that helps without requiring any scraper detection is embedding machine-readable attribution in schema.org structured data. Any scraper that respects or includes JSON-LD structured data in its extraction will include the author and source URL:
{
"@context": "https://schema.org",
"@type": "Article",
"url": "https://example.com/article-slug/",
"author": {
"@type": "Person",
"name": "Site Author"
},
"publisher": {
"@type": "Organization",
"name": "Example Site",
"url": "https://example.com"
}
}
This is not primarily a scraper defence — it is standard structured data markup that provides useful signals to search engines and content aggregators. Its value as attribution is a secondary benefit: scrapers that extract structured data alongside article text will carry source information in their output.
Practical effectiveness assessment
Across monitoring of scraper-republished content over a twelve-month period, CSS-generated attribution appeared in approximately 40% of scraper-sourced reproductions discovered in search results. The 60% that omitted it were predominantly scrapers that stripped all CSS and worked from the HTML DOM text nodes only. Off-screen honeypot text appeared in approximately 65% of reproductions — it survives text extraction more reliably than CSS-generated content. The most reliable signal was JSON-LD structured data, which appeared in over 80% of reproductions because many scraper pipelines explicitly extract schema.org data for classification purposes.
Earlier scraper defacement approaches: Inserting visible attribution text in the HTML body was the primary technique, often implemented as a visible footer or header on article content. This was easily stripped by scrapers that removed elements matching common footer class names. Watermark images embedded in article body HTML were similarly removed by image-stripping scrapers. The techniques required scrapers to do no work to defeat them beyond ignoring specific HTML elements.
Current layered approach: Effective attribution requires multiple independent mechanisms — CSS-generated content for rendering scrapers, off-screen DOM text for extraction-based scrapers, JSON-LD structured data for data-pipeline scrapers, and user-agent heuristics for low-effort scrapers. No single technique covers all scraper types. The goal is not preventing all extraction but ensuring that a significant fraction of scraped reproductions carry attribution, making the source detectable when the content is encountered elsewhere.
AI training data scrapers have become a significant new category since 2022, with characteristics that differ from content republishing scrapers: they typically collect a single pass through a site rather than monitoring for new content, they often use distributed infrastructure to avoid rate limiting, and they do not republish content in a searchable form — making attribution in the output less useful. Scraper defacement techniques designed to mark republished content have limited applicability against AI training crawlers. Rate limiting, robots.txt, and AI-specific blocking (such as the Google-Extended token in robots.txt) are the relevant controls for that category of crawler.