当前位置

识别User Agent屏蔽一些Web爬虫防采集

James Qi 在 2017年12月4日 - 09:58 提交
内容摘要:自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来......

  自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:

RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTS\sAgent|Creative\sAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$
RewriteRule .* - [F,L]

  屏蔽User Agent为空的代码:

RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F]

  屏蔽Referer和User Agent都为空的代码:

RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^$ [NC]
RewriteRule .* - [F] 

  下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:

  • User-Agent
  • DTS Agent
  • HttpClient
  • Owlin
  • Kazehakase
  • Creative AutoUpdate
  • HTTrack
  • YisouSpider
  • baiduboxapp
  • Python-urllib
  • python-requests
  • SemrushBot
  • SearchmetricsBot
  • MegaIndex
  • Scrapy
  • EMail Exractor
  • 007ac9
  • ltx71

  其它也可以考虑屏蔽的:

  • Mail.RU_Bot:http://go.mail.ru/help/robots
  • Feedly
  • ZumBot
  • Pcore-HTTP
  • Daum
  • your-server
  • Mobile/12A4345d
  • PhantomJS/2.1.1
  • archive.org_bot
  • AcooBrowser
  • Go-http-client
  • Jakarta Commons-HttpClient
  • Apache-HttpClient
  • BDCbot
  • ECCP
  • Nutch
  • cr4nk
  • MJ12bot
  • MOT-MPx220
  • Y!OASIS/TEST
  • libwww-perl

  一般不要屏蔽的主流搜索引擎特征:

  • Google
  • Baidu
  • Yahoo
  • Slurp
  • yandex
  • YandexBot
  • MSN

  一些常见浏览器或者通用代码也不要轻易屏蔽:

  • FireFox
  • Apple
  • PC
  • Chrome
  • Microsoft
  • Android
  • Mail
  • Windows
  • Mozilla
  • Safar
  • Macintosh

  有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:

RewriteCond %{HTTP_USER_AGENT} ^(.*)(\'Mozilla\/5\.0|\'Mozilla\'|\'Moz\'|\'Mozil\'|\'(.+)\'|Mobile\/13G34|Chrome\/53\.0\.2785\.143)(.*)$
RewriteRule .* - [F,L]

  或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:

RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox\/44\.0|Safari\/537\.36)(.*)$
RewriteCond %{REQUEST_URI} ^(.*)\/comment\/reply\/(.*)$
RewriteRule .* - [F,L]

  上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。

  网上也找了一些其它的代码,列出供参考:

RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC]
RewriteRule ^(.*)$ - [F]

  除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:

DocumentRoot /home/wwwroot/xxx
<Directory "/home/wwwroot/xxx">
SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT
        Order allow,deny
        Allow from all
       deny from env=BADBOT
</Directory>

  这样修改后需要重启Apache。别人列出的需要屏蔽特征:

 

  • FeedDemon             内容采集
  • BOT/0.1 (BOT for JCE) sql注入
  • CrawlDaddy            sql注入
  • Java                  内容采集
  • Jullo                 内容采集
  • Feedly                内容采集
  • UniversalFeedParser   内容采集
  • ApacheBench           cc攻击器
  • Swiftbot              无用爬虫
  • YandexBot             无用爬虫
  • AhrefsBot             无用爬虫
  • YisouSpider           无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
  • MJ12bot               无用爬虫
  • ZmEu phpmyadmin       漏洞扫描
  • WinHttp               采集cc攻击
  • EasouSpider           无用爬虫
  • HttpClient            tcp攻击
  • Microsoft URL Control 扫描
  • YYSpider              无用爬虫
  • jaunty                wordpress爆破扫描器
  • oBot                  无用爬虫
  • Python-urllib         内容采集
  • Indy Library          扫描
  • FlightDeckReports Bot 无用爬虫
  • Linguee Bot           无用爬虫
     

  继续补充:

WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot

  还有:

Aboundex
80legs
^Java
^Cogentbot
^Alexibot
^asterias
^attach
^BackDoorBot
^BackWeb
Bandit
^BatchFTP
^Bigfoot
^Black.Hole
^BlackWidow
^BlowFish
^BotALot
Buddy
^BuiltBotTough
^Bullseye
^BunnySlippers
^Cegbfeieh
^CheeseBot
^CherryPicker
^ChinaClaw
Collector
Copier
^CopyRightCheck
^cosmos
^Crescent
^Custo
^AIBOT
^DISCo
^DIIbot
^DittoSpyder
^Download\ Demon
^Download\ Devil
^Download\ Wonder
^dragonfly
^Drip
^eCatch
^EasyDL
^ebingbong
^EirGrabber
^EmailCollector
^EmailSiphon
^EmailWolf
^EroCrawler
^Exabot
^Express\ WebPictures
Extractor
^EyeNetIE
^Foobot
^flunky
^FrontPage
^Go-Ahead-Got-It
^gotit
^GrabNet
^Grafula
^Harvest
^hloader
^HMView
^HTTrack
^humanlinks
^IlseBot
^Image\ Stripper
^Image\ Sucker
Indy\ Library
^InfoNaviRobot
^InfoTekies
^Intelliseek
^InterGET
^Internet\ Ninja
^Iria
^Jakarta
^JennyBot
^JetCar
^JOC
^JustView
^Jyxobot
^Kenjin.Spider
^Keyword.Density
^larbin
^LexiBot
^lftp
^libWeb/clsHTTP
^likse
^LinkextractorPro
^LinkScan/8.1a.Unix
^LNSpiderguy
^LinkWalker
^lwp-trivial
^LWP::Simple
^Magnet
^Mag-Net
^MarkWatch
^Mass\ Downloader
^Mata.Hari
^Memo
^Microsoft.URL
^Microsoft\ URL\ Control
^MIDown\ tool
^MIIxpc
^Mirror
^Missigua\ Locator
^Mister\ PiX
^moget
^Mozilla/3.Mozilla/2.01
^Mozilla.*NEWT
^NAMEPROTECT
^Navroad
^NearSite
^NetAnts
^Netcraft
^NetMechanic
^NetSpider
^Net\ Vampire
^NetZIP
^NextGenSearchBot
^NG
^NICErsPRO
^niki-bot
^NimbleCrawler
^Ninja
^NPbot
^Octopus
^Offline\ Explorer
^Offline\ Navigator
^Openfind
^OutfoxBot
^PageGrabber
^Papa\ Foto
^pavuk
^pcBrowser
^PHP\ version\ tracker
^Pockey
^ProPowerBot/2.14
^ProWebWalker
^psbot
^Pump
^QueryN.Metasearch
^RealDownload
Reaper
Recorder
^ReGet
^RepoMonkey
^RMA
Siphon
^SiteSnagger
^SlySearch
^SmartDownload
^Snake
^Snapbot
^Snoopy
^sogou
^SpaceBison
^SpankBot
^spanner
^Sqworm
Stripper
Sucker
^SuperBot
^SuperHTTP
^Surfbot
^suzuran
^Szukacz/1.4
^tAkeOut
^Teleport
^Telesoft
^TurnitinBot/1.5
^The.Intraformant
^TheNomad
^TightTwatBot
^Titan
^True_Robot
^turingos
^TurnitinBot
^URLy.Warning
^Vacuum
^VCI
^VoidEYE
^Web\ Image\ Collector
^Web\ Sucker
^WebAuto
^WebBandit
^Webclipping.com
^WebCopier
^WebEMailExtrac.*
^WebEnhancer
^WebFetch
^WebGo\ IS
^Web.Image.Collector
^WebLeacher
^WebmasterWorldForumBot
^WebReaper
^WebSauger
^Website\ eXtractor
^Website\ Quester
^Webster
^WebStripper
^WebWhacker
^WebZIP
Whacker
^Widow
^WISENutbot
^WWWOFFLE
^WWW-Collector-E
^Xaldon
^Xenu
^Zeus
ZmEu
^Zyborg
Acunetix
FHscan

 

  临时屏蔽(返回503错误),而不是长期屏蔽的代码:

RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteRule .* - [R=503,L]

 

 

 

评论

-- 发自IP地址: 58.49.166.204 (位置 | 谁是)

RewriteCond %{HTTP_USER_AGENT} ^(.*)(User-Agent|DTS\sAgent|HttpClient|Owlin|Kazehakase|Creative\sAutoUpdate|HTTrack|YisouSpider|Python-urllib|python-requests|SemrushBot|SearchmetricsBot|MegaIndex|Scrapy|EMail\sExractor|007ac9|ltx71|Feedly|ZumBot|Pcore-HTTP|Daum|Mobile\/12A4345d|PhantomJS\/2\.1\.1|archive\.org_bot|AcooBrowser|Go-http-client|Jakarta\sCommons-HttpClient|Apache-HttpClient|BDCbot|Nutch|cr4nk|MJ12bot|MOT-MPx220|Y!OASIS\/TEST|libwww-perl|Indy\sLibrary|Alexa\sToolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|UniversalFeedParser|ApacheBench|Microsoft\sURL\sControl|Swiftbot|ZmEu|oBot|jaunty|lightDeckReports\sBot|YYSpider|DigExt|heritrix|EasouSpider|Ezooms|FeedDemon|BOT\sfor\sJCE|Jullo|UniversalFeedParser|WinHttp|FlightDeckReports|Linguee\sBot|JikeSpider|node-superagent|WebZIP|FetchURL|Apache-HttpAsyncClient|Aboundex|80legs|Cogentbot|Alexibot|asterias|BackDoorBot|BackWeb|Bandit|BatchFTP|Bigfoot|Black\.Hole|BlackWidow|BlowFish|BotALot|BuiltBotTough|Bullseye|BunnySlippers|Cegbfeieh|CheeseBot|CherryPicker|ChinaClaw|Collector|Copier|CopyRightCheck|cosmos|Crescent|Custo|AIBOT|DISCo|DIIbot|DittoSpyder|Download\sDemon|Download\sDevil|Download\sWonder|dragonfly|Drip|eCatch|EasyDL|ebingbong|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|Exabot|Express\sWebPictures|Extractor|EyeNetIE|Foobot|flunky|FrontPage|Go-Ahead-Got-It|gotit|GrabNet|Grafula|Harvest|hloader|HMView|humanlinks|IlseBot|Image\sStripper|Image\sSucker|InfoNaviRobot|InfoTekies|Intelliseek|InterGET|Internet\sNinja|Iria|Jakarta|JennyBot|JetCar|JOC|JustView|Jyxobot|Kenjin\.Spider|Keyword\.Density|larbin|LexiBot|lftp|libWeb\/clsHTTP|likse|LinkextractorPro|LinkScan|LNSpiderguy|LinkWalker|lwp-trivial|LWP::Simple|Magnet|Mag-Net|MarkWatch|Mass\sDownloader|Mata\.Hari|Memo|Microsoft\.URL|Microsoft\sURL\sControl|MIDown\stool|MIIxpc|Mirror|Missigua\sLocator|Mister\sPiX|moget|NAMEPROTECT|Navroad|NearSite|NetAnts|Netcraft|NetMechanic|NetSpider|Net\sVampire|NetZIP|NextGenSearchBot|NICErsPRO|niki-bot|NimbleCrawler|NPbot|Octopus|Offline\sExplorer|Offline\sNavigator|Openfind|OutfoxBot|PageGrabber|Papa\sFoto|pavuk|PHP\sversion\stracker|Pockey|ProPowerBot\/2\.14|ProWebWalker|psbot|Pump|QueryN\.Metasearch|RealDownload|Reaper|Recorder|ReGet|RepoMonkey|RMA|Siphon|SiteSnagger|SlySearch|SmartDownload|Snake|Snapbot|Snoopy|SpaceBison|SpankBot|spanner|Sqworm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|suzuran|Szukacz\/1\.4|tAkeOut|Teleport|Telesoft|TurnitinBot\/1\.5|The\.Intraformant|TheNomad|TightTwatBot|Titan|True_Robot|turingos|TurnitinBot|URLy\.Warning|Vacuum|VCI|VoidEYE|Web\sImage\sCollector|Web\sSucker|WebAuto|WebBandit|Webclipping\.com|WebCopier|WebEMailExtrac|WebEnhancer|WebFetch|WebGo\sIS|Web\.Image\.Collector|WebLeacher|WebmasterWorldForumBot|WebReaper|WebSauger|Website\seXtractor|Website\sQuester|Webster|WebStripper|WebWhacker|Whacker|Widow|WISENutbot|WWWOFFLE|WWW-Collector-E|Xaldon|Xenu|Zeus|Zyborg|Acunetix|FHscan)(.*)$
RewriteRule .* - [F,L]

 

James Qi / 祁劲松