自从做网站以来,大量自动抓取我们内容的爬虫一直是个问题,防范采集是个长期任务,这篇是我5年前的博客文章:《Apache中设置屏蔽IP地址和URL网址来禁止采集》,另外,还可以识别User Agent来辨别和屏蔽一些采集者,在Apache中设置的代码例子如下:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(DTS\sAgent|Creative\sAutoUpdate|HTTrack|YisouSpider|SemrushBot)(.*)$ RewriteRule .* - [F,L]
屏蔽User Agent为空的代码:
RewriteCond %{HTTP_USER_AGENT} ^$ RewriteRule .* - [F]
屏蔽Referer和User Agent都为空的代码:
RewriteCond %{HTTP_REFERER} ^$ [NC] RewriteCond %{HTTP_USER_AGENT} ^$ [NC] RewriteRule .* - [F]
下面把一些可以屏蔽的常见采集软件或者机器爬虫的User Agent的特征关键词列一下供参考:
- User-Agent
- DTS Agent
- HttpClient
- Owlin
- Kazehakase
- Creative AutoUpdate
- HTTrack
- YisouSpider
- baiduboxapp
- Python-urllib
- python-requests
- SemrushBot
- SearchmetricsBot
- MegaIndex
- Scrapy
- EMail Exractor
- 007ac9
-
ltx71
其它也可以考虑屏蔽的:
- Mail.RU_Bot:http://go.mail.ru/help/robots
- Feedly
- ZumBot
- Pcore-HTTP
- Daum
- your-server
- Mobile/12A4345d
- PhantomJS/2.1.1
- archive.org_bot
- AcooBrowser
- Go-http-client
- Jakarta Commons-HttpClient
- Apache-HttpClient
- BDCbot
- ECCP
- Nutch
- cr4nk
- MJ12bot
- MOT-MPx220
- Y!OASIS/TEST
- libwww-perl
一般不要屏蔽的主流搜索引擎特征:
- Baidu
- Yahoo
- Slurp
- yandex
- YandexBot
-
MSN
一些常见浏览器或者通用代码也不要轻易屏蔽:
- FireFox
- Apple
- PC
- Chrome
- Microsoft
- Android
- Windows
- Mozilla
- Safar
- Macintosh
有的时候是采集者单独设置的User Agent,也可以通过分析后进行屏蔽,例如:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(\'Mozilla\/5\.0|\'Mozilla\'|\'Moz\'|\'Mozil\'|\'(.+)\'|Mobile\/13G34|Chrome\/53\.0\.2785\.143)(.*)$ RewriteRule .* - [F,L]
或者与HTTP_USER_AGENT一起考虑其它的因素再联合判断检测、屏蔽,例如:
RewriteCond %{REQUEST_METHOD} POST RewriteCond %{HTTP_USER_AGENT} ^(.*)(Firefox\/44\.0|Safari\/537\.36)(.*)$ RewriteCond %{REQUEST_URI} ^(.*)\/comment\/reply\/(.*)$ RewriteRule .* - [F,L]
上面这是遇到反复POST提交留言的情况,判断特征进行屏蔽。
网上也找了一些其它的代码,列出供参考:
RewriteCond %{HTTP_USER_AGENT} (^$|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms) [NC] RewriteRule ^(.*)$ - [F]
除了修改.htaccess文件以外,还可以通过修改httpd.conf配置文件来实现:
DocumentRoot /home/wwwroot/xxx <Directory "/home/wwwroot/xxx"> SetEnvIfNoCase User-Agent ".*(FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT Order allow,deny Allow from all deny from env=BADBOT </Directory>
这样修改后需要重启Apache。别人列出的需要屏蔽特征:
- FeedDemon 内容采集
- BOT/0.1 (BOT for JCE) sql注入
- CrawlDaddy sql注入
- Java 内容采集
- Jullo 内容采集
- Feedly 内容采集
- UniversalFeedParser 内容采集
- ApacheBench cc攻击器
- Swiftbot 无用爬虫
- YandexBot 无用爬虫
- AhrefsBot 无用爬虫
- YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
- MJ12bot 无用爬虫
- ZmEu phpmyadmin 漏洞扫描
- WinHttp 采集cc攻击
- EasouSpider 无用爬虫
- HttpClient tcp攻击
- Microsoft URL Control 扫描
- YYSpider 无用爬虫
- jaunty wordpress爆破扫描器
- oBot 无用爬虫
- Python-urllib 内容采集
- Indy Library 扫描
- FlightDeckReports Bot 无用爬虫
- Linguee Bot 无用爬虫
继续补充:
WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot
还有:
Aboundex 80legs ^Java ^Cogentbot ^Alexibot ^asterias ^attach ^BackDoorBot ^BackWeb Bandit ^BatchFTP ^Bigfoot ^Black.Hole ^BlackWidow ^BlowFish ^BotALot Buddy ^BuiltBotTough ^Bullseye ^BunnySlippers ^Cegbfeieh ^CheeseBot ^CherryPicker ^ChinaClaw Collector Copier ^CopyRightCheck ^cosmos ^Crescent ^Custo ^AIBOT ^DISCo ^DIIbot ^DittoSpyder ^Download\ Demon ^Download\ Devil ^Download\ Wonder ^dragonfly ^Drip ^eCatch ^EasyDL ^ebingbong ^EirGrabber ^EmailCollector ^EmailSiphon ^EmailWolf ^EroCrawler ^Exabot ^Express\ WebPictures Extractor ^EyeNetIE ^Foobot ^flunky ^FrontPage ^Go-Ahead-Got-It ^gotit ^GrabNet ^Grafula ^Harvest ^hloader ^HMView ^HTTrack ^humanlinks ^IlseBot ^Image\ Stripper ^Image\ Sucker Indy\ Library ^InfoNaviRobot ^InfoTekies ^Intelliseek ^InterGET ^Internet\ Ninja ^Iria ^Jakarta ^JennyBot ^JetCar ^JOC ^JustView ^Jyxobot ^Kenjin.Spider ^Keyword.Density ^larbin ^LexiBot ^lftp ^libWeb/clsHTTP ^likse ^LinkextractorPro ^LinkScan/8.1a.Unix ^LNSpiderguy ^LinkWalker ^lwp-trivial ^LWP::Simple ^Magnet ^Mag-Net ^MarkWatch ^Mass\ Downloader ^Mata.Hari ^Memo ^Microsoft.URL ^Microsoft\ URL\ Control ^MIDown\ tool ^MIIxpc ^Mirror ^Missigua\ Locator ^Mister\ PiX ^moget ^Mozilla/3.Mozilla/2.01 ^Mozilla.*NEWT ^NAMEPROTECT ^Navroad ^NearSite ^NetAnts ^Netcraft ^NetMechanic ^NetSpider ^Net\ Vampire ^NetZIP ^NextGenSearchBot ^NG ^NICErsPRO ^niki-bot ^NimbleCrawler ^Ninja ^NPbot ^Octopus ^Offline\ Explorer ^Offline\ Navigator ^Openfind ^OutfoxBot ^PageGrabber ^Papa\ Foto ^pavuk ^pcBrowser ^PHP\ version\ tracker ^Pockey ^ProPowerBot/2.14 ^ProWebWalker ^psbot ^Pump ^QueryN.Metasearch ^RealDownload Reaper Recorder ^ReGet ^RepoMonkey ^RMA Siphon ^SiteSnagger ^SlySearch ^SmartDownload ^Snake ^Snapbot ^Snoopy ^sogou ^SpaceBison ^SpankBot ^spanner ^Sqworm Stripper Sucker ^SuperBot ^SuperHTTP ^Surfbot ^suzuran ^Szukacz/1.4 ^tAkeOut ^Teleport ^Telesoft ^TurnitinBot/1.5 ^The.Intraformant ^TheNomad ^TightTwatBot ^Titan ^True_Robot ^turingos ^TurnitinBot ^URLy.Warning ^Vacuum ^VCI ^VoidEYE ^Web\ Image\ Collector ^Web\ Sucker ^WebAuto ^WebBandit ^Webclipping.com ^WebCopier ^WebEMailExtrac.* ^WebEnhancer ^WebFetch ^WebGo\ IS ^Web.Image.Collector ^WebLeacher ^WebmasterWorldForumBot ^WebReaper ^WebSauger ^Website\ eXtractor ^Website\ Quester ^Webster ^WebStripper ^WebWhacker ^WebZIP Whacker ^Widow ^WISENutbot ^WWWOFFLE ^WWW-Collector-E ^Xaldon ^Xenu ^Zeus ZmEu ^Zyborg Acunetix FHscan
临时屏蔽(返回503错误),而不是长期屏蔽的代码:
RewriteCond %{HTTP_USER_AGENT} ^.*(bot|crawl|spider).*$ [NC] RewriteCond %{REQUEST_URI} !^/robots\.txt$ RewriteRule .* - [R=503,L]
自由标签
评论1
今天屏蔽的如下:
RewriteCond %{HTTP_USER_AGENT} ^(.*)(User-Agent|DTS\sAgent|HttpClient|Owlin|Kazehakase|Creative\sAutoUpdate|HTTrack|YisouSpider|Python-urllib|python-requests|SemrushBot|SearchmetricsBot|MegaIndex|Scrapy|EMail\sExractor|007ac9|ltx71|Feedly|ZumBot|Pcore-HTTP|Daum|Mobile\/12A4345d|PhantomJS\/2\.1\.1|archive\.org_bot|AcooBrowser|Go-http-client|Jakarta\sCommons-HttpClient|Apache-HttpClient|BDCbot|Nutch|cr4nk|MJ12bot|MOT-MPx220|Y!OASIS\/TEST|libwww-perl|Indy\sLibrary|Alexa\sToolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|UniversalFeedParser|ApacheBench|Microsoft\sURL\sControl|Swiftbot|ZmEu|oBot|jaunty|lightDeckReports\sBot|YYSpider|DigExt|heritrix|EasouSpider|Ezooms|FeedDemon|BOT\sfor\sJCE|Jullo|UniversalFeedParser|WinHttp|FlightDeckReports|Linguee\sBot|JikeSpider|node-superagent|WebZIP|FetchURL|Apache-HttpAsyncClient|Aboundex|80legs|Cogentbot|Alexibot|asterias|BackDoorBot|BackWeb|Bandit|BatchFTP|Bigfoot|Black\.Hole|BlackWidow|BlowFish|BotALot|BuiltBotTough|Bullseye|BunnySlippers|Cegbfeieh|CheeseBot|CherryPicker|ChinaClaw|Collector|Copier|CopyRightCheck|cosmos|Crescent|Custo|AIBOT|DISCo|DIIbot|DittoSpyder|Download\sDemon|Download\sDevil|Download\sWonder|dragonfly|Drip|eCatch|EasyDL|ebingbong|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|Exabot|Express\sWebPictures|Extractor|EyeNetIE|Foobot|flunky|FrontPage|Go-Ahead-Got-It|gotit|GrabNet|Grafula|Harvest|hloader|HMView|humanlinks|IlseBot|Image\sStripper|Image\sSucker|InfoNaviRobot|InfoTekies|Intelliseek|InterGET|Internet\sNinja|Iria|Jakarta|JennyBot|JetCar|JOC|JustView|Jyxobot|Kenjin\.Spider|Keyword\.Density|larbin|LexiBot|lftp|libWeb\/clsHTTP|likse|LinkextractorPro|LinkScan|LNSpiderguy|LinkWalker|lwp-trivial|LWP::Simple|Magnet|Mag-Net|MarkWatch|Mass\sDownloader|Mata\.Hari|Memo|Microsoft\.URL|Microsoft\sURL\sControl|MIDown\stool|MIIxpc|Mirror|Missigua\sLocator|Mister\sPiX|moget|NAMEPROTECT|Navroad|NearSite|NetAnts|Netcraft|NetMechanic|NetSpider|Net\sVampire|NetZIP|NextGenSearchBot|NICErsPRO|niki-bot|NimbleCrawler|NPbot|Octopus|Offline\sExplorer|Offline\sNavigator|Openfind|OutfoxBot|PageGrabber|Papa\sFoto|pavuk|PHP\sversion\stracker|Pockey|ProPowerBot\/2\.14|ProWebWalker|psbot|Pump|QueryN\.Metasearch|RealDownload|Reaper|Recorder|ReGet|RepoMonkey|RMA|Siphon|SiteSnagger|SlySearch|SmartDownload|Snake|Snapbot|Snoopy|SpaceBison|SpankBot|spanner|Sqworm|Stripper|Sucker|SuperBot|SuperHTTP|Surfbot|suzuran|Szukacz\/1\.4|tAkeOut|Teleport|Telesoft|TurnitinBot\/1\.5|The\.Intraformant|TheNomad|TightTwatBot|Titan|True_Robot|turingos|TurnitinBot|URLy\.Warning|Vacuum|VCI|VoidEYE|Web\sImage\sCollector|Web\sSucker|WebAuto|WebBandit|Webclipping\.com|WebCopier|WebEMailExtrac|WebEnhancer|WebFetch|WebGo\sIS|Web\.Image\.Collector|WebLeacher|WebmasterWorldForumBot|WebReaper|WebSauger|Website\seXtractor|Website\sQuester|Webster|WebStripper|WebWhacker|Whacker|Widow|WISENutbot|WWWOFFLE|WWW-Collector-E|Xaldon|Xenu|Zeus|Zyborg|Acunetix|FHscan)(.*)$
RewriteRule .* - [F,L]