What is it?
The following PHP script measures your site’s bots/crawlers to Google analytics 4.
Why do you want this?
- Finding out how indexing works for Google, Bing, etc.
- Analytics on how fast or if new content is indexed at all.
- Finding out what AI bots are crawling.
- Seeing if you have crawler traps on your site > relative URLs that are looping.
- Detect some small attempts by bots to attack your site.
- Verify that the rules from robots_txt are working.
- You might find that your hosting is blocking some bots.
- Advanced. You can export this data to Google big query and then calculate how fast you are indexing different parts of the page and how often they are updated by search engines.
Who is this script for?
Currently it’s for people who know at least a little bit of PHP and have a website built on PHP.
What if I don’t fall into this group?
You can program a similar thing in your language, it’s really not hard. ChatGPT and other AIs can certainly help you with that.
How does it actually work?
In principle, the code consists of only a few parts:
- A table of possible user-agent robots.
- A condition if it is a bot.
- Build an anonymized hit for GA4.
- Sending the hit to GA4.
What does the output look like?
It will look like you make it. 🙂
For inspiration, I made a menu like this. You have to set everything yourself, so there is no limit to creativity here.
You can use the following bonus dimensions for building:
For explanation there is user_agent2 and it is there because some user-agents are longer than 100 characters (limit free ga4), so I send them in multiple variables.
Example installation on wordpress
This is at your own risk!!!!!
Paste the code into functions.php
In wordpress > admin > appearance > template editor > functions.php and paste the content from my github at the end of the file there.
function is_bot($sistema){ $bots = array( 'Googlebot','Baiduspider','ia_archiver','R6_FeedFetcher','NetcraftSurveyAgent','Sogou web spider','bingbot','Yahoo! Slurp','facebookexternalhit','PrintfulBot','msnbot','Twitterbot','UnwindFetchor','urlresolver','Butterfly','TweetmemeBot','PaperLiBot', 'MJ12bot','AhrefsBot','Exabot','Ezooms','YandexBot','SearchmetricsBot','picsearch','TweetedTimes Bot','QuerySeekerSpider','ShowyouBot','woriobot','merlinkbot','BazQuxBot','Kraken','SISTRIX Crawler','R6_CommentReader','magpie-crawler','GrapeshotCrawler', 'PercolateCrawler','MaxPointCrawler','NetSeer crawler','grokkit-crawler','SMXCrawler','PulseCrawler','Y!J-BRW','80legs','Mediapartners-Google','InAGist','Python-urllib','NING','TencentTraveler','Feedfetcher-Google','mon.itor.us','spbot','Feedly','bitlybot', 'ADmantX','Niki-Bot','Pinterest','python-requests','DotBot','HTTP_Request2','linkdexbot','A6-Indexer','TwitterFeed','Microsoft Office','Pingdom','BTWebClient','KatBot','SiteCheck','proximic','Sleuth','Abonti','(BOT for JCE)','Baidu','Tiny Tiny RSS', 'newsblur','updown_tester','linkdex','baidu','searchmetrics','genieo','majestic12','spinn3r','profound','domainappender','VegeBot','terrykyleseoagency.com','CommonCrawler Node','AdlesseBot','metauri.com','libwww-perl','rogerbot-crawler','ltx71','Qwantify', 'Traackr.com','Re-Animator Bot','Pcore-HTTP','BoardReader','omgili','okhttp','CCBot','Java/1.8','semrush.com','feedbot','CommonCrawler','MetaURI','ibwww-perl','rogerbot','MegaIndex','BLEXBot','FlipboardProxy','techinfo@ubermetrics-technologies.com', 'trendictionbot','Mediatoolkitbot','trendiction','ubermetrics','ScooperBot','TrendsmapResolver','Nuzzel','Go-http-client','Applebot','LivelapBot','GroupHigh','SemrushBot','commoncrawl','istellabot','DomainCrawler','cs.daum.net','StormCrawler','GarlikCrawler', 'The Knowledge AI','getstream.io/winds','YisouSpider','archive.org_bot','semantic-visions.com','FemtosearchBot','360Spider','linkfluence.com','glutenfreepleasure.com','Gluten Free Crawler','YaK/1.0','Cliqzbot','app.hypefactors.com','axios','webdatastats.com', 'schmorp.de','SEOkicks','DuckDuckBot','Barkrowler','ZoominfoBot','Linguee Bot','Mail.RU_Bot','OnalyticaBot','admantx-adform','Zombiebot','Nutch','SemanticScholarBot','Jetslide','scalaj-http','XoviBot','sysomos.com','PocketParser','newspaper','serpstatbot', 'MetaJobBot','SeznamBot/3.2','VelenPublicWebCrawler/1.0','WordPress.com mShots','adscanner','BacklinkCrawler','netEstate NE Crawler','Astute SRM','GigablastOpenSource/1.0','DomainStatsBot','Winds: Open Source RSS & Podcast','dlvr.it','BehloolBot','7Siters', 'AwarioSmartBot','Apache-HttpClient/5','Seekport Crawler','AHC/2.1','eCairn-Grabber','mediawords bot','PHP-Curl-Class','Scrapy','curl/7','Blackboard','NetNewsWire','node-fetch','admantx','metadataparser','Domains Project','SerendeputyBot','Moreover', 'DuckDuckGo' ,'monitoring-plugins','Selfoss','Adsbot','acebookexternalhit','SpiderLing','Cocolyzebot','TTD-Content','superfeedr','Twingly','Google-Apps-Scrip','LinkpadBot','CensysInspect','Reeder','tweetedtimes','Amazonbot','MauiBot','Symfony BrowserKit', 'DataForSeoBot','GoogleProducer','TinEye-bot-live','sindresorhus/got','CriteoBot','Down/5','Yahoo Ad monitoring','MetaInspector','PetalBot','MetadataScraper','Cloudflare SpeedTest','aiohttp','AppEngine-Google','heritrix','sqlmap','Buck','wp_is_mobile', '01h4x.com','404checker','404enemy','AIBOT','ALittle Client','ASPSeek','Aboundex','Acunetix','AfD-Verbotsverfahren','AiHitBot','Aipbot','Alexibot','AllSubmitter','Alligator','AlphaBot','Anarchie','Anarchy','Anarchy99','Ankit','Anthill','Apexoo','Aspiegel', 'Asterias','Atomseobot','Attach','AwarioRssBot','BBBike','BDCbot','BDFetch','BackDoorBot','BackStreet','BackWeb','Backlink-Ceck','Badass','Bandit','BatchFTP','Battleztar Bazinga','BetaBot','Bigfoot','Bitacle','BlackWidow','Black Hole','Blow','BlowFish', 'Boardreader','Bolt','BotALot','Brandprotect','Brandwatch','Buddy','BuiltBotTough','BuiltWith','Bullseye','BunnySlippers','BuzzSumo','CATExplorador','CODE87','CSHttp','Calculon','CazoodleBot','Cegbfeieh','CheTeam','CheeseBot','CherryPicker','ChinaClaw', 'Chlooe','Citoid','Claritybot','Cloud mapping','Cogentbot','Collector','Copier','CopyRightCheck','Copyscape','Cosmos','Craftbot','Crawling at Home Project','CrazyWebCrawler','Crescent','CrunchBot','Curious','Custo','CyotekWebCopy','DBLBot','DIIbot', 'DSearch','DTS Agent','DataCha0s','DatabaseDriverMysqli','Demon','Deusu','Devil','Digincore','DigitalPebble','Dirbuster','Disco','Discobot','Discoverybot','Dispatch','DittoSpyder','DnBCrawler-Analytics','DnyzBot','DomCopBot','DomainAppender', 'DomainSigmaCrawler','Dotbot','Download Wonder','Dragonfly','Drip','ECCP/1.0','EMail Siphon','EMail Wolf','EasyDL','Ebingbong','Ecxi','EirGrabber','EroCrawler','Evil','Express WebPictures','ExtLinksBot','Extractor','ExtractorPro','Extreme Picture Finder', 'EyeNetIE','FDM','FHscan','Fimap','Firefox/7.0','FlashGet','Flunky','Foobot','Freeuploader','FrontPage','Fuzz','FyberSpider','Fyrebot','G-i-g-a-b-o-t','GT::WWW','GalaxyBot','Genieo','GermCrawler','GetRight','GetWeb','Getintent','Gigabot','Go!Zilla', 'Go-Ahead-Got-It','GoZilla','Gotit','GrabNet','Grabber','Grafula','GrapeFX','GridBot','HEADMasterSEO','HMView','HTMLparser','HTTP::Lite','HTTrack','Haansoft','HaosouSpider','Harvest','Havij','Hloader','HonoluluBot','Humanlinks','HybridBot','IDBTE4M', 'IDBot','IRLbot','Iblog','Id-search','IlseBot','Image Fetch','Image Sucker','IndeedBot','Indy Library','InfoNaviRobot','InfoTekies','Intelliseek','InterGET','InternetSeer','Internet Ninja','Iria','Iskanie','IstellaBot','JOC Web Spider','JamesBOT','Jbrofuzz' ,'JennyBot','JetCar','Jetty','JikeSpider','Joomla','Jorgee','JustView','Jyxobot','Kenjin Spider','Keybot Translation-Search-Machine','Keyword Density','Kinza','Kozmosbot','LNSpiderguy','LWP::Simple','Lanshanbot','Larbin','Leap','LeechFTP','LeechGet','LexiBot' ,'Lftp','LibWeb','Libwhisker','LieBaoFast','Lightspeedsystems','Likse','LinkScan','LinkWalker','Linkbot','LinkextractorPro','LinksManager','LinqiaMetadataDownloaderBot','LinqiaRSSBot','LinqiaScrapeBot','Lipperhey','Lipperhey Spider','Litemage_walker','Lmspider' ,'MFC_Tear_Sample','MIDown tool','MIIxpc','MQQBrowser','MSFrontPage','MSIECrawler','MTRobot','Mag-Net','Magnet','Majestic-SEO','Majestic12','Majestic SEO','MarkMonitor','MarkWatch','Mass Downloader','Masscan','Mata Hari','Mb2345Browser','MeanPath Bot', 'Meanpathbot','Metauri','MicroMessenger','Microsoft Data Access','Microsoft URL Control','Minefield','Mister PiX','Moblie Safari','Mojeek','Mojolicious','MolokaiBot','Morfeus Fucking Scanner','Mozlila','Mr.4x3','Msrabot','Musobot','NICErsPRO','NPbot', 'Name Intelligence','Nameprotect','Navroad','NearSite','Needle','Nessus','NetAnts','NetLyzer','NetMechanic','NetSpider','NetZIP','Net Vampire','Netcraft','Nettrack','Netvibes','NextGenSearchBot','Nibbler','Niki-bot','Nikto','NimbleCrawler','Nimbostratus', 'Ninja','Nmap','Nuclei','Octopus','Offline Explorer','Offline Navigator','OnCrawl','OpenLinkProfiler','OpenVAS','Openfind','Openvas','OrangeBot','OrangeSpider','OutclicksBot','OutfoxBot','PECL::HTTP','PHPCrawl','POE-Component-Client-HTTP','PageAnalyzer', 'PageGrabber','PageScorer','PageThing.com','Page Analyzer','Pandalytics','Panscient','Papa Foto','Pavuk','PeoplePal','Petalbot','Pi-Monster','Picscout','Picsearch','PictureFinder','Piepmatz','Pimonster','Pixray','PleaseCrawl','Pockey','ProPowerBot','ProWebWalker', 'Probethenet','Psbot','Pu_iN','Pump','PxBroker','PyCurl','QueryN Metasearch','Quick-Crawler','RSSingBot','RankActive','RankActiveLinkBot','RankFlex','RankingBot','RankingBot2','Rankivabot','RankurBot','Re-re','ReGet','RealDownload','Reaper','RebelMouse','Recorder', 'RedesScrapy','RepoMonkey','Ripper','RocketCrawler','Rogerbot','SBIder','SEOlyticsCrawler','SEOprofiler','SEOstats','SISTRIX','SMTBot','SalesIntelligent','ScanAlert','Scanbot','ScoutJet','Screaming','ScreenerBot','ScrepyBot','Searchestate','Seekport','SemanticJuice', 'Semrush','SentiBot','SeoSiteCheckup','SeobilityBot','Seomoz','Shodan','Siphon','SiteCheckerBotCrawler','SiteExplorer','SiteLockSpider','SiteSnagger','SiteSucker','Site Sucker','Sitebeam','Siteimprove','Sitevigil','SlySearch','SmartDownload','Snake','Snapbot', 'Snoopy','SocialRankIOBot','Sociscraper','Sosospider','Sottopop','SpaceBison','Spammen','SpankBot','Spanner','Spbot','SputnikBot','Sqlmap','Sqlworm','Sqworm','Steeler','Stripper','Sucker','Sucuri','SuperBot','SuperHTTP','Surfbot','SurveyBot','Suzuran', 'Swiftbot','Szukacz','T0PHackTeam','T8Abot','Teleport','TeleportPro','Telesoft','Telesphoreo','Telesphorep','TheNomad','The Intraformant','Thumbor','TightTwatBot','Titan','Toata','Toweyabot','Tracemyfile','Trendiction','Trendictionbot','True_Robot','Turingos', 'Turnitin','TurnitinBot','TwengaBot','Twice','Typhoeus','URLy.Warning','URLy Warning','UnisterBot','Upflow','V-BOT','VB Project','VCI','Vacuum','Vagabondo','VelenPublicWebCrawler','VeriCiteCrawler','VidibleScraper','Virusdie','VoidEYE','Voil','Voltron', 'WASALive-Bot','WBSearchBot','WEBDAV','WISENutbot','WPScan','WWW-Collector-E','WWW-Mechanize','WWW::Mechanize','WWWOFFLE','Wallpapers','Wallpapers/3.0','WallpapersHD','WeSEE','WebAuto','WebBandit','WebCollage','WebCopier','WebEnhancer','WebFetch','WebFuck', 'WebGo IS','WebImageCollector','WebLeacher','WebPix','WebReaper','WebSauger','WebStripper','WebSucker','WebWhacker','WebZIP','Web Auto','Web Collage','Web Enhancer','Web Fetch','Web Fuck','Web Pix','Web Sauger','Web Sucker','Webalta','WebmasterWorldForumBot', 'Webshag','WebsiteExtractor','WebsiteQuester','Website Quester','Webster','Whack','Whacker','Whatweb','Who.is Bot','Widow','WinHTTrack','WiseGuys Robot','Wonderbot','Woobot','Wotbox','Wprecon','Xaldon WebSpider','Xaldon_WebSpider','Xenu','YoudaoBot','Zade', 'Zauba','Zermelo','Zeus','Zitebot','ZmEu','ZoomBot','ZumBot','ZyBorg','arquivo-web-crawler','arquivo.pt','autoemailspider','backlink-check','cah.io.community','check1.exe','clark-crawler','coccocbot','cognitiveseo','com.plumanalytics','crawl.sogou.com', 'crawler.feedback','crawler4j','dataforseo.com','demandbase-bot','domainsproject.org','eCatch','evc-batch','facebookscraper','gopher','instabid','internetVista monitor','ips-agent','isitwp.com','iubenda-radar','lwp-request','lwp-trivial','meanpathbot', 'mediawords','muhstik-scan','oBot','page scorer','pcBrowser','plumanalytics','polaris version','probe-image-size','ripz','s1z.ru','satoristudio.net','scan.lol','seobility','seocompany.store','seoscanners','seostar','sexsearcher','sitechecker.pro', 'siteripz','sogouspider','sp_auditbot','spyfu','sysscan','tAkeOut','trendiction.com','trendiction.de','ubermetrics-technologies.com','voyagerx.com','webgains-bot','webmeup-crawler','webpros.com','webprosbot','x09Mozilla','x22Mozilla','xpymep1.exe','zauba.io', 'zgrab','petalsearch','protopage','Miniflux','Feeder','Semanticbot' ,'ImageFetcher','Mastodon' ,'Neevabot','Pleroma','Akkoma','koyu.space','Embedly','Mjukisbyxor','Giant Rhubarb','GozleBot','Friendica','WhatsApp','XenForo','Yeti','MuckRack','PhxBot','Bytespider', 'GPTBot','SummalyBot','LinkedInBot','SpiderWeb','SpaceCowboys','LCC','Paqlebot','SeznamBot','SeznamHomepage','WP Fastest Cache', 'ChatGPT','Google-Extended','GoogleOther','anthropic','Claude-Web','cohere-ai','Diffbot','FacebookBot','ImagesiftBot','PerplexityBot','Omigili','yacybot','RepoLookoutBot','StractBot','IABot','rss-is-dead','Slackbot', 'Google-InspectionTool','Storebot-Google','Google-InspectionTool','APIs-Google','AdsBot-Google','Mediapartners-Google','Google-Safety','WellKnownBot','ArchiveBot','Sogou','iaskspider','Qwantbot','keys-so-bot','OAI-SearchBot', 'bot','spider','crawl', ); foreach($bots as $b){if( stripos( $sistema, $b ) !== false ) return $b;} return ""; } function ga4bottracking() { $userAgent=$_SERVER['HTTP_USER_AGENT']; $botname=is_bot($userAgent); if($botname=="") { return;} $domainName = $_SERVER["SERVER_NAME"]; $documentPath = $_SERVER["REQUEST_URI"]; $documentReferer = $_SERVER["HTTP_REFERER"]; if (empty($documentReferer) && $documentReferer !== "0") { $documentReferer = ''; } else { $documentReferer = $documentReferer; } $ga4Params = array(); $ga4Params['v'] = "2"; $ga4Params['tid'] = 'G-XXXXX' ; // <----- your GA4 measure ID $ga4Params['gcs'] = 'G101'; $ga4Params['gcd'] = '13t3t3t2t5'; $ga4Params['npa'] = '0'; $ga4Params['dma_cps'] = 'sypham'; $ga4Params['dma'] = '1'; $ga4Params['_rdi'] = '0'; $ga4Params['tt'] = 'antispam'; // <------ your unique antispam traffic_type $ga4Params['cid'] = "5555"; $ga4Params['ecid'] = "5555"; $ga4Params['uid'] = 'anonymous'; $ga4Params['ul'] = 'en-us'; $ga4Params['sr'] = '1x1'; $ga4Params['ur'] = ''; $ga4Params['pscdl'] = 'noapi'; $ga4Params['sid'] = floor(microtime(true) * 1000); $ga4Params['_p'] = rand(1000000000, 2147483647 ); $ga4Params['dt'] = 'anonymous'; $ga4Params['dl'] = urlencode( "http://$_SERVER[HTTP_HOST]$_SERVER[REQUEST_URI]") ; $ga4Params['dr'] = urlencode($documentReferer); $ga4Params['cs'] = urlencode($botname); $ga4Params['cn'] = urlencode($botname); $ga4Params['cm'] = 'bot'; $ga4Params['seg'] = '0'; $ga4Params['_ss'] = '1'; $ga4Params['_fv'] = '1'; $ga4Params['en'] = 'page_view'; $ga4Params['ep.bot_name'] = urlencode($botname); $ga4Params['ep.http_code'] = http_response_code(); $ga4Params['ep.referrer'] = urlencode($documentReferer); $ga4Params['ep.user_agent'] = urlencode($userAgent); if (strlen($userAgent) > 100) { $ga4Params['ep.user_agent2'] = urlencode(substr($userAgent, 100)); } $theParamList = ""; $gurl = 'https://region1.google-analytics.com/g/collect'; foreach($ga4Params as $key => $value) {$theParamList .= $key."=".$value."&";} $utmUrl = $gurl . "?" .$theParamList; $ch = curl_init(); curl_setopt($ch,CURLOPT_USERAGENT, "notset"); curl_setopt($ch,CURLOPT_URL, $utmUrl); curl_setopt($ch,CURLOPT_HTTPHEADER,array('text/plain')); curl_exec($ch); curl_close($ch); }
Here you need to modify the GA4 tracking ID that you generated just for this purpose to a separate Google analytics 4 property. Do not mix the data with normal data collected from the web.
$ga4Params['tid'] = 'G-XXXXX';
Alternatively, you can edit the antispam phrase line to name the traffic type for GA4. (traffic_type)
$ga4Params['tt'] = 'anti-spam';
If you update, you can then add more new crawlers to the array in the is_bot function.( Bots list – github)
Adding code to header.php
Here, you will then insert the following code into the header, which will call the function added to the functions.php file. This will call the code on every page of the site.
<?php ga4bottracking(); ?>
Is it possible to put it on PHP sites other than WordPress?
Of course. Most of the time you put the same code in the main template of a PHP page. The whole thing is built from basic commands and it doesn’t have any other dependencies either.
Questions and Answers
What are the risks?
If you’re not proficient with PHP, adding code inappropriately can crash the frontend of the site. With WP, just remove the code and everything is back to normal.
Why don’t I need consent to measure?
I deliberately measure only bots that identify themselves by user-agent as bots / crawlers of my site content. I don’t use any ip addresses etc. All I use is user-agent. Due to the fact that robots and not humans are demonstrably measured, no consent is needed.
Known problems
Updating templates, if you update a template you may find that it gets deleted from the code. Tjs after updating the template, check that the code is still there. That’s why I recommend you set up a watchdog in GA4 so that when the code disappears/ traffic drops by 80% or more, you get an email notification. You set this up on the GA4 home page at the bottom.
I need help with this.
Sorry, I don’t do any support for this, take this as inspiration.
Web speed
By having this on a PHP in page the slowness of the site is almost nil.
I’m not getting much data or some pages are missing
You can have cached content on the site and then it doesn’t get measured because the cache issues it, not PHP. This is neither good nor bad. This is mostly for sites that are completely hidden behind a proxy, if you have normal caching on your site it measures just fine.
Does it measure all sites?
It measures pages with HTTP status codes 200 and 404. So it doesn’t measure server errors(50x), redirects(30x).
Why I chose PHP and not javascript / GTM?
Robots in most cases don’t run javascript, so you can’t use GTM for that. You can build something similar on a GTM server and measure it there via an embedded image, but that’s a different league, a different sport. This was meant to be simple and for everyone.
Conclusion
It’s not complicated, but it can help you quite a bit when dealing with site indexing issues, which was the reason I wrote this.
Update 2024-07-19
Data from BQ GA4 export. Or a sample of how this article was quickly found by crawlers and how many times it has already been downloaded.
Btw, SeznamHomepage bot visits me, but their SeznamBot crawler does not and I am not even indexed on Seznam for the given article.
And the entire website is built on wordpress and is fast.
After those 17 hours, I put it on social media. Until then, it was kind of a soft launch.
And since I think more people should play with it, here’s an SQL query to build a similar table:
select event_name ,event_date ,event_timestamp , (select as struct * from ( select ep.key key, concat( coalesce (ep.value.string_value, ""), coalesce (cast (ep.value.int_value as string), ""), coalesce (cast (ep.value.float_value as string), ""), coalesce (cast (ep.value.double_value as string), "") ) value from unnest (event_params) ep ) ep pivot (string_agg (value) for ep.key in ( "page_location", "bot_name", "user_agent", "user_agent2", "page_referrer" )) )params from `analytics_99999999.events_*` where _table_suffix between '20240711' and '20270719' and event_name LIKE '%page_view%'
When using, you need to change the analytics_99999999.events_* line and replace 99999999 with your GA4 property ID and set your dates.
The output will be a simple flat table with columns:
event_name
event_date
event_timestamp
page_location
bot_name
user_agent
user_agent2
page_referrer
From which you can easily build a report, for example, in Google sheets connected to Google big query.
The last update was 7/19/2024 6:04:32 PM, so we’ll see how quickly the search engines find it.