Blog

Counting website visitors is hard

Counting visitors to your website is surprisingly hard. [footnote 1]

In the past 30 days, Cloudflare says there were about 1,170 unique visitors to bobbiechen.com . Over that identical time period, Squarespace Analytics says there were just 51 unique visitors. That's more than a 20x difference. What the heck?

Understanding visitor count metrics

Usually when we want to know how many visitors view our website, we're really talking about real human eyeballs looking at our page. As a human, visiting a website actually triggers a Rube Goldberg machine of computers talking to other computers and doing things in response, which can be hard to distinguish from an automated script ("bot") visiting that same website. Although Cloudflare and Squarespace use the name unique visitors, they actually measure different parts of that complex website-fetching process, which of course results in wildly different final numbers.

Cloudflare defines unique visitors as the number of "unique IP addresses requesting content from your site". Every internet-connected thing has an IP address - it is used to send messages to your device using the Internet Protocol. This includes bots like the web crawlers used by search engines to discover website content. We don't care about them, so this number is overcounting our desired metric of human views. It is possible for a person's IP address to change due to dynamic address allocation by the mobile carrier or ISP; but this is usually infrequent - maybe weekly, maybe monthly, or whenever a Wi-Fi router is rebooted.

Meanwhile, Squarespace defines "unique visitors" as an "estimate of the total number of actual visitors that reached your site", which is identified using Javascript and a browser cookie with two-year expiration. Almost all real humans browse with Javascript and cookies enabled. [footnote 2] Cookies are little pieces of data stored locally in your browser and device, so if you switch from your laptop to your phone, from Chrome to Safari, clear your cookies, or use your browser's Private Browsing ("Incognito Mode"), it will be counted as two different visitors. These cookies expire in two years, so we shouldn't see major effects from cookie expiration in the one-month window we're looking at. Most robots do not execute Javascript, so they will not be counted by this method. So far, so good - sounds like this is exactly what we're looking for. What's the catch?

Well, Javascript-based analytics are often blocked by adblockers like uBlock Origin. Estimates of how many people use an adblocker vary from 27% to 47%, according to Statista, Hootsuite [email required], and GWI [email required]. It's likely lower on mobile, since Chrome does not offer adblocking (EDITED: see [correction 1]). When I load my own website and view the uBlock Origin logs, I can see that Squarespace's analytics script is blocked thanks to the default Easy Privacy blocklist, which also blocks Google Analytics, the Facebook tracking pixel, and lesser-known names like Woopra and KissMetrics. [footnote 3] So, our Squarespace numbers are undercounted due to adblocking users, and we should multiply them by (100 / (100 - 27)) to (100 / (100 - 47)), or about 1.3-1.9x. That's for the general population, and anecdotally I expect that younger and more tech-savvy users are even more likely to block ads.

Here's a quick summary of some differences between the two methods:

Type of traffic Unique IP address Javascript + 2-year-expiration cookie
Automated bots Counted Not counted (mostly)
User with an adblocker Counted Not counted (mostly)
Two devices on the same home wi-fi Counts one Counts two
One incognito, one normal session Counts one Counts two
Two different devices, different wi-fi Counts two Counts two

Narrowing the range of possibilities

Where does that leave us? We know that due to the 1.3 adblock factor, our 51 visitors measured by Squarespace probably represent at least 66 visits by real people. This will overcount some people who are visiting from multiple browsers, devices, Wi-fi networks, or using incognito private browsing, but we can't distinguish them using this method. [footnote 4] We also know that due to bot traffic, the 1,170 visitors measured by Cloudflare is definitely an overcount. This implies that the real number of humans who visited my website in the last 30 days is between 66 and 1,170. That's a 17x difference between the min and max values, which is pretty unsatisfying.

Luckily, I managed to accidentally create conditions earlier this month that will help narrow this range. Here's an extremely rigorous SWAG (scientific wild-ass guess) for a tighter bound on monthly visitors. Normally, I don't actively promote my blog at all. But on December 4th, I spotted a Hacker News post where someone was trying to teach their kids to code, and linked my post Opportunities in the comments because I thought it was relevant. This caused a noticeable spike in my modest traffic numbers:

This is the Cloudflare Analytics dashboard above, which shows about 60 visitors per day on a normal day. I can take the peak traffic (112) on December 4th, subtract the baseline traffic (about 60), and see that posting on Hacker News apparently caused about 50 new visitors that day, as measured by Cloudflare.

The Squarespace Analytics dashboard above lets me drill down and see that five requests in the orange "Social" category were referred from Hacker News. [footnote 5] Combined with the Cloudflare data, what can we conclude?

Of the 50 Cloudflare visitors, we know 6 of them are from non-Hacker-News sources (based on Squarespace), so let's remove them, leaving 44. There are probably a few bots that crawl Hacker News comments - for the sake of having a nice round number let's say there were 4 of those. That leaves 40 visitors as measured by Cloudflare who are presumably real people from Hacker News. 5 of those visitors had Javascript and cookies and didn't adblock Squarespace Analytics; that implies that the remaining 35 people (7 out of 8, or 87.5%) were not tracked using Squarespace Analytics.

87% is quite high compared to that 27-47% figure I mentioned earlier. But I argue that it is plausible: Hacker News is among the most technical spaces on the Internet, with commenters frequently discussing the use of adblockers and NoScript. [footnote 6] This means that I can narrow my upper bound on website visitors. Assuming my website visitors at most use adblock as much as Hacker News users, then I would expect about 51 * (100 / (100 - 87.5)) = 408 of my monthly visitors are real humans.

I'll drop some significant figures to represent the incredible handwaving I've done here, to present my new bounds:

  • My website sees 1200 unique IP addresses per month (as measured by Cloudflare)
  • 60 to 400 of them are real humans, which means that
  • 700 to 1140 of them are bots of some kind

Was this useful?

Honestly, not really, besides satisfying some of my curiosity. I did manage to decrease the spread from min to max values to just 7x (down from 17x) by reducing the upper bound from 1200 to 400. I could've directly used the 1.9x figure for adblock use to get an upper bound of 97, but I was a bit skeptical that only 47% of my site visitors use an adblocker. I also could have just made up a number for maximum (un)reasonable adblock prevalence and gotten similar results instead of actually looking at the data. But I am much more confident in these results (say, 95% confident that the true value lies in this range) versus using the 1.9x adblock factor (which I'd guesstimate at 85% confidence) or taking the original Cloudflare upper bound (which is extremely confident, but only by being way too large of a range to really say anything).

If I were willing to put actual time, money, and effort into checking this, I might stand in the street and pay people to actually visit my website, and then compare the resulting analytics numbers with the ground truth of how many people I paid. This would probably be skewed because 1) adblocking is rarer on mobile, and 2) the audience would necessarily be people who will talk to a crazy person in the street (me) and do what I ask them to. To be fair, 2) is probably an issue with industry adblocking surveys that I cited above, so maybe that's not a big loss. The good news is that I am not willing to do that, so we'll just have to wonder forever. Thanks for reading.


I posted this to Hacker News, which led to a larger traffic spike. I wrote about that and its new and exciting insights in this follow-up post.

I've probably done some questionable things in this post. If you spot anything wrong, whether it's a trivial typo or that my approach is fundamentally flawed, please let me know via the contact form on my home page (or whatever method you normally contact me by).

Thanks to Yee Aun and Joanne for inspiration and feedback on this post. Thanks HN user geuis for the correction re: adblocking actually being available on mobile browsers.


Corrections

[correction 1] (back to content)

Originally this sentence incorrectly said that "Firefox is the only major browser that has any sort of adblocker integration". But as HN user geuis points out, numerous mobile browsers do offer adblockers. iOS Safari, Samsung Internet, Opera, and UC Browser all offer adblock integration, just like Firefox.

The remaining major mobile browser, Chrome, does not offer any adblocking solution, and it makes up 40% of US mobile browsers (source: Statista). I think this is where I got my wrong impression; I initially switched to Firefox on mobile specifically so I could continue using uBlock Origin on my phone.

40% of mobile users in the US is still a lot, but it's nowhere near "every browser besides Firefox" (which would be something like 99%). Thanks for the correction, and I'll check statements like this more closely in the future.


Footnotes

[footnote 1] (back to content)

I didn't intend to have any analytics here; as I've mentioned before, I use Squarespace because it makes it easy to share writing and music without messing with technology too much. The website is behind Cloudflare because it slightly improves load times from what I could measure. I only reluctantly discovered the analytics features by clicking around on all the settings one day.

This is really a mixed bag: my safety bubble of imagining no one reads my blog has been popped by these easily-accessible and semi-useful metrics, and I put literally zero effort into setting them up. I've already been surprised in real life by people mentioning they've read my blog, so it's okay; I'll just not let self-conscious feelings hold me back from writing (spoilers) 2000-word blog posts that may or may not actually have any useful conclusion.


[footnote 2] (back to content)

Honestly [citation needed] on this statement, but my impression is that it's incredibly likely. Both Javascript and cookies are enabled by default in all major browsers, and many websites today will just not work with Javascript disabled. There does exist a certain type of power user that disables Javascript by default through NoScript or similar, but they are a small minority (and I say this as someone who runs Linux on my personal laptop).


[footnote 3] (back to content)

If you're curious, the mechanism is that a tracking script sends data to an analytics server somewhere. Adblockers like uBlock Origin can override your local DNS (the "address book" of the internet) to send that data nowhere instead.

There are some uncommon tricks that websites can use to sneak their analytics scripts past user adblockers, like CNAME cloaking and proxying the requests through the website itself. Iain Bean explains these better than I can in their blog post, The shady world of Google Analytics proxying.


[footnote 4] (back to content)

I do mean "will" here: I know for a fact because my partner told me she does some of these things when reading my blog, so she makes up at least a few of those 51. Hi Yee Aun!

And no, I will not attempt to correct for these known same-user, different devices scenario. I respect Yee Aun's dedication to inflating my visitor numbers, and I think the real solution here is to simply increase traffic until those are just rounding errors.


[footnote 5] (back to content)

The other six requests that day are broken down as:

  • 2 from Facebook (definitely Yee Aun)
  • 3 from direct requests to bobbiechen.com, and
  • 1 from a Google search (these tend to be related to the semaphore decoder)

[footnote 6] (back to content)

This is mainly anecdotal, but here's some search terms with similar comment volume (as measured by HN's Algolia search result count).

The below table is for the time frame one month before this post, i.e. late November through late December 2021. Approximately 100,000 comments were made in this time frame across the entire site.

Comment count Relevant term Terms with similar count
240 ublock origin angular, dark mode, smart contract
200 adblock ansible, jquery, ocaml
100 pihole browser extension, opera, qr code
70 noscript bitwarden, dark pattern, rss reader

To me, these paint a picture that these are not uncommon discussion topics - at least, no more uncommon than these other things. I tried to pick search terms that fairly distinctly referred to a specific thing (vs. something like "React", which in addition to being a technology is also a real English word, believe it or not).

As an absolute number or proportion of comments, it's not nearly as large as hot topics like Microsoft (2,300), IntelliJ (1,700), or log4j (830); but the topic comes up often enough that I would expect almost all regular HN users to be aware of adblockers and at least consider their use. This should increase usage compared to the general population, many of whom are entirely unaware that it is possible to block ads client-side.


The space below is left empty so that clicking on footnotes will scroll to the correct location.

codeBobbie Chen