NEVER click on a link that looks like that

Every time one of my posts on this journal ends up somewhere on Reddit, Twitter Nostr or Hacker News, lots of people seem to be irritated by the site’s URL. Hence, let me do a quick introduction into what’s called Punycode, and why I’m using this domain name.

NEVER click on a link that looks like that

Whenever you launch your favorite browser and type in e.g. soylentnews.org, your web browser needs to look up the server from which it can request the website from. On the internet there are literally a gazillion of servers running 24/7, each with different websites on them. Unfortunately, those servers don’t have actual names, like soylentnews.org but instead listen to something called an IP address. It’s sort of an internet phone number. Sort of. Anyway.

In order for the browser to know a website’s IP address, it needs to look it up first, using something called the Domain Name System, or short DNS. The DNS is basically the phone book of the internet and contains a huge table in which every website’s domain is assigned to the IP address of its server(s). The DNS basically translates a website’s domain, e.g. soylentnews.org, into its IP address, e.g. 23.239.29.31.

When this internet’s phone book (DNS) was initially created, it only allowed for a limited set of ASCII characters to be used in host (www) and domain (soylentnews.org) names. With the growth of the internet and its reach to non-English speaking countries however, the need for international domain names that could contain Unicode characters – like á, ț or even – arose.

In order to be able to represent these characters throughout the ASCII-based services that form the internet, it was required to implement an ASCII character representation of such Unicode domains, that could be used by systems like the DNS. This implementation is called Punycode, an instance of the more general algorithm called Bootstring, as described in the RFC 3492. Punycode “allows strings composed from a small set of ‘basic’ code points to uniquely represent any string of code points drawn from a larger set.”

So how does this look in practice? Assuming you type málaga.es, which is the website of the Málaga municipality in Spain, then your browser will convert this into xn--mlaga-xqa.es, which is a valid ASCII-character domain, and it will continue to send the converted domain to the DNS in order to retrieve its IP address and open the website.

Rule of thumb: Punycode prefixes international domain names with xn--, meaning that whenever you see a domain starting with xn-- you can be sure that it’s an international domain name that contains non-ASCII characters.

You could however manually type xn--mlaga-xqa.es into your browser’s address bar, in which case the browser would detect that you’ve provided an already normalized Unicode domain and simply proceed with the DNS request and page loading.

Browsers have different default settings for displaying Unicode domain names. Some browsers choose to display the actual Unicode representation, e.g. málaga.es, while others will display the ASCII domain xn--mlaga-xqa.es instead. Some browsers offer an option for you to choose what to display. The main reason for browsers to not display the Unicode representations and instead go with the ASCII format is the fraud potential that Unicode domains bring.

Unicode domains allow for URL spoofing. For example, the letter “h” is virtually indistinguishable from the Unicode character “һ” (Shha in Unicode). This makes it possible for fraudsters to register e.g. the domain mcbseycһelles.com, in an attempt to impersonate the actual Mauritius Commercial Bank on the Seychelles, and try to trick customers into logging in to the fake bank with their actual credentials.

Effectively this means that the Unicode representation of international domain names is actually more dangerous, and that the Punycode representation (xn--...) should be preferred for the sake of clarity.

PS: In case of málaga.es it just so happens that the web server automatically redirects people to malaga.es, which is a separate domain that the municipality also owns, and that it uses as its primary domain. Obviously for non-Spanish speakers, the plain ASCII domain malaga.es is easier to read, remember and type – even though that might be irrelevant these days (see below).

“But why are you using such an impossible to remember Punycode domain, that only Japanese speakers can possibly type out?!”

I’m a software architect and engineer by trade and a hacker by heart; I like to provoke chaos in order to see how systems react. Even though Punycode is probably older than the average reader of this site, it turns out that many modern, widely used systems still cannot handle it properly.

For example, I cannot log into my Patreon account anymore, ever since I had the support change my email address from a regular ASCII-domain to this Unicode domain, because the login form doesn’t recognize my email address as a valid address.

Similarly it was impossible for a company that I purchased something from to create an order for a replacement part in their system with this email address added to my account. Their system would simply not allow for the order to be created.

These are just two of many comical situations that I keep running into, and that show flaws that sometimes puzzle even me, leading me to consider cases within the infrastructures and systems, which I’m building for my clients, that I otherwise might not have had considered.

Okay but aren’t you making it impossible for people to find you?

Not at all. Domain names are dead. Unless you’re one of the conglomerates running what is today’s internet, there’s no point in spending any effort in finding a short and representative domain name. The majority of internet users don’t type domain names anymore. They follow the URL that was linked in a post on a social media app, or they use their browser’s search engine to look for a specific topic, or they ask ChatGPT. People don’t even type conglomerates’ domain names anymore. Adding the .com part to amazon has already become too much of an effort, even with smartphone keyboards offering dedicated .com buttons and the overall round-trip time being significantly shorter than going through a search engine.

Especially with the content that I’m creating, there’s no real benefit in having a memorable domain name. The average Joe simply isn’t interested in this site, while people searching for awkwardly specific things will usually find it right away, regardless of the domain name. Whether they will actually click a URL that says マリウス.com or even xn–gckvb8fzb.com is a whole other story. However, judging by my privacy-friendly analytics tool this site is not doing too bad.

Last but not least, I’m happy to part-take in making the average internet user more conscious about the fact that there are different types of writing systems and that large parts of the world use e.g. logographic systems and not solely Latin script. Not only does the domain name I’m using stir up a bit of discussion, it also brings the opportunity for people to learn something new, especially with great explanatory comments like this one:

My Japanese is rusty, but it should display as: マリウス “ma” “ri” “u” “su”, which looks to be a foreign name (katakana), “Marius.” That matches the username on this user’s github email address, also at the same domain, which means their email address should be pronounced, “Marius at Marius dot com”

Well done, Marius. Well done.

zeta0134

If I have gotten you curious and you’d like to learn more about Punycode domains, check out the Wikipedia page as well as this implementation of RFC 3492 and RFC 5891 in pure JavaScript. If you’d like to register your own international domain name, I recommend the privacy-friendly folks over at 1337 Services LLC (read: Njalla) in Nevis.

The title for this post was inspired by mrtweetyhack’s comment on Hacker News.

Enjoyed this? Support me via Monero, Bitcoin, Lightning, or Ethereum!  More info.