In message <1AA95AD8-3729-4BBB-A921-1535429A9658@ripe.net>, Edward Shryane <eshryane@ripe.net> wrote:
DB-WG: should we allow non-ASCII addresses in the RIPE database?
Do you mean email addresses or street addresses as well?
I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.
I want to be crystal clear here. Street addresses, person names, city names, or any other data value (except for ASNs, IP addresses, ISO 3166 country codes, and domain names) that are encoded in full 8-bit ISO-8859-1 within the data base do not present any terrific problems for me personally because, generally speaking, I don't anticipate that I will ever be trying to parse those person names, street names, city names, etc. I will just use them "as is" and in whatever encoding they happen to be in when I receive them. Quite certainly, within the RIPE region there are billions upon billions of person names, street names, and city names that cannot be accurately represented in US-ASCII, nor even, I must note, in ISO-8859-1. (I am thinking of your fellow RIPE members in places where cyrillic is used, and also your fellow RIPE members in Israel and elsewhere.) In ancient times (e.g. prior to the issuance of, for example, RFC3490 in March, 2003) 7-bit US-ASCII was used fairly exclusively within the data bases of all of the Regional Internet Registries. And I, for one, am greatly appreciative of all of the effort and contortions, over so many years, that so many people have gone through in order to try, as best as they could, to anglicize person, street, and city names, especially those that were not really amenable to that process, and to convert them all into some 7-bit ASCII approximation of the actual "native" strings. Even though this conversion process has often rendered thye resulting anglicized versions substantially inaccurate, it has served to keep processing code simple, at least up until now. Now however I see that 8-bit ISO-8859-1 encodings are creaping in, at least to the RIPE data base. I am torn by this. On the one hand this new development augurs a sea change which will likely end by complicating a lot of tools, and not only my own. On the other hand, the benfits are clear; more accurate representations of person, street, and city names within the data base... BUT still quite limited to names that can be accurately represented within ISO-8859-1, a character set which excludes some very large swaths of RIPE territory. Even at the risk of making my own life more complicated, I have to say that I personally place a higher value on accuracy than I do on simplicity. For this reason, it is my feeling that the data base should evolve in the direction of UTF-8 and *not* in the rather different and far more limiting direction of ISO-8859-1. That having been said however, domain names are a really very special and different concern. I personally am not aware of any standard which suggests that domain names should ever be written in ISO-8859-1. Rather, for domain names, the available choices of representation seem to be either (a) 7-bit US-ASCII or else (b) punycode (RFC3492) or else (c) UTF-8. Obviously, 7-bit US-ASCII is really no longer an option, and hasn't been ever since the publication of RFC3490 in 2003. At the present moment, punycode can be used, and can represent all domain names with 100% accuracy, even while allowing the evolution of the encoding of other data base fields to proceed and to be debated independently. The bottom line is that in the short term, and for the immediate future, I believe that there is no other sensible choice except to decree that all domain names within the data base shall be represented in punycode form.
Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.
Yes, but please just leave this to the WHOIS *client* to handle. It is less desirable, I think, to perform this conversion on the server side. Regards, rfg