Internationalized domain names in the data abase?

older
Re: [db-wg] NWI-10 Definition of...

Ronald F. Guilmette

2 Nov 2019 2 Nov '19

9:55 p.m.

Sorry if I am interrupting any ongoing discussion, but I just have a quick and simple question... Is it permitted to have internationalized domain names appear within the database? By that I really mean to ask if it is permissible to have there appear in the data base IDNs which are written in their UTF-8 encoded forms, rather than, say, in punycode? I have found at least one specific case where an IDN does appear in the data base as a UTF-8 encoded string, but since I had never seen that before, I just wanted to know if that was an anomalous mistake or if it was consider normal, acceptable, and routine.

Show replies by date

Edward Shryane

5 Nov 5 Nov

9:58 a.m.

Hello Ronald,

...

On 2 Nov 2019, at 21:55, Ronald F. Guilmette via db-wg <db-wg@ripe.net> wrote:

Sorry if I am interrupting any ongoing discussion, but I just have a quick and simple question...

Is it permitted to have internationalized domain names appear within the database?

Currently the RIPE database supports the Latin-1 (ISO-8859-1) character set only. There was previous discussion in April - May 2015 to support UTF-8: https://www.ripe.net/ripe/mail/archives/db-wg/2015-April/004516.html https://www.ripe.net/ripe/mail/archives/db-wg/2015-May/004542.html The proposal was to allow UTF-8 in free-text attributes, except for primary keys.

...

By that I really mean to ask if it is permissible to have there appear in the data base IDNs which are written in their UTF-8 encoded forms, rather than, say, in punycode?

The RIPE database only contains reverse domain objects (i.e. to register reverse delegations).

...

I have found at least one specific case where an IDN does appear in the data base as a UTF-8 encoded string, but since I had never seen that before, I just wanted to know if that was an anomalous mistake or if it was consider normal, acceptable, and routine.

Please let me know in which object you found this. The DB team spent some effort recently improving (non-) Latin-1 character handling (in updates and queries), there shouldn't be any non Latin-1 characters remaining. Regards Ed Shryane RIPE NCC

Ronald F. Guilmette

6 Nov 6 Nov

12:13 a.m.

In message <B335DD85-CED0-41A3-A504-E0A7E6E41D2B@ripe.net>, Edward Shryane <eshryane@ripe.net> wrote:

...

...
Is it permitted to have internationalized domain names appear within the database?

Currently the RIPE database supports the Latin-1 (ISO-8859-1) character set only.

Yes. Please forgive me. I asked the Wrong Question entrely. See below.

...

...
I have found at least one specific case where an IDN does appear in the data base as a UTF-8 encoded string, but since I had never seen that before, I just wanted to know if that was an anomalous mistake or if it was consider normal, acceptable, and routine.

Mea culpa! I misspoke. What I found was *not* an internationalized domain name, per se. Well, maybe it was/is and maybe it wasn't/isn't. I'll let you all decide, and then you can tell me if I have used improper terminology to descrtbe what I found. The issue came up as I was performing some automated processing relating to certain abuse contact email addresses relating to certain RIPE ASNs. More specifically, one of my automated tools got rather badly confused by the abuse reporting addresses for AS5464 and AS42486, both of which consist of the email address: abuse@zürich.email The domain name portion of this address may or may not be a proper sort of internationalized domain name. I am frankly not sure about that now, one way or the other. I just saw a character that was not a traditional 7-bit ASCII character and then I improperly lept to the conclusion that this must be one of those internationalized domain name that have bedeviled some of my other home-brew tools in the past. The problem, of course, is that one lower-case letter "u" with the associated umlaut above it. On my system here, the "od -c" command indicates that this one character is encoded NOT as any kind of UTF-8 sequence, but rather that it is simply encoded as a single byte with the value 374 (octal). As I now know, that byte value, when construed in accordance with ISO-8859-1, does in fact represent a lower-case "u" with an umlaut. So at least in this limited sense I now know what the person who put that domain name into the data base had intended. However I am not yet persuaded that simply using ISO-8859-1 encoding was either the best choice or even an entirely appropriate choice in this instance. It was certainly convenient for the writer that a lower-case "u" with an umlaut could be represented within ISO-8859-1, thus making it unnecessary to resort to UTF-8 in this particular instance, but it does cause me to wonder a bit about what may transpire on the day when some RIPE member finds it appropriate and necessary to add to the data base some contact email address consisting in part of an IDN, where said IDN is, in its native form, something in Arabic, Farsi, Hebrew or Chinese. For my own part, I am merely an out-of-date and ancient relic of a happier and simpler time, here in the United States, when 7-bit ASCII was sufficient for anything and everything. As such, I cannot help but long for a return to that level of simplicity, parochial as it might be. But since that is not going to happen anytime soon, I can only hope that RIPE and other regions will come to some agreement regarding the proper representation of IDNs within their respective data bases. If ISO-8859-1 is the standard chosen, I wll certainly adjust my tools accordingly. If however some other standard is set, then I merely hope that I will be on the circulation list when that memo is issued. Regards, rfg P.S. Not that anybody should really care, but for this one lone resarcher it would be maximally convenient if all domain names represented within the data base were encoded as punycode, where necessary. In fact, it is my belief that 99.99% of them already are, which thus renders the "transition" to that standard essentially pain free.

Edward Shryane

4:04 p.m.

Hello Ronald, DB-WG,

...

On 6 Nov 2019, at 00:13, Ronald F. Guilmette via db-wg <db-wg@ripe.net> wrote:

...
...
I have found at least one specific case where an IDN does appear in the data base as a UTF-8 encoded string, but since I had never seen that before, I just wanted to know if that was an anomalous mistake or if it was consider normal, acceptable, and routine.

Mea culpa! I misspoke.

Thanks for clarifying!

...

What I found was *not* an internationalized domain name, per se. Well, maybe it was/is and maybe it wasn't/isn't. I'll let you all decide, and then you can tell me if I have used improper terminology to descrtbe what I found.

The email address you found, is the only IDN (i.e. non-ASCII) email address in the RIPE database (so far). It's currently considered a valid value in the RIPE database, as it's composed of Latin-1 characters, and the attribute syntax check passes. There is also an MX record for the domain (although the host dc-eb0309b6496a.xn--zrich-kva.email is currently unreachable for me). However, it may cause inter-operability issues, as the sending mail server needs to handle IDN addresses correctly. DB-WG: should we allow non-ASCII addresses in the RIPE database?

...

P.S. Not that anybody should really care, but for this one lone resarcher it would be maximally convenient if all domain names represented within the data base were encoded as punycode, where necessary. In fact, it is my belief that 99.99% of them already are, which thus renders the "transition" to that standard essentially pain free.

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses? For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute. Regards Ed Shryane RIPE NCC

Piotr Strzyzewski

9:17 p.m.

On Wed, Nov 06, 2019 at 04:04:11PM +0100, Edward Shryane via db-wg wrote: Hi!

...

DB-WG: should we allow non-ASCII addresses in the RIPE database?

Do you mean email addresses or street addresses as well?

...

...
P.S. Not that anybody should really care, but for this one lone resarcher it would be maximally convenient if all domain names represented within the data base were encoded as punycode, where necessary. In fact, it is my belief that 99.99% of them already are, which thus renders the "transition" to that standard essentially pain free.

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

Works for me.

...

For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

-- Piotr Strzyżewski

Edward Shryane

9:47 p.m.

Hi Piotr, DB-WG,

...

On 6 Nov 2019, at 21:17, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
DB-WG: should we allow non-ASCII addresses in the RIPE database?

Do you mean email addresses or street addresses as well?

I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.

...

...
...
P.S. Not that anybody should really care, but for this one lone resarcher it would be maximally convenient if all domain names represented within the data base were encoded as punycode, where necessary. In fact, it is my belief that 99.99% of them already are, which thus renders the "transition" to that standard essentially pain free.

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

Works for me.

We can make explicit support for the punycode format, and allows (full) IDN email addresses to be used (as this syntax should be interchangeable with the normal form). Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.

...

...
For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

-- Piotr Strzyżewski

Regards Ed Shryane RIPE NCC

Piotr Strzyzewski

10:05 p.m.

On Wed, Nov 06, 2019 at 09:47:33PM +0100, Edward Shryane wrote: Hi Edward, DB-WG,

...

...
On 6 Nov 2019, at 21:17, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
DB-WG: should we allow non-ASCII addresses in the RIPE database?

Do you mean email addresses or street addresses as well?

I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.

I do not object having properly coded non-ASCII email addresses in the database.

...

...
...
...
P.S. Not that anybody should really care, but for this one lone resarcher it would be maximally convenient if all domain names represented within the data base were encoded as punycode, where necessary. In fact, it is my belief that 99.99% of them already are, which thus renders the "transition" to that standard essentially pain free.

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

Works for me.

We can make explicit support for the punycode format, and allows (full) IDN email addresses to be used (as this syntax should be interchangeable with the normal form).

Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.

...
...
For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

And what about RDAP? Piotr -- Piotr Strzyżewski

Ronald F. Guilmette

11:46 p.m.

In message <20191106210554.GB5460@hydra.ck.polsl.pl>, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...

I do not object having properly coded non-ASCII email addresses in the database.

First, just to be clear, we are really only discussing the representation of domain names within the data base. Of course, any email address contains one of those, but we are specifically -not- discussing the representation of the user-ID portion of any email address in the data base. Second, it is nice that you are OK with "properly coded non-ASCII" domain names in the data base. So I am I. That's not the question. The question is how should IDNs be -represented- within the data base. As I have stated, it is my opinion that the only two viable options at the present time are either (a) punycode or else (b) UTF-8. ISO-8859-1 is not, as far as I know, a standardized or appropriate way of encoding IDNs in any context. If I am wrong about that, then please do correct me an please do point me at the RFC which states otherwise. Regards, rfg

Edward Shryane

8 Nov 8 Nov

11:10 a.m.

Hello Ronald, DB-WG,

...

On 6 Nov 2019, at 23:46, Ronald F. Guilmette via db-wg <db-wg@ripe.net> wrote:

In message <20191106210554.GB5460@hydra.ck.polsl.pl>, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
I do not object having properly coded non-ASCII email addresses in the database.

First, just to be clear, we are really only discussing the representation of domain names within the data base. Of course, any email address contains one of those, but we are specifically -not- discussing the representation of the user-ID portion of any email address in the data base.

Understood. The user-ID (local) portion of an email address is not affected, only the domain.

...

Second, it is nice that you are OK with "properly coded non-ASCII" domain names in the data base. So I am I. That's not the question. The question is how should IDNs be -represented- within the data base.

As I have stated, it is my opinion that the only two viable options at the present time are either (a) punycode or else (b) UTF-8.

DB-WG: - if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode? - or if (b), should the RIPE database support UTF-8 for the domain part of IDN email addresses? This is technically possible in the Whois server side, but it's a large change for clients.

...

ISO-8859-1 is not, as far as I know, a standardized or appropriate way of encoding IDNs in any context. If I am wrong about that, then please do correct me an please do point me at the RFC which states otherwise.

Using ISO-8859-1 to encode IDN email addresses in the RIPE database does cause some issues: - Only a small subset of the UTF-8 character set is supported, characters outside ISO-8859-1 are substituted with a '?' on Whois update. - ISO-8859-1 encoded email addresses may not be handled properly by Whois clients or mail servers.

...

Regards, rfg

Regards Ed Shryane RIPE NCC

Nick Hilliard

11:19 a.m.

Edward Shryane via db-wg wrote on 08/11/2019 10:10:

...

- if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode?

Where though? Only in fields which contain email addresses? Or free-text fields too? Nick

Edward Shryane

11:25 a.m.

Hi Nick,

...

On 8 Nov 2019, at 11:19, Nick Hilliard <nick@foobar.org> wrote:

Edward Shryane via db-wg wrote on 08/11/2019 10:10:

...
- if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode?

Where though? Only in fields which contain email addresses? Or free-text fields too?

I suggest only in fields which contain email addresses: upd-to, mnt-nfy, notify, e-mail.

...

Nick

Regards Ed Shryane RIPE NCC

Piotr Strzyzewski

4:06 p.m.

On Fri, Nov 08, 2019 at 11:10:10AM +0100, Edward Shryane via db-wg wrote:

...

DB-WG:

- if (a), should the RIPE database automatically convert IDN domain names in email addresses into punycode?

That is just a workaround for the general problem of having UTF8 in the DB.

...

- or if (b), should the RIPE database support UTF-8 for the domain part of IDN email addresses? This is technically possible in the Whois server side, but it's a large change for clients.

We should make our minds about UTF8. One way or another. -- Piotr Strzyżewski

Ronald F. Guilmette

11:23 p.m.

In message <0A69C7DE-D5E2-4B95-9643-82103F87B92B@ripe.net>, Edward Shryane <eshryane@ripe.net> wrote:

...

Using ISO-8859-1 to encode IDN email addresses in the RIPE database does cause some issues:

We agree on that point, 100%.

...

- Only a small subset of the UTF-8 character set is supported, characters outside ISO-8859-1 are substituted with a '?' on Whois update.

Yes. And this is really rather entirely sub-optimal.

...

- ISO-8859-1 encoded email addresses may not be handled properly by Whois clients or mail servers.

I personally am not too concerned about WHOIS client tools. They can adapt or die. :-) It is certainly the case however that most or all existing WHOIS clients do not contain any UTF-8 decoding logic, and that they thus will display only 7-bit US-ASCII or, in some cases that and alo ISO-8859-1 encoded single byte characters. For all of these existing clients & tools it would be maximally convenient to be able to cut-and-paste email addresses out of the WHOIS data base records, as these tools render them, and directly into mail clients. Either a UTF-8 encoding or a punycode encoding (of domain name) -might- possibly work for that. I personally prefer punycode because it is effectively the lowest common denominator. It does not force WHOIS clients or tools to support anything beyond simple and primitive 7-bit US-ASCII, and yet it can still express 100% of all modern IDNs. Regards, rfg

Piotr Strzyzewski

4:02 p.m.

On Wed, Nov 06, 2019 at 02:46:13PM -0800, Ronald F. Guilmette via db-wg wrote:

...

In message <20191106210554.GB5460@hydra.ck.polsl.pl>, Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> wrote:

...
I do not object having properly coded non-ASCII email addresses in the database.

First, just to be clear, we are really only discussing the representation of domain names within the data base. Of course, any email address contains one of those, but we are specifically -not- discussing the representation of the user-ID portion of any email address in the data base.

Second, it is nice that you are OK with "properly coded non-ASCII" domain names in the data base. So I am I. That's not the question. The question is how should IDNs be -represented- within the data base.

Properly. As I said. The specification for that is in the relevant RFC. I do not see any reason for not allowing people to use proper email addresses. -- Piotr Strzyżewski

Ronald F. Guilmette

6 Nov 6 Nov

11:35 p.m.

In message <1AA95AD8-3729-4BBB-A921-1535429A9658@ripe.net>, Edward Shryane <eshryane@ripe.net> wrote:

...

...
...
DB-WG: should we allow non-ASCII addresses in the RIPE database?

Do you mean email addresses or street addresses as well?

I mean to continue to allow non-ASCII (i.e. Latin-1 encoded) IDN email addresses, such as the example mentioned. Or, do we automatically encode non-ASCII characters as punycode.

I want to be crystal clear here. Street addresses, person names, city names, or any other data value (except for ASNs, IP addresses, ISO 3166 country codes, and domain names) that are encoded in full 8-bit ISO-8859-1 within the data base do not present any terrific problems for me personally because, generally speaking, I don't anticipate that I will ever be trying to parse those person names, street names, city names, etc. I will just use them "as is" and in whatever encoding they happen to be in when I receive them. Quite certainly, within the RIPE region there are billions upon billions of person names, street names, and city names that cannot be accurately represented in US-ASCII, nor even, I must note, in ISO-8859-1. (I am thinking of your fellow RIPE members in places where cyrillic is used, and also your fellow RIPE members in Israel and elsewhere.) In ancient times (e.g. prior to the issuance of, for example, RFC3490 in March, 2003) 7-bit US-ASCII was used fairly exclusively within the data bases of all of the Regional Internet Registries. And I, for one, am greatly appreciative of all of the effort and contortions, over so many years, that so many people have gone through in order to try, as best as they could, to anglicize person, street, and city names, especially those that were not really amenable to that process, and to convert them all into some 7-bit ASCII approximation of the actual "native" strings. Even though this conversion process has often rendered thye resulting anglicized versions substantially inaccurate, it has served to keep processing code simple, at least up until now. Now however I see that 8-bit ISO-8859-1 encodings are creaping in, at least to the RIPE data base. I am torn by this. On the one hand this new development augurs a sea change which will likely end by complicating a lot of tools, and not only my own. On the other hand, the benfits are clear; more accurate representations of person, street, and city names within the data base... BUT still quite limited to names that can be accurately represented within ISO-8859-1, a character set which excludes some very large swaths of RIPE territory. Even at the risk of making my own life more complicated, I have to say that I personally place a higher value on accuracy than I do on simplicity. For this reason, it is my feeling that the data base should evolve in the direction of UTF-8 and *not* in the rather different and far more limiting direction of ISO-8859-1. That having been said however, domain names are a really very special and different concern. I personally am not aware of any standard which suggests that domain names should ever be written in ISO-8859-1. Rather, for domain names, the available choices of representation seem to be either (a) 7-bit US-ASCII or else (b) punycode (RFC3492) or else (c) UTF-8. Obviously, 7-bit US-ASCII is really no longer an option, and hasn't been ever since the publication of RFC3490 in 2003. At the present moment, punycode can be used, and can represent all domain names with 100% accuracy, even while allowing the evolution of the encoding of other data base fields to proceed and to be debated independently. The bottom line is that in the short term, and for the immediate future, I believe that there is no other sensible choice except to decree that all domain names within the data base shall be represented in punycode form.

...

Whois could automatically translate to and from the punycode format, if an IDN format address is encountered.

Yes, but please just leave this to the WHOIS *client* to handle. It is less desirable, I think, to perform this conversion on the server side. Regards, rfg

Ronald F. Guilmette

10:22 p.m.

In message <DBB71EC8-7564-4AAB-B490-5A894B39AF72@ripe.net>, Edward Shryane <eshryane@ripe.net> wrote:

...

...
What I found was *not* an internationalized domain name, per se. Well, maybe it was/is and maybe it wasn't/isn't. I'll let you all decide, and then you can tell me if I have used improper terminology to descrtbe what I found.

The email address you found, is the only IDN (i.e. non-ASCII) email address in the RIPE database (so far).

What I found is definitely *not* "US-ASCII" i.e. 7-but ASCII. It is a separate question as to whether or not what I found qualifies, properly, under the relevant RFCs, as being a proper sort of a representation of an "IDN". (I suspect it does not.) The relevant current RFCs appear to be RFC5890 and possibly RFC5891, RFC5892, and RFC5894, but I'm sorry to say that each of these is rather complex, and I do not have time available right now to dredge into them and learn the real current rules. All I can say is that a brief glance at these RFCs seems to indicate that RFC5892 is the most directly relevant, and that RFC5892 appears to say that Unicode must be used for representation of IDNs. The domain name I found *is* ISO-8859-1 (Latin-1) but does not appear to me to be Unicode.

...

It's currently considered a valid value in the RIPE database, as it's composed of Latin-1 characters, and the attribute syntax check passes.

Yes.

...

There is also an MX record for the domain (although the host dc-eb0309b6496a.xn--zrich-kva.email is currently unreachable for me).

However, it may cause inter-operability issues, as the sending mail server needs to handle IDN addresses correctly.

Yes.

...

DB-WG: should we allow non-ASCII addresses in the RIPE database?

More precisely, the question should be, I think: (a) Should charcters that are non-US-ASCII be allowed in the data base generally, and separately (b) how should IDNs be represented in the data base?

...

DB-WG: is punycode for domain names a viable alternative for encoding non-ASCII email addresses?

I think that in order to be comprehensive, domain names appearing in the data base *must* be encoded *either* as punycode *or* else as UTF-8. I don't believe that ISO-8859-1 (Latin-1) will be able to do the job entirely, but the other two options will.

...

For example, the punycode equivalent abuse@xn--zrich-kva.email is already a valid value for the e-mail (or abuse-c) attribute.

Yes, and the same can be said generally. i.e. the (punycoded) domain name xn--zrich-kva.email is in all respects a substitute for its Unicode equivalent. Thus, xn--zrich-kva.email may be used, for example, as the argument to the "dig" command, and/or in all other contexts where a fully qualified domain name may be used. Regards, rfg

2105

Age (days ago)

2111

Last active (days ago)

List overview

Download

15 comments

4 participants

participants (4)

Edward Shryane
Nick Hilliard
Piotr Strzyzewski
Ronald F. Guilmette