Re: [db-wg] db-wg Digest, Vol 44, Issue 7
Hi Piotr Just to be clear, you refer to free text attributes. This has a specific meaning in terms of database syntax checks. It applies to those attributes where no syntax checks are done, for example "address:", "descr:", "remarks:". Is your proposal only referring to these attributes? I trust you do not mean all attributes other than primary keys. Incidentally, although "person:", "role:" and "org-name:" are not primary keys, they are not free text either. Currently there are syntax checks done on these values. If you allow these in UTF8 then all these syntax checks will have to be dropped. cheersdenisindependent netizen Date: Fri, 17 Apr 2015 12:18:04 +0200 From: Piotr Strzyzewski <Piotr.Strzyzewski@polsl.pl> To: db-wg@ripe.net Subject: [db-wg] Proposal to allow UTF8 Message-ID: <20150417101804.GD7031@hydra.ck.polsl.pl> Content-Type: text/plain; charset=utf-8 Dear DB-WG Members Proposal: I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Description: RIPE NCC service region covers Europe, the Middle East and parts of Central Asia. Moreover we have users from outside of this region. This means that WHOIS DB stores data for people and organizations from number of different countries using number of different alphabets. At this moment, all data in the RIPE WHOIS DB have to be stored using 7-bit plain US ASCII character set. [As a side note: It is technically possible to store some UTF8 content in some attributes, but the answer to whois query (both terminal and web based) returns "?" character in this case.] Lack of the full support for national character sets leads to some problems which includes, but is not limited to: 1. Mistakes in person/organization names due to national->english and english->national (based mostly on guess) conversion. 2. Mistakes in person/organization address due to national->english and english->national (based mostly on guess) conversion. 3. Conflict of converted words with other correct words (most visible in latin-based character sets). 4. Possible offensive word formation due to national->english conversion of names and/or addresses of person/organization. [As a side note to points no 1-3: This could lead to some problems when LEA tries to find out precisely who should be contacted in case of abuse.] On the other side, community members needs to know who is responsible for certain resource without the necessity of understanding all the others character sets. Moreover, some objects are filled with data that has to be provided in ASCII character set due to business rules (like ORGANISATION object details for LIRs). RIPE NCC has a policy to insist on latin based names for organisation objects that it verifies (allocated, and sponsored end-user space). Taking this into accout I propose to allow UTF8 in all free text attributes of all DB objects except in primary keys. Some possible issues to be addressed: 1. When this proposal will be supported by the DB-WG, then it has to be discussed at least with AA-WG and AP-WG. 2. UTF8 may cause problems for client code. Comment: The proper implementation plan and announcements schedule should be prepared. 3. UTF8 may result in contact addresses and names that are not readable by a large part of the community. Comment: Primary keys (mostly names) still have to be in ASCII character set. Moreover, LIRs data are also in ASCII character set due to business rules. 4. At this moment there are no major technical issues blocking UTF8 support in the RIPE DB back-end. However thorough checks have to be done. Looking for your comments. Piotr -- gucio -> Piotr Strzy?ewski E-mail: Piotr.Strzyzewski@polsl.pl
On Fri, May 01, 2015 at 01:53:27PM +0000, denis walker wrote: Dear Denis Thanks for your valuable input.
Just to be clear, you refer to free text attributes. This has a specific meaning in terms of database syntax checks. It applies to those attributes where no syntax checks are done, for example "address:", "descr:", "remarks:". Is your proposal only referring to these attributes? I trust you do not mean all attributes other than
I have deliberately used the "free text" characteristic instead of <freeform> grammar element used in RIPE Database Documentation. So, to be clear - yes, I meant also "person:", "role:" and "org-name:".
primary keys. Incidentally, although "person:", "role:" and "org-name:" are not primary keys, they are not free text either.
Taking above into account one can observe that according to the RIPE Database Documentation "person:" attribute is somehow less restricted than "address:", "descr:" and "remarks:" attributes (limited to Latin1) ;-) In contrast to <role-name> and <organisation-name> which use the "alphanumeric characters" characteristic, the <person-name> use the "letter" one. And since "letter" is not defined anywhere, my understanding of this word _could_ be different than yours. ;-)
Currently there are syntax checks done on these values. If you allow these in UTF8 then all these syntax checks will have to be dropped.
I disagree that all of them will have to be dropped. For example, the attribute length or number of words separated by space is quite independent from the character set. Moreover, we can restrict UTF8 in attributes which are not defined as <freeform> at this moment, to include only those subsets of UTF8 which covers alphabets used in RIPE NCC service region. I'm open to discuss this. Best regards, Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl
Hi Piotr Thanks for the clarification. I don't think it makes sense to restrict the UTF8 to only character sets defined within the RIPE region. (Not sure it is even technically possible.) But if a Chinese person lives and works in this region why would they not be able to enter their correct name? Just for arguments sake, changing my name into Chinese with Google translate changes the space to a '.'. If that is correct then the current syntax check fails. Also "person:", "role:" and "org-name:" are all defined as 'lookup keys'. That means you can enter their values in a query as the query string and that will be searched on in the database. The individual 'words' from these attribute values are stored in index tables in the database and searched as part of the query to return objects with matching values. I believe it is problematic to do string comparison in UTF8. Also the Full Text Search allows searches on all these attributes as well as "address:", "descr:" and "remarks:". Again all the component parts of these values are indexed for this search. So to allow any attribute in UTF8 only, may require software changes and may put restrictions on some of the services the database currently provides. If you cannot rely on a search returning the correct objects then you cannot allow those searches. There was a Labs article written some time ago on UTF8https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language. I think that would be a good first step to provide additional benefits of localisation without breaking any of the current functionality. Even if it was only an interim step it would allow time to asses any issues and monitor the usefulness of these new attributes. cheersDenis WalkerIndependent Netizen On 06/05/2015 09:56, Piotr Strzyzewski wrote: On Fri, May 01, 2015 at 01:53:27PM +0000, denis walker wrote: Dear Denis Thanks for your valuable input. Just to be clear, you refer to free text attributes. This has a specific meaning in terms of database syntax checks. It applies to those attributes where no syntax checks are done, for example "address:", "descr:", "remarks:". Is your proposal only referring to these attributes? I trust you do not mean all attributes other than I have deliberately used the "free text" characteristic instead of <freeform> grammar element used in RIPE Database Documentation. So, to be clear - yes, I meant also "person:", "role:" and "org-name:". primary keys. Incidentally, although "person:", "role:" and "org-name:" are not primary keys, they are not free text either. Taking above into account one can observe that according to the RIPE Database Documentation "person:" attribute is somehow less restricted than "address:", "descr:" and "remarks:" attributes (limited to Latin1) ;-) In contrast to <role-name> and <organisation-name> which use the "alphanumeric characters" characteristic, the <person-name> use the "letter" one. And since "letter" is not defined anywhere, my understanding of this word _could_ be different than yours. ;-) Currently there are syntax checks done on these values. If you allow these in UTF8 then all these syntax checks will have to be dropped. I disagree that all of them will have to be dropped. For example, the attribute length or number of words separated by space is quite independent from the character set. Moreover, we can restrict UTF8 in attributes which are not defined as <freeform> at this moment, to include only those subsets of UTF8 which covers alphabets used in RIPE NCC service region. I'm open to discuss this. Best regards, Piotr
On Wed, May 06, 2015 at 09:13:28AM +0000, denis walker wrote: Dear Denis
Thanks for the clarification. I don't think it makes sense to restrict the UTF8 to only character sets defined within the RIPE region. (Not sure it is even technically possible.) But if a Chinese person lives and works in this region why would they not be able to enter their
This idea came from the fact that if someone live in this region, probably have some documents issued by local authorities. Of course for some cases this could not be true. So, good point.
correct name? Just for arguments sake, changing my name into Chinese with Google translate changes the space to a '.'. If that is correct then the current syntax check fails.
Well spotted.
Also "person:", "role:" and "org-name:" are all defined as 'lookup keys'. That means you can enter their values in a query as the query string and that will be searched on in the database. The individual
This could introduce some inconveniences while using cli interface.
'words' from these attribute values are stored in index tables in the database and searched as part of the query to return objects with matching values. I believe it is problematic to do string comparison in UTF8.
I really doubt. Have you used Google search recently? ;-) Being more serious, I believe that most of the countries with their own alphabets do use internet tools and webpages without translating all the names, addresses and other things to US-ASCII or Latin1.
Also the Full Text Search allows searches on all these attributes as well as "address:", "descr:" and "remarks:". Again all the component parts of these values are indexed for this search.
So to allow any attribute in UTF8 only, may require software changes and may put restrictions on some of the services the database currently provides. If you cannot rely on a search returning the correct objects then you cannot allow those searches.
I'm aware that any modification may require software changes. I hope that you haven't suggested that we should abandon any improvements just because it requires some work to do.
There was a Labs article written some time ago on UTF8https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database
This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language. I think that would be a good first step to provide additional benefits of localisation without breaking any of the current functionality. Even if it was only an interim step it would allow time to asses any issues and monitor the usefulness of these new attributes.
It was back in 2010 during the RIPE61 when I propose person-idn: and other similar attributes. Although I understand your point of view, I believe that the situation has changed through years. Best regards, Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl
Hi all, On Wed, May 06, 2015 at 02:07:56PM +0200, Piotr Strzyzewski wrote:
correct name? Just for arguments sake, changing my name into Chinese with Google translate changes the space to a '.'. If that is correct then the current syntax check fails.
Well spotted.
The syntax check might need fixing then. The assumption that every name consists of at least two strings seperated by a space is based on nothing. I would consider this a bug. Who will file the issue on github? :-)
Also "person:", "role:" and "org-name:" are all defined as 'lookup keys'.
This could introduce some inconveniences while using cli interface.
Why would this be an issue with UTF8? Can someone from RIPE NCC comment on how this looks from the technical side of things?
There was a Labs article written some time ago on UTF8 https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database
This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language.
It was back in 2010 during the RIPE61 when I propose person-idn: and other similar attributes. Although I understand your point of view, I believe that the situation has changed through years.
So you two are leaning towards allowing UTF8 in some fields, and in other places add an optional new attribute (such as person-idn) if people want to describe more clearly what their actual name is? If this is the case it would be good if you go over all fields/attributes the database currently knows, and compile a full list of attributes that should receive an idn-sibling or should accept UTF8 instead of whatever they currently accept. Kind regards, Job
HI All On 06/05/2015 14:46, Job Snijders wrote:
Hi all,
On Wed, May 06, 2015 at 02:07:56PM +0200, Piotr Strzyzewski wrote:
correct name? Just for arguments sake, changing my name into Chinese with Google translate changes the space to a '.'. If that is correct then the current syntax check fails. Well spotted. The syntax check might need fixing then. The assumption that every name consists of at least two strings seperated by a space is based on nothing. I would consider this a bug. Who will file the issue on github? :-)
I agree there is no reason to keep that specific syntax check. But my point was that changing any attribute to UTF8 only may affect syntax checks or business rules and these need to be considered.
Also "person:", "role:" and "org-name:" are all defined as 'lookup keys'. This could introduce some inconveniences while using cli interface. Why would this be an issue with UTF8? Can someone from RIPE NCC comment on how this looks from the technical side of things?
There was a Labs article written some time ago on UTF8 https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language. It was back in 2010 during the RIPE61 when I propose person-idn: and other similar attributes. Although I understand your point of view, I believe that the situation has changed through years. So you two are leaning towards allowing UTF8 in some fields, and in other places add an optional new attribute (such as person-idn) if people want to describe more clearly what their actual name is?
If this is the case it would be good if you go over all fields/attributes the database currently knows, and compile a full list of attributes that should receive an idn-sibling or should accept UTF8 instead of whatever they currently accept.
This is what I suggested a few years ago. Someone (maybe a task force?) needs to look at every attribute in every object and choose one of three categories for it: -Latin1 only: some attributes make no sense in local language, eg status, import -Duplicated: some attributes may need to be available in Latin1 for registry consistency, legal reasons, or simply maintaining a database for the whole region to make use of, but could also be duplicated in local language, eg org-name, abuse-mailbox -UTF8 only: some attributes could be open to any character set, eg remarks, notify (only relevant to maintainer of object) This requires a bit more preparation work and introducing new attributes, but in the end it allows much more of the database to be opened up to the possibility of UTF8 without restricting any of its value or usage throughout the whole region. cheers denis
Kind regards,
Job
On Wed, May 06, 2015 at 06:06:56PM +0200, denis wrote: Hi
There was a Labs article written some time ago on UTF8 https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language. It was back in 2010 during the RIPE61 when I propose person-idn: and other similar attributes. Although I understand your point of view, I believe that the situation has changed through years. So you two are leaning towards allowing UTF8 in some fields, and in other places add an optional new attribute (such as person-idn) if people want to describe more clearly what their actual name is?
If this is the case it would be good if you go over all fields/attributes the database currently knows, and compile a full list of attributes that should receive an idn-sibling or should accept UTF8 instead of whatever they currently accept.
This is what I suggested a few years ago. Someone (maybe a task force?) needs to look at every attribute in every object and choose one of three categories for it:
-Latin1 only: some attributes make no sense in local language, eg status, import -Duplicated: some attributes may need to be available in Latin1 for registry consistency, legal reasons, or simply maintaining a database for the whole region to make use of, but could also be duplicated in local language, eg org-name, abuse-mailbox -UTF8 only: some attributes could be open to any character set, eg remarks, notify (only relevant to maintainer of object)
This requires a bit more preparation work and introducing new attributes, but in the end it allows much more of the database to be opened up to the possibility of UTF8 without restricting any of its value or usage throughout the whole region.
Although I like the idea of setting up the task force, I would like to wait a moment to gather some more ideas and point of views from the community. :)
From the above it looks like the TF should be set up to discuss the implementation details [*], whereas we still do not know if the community at large wants UTF8 in DB at all.
[*] This is in fact the role of the NCC itself, but I believe that we as the community have the mandate to provide some directions here. Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl
On Wed, May 06, 2015 at 02:46:05PM +0200, Job Snijders wrote: Hi
On Wed, May 06, 2015 at 02:07:56PM +0200, Piotr Strzyzewski wrote:
correct name? Just for arguments sake, changing my name into Chinese with Google translate changes the space to a '.'. If that is correct then the current syntax check fails.
Well spotted.
The syntax check might need fixing then. The assumption that every name consists of at least two strings seperated by a space is based on nothing. I would consider this a bug. Who will file the issue on github? :-)
Thanks for volunteering. ;-)
There was a Labs article written some time ago on UTF8 https://labs.ripe.net/Members/kranjbar/internationalisation-of-ripe-database
This article put forward the idea of keeping all existing attributes in ASCII (but really meant Latin1) and allowing additional optional attributes for name and contact details in local language.
It was back in 2010 during the RIPE61 when I propose person-idn: and other similar attributes. Although I understand your point of view, I believe that the situation has changed through years.
So you two are leaning towards allowing UTF8 in some fields, and in other places add an optional new attribute (such as person-idn) if people want to describe more clearly what their actual name is?
At first no. The proposal sent few weeks ago was to introduce full support for UTF8 without any new attributes. Taking into account my old idea and Denis' point of view, we can make some consensus here. Hope that other members of the community will also present some views and ideas here.
If this is the case it would be good if you go over all fields/attributes the database currently knows, and compile a full list of attributes that should receive an idn-sibling or should accept UTF8 instead of whatever they currently accept.
What I see as an options right now are: 1. Full support for UTF8 in current attributes (without primary keys). 2. Full support for UTF8 in complementary attributes (list have to be made; interim solution). Option 2 could lead to server behaviour controlled by new option. This option could control which one of those complementary attributes is returned (Latin1 or UTF8) by the server. Moreover, the default behaviour could be changed after some transitory period of time. Piotr -- gucio -> Piotr Strzyżewski E-mail: Piotr.Strzyzewski@polsl.pl
participants (4)
-
denis
-
denis walker
-
Job Snijders
-
Piotr Strzyzewski