Laurent & Enke: First of all thanks for your work guys. Now for my opinion about how to sync the data: First - when RIPE gives us their latest data run, you will see *no* differences in the 193 data since we will throw out all 193 nets we currently have. Second, if there are conflicts in the data, it will be on the 192 nets and class B's. This would be a more interesting report. To make this report even more interesting, there are various levels of errors that are not really told here that need to synced up. IHMO, here are my level of errors in severity (based on a ip network match): 1) Organizational name differences *tied* 2) Organization addresses 2) netnames One should easily be able to pull this from any database transfer file (yours, ours, RIPEs). Finally, I think the comparison should be between Merit data and InterNIC data.... Thanks, Mark PS: About the weekly dumps, I think we need to first work out how to point out the problems that need to be fixed and solve them. You have a good start by our recent dump of the InterNIC database that can be easily separable.
Thanks to the InterNIC folks, the InterNIC data relative to the Network numbers is available on a flat file on merit.edu. We ran a comparison between the InterNIC and the Merit data, and between the RIPE data and the InterNIC data. since RIPE is authoritative for the European information and the InterNIC is authoritative for everything else, it seems most important to address those discrepancies (RIPE/InterNIC).
Here is the result of the first comparison between RIPE and InterNIC: (The data may be old for some networks, and our program may have some bugs... The result for the Merit/InterNIC data are not provided)
Number of entries: 10423
Unregistered in NIC DB: 5298 (50%)
No difference: 108 (1%)
Small differences: 2047 (19%) Substrings: 1604 (15%) Typos: 510 (4%) Order: 356 (3%) Punctuation: 21 (0%) Abbreviations: 186 (1%) One Word only: 5 (0%)
Differences: 2970 (28%) 100%: 758 (7%) 80%: 498 (4%) 60%: 494 (4%) 40%: 723 (6%) 20%: 497 (4%)
Remarks: - 10423 represents the number of networks and blocks in the RIPE DB; each block is counted as 1 network. - The addresses for the RIPE networks are got from the 'Administrative Contact' entries. For 10423 networks, 224 do not have a administrative contact entry in the persons databases. In such a case we use the 'description' attribute to get the address. Some entries are duplicated in the person DB, which creates some strange addresses (e.g. 193.84.64.0 -> Alexandr Modry) - 'Unregistered in NIC DB' means the network is part of the RIPE block in the NIC BD, but does not have its own entry. - 'Substrings' means all the words of one address are on the second address (e.g. "Celisoft Data AB, Box 718, S-941 28 Piteaa, Sweden" and "Celisoft Data AB, Celisoft Data AB, BOX 718, S-941 28 PITEAA, SWEDEN") - The differences are given in percentage of number of different words. 100% means 80% to 100% of the word are differents. We also take into account the typos, abbreviations, punctuation, etc... which are not a 'difference' per se. - The program which compares the addresses takes about 20s for 10000 entries. The program which reads the InterNIC data and formats them for comparison takes about 20 minutes.
We'd like to set up a plan with RIPE and the InterNIC to eliminate the discrepancies. Merit has a list of the 5298 networks without entry in the NIC DB and can provide a detailed list of the address differences for the other inconsistences. Do you all have any suggestions as to how we can proceed to resolve this problem?
Laurent & Enke.
PS: Could InterNIC provide your flat file periodicly? (once a week?) PS: Any comments are wellcome.