"International" StrCmpLogicalW

Posted by

I've made a version of the native StrCmpLogicalW in C# that sorts the non-digit parts in a culturally-sensitive way (or, if you like, in a non-culturally-sensitive way). You can download the code here

The algorithm is quite simple, and there's plenty of room for improvement. Basically, it just splits the string into non-digit and digit parts, then compares the non-digit parts using a normal StringComparer and the digit parts by first converting to an integer then doing a simple integer conversion.

There are a couple of areas where it could (should) be improved. First of all, it splits the entire string, even though only the first few bytes might be what decides the difference (the solution here would be to split/compare at the same time – but that would have complicated things a little bit too much).

Another problem is that it assumes the digit-part will fit into a 32-bit integer. It would be easy enough to change that to a 64-bit integer, but a truly scalable solution would be to do some "fancy" string compares, with the knowledge that you're dealing in numbers only.

Finally, one of the questions Michael Kaplan posited in his recent post "What would it mean to internationalize StrCmpLogicalW? was "how would you deal with non-ASCII digits?" At the moment, my code only cares about the digits 0-9 but it wouldn't be impossible to extend so that it correctly handles all digits defined in Unicode (well, all 0-9-based numbering systems. I'd hate to have to implement Ethiopic numbering into it :)

blog comments powered by Disqus