Handling accented characters searches

I build dual language systems – English and Spanish. It’s my first time building a system in any language other than English and I came across this interesting issue the other day.

Let’s take the name Martinez for this example. I’ve seen it entered with different case alphabets (upper, lower, mixed, Pascal) and also with Latin characters (for instance, accented Í – aka culture sensitive accent – instead of I in either case). So my two extremes for this would be “martinez” and “MARTÍNEZ”.  My constraint being that the data operators and the people who search this data could use the extremes or anything in between. So if I had a clean solution to the extremes, this would largely fix the issue.

So I set about solving this problem in code (C# in this instance)

First I run a simple equality operator (==) on the two extremes

[TestMethod]
public void TestAccented()
{
string plainName = “martinez”;
string accentedName = “MARTÍNEZ”;
Assert.IsTrue(plainName == accentedName);//this fails
}

This test fails. So we’ll approach the problem by splitting into obvious and not-s0-obvious issues:

  • Obvious
    • Casing: Use ToUpper() on both to account for upper and lower case differences – this fails so some other difference exists
    • Internationalization: Could an upper case “i” in English “I” be equal to an upper case “i” in Latin character “Í” if you make the comparison culture insensitive? ToUpperInvariant() fails – which brings me to the next point
  • Not-so-obvious: Obviously the accented Í is more complicated to solve than simple casing and cultural difference.

Following on this chain of through, the accent on the character is also known as a “diacritic” – a term I learned when I started learning Spanish. Searching the Internet for “c# string diacritic” provided a bunch of useless results centered around replacing characters which went against my criteria of a non-hack”y” solution. However, searching for the term “C# string diacritic compare” did the trick.

I came across a seldom used (by me, at least) switch – CompareOptions – an enum in C#, which included this blurb that sounded promising:

Using this in code

[TestMethod]
public void TestAccented()
{
string plainName = “martinez”;
string accentedName = “MARTÍNEZ”;
System.Globalization.CompareInfo compareInfo = System.Globalization.CultureInfo.InvariantCulture.CompareInfo;
//first switch accounts for upper case-lower case, second switch ignore diacritics
System.Globalization.CompareOptions options = System.Globalization.CompareOptions.IgnoreCase | System.Globalization.CompareOptions.IgnoreNonSpace;
var compareResult = compareInfo.Compare(plainName, accentedName, options);
Assert.AreEqual(compareResult, 0); //this works!
}

Now in a MVC web application scenario, I put this incorporated this code in the model to prevent rewriting this in every compare operation for the relevant data entry text fields.

So, this got me thinking is there another approach I could use do at a deeper level? I store my user data in SQL Server. What if I wanted to run a SQL query with the above name as part of the last name which is a common scenario in that part of the world)?

SELECT count(*)
FROM <table name>
where LastName like ‘%MARTÍNEZ%’

The above query gave me a number, let’s say 75. But I was afraid I was missing out on results because of similar issues as the original problem statement. Looking at the collation settings for that column revealed it to be on database default: SQL_Latin1_General_CP1_CI_AS

Breaking this down:

  1. latin1 makes the server treat strings using charset latin 1, basically ascii
  2. CP1 stands for Code Page 1252
  3. CI case insensitive comparisons so ‘ABC’ would equal ‘abc’
  4. AS accent sensitive, so ‘ü’ does not equal ‘u’

So, basically SQL Server was, by default, configuring these columns to use the ASCII character set (latin1) and use a particular lookup table to map the hex value to a character (CP1). The flags: CI for case insensitive and AS for accent sensitive – perked my interest. It seems like CI would do the job for casing but AS was holding me back.

Looking back at the collation settings manual, it seemed like changing _AS to _AI would do the trick.

So, I picked the SQL_Latin1_General_CP1_CI_AI setting from the list (above current default setting).

When I reran the SELECT COUNT(*) query, I got almost 5 times the number of results compared to the previous which means the data entry operators were entering accented character (diacritics) in an unpredictable manner. Additionally, making this change at the data storage level required little to no change in any code that accessed this database.

Addendum 1: If this issue was systemic, i.e., happening across a wide section of tables column, I could fix this by changing the collation setting at the database level.

Addendum 2: I discovered that Azure CosmosDB SQL – DocumentDB does not have collation settings but you can harness ASCII folding filters with Custom analyzers in Azure Search (you should probably be using Azure Search with DocumentDB anyway) to achieve culture/accent insensitive searches – probably a blogpost for the future.

If you liked this post, please share it on social media and follow me on Twitter @_s_hariIs there anything I’m leaving out? Feel free to share in the comments below.

2 thoughts on “Handling accented characters searches”

  1. Hi there. Just to clarify a few things:

    1. The Stack Overflow answer you linked to is incorrect regarding the first item (the meaning of “Latin1_General”). The first part of a Collation name refers to the culture / locale that is used mainly to determine the main linguistic sorting and comparison rules.
    2. The Code Page (noted in the names of only some of the SQL Server collations , those starting with “SQL_*” ) is a property of the collation. This is the “charset”, BUT it only affects VARCHAR / CHAR / TEXT datatypes. NVARCHAR / NCHAR / NTEXT datatypes are Unicode, and that is a singular character set. Yes, the culture / locale does imply the Code Page, and yes, Code Page 1252 is for Latin1, but they are not the same thing.
    3. While the SQL_Latin1_General_CP_CI_AS is the (very) unfortunate default collation for installations on OSs using an English language, it is not a very good choice. These collations are not being updated, and they have very simplistic sorting rules that can cause performance issues when comparing indexes on VARCHAR columns using them against NVARCHAR data (please see: https://sqlquantumleap.com/2017/07/10/impact-on-indexes-when-mixing-varchar-and-nvarchar-types/ ).
    4. The datatype of your column is not shown, and that might make a difference in how things are handled given that you are using a SQL Server Collation. Your query shows a string literal without the upper-case “N” prefix used to denote a Unicode string, but that would be converted to NVARCHAR if the column in the comparison is NVARCHAR / NCHAR.
    5. You do not need to change the collation of the column to change how a particular query works with that column. And you might not want to change it anyway. The collation should be set to however the comparisons and/or sorting are done most often. For operations that happen less frequently, you can control that behavior by adding the COLLATE keyword to any predicate (e.g. column = ‘value’ ). For example, you could have left the column with the SQL_Latin1_General_CP1_CI_AS collation and then updated your query to use: ” where LastName like ‘%MARTÍNEZ%’ COLLATE Latin1_General_100_CI_AI “.
    6. It is also best to use the newest version of a collation that you have access to, unless you need to join to existing tables that already use a different collation. But for new projects, you should use the 100 level collations on systems that are SQL Server 2008 or newer (at least for the Latin1_General_* collations).

    Take care, Solomon..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.