Hacker News new | past | comments | ask | show | jobs | submit login
Some problems with 'first name' and 'last name' fields in data (utcc.utoronto.ca)
37 points by ingve 12 months ago | hide | past | favorite | 67 comments



Instead of complicated schemes that attempt to be locale aware, it seems simpler to use a general scheme that will apply regardless.

Collect two values:

1. What do you want to be called? e.g. if we send you an email we'll say Hello {name}

2. What is your legally recognised name i.e. what's on your passport and official documents.

#2 is only relevant in some domains, in many cases #1 alone is sufficient.


> What is your legally recognised name i.e. what's on your passport and official documents.

1) People can have multiple passports with different names.

2) a single passport can contain multiple names for the same person.


What standard allows multiple aliases on one passport?

AFAIK, the MRZ and chip don’t support alias fields?


Russian passports contain a transliterated Latinized version, which also (usually?) omits the patronymic.


And now we need a "Falsehoods developers believe about passports" doc...


ICAO 9303, part 10, section 4.7.11, data group 11 allows for a list of other names.


You're making the parent's point. Why bother trying to model the entire concept of a "name"? Just have something to display on the screen (get the user to type it in and they can't complain when you display it back) and something to put on legal documents.


As mentioned at the tail end of TFA:

3. A string that will never be displayed, but will result in proper sorting. It's not simply a concatenation of all name parts, though. Omitting spaces and punctuation is also critical for a nice sort, so that apostrophes and the like don't shunt things to the top or bottom. But don't just filter to Latin letters, keep all [0] letters!

[0] https://stackoverflow.com/a/3617818


Which is very cool until you have to interface with systems expecting first/last name :-(

When my last company did phone numbers we just let it be free text…


I'm always surprised there's no (widely used) international standard here. It seems like we have international standards for nearly everything. But for some reason we still use the western first-middle-last naming structure as a de-facto standard and as soon as you have to accommodate people from one of the many cultures that don't use this, you're in trouble.

I've even run into trouble occasionally as a Western person with fairly normal first and last but no middle name. More than once I've been confronted with online forms that require a middle name (not for many years though, thankfully).


We basically had three fields. First and last, but that was because we had to interface with tons of internal and external entities that required it. And we had a freetext “spoken name” that would be used all over the site.

Good times.

There is an ISO standard for phone numbers, actually. There are even libraries to parse phone numbers into the format. That being said, for “non machine readable” phone numbers (say for a contacts app) some say you should keep it free text because people plug all kinds of shit into the field and if you aren’t using it to send messages/voice/whatever just let the user do whatever…


Please don't call it "western". Most Latin languages, the majority of the western culture, don't have just two names. This is exclusively Anglo-Saxon (or something), not "western"


I stated three names in my comment: first-middle-last. It's also common to put more than one name, or hyphenated names, in any of those fields so this format suits anywhere more than two names, as long as they roughly follow the format first ... nth ... last. I think this covers all western names. If not, please give me examples.

The problem arises when the order is reversed (e.g Vietnamese), or when there's only one name and the form requires at least two names. And probably for other name formats saying around the world that I'm not aware of.


In Spain, people usually have one or two names, then two last (family) names. No hyphens. In Catalonia, the two family names are sometimes "anded". Example: Miquel Martí i Pol.


That still fits in a form with first-<middle>-last fields. And officials reading the form will know to call the person Miquel.

First: Miquel Middle: empty Last: Martí i Pol

The problem is for names that don't work in that format.


Where I am from, we don't have first and last names. It's usually a name people call you by and your full name. Full name can have multiple parts (usually 2-3 can be more). The cast or family name usually comes at the end. Some cast related names are in the beginning as prefixes.

E.g. First part of my name is Muhammad but no one ever calls me by this in my country (Pakistan). This is one of the most common prefix here. I am usually surprised when some one here in UK correctly calls by my name instead of Muhammad.

Filling official form with Muhammad as my "First Name" is a weird experience knowing that this is what I will be called in communications.


Funnily enough, I have the same problem and I'm from the UK. I now live in Southeast Asia where officials will call me by my middle name which is a really weird experience for me.


I think a better format for forms can be "name", which is the name you would like to called by, and "full name", which should cover a lot more cases.


It's always been a forced input format that doesn't stand scrutiny even a bit outside of western world.


Also fair to note that Western people who live in places with other name formats will suffer the reverse problem.


thought: when half the country has a "muhammed" appendage to their name, what does it even mean? and might it perhaps be time to just not consider it as such?


It seems a bit presumptive to me to suggest that entire other cultures change their definition of what a name is just because you don't understand them.

If someone tells me their name, I think that it is right to believe them unconditionally. No matter if it starts with Wang (100M people!) or ends with -son/-dottir (~all of Iceland) or contains affixes or other common parts. Names are deeply significant and have more value than just contributing information entropy to a unique identifier.


at no point did I suggest we should not believe someone when they say their name. But what purpose is a complete unit in a name if it is what 1/2 of everyone else in a country has too?

the icelandic case is different, as it does after all function as a lastname, just a suffix on the lastname.

I see no value in this, its kinda like putting salutation into the name. I would receive NO value in having an additional firstname of "Mr"


This is a common prefix among Muslims. It is not serving a calculated purpose. This is appended out of respect of the prophet they believe in. There are certain sects which append or prepend name of some other religious figure important to them. I have been told that some people even use to (and may be still do) add name of their teachers with their names in older times out of deep respect / reverence.

Names are part of identity the same way origin is. Can't just ask to purge it because everyone around has some of it in common.


if everyone has it, it ceases to have any significance, and can be ignored.

this is why someone might have been called "johnny 1 leg", because he only had one leg. You do not hear things like "2 legs 2 arms 10 fingers 1 heart 1 bellybutton johnny"


This may drive DBAs nuts, but in some cases, perhaps you could have a naming_convention variable that is filled out first, and it has a certain number of limited values, and it determines the number and order of "name" fields that get filled out. Then you have a catch-all naming_convention that just produces one "name" label that's very long, and you put the whole name in there.

Then you can have, say, "Western" naming convention, "Asian" (reverse order), "Spanish", "Catalan", "Hungarian", "Ancient Roman", and some of these are going to look very similar. And then if Madonna Ciccone or Beyonce Knowles comes around, you just use the catch-all for the mononymous.

It's not an elegant solution and it's not going to work for user-oriented forms, but perhaps on the backend, I don't know, for Wikipedia or something. Wikipedia already tags many biographies with the naming convention that is used.


I'd go further and say we need to stop passing around names as language primitive strings. Serious work on Name types in the standard library would go a long way towards making life easier for developers, in the same way the Calendar, Date, Time, and Duration types have made dealing with time.


I’m skeptical. Calendar, date, time all have a finite set of values. Sure, it’s a complex, maddening morass, but ultimately it boils down to a small set of integers, and a small set of standards to implement.

Names are effectively a collection of arbitrary Unicode values with a fair number of complications (sorting, multiple names, etc), and there is no set of standards to draw upon for implementation.


> there is no set of standards to draw upon for implementation

Fair point. I used those examples mostly to acknowledge that programming has moved on from treating them as strings and integers. It's true that in the US there are few, if any, restrictions on naming. Only in 2017 did California, by law, allow names with diacriticals, as in José. However, in other countries there are interesting restrictions and patterns. Icelandic law, to give one example.

> Names are effectively a collection of arbitrary Unicode values

Well, yes, for just the string, but names are more than a sequence of characters. The structure of what a name is, and how it's presented, matters. for example given vs surname order, patronymic, Mongolian clan names, supported characters in CJK, and so forth. Those are things that are hard to get right, and having every organization doing i18n create its own one-off name handling code has led to chaos and a lack of interoperability internationally.


Names can sometimes be standardized at the worst possible times.

For example, Ellis Island was a huge source of novel spellings, loss of diacritics, Anglicizations, and other rather atrocious word crimes. However, since perhaps a small, finite number of civil servants were the ones transcribing the names and putting them on official US documentation, this had the effect of basically standardizing a lot of names for families now living in these United States, and possibly reducing the variation that was found among them. While I haven't directly seen many Ellis Island records, I've seen a lot of US Census records, and it's obvious that the census workers always wrote down sort of what they heard, and when it comes to names, there could be all kinds of phonetic trickery going on.

This still goes on today, and anywhere you have a bored, underpaid civil servant entering names into a rigid database system, you'll have a rather unpleasant standardization on the lowest common denominator. People who speak Arabic or Indic languages may often choose, on-the-spot, how to romanize their names. There are some romanization standards, of course, for most popular scripts, but why not follow multiple ones in the same word?

I imagine that it may go the other way, as well. Whatever country people are immigrating to, whatever written language that database is implemented in, names will get standardized to that lowest-common-denominator.


I'm not talking about standardizing the characters. I'm talking about naming, the structure of the entire identifier people use as their names: family name, given name, clan name, nickname, patronymic, and so forth.

Focusing on the characters is being stuck in the name-as-string: the exact problem I'm advocating we give up.


It's the difference between having only a basic typeface for Arabic and having software that can properly layout Arabic right-to-left, properly handle the variant letter forms, and generally formatting the text correctly. There's Arabic the core 28 letterforms and there's Arabic writing with proper initial, medial, and and final forms, collation, and presentation on the page.


I think there are a few possible outcomes of an attempt to standardize how computers deal with names.

- Every programming language, database, etc implements something, but overall they’re not the same solution, leaving us mostly where we are today.

- Some computer standard emerges, but fails because it’s not flexible enough for all the business and cultural needs, leaving us mostly where we are today.

- Some computer standard emerges, is good enough for most needs, but leaves people bitter because it leads to yet more cultural standardization in a world already suffering from cultural globalization.

I’m inherently skeptical of all change, I suppose, and my belief is we’ll just muddle along forever.


You could say more or less the same thing about how calendars and date/time are implemented today, as well as general language support when including bidi, CJK, and other non-roman languages. Unicode is widespread but we still have a variety of encodings, not just UTF-8.Geographical naming isn't exactly a solved problem, but localization exists. We have stuff that works, there are rough edges and plenty of places it could improve, sure.

Maybe my experiences in the industry have been particularly plagued by the handling of names, sometimes so badly that I have to roll my eyes. The handling of human names just seems like a big opportunity to remove friction across international boundaries.


Not a bad idea. Form builders and the like increasingly do offer fancy string data types (like "Link" for example, which captures the desired link text and URL as a tuple, and sometimes even more detail than that), so it does seem reasonable for "Name of a person" to be dealt with similarly.


I just had an issue related to this: my name ends in "Filho" meaning son, to differentiate me from my father.

I ran into some stupid company policy where I had to be identified by first and last name...

Resulting in people thinking my name is "son".

The I had to explain that no, son is not a name, nor a surname. It is an adjective, not a substantive, it's sole purpose is differentiate me from my father, when needed. Calling me just "son" makes no sense.

I pointed out to that company that if they applied that policy to bill gates, that is names Willlam Gates III, they would end Calling him number three, and it is obvious that number three is not his name.

Then there is the fact my "middle" name that matters. My first name was just to honor a famous catholic priest and is not even a valid word in my native language. Nobody calls me by my first name... unless there is some bureaucracy somewhere, then they insist on calling me by my first name, and son as my surname.


Shouldn't your name(using Gates example) be "Gates_III" than just "III"?


Side note to your example: Bill Gates’ nickname growing up was Tré — as in 3.


Reminds me of two stories/articles:

1. The "Y2Gay" problem

Was a problem because databases were designed with the idea that marriage could only be between a male and a female.

2. Falsehoods developers believe about names

From HN's legend: patio11

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...


Names are much more complicated than this. A starting point is pretty much all systems processing names need to consider all possible permutations across all cultures, and select some constraints to operate by. There is no silver bullet here, and in all honesty the authors suggestions seem well intentioned, but don't address any real problem space.

When it comes to these considerations for instance, it's possible to have a first name, but no last name and vice-versa.

Also, never mind how many words make up a first or last name, but what's the minimum number of characters allowed for each?

What characters do they allow, and which shouldn't they?

Also, what version of the person are they attached to? Is Dr. J. Stevenson the same identity as Jonathan Stevenson, or Jon Adam Stevenson?


I’m trying to figure out what the issue is here. The Spanish example given isn’t even a real problem: even Spanish speaking countries only use two columns (“first names” and “last names” plural). It really feels like it’s trying to invent a problem that doesn’t exist, or to try to cater for the extreme edge case.

Once you have the data, how you present it as a separate i18n thing. Do you want to address an <ethnicity> crowd by their last name? Cool! That’s a rendering issue, not a database problem.


I understand that the author's point is that we should replace "first name" with "given name(s)" and "last name" with "family name(s)" if we are going to have two fields instead of just one for the full name. It's true that it would be more clear in some cases, since in many cultures the "first name" is the family name, which is probably not what the system expects.

Though I do agree that it's mostly a matter of i18n. In Spanish you should ask for "nombre" (name) and "apellido(s)" (surname), if you say "given name" and "family name" users will be confused.


Ideally it would just ask for each actual need. If the system needs to render something brief for a salutation, something verbose for a certificate of achievement, and something suitable for sorting purposes, then it could just ask the user to provide all of those exact things. Then the potential for misalignment with "what the system expects" completely disappears.


Last names are also not limited to ASCII. There all sorts of special caracters depending on the language like üáàãéèç-' (the single quote is specially important because its normally blocked due to SQL injection, but its a common character in italian/french/spanish names).

Preventing people from writing their own names because you decide to use a regexp from stackoverflow is also bad.


Don't forget Irish names for using the apostrophe!


I worked at an place tied to 60s US legacy technology from the airline industry, and yes, everything was first name/last name. I don't know how it works interoperating with airlines serving Asia or anywhere with non-US naming. The same system was strictly ASCII, too (I think very technically EBCDIC deep in the inner workings). Also, because of TSA rules and the security line, the names are used for more than personalization, they are legally required to meet certain criteria.

Previous to that I worked at place with a key customer in Japan, so everything was pretty well i18n and I got used to seeing given name/family name or similar. I discovered I had muscle memory for typing 'surname' instead of 'lastname'.


It took me ages to settle on how to fill in those two fields in websites. I have five names, four of which are family names. To identify me, you should have them all, but that division, first and last, have literally no meaning for me. My literal last name is a name that's so common here that I share it with the current president, a number of sportsmans and local celebrities, and probably half the people living in the same region as I. If that's what you're asking for, well, I can give it to you but I already know it's worthless.


So if you're dealing internationally, and names actually matter. You may need to think a bit harder about names.

Although the proposal still doesn't solve all edge cases. And the current edge cases have presumably found a solution, like my girlfriend that has 2 last names, and is from a 'western' country.

My parents address is stupidly long and doesn't fit in to most address boxes, so you also need to allow arbitrarily long addresses.

Oh, and your email validator is wrong too.


Speaking of email validation, recently I've come across email inputs that reject addresses that would definitely match even the most restrictive pattern you can imagine (Latin letters and Arabic numerals @gmail.com) but which I created by mashing my keyboard or just made up using my imagination and extremely common (in the USA) names. For example, in the captive portal for Starbucks WiFi, it rejects things like "iebfisldndu38256" or "jennsmith122499" suffixed with "@gmail.com".

Now, I'll admit: it actually did thwart my attempt at not providing my actual email address, for a good several attempts. I had to increase my creativity to a level I truly didn't anticipate, before finding something it accepted. And I definitely had enough randomness that it wasn't just disguising an "already exists" error with an "invalid address" error for privacy purposes, I am certain.

What in the world kind of validation is this? Do people really end up resorting to entering their actual email address often enough to make this at all useful?


This isn’t about validation. They check to see if it is a real email address.


Ah, interesting. I wonder why everyone doesn't do that instead of poor attempts at validation.


which is what mailinator is for


Even bog standard Dutch names are often a problem e.g. “Jan de Wit” being addressed as “Jan de”, “Jan Wit” or sorted under the letter d.


Is the "de" somehow different to a "regular" middle name?


It's not a "middle name", rather it's an "infix". Some countries/languages have this feature where there is a generic "infix" part of the last/family name. This generic part if fully part of the last name but is so common that it will cause an imbalance in the distribution of names.

To understand this as an English speaker. Imagine people aren't called John Smith, but rather John the Smith and not Bill Boston but Bill of Boston. Now imagine that a large percentage of people had an infix in the name. It would be impractical to order stuff by last name and find a massive index on the letter "o" because all the last names starting with "of" or at "t" starting with "the". To balance this out, parts that would translate to words like "from", "the", "of", "to" or "on" would be considered an index. Part of the last name, but not considered part of the name index. John the Smith would be found at the letter "S" usually structured as "Smith, John the".


Ah ok. So the issue is not so much how to store it, but how to sort it.

It's the iTunes "The ..." problem.

Thanks for clarifying.


That sound like is the user’s fault: if their name is “Bill of Boston”, then their surname is “of Boston”… and if they prefer to be called “Mr. Boston” or similar, then they should omit the “the” when filling a form


If someone's family name is O'Neill and their name gets sorted to the top of the O section because if the apostrophe does that make it their fault? If they prefer to be sorted correctly they should omit the apostrophe from their name?


Think "da Vinci". It's not exactly the same, but similar enough.


It’s part of the last name.


The author prefers "given name," but that does not include one's surname and that may also not be a one's current name.

I'm constantly annoyed when asked for both my legal name and my preferred name that companies choose not to address my by my preferred name. That's why I gave them my preferred name, after all.


As an Irish person I'm often accused of SQL injection.


I recently filled out a web form on a state government site that asked me to enter the name of my immediate supervisor at my previous job. His last name has an apostrophe (starts with "O'"). The site gave me an error message:

"Your immediate supervisor's name must only contain letters (A-Z), hyphens, or spaces."

(At least it didn't actually impose the apparent restriction that the name must be ALL CAPS.)


W3C has a good article about personal names around the world: https://www.w3.org/International/questions/qa-personal-names


Not all cultures have a family name. Myanmar for example, has no family names. There is only a given name, which is often many words.

Just store a name field, and if you need to interface with other systems with multiple name fields, ask the user to fill those fields out.


Like, human names are entirely culture-dependent, and in most cultures (including the Western Hemisphere culture) completely free-form?

Even when people adhere to that norm rigorously (which they don't), it's completely unusable as a unique identifier, and is of dubious utility to disambiguate two distinct humans in many scenarios?

That there's not even any real agreement on spelling (especially as, with some, it will need to be spelled in multiple alphabets)?

If this stuff's not old news to you, you haven't ever worked with it.

The one that will really make you want to curl up in a corner and whimper are the problems with birthdays.


I think it‘s all about your target audience. I‘ve never bothered with Native American or Middle Eastern Edge cases because all of my clients serve white Europeans that fit into the two name scheme 99.999%. So I won‘t go the extra mile to optimise for the 1 in 100,000. Why should I? There will be no ROI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: