This is funny, but it's also a real world example of the kind of encoding nightmare that made SOAP RPC encoding really awkward. Various SOAP toolkits used to serialize a missing value as the empty string, or a literal value like "null" or 0, or all sorts of awfulness. I think the correct thing for the spec is to set xsi:nil="true" as an attribute on the XML tag in question, but IIRC about half the toolkits didn't understand that.
(I speak in the past tense of SOAP because I am an optimist.)
I worked at a company where we were replacing the user-facing component of our giant, ugly PHP storefront with a Rails version; in doing so, our developers implemented a JSON bridge between the two, allowing the frontend and backend to operate separately, using separate databases (and actually, they were in separate data centres).
As we were testing, we found that some products in our database would cause a JSON decoding error on the Rails side. After a few minutes, we realized the problem. We had a string field for something (product IDs, manufacturer SKU, etc). On the PHP side, the JSON encoder was using PHP's is_numeric() for each field to see if the field was a number (to determine how to encode it). Some of the SKUs, however, happened to be composed entirely of digits, and for those, PHP encoded them into the JSON as integer values. This, of course, broke on the Rails end, because Rails was expecting a string value and got an integer value.
In the end, we had to write a surprising amount of code to work around the brain damage involved, since regardless of what we tried to do PHP wanted, by default, to send things as integers whenever possible. I believe the final fix was to actually patch the JSON encoder library and special-case that field.
Heh. I recently had a password reset function break. Problem not reproducible on the test system. Turns out the reset email is produced by a Freemarker template engine, which is "smart" about datatypes: "oh, a number! I need to Format that nicely with commas to separate the thousands!" Too bad that number was the user ID - Not a problem on the test system with its 300 users, but in production...
is_int() checks the type of the field, while is_numeric() looks for strings that look like numbers.
You will also need to use settype() when getting your data from the database since integers from the database will pass through as strings (since the database range and PHP range aren't necessarily the same, use a float if you need unsigned ints).
(I assume that "what." is a request for an explanation.)
A float can store an exact integer of up to 53 bits even on a 32 bit machine.
PHP only has signed ints. If you need to store an unsigned int you can either store it internally as signed and only convert it to unsigned with printf() when you output it (and deal with the complexity of comparisons), or use a float and limit yourself to 53 bits.
If you have a 64 bit machine then of course you can easily fit an unsigned 32 bit int in that range. But it's wise not to rely on that at least for another few years.
In short, if you need more than 32 signed bits of range, and you want to make sure your code will run on any machine, then use a float. If you know you only use 64 bit machines then you have more flexibility. (You can use PHP_INT_SIZE and PHP_INT_MAX to check.)
If you need even more range than that then use the built in GMP library.
Also, PHP will automatically convert numbers that are too large from ints to float, so normally you don't see any of this. It's only if you use settype() to force an int that you have to pay attention to this.
could you have added some text at the end of the id before sending it and then removed the extra bits from the end on recieving side? something like a parity value.
That would probably work, but it means that the problem + workaround is spread out across two systems rather than being contained in just one. Also, working around it on the consumer side means that any new consumers (or any new string fields!) will need to use the workaround too, further spreading out the problem. Better to keep it encapsulated in one system if possible.
If you have a method that breaks when an int is passed in, just seems like good defensive programming to call to_s in Ruby. Any other caller could make the same mistake. But, I also understand/agree with fixing the root issue for the sake of other clients.
This is really a fundamental problem: how do you indicate operation failure? This relies on two things: the range (the valid output values) of the operation itself, and the range of the datatype you're mapping the operation's result to.
If the operation and the datatype's range are not equal, then you can indicate failure inside the return value by applying special meaning to invalid values. But if the operation and the datatype's range are equal, then you need another distinct value to indicate failure. The difficulty is in recognizing which situation you're in, and as you point out, this is one where, effectively, the operation and the datatype have the same range.
"I trust that the guys who wrote this have been shot." :-)
People who all run the same version of Visual Studio think SOAP is awesome. Get handed somebody else's "whiz-dull" a few times, and see how much fun it is to generate a working client using a different brand/version client stack.
I had the pleasure of writing a client for a SOAP service in a Titanium/JavaScript app not too long ago. If you don't have visual studio generating those proxy classes for you then indeed it's a huge pain.
Well, at least SOAP uses XML which has defined the basic formats. I hate that there are at least three different datetime formats in JSON and they are all used. WTF!
SOAP isn't that bad if you stay away from WS- extensions
A lot of protocols and standards containing the word 'simple; aren't. Most of them, in fact! I have a suspicion that this is because these designs start as antitheses to existing complex designs. 'Aha!' say the designers. 'We won't repeat those mistakes!' But because they proceed from the same basic assumptions as the complex designs they try to replace, they always produce something complex in the end, because they never really understood simplicity.
It's hard for people to imagine now, but at the time the Internet was just one of many competing network standards. Had this been developed after the rise of the web, I'm sure it would have been a very different protocol.
Definitely not true. The first LDAP implementation was published in 1993, and was worked on for a while before that internally at the University of Michigan:
Yeah -- before the enterprise types got a hold of it, SOAP was actually fairly pleasant to work with. Sigh. Oh well.
You can kind of get a flavor of what pre-enterprise-jackassery SOAP was like to work with by looking at Dave Winer's XML-RPC (spec: http://xmlrpc.scripting.com/spec.html), which was one of the precursors of SOAP.
That's a good point - didn't CORBA also start fairly straightforward (I can't believe I said that) but then grew extra layers of mind numbing complexity for transactions, security etc. - pretty much like the various weird WS-* specifications that most people seem to ignore?
Ran into this with a REST XML API recently where someone was trying to do some reflection-type serialization of XML. The API had longitude and latitude of all train stations, and some genious decided to call the tags 'lat' and 'long'. 'long' conflicted with the datatype Long and it wasn't fun. Version 2 of the API has fixed this issue luckily.
"Lat" and "Long" seem like great tag names for this purpose. It sounds like the problem wasn't this guy, but the "reflection-type serialization of XML".
I think the absence of the element/attribute is the best way to define null assuming your XSD is set up properly. Many XML marshalling libraries work well with this approach.
I've had the joy of working with a SOAP endpoint that doesn't recognise <element /> syntax, which left me having to create attributes assigned to '' in my Python code, so SUDS would generate the <element></element> syntax for me.
They also massively over-engineered the endpoint, constantly wrapping elements within elements, for no real reason.
That's always annoyed me: sure, XML can be heavyweight but if we're going to use it we should at least get the benefits.
Naturally that line of reasoning didn't get very far with the “maintainers” of an internal purported-SSO system with a SOAP endpoint which crashed on non-ASCII data or SQL special characters in the submitted username / password values.
I have joked that I might change my name to Sample User, develop a piece of land in the country, and name my road Example Avenue, taking address 123. This would make me impervious to datamining, because my results would always be thrown out.
I've done data-mining on customers, and truth be told, they'll send that mail without human intervention. You wouldn't be impervious to ye olde mail merge!
When AOL first started allowing screen names longer than 8 characters, I knew someone who registered the name "My Documents". That got some ... interesting emails from people trying to save their downloads.
If a patient with the last name of "Mouse" ever checks in to the hospital where I work, I have doubts about whether any of his labs will be performed. Standard practice is when creating a test user in production or placing a test order, name him Anything Mouse and people know to simply delete the request from the system.
Oh man, reminds of the time I was working on an intranet app for a big furniture company and all test user signups were coming through as 01/01/1970. After several frustrating hours trying to track down the source of the error I had the client enter a new user in front of me to see why it was happening for them & not me. I watched in horror as he set the birth date to January 1st 1970.
He had some limited exposure to development in the past and had got into his head that this date was the Universal Developer Test Date.
Careful about picking a low-populated area like this. I used to live in a town with population of about 2,000 and the post office clerks knew most everyone by name. One time I signed up for a site and just used "123 Blah St." as a placeholder address. Months later, some letter was mailed to that address, but the mail clerk, recognizing my name, just helpfully put it in my PO Box anyway!
I once worked for a medical records software company. We received a bug report that a particular patient's record could not be viewed. Our support engineer remoted into the client's site and asked the secretary for the patient's name. It was Bobby Null. You can imagine what sort of underlying assumption about String serialization led to this issue. [A preemptive aside: We had proper confidentiality agreements in place. No HIPAA rules were violated.]
Good question. I recall having done a Google search and noting that there were not an insignificant number of people with the last name Null in the US, so I wasn't too concerned about posting this. Probably a HIPAA violation, but not a major one.
In this case, probably yes. Might want to remove the post, it's a fairly major violation.
Often, names alone wouldn't necessarily constitute a violation as names are generally not sufficient to count as personally identifiable information... but a name like 'Bobby Null' is, I think, quite unique.
When I was being trained on HIPAA compliance I was told that sole first names are generally perfectly fine, and sole last names can often be fine but should be avoided for very common names. But I should also say that I am not an expert on HIPAA compliance.
I don't know the ins and outs of HIPAA, largely because I don't have to deal with them at all, but I don't see how this should be a violation. That's not to say that it's not, but rather that it seems like an odd rule.
All the post tells us is that a person named "Bobby Null" exists and has medical records, as do most people. It doesn't say anything about this persons medical issues/history at all.
I could learn more about someone by sitting a touch too close to the reception area at a doctor's office.
Also not an expert, but I agree. The violation is only if there is PHI - personal health information released. Stating that John Doe was present at X Clinic is a problem; stating that he exists is not.
Having a record implies that you were present at X Clinic. If it's a specialist clinic, then confirming the existence of patient record could allow someone to infer the condition or a range of conditions. Most clinics won't confirm or deny that a patient is there (or has records) without a release. In this case, though, we don't know where the record was stored.
Good point. My training said no full names, but that was because we were directly associated with a specific product/analysis, so any full names would associate the patient with a particular health... thing.
A name by itself, you are quite right, is not PHI. Thanks for the reminder!
XML is self-describing, it just so happens that XML's data model is not identical (and actually not even close) to SOAP's data model, or the typical programming language's data model.
XML itself only describes a text encoding, XML infoset describes node labeled trees, possibly graphs through xml:id and idref.
Unlike JSON it doesn't have a concept of null, it only has absence of a node. The authors of SOAP just invented a truly terrible way of mapping XML into a programming language's constructs (which are typically edge labeled trees with typed nodes).
XML is actually a decent data format for markup. Using it for other purposes (RPC format, configuration files, ...) usually doesn't end well.
Well my last name have a ñ . So for example my credit card have a weird character like "&" . Others just change to n. My last name crash a educational site when I registered
Using a name other than your official legal name is frowned upon in many contexts. In some countries, although not the US, it's actually illegal under many circumstances.
Actually, what worries you is not Perl per se, but people that write Perl code and don't know what they want to test for. The code shown interrogates $lastname for values that represent truth in Perl, while it ought to be checking for definedness:
if(defined $lastname) { ... }
The two are totally different cases. I would also argue that the problem lies elsewhere if you have values for a 'lastname' field in your data set that consist of a single letter.
Considering that there a plenty of people with no last name at all, I don't find it at all hard to imagine that there might also be people with a real last name consisting of only one letter.
There is a town famously called simply "Y" in France.
It's not Perl. I recently had a conversation with a friend who didn't like me using "if myvar == 0:" or "if myvar is 0:" in python code rather than "if myvar:". Call me paranoid, but i like to be as explicit as possible in my checks, you never know when magic conversion tricks (which are often platform- or implementation-dependant) will end up biting you in the ass.
Python is a little better than Perl or PHP on this; it won't treat the string "0" as false. It does, however, treat both 0 and None as false, and also 0.0 == 0 == False, which is the same kind of potential bug.
Generally I find that Python's avoidance of implicit string type conversions means that I almost never have this kind of bug in my Python.
On another note, `myvar is 0` is undefined behavior; Python implementations can perfectly legitimately return False for that even if myvar is, in fact, the integer 0. Try this, in Python 2.7.3:
>>> x = 257
>>> x is 257
False
>>> x = 257; x is 257
True
>>> 257 is 2**8 + 1
False
>>> 256 is 2**8
True
>>> x = 256
>>> x is 256
True
That's because `is` denotes object identity, not value equality, and for immutable objects like integers, strings, and tuples of immutable objects, object identity is fair game for optimization. In the above, "is" gives us a fascinating window into the particular optimization decisions taken by the CPython 2.7.3 interpreter. But, child, if you want your code's behavior to depend on some problem domain instead of interpreter optimizations, don't use "is" to compare integers!
I agree, but it's still more explicit in terms of what process is used to evaluate the content of myvar and what it should match, especially in a context where you're expecting a numeric value rather than a boolean.
17 years ago, when I got my second Internet account with my ISP, I filled in these 3 names for my choice of email address on their paper signup form.
root@ , nobody@ and daemon@
They gave me "daemon". I've terminated that account long ago, but last I checked (6 years ago?), I could still retrieve emails and dial in using a modem using that account.
I believe that these errors are so common they represent a Cognitive bias on the part of programmers. At some point every developer wants to execute a one line command and have the system "do something". If they cannot get that one line, then they have two options. - Wrap up more abstraction code, until one line executes (the SOAP solution), or think deeply about what you are trying to do and take things away until one line is clear and obvious (The REST solution)
Dear god, so many deleted answers from people trying to be funny instead of informative! (I am counting 4 from the last hour and 3 more from the previous years)
I think this is a joke, possibly inspired by the XKCD comic (link posted in the stackoverflow comments). The string "Null" would not cause this behavior.
I have personally encountered "Null" as a surname in system used by job applicants world-wide. The system's session layer encodes absent values as the string "null" at some point. The Null clan is the only problematic case, and they are numerous enough that the maintainers are aware of the problem but not so numerous to fix the session layer.
I wish I could listen in on a dinner conversation at Null house. They must have an interesting perspective about computers.
It could very easily be real. I was in a Fantasy Football league on Yahoo a few years ago, and there was a player named Keith Null who played briefly after another player was injured. His name just showed up as Keith.
At least your employee has a last name. I had an Indonesian hacker in my team, he had no last name...
It is all about assumptions. OP assumed nobody would be called Null. MusicBrainz index assumes no band chose to name themselves "Various Artists" or "[unknown]". These are advisable but how to not assume that people have last names?
Naturally the very first thing I did when opening this discussion thread was search the page for “bobby tables” and “xkcd”. Of course, there were already three separate mentions of that comic in the thread.
And this is why it is a bad idea to look for null or nil as a value representation in place of text or number. Instead, use a different representation, like an empty or non-existent element/attribute, etc.
Nan is actually a legitimate name (http://en.wikipedia.org/wiki/Nan) so if you're putting everything into upper case and not sanitizing, you're gonna have a bad time.
(I speak in the past tense of SOAP because I am an optimist.)