Hacker News new | past | comments | ask | show | jobs | submit login
Mistaken Identifiers: Gene name errors introduced inadvertently using Excel (biomedcentral.com)
73 points by youngerdryas on April 21, 2013 | hide | past | favorite | 46 comments



Another comment asks,

why are we seeing so many articles about Excel?

The main reason is that Excel was recently in the news because a paper about economic policies for countries was found to have been based on data kept in an Excel spreadsheet that was poorly documented enough that errors in the data weren't found right away. (This is the highly condensed version of the story.)

These stories resonate here on HN because many, many, many of us have had occasion to use Excel as a tool. Members of the general public who share information with me (for example, contact lists for youth soccer teams) have learned in a corporate environment to treat Excel as the "universal" data exchange format. So I will be dealing with Excel spreadsheets for years even if I never create another one.

One expects Excel to operate like a tool, a way to manipulate data in some straightforward, well defined ways. I don't expect Excel to do what is described in the article submitted here: treat any text data value with certain embedded strings as special data types that the program can rewrite without explicit user command. That turns Excel from being a tool in the workplace to being a surly co-worker who habitually messes up other workers' projects. I intentionally minimize my use of Excel because I don't like its artificial intelligence turning into artificial stupidity while I try to get my work done. To find out that Excel is actually actively impeding medical research by rewriting data cells in spreadsheets is a dismaying example of why I can't treat Excel as a useful tool, so of course I was glad to upvote this informative submission.

AFTER EDIT:

I shared the link that opened the thread here among my Facebook friends, and one friend commented,

"This is (luckily) old news and no bioinformatician worth his keyboard uses Excel any more. It's just too much of a wild card."

He followed up after another friend's comment with

"Microsoft is squarely in the wrong here. The tool aggressively reformats highly technical data fields and the behavior is remarkably hard to keep turned off. I've been working in this specific field for 15 years, and I can guarantee that the power users do know their tools. What they know these days is to go use R or even one of the OSS applications like LibreCalc. Unfortunately, more naive users new to bioinformatics analysis routinely get tripped up by this and other overly assistive features of the Office suite."


Excel is not particularly rewriting data cells in spreadsheets. When importing other formats, it has some automatic conversions that people apparently find convenient. It's a little bit pedantic, but a text file with some sort of structured format is not the same thing as a spreadsheet.

It probably doesn't make any sense to apply automatic conversions to formats that are more or less defined by convention, but if you mark a column in an Excel file as text, Excel won't apply magic to that column.

The bigger problem is that it is considered acceptable to not keep a record of the changes being made to the data (a script can serve nicely as both a data processor and a record of the processing being done).


> if you mark a column in an Excel file as text, Excel won't apply magic to that column.

The problem is that the CSV file format has no reliable mechanism to mark a column as text( * ). Known workarounds include inserting a single quote at the beginning of a field, but this single quote may remain in the field forever, polluting the data downstream from the import.

* The obvious approach of enclosing CSV text fields in quotes, and non-text fields without quotes, won't work -- too many Excel versions strip out all the quotes while importing, without considering the implication of their selective application. Also, in many Excel versions "DEC1" is converted into a date whether or not it's quoted.


I posted this the last time there was a discussion about CSV and excel:

If you are generating the CSV yourself, save yourself some agony and just wrap the text in =""

    $ cat test.csv
    ="12.34567890124312341234123412341234",="1-5"
    $ open -a Microsoft\ Excel test.csv

https://news.ycombinator.com/item?id=5514587

Explanation:

It uses another auto trick: if the leading character is '=', the result is treated as a formula.

Thus, you can actually write formulas in your CSV and excel will interpret appropriately:

    $ cat test.csv
    =1+1,="1-5"
    $ open -a Microsoft\ Excel test.csv
You should see the value '2' (with content `=1+1`)

https://news.ycombinator.com/item?id=5514704


An alternative solution that works is some situations is to format the column as "Text". However you must do this before pasting the data into the cells or else it will already be converted. I use this technique for zip codes which can have leading zeros.


"The problem is that the CSV file format has no reliable mechanism to mark a column as text( )."*

The problem is that there is NO such thing a "the CSV file format". There's only a lot of different interpretations of how to store data in forms of rows with fields separated by commas or semicolons or whatnot, with every software package having slightly different ideas.

Save the world, use S-expressions or JSON if you write something new.


CSV is not the issue here; Excel's automagic conversions are. If it could open JSON, it'd extract the values, then apply it's magic. So your JSON string "DEC1" would still be converted (JSON doesn't have a date type).


The point of my comment was mainly that "by rewriting data cells in spreadsheets" is a poor way to describe the bad import behavior of Excel. A CSV is an ill specified structured text file, not a spreadsheet (most spreadsheets have some sort of type information for each cell, who knows whether it is intentional and accurate and so forth, but it is there).

In that context, "in an Excel file" meant "in a native Excel file".


> most spreadsheets have some sort of type information for each cell ...

Yes, but the problem is that, without some knowledge and care on the part of the operator, the data type will be established during the import, not in advance, and on a cell-by-cell basis.


I am surprised that this is a surprise. Anyone who has used Excel to open a CSV file of any complexity has surely encountered incorrect, automatic "interpretations" of values as dates or numbers.


Why wouldn't an import tool have an option "make automated conversions" or the ability to _simply_ turn automated conversions off with clear state feedback (eg as a checkable button like for overwrite/insert modes¹).

Surely MS research has shown that this is one of the great pain points with Excel/Word?

[¹ I haven't used MS Word for real in over a decade, it used to have a display of OVR in a bottom panel IIRC for when overwrite mode was on.]


> Surely MS research has shown that this is one of the great pain points with Excel/Word?

Perhaps, but Microsoft is infamous for ignoring Excel's many faults. An article linked yesterday described a litany of very serious statistical errors of long standing, without any indication of a timetable to repair the defects.


Yeah, that. Excel's CSV behavior is worse-than-useless, which is a shame since CSV is generally treated as the easiest way to get data into Excel.


"Another comment asks, why are we seeing so many articles about Excel?"

There's a link next to posts. It's easy to reply to the person asking the question by clicking the link and typing a reply.


I love that they published a paper on this. For most of us, this is one of those things that surprises you once and then convinces you not to use Excel for genomic work in the future, since no magic should be applied without explicit instructions from the scientist. For them, they realized that this Excel/gene name error affected almost everyone at least once, probably reasoning that journal referees would be quite sympathetic despite the otherwise bland nature of the story.

The type of auto-conversion that's going in, as mentioned in the article, is e.g., DEC1 (text) to 1-DEC (date), etc.


I just changed the format of the column to "text" from "auto" and kept using it for my thesis work. It's too useful to bail on because of a quirk like this.


> I just changed the format of the column to "text" from "auto" and kept using it for my thesis work.

Yes, but for an existing spreadsheet, that won't work ex post facto. A spreadsheet in which some gene names have selectively been converted to dates won't be repaired by changing the column's data type. Such a remedy must be applied in advance of the import.


I was chewing through many datasets at the time, so I needed to do automation anyway (which was easier to do in R or PHP/Perl, given my background at the time). Even then, I still produced Excel output (with R scripts in an attached zip) when I wanted to share the final results with my PI and collaborators.


This article was published in 2004. It was still causing problems in 2012 [1], when a paper published in the prestigious journal Cell had a few gene names written as dates [2]. Even the GEO [3] had problems with this , according to [1]. The culprit is automatic type conversion, which the user has to understand in order to avoid it. Usually, the way around this problem is to format the columns as text before the data is entered. But it is much harder to work with someone else's munged data or spreadsheet, especially when it is in the form of a .csv file. Even though there is an RFC standard for CSV files, RFC 4180 [4], it is a description of how different editors handle CSV files, and not a prescriptive definition.

[1] https://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel...

[2] https://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel...

[3] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL13667

[4] https://tools.ietf.org/html/rfc4180


I don't know why people are blaming Excel.

The problem is that these researchers, or even the economists in the paper on Global Austerity, aren't properly trained in what is essentially a computer programming task, and they don't do things like validate their data. These are basically bugs in their spreadsheet that they didn't catch.

Sure, Excel may behave unexpectedly for their particular uses, but for the vast majority of finance people, it works very predictability. If they had spent time validating the data, they would have realized that the names had been modified, and they could have corrected for it.


It's easy to blame the tools. It's harder to admit that you don't understand the tools you are working with.


... or when the wrong tools become the de facto standard people are expected to use.


I am amazed by the diplomacy shown in the linked article. Nowhere does the article suggest that Excel not be used for such important scientific work, work that has a bearing on human health and well-being.

If the paper had been about drunk driving, or guns in the hands of children, it could be expected to suggest obvious remedial steps along with the data. But as to Excel, it's as though it's the only available tool for data reduction and communication. In fact, it's one of the more expensive of the alternatives, many of which are much better suited to the task being described.


This article is from 2004, but I suspect that the problem is still present in the last version.

A few month ago I had a similar problem with a list of usernames. One of them was something like julio-90 (In Spanish, "julio" is the name of a person and the name of a month.) and Excel changed it to the date jul-90 (i.e. 1990-07-01).


If you're importing from a CSV, there's a workaround that, unlike the Microsoft recommendation linked in this article, doesn't require adding a spurious apostrophe to the front of your data: http://stackoverflow.com/questions/165042/stop-excel-from-au...


You might have included the workaround that actually works from the linked conversation's many suggestions that don't work -- add a space in front of each field of the original CSV data that you don't want converted. This is more effective and reliable if each CSV data field is quoted, and the space is inside the quotes.


What am I missing, why are we seeing so many articles about Excel???


A lot of people use Excel. Some of those people are technical and do wonderful things with it. Some of those people are not technical, and they do wonderful things with it.

But of of the things done with Excel are positively scary, especially when you see just how widespread its use is, and how little scrutiny is given to the spreadsheets and the data going in and the stuff comping out.

Ray Panko has websites about human error, and about spreadsheet error. (And there's obviously cross-overs). (http://panko.shidler.hawaii.edu/)

And there's the EUropean SPreadsheet Risk Interest Group (http://www.eusprig.org/)

There are a lot of programmers on HN. Seeing how people get work done allows these programmers to spot niches in the market that they can turn into opportunity. There are needs for better data entry validation; better auditing of spreadsheets; better use of databases rather than spreadsheets; etc etc.


Don't forget the tools that would (help to) extract a 'proper app' from your average corporate Excel sheet.

I'm pretty sure there's a huge opportunity there ...


This One Microsoft Excel Error Helped Spark Mass Riots Across Europe http://www.policymic.com/articles/36071/this-one-microsoft-e...


Because Reinhart and Rogoff was such a massive deal.


But the R&R error had almost nothing to do with Excel. The same mistake could have been made with R or pen and paper. They were careless in a way I wouldn't expect a high schooler to be careless. Not in some way that Excel did something unexpected.

It would be like blaming C# because you did foreach over collection A, rather than collection B.


You're right, of course. The problem in the R&R case is not with Excel itself, but that it's being used with a lack of rigour.

If it were being done in C# or R I would expect unit tests and so on. I'm not saying that would make such a mistake impossible, but programmers have processes and tools for a reason.

In the OPs case the problem actually is with excel though.


Basically this alone may have significantly change the course of several European nations for the next decade because of policy made on the that paper. NPR's Planet Money team just released a podcast [1] that goes into the basics of that.

Last year my company was having supply chain issues. We have complicated 3 tier supply chain for a critical component with odd shipping restrictions and mix of batch and continuous process and other items. We don't control the the vendors but pay on yield and know what material enters and leaves each supplier. We desperately needed insight into why we had delivery issues as the suppliers were not very forth coming understandably.

I was tasked with writing an app that let analysts run dozens of scenarios and give allow them to tweak all aspects of our models to gain insight into yield and schedule. I could have written a python script or Java or C# app (since I dont have SCM or MES software) or I could write an Excel spreadsheet. In 2 weeks, I wrote a spreadsheet that modeled the basics including complicated recycling and exposed everything to analysts. We debottlenecked several logistics issues with that and kept the business running and that spreadsheet still is updated daily. I can guarantee writing an engine+ui+reports in any language and allow the level of flexibility Excel provides would have taken me 6 months or more. The SCM or MES software would ultimately serve all of companies better but who has $200K-$1M+ to spend on these things for every issue when $200 + 2 weeks can get you 80% of the way.

[1] http://www.npr.org/blogs/money/2013/04/19/177999020/episode-...


Not really.

The R&R paper was criticized from the start but it suited a mindset that was happy to use what ever was convenient. Many hack economic studies that suit the dominant agenda are published each month.

By happenstance, I know that Kenneth Rogoff is pretty much a professional liar, having made a survey of his predictions before and after the bubble (I found a pre-bubble interview with him deriding a doomsayer and a post-bubble interview with him being a doomsayer and expressing anger at the people, such his pre-bubble incarnation, who said everything was great). But I'm sure there are many, many Kenneth Rogoffs in the economics field, since much of it involves validating existing policy.

So the R&R paper probably influenced nothing, was just icing on a cake that was already baked. The discovery of the paper's errors, on the other hand, is more of an outre event. Unlike your average mediocre product, it just happened that this paper had these error that were so bad they couldn't be waved away. Well, these hacks are done, to be replaced by other hacks. The error discovery may force a short backtracking on austerity idiocy but I wouldn't count on it.


It's a "broken damn" effect. Once the first one got through, it resonated with everyone's previous knowledge.


nah, the "broken, damn" effect is me taking my lunch break. You're of course talking of the "broken dam" effect :-)


Another problem I've seen with dates and Excel is copying date information from one workbook to another. The problem with this is that the workbooks may be using different underlying date systems. Workbooks originating on a Macintosh use the 1904 date system and Windows defaults to the 1900 date system. Copying dates between workbooks can give you shifted dates. http://support.microsoft.com/kb/180162


Anyone know why such a straightforward article has eight authors? What could each of them possibly have contributed?

I have noticed in the past that bioscience articles tend to have lots of authors but always thought it was due to their inherent complexity requiring lots of different skills.

Perhaps it is really just a way to get more published papers for more people to help their academic career. This may explain some of the super long CVs these guys often have.


It is a way to get more published papers. It's very common for a research lab director or faculty member to be listed as an author on all papers produced by their organizations. But the practice also recognizes people who may have been involved in the production of important data in the paper, even if those people didn't write any part of the paper directly or were acting in a "service provider" capacity and are not part of the larger research project that the paper describes.


Makes sense. I have read many more math, stats and econ papers than bioscience. In there areas there tends to only be a few authors. Often there is an acknowledgement of some helped received either in the paper or at th bottom of the first page. In these areas paper are written by individuals rather than organizations. Bioscience takes much more infrastructure and coordination as well as being much more expensive.

Still seems crazy that eight "authors" should get credit for this article.


I have to say this problem would go away in a flash if a database were used instead of a spreadsheet program. The reason is that each database field has a strict type, and "DEC1" in a database field identified as text would never be converted, or in any way changed.

The problem is that Excel spreadsheet fields don't have any specific type -- they're defined on the fly by the data that's inserted into them. And worse, different data are interpreted in different ways in the same column, where you would expect some consistency within the column.

Those accustomed to database work, using tables having strictly defined data types, may be surprised to learn that, during an Excel import, successive fields in the same column can be interpreted in a dozen different ways, based on the data being read, not on the field's defined data type (which doesn't exist in a spreadsheet).

What I find sad about these recent Excel stories is that few seem to be willing to dump the program and choose an alternative.


I guess I'd just assumed that everyone was aware of Excel's penchant for converting alphanumerics to scientific notation (any numeric sequence that contains what looks like an exponent.) It also converts long numeric sequences (say, a 20-digit ID number) to scientific notation. Any alpha sequence that could be interpreted as a date ("02-01") of course will be converted thusly.

Most of these are avoidable except the long ID number problem. Even with careful formatting as text the last time I experienced the problem Excel was still performing implicit conversions in ways that weren't immediately apparent, and that rendered the whole experiment worthless.

The recent problems with bad formulas are easily solvable using the built-in auditing features, or formula arrays, or just discipline. The shortcomings reported for statistical functions ("Computational Statistics and Data Analysis", June, 2008) are another issue altogether.

Jeff


As noted this article is from 2004. Unfortunately such errors are still around, as noted by Neil Saunders in a recent blog post[1].

1. http://nsaunders.wordpress.com/2012/10/22/gene-name-errors-a...


This article was published in 2004. I would hope that bioinformatics researchers have ways of dealing with these issues now.


As the paper points out, this an old problem. Excel is an excellent tool unlike most Microsoft applications. You learn to live the quirks. When you see the "September" genes, you just file it under "do a final cleanup and be sure this doesn't wind up in the final submission of the paper".




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: