Hacker News new | past | comments | ask | show | jobs | submit login
Excel.vim (github.com/yakiang)
143 points by johannh on Oct 10, 2014 | hide | past | favorite | 57 comments



»Works best on excel files that contain English characters only.«

That's a quite sad statement to make nowadays. I guess the old binary format might be worse regarding character sets, but at least the newer ones should use Unicode exclusively which makes this a very odd restriction.


I hate this sentiment. I speak English and French and have limited amount of time to hack on software I'm giving away for free. My day job consists entirely in English. I have neither the experience, nor the inclination for internationalization of software. Don't like, don't use it. Or fork it, and add it yourself, because obviously you have more free time than I have.


> I have neither the experience, nor the inclination for internationalization of software.

Taking a piece of software and making all of the UI language localized is one thing. Making sure that your program doesn't blow up if it encounters UTF-8 is another thing. Nowadays if your program chokes on UTF-8, I think it's safe to just consider it broken.

In any case, looks like this is really where the issue may lie:

  # for non-English characters
  def getRealLengh(str):
      length = len(str)
      for s in str:
          if ord(s) > 256:
              length += 1
      return length
and:

          for val in shn.row_values(n):
              try: val = val.replace('\n',' ')
              except: pass
              val = isinstance(val,  basestring) and val.strip() or str(val).strip()
              line += val + ' ' * (30 - getRealLengh(val))
          vim.current.buffer.append(line)
In accounting for the fix-width layout of non-ASCII characters.


Are UTF-8 encoded Excel documents actually common? Do they even exist? I thought Excel used CP 1252 on English Windows and the corresponding code pages on other language versions?


There are two types of formats generally recognized as XLS: Excel 5.0/95 "BIFF5" and Excel 97-2003 "BIFF8". The former uses a language-specific codepage like 1252 and the latter can use a language-specific codepage or the more general 1200 (UTF16LE).

Here is the master list of codepages used by Excel: https://github.com/SheetJS/js-codepage/blob/master/excel.csv (disclaimer: I built this as part of the in-browser XLS parser https://github.com/SheetJS/js-xls)


I'm pretty sure that xlrd decodes it all to unicode() in Python, so that should be a moot point. You would only need to worry about passing it as utf-8 to Vim at that point.


How would it save a document containing multiple languages, then?


Excel 97-2003 (XLS) actually uses UTF16LE in that case, not UTF8. Excel 2007+ XLSB exclusively uses UTF16LE -- there is no way to force it to use a codepage


Interesting definition of broken. It seems to work perfectly for me and the creator.


> Interesting definition of broken.

Consider this: A medical device that people's lives depend on. It only fails in 1/1000 cases causing death. Many people could state, "it works for me, can't be broken!" On the other hand, the families of the dead could argue that it is broken. Who is right?

Obviously this isn't such an extreme case. No one has their life depending on a Vim plugin, but it illustrates a point. "Works for me" doesn't necessarily imply "isn't broken."


I think the point is that for many people building free software for fun, "works for me" is all that matters. Testing use cases that you know you will never encounter is not interesting or challenging (at least in this case), but it takes time, and you're not making a product, nobody is relying on you. Why bother?

The author of this plugin isn't trying to make a spreadsheet competitor, they just released it publicly because other people might find it useful or interesting.


I think that we can still consider it broken without demanding that the author 'do it better.' People make broken things all of the time.


We can consider it incomplete, or disagree with the design choices, but broken means it doesn't do what it's supposed to do. Here, it does everything the author intended it to do, and everything the description says it does. When you say it's broken, it sounds at least to me like you're imposing your own requirements on a project you have nothing to do with. It's like calling Microsoft Office broken because it can't handle Open Office files. It's not broken, it just doesn't have all the features you would have included.


It's totally cool to release some software that has some bugs. It's also sad to program in an environment that isn't Unicode-friendly.


This is an imperfect analogy. The medical device is not being given away for free.


The analogy very well might be imperfect. It doesn't seem like this particular distinction is all too valid though.


Perhaps you could take into account that emoji and other fancy characters are heavy utf8 characters. UTF support doesn't usually mean "prepare for Swahili", but more "don't choke on the characters"


the thing is if you build for unicode support from the start these conversations don't need to be had. The problem is not enough people treat text as a black box from the start (I can understand unwillingness to support bigger things like RTL)


Okay... let's go into this... how are the strings in excel encoded anyway?

I'd be willing to bet money that at least some of the formats in question aren't UTF-8, they are likely ASCII encoded against a character set or code page.

Then you have to read that codepage, and convert the necessary characters to their Unicode equivalents, and from there do you downcode to utf-8?

Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

It's not as simple as saying "don't choke on unicode".


> how are the strings in excel encoded anyway?

Length-prefixed byte arrays encoded using various code pages. There are a small number that excel uses: https://github.com/SheetJS/js-codepage/blob/master/excel.csv (the columns are CP#, mapping, single/double-byte)

> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

> Who's going to go through the different document versions to confirm, and adjust for the various encodings for non-ascii characters?

Someone already did that: https://github.com/SheetJS/test_files/tree/master/biff5 has artifacts for every language type


> If we can put together an Apache2-licensed module in JS in an afternoon (https://github.com/SheetJS/js-codepage) it can be done in python.

I thought Python 2 was Unicode-unfriendly. So not as easy as JS.


> Does the language this library is written in support that translation? Are there modules to do that? Is the license for those module(s) necessary compatible?

It's written in Python, which comes with support for pretty much every major encoding¹ out of the box, so yes.

¹: https://hg.python.org/cpython/file/cb94764bf8be/Lib/encoding...


There just isn't a lot of pervasive experience in the development community for multi-language unicode devlopment. Also xlrd is fairly old, although I don't know if that tool is part of what limits this to english.

In ten years it might be better.


Joel Spolsky said that ten years ago. The problem is that devs are afraid to learn unicode. They treat it like learning a foreign language. It's not even a fun problem, like learning a new programming language, so nobody makes time for it. The only people who learn it are those who make it a point of pride to implement something correctly and handle corner cases.

Unicode isn't even hard: Use UTF-8. Don't try to measure the length of a string unless you're rendering that string and measuring the length in screen units like pixels. If you do those two things, that's 90% of the effort of making Unicode-safe software.

I think both views are valid. Those who don't know how to write Unicode-safe software shouldn't feel shamed into learning Unicode before releasing open source work. Those who already know Unicode should feel happy that they're making other people's lives easier.


But these are file formats that may well not be encoded in UTF-8.. the formats already exist.. it isn't like he's creating a new spreadsheet format here. Some of them may well be encoded to something that works fine against unicode/utf-8, others not so much.


So you write FooToUTF8() and UTF8ToFoo(), where Foo is whatever the encoding is in the external format. Done.

As far as I know, UTF-8 will work 100% of the time, and is almost always the best internal representation for software you write due to how simple and uniform it is. If something is encoded in some other format, you can probably find a conversion function online.


Okay, so why don't you fork the project, and create your simple Foo/UTF8 methods, and confirm that they are the correct Foo/UTF8 methods for each of the document formats supported.

I'm not saying that it's really all that hard, but there are multiple document formats, and versions of those formats. The author obviously didn't need unicode support, so didn't test for it. I'm sure test cases, and a pull request would be welcome.


The grandparent probably wrote the comment in his spare time, too. Does that limit your entitlement to voice criticism or suggestions? Don't like it, don't read it.


Good for you for saying so! I'm always "impressed" by folks who immediately dump on FREE software that doesn't meet their exact needs.


It goes both ways. I hate developers that write some code, dump it on GitHub and say "It's open source, you can always fork it." Whatever happened to taking pride in your work and making it work the best it can?

Like other people have pointed out, handling unicode properly does not mean internationalization. Handling utf-8 isn't even difficult if you just keep it in mind.


I do take pride in my work, and every piece of software I write handles every use case I need it for explicitly. I take offense that you would imply otherwise knowing absolutely nothing of me and my craft. This isn't about UTF-8, it's about an illogical premise where some very shortsighted individuals would rather have only fully fleshed, fully baked products in open source. This is highly illogical, and would bury ideas potentially unexposed for others to coach and assist with. You aren't the judge of what others find difficult, but that doesn't mean that those that struggle with a concept cannot add tremendous value in other areas.

By the way, it's not utf-8, it's UTF-8. Whatever happened to taking pride in your writing and making it the best it can be?

Everyone has their own criteria for quality, and you can't hope to satisfy everyone. Everyone with even a mildly successful project in open source knows this. Scratch your own itch, make it work, accept any request that meets with your vision, and keep a permissive licence so those that don't can fork. Otherwise, the arrogance being asserted, that you can somehow determine if my contribution is worth of existing, is baffling.


I agree with you. Anything else just reeks of over-entitlement. I mean, if someone spends their free time to make something useful, that's great! If it doesn't quite meet your standards, that's your problem and you can either: fork it or send a pull request (ie fix it yourself), or pay someone (original author or someone else) to do it.

I'd rather people who build stuff for themselves release it to the rest of us than keep it to themselves.


You are just being baited. I agree with what you say, but did you really want to burn your energy in this discussion?


It's one thing to internationalize software. That's hard. But not being able to handle UTF-8 in 21st century is downright shameful.

The notion of non-ASCII characters in user Excel documents IS NOT something rare even for English speaking nations. There are tons of people with foreign names, addresses and other personal information which is commonly stored in Excel documents. And the funny thing is: it's usually not alot of additional work to support UTF-8 if you START correctly.


Calling the work of someone who releases free software that they may have well hacked together in their free time downright shameful seems quite offensive and disregards the work they put into the project. Perhaps the OP had no need to use this code for non-English alphabets or is simply in the early stages and hasn't had the opportunity to fully test and implement this.

Regardless, I think people should be applauded for releasing their work instead of shamed. And of course, if it's not a lot of work, a pull request would likely be appreciated ;)


Hmm, I do not agree with you that the sole act of opensourcing a software should bring it above criticism. Any developer that takes a bit of pride in his work should at least keep himself to some minimal standards and I think supporting UTF-8 isn't an unreasonable baseline for software released right now.


A few snippets from the Show HN guidelines:

> A Show HN needn't be complicated or look slick. HN users are comfortable with work that's at an early stage.

> Be respectful. Anyone sharing work is making a contribution, however modest.

> When something isn't good, you needn't pretend that it is. But don't be gratuitously negative.

I'm not saying it people shouldn't be open to criticism, but I think terms such as downright shameful fall under the category of gratuitously negative.


criticism != shaming


Someone recently posted a thread referencing Teddy Roosevelt's quote on 'critics'. That seems to apply well here. http://www.goodreads.com/quotes/7-it-is-not-the-critic-who-c...


Yes, this destroys all of my use cases.


Totally agree. UTF-8 is all over the world, yet we are living like in ASCII-only 90s. But since it's vim I'm not really surprised, it might be actually tricky to make something work there (however I don't know what really is the problem).


Thanks for sharing this. To the author, if you're reading this, thanks for putting your work up on Github.


Working in a corporate world where an excel can often be how you receive bug reports, this will actually fit very well into my workflow.


If the author is reading this:

> For vim 7.3 and less, it works well for almost all kinds of file formats,

> ie. .xls,.xlam,.xla,.xlsb,.xlsx,.xlsm,.xltx,.xltm,.xlt etc

Someone already pointed it out (https://github.com/yakiang/excel.vim/issues/5) on github: xlrd does not support the XLSB format (and the xlrd authors expressed no interest in building it)


I just found out about XLSB.. if really saves me so much time / space, I might just start using it for all large spreadsheets..


XLSB is significantly faster to process because it does not require an XML parser to blast through the data. Numbers are stored directly as IEEE754 doubles. Text may end up larger because XLSB uses UTF16LE for everything, including pure-ASCII strings (since Excel uses that format internally, there is no conversion involved and XLSB still will be faster).

The problem you will encounter is that most programs (Numbers, Google Docs) do not support XLSB.

Shameless plug: https://github.com/SheetJS/js-xlsx supports both XLSX and XLSB (AFAICT the only liberally licensed project that handles the format)


I could actually really use a nice display of CSV and TSV files (with editing).


There's code for a decent plugin here:

http://vim.wikia.com/wiki/Working_with_CSV_files

I'm not sure if it's what you're looking for, but I've found it very useful.


Why make a Vim plugin instead of a standalone program since it’s read-only?


As a Vim plugin, you get a lot of features of Vim for free (e.g. search, familiar bindings). And the code is only 56 lines. A stand-alone program would likely have less features and be much longer.


Not everything has to be useful or even sane. Especially on _Hacker_ News.

Edit: To clarify, I think this shows what hacking is all about. Playful cleverness and curiosity.


Most of the functionality is in the xlrd library, think of this as just a frontend.


It's possible they eventually plan adding write support using xlwt.


because.... vim!


I think this could be greatly useful, as also sc could have been.


I'll definitely try it... good idea


Wow. How far will vim go if it can now replicate like excel?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: