Hacker News new | past | comments | ask | show | jobs | submit login
Dateparser: Python parser for human readable dates (github.com/scrapinghub)
89 points by juanriaza on Nov 24, 2014 | hide | past | favorite | 31 comments



In the same vein check out Arrow for a big improvement over Pythons standard time/date libraries. As a bonus it also generates human readable dates (though I don't think it parses them like this lib): http://crsmithdev.com/arrow/


Except that arrow is simple-minded when trying to parse date strings (compared to dateparser or delorean[1]). By default, it only tries to match a few patterns. You'll see a lot of this:

    arrow.parser.ParserError: Could not match input to any of
    ['YYYY-MM-DD', 'YYYY-MM', 'YYYY'] on '01-06-17'
Here's a list of (US English–centric) test dates that dateparser ((ddp.get_date_data(date)['date_obj']).date()) and delorean (delorean.parse(date, dayfirst=False, yearfirst=False).date) both parse correctly, nearly all of which arrow fails on:

    01-06-2017
    01-06-17
    2017-01-06

    01/06/2017
    01/06/17
    2017/01/06

    Jan 6, 2017
    Jan 6 2017

    2017, Jan 6
    2017 Jan 6

    2017, January 6
    2017 January 6

    January 6, 2017
    January 6 2017


    January 6nd, 2017
    January 6rd, 2017
    January 6st, 2017
    January 6th, 2017

    January 6nd 2017
    January 6rd 2017
    January 6st 2017
    January 6th 2017


    2017, January 6nd
    2017, January 6rd
    2017, January 6st
    2017, January 6th

    2017 January 6nd
    2017 January 6rd
    2017 January 6st
    2017 January 6th


    01//06/2017
    01//06//2017
    01--06-2017
    01--06--2017

    01/06-2017
    01-06/2017
I like Delorean's API better than arrow's (strictly personal preference) but think dateparser's language detection is interesting.

[1] http://delorean.readthedocs.org/en/latest/quickstart.html


>01-06-17 //

What's the correct parsing of that date? Is it 2001 or 2017 or 1917 or ... is it June or January ...?


I was curious what Haskellers were using and found this:

https://github.com/codygman/git-date-haskell

Had to do a small bugfix from bitrot, but this is a nice package that wraps the git date handling code which the author claims[0] was the only sane implementation he could find.

0: http://stackoverflow.com/questions/9831956/parsing-fuzzy-dat...


Arrow is far from "production" ready with severe limitations, bugs, incompatibilities and the maintainers appear to be unresponsive. It made a big splash here, but an unwarranted one.


Huh! Just last week I did a survey of NLP Date Parsing libraries. If you're looking got something similar in other languages, see:

https://docs.google.com/spreadsheets/d/1dKt0R247B8Mx5sFXd7ht...


Python also has dateutil, which can do similar things and has been around a long time: https://pypi.python.org/pypi/python-dateutil


If you look at the features section of the GitGub project, it says it is based on dateutil and actually added features on top of it.


Yeah, dateutil it is cool, but it has a few problems:

  >>> from dateutil import parser
  >>> parser.parse('')
  datetime.datetime(2014, 11, 24, 0, 0)
It gets worse with fuzzy parsing:

  >>> parser.parse('something meaningless', fuzzy=True)
  datetime.datetime(2014, 11, 24, 0, 0)


Can this not be reported as a bug on the project's issue tracker? I don't understand why people trash things in public instead of at least filing a polite issue.


I don't mean to trash anything, these are known bugs (there are lots of them issued there: https://bugs.launchpad.net/dateutil).

It seems that dateutil has just not been receiving much love from its developers lately.


I had no idea! Thanks for sharing this.


So quick question to anyone who's used this lib. The README cites an example: it can give you the date for text like: '1 min ago', '2 weeks ago', '3 months, 1 weeks and 1 day ago', etc

Does it handle proper grammar for singular values (i.e., 1 week vs. 1 weeks)?


Well, it is meant to be very forgivable. Right now it outputs the same thing for both "1 week ago" and "1 weeks ago" (even though the latter is grammatically incorrect).

Can you elaborate what you mean by "proper grammar for singular values"?


I think the line you copied and pasted here answers your own question – '3 months, 1 weeks and 1 day ago.'


Sort of related, I'm the author of ago.py (https://pypi.python.org/pypi/ago/0.0.6) which generates human readable timedeltas that this parser reverses.


I'm ashamed to say that, in the few Python projects I've done, I have resorted to delegating date parsing out to PHP in the past given its amazing date parser. Aside from how silly that sounds, it's actually a pretty fast solution. I'll give this a look and see how it compares. I've found that a lot of Python libraries seem to add an obscene amount of bloat for the functionality I'm looking for.


If you're happy using PHP for this, I don't want to get in the way of your happiness - but if you applied the same standard to that practice as you do to Python libraries you'd certainly see that invoking a separate PHP process is "an obscene amount of bloat for the functionality".


They needn't be invoking a separate process. Could be just calling out to an API provided by a running PHP process, perhaps over HTTP.


FYI the ruby equivalent is chronic: https://github.com/mojombo/chronic




FYI in Clojure (with a live demo): http://duckling-lib.org


How would you propose using that from Python?


With a mini service. You feed it a line of text; it replies with the parsed date in a standard format.


Interesting. However it doesn't solve what I would argue is the harder problem of how to identify a time in the document.

For example as I write this HN url says that it is 8 hours old. Without knowing the exact format how can I extract these sort of dates out of random text/html documents?


This is a hard problem -- there's a bunch of research in NLP on it, where it's sometimes called temporal tagging. HeidelTime is a system that does this; some examples on their webpage, https://code.google.com/p/heideltime/


check out the NLP support in parsedatetime.

    https://github.com/bear/parsedatetime/blob/master/parsedatetime/tests/TestNlp.py
may be what you're after.


related, for parsing durations: https://github.com/thraxil/simpleduration/


interesting. it seems to support only English dates, sadly.


open a ticket.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: