Hacker News new | past | comments | ask | show | jobs | submit login

Python 3.7+: Counter(elems).values()

But we can play this game both ways. What's the K for (Counter(a) & Counter(b)).most_common(3)?




It's a little funny comparing a language with another language+plus-its-entire-ecosystem, but k fares remarkably well I think.

Counter could be #:'=: (count each group)

    c:#:'=:
Values would just be dot.

Counter[x]&Counter[y] is a bit tricky to write, because while the documentation says set intersection, what they really mean is the set intersection of the keys with the lower of the values.

This is entirely clear in k; First, understand I have a common "inter" function that I use frequently (it's actually called "inter" in q):

    n:{x@&x in y}
and from there, I can implement this weird function with this:

    i:{(?,/n[!x;!y])#x&y}
I've never needed this particular function, but I can read out the characters: distinct raze intersection of keys-of-x and keys-of-y, taking [from] x and y. It sounds very similar to definition of the function I gave above (if you realise the relationship between "and" and "lower of the values"), and this was painless to write.

most_common could be: {k!x k:y#>x} but if I'm going to want the sort (in q) I might write y#desc x which is shorter. In this I can save a character by reversing the arguments:

    m:{k!y k:x#>y}
so our finished program (once we've defined all our "helper functions" in a module) is:

    m[3] i . c'(x;y)
So we're looking at 13 lexemes, versus 19 - a minor victory unless you count the cost of:

    from collections import Counter
and start counting by bytes! But there's more value here:

All of these functions operate on data, so whether the data lives on the disk or not makes no difference to k.

That's really powerful.

In Python I need more code to get it off the disk if that's where it is.

I also may have to deal with the possibility the data might "live" in another object (like numpy) so either I trust in the (in)efficiency of the numpy-to-list conversion, or I might have to do something else (in which case really understanding how Counter[a]&Counter[b] works will be important!).

I might also struggle to make this work on a big data set in python, but k will do just fine even if you give it millions of values to eat.

These things are valuable. Now if you're interested in playing games, let's try something hard:

    "j"$`:x?`f
It creates (if necessary) a new enumeration in a file called "x" which persists the unique index of f, so:

    q)"j"$`:x?`f
    0
    q)"j"$`:x?`g
    1
    q)"j"$`:x?`h
    2
    q)"j"$`:x?`f
    0
How would you do this in Python?


> It's a little funny comparing a language with another language+plus-its-entire-ecosystem

This is pretty much my point: any ‘reimplement this weird built-in thing in $OTHER_LANG’ is going to involve overheads unless $OTHER_LANG also has that thing. I don't think keeping most types out of prelude should be a downside; you can always ‘from myprelude import * ’ if you disagree.

> How would you do this in Python?

I don't really know what "persists the unique index of f" means, but this seems similar to shelve. I can misuse that to give the same effect as what you showed.

    with shelve.open('x') as db: db.setdefault("f", len(db))
    >>> 0
    with shelve.open('x') as db: db.setdefault("g", len(db))
    >>> 1
    with shelve.open('x') as db: db.setdefault("h", len(db))
    >>> 2
    with shelve.open('x') as db: db.setdefault("f", len(db))
    >>> 0


> I don't think keeping most types out of prelude should be a downside

I don't understand what that means.

> I don't really know what "persists the unique index of f" means, but this seems similar to shelve. I can misuse that to give the same effect as what you showed.

I think you got it, but it seems like a lot of typing!

How many characters do you have to change if you want it purely in memory? In k I just write:

    `x?`f


‘Prelude’ is the set of standard functions and types that you don't have to import to use. Python comes with a rich standard library, but most require imports to use.

> How many characters do you have to change if you want it purely in memory?

When the question is if ‘K can reduce line count by a factor 1000’, shorter keywords hardly cut it. ‘setdefault’ is wordier than ‘?’, but as something you're only going to use on occasion, so what? And d.setdefault(_, len(d)) is something you'll use maybe once a year.

shelve acts like a dict, so you can do the setdefault dance the same way on in-memory dicts.


> ‘Prelude’ is the set of standard functions and types that you don't have to import to use. Python comes with a rich standard library, but most require imports to use.

I might not understand this about python.

    $ python3
    Python 3.7.6 (default, Dec 30 2019, 19:38:26) 
    [Clang 11.0.0 (clang-1100.0.33.16)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> Counter
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'Counter' is not defined
Should that have worked? Is there something wrong with my Python installation?

> but as something you're only going to use on occasion, so what? And d.setdefault(_, len(d)) is something you'll use maybe once a year.

If it takes that many keystrokes, you're definitely only going to use it once per year! This is the normal way you make enumerations in q/k.

The low number of source-code bytes is the major feature in k. I've written some on why this is valuable, but at the end of the day, I don't really have anything more compelling than try it you might like it, and it works for me. This is something I'm working on though.

- https://news.ycombinator.com/item?id=8476113

- https://news.ycombinator.com/item?id=10876975


Counter is not in Python's prelude. Therefore you have to import it in order to use it.

My comment about prelude was in response to you saying I was using Python's “language+plus-its-entire-ecosystem”. This is not true; Counter is part of the standard library that comes with the language. It is not from a third-party package.

> If it takes that many keystrokes, you're definitely only going to use it once per year! This is the normal way you make enumerations in q/k.

You don't use this frequently in other languages because you almost never care about the index of a value in a hash table (because it's ridiculous), and you almost never want to brute-force unique a list (because it's a footgun).


> you saying I was using Python's “language+plus-its-entire-ecosystem”. This is not true; Counter is part of the standard library that comes with the language. It is not from a third-party package.

I'm not sure I agree "third-party package" is a good/useful place to draw the line, but I don't think it's terribly important. Sorry.

> you almost never care about the index of a value in a hash table (because it's ridiculous)

So I want to query all of the referrer urls I've seen recently. I only see a few of them, so I want to have a table of all the URLs, and another table with one row per event (landing on my page). In SQL, you might have a foreign-key operation, but in q, I can also make the enumeration directly. I don't really care about the value of the index, just that I have one.


I wouldn't want to use an index as a long-lived key like that, since it makes ops like deletion a footgun.

For temporary calculations I'd just `{u: Url(u) for u in url_strings}` for the table of URLs, and stick the `Url` objects in the table of events.


I don’t understand any of this. I don’t know what footgun is. I have no idea what a “URL” object is or how you stick it in a table of events (on disk? in memory? in a remote process called a “database”). I’m don’t know what you mean by temporary calculations.

I have fifty billion of these events to handle every day, and I do this on one server (that is actually doing other things as well). It is not an option to “just” do anything at this scale if I want to keep costs under control.


Sorry, it's a bit weird not sharing a lexicon. I get the impression Q is a whole different genealogy of programmers.

In Python, although you can just use shelve to store data on disk, in practice this is considered a bad idea beyond very simple cases. Valuable data wants the guarantees that real databases provide, like ACID. shelve doesn't provide this, and IIUC nor does kdb+.

So if you're handling 50 billion events a day, live, and you need these to persist, you'd use SQL or something similar. That would then ultimately determine how you add and manipulate records.

If you don't care that much if you lose the data on a crash, that's when we're talking about temporary calculations. In Python, rather than having two tables like (eg.)

URLs:

    url_id  url          count
       1    google.com    300
       2    python.org    400
       3    example.net   200
Requests:

    request_id   data   url_id
        1       'spam'     2
        2       'eggs'     2
       ...
       900       'ham'     1
you would make a custom type, aka. ‘Url’, containing ‘url’ and ‘count’ as object fields, and then store your requests as a list containing references to those ’Url’s.

    urls = {
        "google.com":  Url("google.com",  300),
        "python.org":  Url("python.org",  400),
        "example.net": Url("example.net", 200),
    }

    requests = [
        Request("spam", urls["python.org"]),
        Request("eggs", urls["python.org"]),
        ...,
        Request("ham",  urls["google.com"]),
    ]


> ... like ACID. shelve doesn't provide this, and IIUC nor does kdb+. So if you're handling 50 billion events a day, live, and you need these to persist, you'd use SQL or something similar. That would then ultimately determine how you add and manipulate records. …

ACID is overrated. You can get atomicity, consistency, isolation and durability easily with kdb as I'll illustrate. I appreciate you won't understand everything I am saying though, so I hope you'll be able to ask a few questions and get the gist.

First, I start write my program in g.q and start a logged process:

    q g -L
This process receives every event in a function like this:

    upd:{r[`u?y`url;y`metric]+:1}
There's my enumeration, saved in the variable "u". "r" is a keyed table where the keys are that enumeration, and the metric is whatever metric I'm tracking.

I checkpoint daily:

    eod:{.Q.dd[p:.Q.dd[`:.;.z.d];`r] set r;.Q.dd[p;`u] set u;r::0#r;system"l"}
This creates a directory structure where I have one directory per date, e.g. 2020.03.11 which has a file (u or r) referring to the snapshots I took. I truncate my keyed table (since it's a new day), and then I tell q the logfile can be truncated and processing continues! To look at an (emptyish) tree right after a forced checkpoint:

    total 24
    drwxr-xr-x  4 geocar  staff  128 11 Mar 16:59 2020.03.11
    -rw-r--r--  1 geocar  staff    8 11 Mar 16:59 g.log
    -rw-r--r--  1 geocar  staff  206 11 Mar 16:55 g.q
    -rw-r--r--  1 geocar  staff  130 11 Mar 16:59 g.qdb

    geocar@gcmba a % ls -l 2020.03.11 
    total 16
    -rw-r--r--  1 geocar  staff  120 11 Mar 16:59 r
    -rw-r--r--  1 geocar  staff   31 11 Mar 16:59 u
The g.q file is the source code we've been exploring, but the rest are binary files in q's "native format" (it's basically the same as in memory; that's why q can get this data with mmap).

If I've made a mistake and something crashes, I can edit g.q and restart it, the log replays, no data is lost. If I want to do some testing, I can copy g.log off the server, and load it into my local process running on my laptop. This can be really helpful!

I can kill the process, turn off the server, add more disks in it, turn it back on, and resume the process from the last checkpoint.

You can see some of these qualities are things only databases seem to have, and it's for that reason that kdb is marketed as a database. But that's just because management has a hard time thinking of SQL as a programming language (even though it's there in the name! I can't fault them, it is a pretty horrible language), and while nobody wants to write stored procedures in SQL, that's one way to think about how your entire application is built in q.

That's basically it. There's a little bit more code to load state from the checkpoint and set up the initial days' schema for r:

    u:@[get;.Q.dd[.Q.dd[`:.;last key `:.];`u];{0#`}];
    r:([url:`u$()]; req:0#0; imp:0#0; clk:0#0; vt:0#0);
but there's no material difference between "temporary calculations" or ones that will later become permanent: All of my input was in the event log, I just have to decide how to process it.


I mean, sure, if your problem is such that a strategy like that works for you, I'm not going to tell you otherwise. You can log incoming messages and dump data out to files easily with Python too. I wouldn't want to call that a ‘database’, though, since it's no more than a daily archive.


Yes! "databases" are all overrated too. Slow expensive pieces of shit. No way you could do 50 billion inserts from Python to SQL server on a single core in a day!

I'm so glad k isn't a "database" like that.


It's a bit unfair to compare the speed of wildly inequivalent things. RocksDB would be more comparable, but even there it is offering much stronger resilience guarantees, multicore support, and gives you access to all your data at once.

Calling them expensive is ironic AF. Most of them are free and open source.


> It's a bit unfair to compare the speed of wildly inequivalent things.

Yes, but I understand you keep doing it because you don't understand this stuff very well yet.

> RocksDB would be more comparable

How do you figure that? RocksDB is not a programming language.

If you combine it with a programming language like C++, it can, with only 4x the hardware just about keep up 1/6th of my macbook air[1].

RocksDB might be more comparable to gdbm, but it's not even remotely like q or k.

[1]: And that's taking facebook's benchmarks at their word here, ignoring how utterly synthetic these benchmarks read: https://github.com/facebook/rocksdb/wiki/Performance-Benchma...

> much stronger resilience guarantees,

You're mistaken. There is no resilience guarantee offered by rocksdb. In q I can backup the checkpoints and the logs independently. It is trivial to get whatever level of resilience I want out of q just by copying regular files around. RocksDB requires more programming.

> gives you access to all your data at once

You're mistaken. This is no problem in q. All of the data is mmap'd as soon as I access it (if it isn't mmap'd already).

> Calling them expensive is ironic AF. Most of them are free and open source.

If they require 4x the servers, they're at least 4x as expensive. If it takes 20 days to implement instead of 5 minutes, then it's over 5000x as expensive.

No, calling that "free" is what's ironic, and believing it is moronic.


> How do you figure that? RocksDB is not a programming language.

I'm comparing to the code you showed. You're using the file system to dump static rows of data. All your data munging is on memory-sized blocks at program-level. Key-value stores are the comparable database for that.

> You're mistaken. This is no problem in q. All of the data is mmap'd as soon as I access it (if it isn't mmap'd already).

Yes, because you're working on the tail end of small, immutable data tables, rather than an actually sizable database with elements of heterogeneous sizes.

> In q I can backup the checkpoints and the logs independently. It is trivial to get whatever level of resilience I want out of q just by copying regular files around.

Yes, because you don't want much resilience.

---

What you're doing here is incredibly simplistic. It's not proper resiliency, it's not scalable to more complex problems, and it's not scalable to larger workloads. An mmap'ed table and an actual database are different things.

It works fine for you, but for many other people it's not.


> You're using the file system to dump static rows of data

That's what MySQL, PostgreSQL, SQL Server, and Oracle all do. They write to a logfile (called the "write ahead log") then periodically (and concurrently) process it into working sets that are checkpointed (checked) in much the same way. It's a lot slower because they don't know what is actually important in the statement except what they can deduce from analysis. Whilst that analysis is slow, they do this so that structural concerns can be handed off to a data expert (often called a DBA), since most programmers have no fucking clue how to work with data.

That can work for small data, but it doesn't scale past around the 5bn inserts/day mark currently, without some very special processing strategies, and even then, you don't get close to 50bn.

> All your data munging is on memory-sized blocks at program-level.

That is literally all a computer can do. If you think otherwise, I think you need a more remedial education than the one I've been providing.

> What you're doing here is incredibly simplistic. It's not proper resiliency, it's not scalable to more complex problems, and it's not scalable to larger workloads. An mmap'ed table and an actual database are different things.

Yes, except for everything you said, nothing you said is true in the way that you meant it.

Google.com does not query a "database" that looks any different from the one I'm describing; Bigtable was based on Arthur's work. So was Apache's Kafaka and Amazon's Kinesis. Stream processing is undoubtedly the future, but it started here:

https://pdfs.semanticscholar.org/15ec/7dd999f291a38b3e7f455f...

Not only does this strategy get used for some of the hardest problems and biggest workloads, it's quite possibly the only strategy that can be used for some of these problems.

Resiliency? Simplistic? I'm not even sure you know what those words mean. Putting "proper" in front of it is just weasel words...


> I'm not even sure you know what those words mean.

I don't think you are doing your case any favors with such rhetoric.


It wasn't rhetorical.


Ight imma head out, this is clearly not a good direction for the discussion.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: