Hacker News new | past | comments | ask | show | jobs | submit | indatawetrust's comments login

experimental project https://github.com/lfex/jlfe (erlang on the jvm)





Neo4j, like all graph databases I've tried, is only okay with small data.

Suppose I want to import a medium-sized graph into Neo4j. Medium-sized as in "fits on a hard disk and doesn't quite fit in RAM". One example would be importing DBPedia.

Some people have come up with not-very-supported hacks for loading DBPedia into Neo4j. Some StackOverflow comments such as [1] will point you to them, and the GitHub pages will generally make it clear that this is not a process that really generalizes, just something that worked once for the developer.

[1] http://stackoverflow.com/questions/12212015/how-to-setup-neo...

Now suppose you want to load different medium-sized graph-structured data into Neo4j. You're basically going to have to reinvent these hacks for your data.

And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.

Don't tell me that I need some sort of enterprise distributed system to import a few million edges. Right now I keep these edges in a hashtable that I wrote myself, in not-very-optimized Python, that shells out to "sort" for the computationally expensive part of indexing. It's not a very good DB but it gets the job done for now. It takes about 2 hours to import.


>>And the last time I tried to load my multi-million-edge dataset into Neo4j through its documented API, I estimated that it would have taken several weeks to finish.

Use the Import tool, it can do a million writes a second. Here is how to import hacker news into Neo4j using it: https://maxdemarzi.com/2015/04/14/importing-the-hacker-news-...


Thanks, I'll put that on my list of things to try, although I've spent more than enough time banging my head against graph DBs for today.

This isn't insurmountable, but I'm just going to gripe about it: I'm annoyed by the idea that I need to make a table of nodes and load it in first. Every graph DB tutorial seems to do this, because it looks like what you'd do if you were moving your relational data into a graph DB. But I have RDF-like data where nodes are just inconsequential string IDs.

Hm, this indicates that I should definitely be looking at Cayley, which directly supports loading RDF quads.


Let me preface this that have not used neo in ~1 year.

But before then I used it all the time. Imported the whole US patent DB into it. Its performance was very solid, both for importing and querying using their CYPHER syntax language, which is well tailored for graph representations and graph algos. Definitely had millions of edges, with ~8M nodes.

That said if you are having import issues, you must be not using batch. If the that's the case you are creating significant overheads in your write process (every write acquires certain locks and syncs the db to avoid concurrent mod issues)


Very interesting. I think the reality in a lot of situations is that most people don't really need the full feature-set that graph databases provide.

I ran into a similar problem trying to explore Wikidata's json dumps. It's a lot simpler to load it into MongoDB and create indices where you need them, rather than figuring out how to interface to a proprietary system that you may or may not end up using in the long run.

I'm still having trouble keeping my indices in memory though, and would be keen to know what sort of latency you encounter hitting an on-disk hashtable.


What I end up with is a 4 GB index, whose contents are byte offsets into an 8 GB file containing the properties of the edges.

When I mmap these, an in-memory lookup takes about 1 ms, but I can have unfortunate delays of like 100 ms per lookup if I'm starting cold or hitting things that are swapped out of memory.


Also, yeah, I don't really need most of the things that graph DBs are offering. They seem to focus a lot on making computationally unreasonable things possible -- such as arbitrary SPARQL queries.

I'm not the kind of madman who wants to do arbitrary SPARQL queries. I just want to be able to load data quickly, look up the edges around a node quickly, and then once I can do that, I'd also like to be able to run the occasional algorithm that propagates across the entire graph, like finding the degree-k core.


I use MongoDB for a Wikidata replica and index performances are quite good. I use some hacks in order to keep size of indexed values low (see https://github.com/ProjetPP/WikibaseEntityStore/blob/master/... ). It helps a lot in order to be able to keep indexes into memory.

It powers https://askplatyp.us quite well.


I've loaded the full Freebase dump into Cayley one year and something ago (Freebase is probably already big-sized). It took around one week on a pretty beefy machine, the problem being mainly the way the data was structured in the dump.

But after the import, it worked pretty well without issues and with decent performances. With decent meaning: not good for a Front End, good enough for running analysis in a back-end using a little bit of caching.


FYI, we have had EXTREME performance with the LOAD CSV feature of Cypher.



Great course by the way +1


is it funny?




> Note: We are only accepting applications from programmers.

significant.


the subject will be on the outside but what are examples of applications developed with rust?


There's a lot of OSS stuff as well, (well, "a lot" for how old Rust is) but we just recently added https://www.rust-lang.org/friends.html to the website to showcase production users.


The idea of using docker excellent.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: