GraphGPT: Extrapolating knowledge graphs from unstructured text

bitforger · on Feb 1, 2023

Pretty cool.

I once worked on AI Dungeon and we had a similar idea to parse the story so far into a graph, so that we could manage long-term memory outside of the context window (which was only 2048 tokens).

Coreference is hard. ("he took the sword"... who is he?) Updating the graph is also hard. (As the story progresses, new facts contradict old facts. Jenny was dating Tom, but now she's dating Mike.)

And knowing what to do with the knowledge graph is hard too, especially if you don't know the schema up front. The only thing we could think to use it for was... programmatically turning relevant sections back into text and prepending it to the context window. (There were easier ways to get a similar effect.)

Agentlien · on Feb 1, 2023

It's really fascinating hearing about this and what the issues were. I have played a lot of AI Dungeon on and off and this always felt like part of what was missing: some way for it to keep a structured view of the story to help consistency. The biggest problem has always been that it keeps contradicting itself or lose track of the plot. It's gotten a bit better with the manageable context being fed back each step, but it's still not nearly good enough.

varunshenoy · on Feb 1, 2023

Handling state (especially long-term) is really a struggle for LLMs right now. This issue should become easier to work with as context windows scale up in the next couple years (or months, who knows!).

dm3 · on Feb 1, 2023

People are already making progress on this, e.g. the H3 project[1].

[1] https://arxiv.org/abs/2212.14052

inciampati · on Feb 1, 2023

This is the most excited I've ever been sequence models! If the claims the H3 (and S4) authors are true then we are on the cusp of something very big that will provide another quantum leap in LLM performance. I worth that the claims may come with a hidden catch, but we just have to work with these systems to know.

I'll venture that once truly long range correlations can be managed (at scales 100-1000x what's possible with current GPTs), all the issues about logical reasoning can be answered by training on the right corpus and applying the right kinds of human guided reinforcement.

machiaweliczny · on Feb 1, 2023

Google scaled context to 40K tokens

contravariant · on Feb 1, 2023

Using tokens as context still sounds to me like you're asking someone to read back text that someone else wrote and continue the story. It might work but it's not the best way to get a coherent narrative.

inciampati · on Feb 1, 2023

How can you have a coherent narrative if you can't link things across very large contexts?

contravariant · on Feb 1, 2023

I'm saying the context should consist of more than just tokens.

Dwolb · on Feb 1, 2023

The new facts contradicting old facts thing is fascinating to me.

Why can’t graphs properly model time or sequences?

yorwba · on Feb 1, 2023

It's possible to model by annotating facts in the database with a timestamp (Wikidata has this, as well as qualifiers for e.g. the source of a statement, or that it applies within a restricted context) but you still need to somehow integrate the information if you want to know the state right now. E.g. if you have (Jenny, date, Tom) from a year ago and (Jenny, date, Mike) from yesterday, does that mean (Jenny, date, Tom) is no longer valid? Or are both simultaneously true? Or is (Jenny, date, Mike) invalid too, because yesterday was like ages ago?

You could have some heuristics to handle this and then you add another relation "has met" and suddenly you need a whole new set of heuristics.

gryn · on Feb 1, 2023

you can have a date_start and date_end to handle this ambiguity. but yes the complexity lies in the interpreter/reasoner that has to deal with these facts and evolution of this (meta)schema.

But rdf style and labeled property graph data modeling approach have multiple ways of dealing with this.

babelchips · on Feb 1, 2023

The way Datomic handles facts, accumulating them and providing point-in-time queries, is very effective.

Facts can contradict each other. Old facts are not lost. Querying requires a notion of time - “as of”.

zcw100 · on Feb 5, 2023

It's a combination of reification and bitemporal modeling.

jaygray0919 · on Feb 1, 2023

Correctomundo. See RDF-Star for progress about state-in-time. During summer 2022 there was extensive discussion/consideration in the W3C working group about different state-conditions.

visualphoenix · on Feb 1, 2023

Cool story! Feeding context back into the 0 shot is the hotness. I’ve had a lot of success with that.

Curious what other (easier) ways you found to accomplish the same effect?

groestl · on Feb 1, 2023

> programmatically turning relevant sections back into text

I can't help but think, is this the voice in our heads?

varunshenoy · on Feb 1, 2023

Hey everyone, author of the repo here.

Never expected to see this near the top of HN, but here we are! Super cool to see so much excitement around my weekend hack. Happy to answer any questions on the project.

I posted a couple demo videos on Twitter, in case anyone is interested: https://twitter.com/varunshenoy_/status/1620511932930490372?...

varunshenoy · on Feb 1, 2023

If anyone wants to play around without having to set everything up:

https://graphgpt.vercel.app/

Just bring your own OpenAI API Key.

cloudking · on Feb 1, 2023

Nice work and cool example of a one-shot prompt https://github.com/varunshenoy/GraphGPT/blob/main/public/pro...

thomasahle · on Feb 1, 2023

I remember writing chatbots just 5 years ago, a major challenge was how infer structured output using a neural network.

Now you just use text output to generate raw json and parse that. Crazy times.

axiom92 · on Feb 1, 2023

It has been possible to generate impressive graphs from text since GPT-2. Though you need a few tricks to make it work.

Here's an example (my work): https://aclanthology.org/2021.naacl-main.67.pdf

TLDR of the input/output: https://madaan.github.io/res/tldr/graph_gen_tldr.jpg

Some work from AllenAI: https://proscript.allenai.org/

teruakohatu · on Feb 1, 2023

Very impressive.

joaomacp · on Feb 1, 2023

"Newman is Jerry's enemy. He lives in the same building as Jerry and Kramer"

I was expecting this would make Newman, Jerry and Kramer all neighbours of each other, but it only did it for Newman and Kramer.

varunshenoy · on Feb 1, 2023

Agreed. Definitely a drawback of this technique — you might not get the exact specificity you want.

In general, GraphGPT tends to be very conservative in adding nodes/relationships. Not sure why, but probably deserves more investigation.

machiaweliczny · on Feb 1, 2023

Someone need to figure out how to create triplets store (I guess that's Google Knowledge Graph) using LMMs and then use that RETRO style. Would be cool if it wouldn't add facts that aren't consistent with current knowledge. I guess that's the way to AGI. So basically finally connection between experts systems, natural language and reasoning, add usage of python and boom. I would also throw Tsetlin machine into the mix somehow so we could interpret stuff.

eurasiantiger · on Feb 1, 2023

This will be a gamechanger for Domain-Driven Design.

t0mk · on Feb 3, 2023

If I'd have the credit on OpenAI, I'd paste in some Bob Dylan song, maybe Tangled Up in Blue. Would like to see what that graph will look like!

redgetan · on Feb 1, 2023

Looks nice! What are the potential use cases for this if i my ask?

MirelesJ · on Feb 1, 2023

Extract graph description

Does anyone know how to extract the nodes and links (edges and vertices) in text (JSON perhaps) or tabular form to input into other systems (like, say, Neo4j)?

guskel · on Feb 1, 2023

I wonder if you could build an OWL 2 graph with this.

rkuodys · on Feb 1, 2023

Looks really great. I wonder the application of such solution to documentation - graph as documentation for end user.

brianjking · on Feb 1, 2023

I saw this on twitter, really great work! Thanks for sharing!

ayejaytwo · on Feb 1, 2023

Can someone do this for the cosmere?

captn3m0 · on Feb 1, 2023

I get an OpenAPI Key error.

varunshenoy · on Feb 1, 2023

We've been messing with a Vercel deployment so you might've seen that :)

https://graphgpt.vercel.app/

It's not quite battle tested, but think I gotta sleep and take a look at it tomorrow.

qup · on Feb 1, 2023

you need to sign up for your own key

LarsDu88 · on Feb 1, 2023

Wow, this is incredible

RHJ · on Feb 1, 2023

K-Drama from Netflix

jackson1442 · on Feb 1, 2023

unrelated: what browser is that in the screenshot?

egrefen · on Feb 1, 2023

Looks like the Arc Browser, in developer mode, with the left sidebar hidden.

varunshenoy · on Feb 1, 2023

Yup, this is correct. Huge fan of Arc!

tsaitoh · on Feb 3, 2023

Acute myeloid leukemia (AML) is a heterogeneous hematologic malignancy characterized by expansion of myeloid blasts that fail to differentiate normally. This leads to hematopoietic failure resulting in granulocytopenia, thrombocytopenia, or anemia. (Leuk Res 2021 Lancet 2006 NEJM 1999). Chromosomal abnormalities and genetic mutations in AML cells are associated with aberrant proliferation and/or blockade of normal differentiation of hematopoietic cells. (J Mol Med (Berl). 2020) DNA damage repair mechanisms are essential for maintaining genomic integrity, and abnormalities in DNA repair genes may increase the risk of developing cancer. Recently, we reported the association between 8-oxoG glycosylase1 (OGG1), a key player in the base excision repair (BER) pathway, and the prognosis of AML. In this paper, we further investigate the relationship between BER pathway-related genes and the pathogenesis of AML. Genomic DNA is constantly damaged by endogenous reactive oxygen species and metabolites, as well as by various environmental agents such as ionizing radiation, ultraviolet light, and chemical compounds. The BER pathway is responsible for repairing most endogenous small DNA damage, including deamination, depurination, alkylation, and oxidative damage, which occurs approximately 30,000 times per cell per day. (Nature1993. Cancer Lett 2012.) Four proteins are involved in BER pathway: DNA glycosylases, AP endonuclease or AP DNA lyase, DNA polymerase, and DNA ligase. In this study, we focused on four BER genes: APEX1, MUTYH, OGG1, and XRCC1. APEX1 has multiple functions and acts as an AP endonuclease in BER; MUTYH and OGG1 are DNA glycosylases; XRCC1 is a scaffold protein that binds DNA polymerase and DNA ligase (Cell. Mol. Life Sci. 66 (2009). Numerous studies have reported the association between functional BER gene polymorphisms and the risk of various tumors, including lung, digestive system, bladder, and breast cancers. (Med Oncol. 2015, Mol Biol Rep 2014, PLoS One 2013, Diagn Pathol 2012). Previous studies have also reported associations between XRCC1 polymorphisms and AML risk. However, few studies have comprehensively investigated the impact of BER polymorphisms on the pathogenesis of AML. (Mol Carcinog. 2017, Blood 2007, Int J Cancer 2011). In this study, we focused on six BER polymorphisms and their impact on the development and clinical features of AML: APEX1 -656T>G (rs1760944), APEX1 D148E (rs1130409), MUTYH Q324H (rs3219489), OGG1 S326C (rs1052133), XRCC1 R194W (rs1799782), and XRCC1 R399Q (rs25487). In the present study, we found an association between the risk of developing AML and APEX1 polymorphisms. Therefore, we further investigated the association between APEX1 and AML pathogenesis, considering that APEX1 has a positive influence on AML. APEX1, also known as redox factor-1 (Ref-1), is a multifunctional protein with both DNA repair and transcriptional regulatory activities (Antioxid Redox Signal. 2009, Antioxid Redox Signal. 2014). APEX1 is frequently upregulated in various tumors and its expression is associated with clinical stage and poor prognosis in several tumors, including lung cancer, breast cancer, hepatocellular carcinoma, bladder cancer and multiple myeloma (Lung Cancer 2009, Biochem Biophys Res Commun 2012, Clin Cancer Res 2005, Mol Med 2007, Clin Lymphoma Myeloma Leuk 2010). In AML, Vascotto et al. and López et al. reported the interaction of APEX1 and NPM1, which are frequently mutated in AML cells. However, little is known about the role of APEX1 in the pathogenesis of AML. However, little is known about the role of APEX1 in the overall pathogenesis of AML. In the present study, we found an association between the risk of developing AML and APEX1 polymorphisms. Therefore, we further investigated the association between APEX1 and AML pathogenesis, considering that APEX1 has a positive influence on AML. APEX1, also known as redox factor-1 (Ref-1), is a multifunctional protein with both DNA repair and transcriptional regulatory activities (Antioxid Redox Signal. 2009, Antioxid Redox Signal. 2014). APEX1 is frequently upregulated in various tumors and its expression is associated with clinical stage and poor prognosis in various tumors, including lung, breast, hepatocellular, bladder and multiple myeloma (Lung Cancer 2009, Biochem Biophys Res Commun 2012, Clin Cancer Res 2005, Mol Med 2007, Clin Lymphoma Myeloma Leuk 2010). In AML, Vascotto et al. and López et al. reported the interaction of APEX1 and NPM1, which are frequently mutated in AML cells. However, little is known about the role of APEX1 in the pathogenesis of AML. However, little is known about the role of APEX1 in the overall pathogenesis of AML. The aim of this study was to investigate the impact of BER polymorphisms on susceptibility, prognosis, and clinical features in AML patients. In addition to the analysis of these polymorphisms, the impact of APEX1, a multifunctional protein, on the pathogenesis of AML was also investigated. Our study was divided into two main parts: the analysis of BER polymorphisms and the investigation of APEX1 expression and its impact on knockdown.

RHJ · on Feb 1, 2023

K-drama from Neflix

RHJ · on Feb 1, 2023

K-dramas from Netflix

RHJ · on Feb 1, 2023

Drama