Submissions from alignmentforum.org

		Catastrophic sabotage as a major threat model for human-level AI systems (alignmentforum.org)
		5 points by speckx 21 days ago \| past
		Would catching AIs trying to escape convince AI devs to slow down or undeploy? (alignmentforum.org)
		2 points by rntn 79 days ago \| past
		AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (alignmentforum.org)
		1 point by sebg 86 days ago \| past
		Mysteries of mode collapse – AI Alignment Forum (2022) (alignmentforum.org)
		1 point by Bluestein 4 months ago \| past
		Opinionated Annotated List of Favourite Mechanistic Interpretability Papers v2 (alignmentforum.org)
		2 points by thunderbong 4 months ago \| past
		Transformers Represent Belief State Geometry in Their Residual Stream (alignmentforum.org)
		3 points by HR01 6 months ago \| past
		LLMs for Alignment Research: a safety priority? (alignmentforum.org)
		1 point by rntn 7 months ago \| past
		Modern Transformers Are AGI, and Human-Level (alignmentforum.org)
		2 points by rntn 7 months ago \| past
		Larger language models may disappoint you [or, an eternally unfinished draft] (alignmentforum.org)
		2 points by behnamoh 10 months ago \| past
		AGI safety from first principles: Superintelligence (alignmentforum.org)
		2 points by warkanlock 10 months ago \| past
		Anthropic Fall 2023 Debate Progress Update (alignmentforum.org)
		2 points by EvgeniyZh 11 months ago \| past
		When do "brains beat brawn" in chess? An experiment (alignmentforum.org)
		124 points by andrewljohnson 12 months ago \| past \| 79 comments
		Critique of some recent philosophy of LLMs' minds (alignmentforum.org)
		2 points by behnamoh on Oct 9, 2023 \| past
		Mesa-Optimization (alignmentforum.org)
		1 point by reqo on Sept 5, 2023 \| past
		Glitch Tokens (alignmentforum.org)
		1 point by peter_d_sherman on June 8, 2023 \| past
		The Unsolved Technical Alignment Problem in LeCun's A Path Towards AGI (alignmentforum.org)
		4 points by sandinmyjoints on June 6, 2023 \| past
		A Mechanistic Interpretability Analysis of Grokking (alignmentforum.org)
		202 points by og_kalu on May 30, 2023 \| past \| 54 comments
		AI Will Not Want to Self-Improve (alignmentforum.org)
		3 points by behnamoh on May 22, 2023 \| past \| 3 comments
		GPTs are Predictors, not Imitators or Simulators (alignmentforum.org)
		2 points by og_kalu on April 22, 2023 \| past
		Concrete Open Problems in Mechanistic Interpretability (alignmentforum.org)
		1 point by raviparikh on April 11, 2023 \| past
		Imitation Learning from Language Feedback (alignmentforum.org)
		2 points by tim_sw on April 5, 2023 \| past
		Othello-GPT Has a Linear Emergent World Representation (alignmentforum.org)
		2 points by todsacerdoti on April 1, 2023 \| past
		Gitch Tokens in GPT (SolidGoldMagikarp) (alignmentforum.org)
		1 point by gwd on March 8, 2023 \| past
		Data, not size, is the current active constraint on language model performance (alignmentforum.org)
		4 points by satvikpendem on Dec 22, 2022 \| past
		Central AI alignment problem: capabilities generalization and sharp left turn (alignmentforum.org)
		1 point by kvee on Nov 8, 2022 \| past
		Some Lessons Learned from Studying Indirect Object Identification in GPT-2 Small (alignmentforum.org)
		2 points by g0xA52A2A on Nov 2, 2022 \| past
		A Mechanistic Interpretability Analysis of Grokking (alignmentforum.org)
		1 point by apetresc on Sept 20, 2022 \| past
		OpenAI: Common Misconceptions (alignmentforum.org)
		1 point by O__________O on Sept 20, 2022 \| past \| 2 comments
		Gödel, Escher, Bach: an in-depth explainer (alignmentforum.org)
		430 points by behnamoh on Sept 13, 2022 \| past \| 248 comments
		A Mechanistic Interpretability Analysis of Grokking (alignmentforum.org)
		1 point by caprock on Aug 24, 2022 \| past
		More