It also has requirements on the box tree structure and order and what boxes are required/optional. Then there are also mapping per codec (see https://mp4ra.org/) that have additional requirements on boxes and how the samples should be encoded.
For example PNG in MP4 is mapped by having a stsd (sample descriptor) box that has a mp4v format/box which itself should includes a esds (mpeg elementary stream descriptor) box which will include a decode configuration stating that the stream type is video and the object type is PNG.
Since most of us associate mp4 with moving images, what exactly is usually wrapped in the mp4 container, and is there a good introduction on the topic available?
If you want to go deep ISO-14496-12 is probably what your looking for. You can either pay ISO to read it or hypothetically you can google for "filetype:pdf ISO-14496-12". For how things actually work in practice https://github.com/FFmpeg/FFmpeg/blob/master/libavformat/mov... is a good resource.
Generally in the form of a video stream (usually h264, but it can be anything), one or more audio stream (AAC or MP3), and a timestamp index that keeps them aligned and makes it easier to seek.
It's not especially easy to find good documentation; people generally don't write software to touch it directly, either they use the OS media library or ffmpeg.
It could easily contain a video consisting of a series of images in PNG or other formats. For example, Motion JPEG files are just a series of JPEG images, and used to be a standard capture and "intermediate master" video interchange format. https://en.wikipedia.org/wiki/Motion_JPEG
fq has mp4 support https://github.com/wader/fq is quite visual but is a CLI tool (for now). It has a REPL and query language to poke around. Disclaimer: i'm the author.
Does everyone has a longer overview of the different components and how they interact (tracks/streams/...) and the different things you can put in there? Something a little more in-depth that is not a full specification, I guess.