Ask HN: Anyone have any cool Open Source Projects and looking for contributors?

kstenerud · on July 16, 2022

https://concise-encoding.org/ is looking for help!

I'm planning to release v1 later this year, and there are still a number of things to finish:

- Finish upgrading the portable testing rig (whereby the tests are defined in CTE format so that they can be run against any implementation).

- Bring the Antlr grammar files up to date and make sure they're as easy as possible to build CTE parsers from.

- Add schema validation support to https://github.com/kstenerud/enctool for Concise Encoding documents (using https://cuelang.org/)

- Critiques on the format itself (passages that are unclear or don't make sense, features that shouldn't be there or need more work, etc)

- Implementations in other languages & platforms (CBE is more important to start because one can always use enctool to convert between CBE and CTE).

zzo38computer · on July 17, 2022

It is a interesting idea (there are some good things in there) but I think there are some problems with it:

- It uses Unicode.

- It seems to be rather complicated, more than it should be.

- I think that it would be better that comments should not be considered to be a part of the structure.

- I think that astronomical year numbering might be better since then there will be zero.

- I am not sure about the media encoding

(However, I also dislike XML and JSON, too (they also have the flaw of using Unicode, and other problems too; this Concise Encoding fixes some of them, and adds problems of its own). I had some ideas of my own designs too though. I think that the PostScript object format is not too bad, although it also has problems (e.g. the binary format cannot use numbers longer than 32-bits, the binary format does not have dictionaries (except as a non-standard extension which is deprecated anyways), etc).)

kstenerud · on July 17, 2022

Hi, thanks for the comments! I've agonized to excess over so many of the design decisions, but here is the reasoning for the concerns you've raised:

There is no avoiding Unicode since that is what the world has settled upon. Anything else would limit international support.

As soon as comments are allowed in a text format, they become part of the structure, and need rules for where they can and cannot occur. I've tried to make them as forgiving as possible (basically they can be placed between any two tokens). I think the main difference is that most formats only imply rather than specify where they're allowed and not allowed (for example, you can't put a comment in the middle of a number: 1234/blah/5678 does not represent 12345678).

Astronomical year numbering is essential in internal runtime representations, but terrible for anything human facing (which is why standard library date APIs present them in era form). I've observed time and time again people making off-by-one mistakes trying to input BC dates in ISO 8601 (which uses astronomical year numbering); it's just not user-frinedly, and doesn't add anything useful to an interchange format since a computer can make the adjustment automatically and without error every time it converts to/from its internal representation.

Media is useful for embedding objects that the receiving side will just know how to handle. Media types were added to email and HTTP and various other formats so that one could embed foreign objects into documents in a consistent, compatible way that the operating system could automatically display. For example, an identification card document could include a picture of the person, but if it were just a series of bytes, a Concise Encoding reader wouldn't know what to do with it; only the receiving application would know. However, as a media object "image/jpeg", any CE reader could display the image inline when viewing the document.

I think the format looks complicated because I've been explicit to the point of paranoia about codec behavior in order to ensure that no two implementations have subtle behavioral differences (which would open attack vectors). Internally, all of the types follow 4-5 data layout strategies to keep the binary codec size small. The text format will likely see 2% of usage in the real world, so for it my primary concerns are user friendliness and compatibility with the binary format.

zzo38computer · on July 17, 2022

I still consider them to be problems although Concise Encoding also has advantages in addition to disadvantages.

> There is no avoiding Unicode since that is what the world has settled upon. Anything else would limit international support.

Actually, it is not true. I believe that avoiding Unicode and treating strings as sequences of bytes (without requiring any specific character encoding) will have better international support, and support for other programs. Some character encodings cannot be effectively converted to/from Unicode or will have wrong character properties (or other problems such as Han unification, etc). Some programs might otherwise mistakenly use the Unicode string type for things that shouldn't be, and I see this problem in too many modern programs than I count (assuming that environment variables, file names, command-line arguments, all locales, etc are UTF-8; that is VERY WRONG and you should fix your program if you do that). I do not wish to compound that problem, so all of my own designs are not using Unicode, as much as I can avoid it. This often means inventing my own file formats (meaning many existing programs cannot deal with them), or using existing file formats in an unintended way (which can also make existing programs confused when trying to deal with them). Even one internal thing in Concise Encoding assumes Unicode text where it shouldn't, which is keys. This isn't very good if you want to store non-Unicode keys.

I know, because I have dealt with character encoding and have the conclusion that non-Unicode is better for international text. Also problems with assumptions that something is text even though it isn't text in any character encoding but is merely a null-terminated sequence of bytes. In some of my own formats that need to use international text and need to know the character encoding within the file format (which is not always necessary, but sometimes it is), they are specified as a 23-bit code page number (an extension of IBM's 16-bit code page numbers), where zero means it is not interpreted as text in any encoding.

> As soon as comments are allowed in a text format, they become part of the structure, and need rules for where they can and cannot occur.

In text formats yes, but not in the general structure which is independent of the file format, I think.

> Astronomical year numbering is essential in internal runtime representations, but terrible for anything human facing (which is why standard library date APIs present them in era form). I've observed time and time again people making off-by-one mistakes trying to input BC dates in ISO 8601 (which uses astronomical year numbering); it's just not user-frinedly

I disagree that it isn't user-friendly; some people will expect them to be astronomical year numbering. One alternatively might be that the text format allows year numbers to optionally specify "BC", "BCE", "AD", or "CE" (where "BC" is equivalent to "BCE" and "AD" is equivalent to "CE"); if none of these are specified then astronomical year numbering is used, and if converted to binary format then they are converted to astronomical year numbering format.

What I have done in some of my own designs is a UNIX timestamp (or, alternatively, use a TRON timestamp, which has a different epoch and may sometimes make variable-length encodings shorter), and the number of nanoseconds is allowed to exceed one billion in case of leap seconds.

> Media is useful for embedding objects that the receiving side will just know how to handle. Media types were added to email and HTTP and various other formats so that one could embed foreign objects into documents in a consistent, compatible way that the operating system could automatically display.

While it is useful for that purpose, it is not always sufficient. There are a few different ways of identifying file formats, including MIME and UTI, and my own "unordered-labels"-based format. MIME is the most common one (and is the one you used here), but cannot always identify files accurately, especally if they can be interpreted as more than one format, etc. (The use of + helps a little bit, but that is just a add-on to a badly designed format.) Using MIME does have its advantages that other programs can more likely be capable of it, though.

My own "unordered-labels"-based format has some advantages that I had considered. For example, you can have "text[1209]:plain:markdown+commonmark:document", "text[367]:html:document", "epub:zip:document", "image:jpeg", etc. You can also identify audio/video codecs, e.g. "ogg<vorbis,theora>:audio:video". A format such as PostScript can be text or binary, and can be treated as a program or as a document, so you can specify this. And then, there are also polyglot codes that can be interpreted as many kind of formats. (The plus sign is a abbreviation, so "markdown+commonmark" is the same as "markdown:markdown.commonmark". Labels can be in any order, so "image:jpeg" is equivalent to "jpeg:image".)

> if it were just a series of bytes, a Concise Encoding reader wouldn't know what to do with it

Yes, although a user could still display the hex dump, or copy out the bytes and display them using an external program, etc.

> Internally, all of the types follow 4-5 data layout strategies to keep the binary codec size small. The text format will likely see 2% of usage in the real world

It is a good idea, although all of the different requirements of different types and their relation can make it messy, and there is then decision, which ones to add or don't add? For example, someone wanted to add currency type, but I think that it would be excessive (and problematic in other ways, too). But you cannot be sure about how much usage the text format will be, compared with the binary format, I think.

zzo38computer · on July 17, 2022

My open source project is a puzzle game engine called Free Hero Mesh. I am looking for contributors, of whatever kind of contribution you might like to do; post on the NNTP and/or IRC if you have interested please.

The IRC is #freeheromesh on the Libera IRC; there is also a Matrix bridge (although I have not tested it). The newsgroup name on the NNTP is un2.org.zzo38computer.soft.freeheromesh and the NNTP server is zzo38computer.org. There is also a web page http://zzo38computer.org/freeheromesh/ with some explanations.

I do not use GitHub for my own projects; it is fossil, and is mirrored on Chisel. (In future I might also add a mirror on GitHub too, but currently, it isn't)

codingclaws · on July 16, 2022

Yeah, I've been working on a Reddit/HN clone called Comment Castles [0] for a couple years.

[0] https://github.com/ferg1e/comment-castles

MilnerRoute · on July 17, 2022

There's new forks just starting up for the beloved Cutefish OS and its "Cutefish DE" desktop environment.

https://www.reddit.com/r/linux/comments/vwd0m8/i_am_about_to...

https://www.debugpoint.com/cutefish-os-development-halts/

2Gkashmiri · on July 16, 2022

are you interested in accounting? there is one gnukhata https://gitlab.com/gnukhata/gkapp that is looking for contributors to localize it to various countries around the world. it is currently India only but if anyone is willing to help, join up

als0 · on July 16, 2022

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

janosd · on July 16, 2022

ContainerSSH is always happy to have contributors, but it's a bit of a hard project to get into. :) https//containerssh.io

pistoriusp · on July 16, 2022

RedwoodJS is always looking for contributors!

jarsoon · on July 16, 2022

what programming language are you most capable of writing at this moment.

igotsideas · on July 17, 2022

Not op but any suggestions for python projects?

verdverm · on July 16, 2022

https://github.com/hofstadter-io/hof

CUE, Go, code gen & data modeling

Lots of good first issues ready for your contribution

Also opportunity to write in any tech / lang to build examples, demos, or community modules