Hacker News new | past | comments | ask | show | jobs | submit login

Hi, author of the aho-corasick crate here. Your use of it piqued my interest and caused me to take a closer look.

I believe your use of `unsafe` on this line is unsound: https://gist.github.com/daaku/58557e2545612df8f40b13b66b7d3b...

Namely, there is no guarantee that the bytes between `<page>` and `</page>` will be valid UTF-8. It may be the case that you only run this program with UTF-8 input, in which case, UB is never triggered. But it's worth pointing out here since there is nothing actually stopping your program from hitting UB.

Also, as long as you're bringing in the twoway crate, you might as well use it on lines 43 and 48 since you're just searching for a single needle.




The bytes are assumed to be utf8 (I was using the safer `from_utf8` prior to confirming the data was utf8).

I brought in `twoway` when I couldn't find a way to `rfind` using `aho-corasick`. I'll switch the use over for consistency.

Thanks for the quick code review!

PS: Thanks for ripgrep too!


Ah gotya. Yeah, I haven't added reverse searching to aho-corasick yet. Ran out of steam.

Either way, my point here is to be a counter-balance. To be fair, you did say, "But with Rust I managed to safely use." But the code you posted is technically unsound. It's not a huge deal if you know you'll always be feeding the program valid UTF-8. But it is worth mentioning here in this HN thread that is specifically comparing the safety properties of competing programming languages. :-)


Correct and fair. Updated the code to remove the safety issue.


Thank you. :-)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: