I'm glad to see Remote as a location, but due to the free-form writing in the original posts, there are errors. For example, "Haskell dev at Standard Chartered Bank" is listed under Remote, but the post itself says "Remote work isn’t an option". The post for Button similarly doesn't allow remote, but uses "Remote - no" to convey that.
I've been planning on building some filtering for the Who is Hiring threads, and I've pretty much determined that some degree of manual review will be needed. In the most recent thread, I found a huge number of posts containing "remote" which don't actually allow remote working. "No remote" is fairly common and easy to filter out, but there are any number of variations that you can't anticipate a priori.
> I've pretty much determined that some degree of manual review will be needed
You're spot on with everything. I did a lot of manual review and the site already filters out "NO REMOTE", "REMOTE no", "Remote not" and "No Remote" entries. I did spot the "Remote work isn’t an option" post, but I decided I'm not going to write that kind of completely ad-hoc filtering rules, it's just ugly.
All these strategies are interesting, but I'm afraid we are over-engineering the problem here. The pretty simplistic strategy I'm using now is basically just pattern matching, and so far I had only 4 misplaced posts out of the 840 for April alone: that is < 0.5%. And it's blazing fast! I can rebuild the entire db in less then 30 seconds.
Given these number I believe pretty much everything more complicated than that would be a total overkill... Good food for thoughts though!
In my experience with data quality management, manual translation of these edge cases is not pleasant. Yet it can be very valuable. It's a bit like "online learning" in machine learning - each time an error is found, you provide the correct answer. Yes, you might end up with a long array of phrases/regexes to check against. However, it scales just right for the amount of data you have and provides high quality results.
A better option would be to require job postings to make location and remote-ability explicit at the top, in a standard format/layout. Because quite often I'm Cmd+F-ing through a thread and landing on a ton of "no remote" posts, which is frustrating.
I'm glad to see Remote as a location, but due to the free-form writing in the original posts, there are errors. For example, "Haskell dev at Standard Chartered Bank" is listed under Remote, but the post itself says "Remote work isn’t an option". The post for Button similarly doesn't allow remote, but uses "Remote - no" to convey that.
I've been planning on building some filtering for the Who is Hiring threads, and I've pretty much determined that some degree of manual review will be needed. In the most recent thread, I found a huge number of posts containing "remote" which don't actually allow remote working. "No remote" is fairly common and easy to filter out, but there are any number of variations that you can't anticipate a priori.