I'm glad to see Remote as a location, but due to the free-form writing in the original posts, there are errors. For example, "Haskell dev at Standard Chartered Bank" is listed under Remote, but the post itself says "Remote work isn’t an option". The post for Button similarly doesn't allow remote, but uses "Remote - no" to convey that.
I've been planning on building some filtering for the Who is Hiring threads, and I've pretty much determined that some degree of manual review will be needed. In the most recent thread, I found a huge number of posts containing "remote" which don't actually allow remote working. "No remote" is fairly common and easy to filter out, but there are any number of variations that you can't anticipate a priori.
> I've pretty much determined that some degree of manual review will be needed
You're spot on with everything. I did a lot of manual review and the site already filters out "NO REMOTE", "REMOTE no", "Remote not" and "No Remote" entries. I did spot the "Remote work isn’t an option" post, but I decided I'm not going to write that kind of completely ad-hoc filtering rules, it's just ugly.
All these strategies are interesting, but I'm afraid we are over-engineering the problem here. The pretty simplistic strategy I'm using now is basically just pattern matching, and so far I had only 4 misplaced posts out of the 840 for April alone: that is < 0.5%. And it's blazing fast! I can rebuild the entire db in less then 30 seconds.
Given these number I believe pretty much everything more complicated than that would be a total overkill... Good food for thoughts though!
In my experience with data quality management, manual translation of these edge cases is not pleasant. Yet it can be very valuable. It's a bit like "online learning" in machine learning - each time an error is found, you provide the correct answer. Yes, you might end up with a long array of phrases/regexes to check against. However, it scales just right for the amount of data you have and provides high quality results.
A better option would be to require job postings to make location and remote-ability explicit at the top, in a standard format/layout. Because quite often I'm Cmd+F-ing through a thread and landing on a ton of "no remote" posts, which is frustrating.
The only way to deal with the unstructured nature of the "Who is hiring" posts is to have some sort of schema that can be processed. I also wanted to do something similar (imagine it with dc.js, for example!), but the data is too diverse.
All posts could have a METADATA: compressed_json entry that can be processed by the site and displayed/filtered accordingly. Perhaps it could be built manually at the beginning until it catches up.
This is cool, nice to look at other projects analyzing this data. I publish HN Hiring Trends, http://www.ryan-williams.net/hacker-news-hiring-trends/ , that watches the various technology terms being mentioned in the postings.
Once thing I changed to was just including top level comments and no replies/discussion of the posting. Do you handle similarly?
HN Hiring Trends is really sweet, I didn't know about it. I do include only top level comments; and incidentally, digging through HN's HTML code was... uhm... let's say a bit messy.
Great project, however I've noticed some listings are missing. For example, from April, https://news.ycombinator.com/item?id=9303396 there was a posting from Questrade. It's missing on this website.
Thanks for the feedback! That particular post is missing because the city is written in all caps (TORONTO). I used a bunch of tricks to be able to catch as many cities as possible (say NY and NYC and Manhattan and even NEW YORK all go under "New York City"), but not every city is tested against its "all caps" equivalent atm because I wanted to be able to rebuild the db as fast as possible during development phase. I should probably fix that now.
Unfortunately no, 1) you can't trim the whitespace if you're matching cities, 2) matching the cities lowercase opens up too many false positives. At the same time, matching every city with its uppercase equivalent doubles the time requested to build the db but only adds a tiny handful of posts. That's why (for now) I settled with a tradeoff where I catch the uppercase equivalent for the biggest cities only.
Love this. I was able to get more useful information about remote working than I was able to when reading the original who is hiring April post. This is probably a good time to suggest changing the format of the who is hiring posts. Would be great if the use of a standard form template is encouraged. That would make an effort to parse the data much easier, and would make the reading of the original post easier, and would probably make it easier for companies to create their posts too. Win for all?
I'm surprised nobody else has requested this, but any chance for a state category? If you really want to impress me, perhaps a warm and cold climate section or maybe have to shovel snow vs. unlikely to shovel snow. :P Gotta set my priorities straight…
I actually thought about it, but it's a total mess to parse :( Think of posts like "We're sorry we can't sponsor H1B at the moment, but we might get you a VISA for our London office".
On the other hand, once you're already browsing New York City, good ol' ctrl-f for VISA will probably serve you well enough.
Back when the web looked (to me) like it was going to move to XHTML2 rather than HTML5 data encoding using Microformats looked promising, http://microformats.org/wiki/job-listing. One doesn't hear "semantic web" much nowadays though.
Although I'm sure companies won't complain about the additional publicity, are there any concerns of copyright issues with scraping and republishing the text from other HN posts?
If everyone cared that much about copyright I think no one would ever make anything. There's always a way to sue someone over something, especially copyright.
Yes. You set yourself up as a contractor and handle country specific taxes etc yourself. The book "Remote: Office Not Required"[1] has a lot of additional information, it's a great read.
Is it hard to find remote work in Europe at the moment? Just curious, I haven't had to look, but was under the impression there ought to be a reasonable amount available.
Very cool idea, fun name, practical interface. I like it.
Perhaps add a simple tagging system where users can add tags to hiring posts. That way you don't need to comb through every post and hopefully you crowdsource some helpful taxonomic data.
Useful, but falls short for surrounding areas of Los Angeles like Venice (neighborhood of LA) & Santa Monica (adjacent and much a part of LA). I imagine there are issues like this for other cities and regions too.
As an LA-area resident, I definitely feel like an overall "Los Angeles" category is important. Santa Monica and Venice are LA; leaving them out is like leaving out Palo Alto or Mountain View from a "Silicon Valley" category.
i believe its not its own city. 2 posts for venice should be part of los angeles. you can still have venice its own section if u want, but excluding it from los angeles doesn't make sense.
I realize that you may be using a limited input device, but surely you can afford a few spare grams of pressure for the shift key, or y and o? Anyway, I can see what you mean re: Venice/LA (Venice is not its own city), but as someone who used to work in Santa Monica there was a time when the ONLY places I would consider were SM, Venice, and just maybe El Segundo (the beach bike path did make for an amazing commute). Having WeHo or Northridge jobs mixed in there could be a pain.
having the problem of a few more posts to sift through is much better than the problem of missing something that might have been categorized as "Los Angeles" where the job is actually in venice or santa monica <- happens all the time.
these things are just a starting point - further investigation on the company's site and actual location are always necessary.
I was surprised at how few are hiring in Los Angeles. I'm thinking about going to grad school there. Can anyone in LA comment on the state of your tech economy?
LA-area tech community has been growing incredibly over the last few years. Lots of early stage startups; some are now maturing like Dollar Shave Club and Lynda.com
Ah ha, that's why. I publish whoishiring technology trends every month[1] and was curious about the change. Fortunately the API allows a list of users, making it an easy thing to handle.
I assume that was just a DST issue. The bot posted it an hour later than usual, and the DST transition (for the US) took place between the March and April posts.
Very cool, and interesting to see the results. I immediately looked for a 'trends' feature to see how cities rank change over time, or maybe this could be plotted?
I would suggest three buckets: San Francisco, including San Bruno, Millbrae, and Burlingame; Mid-Peninsula, covering San Mateo, Foster City, Belmont, San Carlos, Redwood City, Menlo Park, and Palo Alto; and South Bay, covering Mountain View, Sunnyvale, Santa Clara, San Jose, Cupertino, Campbell, Los Gatos, and Milpitas (and maybe Fremont?).
Yep, plain Cambridge is "Cambridge, MA", the other one is listed as "Cambridge, UK". In the same fashion, "Venice" is actually "Venice, CA". It hurt a bit but it was the right thing to do ;)
Not really, but if you want to do it all it takes is to replace the URLs to wget (i.e. you want to wget all the "Who's looking for work" pages instead of the "Who is hiring?").
You can easily build your local Sqlite database like that. I wrote some more instructions about it on the README.md on Github.
I should mention that I and other hiring managers I've talked with are moving away from posting on the "Who is hiring?" post.
It was pretty useful ~6 months ago. But, the amount of spam generated from recruiting and sourcing firms, various startups trying to push their revolutionary new online coding tools, etc. is pretty ridiculous and many of them, especially the SV-area startups, have been quite aggressive (e.g., phone calls and switching to my personal e-mail address after I told them I was not interested).
Posting jobs on twitter has been a far more effective sourcing tool than HN "Who is hiring" has become recently, at least in the free space.
I've had some ok luck with the Who's hiring threads, but, what really bothered me was some of the practices from these companies.
One company, allowing remote work, sent me to do a personality inventory without even talking to me first -- which really bothered me. (They're still posting looking for DevOps and Developers in Indianapolis.)
One company scheduled an introduction phone call on the 25th of the month, and then didn't show up on time and attempted to reschedule on the 15th of the following month. (Apparently, they didn't understand "Hire fast, fire faster.")
Finally, one company wasn't up-front or honest about their salary expectations until after I had spent almost a month in their system -- even taking a week off of work to do one of their "trial weeks" only to discover that they were going to offer me approximately 50% less than what I was making now and that they had a standard 'formula' for salaries...things that if I would have known, I wouldn't have wasted their time (nor mine) going forward.
Don't get me wrong -- HN has brought me a lot of great things: context, opportunities, viewpoints, and friends. Unfortunately, the "Who is Hiring" has morphed into traditional HR -- where you send a resume and don't hear back anything from anyone, versus the near-immediate feedback that you would once get in 2012.
Maybe have a special tag that you can add to the text for info that you only want karma users of a certain level or higher to be able to see -- ex. (karma>300)[Contact me at my@email.com], or instead of direct numbers, target people that are able to downvote, or have a moderate level of karma. People would post a link to the recruitment page/job description page on their website for all other users, which would hopefully work at deterring spammers from contacting them personally.
I think something like this would help you focus your recruitment efforts on those who have at least contributed to the community in some way, which should filter out people spamming every single email in the thread.
Another idea is to mask emails with a craigslist-like mailing address, which would give the end-user the ability to report an email as spam, and therefore tie that email to the offending party's hacker news account.
Edit: What I mean is that each hacker news account would see the email address as a different one, so when they emailed that account it uniquely identifies the account that originally viewed that email address. So, Spammer A sees Poster B's email address as hn-49384932842@ycombinator.com, and Legitimate Candidate C sees Poster B's address as hn-4494838943842@ycombinator.com. When either one emails that address, if Poster B reports the email as spam, and if enough reports accumulate, the HN account sending the spam can be docked karma and lower them below the threshold allowed to view further posts.
Nothing. HN is not a job board. Whatever you do to squelch recruiter spam on "Who's Hiring" threads is bound to have unintended consequences. Meanwhile: if the big problem is that recruiters use job posts as spam targets, everyone who posts an ad can come up with their own solution (if it's karma-locked, it can just be "mail your HN username here") and the best one will spread.
I'm open to that in this case, though in general people tend to object to karma requirements.
The Who Is Hiring threads belong to this community. If something needs to be done to protect them for the community, we'll do it. But we'd ideally like to see a consensus emerge.
We should probably discuss this in a separate thread (and probably not today, as I'm about to be traveling). And I feel bad for taking a Show HN further off-topic, so will mark this subthread as such (which lowers it), even though it's obviously an important question.
There is little correlation between the domain of a submission and the amount of points it receives on average. (The exception is the more niche posts by more renowned programmers)
The stories don't have to be popular to generate karma, as long as the domains are some articles will get karma from other people submitting the same links and manual upvotes too. Yesterday someone autosubmitted everything a dozen big tech sites published and got 100 - 200 karma without hitting the front page.
Won't they find you on Twitter as well? Even LinkedIn has the same problem from what I hear from hiring managers, becoming increasingly frustrated by the number of recruiters that contact them when they post a job there.
I'm glad to see Remote as a location, but due to the free-form writing in the original posts, there are errors. For example, "Haskell dev at Standard Chartered Bank" is listed under Remote, but the post itself says "Remote work isn’t an option". The post for Button similarly doesn't allow remote, but uses "Remote - no" to convey that.
I've been planning on building some filtering for the Who is Hiring threads, and I've pretty much determined that some degree of manual review will be needed. In the most recent thread, I found a huge number of posts containing "remote" which don't actually allow remote working. "No remote" is fairly common and easy to filter out, but there are any number of variations that you can't anticipate a priori.