Hacker News new | past | comments | ask | show | jobs | submit | more mfrye0's comments login

Hey HN. This is a follow up to a post from earlier this year that seemed to have polarizing reactions: https://news.ycombinator.com/item?id=35977057.

Basically, we're addressing the community's feedback and providing general updates on the project. Open to any and all thoughts.


I 2nd this, minus Salesforce being hard to customize. It's probably too customizable.

At a certain stage of a company, sales reps expect Salesforce. So much so, that even when we finally caved and got it, I had reps specifically turn off Lightning and stick with Classic mode. It's like Bloomberg - it's what they know and can move fast in.

As much as it pains me to say this, it may make sense to have an option that can mirror the Salesforce UI vs reinventing the wheel. Or maybe even an integration / escape hatch to migrate everything off of Salesforce to this. Basically, if you want reps to adopt it, make it close as possible to what they know.


In terms of UI, Notion has been our main source of inspiration. Targeting the bucket of Salespeople who enjoy the Notion UI is definitely a smaller one than those who know how to use Salesforce, but I feel like it might be a faster-growing one. It's a good point that it will become harder as we move to larger companies with a higher share of experienced sales reps that don't want to change.


I've been tempted to do something similar over the years. I've worked with Salesforce quite a bit, and I dislike almost everything about it.

I settled on doing open source company data: https://news.ycombinator.com/item?id=35977057. I think an open source CRM with the all the world's companies could be pretty killer.

I created an account and will play around with it. Pretty cool so far.


Nice! Agree this could be great. Is this hosted as an API somewhere?


Apologies. We've had a huge problem with bots, so we have a number of security measures in place. The Google Captcha component is probably flagging you as a bot.

Try disabling your VPN if you're on one, or use a different IP.


Oh the irony.


Hey dang. That's fair.

To be frank, we went back and forth on this, but in the end, thought the original title was ok. The only other large, "open source" dataset we could find was 9M. So after researching, we came to the conclusion that it sounded clickbaity, but was likely accurate.

And yes on "open source". We fully intend for this to be "open source" in the full meaning of the word, but it seems we were moving too fast and missed adding the formal license.


Fair points. I agree the wording could be better.

No, the source code is not available. This dataset is a subset of the raw data our system collects. Our final product made available via the API does a variety of processing steps on the raw data (dedupes, joins, ML predictions, etc). The final, processed data is the piece that is proprietary / subject to the terms.

We will update the site terms to reference this dataset, as we aim to continue releasing an updated version each quarter. I'll have to double check with the lawyers, but it will most likely be MIT licensed.


What exactly does "open-source" mean to you? Because it sounds like there's absolutely nothing open about this other than a small scraping of LinkedIn data (which you should probably ask your lawyers if you're even allowed to license out).

The wording isn't just misleading, it's a complete lie.

EDIT: Nevermind, the title has been updated to accurately reflect this being a small datadump.


How did you compute this? I just did another check to verify (wc -l) and it's coming to 15,980,531.


I used wc -l at first, but I've just imported into SQLite and the count(*) is 15,263,246 - updated my previous comment (which had said 15,263,251).

I downloaded the CSV and ran:

    sqlite-utils insert companies.db company companies-dataset-2023-02-ckgENv.csv --csv
    sqlite-utils enable-fts companies.db company name specialties
    sqlite-utils analyze-tables --save companies.db
This lets me run searches against the name and specialties columns, and gives me those aggregate stats too.


Ok. I'm not sure how this happened, but I think the dataset was somehow mislabeled. It appears that this dataset is the Q1 version, not the latest Q2. Can you please try re-downloading it?

We're probably going to have to make an public announcement about this...


OK, that one has 15,948,996 rows.

Here's what I got from running the same "sqlite-utils analyze-tables companies2.db company" command against it:

    company.handle: (1/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 0

      Distinct values: 15948996

    company.type: (2/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 5253878

      Distinct values: 92

      Most common:
        5311279: Privately Held
        5253878: 
        1290064: Self-Owned
        1055857: Partnership
        987045: Public Company
        828643: Self-Employed
        799655: Nonprofit
        334552: Educational
        87681: Government Agency
        35: De financiación privada

    company.name: (3/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1591

      Distinct values: 15549439

      Most common:
        1591: 
        1098: .
        277: A
        246: -
        164: None
        155: X
        142: N/A
        132: ...
        128: x
        122: 1

    company.website: (4/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3249552

      Distinct values: 11272926

      Most common:
        3249552: 
        86957: facebook.com
        57769: instagram.com
        46404: business.site
        31397: linktr.ee
        27882: indiamart.com
        21852: wixsite.com
        19008: negocio.site
        17366: linkedin.com
        13224: yelp.com

    company.founded: (5/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 8040264

      Distinct values: 1561

      Most common:
        8040264: 
        524236: 2020
        451742: 2017
        441748: 2018
        426575: 2019
        418318: 2021
        411391: 2016
        389487: 2015
        339212: 2014
        299038: 2013

    company.industry: (6/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 1334901

      Distinct values: 421

      Most common:
        1334901: 
        820156: IT Services and IT Consulting
        651746: Construction
        651557: Advertising Services
        465857: Software Development
        455111: Business Consulting and Services
        447922: Real Estate
        435151: Retail
        355049: Financial Services
        312937: Wellness and Fitness Services

    company.size: (7/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2655086

      Distinct values: 123

      Most common:
        6646929: 2-10
        3584483: 11-50
        2655086: 
        1197530: 51-200
        1091094: 1 employee
        421595: 201-500
        150053: 501-1,000
        129373: 1,001-5,000
        44755: 10,001+
        27742: 5,001-10,000

    company.city: (8/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 3155391

      Distinct values: 410985

      Most common:
        3155391: 
        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi

    company.state: (9/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 4524015

      Distinct values: 58563

      Most common:
        4524015: 
        691167: England
        584141: California
        329584: Texas
        291639: New York
        286723: Florida
        222552: São Paulo
        185925: Maharashtra
        171885: Ontario
        171657: Île-de-France

    company.country_code: (10/10)

      Total rows: 15948996
      Null rows: 0
      Blank rows: 2961064

      Distinct values: 272

      Most common:
        4059985: US
        2961064: 
        1232403: GB
        885302: IN
        756411: FR
        664235: BR
        467501: DE
        414433: NL
        410372: ES
        389535: CA


The thing I find most interesting is this:

        269708: London
        124059: Paris
        113135: New York
        99314: São Paulo
        75428: Los Angeles
        69691: Madrid
        67328: Toronto
        63738: Dubai
        63456: New Delhi
I would not have expected São Paulo to come fourth in this list, after New York but in front of Los Angeles. I just learned it's the 4th largest city in the world https://en.wikipedia.org/wiki/List_of_largest_cities - after Tokyo, Delhi, Shanghai - but I guess it has much more of a representation on LinkedIn than those other cities.


wc -l will count all newlines, even those that are escaped. Perhaps some company descriptions have newlines?


Hey HN, we're thrilled to announce our latest project - the World's Largest Open Source Company Dataset. Our team has been working hard on this product for the past few months, and we're excited to finally share it with you all.

We started off years ago trying to build a B2B app, but getting basic company data at scale was a huge barrier for us. This 15M+ record dataset attempts to solve that and has all the key company fields like name, industry, size, location, LinkedIn handle, etc. We aim to update it quarterly to ensure that you always have the most up-to-date information.

Disclaimer: Okay, we have to admit, we didn't exactly comb through every dataset out there to verify that ours is the world's largest, but we did our research, and we're pretty sure it might be. Whether or not that's true, we believe this dataset is a robust and invaluable resource for anyone interested in company data.


Out of curiosity: Would you be willing to share how you acquired this data? Website scraping or other means?


Data is very likely to be from LinkedIn if you look at the field descriptions and stats. The only field that is 100% available is the one based on the LinkedIn URL. I would guess scraping unless LinkedIn provides an API for this data that I can't find.


They (Microsoft) have APIs for app/website integration, but that's all I know about.

https://developer.linkedin.com/


Most likely from scraping (crunchbase, yahoo,etc), unless they bought it from somewhere. In most countries you can get it from the chamber of commerce. Dun and Bradstreet and other similar companies. Some of these data aggregators will have partnership with other companies, and you can also (illegally) scrape it from there.


So this blew up today. Reviewing all the comments now.

1. Yes, this is scraped from public sources. 2. Yes, this is free to use / is open source in the broadest sense. Apologies for the confusion on the lack of a license and no mention about this in our TOS. We probably should update our TOS to be clearer here. 3. This is a raw dump of companies from all over the world by LinkedIn handle. The handles are deduped, but the website is not.


> Yes, this is free to use ...

Including for commercial purposes?


Yes.


Seems at odds with your tos. https://bigpicture.io/terms


What's the license the dataset is released under? I poked around the documentation a bit but couldn't find it - apologies if it's in there!


The data is not licensed as opensource (only non-commercial use): "The Service and its entire contents, features and functionality (including but not limited to all information, software, text, displays, images, video and audio, and the design, selection and arrangement thereof), are owned by the Company, its licensors or other providers of such material and are protected by United States and international copyright, trademark, patent, trade secret and other intellectual property or proprietary rights laws. You are permitted to use the Service for Your personal, non-commercial use, or legitimate business purposes related to Your role as a customer of BigPicture.io." - https://bigpicture.io/terms


This looks great. I'll have to play around with it.

Related, we built a developer oriented Zapier clone for event scale automations awhile back for our internal product. We've since pivoted and have been debating on potentially open sourcing the engine as well.

We built ours using Rust with a DSL for all the triggers, actions, and action inputs/outputs. The actions themselves are defined as APIs, which makes it easy to add functionality in any language. Most of our actions have been built in Typescript.

Is there interest from anyone in potentially using it?


Yes, definitely. Put it on github, create a discord and see if a community forms around it.


I'm always interested in seeing alternative solutions to this problem.


I told my mom about the breakthrough. Her immediate question was, "will this help to save money on energy costs for the house?".

I replied, "this will eventually result in nearly limitless energy that will fundamentally change humanity".

She responded, clearly disappointed, "oh, okay".

I think for the average person, unless it has an immediate impact on their life, it's difficult for them to visualize future potential benefits.


Especially when it's described along the lines of "nearly limitless energy that will fundamentally change humanity". That's the sort of utopian promise that is automatically very suspect on its face. Combine that with "at some unknown point in the future", and the whole thing gets too ephemeral to get all worked up about.


Fair point.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: