Hacker News new | past | comments | ask | show | jobs | submit | mfrye0's comments login

As others have mentioned, converting html to markdown works pretty well.

With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.

It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.


Congrats on the launch. This looks awesome.

I'm actually working with a number of companies who are exploring this space. Many of them are in the current YC batch. We're helping to provide the core business data, then we're exploring how we can leverage our scraping infrastructure in other ways to bring costs down.

I'm open to chat if you're interested: michael (a t) bigpicture.io


thanks! great to know, just emailed you!


I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.

I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.


This depends on the SDK - both Typescript and Golang support a `registerAction` method on the worker which basically let you register a single step to only run on that worker. You would then call `putWorkflow` programmatically before starting the worker. Steps are distributed by default so they run on the workers which have registered them. Happy to provide a more concrete example for the language you're using.


Perfect. Yeah, we're using both, but mainly TS. We'll test that out.


My thought here is that Google is still haunted by their previous AI that was classifying black people as gorillas. So they overcompensated this time.

https://www.wsj.com/articles/BL-DGB-42522


I think it depends on what you're trying to do. For any given project, what's the goal? Who are you building it for? Are you trying to build a business around it?

There's a big difference in working on a project for fun, and building something to solve a problem for users. In the article, I do not see any mention about talking to users, customers, etc. and asking what they want. It's a lot more motivating to continue working on a project when you're building something people want and solves real problems.


If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.

Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.

- https://www.postgresql.org/docs/current/textsearch.html - https://aws.amazon.com/about-aws/whats-new/2023/05/amazon-rd...


You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)


I have troubles seeing how this is possible.

$220 dollars per instance gets you 8Gb of RAM which is way, way, below the index size if you are indexing billions of vectors.


how big is the disk for the biggest instance?


Pretty small still at 500gb. It only stores hot data right now and a subset of what's important. Most of our data is in S3.


I know a guy that built his whole product out on Hubspot using their free CRM and APIs. I believe he said he's storing a few hundred million records there and there's been no issues so far.


I used to work for a division of State Farm insurance which ran into some similar issues with addresses on or close to state lines. Insurance is also one of those businesses that's very particular about location, as we were only approved to do business in certain states.

The solution we ended up going with was to use the free Tiger shape files from the US Census (https://www.census.gov/geographies/mapping-files/time-series...), which enabled us to statically "lock" our geocordinates. Apart from filing our approach and getting it approved, the logic was that we could always point back to it being a government database if there was ever a problem.


Any chance you could link to where you see that?

The back story is that we started as started as a B2B AMB product. We relied on another provider for data, then got hit with a bill one month that nearly bankrupted us. We ended up building our own system and pivoted to making this data more cost effective for others to access at scale.

Basically, there's still bits and pieces of old marketing material out there that needs to be removed and/or cleaned up.


https://bigpicture.io/ip-to-company-data-api

It's one of the H2 headings just over halfway down the page.


Awesome, thanks. We're due to update the whole marketing site.


To be frank, we and others interpreted that entire thread as resolved and didn't need addressing. Scraping is clearly a grey area of the law. This is all public data.

- https://techcrunch.com/2022/04/18/web-scraping-legal-court/ - https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn - https://www2.staffingindustry.com/Editorial/IT-Staffing-Repo...


Did you read your own links? The HiQ labs decision in favor of scraping was vacated by the Supreme Court and then settled. Not a clear cut case law, but definitely ended on LinkedIn’s terms.


Yes. This was because HiQ was creating fake accounts to scrape member profiles, which was deemed to be in violation of the TOS.

Legal has reviewed our approach in comparison, and we're good.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: