As others have mentioned, converting html to markdown works pretty well.
With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.
It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.
I'm actually working with a number of companies who are exploring this space. Many of them are in the current YC batch. We're helping to provide the core business data, then we're exploring how we can leverage our scraping infrastructure in other ways to bring costs down.
I'm open to chat if you're interested: michael (a t) bigpicture.io
I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.
I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.
This depends on the SDK - both Typescript and Golang support a `registerAction` method on the worker which basically let you register a single step to only run on that worker. You would then call `putWorkflow` programmatically before starting the worker. Steps are distributed by default so they run on the workers which have registered them. Happy to provide a more concrete example for the language you're using.
I think it depends on what you're trying to do. For any given project, what's the goal? Who are you building it for? Are you trying to build a business around it?
There's a big difference in working on a project for fun, and building something to solve a problem for users. In the article, I do not see any mention about talking to users, customers, etc. and asking what they want. It's a lot more motivating to continue working on a project when you're building something people want and solves real problems.
If you're already on AWS, I recommend switching to postgres for now. For context, I have 3 RDS instances, each multi zone, with the biggest instance storing several billion records. My total bill for all 3 last month was $661.
Postgres has full text search, vector search, and jsonb. With jsonb you can store and index json documents like you would in Mongo.
You can even do Elastic-level full text search in Postgres with pg_bm25 (disclaimer: I am one of the makers of pg_bm25). Postgres truly rules, agree on the rec :)
I know a guy that built his whole product out on Hubspot using their free CRM and APIs. I believe he said he's storing a few hundred million records there and there's been no issues so far.
I used to work for a division of State Farm insurance which ran into some similar issues with addresses on or close to state lines. Insurance is also one of those businesses that's very particular about location, as we were only approved to do business in certain states.
The solution we ended up going with was to use the free Tiger shape files from the US Census (https://www.census.gov/geographies/mapping-files/time-series...), which enabled us to statically "lock" our geocordinates. Apart from filing our approach and getting it approved, the logic was that we could always point back to it being a government database if there was ever a problem.
The back story is that we started as started as a B2B AMB product. We relied on another provider for data, then got hit with a bill one month that nearly bankrupted us. We ended up building our own system and pivoted to making this data more cost effective for others to access at scale.
Basically, there's still bits and pieces of old marketing material out there that needs to be removed and/or cleaned up.
To be frank, we and others interpreted that entire thread as resolved and didn't need addressing. Scraping is clearly a grey area of the law. This is all public data.
Did you read your own links? The HiQ labs decision in favor of scraping was vacated by the Supreme Court and then settled. Not a clear cut case law, but definitely ended on LinkedIn’s terms.
With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.
It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.