Hacker News new | past | comments | ask | show | jobs | submit login

In Elixir, I select the `<body>`, then remove all script and style tags. Then extract the text.

This results in a kind of innerText you get in browsers, great and light to pass into LLMs.

    defp extract_inner_text(html) do
      html
      |> Floki.parse_document!()
      |> Floki.find("body")
      |> Floki.traverse_and_update(fn
        {tag, _attrs, _children} = _node when tag in ["script", "style"] ->
          nil
  
        node ->
          node
      end)
      |> Floki.text(sep: " ")
      |> String.trim()
      |> String.replace(~r/\s+/, " ")
    end



An example of where this approach is problematic: many ecommerce product pages feature embedded json that is used to dynamically update sections of the page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: