Hacker News new | past | comments | ask | show | jobs | submit | adulion's comments login

https://datasignal.uk

I've been working on an exciting project called DataSignal, where I'm developing a sophisticated Named Entity Recognition (NER) and enrichment system. This system is designed to identify and extract key entities from blocks of text, such as names, locations, or organizations, and then provide rich contextual information about them. The goal is to transform raw text into meaningful data that users can easily understand and leverage in various applications.

The real magic of DataSignal lies in how it delivers this contextual information. It can be seamlessly integrated into your workflow through a browser plugin, which instantly highlights and explains entities as you browse the web. For developers, DataSignal can be embedded directly into a webpage via JavaScript, enhancing the content on the page with dynamic insights. Additionally, it offers a REST API, allowing for flexible and customizable integration into other software systems.

Whether you're a developer looking to enhance your application with smart text analysis or someone who wants to streamline research and data gathering, DataSignal is poised to bring a new level of insight and efficiency to text analysis. This project aims to make complex data more accessible and actionable, transforming how we interact with and interpret information.


I recently expanded a project that allows for easy benchmarking of local language models, specifically for Named Entity Recognition (NER) tasks. By using a simple JSON configuration, you can now evaluate and compare the performance of different models installed through Ollama without the need for extensive coding.

The idea is to make it easier for developers and data scientists to test various models and see how they perform on specific tasks, all within a streamlined framework.

If you're interested in learning more about the approach and trying it out, you can check out the full details in my blog post: Creating a JSON Framework to Test Local Language Models for NER — https://datasignal.uk/blog/ner-framework.html

The code is open source and available on GitHub: NER-llm-blog Repo — https://github.com/DataSignalOrg/NER-llm-blog/tree/master

I’d love to hear your thoughts or suggestions for improvements!


only one remote role?


can you build a website nowadays with analytics without using cookies? or violating GDPR?


Yes you can. See for example https://plausible.io/, which does analytics without using cookies, and without collecting any personal data.


In my understanding, the most important part is to not share user information with third parties. IIUC, Google can use Google analytics data to join your behavior from multiple sites and then use that to serve targeted ads.

The next level is to not store PII unless there's a specific reason in the user's interest (improving site quality doesn't count, logging in does). Therefore, you can see how many people visited a page, aggregates of device types etc. Just not anything that identifies an individual.


Of course you can. What is it you want to do that you think you can't do?


You can use something like Plausible Analytics which does not use cookies.


Best way is to self-host your analytics, the main thing about GDPR is not sending your data to third parties or using it for marketing/targeting purposes.

By not sending the data to third-parties, you already comply to most of the GDPR policies.


Certainly one aspect of GDPR is about how you share data with third-parties. But self-hosted analytics are still subject to GDPR and/or ePrivacy restrictions if you process full (unredacted) IP addresses, any user-identifying tokens, or anything else deemed as PII (Personally Identifiable Information) for purposes such as analytics without seeking user consent.


That's true, but the "analytics" purpose is ambiguous. It could be for security most servers already have access logs by default, that stores IP addresses anyway, and it's often used for DDOS protection for example or fail2ban login attempts.


The ambiguity of this legislation is one of the biggest problems with it.

This ambiguity leads to companies implementing cookie warning popups based on a risk-averse interpretation of the law


You can track the number of visits without using cookies, but its practically impossible to track the number of unique visitors without using cookies.

The number of unique visitors is a very useful metric (both in itself, and combined with the number of visits).

The EU has made it impossible to track this simple and harmless metric without inconveniencing all users with awful UX.

Under the GDPR / ePrivacy Directive, ANY user-based unique identifer used for advertising, analytics and tracking will trigger the need for consent.

---

General Data Protection Regulation (GDPR)

Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person."

Article 6(1) outlines the lawfulness of processing and states that processing is only lawful if and to the extent that at least one of the following applies: "the data subject has given consent to the processing of his or her personal data for one or more specific purposes."

---

ePrivacy Directive (Directive 2002/58/EC)

Article 5(3) requires prior informed consent for the storage of or access to information stored on a user's device: "Member States shall ensure that the storing of information, or the gaining of access to information already stored, in the terminal equipment of a subscriber or user is only allowed on condition that the subscriber or user concerned has given his or her consent, having been provided with clear and comprehensive information, in accordance with Directive 95/46/EC, inter alia, about the purposes of the processing. This shall not prevent any technical storage or access for the sole purpose of carrying out the transmission of a communication over an electronic communications network, or as strictly necessary in order for the provider of an information society service explicitly requested by the subscriber or user to provide the service."


Agreed- I worked at a company a few years back that build on a product using postgrest and an angular frontend for everything with nothing in between- which felt wrong as well


Its not UK- its Great Britain- Doesnt include northern ireland data


I was working full time in mcdonalds and when the facebook API launched in 2009ish created a few blog posts integrating the API to php and codeignter.

This lead to picking up a few projects from agency which led to a full time job and the rest is history.


why would you use this over duckdb?


Here is an example, DB Pilot recently switched from duckdb to ChDB, mostly for faster queries and broader data formats support: https://dbpilot.io/changelog#embedded-clickhouse-and-standal...

Disclaimer: I work at ClickHouse


I don't have an opinion, but this article gives a fairly unbiased comparison of DuckDB vs Clickhouse-local, which I imagine exhibits similar performance characteristics as ChDB, just without the embedded part: https://www.vantage.sh/blog/clickhouse-local-vs-duckdb


Disclaimer: I am a chdb maintainer! duckdb is currently thinner and has lots of active contributors and mature integrations, while chdb is still in its early stages BUT if you already love ClickHouse (like we do) chdb is a great choice as it inherits all the ClickHouse stability, performance and more importantly, all the 70+ supported formats for the embedded use case without any of the server/client requirements, making it perfect for fast in-process and serverless OLAP executions.

Note chdb is based on ClickHouse codebase but completely community powered so there's no feud with DuckDB (I'm a quackhead, too!) which actually offers lots of great inspiration and many integration opportunities with ClickHouse/chdb for combined compute and processing of datasets. I personally love both and use them together all the time in my colab "OLAPps"


Where can I find an SVG version of this logo? https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-...



does chdb have support for recursive CTES?

It and the ability to do `from 'https://example.com/.csv` is why i love duckdb


ClickHouse 10 years before DuckDB existed:

SELECT * FROM url('https://example.com/*.csv')


not as simple as from 'a.csv' and 10 years without recursive CTES?


One reason would be if you're already fluent in ClickHouse's SQL dialect. Although they maintain great standard SQL compatibility, they also have a great deal of special functions/aggregates/etc that are ClickHouse specific.

Other reasons include their wide range of input formats, special table functions (e.g. query a URL).


I think it makes sense if you are just considering whether or not to use ClickHouse, it's a very easy place to start. If you then outgrow embedded you won't have to move to another database afterwards, you can probably just attach the tables to a "real" ClickHouse instance and continue using it without having to do lengthy data migration.


I suppose if you had data in a format that DuckDb doesn't work with, like Protobuf, Avro, ORC, Arrow, etc. ClickHouse reads and writes data in over 70 formats


One reason is that it supports a wide range of formats.


I seen him speak in ireland twice in the past 20 years so i assume he got here by plane


i work in ecommerce and we have a conditional for the order of the fields based on the delivery country the user checking out


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: