Hacker News new | past | comments | ask | show | jobs | submit login
Python library for univariate regression, interpolation, and smoothing (github.com/brendanartley)
102 points by codeboy7432 on Aug 29, 2022 | hide | past | favorite | 27 comments



I authored this package as I needed to generate confidence intervals for time series data without using SciPy. Sharing this here, as this could be a useful package for others :)

Included Models: - Linear regression - Ridge regression - Linear spline - Isotonic regression - Bin regression - Cubic spline - Natural cubic spline - Exponential moving average - Kernel functions (Gaussian, KNN, Weighted average)


Thanks for releasing this. I was just wondering if something like this exists. I've worked on a few projects where scipy was banned due to the large dependencies it pulls in.


Have you compared these to StatsModels? Or R? (which includes most of these ootb)


I did look into stats models, but this was a large library with more than I needed. I did not look into R.. Are there any lightweight alternatives in this language?


I think most or all of these models are in the R Standard lib


Not all as far as I know. For ridge regression you will want to install glmnet, for example. mgcv which is usually shipped with R provides access to a few common fast kernels which seem to be the ones python programmers are familiar with.


Can use lm.ridge from MASS instead of glmnet, but yeah there’s going to be some smoother not in R standard library


mgcv also provides many more varieties of splines than base R.


Am I guess correct to assume that you could not use sklearn/scikit as well because it depends on SciPy? (I am under the impression that sklearn/scikit is the dominant library for an implementation of these algorithms.)


That is correct. I had to generate confidence intervals on over 8000 univariate data sets using very small VMs, so I needed to limit large dependancies as much as I could. This package was the result of this!


Two questions:

Why the requirement that you can't use scipy?

Have you heard of the package stats model?


Well, SciPy depends heavily on NumPy, which as a CPython-specific extension won't run on other Python interpreters in general. Although for example there is ulab for MicroPython which replicates part of NumPy, and PyPy has a compatibility layer for CPython extensions.

Edit: well, Regessio itself also depends on NumPy, but might be able to run on top of ulab whereas I really doubt SciPy would.


The repo on OP also depends on numpy


Responding that there's something out there called ulab doesn't really answer my question, which was: where does op's requirement to not use scipy come from.


(Same as comment above)

".. I had to generate confidence intervals on over 8000 univariate data sets using very small VMs, so I needed to limit large dependancies as much as I could. This package was the result of this!"

Based on the comments in this thread, it may be worth trying to make this package not dependant on Numpy as well?


Nice job -- I have examples using statsmodels for similar (not time series) data [1,2]. I typically use this for EDA before regression modelling, so dependencies in that scenario are not a big deal. But I might weep if someone told me no scipy in production.

[1] https://github.com/apwheele/Blog_Code/tree/master/Python/Smo...

[2] https://andrewpwheeler.com/2020/09/20/making-smoothed-scatte...


Interesting. How does this compare to stats models? https://www.statsmodels.org/stable/index.html


It looks pretty good, but I’d love to see you make better use of the routines already in numpy. In particular, I see you are solving the OLS problem using direct inversion when you already have QR and SVD available to just call. There are other simple things that can do a lot for your results, centering, scaling, etc, too. I guess it works well enough for small well behaved problems as is though.


Nice and slim - does your cubic spline support clamping and monotonicity as well by any chance ?


Do you have an example or reference describing how clamping and monotonicity in the context of cubic splines are implemented? Thanks.


https://en.wikipedia.org/wiki/Monotone_cubic_interpolation has some reference for monotone cubic splines.

In theory they should be useful when know that the underlying process should be monotone. I think in the past I found them more sensitive to noise and wondered if monotone approximation might not be better than monotone interpolation for that reason.


I added class comments to each class which explain the high level implementation details. Clamping is supported with natural cubic splines, and this is done by taking the slopes at each endpoint.

Monotonicity is currently not supported (for cubic splines).


sorry if too out of topic, but this looks great and i was just in need of something like it for c++ instead. does anyone know of similar libs?


In addition the already mentioned GSL, there are two more I know of and which I have used for spline interpolation in C/C++: John Burkardt's spline library[1] and Netlib[2]. The pppack library from Netlib is Fortran code so you have to write a wrapper when using it from C++.

[1] https://people.sc.fsu.edu/~jburkardt/cpp_src/spline/spline.h...

[2] https://netlib.org/pppack/

EDIT: A quick internet search (in my case for cubic splines) gave more relevant results, so dependent on your needs, just look around: https://www.bing.com/search?q=cubic+spline+c%2B%2B


My first bet would be the GSL library: https://www.gnu.org/software/gsl/doc/html/


Does this support 3d data?


Nice. Well done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: