Hacker News new | past | comments | ask | show | jobs | submit login

Binary search is only faster if access to the elements is constant-time, which is not true for a file with varying-length records (lines). Line 1000 (counting from 0) could be at offset 2000 if each line is a single character + newline, or it could be at offset 2000000 if each line is 999 characters and a newline, or somewhere in between.



You don't need to know the number of the line you're looking at. You jump to the middle of the file, and you compare the nearest line to the search pattern. Then you recurse on the top or bottom half.


That still requires random seeking; it's not so bad on SSDs, but abysmal on rotating disks, and most of the filesystem and OS caching code is optimised for linear forward reads, so things like prefetching are not going to help at all.


At some point the cost of scanning a long file will overcome the cost of doing log(n) seeks. On short files, it really doesn't matter what you do, searching will be fast regardless.


It's probably much faster in most cases of text files though.

I also think your constant time claim sounds too strong. You can eat a lot of end of line search time after the binary search and still beat a linear/regex search.

As someone else pointed out, look can also exit sooner.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: