Hacker News new | past | comments | ask | show | jobs | submit login
ZMap: Internet scanner maps all of IPv4 in 45 minutes (zmap.io)
207 points by sweis on Aug 16, 2013 | hide | past | favorite | 63 comments



I thought the approach they took to permuting the address space was particularly clever, but was quickly schooled on Twitter by how bog-standard the approach apparently is (I very much concede the point!). In the interest of honesty about how easily impressed I am, and because I still think it's a neat little trick:

You want to generate a permutation of the entire IPv4 address space, but you don't simply want to shuffle every possible IP address because that would require you to keep an insane amount of state. So instead, work in the multiplicative group modulo p for prime p > 2^32, find an appropriate generator, and iterate by multiplying with the generator mod p. Remember the prime, the generator, the starting address, and your current address and you can detect a complete traversal of the space when the starting address recurs.

There are a number of simpler ways to do this (after sheepishly conceding that this is pretty fundamental stuff, I played with using PRFs and card shuffling to do it; DrHoney suggested Gray codes), but I liked how immediately obvious the multiplicative group solution was, and that I could code it from a simple description.


The problem of generating a non-repeating (2^n)-1 n-bit numbers permutation is equivalent to generating a m-sequence (http://en.wikipedia.org/wiki/Maximum_length_sequence) and the simplest way is a n-bit LFSR with a primitive poly, called maximal LFSR. It's so much simpler and faster (some XORs and n bits of memory), I cannot see why they chose that weird prime multiplication algorithm, in fact right now I'm not completely sure that algorithm is correct.


Furthermore, is not that a way to describe a Linear Congruential Generator (You know, rand()?) http://en.wikipedia.org/wiki/Linear_congruential_generator by looking at the source (https://github.com/zmap/zmap/blob/master/src/cyclic.c#L164) it really looks like it, specifically a Lehmer random number generator. That would mean the period of the generator is not exactly 2^32 but it repeats 15 IPs... EDIT: Now I see it, it actually generates numbers over 2^32 but the code just skips them...nasty! I'm surely missing something, or else that's a usenix paper wasting 2 pages explaining rand()


It's not literally rand, but is a form of LCG, just tuned for IP addresses.

I agree with you about the formal impact of the "math" here. The irony is that I'm not impressed by the scanner stuff, and easily impressed by simple math!


A 32-bit LFSR generates a sequence of bits with period 2^32-1; what you really want is a Lehmer-like generator using arithmetic in GF(2^32).


You mathematicians, with your... understanding of the basic mathematics behind things we developers already use every day, you all think you're better than us, don't you? DON'T YOU?!

Can't I just be left to be impressed by a direct use of the math behind a crappy textbook random number generator in peace?

-.-


I'm not 100% sure if that was tongue-in-cheek or not, so I'll clarify: I was disagreeing with aortega. This is not a typical PRNG, where the seed generally marks the initial point of the same sequence. Choosing a random generator matters, because we want a different permutation every time, not simply a different starting point. It is akin to running a block cipher in OFB mode with a different key each time.

What they are doing modulo 2^32 + 15 is IMO the simplest way to achieve that goal. Yes, you could do it without overflowing 2^32 using a binary field; you could also cook your own mini block-cipher. But that increases the complexity of the code and is not really faster everywhere.


It was 100% tongue-in-cheek, but I'm happy to be extracting even more information from you by having my poor attempts at humor misconstrued. :)


>Choosing a random generator matters, because we want a different permutation every time, not simply a different starting point.

Ah thanks, now it makes more sense, I knew I was missing something. What about choosing different prime polys for a galois lfsr? I believe you will get the same result.


Also you could XOR the LFSR output with any number to get a different permutation each time.


What does "GF(2^32)" mean in this context?


It's the finite field with 2^32 elements. There are several equivalent representations, but we typically work with it as the polynomials with coefficients modulo 2 modulo an irreducible polynomial of degree 32, e.g., x^32 + x^7 + x^3 + x^2 + 1. For more, see [1].

[1] https://en.wikipedia.org/wiki/Finite_field_arithmetic


I used skip32.c, but I like their approach too.


+1 for skip32.c since it's O(1) space and O(1) time. With the prime approach, you may need to call it multiple times to reject numbers greater than 2^32-1. Also, a generalized Feistel algorithm may be used to generate permutations with fewer than 32 bits with the same constant space/time requirement and may be helpful in skipping specific IP ranges.


Thanks, the theory of generating secure-ish permutations of arbitrary size (at least, small ones) isn't that well developed and I hadn't considered generalizing feistel structures to other bit sizes. It's an interesting research topic IMHO. '

I have to admit that I'm slightly chagrined by the disparity of approach between a lame-ass pentester and an associate professor and two PHD candidates. Reverse elitism, it lives.

I still find the decision to release to be an interesting choice; I didn't because I felt that it would lead to an increase in Internet background radiation, and that basically it's an obvious approach and anyone working in this area will come to exactly the same conclusions. Beer soaked napkin calculations will lead everyone to exactly the same scanner. From my POV there was little benefit in giving this crap to people who wouldn't make the same calculations and write the same code.

What amused me for a while was how long we could get away with being the top "malicious" source on DShield [hint, this requires a ridiculously high packet rate - but it was less than 50k PPS :]. But if you use zmq and distribute your targets that way between a bunch of scan agents, you can get off the top 10 list. Also you can do a scanner like this with a lot less SLOC. The correct architecture is "ip distributor", "scan agent", "listener". They're sort of conflating.


FWIW:

http://www.cs.ucdavis.edu/~rogaway/papers/subset.pdf

(Marsh Ray tweeted this yesterday; it's an easy read)


Thanks for the link. Awhile back, I had implemented an n-bit cipher (where 1 <= n <=64) with an unbalanced Feistel network and used hash function as the basis of the round functions. I then wrap it within a cycle-walking function to form a cipher for any number smaller than 2^64. I did all this so I can generate scrambled 5-char path IDs of short URLs (e.g. test.com/xA7bc) with domain size of 62^5.


work in the multiplicative group modulo p for prime p > 2^32, find an appropriate generator, and iterate by multiplying with the generator mod p

Is that the same as a PRNG with period 2^32-1? I thought I read that idea in the Warhol Worm paper, but looking again it uses a block cipher so I was probably thinking of the Witty worm.


Well, it's obviously a PRNG. :)

The Sapphire worm used a similar approach, and got the math so distinctively wrong that Vern Paxson was able to fingerprint it, reverse the generator, and take a stab at "patient zero" (I think I'm remembering this correctly).


What kind of math knowledge is required to understand this? Abstract Algebra?

I've noticed recently I'm coming unstuck on particular CS problems (notably crypto as well), and have realised I need to further my math knowledge greatly.


When they say "scanning the whole Internet in 45 min" they mean scanning only one port of every IP address (for example sending a short GET request to port 80/tcp) over a Gigabit link:

2^32 (IP addresses) * 1 (port per IP) * 80 (bytes per packet) * 8 (bits per byte) / 1e9 (throughput in bit/sec) / 60 (sec per min) = 46 minutes (note: excluding multicast space, RFC 1918 space, etc, scanning time would be reduced down to ~35 min)

That's equivalent to "scanning all 65,535 ports of a /16 subnet in 45 min" which does sound less impressive...


  scanning all 65,535 ports of a /16 subnet in 45 min
...or in other words:

  scanning all ports in the reserved Class C range, 
  from 192.168.0.0 to 192.168.255.255, in 45 min


> ... "class C" ...

I realize lots of people are simply in the habit of saying "Class C" when what they really mean is a /24, "Class B" for a /16, etc. but classless routing[0] has been around for 20 years now and these terms need to go away.

[0]: http://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing


Except that qwerty_asdf wasn't referring to a /24 subnet, (s)he was talking about one of the three address spaces defined in RFC 1918 and described thusly:

"Note that (in pre-CIDR notation) the first block is nothing but a single class A network number, while the second block is a set of 16 contiguous class B network numbers, and third block is a set of 256 contiguous class C network numbers."

So it is common for crusty old network engineers and sysadmins to refer to 192.168/16 as "the class C" private block, even when they understand that you can subnet it however you'd like.


> ... (s)he was talking about one of the three address spaces defined ...

Right, I realized that when s/he said "reserved Class C range". It was more of a general observation. I always forget I have to be extremely specific here on HN.


Why should the terms go away? Obviously the networks have been widely broken up and shuffled around, but I see nothing wrong with calling 11.5.0.0/16 a class B network.


Because classful networking went out the window 20 years ago.

Case in point: 11.5.0.0/16 is not a Class B network and never was. (11.0.0.0/8, however, was/"is" a Class A network).

/16 != Class B network.


I used 11 for a reason, yes.

It's not a class B network under the old system. Because under the old system 11.5.0.0/16 didn't exist.

But you can make such a network now, and if you were to classify its size the appropriate letter is 'B'.


In order to send a GET request you may want to first establish a TCP connection, so it's going to be at least three packets. Otherwise you're not going to receive any response.

Also, for these scans is quite common to just send a SYN packet and wait for the SYN/ACK to decide if the port is "open" or not.


> That's equivalent to "scanning all 65,535 ports of a /16 subnet in 45 min" which does sound less impressive...

Actually, the above is a much harder problem. Scanning a limited subnet requires congestion control in a way that scanning the whole Internet does not.


I'm somewhat surprised there is no reference to scanrand[1][2], a fast stateless syn scanner by Dan Kaminsky in 2002. It wasn't directly geared towards scanning the entire Internet, but instead scanning large subnets (like for a pen test of a /16 network... takes a long time to scan).

It is a bit obscure, but it did do tricks like encoding encrypted data in extra mutable fields (just the sequence number for scanrand) for validation purposes. Actually, scanrand 2.0 can apparently measure latency (without state!) by encoding timing information in the source port field, which zmap doesn't currently do.

I think this research is great, but I just hate to see interesting old projects get forgotten.

[1] http://dankaminsky.com/2002/11/18/77/ [2] http://www.sans.org/security-resources/idfaq/scanrand.php [3] http://s3.amazonaws.com/dmk/SBO_Hiver.ppt


I currently scan the entire internet once a week using a re-implementation of scanrand I did myself. (and will be switching to Zmap shortly)

There aren't as many people using it as you'd think because 1) finding a working download link is quite an exercise and 2) compiling paketto is near impossible except on Dan's machine. :)


check out dscan from dugsong, which was built around 2003 or so to address that problem, that dan doesn't often write portable code.

https://github.com/dugsong/dscan


The anonymously published whole Internet survey scanned 100 ports on every address as well as a few other things. It took something like 30,000 devices and months of work though so I guess this is pretty impressive.

http://internetcensus2012.bitbucket.org/paper.html


A friend of mine conveniently built a search around the data:

http://exfiltrated.com/querystart.php



As a phun side note. Reverse mapping government censorship of DNS. Since I think governments are starting to go down the slippery slope road of censorship here is how map out the censorship.

If you have the list of the whole internet servers which answers on http port 80. Then you can reverse map government censorship dns list. Ie you can find out what the government wants to censor by doing lookups in the censored dns and for example opendns on the servers ip that answers on port 80, then you diff the results from the dns servers and if you get different answers you find out the government black list.


"How to get dropped by your ISP in under an hour."

There has been a very positive trend recently in the quality of documentation, a move away from dry, man-style listing of options to more operational descriptions, tutorial, examples, a bit of hand-holding. Here's a Docker tutorial[1], still on the front page.

[1] https://www.docker.io/gettingstarted/


A move away from having man pages isn't necessarily a thing to be celebrated - a well-written man page is an extremely useful thing.

In the ideal case, you might have a quality man page that provides usage information and links to more detailed documentation (that would include tutorials, implementation info, etc.) on the web somewhere.


I suppose the internet generation never really took the time to read man pages (nor to write one for that matter).


That's too bad. When I was getting started, the best documentation available to me was the man pages (this was before I had an "always-on" broadband connection) and TLDP's "Linux HOWTO's". I printed many of them and took them with me to high school to study in class.

It's great that we have blogs and such nowadays where anyone and everyone can contribute their own documentation, guides, tutorials, etc., but there was something awesome about having a single, centralized, authoritative HOWTO covering a particular topic.


I've always thought tmux had a great man page. http://www.openbsd.org/cgi-bin/man.cgi?query=tmux&sektion=1


I pick distros based on how nice the man pages are. It's one reason I love the BSDs.


Oh man. OpenBSD's man pages. Whoever was the author. He was the best man page writer.


I have gotten so used to just googling for information that when I'm working on OpenBSD I have to consciously remind myself that the best, most useful, and authoritative references are already installed locally. And all ad- and tracking-free.


This is very cool, but I'm curious as to why they stuck with ZMap. Nmap's graphical interface, Zenmap, sounds very similar and has a huge following among security researchers.

For the technically inclined, a good white paper describing the advantages of IPv4-wide scanning for security reconaissance and the advantages of ZMap vs other tools like NMap can be found here:

https://www.usenix.org/system/files/conference/usenixsecurit...


Probably because the author's first name is Zakir.


Wonder why they didn't mention unicornscan in their paper?

User-land Distributed Portscanner released in 2005: http://unicornscan.com

Defcon Presentation Introducing Unicornscan from 2005: http://www.youtube.com/watch?v=ZdCEo6yoEWA


1) "ZMap supports both blacklisting and whitelisting network prefixes. If ZMap is not provided with blacklist or whitelist parameters, ZMap will scan all IPv4 address (including local, reserved, and multicast addresses)."

What an absolutely stupid default setting. Thanks for giving a bunch of noobs a simple IPMC DOS application. If this thing gets popular it will soon be the bane of network admins everywhere.

2) There are at least 2 obvious omissions from their default blacklist file. There might be more but these are the obvious ones that come to mind. class-E 240.0.0.0/4 CGN 100.64.0.0/10

3) Can someone explain to me why I wouldn't just want to use nmap to do this same thing? Why do we need a new tool for this?


In case anyone is wondering if this worked on IPv6, and the rate was constant (45 minutes to do 2^32) it would take about 2 octillion days to scan all of the IPv6 addresses.


gonna try it over university network


Make sure to do a write-up when you regain internet access what it was like to be expelled for 'running just a script'.


Yeah our school basically had one big LAN (everything got a publicly routable IPv4 address, but nothing within the university was firewalled from anything else). There were networked printers though in the computer labs, some of these labs being on the first floor of the dorms. If you needed to print slides before class (and professors always posted the slides only like an hour before class), you had to go to the lab, hit print, and wait 15 minutes for the printers to get to your job in the queue and the staff to actually fetch it and lay it out. Definitely added to the time it took to get to class.

Since I was already accessing my dorm computer via Samba in the labs (I know, dumb idea in hindsight, even with a password, but this was 2004), I decided to figure out a way to print directly from my room and then just grab it on the way to class. Long story short I ran a port scan on the computer lab to find the printer IPs and had my network port turned off within minutes (I had the IPs though!). Ended up having to go to some office and explain what I was doing. Got turned back on a few days later.

The upside was that I eventually was able to print from my room as long as I converted whatever it was to postscript first. The downside was that I didn't need to know the printer IPs after all (the university's unix server already had the printers setup... just piped it through ssh to it).


Now I'm curious. It's not safe to run this at home, on any sort of paid hosting, on a university network, where is it safe to run this program?

If you work for an ISP?

If you ask the University nicely?


On a colocation facility that you own using a net port that you pay for yourself?

As soon as you expose someone downstream to stuff like this you're asking for being disconnected. If you ask your University nicely they'll likely refuse unless you state a goal you wish to achieve.

If using a script to download documents qualifies as hacking then hitting all of the internet with a portscanner is likely going to get you network administrator attention of the entirely wrong kind. And that's because they in turn will get some flak from the outside world.


I did something similar when I was researching search technology at my university[0]. Then when coming back from lunch some weeks later I found two gays from the it-department had locked them self into our office to investigating what we where doing. Apparently we had hit a lot of honeypots :)

0: http://www.boitho.com/


Why would you assume they were gay?


Lol. I meant guys :) English is not my first languages ( gays/guys, "let's have a beer/bear" etc don't get caught in most spell checkers... ).


Just for a moment I thought I was reading a plot for a gay porno. LOL


Good thing TCP/IP legalized same-sex connections. NCP was not so open minded.

The old NCP networking protocol required that connect and listen sockets must have different parity gender (one even, the other odd -- I can't remember which was which, or if it mattered -- they just had to be different). The act of trying to connect an even socket to another even socket, or an odd socket to another odd socket, was called "homosocketuality", and it was strictly forbidden by internet protocols, and was called the "Anita Bryant feature".

http://www.saildart.org/IMPSER.DOC[SS,SYS][1]

Illegal gender in RFC, host hhh/iii, link 0

The host is trying to engage us in homosocketuality. Since this is against the laws of God and ARPA, we naturally refuse to consent to it.

http://www.saildart.org/FTP.OLD[S,NET]1[2]

    ; Try to initiate connection

    loginj:
            init log,17
            sixbit /IMP/
            0
            jrst noinit
            setzm conecb
            setom conecb+lsloc
            move ac3,hostno
            movem ac3,conecb+hloc
            setom conecb+wfloc
            movei ac3,40
            movem ac3,conecb+bsloc
            move ac3,consck
            trnn ac3,1
                jrst gayskt            ; only heterosocketuals can win!
             movem ac3,conecb+fsloc
             mtape log,[
                   =15
                    byte (6) 2,24,0,7,7
                         ]          ; Time out CLS, RFNM, RFC, and INPut

    [...]

    gayskt:    outstr [asciz/Homosocketuality is prohibited (the Anita Bryant feature)

    /]

        ife rsexec,<jrst rstart;>exit       1,
(The code above adds the connect and listen socket numbers together, which results in bit 0 being 0 if they are the same gender, then TRNN is "test bits right, no change, skip if non zero", which skips the next instruction (jrst gayskt) if they different sex.)


Hitting a lot of honeypots sounds pretty straight to me. Unless you're female of course.


How about your neighbor's public wifi? (Neighbor could be a few cities over...)


It's interesting to see that this supports pushing to redis lists, which is clarified as being an 'output module.'

https://zmap.io/documentation.html#extending

Very good, very extensible.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: