Hacker News new | past | comments | ask | show | jobs | submit login
The Australian Square Kilometre Array Pathfinder hits the big-data highway (blog.csiro.au)
126 points by dgtlmoon on Jan 16, 2017 | hide | past | favorite | 49 comments



Fun fact: the CSIRO is the only Australian organisation to have a 2nd level domain name, lacking the .com and just using .au. This is because they were the first organisation to use a domain name in Australia and got in before the regulation specified it to be .{com|gov|edu|etc}.au


That is a surprisingly fun fact.


Actually, it generates 5.2TB/sec.

"Its antennas are now churning out 5.2 terabytes of data per second"


Either case, I doubt it is "15% of the internets current data rate" as they claim. E.g. the newish submarine cable between the US and Japan has 60 Tb capacity. And that's just one cable.


A large chunk of that capacity will be for private use, not for the capital-I Internet. 280Tbps is about right for interdomain Internet traffic, it lines up with Cisco's predictions/measurements: http://www.cisco.com/c/en/us/solutions/collateral/service-pr...


> A large chunk of that capacity will be for private use,

What private use requires that much bandwidth?


Inter region traffic of large cloud providers.


Id call the inner movements of data within google/facebook private networks part of the internet. When i do a google search, or watch a youtube vid, i know i am generating more traffic than the http requests and answer from the edge of google's network. The inner workings of the largest private networks often service the internet, if not by name, and therefore should be included in its total bandwidth. But good luck measuring it.


Nah, it is mostly traffic to manage the computing activity of large enterprises, not directly related to what we consider the Internet.


While it may be that a good chunk of private traffic is in the service of internet services, I think calling it 'the internet' is wrong technically and in spirit.

The spirit and technical definition is that fundamentally the internet is a connection between private networks. Initially it was some universities, ARPA and so on. Now it's Google, Facebook, Verizon, Amazon, AT&T and so on.

What happens on these networks is private network traffic. If some of that traffic travels long distances it doesn't make it less private. Why count inter-region traffic of cloud providers, why not count the inter-AZ traffic of AWS regions? Why not count the inter-rack traffic and so on. The difference between the internet and private networks is not about distance travelled it's about ownership. For instance if traffic goes between Google and AWS at a peering location in CA, then that is internet traffic even if the data doesn't go very far.


To clarify definitions, data rate will always be less than or equal to capacity. In practice, average rate will be much less than peak capacity.


Also, that is the raw output from the antenna array. This is then analysed before storing, compressing it down to a much more managable few GB/s.


It's analysed and stored in Melbourne, Victoria. The SKA Pathfinder is in Western Australia outside of Geraldton. Hence the data transfer statistics as the two sites are roughly 4,000km from one another.


How do you even begin to deal with that much data per second? What do they do with it? Storing more than a few minutes of data isn't practical is it? Do they process it in realtime?


You're right in the last point, the raw data has such a high rate that they have to process it on the fly, which has its drawbacks. The reason that these instruments output so much data is that the process of correlating the signal between each unique pair of antennas is O(n^2) in the number of antennas. Now 36 antennas (when complete) is not very large, that's approximately the size of the VLA[1], but because of the very unique phased array feeds on ASKAP mentioned in the article, each antenna has not just one receiver, but a little camera of receivers. So if it's an 8x8 feed/receiver array, that's 64 times the data rate of a 36 antenna system. For comparison, I've had projects on the VLA that are about 10 GB/S, but they have only a single feed/receiver. At ASKAP I believe they have to process/"reduce" the data on the fly, which in the most extreme case (likely necessary for the real Square Kilometer Array) is to not store the raw correlations between antennas at all but to convert it to images immediately, which is lossy. Normally this reduction to images is optimized by carefully tuning the imaging parameters and iterating to find the best, but this can't be done if the "visibilities"/correlations have been thrown away.

[1] Edit: saying that the Very Large Array does not have a very large number of antennas is actually pretty ironic.


Whoops! I misremembered in my post above, data rates on the VLA should have been around 20 MB/s not 10 GB/s, but I can't seem to edit my response above now.


This from CSIRO has a lot of detail answering your question – CSIRO ASKAP Science Data Archive: Overview, Requirements and Use Cases

http://www.atnf.csiro.au/projects/askap/ASKAP_SW_0017_v1.0_d...


Perfect, and it shows how 36 antenna get correlated down to 2.5 gigabytes per second.


From the article:

> Once out of the telescope, the data is going through a new, almost automatic data-processing system we’ve developed.

Of course, they don't go into detail (instead making a very odd analogy to a bread machine), but I think the implication is that the data is processed immediately before being recorded. A similar approach is used at the LHC.

A better analogy might be; when you take a picture on your phone, it comes out of the camera chip in a form equivalent to a bitmap file. But those are big and a pain to deal with, so its converted to jpeg or something else reasonable before being stored, perhaps losing a bit of irrelevant noise in the process.


In parallel :-) Divide that number by the number of telescopes in operation (in this case 12) and you're fetching 433 GB/sec on a telescope. Imagine that is I/Q data coming from the antenna. Now you're only looking for neutral hydrogen which has a much narrower spectrum, so you process that data with a band pass decimating filter, that takes you down to a few GB/sec. Then take the redshift into account and process everything that is in your 'target range' (a red shift of .26) plus or minus a bit and now your down to perhaps megabytes per second per telescope.


The example of being able to reduce the data rate by only configuring the system to look in a narrow frequency range in the spectra is perceptive! Although unfortunately you cannot divide the data rate down by the number of antennas by processing each antenna in parallel. Interferometers such as this must correlate the signals between each unique pair of antennas and in approximate real time (1 sec to 1 min range). The data rate comes from sending the signal of each individual antenna by fiber or waveguide to a special FPGA/ASIC super computer called a correlator that computes the correlated signal between each unique pair of antennas. So it all goes into the hopper at once.

Now what you're saying can actually be done, and is done on very long baseline facilities like the VLBA or Event Horizon Telescope (EHN). As an example, the EHN uses hydrogen atomic clocks to essentially time stamp the signal from a single antenna into a 50 TB disk pack, where each of the antennas are off by themselves from Hawaii, Spain, to the South Pole. They then bring all the disk packs to MIT where they then do the correlation from the time stamped data streams. This process is actually very complex however and it requires more total processing in the end. The higher computation comes from having to repeatedly process the data searching for subtle time offsets and antenna position offsets to find the correlation (this is how they use interferometers to measure those 1 cm/year continental drifts).


If it's per unique pair of telescopes then will the data rate scale O(n^2) to the number of dishes? If that is the case then holy balls the final SKA will produce a lot of data.


It's pretty crazy! And yes, the number of measurements scales as N(N-1)/2, so O(N^2). The funny part about this was, the SKA has been in design for >15 or so years at this point, and they had to predict that computational hardware would be good enough to make it work when they actually start building it! The hardware in 2000 wouldn't have been anywhere close to capable of powering the SKA's correlator. This is actually also a game that they had to play with the Large Synoptic Survey Telescope (LSST)[1], breaking ground now, with the relatively pedestrian 20 TB a night for archiving.

[1] https://lsst.org


For those interested in the architecture used to process this, take a look at http://www.slideshare.net/SparkSummit/spark-at-nasajplchris-..., especially slide 29


Stilling waiting for when we put a radio telescope at L3 and L2 giving mankind a 14Tm baseline interferometry. The current 12Mm interferometry is nice, but earth is so small.


Radio scintillation of the Interstellar Medium becomes an issue, I daresay the radio equivalent of "active optics" could help.

I did mention this at an outreach talk with Astroblack morphologies[0] and Tim O'Brien[1] who pointed out that ISM scintillation was on his slides

I did think up 100s of radio telescopes in a bird cage orbit at the distance of Jupiter, impractical of course but Space VLBI does reward thinking big.

Isn't it worth having telescopes well out the ecliptic plane, I seem to recollect VLBI is about filling in the UV plane. It was a long time ago, and it was just a second year physics project.

[0] - http://www.artscatalyst.org/astro-black-morphologies-flow-mo...

[1] - http://www.jb.man.ac.uk/~tob/


We have something along those lines, albeit not quite as long-baseline. Spektr-R is a radio satellite on a elliptical orbit that gives baselines up to 350,000 km (0.35 Gm), linked to Earth-based facilities.

https://en.wikipedia.org/wiki/Spektr-R


Note that the actual collecting area will be 4072 square meters [1]. 12 out of 36 antennas are currently working.

That's roughly 31 Kbps per square millimeter.

Or 1 bit per second per 33 square microns.

[1] http://www.atnf.csiro.au/projects/askap/specs.html


For Americans, that's 2.00773TB/s per Square Mile.


Are those Metric terabytes (1000^4), or in freedom units (1024^4)?


Should be tebibytes (TiB) for the latter, though binary prefixes don't have much adoption outside of the software world.

http://www.physics.nist.gov/cuu/Units/binary.html


Or inside of the software world, for that matter.


Huh? 1024^4 is how Windows defines a terabyte, accordingly that's what most people think a terabyte is. OS X was the first OS to switch to 1000^4 with Snow Leopard (2009). Different Linux system tools use either standard.


Referring the abbreviation of "TiB" versus "TB" for 1024^4.


And 6.353 x 10^16 nibbles per furlong^2 per fortnight


I think you mean 13.468 TB/s per square mile.


I wonder how much of that can be compressed for transfer and storage.


Please note, this is for a Square Kilometer Array pathfinder (the Australian Square Kilometre Array Pathfinder, ASKAP), not the Square Kilometer Array. Construction hasn't even begun on the latter, AFAIK.




Cisco went to court... and lost. Here's an article where I 've learned that there's a stylized version of the Golden Gate in Cisco's logo and that CSIRO's logo is the shape of Australia.

http://www.itnews.com.au/news/csiro-beats-cisco-in-fight-ove...


I'm embarrassed to admit that I never realized that "Cisco" was, in addition to the company's name, short/slang for San Francisco, nor that their logo was meant to evoke the Golden Gate Bridge.


Seconded. Never really thought about the word "Cisco", and assumed the logo was intended to evoke RF on an oscilloscope or some other visualization of electronic signaling.


Honestly, i never linked the cisco logo to the goldengate. It looked like meters of an audio output or something else electronic, an o-scope. The proportions seem off, probably because i never see the bridge from the side.

Im betting they want to play down the link to the city name. "Frisco" is a week trademark because of its common use. Cisco is akin to the "Syfy" channel. You might win on the letters, but not the spoken sounds ... As oppossed to the famous canadian case of "MikeRoeSoft.com" where sound trumped letters.


You win on letters, you lose on i18n. "Syfy" in Polish is plural of "syf", which means disorder, dirt, something ugly, or (probably also the origin of the word) syphilis.

Obviously, this made telling some friends about some shows I watched rather awkward. ;).


And that is why i read hn.


FWIW, The CSIRO logo is in the shape of Australia (the dot representing Tasmania), as it's an Australian organisation. It also sort of looks like WiFi signal bars to me, and I guess the CSIRO has some legitimacy for that seeing as they were involved in inventing WiFi: https://www.csiro.au/en/Research/D61/Areas/Wireless-and-netw...


5.2TB/s not Tb/s. BIG difference!


We've reverted the title from the submitted “New Australian Square Kilometer Array Generates 5.2Tb/s”. One of the reasons the guidelines ask us to prefer original titles is because it's surprisingly difficult to generate new ones that remain accurate.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: