Hacker News new | past | comments | ask | show | jobs | submit login
Sysadmin war story: the network ate my font (verticalsysadmin.com)
222 points by atsaloli on Sept 15, 2017 | hide | past | favorite | 66 comments



Yeah? Try dealing with a printer that has to have a font file sent to it with an lp command from cron every 15 minutes or else checks wont be formatted correctly. AND there is a physical device (either USB on the printer or inline on ethernet in front of the printer) which has the MICR fonts stored.

And then the printer you are migrating the MICR device to has decided it doesn't understand the PCL in the font file anymore.

This is 2017 and I just dealt with this today.


I should point out that a co-worker tried to convince people we could easily do it with a generated PDF and throw away the damn $500 MICR device but that was swiftly deemed 'impractical'.


The engineering time dealing with that printer is probably more expensive than just buying a simpler/better one.


Close. To be fair the printer is supposed to support PCL5, and this is the first time I've witnessed this happen.


If this is so mission critical why not get a proper Postscript printer with an onboard filesystem for font storage. It doesn't make much sense to run a business on consumer junk.


This sort of stuff happened all the time when I worked in tech support for ad agencies. Printing font problems seemed to be 99% of the job, some days. PDFs helped — but only of the person making them remembered to embed the fonts. Usually you would simply make an EPS (encapsulated Postscript file), which would work most of the time — as long as the receiving printer had sufficient memory for huge print files. Can't tell you how many thousands of times art directors who should've known better would just send Quark or PageMaker or (later) InDesign files without the fonts. In fact, there were (are?) preflighting programs designed just to solve this exact problem. They'd look at your files and determine: Are your fonts there? Are your images there? How will this print? etc., etc.


There are indeed PDF preflight tools to this day. E.g. on Linux you can run `pdffonts` and check that everything is embedded. I believe it's still very common, to the point that my local university's PhD thesis print delivery checklist includes it.


Scribus (DTP program for Linux I love) has it built right in and it's good.

They also pride themselves on the quality of PDF output (and have full (iirc) support for PDF-A).


There is nothing popular and free though. I built a simple one myself some time ago: https://bitbucket.org/qznc/vorflug

The font check alone can be done in a few lines of Python: http://beza1e1.tuxen.de/articles/preflight.html


This leads to an obvious question: "why did the programmers not always just embed the font?".

And it turns out that the answer will probably be, "well, we asked the lawyers, and you need rights from the font owner to do so, and it makes files bigger, and most stuff is in whatever Microsoft Office uses as their default font this year anyway."

And so we're torn between "yes, of course the designer who built the font should get paid" and "a sane default became a checkbox because of US copyright law".


So here is how it happens...

Graphic designer designs cheque. For the design to be signed off he/she includes 'lorem ipsum' placeholder text for the special numbers at the bottom.

Design gets signed off, a template is made for the programmers to use.

In the code a third party library is used to make it 'easy' to create a PDF. This process consists of opening the template, adding a line of text to it and printing it as a PDF to another file, ready for the printer.

A little while later the graphic designer edits the template to make a few amends. The file is re-saved, this fresh copy no longer contains the fonts not in the document. The placeholder text having long gone, the font for it is not saved. The other fonts for the cheque are, they moved across and were catered for by the software in the updated template.

The software runs exactly as before, just the template file has been updated. However now the font is not found unless installed or cached on the computer or printer.

The programmer never had to embed the font, his/her third party library abstracted that requirement away. The programmer had worked with the library before and knew that it was best to use Helvetica because PDF knows that is a built in and therefore does not need to bloat the document with the default fonts. Any other font would add megabytes to the document. So there was probably no oversight by the programmer.

However there may have been a micro-manager that was 'responsible' for micro-managing the update to the template. This probably involved meetings and conference calls and deadlines for 'the project' on the whiteboard. Not wishing to overstretch programmer resources, the micro-manager took it on himself to make sure the programmer was not 'interrupted' so he got another lackey to upload the new cheque template. This all worked fine initially.

Had there not been a micro-manager then the graphic designer would have had to have worked with the programmer, without the micro-manager or his lackey. The programmer would have picked up on the smaller file size as this would be a noticeable change. Instinctively the programmer would have made a test run and, not having fonts on his/her dev box, would have detected the problem right away. Meanwhile, the lackey with no knowledge of things like version control just uploaded the template as told, blindly unaware of the requirements to 'check your work'.

Why didn't the micro-manager check that the font wasn't embedded? That is what I want to know.


That one hit way too close to home. I'm the lackey in that story 40 hours a week.


Maybe because PDF has been around a long time and it used to be that embedding even a basic font would have been a MASSIVE increase in file size?


While we're on the subject, PDF software typically supports embedding just a subset of the font, containing only the characters used in the document, to save space.


Really? Neat.


Why don't PDFs just render the text and store it as some form of vector image file.


PDFs come from PostScript. PostScript was invented in 1984. The Apple LaserWriter, introduced in 1985, was the biggest selling early PostScript printer. It had a 12 MHz processor, 1.5 MB of memory, and communicated at 0.225 Mbps.

To make this work effectively, printers would cache fonts. That saved on overall file size, which was important for storage and transmission. But the real driver was rendering speed. Most documents are pages of text at a small number of sizes and there are a small number of letters.

If you're going to print 300 lower-case "e" characters, all in 10.5 point Times New Roman, it would have been ridiculous to do the hard work of rendering the bitmap from vector each time. You render it once, cache the bitmap, and then just plop the bitmap in the right spot.

I know this because circa 1993 one client had me build a custom font that varied letterforms slightly to mimic a hand-lettered effect. They ran these weekly newspaper ads for their big wine store, and the were paying a guy to hand-draw the whole thing. They wanted to keep the casual look, but save on the cost. (And I presume the guy was kinda tired of writing the same things over and over, but I never met him.)

I learned enough PostScript to make it happen, decomposed each letter into strokes, and then drew the strokes with slightly different alignment each time. It worked fine in simple tests, but the first time I rendered a page, I thought I had broke the printer. Instead of the printer's top speed of 8 pages/sec, a full page took over 15 minutes.

So as usual with "why don't they just" questions, the answer is, "because 'just' is sweeping some things under the rug". It was harder than it looked at first glance.


Fonts can have different glyphs/hinting at different sizes and you don't know how big a PDF will be rendered.


That also takes more space. Understandably, when memory was counted in kilobytes and disks were counted in kilobytes, file formats were designed differently than they are today.


Usually you store the text anyway, so you can select and search it, or have it read by accessibility tools.

And most of the time, the letter spacing is the default given by the font, so you just have to encode the position of the beginning of a run of text. So it is pretty space efficient, too.


A font is some sort of vector image file.


Not always. Bitmap fonts are a thing.


> This leads to an obvious question: "why did the programmers not always just embed the font?".

Historically the size of hard-disks have been very limited, so it was important to save a file with as few bytes as possible.

Also the serial transfer speed to the printer was very low, so it was important to limit how much data that was sent for each print job.


I love it when you are troubleshooting and question how it ever worked in the first place.


The average person is amazed when tech breaks. Those who understand it, are amazed it works at all.



If someone had not heard of that one before, it should not be confused with a Heisenbug, which is approximately the opposite thing.

http://catb.org/jargon/html/H/heisenbug.html


The famous Six Stages of Debugging https://news.ycombinator.com/item?id=6477187

    1. That can't happen.
    2. That doesn't happen on my machine.
    3. That shouldn't happen.
    4. Why does that happen?
    5. Oh, I see.
    6. How did that ever work?


I've never felt so alone as when I'm worried that a think that shouldn't work, does.


I know.. blows my mind sometimes.


I just spent all day yesterday debugging why my web app's font-awesome/material design icons weren't showing on IE while they were showing on chrome, ff, and edge. Turned out that to be the corporate policy to disable font downloading on IE. https://www.stigviewer.com/stig/microsoft_internet_explorer_...


Ouch.

To be fair, fonts really do pave the way for malware into your kernel - https://googleprojectzero.blogspot.no/2015/07/one-font-vulne...


I once (as a junior developer) had to write some software to format a massive text file for printing to an IBM line-printer (1403, model N1) .. this (1M) text file had to be checked a few times before it was printed, and that meant opening it up in an editor and verifying the data.

I got pretty sick of opening this file after half a day, because it had tons and tons of CR/LF's and my editor at the time rendered these with strange characters on the screen .. and I didn't like that, as a junior guy, so I just replaced the LF's with 00's. For some reason, this just worked fine in my local environment, and I was able to validate the data in the file before sending it off to the spool for printing, later in the afternoon.

About 3 minutes after I closed the job, the building got a fire alarm, and we all had to exit. Apparently there had been smoke detected in the operations room, where the printer was located, so the Halon systems went off, and we went into full-blown "Ops Reset" mode.

After an hour of hanging around the parking lot, I was called in by the head honcho's in the Ops Room, sat down in front of the printer, and told to explain myself.

Well, turns out, I was responsible. The lack of LF's in the text file meant that the printer was printing - as fast as it could - every single line on the very first character position .. and after a few minutes, the printer simply caught fire.

Oh man, since that day (mid-80's), I've eschewed any job that requires me to deal with printers, and I've been anti-printer ever since. ;)


Love those stories. Read some similar ones in the past. First time when debugging - I'm looking into the cache.


Thanks for the kind words!


The darkest ring of hell would be tech support for wireless (bluetooth?) printers


In a casino full of smokers and devices interfering with those signals.


We had a printer that did checks and kept the signature on a USB drive, at some point the printer started printing a distorted signature - it ended up using the BMP signature and not the PNG file. Changed out of the blue somehow... both images were fine but it had a bug in how it decoded the BMP files


> 508 Resource Limit Is Reached

The network ate my webpage...


Update: should be good now. The engineer logged in and killed a couple of stuck processes. In 2017, folks.


> Turns out the printer had a cache for fonts and was using the font cached from the earlier check image which included the font!

Dang cache.


I upvoted simply because the title was too good not to.


Thanks!! The network ate the font is actually how the problem was reported. :-) That was a fun break from not breaking the HP-UX server running the 24/7 factory and from babysitting various Linux web app server issues at the colo. :-)


I usually love these kinds of war stories, but this one was anticlimactic. This would have been the first instinct of anyone who's ever tried to open a PowerPoint they got over e-mail – "oh, one printer shows the wrong font? I bet the font's not bundled, and one machine just happens to already have it."


I thought of that, but I have to say I wouldn't have thought about the printer's font cache. I didn't even know they had one. Then again, I try to avoid printers as much as I can :)


Same! You live, you learn. Thanks!


Author here. I've opened a ton of PowerPoint presentations (God help me) but this was the first time I've run into this issue. :) I usually crawl around inside of Web services / infrastructures while troubleshooting.


Yeah, I think it's interesting because this maybe isn't obvious when approaching with a network-guy mindset. A workstation guy or printer guy would probably check this first thing.


Indeed, my background was in administering Internet services (Web, mail, DNS, netnews, etc.) This was my first time troubleshooting a printing issue professionally. I was the only UNIX sysadmin in the shop and I was hired in as a "Network consultant" because they didn't have a slot for a Unix sysadmin -- didn't know what one was, even. But the (new) Director of Engineering specifically hired a sysadmin.


There will always be people who read a solution and say "oh that's obvious". Hindsight is 20/20 and all that.


Thanks! I am delighted the post was well received and sparked a little discussion. Thank you so much for your support.


It's "pored over", not "poured over".


500 mile email comment in 3... 2... 1...



The Daily WTF has a trove of similar stories. https://thedailywtf.com


If you enjoy that, come and chat with us on the LOPSA mailing lists or on IRC. :) (LOPSA is the League of Professional System Administrators, Mr. Harris and I are both members, though my little story is nowhere near as awesome.) https://lopsa.org/Chat-maillists-and-more Plus we always welcome new members.


cf related war stories on HN https://news.ycombinator.com/item?id=4709438

My own personal strangest experience was having two autonegotiating 10/100baseT devices that wouldn't speak to each other but would via any other switch.


Mine was about NFS.

One or two years ago, I was tasked to solve the following issue:

Installing an obsolete RHEL 4 on a brand new computer, with all the drivers issues that entails. Virtualization was not an option.

The solution I chose was to "backport" the latest RHEL 5 kernel in the RHEL 4. A few weeks of works later repackaging the kernel, adapting the mkinitrd script and a lot of headaches around the install iso (it was actually the hardest part, with a lot of hacks around anaconda), I finally managed to have a working server.

Then a few days latter, I was notified of a regression.

There was a somewhat crazy application that was managing various configuration files on various devices of this particular infrastructure which stopped working.

Digging in this application, I discovered that it was "pushing" the updated conf files through NFS, more exactly, it was notifying a service on the device targeted, and then this device mounted an NFS share on the RHEL 4 server, recovered the conf file, and unmounted the NFS share (I told you, "crazy").

The targeted devices were using RH 7 (RH, not RHEL, I'm talking about the one with a 2.4 kernel).

Strangely enough, mounting the share by hand and doing an ls showed that the files were indeed present, and there was no permission issues.

Reading the source code a little further and looking at quite old QT versions (the service on the device was QT based). I finally managed to find which line of code was not working: it was a simple readdir().

So I created a simple C program that just did the readdir and I finally managed to reproduce the bug, indeed, readdir was not listing the files present in the directory.

But it was not helping me much... So I decided to do some network captures, everything looked OK. I did the same network captures with the old RHEL 4 kernel, and it looked exactly the same.

After hours at steering at my screen with the 2 captures side by side, I finally spotted a subtle difference. The fileids (64 bits) was padded with zeros on the first 32 bits with the old kernel (ex: 0x0000000000000000A489097654456F97), and it was not the case with the new kernel (ex: 0xFC871902B9086456A489097654456F97).

Basically, the old RH 7 was violating the NFS RFC by only handling 32 bits fileids (by the way, the RFC predates the RH 7 by 5 years).

As it was impossible to upgrade the old RH 7, I finally backported the "bug" in my "RHEL 5 kernel on an RHEL 4" by adding a small two lines patch in the RHEL 5 kernel that ensured 32 bits fileids.


That server behaviour change is an excellent illustration of why "backwards compatible" isn't a simple black-and-white concept.


Wow. Impressive, but I sure wish that drive and creative energy could have been put to getting off RHEL 4... My hat's off to you.


There were various reasons why it was difficult to get off RHEL 4.

It was a system with a lot of cruft accumulated overtime, and a lot of domain specific applications that needed to be ported to a newer environment (new OSes, migrating from QT3 to QT4/QT5 and other newer libraries millions of line of codes). Actually they were planning to move to a newer base system when I finally left the company. We actually did it for other part of the system, and it was a several years process.

I've also another horror story about NFS and RHEL 4 (at nearly the same time as the first one).

I was tasked to update a small internal infrastructure used by another project (replacing an old active directory, an old exchange server, adding a few other service, renewing the hardware, reworking the backups, etc).

For the most part, I managed to completely replace the old stuff by newer things (postfix+dovecote, openldap, samba4, CentOS7, bind, bacula, everything hosted in VMs).

There were a few things that were too hard to migrate without blocking the users too much.

So, to at least migrate away from a +10 years server, I made the choice to just transform it as a VM.

This server was hosting various things (300 subversion repositories, a custom tracker using php 4, a viewvc crazily deployed (apache listened on 10 different ports with some static pages to point to the different ports), probably some other stuff I didn't know, and, you guessed it, NFS.

There was a lot of stuff relying on this particular NFS server (lot of scripts with the server name and paths hardcoded).

Also, the company had decided that it was completely unacceptable to risk losing one line of code on any developer desktop. So they put their homes on this NFS server (And if you are wondering, compiling code on an old NFS server, over a 100MB/s network, with 10 other devs doing the same thing, it is just miserable...). The amount of data (~1TB) was quite large for a server that old. And yeah, it was also acting as a NIS server.

So I started preparing the migration:

* boot on a livecd

* assign a temporary IP to the new server

* creating the partitions by hand

* rsyncing the system from the live, old system

* and few other things like installing grub

(basically a Gentoo installation minus the compilation and packages choosing parts)

Then I checked the services, and everything was working correctly.

So I scheduled a final downtime in the late afternoon for the final synchronization and at the scheduled hour I shutdown all the services except ssh and did the final rsync. I finally shutdowned the old server and switch the new server to the old IP from its temporary one.

A final check showed me that most services seemed to basically work.

I arrived the next morning, and everyone was panicking, the NFS server was behaving badly. I logon on the server and saw that the partition used by NFS was mounted read only. I rebooted the server, and again the partition became read only.

I urgently shutdowned the new VM and restarted the old server.

Then I investigate the issue and I finally found that there was a likely candidate : https://access.redhat.com/security/cve/CVE-2006-3468

The good thing was that given it was a CVE, I managed to find an exploit and it did reproduced the issue on my system.

The issue was triggered because some desktops were not shutdown during the migration, they saw the old server disappear and a new one reappear under the same IP. Consequently they were sending old file handles from the old server to the new server. As described in the CVE the NFS server didn't handled bogus file handles correctly, remounting the partition Read Only.

I updated the kernel, replaned another maintenance window and everything went fine.

A final note: the kernel version installed was 2.6.9-41, the one that fixed the issue was 2.6.9-42, if this machine was updated ONCE in its life, the migration would have gone smoothly...

I left this company a few months later, with somewhat of a deep hatred for NFS ^^.


Whoa. You're brave. This is what I love about sysadmins. They keep the show on the road, even in trying circumstances.


Let me guess: Sun talking to Cisco, around late 1990s to early 2000s? A known bug, known to both vendors, neither would fix it, Solaris admins would just have to know about it, usually via having been bitten by it? Yeah.


Strangely it wasn't, but one of the devices was an experimental bit of hardware which simply did autonegotiation wrong.

We found this out with an oscilloscope.


I had to solve this one with a small hub or switch between the Sun and Cisco boxes, so it sounded very familiar. Possibly 10 vs 100 was harder than we thought.


I discretely removed my comment about it.

But, yes, still my favorite one ^^.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: