The Hidden Race Condition in file-type That Corrupted Our Parallel Image Detection

The problem: a logo candidate list that kept shrinking

We caught this one because our logo candidate list would randomly drop in size and we couldn't figure out why. A given domain would yield twelve image candidates on one run, four on the next, nine on the one after. Same site, same scrape, wildly different shortlists. The downstream ranker only sees what survives the MIME-type filter, so when a PNG silently came back labeled image/jpeg (or worse, an extension we don't accept), it got dropped before scoring ever happened.

We blamed the scraper. Then we blamed GCS. Then we blamed ourselves for blaming GCS. Nothing fit. The files in the bucket were valid. Our database said otherwise. The function in between was a five-line wrapper around the file-type npm package, the same one half of Node.js uses to sniff MIME types from raw buffers. It's supposed to be boring. It's supposed to be pure.

It was neither.

The investigation: it only breaks under load

Locally, we couldn't reproduce it. We fed file-type a thousand PNGs in a tight loop and it cheerfully returned image/png a thousand times. Mixed batches, run sequentially, all correct. We even fed it the exact buffer from a misclassified production request. Still PNG.

The only thing different in production was concurrency. Our brand ingestion pipeline pulls down logos, favicons, hero images, and OG images in parallel, usually six to ten buffers hitting fileTypeFromBuffer at the same moment via Promise.all(). That was the missing variable.

We wrote a reproduction:

import { fileTypeFromBuffer } from 'file-type';
import { readFile } from 'fs/promises';
 
const png = await readFile('./logo.png');
const jpg = await readFile('./hero.jpg');
const gif = await readFile('./icon.gif');
 
const buffers = Array(50)
	.fill(0)
	.flatMap(() => [png, jpg, gif]);
const results = await Promise.all(buffers.map(fileTypeFromBuffer));
 
results.forEach((r, i) => {
	const expected = ['png', 'jpg', 'gif'][i % 3];
	if (r?.ext !== expected) console.log(i, 'expected', expected, 'got', r?.ext);
});

The first run printed twenty mismatches. The next run printed thirty. PNGs came back as JPEGs. GIFs came back as PNGs. The detector was returning answers from the wrong buffer entirely.

The cause: shared tokenizer state

Digging into the package source, the trick became obvious. file-type uses strtok3 under the hood and constructs a FileTypeParser instance whose detection methods read and rewind a shared tokenizer position field as they walk through dozens of magic-byte checks. When you call fileTypeFromBuffer twice concurrently, both calls end up mutating the same internal cursor on the same parser instance. One detection seeks forward to check for an MP4 atom, the other reads four bytes at that offset, sees ftyp-shaped garbage from the other file, and confidently reports the wrong type.

It's not a bug in any single check. The whole API is stateful parsing dressed up as a pure function.

The fix: one parser per buffer

The fix is one line: never share a parser. Construct a fresh FileTypeParser for every buffer, or wrap each call in its own scope so there's no instance to share:

import { FileTypeParser } from 'file-type';
 
const detect = (buf) => new FileTypeParser().fromBuffer(buf);
const results = await Promise.all(buffers.map(detect));

Mismatches went to zero. Candidate counts stabilized run-over-run, PNGs are PNGs again, and we can use Promise.all here again, as long as every call gets its own parser.

The problem: a logo candidate list that kept shrinking

The investigation: it only breaks under load

The cause: shared tokenizer state

The fix: one parser per buffer

Ship an agent that actually knows things.