Even More Happily Employed at Google
30 Jun 08

‘Sparse’ Document Collections

I was having a conversation with some friends recently about the means of handling digital document collections that contain hundreds of thousands of documents, but where only a tiny fraction (let’s say < 30) are actually heavily referenced. Furthermore, let’s suppose that this collection has been built by OCRing many disparate sources, which causes subtle errors that are computationally expensive to detect, if it is in fact actually computationally feasible to detect the errors.

My question is this: would it be acceptable to you as a user if you were provided with a way of indicating that a document needed review as a substitute for hiring a huge team of interns to go over the documents by hand? Do you have any experience with collections that are similar? How did you handle it in that type of collection? Any other ideas?

I promise that a ‘real’ post is soon to follow.

blog comments powered by Disqus