Pdfinfo port

5/7/2023

The PDFs are an extract from PACER (court cases), stored on a NAS, about 1.2 MM PDFs, ½ TB of storage.

One alternative is to put everything behind a proxy like Squid, but I haven’t tried this yet. This is a nuisance that would not be present in other architectures. This comes at a cost – you have to configure all the servers to send Access-Control-Allow-Origin, to allow cross-domain host. Once each of the simple servers is running, the coordination code can be written in a browser, which lets you work using the Chrome developer tools. The virtual machines communicate with fixed IPs, and each is assigned a different port on the same physical host, so that any laptop on the network can join in the fun. The Node.js servers all run as virtual machines on my developer workstation, configured using Vagrant and Virtualbox – these could easily be moved onto separate hosts. There is a fourth server a Python server which serves static Javascript pages that coordinate the work, but outside of development this would be run as a console application with PhantomJS. I run a separate server for each – I’m not sure whether the Node.js community has a preferred architecture, but this feels like a natural fit. A full-text index is also built, the beginning of a larger ingestion process. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. Vagrant + Virtualbox, Node.js, node-static, Lunr.js, node-lazy, phantomjs

0 Comments

Pdfinfo port

Leave a Reply.

Author

Archives

Categories