Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExternalReader that delegates all reading to an external function #57

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mbklein
Copy link

@mbklein mbklein commented Mar 9, 2021

Description

This PR adds an option, externalReader, which takes over the task of reading files/chunks from the built-in readers.

Example

In the following example, the input to exifr is an object pointing to a file stored in an Amazon S3 bucket. The custom reader function uses the AWS SDK to grab the chunks exifr needs. With this functionality, the input given to exifr can be literally anything, as long as the supplied externalReader function knows what to do with it.

const AWS = require("aws-sdk");
const exifr = require("exifr");
const s3 = new AWS.S3();

function chunkReader(input, offset, length) {
  return new Promise((resolve, _reject) => {
    let params = {...input};

    if (typeof offset === 'number') {
      let end = length ? offset + length - 1 : undefined;
      params.Range = `bytes=${[offset, end].join('-')}`;
    }

    s3.getObject(params, (err, data) => {
      if (err) {
        console.error(err);
        resolve(undefined);
      } else {
        resolve(data.Body);
      }
    });
  });
}

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" }, { externalReader: chunkReader })
  .then((exif) => console.log(exif))
  .catch(err => reject(err));

@mbklein
Copy link
Author

mbklein commented Mar 21, 2021

@MikeKovarik Just popping in to say that we’ve been using this code in production for a couple weeks now, and have used it to process more than 200,000 TIFF images of various sizes from 100MB up to almost 4GB. It’s been rock solid and remarkably fast.

@MikeKovarik
Copy link
Owner

MikeKovarik commented Mar 24, 2021

Hello @mbklein. I'm so sorry for responding this late. Glad to hear it's working for you. It's impressive!

I actually saw the PR on the day you submitted it and was about to merge it right away. You've hit a great idea with the externalReader and I really like the direct use of AWS instead of just making UrlFetcher work in nodejs. But because of this shift I started thinking about further improving the idea further. Because the user facing API could be better aligned with how exifr does modularity.
So I got to work, changing a few things here and there, thinking "I'll wait with responding until there's something to bring to the table" and "I'll only do this one more change". And then the weekend came. And the next week. And here we are.

Anyway, back to the externalReader. I really like the idea but don't think that passing it through options object is the ideal way. Exifr already has a ground work for modularity in plugins.mjs in the form of Maps to which all the loaded parsers/readers are added (like this fileReaders.set('url', UrlFetcher))

My idea is to implemented custom reader class and add it to exifrs plugin mechanism. This way, I could also just ship the aws parser. Maybe I could finally turn exifr into monorepo and break it up into sepate packages like exifr-parser-jpeg and exifr-reader-aws.
All reader classes, such as UrlFetcher, inherit from ChunkedReader. canHandle method is used verify if the input is parseable by the reader. When exifr.parse(input) is called exifr goes through all loaded readers to pick the one best suited for the input format. If that's an object with Key and Bucket, then the AwsReader will be used without the need to pass options object.

So this is the API I came up with but am open to suggestions.

import exifr from '../src/bundles/full.mjs'

export class ExternalReader extends exifr.ChunkedReader {

  async readWhole() {
    this.chunked = false
    let arrayBuffer = await this._read()
    this._swapArrayBuffer(arrayBuffer)
  }

  async _readChunk(offset, length) {
    let abChunk = await this._read(offset, length)
    let bytesRead = abChunk.byteLength
    if (bytesRead !== length) this.size = offset + bytesRead
    return this.set(abChunk, offset, true)
  }

}

class AwsReader extends ExternalReader {

  static canHandle(input) {
    return input !== undefined
      && typeof input.Bucket === 'string'
      && typeof input.Key === 'string'
  }

  _read(offset, length) {
    // the aws code here
  }

}

exifr.use(AwsReader)
// or
fileReaders.set('aws', AwsReader)

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" })
  .then(console.log)
  .catch(console.error)

For now it's still open to changes since I'm still changing the logic of ChunkedReader. Specifically the readWhole, readChunked and _readChunk methods. These were internal methods until now, but would need to make more sense to be implemented by user.


PS: This is out of the topic now, but another thing that delayed me was exploring other ways of making UrlFetcher work in nodejs. I ended up with simple shim of fetch implemented with node's built in http and https modules. Here's a snippet:

let httpPromise  = dynamicImport('http',  http => http)
let httpsPromise = dynamicImport('https', https => https)

let fetch = platform.g.fetch || function(url, {headers}) {
  let {port, hostname, pathname, protocol} = new URL(url)
  const options = {
    method: 'GET',
    hostname,
    path: encodeURI(pathname),
    headers
  }
  if (port !== '') options.port = Number(port)
  return new Promise(async (resolve, reject) => {
    let lib = protocol === 'https:' ? await httpsPromise : await httpPromise
    const req = lib.request(options, res => {
      resolve({
        status: res.statusCode,
        arrayBuffer: () => new Promise(resolveAb => {
          let buffers = []
          res.on('data', buffer => buffers.push(buffer))
          res.on('end', () => resolveAb(concatBuffers(buffers)))
        })
      })
    })
    req.on('error', reject)
    req.end()
  })
}

@mbklein
Copy link
Author

mbklein commented Mar 24, 2021

I really like where you've gone with this, especially the idea of breaking up the repo and allowing people to require only the bits they need. It would certainly make it way easier to extend – just write and release a module! – over the current all-in-one.

One small suggestion I have would be to allow the caller to pass the name of a registered reader (or something) when they call parse() (e.g., exifr.parse(input, { reader: 'aws' }), allowing them to bypass all the decision logic if they know they want to use a specific implementation. I like “just do the right thing” code, but I also appreciate when there's a way to impose my will on the process. 😄

@mbklein
Copy link
Author

mbklein commented May 6, 2021

Hi @MikeKovarik – just thought I'd check in and see if there's anything I can do to help with where you'd like this PR/issue/refactor to go. Our application is chugging along nicely here using our fork, but it'd be great to get back to the main release at some point.

@schonert
Copy link

Any news on this? 🤓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants