Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for JISC newspapers #24

Open
thobson88 opened this issue Mar 22, 2022 · 2 comments
Open

Add workaround for JISC newspapers #24

thobson88 opened this issue Mar 22, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@thobson88
Copy link

The JISC newspapers dataset, after pre-processing with the jisc-wrangler tool, exhibit a slightly different directory structure to that assumed by alto2txt.

Instead of a single newspaper issue subdirectory of the form mmdd/ the JISC structure separates the month & day into separate subdirectories: mm/dd/.

This results in warnings like:

extract_text.xml_to_text:19683:WARNING:Unexpected directory: 02

and the XML files in the subdirectory are not processed.

The task is to implement a simple workaround to accommodate the non-standard JISC directory structure.

@thobson88 thobson88 added the enhancement New feature or request label Mar 22, 2022
@thobson88 thobson88 self-assigned this Mar 22, 2022
@thobson88
Copy link
Author

thobson88 commented Mar 22, 2022

Fix added on branch 24-jisc-fix-catch.

I've tested this successfully on a single JISC title but have not yet tested on non-JISC data (to make sure the fix hasn't introduced any bugs).

@andrewphilipsmith andrewphilipsmith added this to the v0.5 milestone Jun 30, 2022
@kallewesterling
Copy link
Contributor

This looks related to #30?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants