All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Support for deleting Odinson Documents (and all associated lucene documents capturing metadata, etc.) from an index via
OdinsonIndex.deleteOdinsonDoc(odinsonDocId)
. - Support for updating Odinson Documents (and all associated lucene documents capturing metadata, etc.) in an index via
OdinsonIndex.updateOdinsonDoc(doc)
. OdinsonIndex.usingIndex(conf: Config)
context manager (ensure an index gets closed)ExtractorEngine.usingEngine(conf: Config)
context manager (ensure an engine's index and state get closed)
- Added documentation about supported types of fields for metadata (and info warnings for unsupported types). (#329)
- Added support for live indexing (#341)
- Use lucene's
KeywordAnalyzer
for parsed documents rather thanWhitespaceAnalyzer
. - Refactored code related to Lucene (#341)
- Negated lookaround queries return the correct spans (not off-by-one).
- Rest API moved to lum-ai/odinson-rest repo (#357)
- Added mechanisms for adding metadata to Documents easily, including an app (#319)
- Added tags-vocabulary endpoint to API for index-specific part-of-speech tags.
- Added tests for metadata and parent API calls in backend.
- Added metadata query language. Includes support for dates and nested objects. (#305)
- Parent document filenames are stored by default.
- Added a histogram endpoint for term frequencies.
- Enhanced term-freq endpoint to allow filtering as well as grouping by a second field.
- Added ability to Mentions to populate their lexical content (#274)
- Added tests for parent queries in core and backend.
- Dependencies now stored as BinaryDocValuesField (previously SortedDocValuesField) to allow for larger graphs (#283).
- Moved responsibility for getting lexical content from ExtractorEngine to DataGatherer (#274)
- Metadata is now indexed as TokensFields instead of StringFields.
- Added :mkDoc command to shell (#272)
- Added ability to serialize Mentions verbosely (with displayField or all storedFields) (#265)
- Added project-wide formatting settings and a PR check for linting
- Added a file that accompanies index (
settings.json
) that describes settings used in creating the index. Currently storingstoredFields
. (#255) - Added REST API endpoint for returning frequencies of token-based annotations in a corpus.
- Added
ai.lum.odinson.utils.TestUtils
and the correspondingOdinsonText
in the main project for using the test utils in dependent projects (#232) - Added some additional methods to ExtractorEngine to access tokens from diff fields of a Lucene Doc (#231)
- Added json serialization and deserialization of Mentions and OdinsonMatches to core (#226)
- Added argument promotion, i.e., arguments specified for promotion or underspecified will be added to the state (#218)
- Add tests for REST API endpoints
- Grammar files now support imports of rules and variables, from both resources and filesystem; absolute and relative paths (#175, #180).
- Validation of tokens to ensure they are compatible with Lucene (#170)
- Add priority as String to
Rule
and asPriority
toExtractor
- Add
MentionFactory
to be optionally passed during construction of theExtractorEngine
so that customMentions
can be produced. Include aDefaultMentionFactory
to be used if one isn't provided. ChangeMention
to be a regular class instead of a case class to facilitate subclassing. - Use added
State.addMentions
now instead ofState.addMention
with help of newOdinResultsIterator
by @kwalcock - Add
State
andStateFactory
integration intoreference.conf
and integrate extras intoapplication.conf
- Code coverage report.
- REST API endpoints for retrieving metadata and parent document; OpenAPI data model for
OdinsonDocument
, etc. - Containerized Odinson
- Docker images for
extra
and the REST API using thesbt-native-packager
plugin.
- Docker images for
- Added
ExtractorEngine.inMemory(...)
to help build an index in memory. - Added
disableMatchSelector
toExtractorEngine.extractMentions()
to retrieve all spans of tokens that could be matched by the query. In other words, it skips theMatchSelector
. - Added
buildinfo.json
file to the index to store versions and build info. - Added ability to express rule vars as lists, in addition to the current string representation.
- Put indexing docs in a method to be used by external projects. (#90)
- Started documentation at http://gh.lum.ai/odinson/ (#97)
- JsonSerializer is now a class, and has the ability to serialize verbose detail about Mentions (#265)
- updated version of CluLab processors in
extra/
to 8.2.3 (#241) - using whole config to create ExtractorEngine and its components (rather than subconfigs) (#231)
- removed the MentionFactory, rename OdinMentionsIterator to MentionsIterator (#228)
- Different organization for tests. Now every test extends a
BaseSpec
class and there are 6 categories of tests. - Turn
State
into a trait with very basicSqlState
and even more basicMemoryState
and placeholderFileState
implementations by @kwalcock - REST API:
/api/parent
->/api/parent/by-document-id
&/api/parent/by-sentence-id
- REST API:
sentId
param for/api/sentence
->sentenceId
- REST API:
rules
param for/api/execute/grammar
->grammar
- Retrieval of OdinsonSentence JSON via REST API
extra/AnnotateText
writes compressed json files- Reduce number of array allocations
- All strings are normalized with NFKC, except the norm field which uses NFKC with casefolding, diacritic stripping, and some extra character mappings. This is the case both at index time and query time. This means you should reindex if you upgrade to this version.
- Use temporary directories for /extra and /backend tests to avoid the main index (
data/odinson/index
) being overwritten during testing - Accept underscore at identifier start (#209)
- Nullpointer exception related to event arguments.
- size of roots array in
UnsafeSerializer
- Added option to allow arguments that overlap with the trigger in event mentions (disallowed by default)
- Added optional label to rules and mentions
- Added lucene segment information to
Mention
- Added optional
label
support to named capture syntax, i.e.(?<name:label> ... )
- Added
QueryUtils.quantifier()
to make a quantifier string from some requirements, e.g. min and max repetitions.
- Enforce quantifier semantics in
event
rules. - Replace variables in rule names