Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. #41

Open
wants to merge 110 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
a6274ec
Bug-fix for unicode array sizes
SharafMohamed Sep 12, 2024
186d239
Merge remote-tracking branch 'upstream/main' into nfa-cleanup-pr
SharafMohamed Sep 12, 2024
4f122c6
Move LexicalRule to its own class; Change name to variable_id; Change…
SharafMohamed Sep 12, 2024
c24f6e1
Additional fix for swapping meaning of tag
SharafMohamed Sep 12, 2024
33582da
Another additional fix for swapping meaning of tag
SharafMohamed Sep 12, 2024
3338ec7
Fix up some comments
SharafMohamed Sep 12, 2024
3cd3c0f
Fix comment grammar
SharafMohamed Sep 12, 2024
e05acbb
Add tags to AST; Serialize AST for testing; Add unit-test for testing…
SharafMohamed Sep 13, 2024
5e61e83
Use using to condense code; Use a unique schema object for each test …
SharafMohamed Sep 13, 2024
082090d
Add has_capture_groups(); Add unit-test for has_capture_groups()
SharafMohamed Sep 13, 2024
2c6d94e
Create and use RegexASTEmpty to split RegexASTgroup with min=0 into R…
SharafMohamed Sep 13, 2024
4e02f24
Add unit-test for 0 repetition regex
SharafMohamed Sep 13, 2024
bb3c543
Add more tests for repetition regex
SharafMohamed Sep 13, 2024
54027ad
Return by value in literal getters; Use const instead of const& for l…
SharafMohamed Sep 16, 2024
e58274f
Refactor new_state()
SharafMohamed Sep 16, 2024
1321871
Rename get_first_matching_variable_ids() to get_matching_variable_ids…
SharafMohamed Sep 16, 2024
c904755
Remove redundant docstrings
SharafMohamed Sep 16, 2024
ffe9a0f
Remove has_capture_groups()
SharafMohamed Sep 16, 2024
913ed1a
Const and auto changes
SharafMohamed Sep 16, 2024
7aa8a92
Changed AST add functions to indicate the AST are being added to the …
SharafMohamed Sep 17, 2024
77e44a5
Merge branch 'nfa-cleanup-pr' into comment-cleanup
SharafMohamed Sep 17, 2024
d1d87e7
Merged with previous PR
SharafMohamed Sep 17, 2024
f386a3b
Merge branch 'tagged-ast' into pre-tagged-nfa-cleanup
SharafMohamed Sep 17, 2024
0c600d7
Merge branch 'pre-tagged-nfa-cleanup' into regex-ast-empty
SharafMohamed Sep 17, 2024
bedad75
Change add in RegexASTEmpty to add_to_nfa
SharafMohamed Sep 17, 2024
053d057
Update src/log_surgeon/finite_automata/RegexAST.hpp
SharafMohamed Sep 18, 2024
a822307
updated examples to use
SharafMohamed Sep 18, 2024
0b9603a
Merge branch 'nfa-cleanup-pr' into comment-cleanup
SharafMohamed Sep 18, 2024
2ef84d1
TODO to clarify RegexAST class is actually nodes in the AST
SharafMohamed Sep 18, 2024
83bd518
Merge branch 'main' into comment-cleanup
SharafMohamed Sep 18, 2024
d3d815e
Merge branch 'comment-cleanup' of https://github.com/SharafMohamed/lo…
SharafMohamed Sep 18, 2024
168adb0
Grammar fix
SharafMohamed Sep 18, 2024
20b3421
Typo fix
SharafMohamed Sep 18, 2024
5231a4a
Fix var references in new comments
SharafMohamed Sep 18, 2024
e4ac215
Move DFA comment to RegexDFA.hpp
SharafMohamed Sep 19, 2024
4b8b13e
Merge branch 'comment-cleanup' into tagged-ast
SharafMohamed Sep 19, 2024
e2d05fa
Merge branch 'main' into tagged-ast
SharafMohamed Sep 19, 2024
660eb9b
Remove duplicate comments from child classes
SharafMohamed Sep 19, 2024
731b9fe
Change capture groups to assign tag when its originally added to the …
SharafMohamed Sep 19, 2024
81b4ffa
Assign negative tags during construction of AST nodes
SharafMohamed Sep 23, 2024
a4d29a5
Added docstring for RegexAST explaining what the class does
SharafMohamed Sep 23, 2024
9379447
Removed serializing AST without tags as this intermediate representat…
SharafMohamed Sep 23, 2024
6c4d933
Fix docstring tense; Add [[nodiscard]] and const to serialize functio…
SharafMohamed Sep 23, 2024
a2fdbf1
Fix linter error
SharafMohamed Sep 23, 2024
b0485f5
Make serialize_with_negative_tags() protected
SharafMohamed Sep 23, 2024
64da95c
Merge branch 'main' into tagged-ast
SharafMohamed Sep 23, 2024
e5bda43
Make it explicitly clear when parent method is used
SharafMohamed Sep 23, 2024
0ac7c43
Use fmt in serialize
SharafMohamed Sep 23, 2024
529dcb2
Have fmt fetched if its not found
SharafMohamed Sep 23, 2024
2e17bee
Remove fetching fmt
SharafMohamed Sep 23, 2024
013765b
Fetch fmt with QUIET
SharafMohamed Sep 23, 2024
14cbe97
Added debug prints to cmake
SharafMohamed Sep 23, 2024
20864d6
Find after fetch
SharafMohamed Sep 23, 2024
72e61f6
Modified fmt fetch to match GSL
SharafMohamed Sep 23, 2024
aadb290
Fix fmt tag in fetch
SharafMohamed Sep 23, 2024
c9d5510
Revert fmt fetch back to how it was before
SharafMohamed Sep 23, 2024
f138527
Switched approach for generating unique capture group ids
SharafMohamed Sep 23, 2024
4cc5e2a
Update fetch fmt
SharafMohamed Sep 23, 2024
e08e345
fmt debug cmake
SharafMohamed Sep 23, 2024
ea5121a
Fixed order of declaring and making available for fetched content in …
SharafMohamed Sep 23, 2024
2ff09b5
Force fmt to generate install rules
SharafMohamed Sep 23, 2024
69cdd1f
Add docstring for FMT_INSTALL; Remove debug prints
SharafMohamed Sep 23, 2024
f580cdd
Copy GSL when fetching fmt
SharafMohamed Sep 23, 2024
c0dd3fe
Fix type in ftm include
SharafMohamed Sep 23, 2024
df12855
Cmake test
SharafMohamed Sep 23, 2024
be00753
Cmake test2
SharafMohamed Sep 23, 2024
2f47770
Cmake test3
SharafMohamed Sep 23, 2024
a15365b
Cmake test4
SharafMohamed Sep 23, 2024
239ff76
Cmake test5
SharafMohamed Sep 23, 2024
2826461
Fix cmake indentation
SharafMohamed Sep 23, 2024
db42bd8
Fix cmake indentation
SharafMohamed Sep 23, 2024
d5eeb5a
Remove space in cmake
SharafMohamed Sep 23, 2024
ce2fc76
Use std::move for set assignment
SharafMohamed Sep 25, 2024
9643138
Update docstrings; Use move and merge for sets.
SharafMohamed Sep 25, 2024
550961e
For serialize functions use const and make nullptr checks explicit; F…
SharafMohamed Sep 25, 2024
74c46ad
Update RegexAST docstring.
SharafMohamed Sep 25, 2024
29e0777
Return raw const* instead of unique_ptr
SharafMohamed Sep 25, 2024
cde19a6
Fix compiler errors in previous commit
SharafMohamed Sep 25, 2024
9837c10
Remove <format>
SharafMohamed Sep 25, 2024
dac2122
Replace bind with a lambda function
SharafMohamed Sep 25, 2024
aa4a4e4
Add fmt to clan-format header libraries; Run clang-format
SharafMohamed Sep 25, 2024
e00fead
Add description of new test
SharafMohamed Sep 25, 2024
63fd9da
Use GIT_SHALLOW ON for fmt in cmakelists
SharafMohamed Sep 25, 2024
a5eae39
Handle 32bit unicode in AST node serialize()
SharafMohamed Sep 26, 2024
7b73929
Remove irrelevent comment from clang-tidy (also fixes the previously …
SharafMohamed Sep 26, 2024
1c62d8c
Merge branch 'main' into tagged-ast
SharafMohamed Sep 26, 2024
033928f
Merge with previous PR
SharafMohamed Sep 26, 2024
1b6cd08
Update to have Catch2 give a nice output for u32strings; Update to me…
SharafMohamed Sep 26, 2024
bc4444a
Use better terminology in new comments
SharafMohamed Sep 26, 2024
48cea41
Merge with previous PR
SharafMohamed Oct 1, 2024
547bef4
Fix formatting
SharafMohamed Oct 1, 2024
c73c72c
Use boost instead of implementing our own u32_to_u8 method; Remove re…
SharafMohamed Oct 1, 2024
f364aac
Fix long string indentation
SharafMohamed Oct 1, 2024
e93ea4f
Remove unused include; Fix typo
SharafMohamed Oct 1, 2024
c2933f3
add default constructor to RegexASTEmpty
SharafMohamed Oct 1, 2024
f195955
move default constructors into class declaration
SharafMohamed Oct 1, 2024
37dfdca
Add docstring for RegexASTEmpty
SharafMohamed Oct 1, 2024
3e64e7c
Add test case for validity; Fix typo in test case name; Fix docstrin…
SharafMohamed Oct 1, 2024
69af073
Add [[maybe_unused]]; Compilation fix for swapping from reference to …
SharafMohamed Oct 1, 2024
14f9f69
Install boost in cmake if not found.
SharafMohamed Oct 1, 2024
e49421a
Remove REQUIRED from boost cmake line so it continues to the install.
SharafMohamed Oct 1, 2024
5ca7b4e
Correct tag for boost.
SharafMohamed Oct 1, 2024
2c72410
Correct tag for boost.
SharafMohamed Oct 1, 2024
f113c19
Fix formatting of GSL linking; Link against Boost::locale.
SharafMohamed Oct 1, 2024
064379c
Enable boost in cmake.
SharafMohamed Oct 1, 2024
bf06d33
Fix typo.
SharafMohamed Oct 1, 2024
34f9e4f
Make sure locale is found.
SharafMohamed Oct 1, 2024
af61ee1
Make sure boost is built.
SharafMohamed Oct 1, 2024
9329cc8
Remove boost::locale and use std::locale instead.
SharafMohamed Oct 1, 2024
ede2a16
Undo cmake changes.
SharafMohamed Oct 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions src/log_surgeon/SchemaParser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ using RegexASTCatByte = log_surgeon::finite_automata::RegexASTCat<
log_surgeon::finite_automata::RegexNFAByteState>;
using RegexASTCaptureByte = log_surgeon::finite_automata::RegexASTCapture<
log_surgeon::finite_automata::RegexNFAByteState>;
using RegexASTEmptyByte = log_surgeon::finite_automata::RegexASTEmpty<
log_surgeon::finite_automata::RegexNFAByteState>;

using std::make_unique;
using std::string;
Expand Down Expand Up @@ -196,8 +198,11 @@ static auto regex_or_rule(NonTerminal* m) -> unique_ptr<ParserAST> {

static auto regex_match_zero_or_more_rule(NonTerminal* m) -> unique_ptr<ParserAST> {
auto& r1 = m->non_terminal_cast(0)->get_parser_ast()->get<unique_ptr<RegexASTByte>>();
return unique_ptr<ParserAST>(new ParserValueRegex(
unique_ptr<RegexASTByte>(new RegexASTMultiplicationByte(std::move(r1), 0, 0))

// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.
// To handle negative tags we treat `R*` as `R+ | ∅`.

return make_unique<ParserValueRegex>(make_unique<RegexASTOrByte>(
make_unique<RegexASTEmptyByte>(),
make_unique<RegexASTMultiplicationByte>(std::move(r1), 1, 0)
));
}

Expand Down Expand Up @@ -238,6 +243,14 @@ static auto regex_match_range_rule(NonTerminal* m) -> unique_ptr<ParserAST> {
max += r5_ptr->get_digit(i) * (uint32_t)pow(10, r5_size - i - 1);
}
auto& r1 = m->non_terminal_cast(0)->get_parser_ast()->get<unique_ptr<RegexASTByte>>();

if (min == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (min == 0) {
if (0 == min) {

// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.
return make_unique<ParserValueRegex>(make_unique<RegexASTOrByte>(
make_unique<RegexASTEmptyByte>(),
make_unique<RegexASTMultiplicationByte>(std::move(r1), 1, max)
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

return unique_ptr<ParserAST>(new ParserValueRegex(
unique_ptr<RegexASTByte>(new RegexASTMultiplicationByte(std::move(r1), min, max))
));
Expand Down
59 changes: 59 additions & 0 deletions src/log_surgeon/finite_automata/RegexAST.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,49 @@ class RegexAST {
std::set<uint32_t> m_negative_tags;
};

template <typename NFAStateType>
class RegexASTEmpty : public RegexAST<NFAStateType> {
public:
RegexASTEmpty();
SharafMohamed marked this conversation as resolved.
Show resolved Hide resolved

/**
* Used for cloning a unique_pointer of type RegexASTEmpty
* @return RegexASTEmpty*
*/
[[nodiscard]] auto clone() const -> gsl::owner<RegexASTEmpty*> override {
return new RegexASTEmpty(*this);
}

/**
* Sets is_possible_input to specify which utf8 characters are allowed in a
* lexer rule containing RegexASTEmpty at a leaf node in its AST, which is nothing
* @param is_possible_input
*/
auto set_possible_inputs_to_true(
[[maybe_unused]] std::array<bool, cSizeOfUnicode>& is_possible_input
) const -> void override {}

/**
* Transforms '.' to to be any non-delimiter in a lexer rule, which does
* nothing as RegexASTEmpty is a leaf node that is not a RegexASTGroup
* @param delimiters
*/
auto remove_delimiters_from_wildcard([[maybe_unused]] std::vector<uint32_t>& delimiters
) -> void override {
// Do nothing
}

/**
* Add the needed RegexNFA::states to the passed in nfa to handle a
* RegexASTEmpty before transitioning to an accepting end_state
* @param nfa
* @param end_state
*/
auto add_to_nfa(RegexNFA<NFAStateType>* nfa, NFAStateType* end_state) const -> void override;

[[nodiscard]] auto serialize() const -> std::u32string override;
};

template <typename NFAStateType>
class RegexASTLiteral : public RegexAST<NFAStateType> {
public:
Expand Down Expand Up @@ -655,6 +698,22 @@ class RegexASTCapture : public RegexAST<NFAStateType> {
uint32_t m_tag;
};

template <typename NFAStateType>
RegexASTEmpty<NFAStateType>::RegexASTEmpty() = default;

template <typename NFAStateType>
void RegexASTEmpty<NFAStateType>::add_to_nfa(
[[maybe_unused]] RegexNFA<NFAStateType>* nfa,
[[maybe_unused]] NFAStateType* end_state
) const {
// DO NOTHING
}

template <typename NFAStateType>
[[nodiscard]] auto RegexASTEmpty<NFAStateType>::serialize() const -> std::u32string {
return fmt::format(U"{}", RegexAST<NFAStateType>::serialize_negative_tags());
}

template <typename NFAStateType>
RegexASTLiteral<NFAStateType>::RegexASTLiteral(uint32_t character) : m_character(character) {}

Expand Down
66 changes: 65 additions & 1 deletion tests/test-lexer.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
#include <cstdint>
#include <numeric>
SharafMohamed marked this conversation as resolved.
Show resolved Hide resolved
#include <ranges>
#include <string>
#include <vector>

Expand All @@ -10,6 +12,7 @@
#include <log_surgeon/SchemaParser.hpp>

using std::string;
using std::u32string;
using std::vector;

using RegexASTCatByte = log_surgeon::finite_automata::RegexASTCat<
Expand All @@ -26,6 +29,46 @@ using RegexASTOrByte
= log_surgeon::finite_automata::RegexASTOr<log_surgeon::finite_automata::RegexNFAByteState>;
using log_surgeon::SchemaVarAST;

auto test_regex_ast(string const& regex, u32string const& expected_serialized_ast) -> void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's create an anon namespace for local helpers
  2. In the anon namespace, let's add the function declaration at the top with a doc string explaining what this function is testing, and move the implementation to the bottom
  3. I think we can use std::string_view. This might require changes to the signature of Schema::add_variable to take a string_view, but I'm ok to put it in this PR

log_surgeon::Schema schema;
schema.add_variable("capture", regex, -1);
auto const schema_ast = schema.release_schema_ast_ptr();
auto const& capture_rule_ast = dynamic_cast<SchemaVarAST&>(*schema_ast->m_schema_vars[0]);

auto u32_to_u8 = [](char32_t c) -> std::string {
std::string u8;
if (c <= 0x7F) {
u8 += static_cast<char>(c);
} else if (c <= 0x7FF) {
u8 += static_cast<char>((c >> 6) | 0xC0);
u8 += static_cast<char>((c & 0x3F) | 0x80);
} else if (c <= 0xFFFF) {
u8 += static_cast<char>((c >> 12) | 0xE0);
u8 += static_cast<char>(((c >> 6) & 0x3F) | 0x80);
u8 += static_cast<char>((c & 0x3F) | 0x80);
} else {
u8 += static_cast<char>((c >> 18) | 0xF0);
u8 += static_cast<char>(((c >> 12) & 0x3F) | 0x80);
u8 += static_cast<char>(((c >> 6) & 0x3F) | 0x80);
u8 += static_cast<char>((c & 0x3F) | 0x80);
}
return u8;
};
SharafMohamed marked this conversation as resolved.
Show resolved Hide resolved

auto const actual_u32string = capture_rule_ast.m_regex_ptr->serialize();
auto const actual_string = fmt::format(
"{}",
fmt::join(actual_u32string | std::ranges::views::transform(u32_to_u8), "")
);

auto const expected_string = fmt::format(
"{}",
fmt::join(expected_serialized_ast | std::ranges::views::transform(u32_to_u8), "")
);

REQUIRE(actual_string == expected_string);
}

TEST_CASE("Test the Schema class", "[Schema]") {
SECTION("Add a number variable to schema") {
log_surgeon::Schema schema;
Expand Down Expand Up @@ -92,7 +135,11 @@ TEST_CASE("Test the Schema class", "[Schema]") {
// This test validates the serialization of a regex AST with named capture groups. The
// serialized output includes tags (<n> for positive matches, <~n> for negative matches) to
// indicate which capture groups are matched or unmatched at each node.

test_regex_ast(
"Z|(A(?<letter>((?<letter1>(a)|(b))|(?<letter2>(c)|(d))))B(?<containerID>\\d+)C)",
U"(Z<~0><~1><~2><~3>)|(A((((a)|(b))<0><~1>)|(((c)|(d))<1><~0>))<2>B([0-9]{1,inf})<"
"3>C)"
);
log_surgeon::Schema schema;
schema.add_variable(
// clang-format off
Expand Down Expand Up @@ -126,4 +173,21 @@ TEST_CASE("Test the Schema class", "[Schema]") {
REQUIRE(capture_rule_ast.m_regex_ptr->serialize()
== std::u32string(cExpectedSerializedU32StringWithTags));
}

SECTION("Test reptition regex") {
// Repetition without capture groups untagged and tagged AST are the same
test_regex_ast("a{0,10}", U"()|(a{1,10})");
test_regex_ast("a{5,10}", U"a{5,10}");
test_regex_ast("a*", U"()|(a{1,inf})");
test_regex_ast("a+", U"a{1,inf}");

// Repetition with capture groups untagged and tagged AST are different
test_regex_ast("(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})");
test_regex_ast("(?<letter>a){5,10}", U"(a)<0>{5,10}");
test_regex_ast("(?<letter>a)*", U"(<~0>)|((a)<0>{1,inf})");
test_regex_ast("(?<letter>a)+", U"(a)<0>{1,inf}");

// Capture group with repetition
test_regex_ast("(?<letter>a{0,10})", U"(()|(a{1,10}))<0>");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a more complicated test case like this?

test_regex_ast(
            "(((?<letterA>a)|(?<letterB>b))*)|(((?<letterC>c)|(?<letterD>d)){0,10})",
            U"((<~0><~1>)|(((a)<0><~1>)|((b)<1><~0>){1,inf})<~2><~3>)|((<~2><~3>)|(((c)<2><~3>)"
            U"|((d)<3><~2>){1,10})<~0><~1>)"
 );

}
}
Loading