-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug-Fix: Add negative tags for RegexMultiplicationAST
with min=0
.
#41
base: main
Are you sure you want to change the base?
Conversation
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
…(); Add docstrign to RegexDFAStatePair
…NFA; Made add to nfa functions const
Co-authored-by: Lin Zhihao <[email protected]>
Warning Rate limit exceeded@SharafMohamed has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 43 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe changes enhance the project's code formatting and dependency management. The Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
tests/test-lexer.cpp (1)
177-192
: LGTM: Comprehensive test cases for repetition regexThese test cases excellently cover various scenarios of regex repetition, both with and without capture groups. They align perfectly with the PR objectives, especially in verifying the correct handling of patterns where the minimum repetition is zero (e.g.,
{0,10}
,*
).The distinction between repetition with and without capture groups is particularly valuable, as it demonstrates the new behaviour of adding negative tags when a capture group may not be matched.
For completeness, consider adding a test case for a regex with multiple capture groups and mixed repetitions, such as:
test_regex_ast("(?<a>x)?(?<b>y)+(?<c>z){0,2}", U"((<~0>)|((x)<0>))((y)<1>{1,inf})((<~2>)|((z)<2>{1,1})|((z)<2>{2,2}))");This would further validate the correct interaction of different repetition types within a single regex.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
127-128
: Consider adding explanatory comments for RegexASTEmptyWhile the implementation of
RegexASTEmpty
is correct, it would be beneficial to add a brief comment explaining its purpose and how it fits into the overall regex parsing strategy. This would improve code readability and maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- src/log_surgeon/SchemaParser.cpp (3 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
- tests/test-lexer.cpp (5 hunks)
🔇 Additional comments (11)
tests/test-lexer.cpp (4)
15-15
: LGTM: Appropriate addition of u32stringThe addition of
using std::u32string;
is appropriate as it's used in the newtest_regex_ast
function. This change enhances code readability by avoiding the need to prefixu32string
withstd::
throughout the file.
32-70
: LGTM: Well-structured test helper functionThe
test_regex_ast
function is a well-structured helper for testing regex AST serialization. It encapsulates the logic for creating a Schema, adding a variable, and comparing the serialized output, which promotes code reuse across multiple test cases.Consider refactoring the UTF-8 conversion lambda
The lambda function
u32_to_u8
manually convertschar32_t
to a UTF-8 string. As mentioned in a previous review, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and enhance maintainability.Enhance test failure diagnostics
To improve the clarity of test failures, consider adding an error message to the REQUIRE statement. This will help quickly identify which part of the test failed when running the test suite.
Here's a suggested improvement:
- REQUIRE(actual_string == expected_string); + REQUIRE(actual_string == expected_string, fmt::format("Expected: {}\nActual: {}", expected_string, actual_string));
138-142
: LGTM: Comprehensive test for AST serialization with tagsThis test case is an excellent addition that validates the serialization of a complex regex AST with named capture groups. It effectively covers the PR objectives by including tags for positive () and negative (<~n>) matches, ensuring correct handling of regex repetition constructs with capture groups.
The use of a complex regex pattern here helps to verify that the serialization works correctly for nested and alternating capture groups, which is crucial for the robustness of the implementation.
2-2
: Verify the necessity of the headerThe header has been added, but it's not clear where it's being used in the visible code. Please ensure that this header is necessary for the changes made, or remove it if it's not needed to keep the includes clean and minimize compilation time.
✅ Verification successful
Removal of the
<numeric>
header is recommendedThe
<numeric>
header is not used intests/test-lexer.cpp
. Please remove it to keep the includes clean and minimize compilation time.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if any functions from <numeric> are used in the file # Test: Search for common functions from <numeric> rg --type cpp '\b(accumulate|inner_product|partial_sum|adjacent_difference|iota)\b' tests/test-lexer.cppLength of output: 106
src/log_surgeon/SchemaParser.cpp (4)
33-34
: LGTM: New type alias for empty regex AST nodeThis addition of
RegexASTEmptyByte
aligns well with the PR objective. It introduces a type to represent an empty node in the regex Abstract Syntax Tree, which is crucial for handling regex repetition constructs with a minimum value of zero.
201-206
: Verify multiplication node parameters in zero-or-more ruleThe implementation of
regex_match_zero_or_more_rule
now correctly treatsR*
asR+ | ∅
, which aligns with the PR objectives. However, the parameters for theRegexASTMultiplicationByte
(1, 0) seem counterintuitive. Could you please verify if these values are correct, or if they should be (1, UINT32_MAX) to represent one or more occurrences?
246-253
: LGTM: Consistent implementation of range rule for min=0The modification to
regex_match_range_rule
correctly implements the PR objective of treatingR{0,N}
asR{1,N} | ∅
when the minimum value is zero. This change is consistent with the implementation in the zero-or-more rule and properly handles the case of optional matches in regex patterns.
Line range hint
1-1000
: Summary: Effective implementation of zero-minimum regex repetition handlingThe changes in this file successfully implement the handling of regex repetition constructs with a minimum value of zero, aligning well with the PR objectives. The introduction of
RegexASTEmptyByte
and the consistent use ofRegexASTOrByte
to combine empty and non-empty matches provide a solid foundation for handling optional regex patterns.A few points to consider:
- The implementation correctly treats
R{0,N}
asR{1,N} | ∅
across different regex rules.- The approach is consistent and well-integrated with the existing code structure.
- There's a potential issue with the multiplication node parameters in the zero-or-more rule that needs verification.
Overall, these changes enhance the regex parsing capabilities as intended, with just a minor point needing clarification.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
127-168
: Implementation of RegexASTEmpty looks goodThe new
RegexASTEmpty
class is well-implemented and consistent with other AST node classes. It correctly represents an empty node in the Regex AST, with appropriate implementations for all required methods.
Line range hint
832-863
: Improved handling of zero minimum repetitions in RegexASTMultiplicationThe modifications to the
add_to_nfa()
method inRegexASTMultiplication
correctly implement the handling of regex repetitions with a minimum value of zero. The addition of an epsilon transition to the end state whenm_min
is 0 is an appropriate solution.
Line range hint
1-1155
: Changes align well with PR objectivesThe modifications in this file, particularly the addition of the
RegexASTEmpty
class and the changes to theRegexASTMultiplication
class, successfully implement the handling of regex repetition constructs with a minimum value of zero. The changes are consistent throughout the file and align well with the stated PR objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
tests/test-lexer.cpp (2)
36-36
: Ensure safe usage ofdynamic_cast
Using
dynamic_cast
to a reference without verification can lead to astd::bad_cast
exception if the cast fails. Consider casting to a pointer and checking fornullptr
to prevent potential runtime errors.-auto const& capture_rule_ast = dynamic_cast<SchemaVarAST&>(*schema_ast->m_schema_vars[0]); +auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get()); +REQUIRE(capture_rule_ast != nullptr);
177-177
: Correct typo in section nameThere's a typographical error in the section name. It should be "Test repetition regex" instead of "Test reptition regex."
-SECTION("Test reptition regex") { +SECTION("Test repetition regex") {src/log_surgeon/finite_automata/RegexAST.hpp (1)
704-711
: Redundant implementation ofadd_to_nfa
in source file.The
add_to_nfa
method is defined in the source file with an empty body. If the method is intended to be empty, consider omitting its implementation entirely, as the default behaviour is sufficient.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- src/log_surgeon/SchemaParser.cpp (3 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
- tests/test-lexer.cpp (5 hunks)
🔇 Additional comments (6)
tests/test-lexer.cpp (1)
176-192
: Well-structured addition of comprehensive test casesThe new test cases for various repetition regex patterns significantly enhance test coverage. They effectively validate the handling of optional and repeated capture groups, ensuring the regex AST serialization functions as intended.
src/log_surgeon/SchemaParser.cpp (3)
33-34
: Addition ofRegexASTEmptyByte
Enhances Regex ASTIntroducing
RegexASTEmptyByte
provides a clear representation for empty regex nodes, improving the readability and maintainability of the code.
201-205
: Potential issue withmin
greater thanmax
inRegexASTMultiplicationByte
constructorThe concern about
min
being greater thanmax
inRegexASTMultiplicationByte
is still applicable. Instantiating it withmin=1
andmax=0
may lead to logical errors or undefined behaviour.
246-253
: Possible incorrect handling ofmin == 0
in repetition rangeThe previous comment regarding the handling of cases where
min == 0
and ensuring thatmin
is less than or equal tomax
inRegexASTMultiplicationByte
remains relevant here.src/log_surgeon/finite_automata/RegexAST.hpp (2)
165-165
: Confirm intended behaviour ofadd_to_nfa
method.The
add_to_nfa
method inRegexASTEmpty
is declared but lacks an implementation in the class definition. Ensure that this method intentionally does nothing. If so, consider adding a comment to clarify this for future maintainers.Add a comment to indicate intentional empty implementation:
auto add_to_nfa(RegexNFA<NFAStateType>* nfa, NFAStateType* end_state) const -> void override { + // Intentionally empty: RegexASTEmpty does not contribute to the NFA. }
713-715
: Ensure correct serialization ofRegexASTEmpty
.The
serialize
method forRegexASTEmpty
returns only the negative tags. Verify that this is the expected serialization output for an empty node. If additional information should be included, adjust the method accordingly.Run the following script to review serialization usages:
✅ Verification successful
Serialization of
RegexASTEmpty
is correctly implemented.No external calls to
RegexASTEmpty::serialize()
were found, confirming that returning only the negative tags is appropriate for an empty node.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check where serialize() is called and ensure it handles RegexASTEmpty appropriately. # Search for serialize method calls in the codebase. rg --type cpp --word-regexp 'serialize\(\)'Length of output: 2989
Script:
#!/bin/bash # Description: Locate all direct calls to RegexASTEmpty::serialize() to verify their usage. rg --type cpp 'RegexASTEmpty<[^>]+>::serialize\(\)'Length of output: 184
…dundant unit-test
…s for methods; Combine empty definitio nwith declaration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
CMakeLists.txt (1)
63-67
: LGTM! Consider updating target_link_libraries.The addition of
FetchContent_MakeAvailable(Boost)
is correct and necessary to ensure Boost is available for the project if it wasn't found byfind_package
.However, I noticed that the
target_link_libraries
command for thelog_surgeon
target hasn't been updated to include Boost. You should add Boost to the list of linked libraries to ensure your project can use it.Add Boost to the
target_link_libraries
command. Here's a suggested change:target_link_libraries(log_surgeon PUBLIC fmt::fmt + $<$<NOT:$<BOOL:${Boost_FOUND}>>:Boost::boost> + $<$<BOOL:${Boost_FOUND}>:Boost::locale> )This change will link against the fetched Boost libraries if Boost wasn't found, or just the Boost::locale component if it was found on the system.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
129-129
: Correct typo in documentationIn the comment for
RegexASTEmpty
, the word "repeition" should be "repetition".Apply this diff to fix the typo:
- * repeition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the + * repetition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- .clang-format (1 hunks)
- CMakeLists.txt (2 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
- tests/test-lexer.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .clang-format
🔇 Additional comments (10)
CMakeLists.txt (1)
43-50
: LGTM! Consider using a more recent Boost version.The addition of Boost as a dependency is well-implemented. The use of
find_package
with a fallback toFetchContent_Declare
is a good practice. However, you might want to consider using a more recent version of Boost, as 1.74 is from August 2020. The latest version as of October 2024 would offer more features and bug fixes.Also, ensure that the 'locale' component is sufficient for your needs, as only this component is specified in the
find_package
call.To verify the Boost version availability, you can run:
✅ Verification successful
Boost Version Update Recommendation
The current Boost version specified is 1.74, while the latest available version is 1.86.0. It's advisable to update to Boost 1.86.0 to leverage the latest features and bug fixes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for the latest Boost version curl -s https://www.boost.org/users/download/ | grep -oP 'Version \K\d+\.\d+\.\d+'Length of output: 91
tests/test-lexer.cpp (6)
6-6
: LGTM: Addition of Boost locale headerThe inclusion of the Boost locale header is appropriate for handling UTF-8 conversions in the test cases. This aligns well with the PR objectives for improved regex pattern handling.
15-15
: LGTM: Addition of u32string using declarationThe addition of the u32string using declaration is appropriate for handling Unicode characters in the regex patterns. This supports the PR objectives for improved regex handling.
32-55
: LGTM: Addition of test_regex_ast helper functionThe new test_regex_ast function effectively encapsulates the logic for testing regex AST serialization. This improves code reusability and readability, making it easier to add new test cases for different regex patterns.
39-41
: Consider using a standard library function for UTF-8 encodingThe lambda function
u32_to_u8
manually convertschar32_t
to a UTF-8 string. As previously suggested, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and improve maintainability.
Line range hint
123-143
: LGTM: Comprehensive test case for AST with tagsThis new test case effectively validates the serialization of complex regex patterns with named capture groups and alternation. The use of tags for positive and negative matches in the expected output ensures correct handling of optional matches, which aligns well with the PR objectives.
145-160
: LGTM: Comprehensive test cases for repetition regexThese new test cases effectively validate various repetition regex patterns, both with and without capture groups. They cover different repetition quantifiers and demonstrate the correct handling of optional matches and negative tags. This set of tests directly addresses the PR objectives, particularly the handling of regex repetition constructs with a minimum value of zero.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
127-159
: Implementation ofRegexASTEmpty
is appropriateThe
RegexASTEmpty
class correctly implements all necessary virtual methods, properly handling the empty node scenarios in the regex AST.
270-270
: Default constructor forRegexASTGroup
definedDefining the default constructor for
RegexASTGroup
enhances consistency and ensures proper initialization.
693-695
: Verify the serialization output ofRegexASTEmpty
The
serialize
method inRegexASTEmpty
returns only the serialized negative tags. Please ensure that this behaviour aligns with the expected serialization format for an empty node in your system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
CMakeLists.txt (2)
43-50
: LGTM! Consider adding a comment for clarity.The addition of Boost as a dependency is well-implemented. The version check and FetchContent declaration are correct and follow best practices.
Consider adding a brief comment explaining why Boost is required, e.g.:
# Boost is required for Unicode support in regex operations find_package(Boost 1.74 COMPONENTS locale)
63-69
: LGTM! Consider adding error handling.The implementation for fetching and finding Boost is correct and follows best practices.
Consider adding error handling in case Boost is still not found after fetching:
if(NOT Boost_FOUND) set(BOOST_ENABLE_CMAKE ON CACHE BOOL "Enable CMake support in Boost") FetchContent_MakeAvailable(Boost) find_package(Boost 1.74 COMPONENTS locale) if(NOT Boost_FOUND) message(FATAL_ERROR "Boost not found even after fetching. Please check your internet connection or Boost installation.") endif() endif()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- CMakeLists.txt (4 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/log_surgeon/finite_automata/RegexAST.hpp
🔇 Additional comments (1)
CMakeLists.txt (1)
142-142
: LGTM! Verify public linkage necessity.The linking of Boost::locale and fmt::fmt to log_surgeon is correctly implemented.
Please verify that both Boost::locale and fmt::fmt are used in the public interface of log_surgeon. If they are only used internally, consider changing the linkage to PRIVATE:
RegexMultiplicationAST
with min=0
.
@@ -26,6 +30,33 @@ using RegexASTOrByte | |||
= log_surgeon::finite_automata::RegexASTOr<log_surgeon::finite_automata::RegexNFAByteState>; | |||
using log_surgeon::SchemaVarAST; | |||
|
|||
auto test_regex_ast(string const& regex, u32string const& expected_serialized_ast) -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's create an anon namespace for local helpers
- In the anon namespace, let's add the function declaration at the top with a doc string explaining what this function is testing, and move the implementation to the bottom
- I think we can use
std::string_view
. This might require changes to the signature ofSchema::add_variable
to take a string_view, but I'm ok to put it in this PR
auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get()); | ||
REQUIRE(capture_rule_ast != nullptr); | ||
|
||
auto u32_to_u8 = [](char32_t const u32_char) -> std::string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How about define this method as a local helper? Seem like we might also need it in other test cases in the future, and making it a lambda might introduce code duplication unintentionally by other developers.
- Instead of having a
u32_char
tou8_char
conversion, can we have a methodstd::u32string
->std::string
:
[[nodiscard]] auto u32string_to_utf8(std::u32string const& u32_str) -> string {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
return converter.to_bytes(u32_str);
}
test_regex_ast("(?<letter>a)+", U"(a)<0>{1,inf}"); | ||
|
||
// Capture group with repetition | ||
test_regex_ast("(?<letter>a{0,10})", U"(()|(a{1,10}))<0>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a more complicated test case like this?
test_regex_ast(
"(((?<letterA>a)|(?<letterB>b))*)|(((?<letterC>c)|(?<letterD>d)){0,10})",
U"((<~0><~1>)|(((a)<0><~1>)|((b)<1><~0>){1,inf})<~2><~3>)|((<~2><~3>)|(((c)<2><~3>)"
U"|((d)<3><~2>){1,10})<~0><~1>)"
);
@@ -238,6 +243,14 @@ static auto regex_match_range_rule(NonTerminal* m) -> unique_ptr<ParserAST> { | |||
max += r5_ptr->get_digit(i) * (uint32_t)pow(10, r5_size - i - 1); | |||
} | |||
auto& r1 = m->non_terminal_cast(0)->get_parser_ast()->get<unique_ptr<RegexASTByte>>(); | |||
|
|||
if (min == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (min == 0) { | |
if (0 == min) { |
make_unique<RegexASTEmptyByte>(), | ||
make_unique<RegexASTMultiplicationByte>(std::move(r1), 1, max) | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
return unique_ptr<ParserAST>(new ParserValueRegex( | ||
unique_ptr<RegexASTByte>(new RegexASTMultiplicationByte(std::move(r1), 0, 0)) | ||
|
||
// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`. | |
// To handle negative tags we treat `R*` as `R+ | ∅`. |
return new RegexASTEmpty(*this); | ||
} | ||
|
||
// Do nothing as an empty node contains no utf8 characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this comment. If you want to keep it, we should move it inside the function body since it's an inline comment.
References
Description
Previously,
RegexASTMultiplication
was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:R{0,N}
asR{1,N} | ∅
R*
asR+|∅
Validation performed