Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`. #41

SharafMohamed · 2024-09-13T18:03:46Z

References

Depends on PR#40.

Description

Previously, RegexASTMultiplication was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:

Create an empty regex AST node, ∅.
Treat R{0,N} as R{1,N} | ∅
Treat R* as R+|∅

Validation performed

Create unit-tests for repetition regex.

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

… added tags

…for clairty that nothing is shared b/w tests

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

…(); Add docstrign to RegexDFAStatePair

…NFA; Made add to nfa functions const

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai · 2024-10-01T15:09:57Z

Warning

Rate limit exceeded

@SharafMohamed has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 43 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Files that changed from the base of the PR and between 34f9e4f and ede2a16.

Walkthrough

The changes enhance the project's code formatting and dependency management. The .clang-format file has been updated to include additional library headers and various formatting options. The CMakeLists.txt file now includes the Boost library as a dependency, with checks for its presence and commands to fetch it if necessary. Improvements have been made to regex parsing in SchemaParser.cpp and RegexAST.hpp, introducing a new class for empty nodes in the regex abstract syntax tree (AST). Additionally, the test suite has been expanded to cover new regex functionalities.

Changes

File(s)	Change Summary
`.clang-format`	Updated `IncludeCategories` regex to include "boost", adjusted various formatting options including alignment settings, brace wrapping rules, and spacing configurations.
`CMakeLists.txt`	Added Boost library as a dependency with checks for version `1.74`, using `FetchContent` to fetch Boost if not found. Refined existing checks for `Microsoft.GSL` and `fmt`.
`src/log_surgeon/SchemaParser.cpp`	Introduced `RegexASTEmptyByte` type alias and updated `regex_match_zero_or_more_rule` and `regex_match_range_rule` functions to handle cases where the minimum value is zero. Updated method signatures accordingly.
`src/log_surgeon/finite_automata/RegexAST.hpp`	Added `RegexASTEmpty` class to represent an empty node in the regex AST, implementing necessary virtual methods and serialization. Updated existing derived classes for compatibility with new tagging and serialization mechanisms.
`tests/test-lexer.cpp`	Added `test_regex_ast` helper function for testing regex AST serialization. Expanded existing test cases to validate regex patterns with named capture groups and repetition patterns, ensuring correct serialized outputs. Refactored test code.

Possibly related PRs

Add support for tracking matched and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes. #38: This PR involves updates to the CMakeLists.txt file to include the fmt library, which is relevant to the main PR's modifications in the .clang-format file and enhancements related to regex and AST serialization.

Suggested reviewers

LinZhihao-723

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

SharafMohamed · 2024-10-01T15:10:42Z

@coderabbitai full review

coderabbitai · 2024-10-01T15:10:48Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

tests/test-lexer.cpp (1)
177-192: LGTM: Comprehensive test cases for repetition regex

These test cases excellently cover various scenarios of regex repetition, both with and without capture groups. They align perfectly with the PR objectives, especially in verifying the correct handling of patterns where the minimum repetition is zero (e.g., {0,10}, *).

The distinction between repetition with and without capture groups is particularly valuable, as it demonstrates the new behaviour of adding negative tags when a capture group may not be matched.

For completeness, consider adding a test case for a regex with multiple capture groups and mixed repetitions, such as:
test_regex_ast("(?<a>x)?(?<b>y)+(?<c>z){0,2}", U"((<~0>)|((x)<0>))((y)<1>{1,inf})((<~2>)|((z)<2>{1,1})|((z)<2>{2,2}))");
This would further validate the correct interaction of different repetition types within a single regex.
src/log_surgeon/finite_automata/RegexAST.hpp (1)

127-128: Consider adding explanatory comments for RegexASTEmpty

While the implementation of RegexASTEmpty is correct, it would be beneficial to add a brief comment explaining its purpose and how it fits into the overall regex parsing strategy. This would improve code readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a092206 and 547bef4.

📒 Files selected for processing (3)

src/log_surgeon/SchemaParser.cpp (3 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
tests/test-lexer.cpp (5 hunks)

🔇 Additional comments (11)

tests/test-lexer.cpp (4)
15-15: LGTM: Appropriate addition of u32string

The addition of using std::u32string; is appropriate as it's used in the new test_regex_ast function. This change enhances code readability by avoiding the need to prefix u32string with std:: throughout the file.

32-70: LGTM: Well-structured test helper function

The test_regex_ast function is a well-structured helper for testing regex AST serialization. It encapsulates the logic for creating a Schema, adding a variable, and comparing the serialized output, which promotes code reuse across multiple test cases.

Consider refactoring the UTF-8 conversion lambda

The lambda function u32_to_u8 manually converts char32_t to a UTF-8 string. As mentioned in a previous review, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and enhance maintainability.

Enhance test failure diagnostics

To improve the clarity of test failures, consider adding an error message to the REQUIRE statement. This will help quickly identify which part of the test failed when running the test suite.

Here's a suggested improvement:
-    REQUIRE(actual_string == expected_string);
+    REQUIRE(actual_string == expected_string, fmt::format("Expected: {}\nActual: {}", expected_string, actual_string));
138-142: LGTM: Comprehensive test for AST serialization with tags

This test case is an excellent addition that validates the serialization of a complex regex AST with named capture groups. It effectively covers the PR objectives by including tags for positive () and negative (<~n>) matches, ensuring correct handling of regex repetition constructs with capture groups.

The use of a complex regex pattern here helps to verify that the serialization works correctly for nested and alternating capture groups, which is crucial for the robustness of the implementation.

2-2: Verify the necessity of the header

The header has been added, but it's not clear where it's being used in the visible code. Please ensure that this header is necessary for the changes made, or remove it if it's not needed to keep the includes clean and minimize compilation time.

✅ Verification successful

Removal of the <numeric> header is recommended

The <numeric> header is not used in tests/test-lexer.cpp. Please remove it to keep the includes clean and minimize compilation time.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if any functions from <numeric> are used in the file

# Test: Search for common functions from <numeric>
rg --type cpp '\b(accumulate|inner_product|partial_sum|adjacent_difference|iota)\b' tests/test-lexer.cpp
Length of output: 106
src/log_surgeon/SchemaParser.cpp (4)

33-34: LGTM: New type alias for empty regex AST node

This addition of RegexASTEmptyByte aligns well with the PR objective. It introduces a type to represent an empty node in the regex Abstract Syntax Tree, which is crucial for handling regex repetition constructs with a minimum value of zero.

201-206: Verify multiplication node parameters in zero-or-more rule

The implementation of regex_match_zero_or_more_rule now correctly treats R* as R+ | ∅, which aligns with the PR objectives. However, the parameters for the RegexASTMultiplicationByte (1, 0) seem counterintuitive. Could you please verify if these values are correct, or if they should be (1, UINT32_MAX) to represent one or more occurrences?

246-253: LGTM: Consistent implementation of range rule for min=0

The modification to regex_match_range_rule correctly implements the PR objective of treating R{0,N} as R{1,N} | ∅ when the minimum value is zero. This change is consistent with the implementation in the zero-or-more rule and properly handles the case of optional matches in regex patterns.

Line range hint 1-1000: Summary: Effective implementation of zero-minimum regex repetition handling

The changes in this file successfully implement the handling of regex repetition constructs with a minimum value of zero, aligning well with the PR objectives. The introduction of RegexASTEmptyByte and the consistent use of RegexASTOrByte to combine empty and non-empty matches provide a solid foundation for handling optional regex patterns.

A few points to consider:

The implementation correctly treats R{0,N} as R{1,N} | ∅ across different regex rules.

The approach is consistent and well-integrated with the existing code structure.

There's a potential issue with the multiplication node parameters in the zero-or-more rule that needs verification.

Overall, these changes enhance the regex parsing capabilities as intended, with just a minor point needing clarification.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-168: Implementation of RegexASTEmpty looks good

The new RegexASTEmpty class is well-implemented and consistent with other AST node classes. It correctly represents an empty node in the Regex AST, with appropriate implementations for all required methods.

Line range hint 832-863: Improved handling of zero minimum repetitions in RegexASTMultiplication

The modifications to the add_to_nfa() method in RegexASTMultiplication correctly implement the handling of regex repetitions with a minimum value of zero. The addition of an epsilon transition to the end state when m_min is 0 is an appropriate solution.

Line range hint 1-1155: Changes align well with PR objectives

The modifications in this file, particularly the addition of the RegexASTEmpty class and the changes to the RegexASTMultiplication class, successfully implement the handling of regex repetition constructs with a minimum value of zero. The changes are consistent throughout the file and align well with the stated PR objectives.

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)

tests/test-lexer.cpp (2)
36-36: Ensure safe usage of dynamic_cast

Using dynamic_cast to a reference without verification can lead to a std::bad_cast exception if the cast fails. Consider casting to a pointer and checking for nullptr to prevent potential runtime errors.
-auto const& capture_rule_ast = dynamic_cast<SchemaVarAST&>(*schema_ast->m_schema_vars[0]);
+auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get());
+REQUIRE(capture_rule_ast != nullptr);
177-177: Correct typo in section name

There's a typographical error in the section name. It should be "Test repetition regex" instead of "Test reptition regex."
-SECTION("Test reptition regex") {
+SECTION("Test repetition regex") {
src/log_surgeon/finite_automata/RegexAST.hpp (1)

704-711: Redundant implementation of add_to_nfa in source file.

The add_to_nfa method is defined in the source file with an empty body. If the method is intended to be empty, consider omitting its implementation entirely, as the default behaviour is sufficient.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a092206 and 547bef4.

📒 Files selected for processing (3)

src/log_surgeon/SchemaParser.cpp (3 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
tests/test-lexer.cpp (5 hunks)

🔇 Additional comments (6)

tests/test-lexer.cpp (1)

176-192: Well-structured addition of comprehensive test cases

The new test cases for various repetition regex patterns significantly enhance test coverage. They effectively validate the handling of optional and repeated capture groups, ensuring the regex AST serialization functions as intended.

src/log_surgeon/SchemaParser.cpp (3)

33-34: Addition of RegexASTEmptyByte Enhances Regex AST

Introducing RegexASTEmptyByte provides a clear representation for empty regex nodes, improving the readability and maintainability of the code.

201-205: Potential issue with min greater than max in RegexASTMultiplicationByte constructor

The concern about min being greater than max in RegexASTMultiplicationByte is still applicable. Instantiating it with min=1 and max=0 may lead to logical errors or undefined behaviour.

246-253: Possible incorrect handling of min == 0 in repetition range

The previous comment regarding the handling of cases where min == 0 and ensuring that min is less than or equal to max in RegexASTMultiplicationByte remains relevant here.
src/log_surgeon/finite_automata/RegexAST.hpp (2)
165-165: Confirm intended behaviour of add_to_nfa method.

The add_to_nfa method in RegexASTEmpty is declared but lacks an implementation in the class definition. Ensure that this method intentionally does nothing. If so, consider adding a comment to clarify this for future maintainers.

Add a comment to indicate intentional empty implementation:
 auto add_to_nfa(RegexNFA<NFAStateType>* nfa, NFAStateType* end_state) const -> void override {
+    // Intentionally empty: RegexASTEmpty does not contribute to the NFA.
 }
713-715: Ensure correct serialization of RegexASTEmpty.

The serialize method for RegexASTEmpty returns only the negative tags. Verify that this is the expected serialization output for an empty node. If additional information should be included, adjust the method accordingly.

Run the following script to review serialization usages:

✅ Verification successful

Serialization of RegexASTEmpty is correctly implemented.

No external calls to RegexASTEmpty::serialize() were found, confirming that returning only the negative tags is appropriate for an empty node.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check where serialize() is called and ensure it handles RegexASTEmpty appropriately.

# Search for serialize method calls in the codebase.
rg --type cpp --word-regexp 'serialize'
Length of output: 2989

Script:
#!/bin/bash
# Description: Locate all direct calls to RegexASTEmpty::serialize() to verify their usage.

rg --type cpp 'RegexASTEmpty<[^>]+>::serialize'
Length of output: 184

tests/test-lexer.cpp

src/log_surgeon/finite_automata/RegexAST.hpp

…dundant unit-test

…s for methods; Combine empty definitio nwith declaration.

…pointer.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

CMakeLists.txt (1)
63-67: LGTM! Consider updating target_link_libraries.

The addition of FetchContent_MakeAvailable(Boost) is correct and necessary to ensure Boost is available for the project if it wasn't found by find_package.

However, I noticed that the target_link_libraries command for the log_surgeon target hasn't been updated to include Boost. You should add Boost to the list of linked libraries to ensure your project can use it.

Add Boost to the target_link_libraries command. Here's a suggested change:
target_link_libraries(log_surgeon 
    PUBLIC 
    fmt::fmt
+   $<$<NOT:$<BOOL:${Boost_FOUND}>>:Boost::boost>
+   $<$<BOOL:${Boost_FOUND}>:Boost::locale>
)
This change will link against the fetched Boost libraries if Boost wasn't found, or just the Boost::locale component if it was found on the system.
src/log_surgeon/finite_automata/RegexAST.hpp (1)
129-129: Correct typo in documentation

In the comment for RegexASTEmpty, the word "repeition" should be "repetition".

Apply this diff to fix the typo:
- * repeition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the
+ * repetition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 547bef4 and 2c72410.

📒 Files selected for processing (4)

.clang-format (1 hunks)
CMakeLists.txt (2 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
tests/test-lexer.cpp (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

.clang-format

🔇 Additional comments (10)

CMakeLists.txt (1)
43-50: LGTM! Consider using a more recent Boost version.

The addition of Boost as a dependency is well-implemented. The use of find_package with a fallback to FetchContent_Declare is a good practice. However, you might want to consider using a more recent version of Boost, as 1.74 is from August 2020. The latest version as of October 2024 would offer more features and bug fixes.

Also, ensure that the 'locale' component is sufficient for your needs, as only this component is specified in the find_package call.

To verify the Boost version availability, you can run:

✅ Verification successful

Boost Version Update Recommendation

The current Boost version specified is 1.74, while the latest available version is 1.86.0. It's advisable to update to Boost 1.86.0 to leverage the latest features and bug fixes.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for the latest Boost version
curl -s https://www.boost.org/users/download/ | grep -oP 'Version \K\d+\.\d+\.\d+'
Length of output: 91
tests/test-lexer.cpp (6)

6-6: LGTM: Addition of Boost locale header

The inclusion of the Boost locale header is appropriate for handling UTF-8 conversions in the test cases. This aligns well with the PR objectives for improved regex pattern handling.

15-15: LGTM: Addition of u32string using declaration

The addition of the u32string using declaration is appropriate for handling Unicode characters in the regex patterns. This supports the PR objectives for improved regex handling.

32-55: LGTM: Addition of test_regex_ast helper function

The new test_regex_ast function effectively encapsulates the logic for testing regex AST serialization. This improves code reusability and readability, making it easier to add new test cases for different regex patterns.

39-41: Consider using a standard library function for UTF-8 encoding

The lambda function u32_to_u8 manually converts char32_t to a UTF-8 string. As previously suggested, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and improve maintainability.

Line range hint 123-143: LGTM: Comprehensive test case for AST with tags

This new test case effectively validates the serialization of complex regex patterns with named capture groups and alternation. The use of tags for positive and negative matches in the expected output ensures correct handling of optional matches, which aligns well with the PR objectives.

145-160: LGTM: Comprehensive test cases for repetition regex

These new test cases effectively validate various repetition regex patterns, both with and without capture groups. They cover different repetition quantifiers and demonstrate the correct handling of optional matches and negative tags. This set of tests directly addresses the PR objectives, particularly the handling of regex repetition constructs with a minimum value of zero.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-159: Implementation of RegexASTEmpty is appropriate

The RegexASTEmpty class correctly implements all necessary virtual methods, properly handling the empty node scenarios in the regex AST.

270-270: Default constructor for RegexASTGroup defined

Defining the default constructor for RegexASTGroup enhances consistency and ensures proper initialization.

693-695: Verify the serialization output of RegexASTEmpty

The serialize method in RegexASTEmpty returns only the serialized negative tags. Please ensure that this behaviour aligns with the expected serialization format for an empty node in your system.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

CMakeLists.txt (2)
43-50: LGTM! Consider adding a comment for clarity.

The addition of Boost as a dependency is well-implemented. The version check and FetchContent declaration are correct and follow best practices.

Consider adding a brief comment explaining why Boost is required, e.g.:
# Boost is required for Unicode support in regex operations
find_package(Boost 1.74 COMPONENTS locale)
63-69: LGTM! Consider adding error handling.

The implementation for fetching and finding Boost is correct and follows best practices.

Consider adding error handling in case Boost is still not found after fetching:
if(NOT Boost_FOUND)
    set(BOOST_ENABLE_CMAKE ON CACHE BOOL "Enable CMake support in Boost")
    FetchContent_MakeAvailable(Boost)
    find_package(Boost 1.74 COMPONENTS locale)
    if(NOT Boost_FOUND)
        message(FATAL_ERROR "Boost not found even after fetching. Please check your internet connection or Boost installation.")
    endif()
endif()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2c72410 and 34f9e4f.

📒 Files selected for processing (2)

CMakeLists.txt (4 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/log_surgeon/finite_automata/RegexAST.hpp

🔇 Additional comments (1)

CMakeLists.txt (1)

142-142: LGTM! Verify public linkage necessity.

The linking of Boost::locale and fmt::fmt to log_surgeon is correctly implemented.

Please verify that both Boost::locale and fmt::fmt are used in the public interface of log_surgeon. If they are only used internally, consider changing the linkage to PRIVATE:

LinZhihao-723 · 2024-10-05T00:18:36Z

tests/test-lexer.cpp

@@ -26,6 +30,33 @@ using RegexASTOrByte
        = log_surgeon::finite_automata::RegexASTOr<log_surgeon::finite_automata::RegexNFAByteState>;
 using log_surgeon::SchemaVarAST;

+auto test_regex_ast(string const& regex, u32string const& expected_serialized_ast) -> void {


Let's create an anon namespace for local helpers

In the anon namespace, let's add the function declaration at the top with a doc string explaining what this function is testing, and move the implementation to the bottom

I think we can use std::string_view. This might require changes to the signature of Schema::add_variable to take a string_view, but I'm ok to put it in this PR

LinZhihao-723 · 2024-10-05T00:20:44Z

tests/test-lexer.cpp

+    auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get());
+    REQUIRE(capture_rule_ast != nullptr);
+
+    auto u32_to_u8 = [](char32_t const u32_char) -> std::string {


How about define this method as a local helper? Seem like we might also need it in other test cases in the future, and making it a lambda might introduce code duplication unintentionally by other developers.

Instead of having a u32_char to u8_char conversion, can we have a method std::u32string -> std::string:

[[nodiscard]] auto u32string_to_utf8(std::u32string const& u32_str) -> string { std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter; return converter.to_bytes(u32_str); }

LinZhihao-723 · 2024-10-05T01:16:53Z

tests/test-lexer.cpp

+        test_regex_ast("(?<letter>a)+", U"(a)<0>{1,inf}");
+
+        // Capture group with repetition
+        test_regex_ast("(?<letter>a{0,10})", U"(()|(a{1,10}))<0>");


Can we add a more complicated test case like this?

test_regex_ast( "(((?<letterA>a)|(?<letterB>b))*)|(((?<letterC>c)|(?<letterD>d)){0,10})", U"((<~0><~1>)|(((a)<0><~1>)|((b)<1><~0>){1,inf})<~2><~3>)|((<~2><~3>)|(((c)<2><~3>)" U"|((d)<3><~2>){1,10})<~0><~1>)" );

LinZhihao-723 · 2024-10-05T01:18:31Z

src/log_surgeon/SchemaParser.cpp

@@ -238,6 +243,14 @@ static auto regex_match_range_rule(NonTerminal* m) -> unique_ptr<ParserAST> {
        max += r5_ptr->get_digit(i) * (uint32_t)pow(10, r5_size - i - 1);
    }
    auto& r1 = m->non_terminal_cast(0)->get_parser_ast()->get<unique_ptr<RegexASTByte>>();
+
+    if (min == 0) {


Suggested change

if (min == 0) {

if (0 == min) {

LinZhihao-723 · 2024-10-05T01:18:47Z

src/log_surgeon/SchemaParser.cpp

+                make_unique<RegexASTEmptyByte>(),
+                make_unique<RegexASTMultiplicationByte>(std::move(r1), 1, max)
+        ));
+    }


Suggested change

}

}

LinZhihao-723 · 2024-10-05T01:19:07Z

src/log_surgeon/SchemaParser.cpp

-    return unique_ptr<ParserAST>(new ParserValueRegex(
-            unique_ptr<RegexASTByte>(new RegexASTMultiplicationByte(std::move(r1), 0, 0))
+
+    // To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.


Suggested change

// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.

// To handle negative tags we treat `R*` as `R+ | ∅`.

LinZhihao-723 · 2024-10-05T01:19:52Z

src/log_surgeon/finite_automata/RegexAST.hpp

+        return new RegexASTEmpty(*this);
+    }
+
+    // Do nothing as an empty node contains no utf8 characters.


I think we don't need this comment. If you want to keep it, we should move it inside the function body since it's an inline comment.

SharafMohamed and others added 30 commits September 11, 2024 20:07

Bug-fix for unicode array sizes

a6274ec

Merge remote-tracking branch 'upstream/main' into nfa-cleanup-pr

186d239

Move LexicalRule to its own class; Change name to variable_id; Change…

4f122c6

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

Additional fix for swapping meaning of tag

c24f6e1

Another additional fix for swapping meaning of tag

33582da

Fix up some comments

3338ec7

Fix comment grammar

3cd3c0f

Add tags to AST; Serialize AST for testing; Add unit-test for testing…

e05acbb

… added tags

Use using to condense code; Use a unique schema object for each test …

5e61e83

…for clairty that nothing is shared b/w tests

Add has_capture_groups(); Add unit-test for has_capture_groups()

082090d

Create and use RegexASTEmpty to split RegexASTgroup with min=0 into R…

2c6d94e

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

Add unit-test for 0 repetition regex

4e02f24

Add more tests for repetition regex

bb3c543

Return by value in literal getters; Use const instead of const& for l…

54027ad

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

Refactor new_state()

e58274f

Rename get_first_matching_variable_ids() to get_matching_variable_ids…

1321871

…(); Add docstrign to RegexDFAStatePair

Remove redundant docstrings

c904755

Remove has_capture_groups()

ffe9a0f

Const and auto changes

913ed1a

Changed AST add functions to indicate the AST are being added to the …

7aa8a92

…NFA; Made add to nfa functions const

Merge branch 'nfa-cleanup-pr' into comment-cleanup

77e44a5

Merged with previous PR

d1d87e7

Merge branch 'tagged-ast' into pre-tagged-nfa-cleanup

f386a3b

Merge branch 'pre-tagged-nfa-cleanup' into regex-ast-empty

0c600d7

Change add in RegexASTEmpty to add_to_nfa

bedad75

Update src/log_surgeon/finite_automata/RegexAST.hpp

053d057

Co-authored-by: Lin Zhihao <[email protected]>

updated examples to use

a822307

Merge branch 'nfa-cleanup-pr' into comment-cleanup

0b9603a

TODO to clarify RegexAST class is actually nodes in the AST

2ef84d1

Merge branch 'main' into comment-cleanup

83bd518

Fix formatting

547bef4

SharafMohamed marked this pull request as ready for review October 1, 2024 15:10

coderabbitai bot reviewed Oct 1, 2024

View reviewed changes

tests/test-lexer.cpp Outdated Show resolved Hide resolved

src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved

SharafMohamed added 13 commits October 1, 2024 11:39

Use boost instead of implementing our own u32_to_u8 method; Remove re…

c73c72c

…dundant unit-test

Fix long string indentation

f364aac

Remove unused include; Fix typo

e93ea4f

add default constructor to RegexASTEmpty

c2933f3

move default constructors into class declaration

f195955

Add docstring for RegexASTEmpty

37dfdca

Add test case for validity; Fix typo in test case name; Fix docstring…

3e64e7c

…s for methods; Combine empty definitio nwith declaration.

Add [[maybe_unused]]; Compilation fix for swapping from reference to …

69af073

…pointer.

Install boost in cmake if not found.

14f9f69

Remove REQUIRED from boost cmake line so it continues to the install.

e49421a

Correct tag for boost.

5ca7b4e

Correct tag for boost.

2c72410

Fix formatting of GSL linking; Link against Boost::locale.

f113c19

coderabbitai bot reviewed Oct 1, 2024

View reviewed changes

SharafMohamed added 4 commits October 1, 2024 12:59

Enable boost in cmake.

064379c

Fix typo.

bf06d33

Make sure locale is found.

34f9e4f

Make sure boost is built.

af61ee1

coderabbitai bot reviewed Oct 1, 2024

View reviewed changes

SharafMohamed added 2 commits October 1, 2024 13:32

Remove boost::locale and use std::locale instead.

9329cc8

Undo cmake changes.

ede2a16

SharafMohamed changed the title ~~Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0.~~ Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. Oct 1, 2024

LinZhihao-723 requested changes Oct 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`. #41

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`. #41

SharafMohamed commented Sep 13, 2024 •

edited

Loading

coderabbitai bot commented Oct 1, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

SharafMohamed commented Oct 1, 2024

coderabbitai bot commented Oct 1, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

LinZhihao-723 Oct 5, 2024

	// To handle negative tags we treat `R{0,N}` as `R{1,N} \| ∅`.
	// To handle negative tags we treat `R*` as `R+ \| ∅`.

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. #41

Are you sure you want to change the base?

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. #41

Conversation

SharafMohamed commented Sep 13, 2024 • edited Loading

References

Description

Validation performed

coderabbitai bot commented Oct 1, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

SharafMohamed commented Oct 1, 2024

coderabbitai bot commented Oct 1, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

LinZhihao-723 Oct 5, 2024

Choose a reason for hiding this comment

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`. #41

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`. #41

SharafMohamed commented Sep 13, 2024 •

edited

Loading

coderabbitai bot commented Oct 1, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)