Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. #41

Open
wants to merge 110 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Sep 13, 2024

References

  • Depends on PR#40.

Description

Previously, RegexASTMultiplication was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:

  • Create an empty regex AST node, ∅.
  • Treat R{0,N} as R{1,N} | ∅
  • Treat R* as R+|∅

Validation performed

  • Create unit-tests for repetition regex.

SharafMohamed and others added 30 commits September 11, 2024 20:07
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
Copy link

coderabbitai bot commented Oct 1, 2024

Warning

Rate limit exceeded

@SharafMohamed has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 43 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Files that changed from the base of the PR and between 34f9e4f and ede2a16.

Walkthrough

The changes enhance the project's code formatting and dependency management. The .clang-format file has been updated to include additional library headers and various formatting options. The CMakeLists.txt file now includes the Boost library as a dependency, with checks for its presence and commands to fetch it if necessary. Improvements have been made to regex parsing in SchemaParser.cpp and RegexAST.hpp, introducing a new class for empty nodes in the regex abstract syntax tree (AST). Additionally, the test suite has been expanded to cover new regex functionalities.

Changes

File(s) Change Summary
.clang-format Updated IncludeCategories regex to include "boost", adjusted various formatting options including alignment settings, brace wrapping rules, and spacing configurations.
CMakeLists.txt Added Boost library as a dependency with checks for version 1.74, using FetchContent to fetch Boost if not found. Refined existing checks for Microsoft.GSL and fmt.
src/log_surgeon/SchemaParser.cpp Introduced RegexASTEmptyByte type alias and updated regex_match_zero_or_more_rule and regex_match_range_rule functions to handle cases where the minimum value is zero. Updated method signatures accordingly.
src/log_surgeon/finite_automata/RegexAST.hpp Added RegexASTEmpty class to represent an empty node in the regex AST, implementing necessary virtual methods and serialization. Updated existing derived classes for compatibility with new tagging and serialization mechanisms.
tests/test-lexer.cpp Added test_regex_ast helper function for testing regex AST serialization. Expanded existing test cases to validate regex patterns with named capture groups and repetition patterns, ensuring correct serialized outputs. Refactored test code.

Possibly related PRs

Suggested reviewers

  • LinZhihao-723

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@SharafMohamed SharafMohamed marked this pull request as ready for review October 1, 2024 15:10
@SharafMohamed
Copy link
Contributor Author

@coderabbitai full review

Copy link

coderabbitai bot commented Oct 1, 2024

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
tests/test-lexer.cpp (1)

177-192: LGTM: Comprehensive test cases for repetition regex

These test cases excellently cover various scenarios of regex repetition, both with and without capture groups. They align perfectly with the PR objectives, especially in verifying the correct handling of patterns where the minimum repetition is zero (e.g., {0,10}, *).

The distinction between repetition with and without capture groups is particularly valuable, as it demonstrates the new behaviour of adding negative tags when a capture group may not be matched.

For completeness, consider adding a test case for a regex with multiple capture groups and mixed repetitions, such as:

test_regex_ast("(?<a>x)?(?<b>y)+(?<c>z){0,2}", U"((<~0>)|((x)<0>))((y)<1>{1,inf})((<~2>)|((z)<2>{1,1})|((z)<2>{2,2}))");

This would further validate the correct interaction of different repetition types within a single regex.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

127-128: Consider adding explanatory comments for RegexASTEmpty

While the implementation of RegexASTEmpty is correct, it would be beneficial to add a brief comment explaining its purpose and how it fits into the overall regex parsing strategy. This would improve code readability and maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a092206 and 547bef4.

📒 Files selected for processing (3)
  • src/log_surgeon/SchemaParser.cpp (3 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
  • tests/test-lexer.cpp (5 hunks)
🔇 Additional comments (11)
tests/test-lexer.cpp (4)

15-15: LGTM: Appropriate addition of u32string

The addition of using std::u32string; is appropriate as it's used in the new test_regex_ast function. This change enhances code readability by avoiding the need to prefix u32string with std:: throughout the file.


32-70: LGTM: Well-structured test helper function

The test_regex_ast function is a well-structured helper for testing regex AST serialization. It encapsulates the logic for creating a Schema, adding a variable, and comparing the serialized output, which promotes code reuse across multiple test cases.

Consider refactoring the UTF-8 conversion lambda

The lambda function u32_to_u8 manually converts char32_t to a UTF-8 string. As mentioned in a previous review, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and enhance maintainability.

Enhance test failure diagnostics

To improve the clarity of test failures, consider adding an error message to the REQUIRE statement. This will help quickly identify which part of the test failed when running the test suite.

Here's a suggested improvement:

-    REQUIRE(actual_string == expected_string);
+    REQUIRE(actual_string == expected_string, fmt::format("Expected: {}\nActual: {}", expected_string, actual_string));

138-142: LGTM: Comprehensive test for AST serialization with tags

This test case is an excellent addition that validates the serialization of a complex regex AST with named capture groups. It effectively covers the PR objectives by including tags for positive () and negative (<~n>) matches, ensuring correct handling of regex repetition constructs with capture groups.

The use of a complex regex pattern here helps to verify that the serialization works correctly for nested and alternating capture groups, which is crucial for the robustness of the implementation.


2-2: Verify the necessity of the header

The header has been added, but it's not clear where it's being used in the visible code. Please ensure that this header is necessary for the changes made, or remove it if it's not needed to keep the includes clean and minimize compilation time.

✅ Verification successful

Removal of the <numeric> header is recommended

The <numeric> header is not used in tests/test-lexer.cpp. Please remove it to keep the includes clean and minimize compilation time.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if any functions from <numeric> are used in the file

# Test: Search for common functions from <numeric>
rg --type cpp '\b(accumulate|inner_product|partial_sum|adjacent_difference|iota)\b' tests/test-lexer.cpp

Length of output: 106

src/log_surgeon/SchemaParser.cpp (4)

33-34: LGTM: New type alias for empty regex AST node

This addition of RegexASTEmptyByte aligns well with the PR objective. It introduces a type to represent an empty node in the regex Abstract Syntax Tree, which is crucial for handling regex repetition constructs with a minimum value of zero.


201-206: Verify multiplication node parameters in zero-or-more rule

The implementation of regex_match_zero_or_more_rule now correctly treats R* as R+ | ∅, which aligns with the PR objectives. However, the parameters for the RegexASTMultiplicationByte (1, 0) seem counterintuitive. Could you please verify if these values are correct, or if they should be (1, UINT32_MAX) to represent one or more occurrences?


246-253: LGTM: Consistent implementation of range rule for min=0

The modification to regex_match_range_rule correctly implements the PR objective of treating R{0,N} as R{1,N} | ∅ when the minimum value is zero. This change is consistent with the implementation in the zero-or-more rule and properly handles the case of optional matches in regex patterns.


Line range hint 1-1000: Summary: Effective implementation of zero-minimum regex repetition handling

The changes in this file successfully implement the handling of regex repetition constructs with a minimum value of zero, aligning well with the PR objectives. The introduction of RegexASTEmptyByte and the consistent use of RegexASTOrByte to combine empty and non-empty matches provide a solid foundation for handling optional regex patterns.

A few points to consider:

  1. The implementation correctly treats R{0,N} as R{1,N} | ∅ across different regex rules.
  2. The approach is consistent and well-integrated with the existing code structure.
  3. There's a potential issue with the multiplication node parameters in the zero-or-more rule that needs verification.

Overall, these changes enhance the regex parsing capabilities as intended, with just a minor point needing clarification.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-168: Implementation of RegexASTEmpty looks good

The new RegexASTEmpty class is well-implemented and consistent with other AST node classes. It correctly represents an empty node in the Regex AST, with appropriate implementations for all required methods.


Line range hint 832-863: Improved handling of zero minimum repetitions in RegexASTMultiplication

The modifications to the add_to_nfa() method in RegexASTMultiplication correctly implement the handling of regex repetitions with a minimum value of zero. The addition of an epsilon transition to the end state when m_min is 0 is an appropriate solution.


Line range hint 1-1155: Changes align well with PR objectives

The modifications in this file, particularly the addition of the RegexASTEmpty class and the changes to the RegexASTMultiplication class, successfully implement the handling of regex repetition constructs with a minimum value of zero. The changes are consistent throughout the file and align well with the stated PR objectives.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
tests/test-lexer.cpp (2)

36-36: Ensure safe usage of dynamic_cast

Using dynamic_cast to a reference without verification can lead to a std::bad_cast exception if the cast fails. Consider casting to a pointer and checking for nullptr to prevent potential runtime errors.

-auto const& capture_rule_ast = dynamic_cast<SchemaVarAST&>(*schema_ast->m_schema_vars[0]);
+auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get());
+REQUIRE(capture_rule_ast != nullptr);

177-177: Correct typo in section name

There's a typographical error in the section name. It should be "Test repetition regex" instead of "Test reptition regex."

-SECTION("Test reptition regex") {
+SECTION("Test repetition regex") {
src/log_surgeon/finite_automata/RegexAST.hpp (1)

704-711: Redundant implementation of add_to_nfa in source file.

The add_to_nfa method is defined in the source file with an empty body. If the method is intended to be empty, consider omitting its implementation entirely, as the default behaviour is sufficient.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a092206 and 547bef4.

📒 Files selected for processing (3)
  • src/log_surgeon/SchemaParser.cpp (3 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (2 hunks)
  • tests/test-lexer.cpp (5 hunks)
🔇 Additional comments (6)
tests/test-lexer.cpp (1)

176-192: Well-structured addition of comprehensive test cases

The new test cases for various repetition regex patterns significantly enhance test coverage. They effectively validate the handling of optional and repeated capture groups, ensuring the regex AST serialization functions as intended.

src/log_surgeon/SchemaParser.cpp (3)

33-34: Addition of RegexASTEmptyByte Enhances Regex AST

Introducing RegexASTEmptyByte provides a clear representation for empty regex nodes, improving the readability and maintainability of the code.


201-205: Potential issue with min greater than max in RegexASTMultiplicationByte constructor

The concern about min being greater than max in RegexASTMultiplicationByte is still applicable. Instantiating it with min=1 and max=0 may lead to logical errors or undefined behaviour.


246-253: Possible incorrect handling of min == 0 in repetition range

The previous comment regarding the handling of cases where min == 0 and ensuring that min is less than or equal to max in RegexASTMultiplicationByte remains relevant here.

src/log_surgeon/finite_automata/RegexAST.hpp (2)

165-165: Confirm intended behaviour of add_to_nfa method.

The add_to_nfa method in RegexASTEmpty is declared but lacks an implementation in the class definition. Ensure that this method intentionally does nothing. If so, consider adding a comment to clarify this for future maintainers.

Add a comment to indicate intentional empty implementation:

 auto add_to_nfa(RegexNFA<NFAStateType>* nfa, NFAStateType* end_state) const -> void override {
+    // Intentionally empty: RegexASTEmpty does not contribute to the NFA.
 }

713-715: Ensure correct serialization of RegexASTEmpty.

The serialize method for RegexASTEmpty returns only the negative tags. Verify that this is the expected serialization output for an empty node. If additional information should be included, adjust the method accordingly.

Run the following script to review serialization usages:

✅ Verification successful

Serialization of RegexASTEmpty is correctly implemented.

No external calls to RegexASTEmpty::serialize() were found, confirming that returning only the negative tags is appropriate for an empty node.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check where serialize() is called and ensure it handles RegexASTEmpty appropriately.

# Search for serialize method calls in the codebase.
rg --type cpp --word-regexp 'serialize\(\)'

Length of output: 2989


Script:

#!/bin/bash
# Description: Locate all direct calls to RegexASTEmpty::serialize() to verify their usage.

rg --type cpp 'RegexASTEmpty<[^>]+>::serialize\(\)'

Length of output: 184

tests/test-lexer.cpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
CMakeLists.txt (1)

63-67: LGTM! Consider updating target_link_libraries.

The addition of FetchContent_MakeAvailable(Boost) is correct and necessary to ensure Boost is available for the project if it wasn't found by find_package.

However, I noticed that the target_link_libraries command for the log_surgeon target hasn't been updated to include Boost. You should add Boost to the list of linked libraries to ensure your project can use it.

Add Boost to the target_link_libraries command. Here's a suggested change:

target_link_libraries(log_surgeon 
    PUBLIC 
    fmt::fmt
+   $<$<NOT:$<BOOL:${Boost_FOUND}>>:Boost::boost>
+   $<$<BOOL:${Boost_FOUND}>:Boost::locale>
)

This change will link against the fetched Boost libraries if Boost wasn't found, or just the Boost::locale component if it was found on the system.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

129-129: Correct typo in documentation

In the comment for RegexASTEmpty, the word "repeition" should be "repetition".

Apply this diff to fix the typo:

- * repeition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the
+ * repetition with a minimum repetition of 0. Namely, we treat `R{0,N}` as `R{1,N} | ∅`. Then, the
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 547bef4 and 2c72410.

📒 Files selected for processing (4)
  • .clang-format (1 hunks)
  • CMakeLists.txt (2 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
  • tests/test-lexer.cpp (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .clang-format
🔇 Additional comments (10)
CMakeLists.txt (1)

43-50: LGTM! Consider using a more recent Boost version.

The addition of Boost as a dependency is well-implemented. The use of find_package with a fallback to FetchContent_Declare is a good practice. However, you might want to consider using a more recent version of Boost, as 1.74 is from August 2020. The latest version as of October 2024 would offer more features and bug fixes.

Also, ensure that the 'locale' component is sufficient for your needs, as only this component is specified in the find_package call.

To verify the Boost version availability, you can run:

✅ Verification successful

Boost Version Update Recommendation

The current Boost version specified is 1.74, while the latest available version is 1.86.0. It's advisable to update to Boost 1.86.0 to leverage the latest features and bug fixes.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for the latest Boost version
curl -s https://www.boost.org/users/download/ | grep -oP 'Version \K\d+\.\d+\.\d+'

Length of output: 91

tests/test-lexer.cpp (6)

6-6: LGTM: Addition of Boost locale header

The inclusion of the Boost locale header is appropriate for handling UTF-8 conversions in the test cases. This aligns well with the PR objectives for improved regex pattern handling.


15-15: LGTM: Addition of u32string using declaration

The addition of the u32string using declaration is appropriate for handling Unicode characters in the regex patterns. This supports the PR objectives for improved regex handling.


32-55: LGTM: Addition of test_regex_ast helper function

The new test_regex_ast function effectively encapsulates the logic for testing regex AST serialization. This improves code reusability and readability, making it easier to add new test cases for different regex patterns.


39-41: Consider using a standard library function for UTF-8 encoding

The lambda function u32_to_u8 manually converts char32_t to a UTF-8 string. As previously suggested, consider using standard library facilities or a well-tested library function for UTF-8 encoding to handle all edge cases properly and improve maintainability.


Line range hint 123-143: LGTM: Comprehensive test case for AST with tags

This new test case effectively validates the serialization of complex regex patterns with named capture groups and alternation. The use of tags for positive and negative matches in the expected output ensures correct handling of optional matches, which aligns well with the PR objectives.


145-160: LGTM: Comprehensive test cases for repetition regex

These new test cases effectively validate various repetition regex patterns, both with and without capture groups. They cover different repetition quantifiers and demonstrate the correct handling of optional matches and negative tags. This set of tests directly addresses the PR objectives, particularly the handling of regex repetition constructs with a minimum value of zero.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-159: Implementation of RegexASTEmpty is appropriate

The RegexASTEmpty class correctly implements all necessary virtual methods, properly handling the empty node scenarios in the regex AST.


270-270: Default constructor for RegexASTGroup defined

Defining the default constructor for RegexASTGroup enhances consistency and ensures proper initialization.


693-695: Verify the serialization output of RegexASTEmpty

The serialize method in RegexASTEmpty returns only the serialized negative tags. Please ensure that this behaviour aligns with the expected serialization format for an empty node in your system.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
CMakeLists.txt (2)

43-50: LGTM! Consider adding a comment for clarity.

The addition of Boost as a dependency is well-implemented. The version check and FetchContent declaration are correct and follow best practices.

Consider adding a brief comment explaining why Boost is required, e.g.:

# Boost is required for Unicode support in regex operations
find_package(Boost 1.74 COMPONENTS locale)

63-69: LGTM! Consider adding error handling.

The implementation for fetching and finding Boost is correct and follows best practices.

Consider adding error handling in case Boost is still not found after fetching:

if(NOT Boost_FOUND)
    set(BOOST_ENABLE_CMAKE ON CACHE BOOL "Enable CMake support in Boost")
    FetchContent_MakeAvailable(Boost)
    find_package(Boost 1.74 COMPONENTS locale)
    if(NOT Boost_FOUND)
        message(FATAL_ERROR "Boost not found even after fetching. Please check your internet connection or Boost installation.")
    endif()
endif()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2c72410 and 34f9e4f.

📒 Files selected for processing (2)
  • CMakeLists.txt (4 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexAST.hpp
🔇 Additional comments (1)
CMakeLists.txt (1)

142-142: LGTM! Verify public linkage necessity.

The linking of Boost::locale and fmt::fmt to log_surgeon is correctly implemented.

Please verify that both Boost::locale and fmt::fmt are used in the public interface of log_surgeon. If they are only used internally, consider changing the linkage to PRIVATE:

@SharafMohamed SharafMohamed changed the title Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. Oct 1, 2024
@@ -26,6 +30,33 @@ using RegexASTOrByte
= log_surgeon::finite_automata::RegexASTOr<log_surgeon::finite_automata::RegexNFAByteState>;
using log_surgeon::SchemaVarAST;

auto test_regex_ast(string const& regex, u32string const& expected_serialized_ast) -> void {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's create an anon namespace for local helpers
  2. In the anon namespace, let's add the function declaration at the top with a doc string explaining what this function is testing, and move the implementation to the bottom
  3. I think we can use std::string_view. This might require changes to the signature of Schema::add_variable to take a string_view, but I'm ok to put it in this PR

auto const* capture_rule_ast = dynamic_cast<SchemaVarAST*>(schema_ast->m_schema_vars[0].get());
REQUIRE(capture_rule_ast != nullptr);

auto u32_to_u8 = [](char32_t const u32_char) -> std::string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How about define this method as a local helper? Seem like we might also need it in other test cases in the future, and making it a lambda might introduce code duplication unintentionally by other developers.
  2. Instead of having a u32_char to u8_char conversion, can we have a method std::u32string -> std::string:
[[nodiscard]] auto u32string_to_utf8(std::u32string const& u32_str) -> string {
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;
    return converter.to_bytes(u32_str);
}

test_regex_ast("(?<letter>a)+", U"(a)<0>{1,inf}");

// Capture group with repetition
test_regex_ast("(?<letter>a{0,10})", U"(()|(a{1,10}))<0>");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a more complicated test case like this?

test_regex_ast(
            "(((?<letterA>a)|(?<letterB>b))*)|(((?<letterC>c)|(?<letterD>d)){0,10})",
            U"((<~0><~1>)|(((a)<0><~1>)|((b)<1><~0>){1,inf})<~2><~3>)|((<~2><~3>)|(((c)<2><~3>)"
            U"|((d)<3><~2>){1,10})<~0><~1>)"
 );

@@ -238,6 +243,14 @@ static auto regex_match_range_rule(NonTerminal* m) -> unique_ptr<ParserAST> {
max += r5_ptr->get_digit(i) * (uint32_t)pow(10, r5_size - i - 1);
}
auto& r1 = m->non_terminal_cast(0)->get_parser_ast()->get<unique_ptr<RegexASTByte>>();

if (min == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (min == 0) {
if (0 == min) {

make_unique<RegexASTEmptyByte>(),
make_unique<RegexASTMultiplicationByte>(std::move(r1), 1, max)
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

return unique_ptr<ParserAST>(new ParserValueRegex(
unique_ptr<RegexASTByte>(new RegexASTMultiplicationByte(std::move(r1), 0, 0))

// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// To handle negative tags we treat `R{0,N}` as `R{1,N} | ∅`.
// To handle negative tags we treat `R*` as `R+ | ∅`.

return new RegexASTEmpty(*this);
}

// Do nothing as an empty node contains no utf8 characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this comment. If you want to keep it, we should move it inside the function body since it's an inline comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants