-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3.1.0 #158
V3.1.0 #158
Conversation
Improved model and vectorizer loading with thread locking and file scanning functionality Made `is_sensitive` have a reason for logging sensitive file Improved the threading process and logging
Add summary and visualization functions for neural network model - Implement `summary` function to generate a detailed summary of the model - Implement `visualize_model` function to create a directed graph of the model's layers and weights - Save model summary and visualization to 'Vectorizer features' directory - Add progress tracking and file handling for vectorizer features
# Conflicts: # CODE/VulnScan/tools/_study_network.py
Saving files now is neater
Now added plot.py which shows a heatmap in bargraph form of the model and best 1000 features, as well as a .html file with a 3D plot graph of losses Fixed minor bug in _study_network.py where I returned old save_graph() function which now makes the gephi file have proper node counts
Added checks if directories and files existed before write/appending to them
Also merged _plot.py to _study_network.py, added activations, weight distribution, t-SNE plots, which are all special, finally fixed some bugs, and made sure all data is genuine, or synthetic, modified config.ini as well to allow paths to be set there.
Also merged _plot.py to _study_network.py, added activations, weight distribution, t-SNE plots, which are all special, finally fixed some bugs, and made sure all data is genuine, or synthetic, modified config.ini as well to allow paths to be set there.
# Conflicts: # CODE/config.ini
1. Made deprecation versions changed 2. Removed an old plan made in v3.0.0 or v3.1.0
Fixed minor bug, and added dump_memory.py to Logicytics.py, also made dump_memory.py which generates around 3 files with data from the system's RAM, 1 is in HEX aka unreadable
WalkthroughThe changes in this pull request involve multiple modifications across various files, primarily focusing on enhancing documentation, error handling, and configuration management. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script
participant Model
participant Vectorizer
User->>Script: Run Script
Script->>Model: Load Model
Script->>Vectorizer: Load Vectorizer
Script->>Script: Scan Files
Script->>Model: Check Sensitivity
Script->>User: Return Results
Suggested labels
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Signed-off-by: Shahm Najeeb <[email protected]>
This pull request includes various changes across multiple files, focusing on adding documentation, updating deprecated functions, and modifying some configurations. The most important changes are summarized below: Documentation Improvements:
Deprecated Functions:
Configuration Updates:
Code Modifications:
Documentation Enhancements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (24)
CODE/Logicytics.py (1)
Line range hint
1-394
: Consider splitting the file into focused modulesThe file has grown to handle multiple responsibilities including script execution, file management, privilege checking, and action handling. This makes it harder to maintain and test.
Consider splitting into these modules:
action_handler.py
- For handling actions and sub-actionsscript_executor.py
- For script execution logicfile_manager.py
- For file operationssecurity.py
- For privilege checkingmain.py
- For orchestrationThis would:
- Improve maintainability
- Make testing easier
- Allow for better separation of concerns
- Reduce cognitive load when working with the codebase
Would you like me to help create a detailed plan for this refactoring?
CODE/vulnscan.py (4)
57-64
: Avoid redundant loading of model and vectorizerThe current implementation may still load the model and vectorizer multiple times if multiple threads enter the locked section before
model_to_use
orvectorizer_to_use
are set. Consider using double-checked locking to prevent redundant loading.Apply this diff to enhance efficiency:
def scan_path(model_path: str, scan_paths: str, vectorizer_path: str): global model_to_use, vectorizer_to_use try: if model_to_use is None: with model_lock: if model_to_use is None: log.info(f"Loading model from {model_path}") model_to_use = load_model(model_path) if vectorizer_to_use is None: with vectorizer_lock: if vectorizer_to_use is None: log.info(f"Loading vectorizer from {vectorizer_path}") vectorizer_to_use = joblib.load(vectorizer_path) vulnscan(model_to_use, scan_paths, vectorizer_to_use)
Line range hint
119-124
: Handle non-text files appropriately in 'scan_file'Attempting to read non-text files in text mode can lead to unexpected behavior or data corruption. It's recommended to skip non-text files or handle them safely.
Apply this diff to skip non-text files:
def scan_file(model: torch.nn.Module, vectorizer: TfidfVectorizer, file_path: str) -> tuple[bool, float, str]: mime_type, _ = mimetypes.guess_type(file_path) if mime_type and mime_type.startswith('text'): with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: content = file.read() return is_sensitive(model, vectorizer, content) - else: - with open(file_path, 'r', errors='ignore') as file: - content = file.read() - return is_sensitive(model, vectorizer, content) + else: + log.info(f"Skipping non-text file: {file_path}") + return False, 0.0, ""🧰 Tools
🪛 Ruff (0.8.0)
150-150: Loop control variable
dirs
not used within loop bodyRename unused
dirs
to_dirs
(B007)
150-150
: Rename unused loop variable 'dirs' to '_dirs'The variable
dirs
is not used in the loop body. Renaming it to_
or_dirs
indicates that it is intentionally unused.Apply this diff:
for root, dirs, files_main in os.walk(base_path): for file_main in files_main: paths.append(os.path.join(root, file_main))Change to:
for root, _dirs, files_main in os.walk(base_path): for file_main in files_main: paths.append(os.path.join(root, file_main))🧰 Tools
🪛 Ruff (0.8.0)
150-150: Loop control variable
dirs
not used within loop bodyRename unused
dirs
to_dirs
(B007)
Line range hint
154-160
: Use a thread pool to manage threads efficientlyCreating a new thread for each file can lead to excessive resource usage and degrade system performance. Consider using a thread pool to limit the number of concurrent threads.
Apply this diff to implement a thread pool:
+from concurrent.futures import ThreadPoolExecutor log.warning("Starting scan - This may take hours and consume memory!!") -threads = [] -for path in paths: - thread = threading.Thread(target=scan_path, - args=("VulnScan/Model SenseMini .3n3.pth", path, "VulnScan/Vectorizer .3n3.pkl")) - threads.append(thread) - thread.start() - -for thread in threads: - thread.join() +with ThreadPoolExecutor(max_workers=10) as executor: + for path in paths: + executor.submit(scan_path, "VulnScan/Model SenseMini .3n3.pth", path, "VulnScan/Vectorizer .3n3.pkl")Adjust
max_workers
according to the optimal number of concurrent threads for your environment.🧰 Tools
🪛 Ruff (0.8.0)
150-150: Loop control variable
dirs
not used within loop bodyRename unused
dirs
to_dirs
(B007)
CODE/dump_memory.py (1)
16-16
: Remove unnecessary 'f' prefixes from strings without placeholdersThe strings on lines 16, 49, and 105 do not contain placeholders, so the
f
prefix is unnecessary.Apply this diff:
- dump_file = f"Ram_Snapshot.txt" + dump_file = "Ram_Snapshot.txt" ... - with open(dump_file, "w", encoding="utf-8") as file: + with open(dump_file, "w", encoding="utf-8") as file: ... - with open(f"SystemRam_Info.txt", "w") as sys_file: + with open("SystemRam_Info.txt", "w") as sys_file:Also applies to: 49-49, 105-105
🧰 Tools
🪛 Ruff (0.8.0)
16-16: f-string without any placeholders
Remove extraneous
f
prefix(F541)
CODE/VulnScan/tools/_study_network.py (5)
281-285
: Reduce or remove print statements inside nested loopsPrinting messages inside tight loops can significantly slow down execution due to I/O overhead, especially with large
GRID_SIZE
. Consider using a progress bar or reducing the frequency of status updates.Apply this diff:
for i, dx in enumerate(x): - print(f"Computing loss for row {i + 1}/{GRID_SIZE}...") for j, dy in enumerate(y): - print(f" Computing loss for column {j + 1}/{GRID_SIZE}...") param.data += dx * u + dy * v # Apply perturbation loss = 0Alternatively, use
tqdm
for progress tracking:from tqdm import tqdm for i, dx in enumerate(tqdm(x, desc="Rows")): for j, dy in enumerate(y): # computation
335-337
: Use actual feature importance values instead of random valuesUsing random values for
feature_importance
does not provide meaningful insights. Consider calculating actual feature importances from the model.Replace with code that computes feature importance:
- feature_importance = np.random.rand(len(tokens[:NUMBER_OF_FEATURES])) # Example random importance + feature_importance = np.abs(model.linear.weight.detach().cpu().numpy()[0, :NUMBER_OF_FEATURES])Ensure that
model.linear
refers to the appropriate layer in your model.
380-380
: Simplify comparison using '!=' operatorReplace
not (module == model_to_use)
withmodule != model_to_use
for clarity and simplicity.Apply this diff:
if ( not isinstance(module, nn.Sequential) and not isinstance(module, nn.ModuleList) - and not (module == model_to_use) + and module != model_to_use ): hooks.append(module.register_forward_hook(hook))🧰 Tools
🪛 Ruff (0.8.0)
380-380: Use
module != model_to_use
instead ofnot module == model_to_use
Replace with
!=
operator(SIM201)
435-436
: Combine nested 'if' statements into a single conditionSimplify the nested
if
statements for better readability.Apply this diff:
-if "trainable" in summaries[layer]: - if summaries[layer]["trainable"]: +if summaries[layer].get("trainable"): trainable_params += summaries[layer]["nb_params"]🧰 Tools
🪛 Ruff (0.8.0)
435-436: Use a single
if
statement instead of nestedif
statements(SIM102)
623-623
: Avoid raising ImportError when module is importedRaising an
ImportError
when the module is imported prevents reuse of the code in other modules and is generally unnecessary. If the script is intended to be run as a standalone program, it's sufficient to place executable code underif __name__ == '__main__':
without raising exceptions.Apply this diff:
else: - raise ImportError("This file cannot be imported") + pass # Allow import without executing the main blockCODE/VulnScan/tools/_test_gpu_acceleration.py (1)
27-27
: Avoid raising ImportError when module is importedRaising an
ImportError
upon import restricts the module's reusability. Instead, simply ensure that execution-specific code is under theif __name__ == '__main__':
block.Apply this diff:
else: - raise ImportError("This file cannot be imported") + pass # Allow import without executing the main blockCODE/VulnScan/tools/_vectorizer.py (2)
Line range hint
51-55
: Consider making max_features configurableThe vectorizer's max_features is hardcoded to 10000. Consider making this configurable through config.ini for better flexibility.
- return TfidfVectorizer(max_features=10000) + max_features = config.getint('VulnScan.vectorizer Settings', 'max_features', fallback=10000) + return TfidfVectorizer(max_features=max_features)
83-84
: Consider using all instead of ImportErrorRather than raising ImportError, consider using all to explicitly define public exports. This provides better control over module usage.
-else: - raise ImportError("This file cannot be imported") +__all__ = ['load_data', 'choose_vectorizer', 'main']CODE/VulnScan/v2-deprecated/_generate_data.py (1)
Line range hint
5-8
: Move configuration to config.iniMAX_FILE_SIZE and SAVE_DIRECTORY should be configurable through config.ini rather than hardcoded.
-MAX_FILE_SIZE: int = 10 * 1024 # Example: Max file size is 10 KB -SAVE_DIRECTORY: str = "PATH" +from configparser import ConfigParser + +config = ConfigParser() +config.read('../../config.ini') +MAX_FILE_SIZE: int = config.getint('VulnScan.generate_data Settings', 'max_file_size', fallback=10 * 1024) +SAVE_DIRECTORY: str = config.get('VulnScan.generate_data Settings', 'save_directory')CODE/config.ini (2)
63-66
: Standardize path handling across settingsThe change to use PATH placeholders is good for flexibility, but ensure consistent path handling and validation across all settings.
Consider creating a configuration validation layer that:
- Validates all PATH placeholders
- Ensures directories exist
- Checks write permissions
91-103
: Avoid hard-coded paths and magic numbersThe study settings contain several hard-coded values:
- Fixed path "NN features/"
- Magic number 3000 for feature visualization limit
Consider:
- Making the features directory configurable
- Moving magic numbers to named constants
- Adding validation for the model and vectorizer paths
CODE/VulnScan/Documentation.md (1)
110-138
: Enhance documentation readability and consistencyThe new documentation section is informative but has some minor issues:
- Missing articles in some sentences (e.g., "Is organized by" should be "It is organized by")
- Inconsistent punctuation in list items
Consider applying these improvements:
- Add missing articles
- Maintain consistent punctuation in lists
- Use complete sentences throughout
🧰 Tools
🪛 LanguageTool
[uncategorized] ~121-~121: Possible missing preposition found.
Context: ...ains the data used to train the models. Is organized by the file size and amount, ...(AI_HYDRA_LEO_MISSING_IT)
[uncategorized] ~122-~122: Loose punctuation mark.
Context: ...explicitly say text. -Archived Models
: Contains the previously trained models....(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~122-~122: Possible missing preposition found.
Context: ...Contains the previously trained models. Is organized by the model type then versio...(AI_HYDRA_LEO_MISSING_IT)
[uncategorized] ~122-~122: Possible missing comma found.
Context: ...ained models. Is organized by the model type then version. -NN features
: Contains...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~123-~123: Loose punctuation mark.
Context: ...model type then version. -NN features
: Contains information about the model `....(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~123-~123: There seems to be a noun/verb agreement error. Did you mean “includes” or “included”?
Context: ...3and the vectorizer used. Information include: -
Documentation_Study_Network.md`: ...(SINGULAR_NOUN_VERB_AGREEMENT)
[uncategorized] ~125-~125: Loose punctuation mark.
Context: ...o. -Neural Network Nodes Graph.gexf
: A Gephi file that contains the model no...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...ges. -Nodes and edges (GEPHI).csv
: A CSV file that contains the model node...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~127-~127: Loose punctuation mark.
Context: ...odel nodes and edges. -Statistics
: Directories made by Gephi, containing t...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~128-~128: Loose punctuation mark.
Context: ... and edges. -Feature_Importance.svg
: A SVG file that contains the feature im...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~129-~129: Loose punctuation mark.
Context: ... the model. -Loss_Landscape_3D.html
: A HTML file that contains the 3D loss l...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~131-~131: Loose punctuation mark.
Context: ...epochs. -Model state dictionary.txt
: A text file that contains the model sta...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...tate dictionary. -Model Summary.txt
: A text file that contains the model sum...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~133-~133: Loose punctuation mark.
Context: ...l summary. -Model Visualization.png
: A PNG file that contains the model visu...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~134-~134: Loose punctuation mark.
Context: ...visualization. -Top_90_Features.svg
: A SVG file that contains the top 90 fea...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~135-~135: Loose punctuation mark.
Context: ...the model. -Vectorizer features.txt
: A text file that contains the vectorize...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~136-~136: Loose punctuation mark.
Context: ...features. -Visualize Activation.png
: A PNG file that contains the visualizat...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~137-~137: Loose punctuation mark.
Context: ...el activation. -Visualize t-SNE.png
: A PNG file that contains the visualizat...(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~138-~138: Loose punctuation mark.
Context: ...del t-SNE. -Weight Distribution.png
: A PNG file that contains the weight dis...(UNLIKELY_OPENING_PUNCTUATION)
CODE/logicytics/FileManagement.py (1)
110-112
: Improve tuple formatting for better readabilityThe excluded_prefixes tuple could be formatted more cleanly:
- excluded_prefixes = ("config.ini", "SysInternal_Suite", - "__pycache__", "logicytics", "VulnScan", - "Vectorizer features") + excluded_prefixes = ( + "config.ini", + "SysInternal_Suite", + "__pycache__", + "logicytics", + "VulnScan", + "Vectorizer features" + )CODE/VulnScan/v2-deprecated/_train.py (2)
Line range hint
524-546
: Remove hardcoded file paths.The main block contains hardcoded file paths, which is a security risk and makes the code less portable.
Replace hardcoded paths with configuration or environment variables:
- DATA = load_data(r"C:\Users\Hp\Desktop\Model Tests\Model Data\GeneratedData") + DATA = load_data(os.getenv('TRAINING_DATA_PATH')) - train_rfc(SAVE_DIR=r"PATH", EPOCHS=30, TEST_SIZE=0.2, + train_rfc(SAVE_DIR=os.getenv('MODEL_SAVE_PATH'), EPOCHS=30, TEST_SIZE=0.2, N_ESTIMATORS=100, RANDOM_STATE=42) - train_nn_svm(EPOCHS=50, - MODEL="nn", SAVE_DIR=r"PATH", MAX_FEATURES=5000, + train_nn_svm(EPOCHS=50, + MODEL="nn", SAVE_DIR=os.getenv('MODEL_SAVE_PATH'), MAX_FEATURES=5000,
204-214
: Improve class documentation.While the class is properly marked as deprecated, the documentation could be improved by adding descriptions for class attributes.
Add attribute descriptions to the class docstring:
""" Initializes the LSTM model. Args: vocab_size (int): Size of the vocabulary. embedding_dim (int): Dimension of the embedding layer. hidden_dim (int): Dimension of the hidden layer. output_dim (int): Dimension of the output layer. + + Attributes: + embedding (nn.Embedding): Embedding layer for input vectorization + lstm (nn.LSTM): Bidirectional LSTM layer for sequence processing + fc (nn.Linear): Fully connected layer for output generation + sigmoid (nn.Sigmoid): Activation function for binary classification """CODE/VulnScan/v3/_train.py (3)
Line range hint
181-346
: Refactor Suggestion: Improve model training architectureSeveral improvements could enhance the robustness and maintainability of the training implementation:
- Global variables create potential thread-safety issues
- No early stopping mechanism could lead to overfitting
- Fixed neural network architecture limits model flexibility
Consider implementing these improvements:
-def train_traditional_model(model_name: str, - epochs: int, - save_model_path: str): - global vectorizer, X_val, X_train +class ModelTrainer: + def __init__(self, X_train, X_val, y_train, y_val): + self.X_train = X_train + self.X_val = X_val + self.y_train = y_train + self.y_val = y_val + self.early_stopping_patience = 5 + self.best_val_loss = float('inf') + self.patience_counter = 0 + + def train_traditional_model(self, model_name: str, + epochs: int, + save_model_path: str):
Line range hint
247-346
: Enhance Neural Network ImplementationThe current neural network implementation could benefit from several improvements:
- Fixed architecture limits flexibility
- No GPU memory management
- Static learning rate scheduling
Consider these enhancements:
+ def build_model(self, layer_sizes: list[int], dropout_rate: float = 0.5): + layers = [] + for i in range(len(layer_sizes) - 1): + layers.extend([ + nn.Linear(layer_sizes[i], layer_sizes[i + 1]), + nn.ReLU(), + nn.Dropout(dropout_rate) + ]) + return nn.Sequential(*layers) - model = nn.Sequential(nn.Linear(X_train.shape[1], 128), nn.ReLU(), nn.Linear(128, 2)) + model = build_model([X_train.shape[1], 256, 128, 64, 2])
347-395
: Add Configuration Validation and Resource ManagementWhile the main execution block has good error handling, it could benefit from configuration validation and resource management.
Consider adding:
+def validate_config(config): + required_fields = [ + 'model_name', 'train_data_path', 'epochs', + 'batch_size', 'learning_rate', 'save_model_path' + ] + for field in required_fields: + if not config.get('VulnScan.train Settings', field): + raise ValueError(f"Missing required config field: {field}") + try: + validate_config(config) train_model(model_name=MODEL_NAME, epochs=EPOCHS, batch_size=BATCH_SIZE, learning_rate=LEARN_RATE, save_model_path=SAVE_PATH, use_cuda=CUDA) + finally: + # Clean up resources + if torch.cuda.is_available(): + torch.cuda.empty_cache()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (17)
.gitignore
(1 hunks).idea/Logicytics.iml
(2 hunks)CODE/Logicytics.py
(2 hunks)CODE/VulnScan/Documentation.md
(1 hunks)CODE/VulnScan/tools/_study_network.py
(1 hunks)CODE/VulnScan/tools/_test_gpu_acceleration.py
(1 hunks)CODE/VulnScan/tools/_vectorizer.py
(4 hunks)CODE/VulnScan/v2-deprecated/_generate_data.py
(5 hunks)CODE/VulnScan/v2-deprecated/_train.py
(12 hunks)CODE/VulnScan/v3/_generate_data.py
(4 hunks)CODE/VulnScan/v3/_train.py
(2 hunks)CODE/config.ini
(2 hunks)CODE/dump_memory.py
(1 hunks)CODE/logicytics/FileManagement.py
(1 hunks)CODE/vulnscan.py
(5 hunks)PLANS.md
(1 hunks)requirements.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- .gitignore
- .idea/Logicytics.iml
- requirements.txt
🧰 Additional context used
🪛 LanguageTool
CODE/VulnScan/Documentation.md
[uncategorized] ~121-~121: Possible missing preposition found.
Context: ...ains the data used to train the models. Is organized by the file size and amount, ...
(AI_HYDRA_LEO_MISSING_IT)
[uncategorized] ~122-~122: Loose punctuation mark.
Context: ...explicitly say text. - Archived Models
: Contains the previously trained models....
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~122-~122: Possible missing preposition found.
Context: ...Contains the previously trained models. Is organized by the model type then versio...
(AI_HYDRA_LEO_MISSING_IT)
[uncategorized] ~122-~122: Possible missing comma found.
Context: ...ained models. Is organized by the model type then version. - NN features
: Contains...
(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~123-~123: Loose punctuation mark.
Context: ...model type then version. - NN features
: Contains information about the model `....
(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~123-~123: There seems to be a noun/verb agreement error. Did you mean “includes” or “included”?
Context: ...3and the vectorizer used. Information include: -
Documentation_Study_Network.md`: ...
(SINGULAR_NOUN_VERB_AGREEMENT)
[uncategorized] ~125-~125: Loose punctuation mark.
Context: ...o. - Neural Network Nodes Graph.gexf
: A Gephi file that contains the model no...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~126-~126: Loose punctuation mark.
Context: ...ges. - Nodes and edges (GEPHI).csv
: A CSV file that contains the model node...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~127-~127: Loose punctuation mark.
Context: ...odel nodes and edges. - Statistics
: Directories made by Gephi, containing t...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~128-~128: Loose punctuation mark.
Context: ... and edges. - Feature_Importance.svg
: A SVG file that contains the feature im...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~129-~129: Loose punctuation mark.
Context: ... the model. - Loss_Landscape_3D.html
: A HTML file that contains the 3D loss l...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~131-~131: Loose punctuation mark.
Context: ...epochs. - Model state dictionary.txt
: A text file that contains the model sta...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~132-~132: Loose punctuation mark.
Context: ...tate dictionary. - Model Summary.txt
: A text file that contains the model sum...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~133-~133: Loose punctuation mark.
Context: ...l summary. - Model Visualization.png
: A PNG file that contains the model visu...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~134-~134: Loose punctuation mark.
Context: ...visualization. - Top_90_Features.svg
: A SVG file that contains the top 90 fea...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~135-~135: Loose punctuation mark.
Context: ...the model. - Vectorizer features.txt
: A text file that contains the vectorize...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~136-~136: Loose punctuation mark.
Context: ...features. - Visualize Activation.png
: A PNG file that contains the visualizat...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~137-~137: Loose punctuation mark.
Context: ...el activation. - Visualize t-SNE.png
: A PNG file that contains the visualizat...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~138-~138: Loose punctuation mark.
Context: ...del t-SNE. - Weight Distribution.png
: A PNG file that contains the weight dis...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Ruff (0.8.0)
CODE/vulnscan.py
150-150: Loop control variable dirs
not used within loop body
Rename unused dirs
to _dirs
(B007)
CODE/dump_memory.py
16-16: f-string without any placeholders
Remove extraneous f
prefix
(F541)
105-105: f-string without any placeholders
Remove extraneous f
prefix
(F541)
CODE/VulnScan/tools/_study_network.py
380-380: Use module != model_to_use
instead of not module == model_to_use
Replace with !=
operator
(SIM201)
435-436: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (5)
PLANS.md (2)
12-14
: Clarify versioning format change
The versioning format has changed from specific versions (v3.3x.0) to snapshots (snapshot-3.3.x) for parser implementations. Please document the reasoning behind this change and ensure it's consistently applied across the project.
10-11
: Verify version compatibility with PR
The PR is labeled as "V3.1.0" but tasks are scheduled for v3.2.0 through v3.5.0. Please clarify if these tasks are future planning or if they should be included in the current v3.1.0 release.
Also applies to: 15-16
CODE/config.ini (1)
33-33
: Verify PATH placeholder validation
While using a placeholder improves flexibility, ensure that the application validates and sanitizes the PATH value before use to prevent potential security issues.
Consider implementing a path validation utility function that can be reused across the application.
✅ Verification successful
Let me search for any validation logic around this PATH configuration.
Let me check how these paths are actually used in the code, particularly in the _generate_data.py
file since it's directly using the save_path
configuration.
Path validation is implemented appropriately
The codebase already implements proper path validation and handling:
- In
_generate_data.py
, the save_path is validated and created usingos.makedirs(SAVE_PATH, exist_ok=True)
- The application uses standard Python path libraries (
pathlib.Path
andos.path
) throughout the codebase - Directory existence checks and automatic creation are consistently implemented across different modules
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for path validation in the codebase
rg -A 3 "save_path\s*=\s*.*PATH"
Length of output: 239
Script:
#!/bin/bash
# Search for PATH validation or sanitization
rg -i "path.*valid|valid.*path|sanitize.*path|path.*sanitize" -A 3
# Look for path-related utility functions
ast-grep --pattern 'def $_($$$) {
$$$
path $$$
$$$
}'
# Check if PATH is used in the code
rg "PATH\s*=|=\s*PATH" -A 3
# Look for path handling imports
rg "^from.*path|^import.*path"
Length of output: 2982
Script:
#!/bin/bash
# Check path handling in _generate_data.py
rg -A 10 "SAVE_PATH = config\.get\('save_path'" CODE/VulnScan/v3/_generate_data.py
# Look for Path validation using pathlib
ast-grep --pattern 'Path($_)'
# Check for directory existence validation
rg "os\.path\.exists|os\.makedirs|Path.*exists|Path.*mkdir"
Length of output: 4104
CODE/VulnScan/v3/_generate_data.py (2)
1-8
: LGTM! Clean imports with proper type hints setup.
The imports are well-organized and the use of __future__.annotations
shows good practice for type hinting.
156-211
: LGTM! Good practice with module execution control
The implementation properly guards the main execution block and prevents accidental imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DELETED: DUPLICATE
Keeps mistaking files added and removed due to trailing `"`
CodeRabbit Suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (7)
README.md (1)
294-294
: Quick formatting fix needed! ✍️There's a double comma in the description that should be removed.
-| sensitive_data_miner.py | Copies all files that can be considered sensitive in a neat folder, , very slow and clunky - useful for depth scanning | | +| sensitive_data_miner.py | Copies all files that can be considered sensitive in a neat folder, very slow and clunky - useful for depth scanning | |🧰 Tools
🪛 LanguageTool
[typographical] ~294-~294: Two consecutive commas
Context: ...e considered sensitive in a neat folder, , very slow and clunky - useful for depth...(DOUBLE_PUNCTUATION)
CODE/VulnScan/v3/_train.py (2)
406-433
: Nice error handling! 🎉Love how you've handled the GPU errors and added fallback to CPU! This matches what was suggested in the previous review.
One small suggestion - maybe add some more details to those error messages? Like this:
- logger.error(f"File Not Found Error in training model: {e}") + logger.error(f"Training data or model files not found: {e}. Please check if all required files exist.") - logger.error(f"Attribute Error in training model: {e}") + logger.error(f"Invalid model configuration or missing attributes: {e}. Please verify model settings.")
434-435
: Make that error message more helpful! 📝- raise ImportError("This file cannot be imported") + raise ImportError("This training script is meant to be run directly and cannot be imported. Please execute it as a standalone script.")This way, anyone trying to import the file will know exactly what to do instead! 🚀
CODE/dump_memory.py (4)
1-10
: Hey! Let's move those constants to the top! 🔝Those constants you've got hiding at the bottom (like
PROCESS_QUERY_INFORMATION
) should be chillin' at the top of the file with their import buddies. It's way easier to find them there!import psutil from logicytics import Log, DEBUG + # Constants + PROCESS_QUERY_INFORMATION = 0x0400 + PROCESS_VM_READ = 0x0010 + MEM_COMMIT = 0x1000 + PAGE_READWRITE = 0x04 + if __name__ == "__main__": log = Log({"log_level": DEBUG})Also applies to: 155-160
13-55
: Quick fixes to make your code even cooler! 😎
- That f-string on line 16 doesn't need the
f
- it's just a regular string!- Those byte conversions are all over the place. Let's make a helper function to keep it clean!
+ def bytes_to_gb(bytes_value: int) -> float: + """Convert bytes to GB with 2 decimal places""" + return bytes_value / (1024 ** 3) def dump_ram_content(): try: - dump_file = f"Ram_Snapshot.txt" + dump_file = "Ram_Snapshot.txt" ... dump_content = ( f"RAM Snapshot - {datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}\n" "===================================\n" f"{system_info}" - f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB\n" - f"Available Memory: {memory_info.available / (1024 ** 3):.2f} GB\n" - f"Used Memory: {memory_info.used / (1024 ** 3):.2f} GB\n" + f"Total Memory: {bytes_to_gb(memory_info.total):.2f} GB\n" + f"Available Memory: {bytes_to_gb(memory_info.available):.2f} GB\n" + f"Used Memory: {bytes_to_gb(memory_info.used):.2f} GB\n" f"Memory Usage: {memory_info.percent}%\n\n" - f"Swap Total: {swap_info.total / (1024 ** 3):.2f} GB\n" - f"Swap Used: {swap_info.used / (1024 ** 3):.2f} GB\n" - f"Swap Free: {swap_info.free / (1024 ** 3):.2f} GB\n" + f"Swap Total: {bytes_to_gb(swap_info.total):.2f} GB\n" + f"Swap Used: {bytes_to_gb(swap_info.used):.2f} GB\n" + f"Swap Free: {bytes_to_gb(swap_info.free):.2f} GB\n" f"Swap Usage: {swap_info.percent}%\n" )🧰 Tools
🪛 Ruff (0.8.0)
16-16: f-string without any placeholders
Remove extraneous
f
prefix(F541)
57-72
: Add some docs to explain what this class does! 📝This class looks solid, but future you (or your teammates) might appreciate knowing what it's for!
class SystemInfo(ctypes.Structure): + """Windows system information structure for GetSystemInfo API. + + This structure contains system information like processor architecture, + memory addresses, and other system details. + """ _fields_ = [ ("wProcessorArchitecture", ctypes.c_ushort), # ... rest of the fields ]
74-85
: Same deal - let's add some docs here too! 📚class BasicMemInfo(ctypes.Structure): + """Windows memory information structure for VirtualQueryEx API. + + Contains information about memory regions like base address, + protection, and state. + """ _fields_ = [ ("BaseAddress", ctypes.c_void_p), # ... rest of the fields ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
CODE/Logicytics.py
(2 hunks)CODE/VulnScan/v3/_generate_data.py
(4 hunks)CODE/VulnScan/v3/_train.py
(2 hunks)CODE/_dev.py
(2 hunks)CODE/config.ini
(3 hunks)CODE/dump_memory.py
(1 hunks)README.md
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CODE/config.ini
🧰 Additional context used
📓 Learnings (1)
CODE/dump_memory.py (1)
Learnt from: DefinetlyNotAI
PR: DefinetlyNotAI/Logicytics#158
File: CODE/dump_memory.py:93-152
Timestamp: 2024-12-11T09:54:12.940Z
Learning: In the 'Logicytics' project, scripts like `dump_memory.py` are not supposed to consider legality concerns during code reviews.
🪛 LanguageTool
README.md
[typographical] ~294-~294: Two consecutive commas
Context: ...e considered sensitive in a neat folder, , very slow and clunky - useful for depth...
(DOUBLE_PUNCTUATION)
🪛 Ruff (0.8.0)
CODE/dump_memory.py
16-16: f-string without any placeholders
Remove extraneous f
prefix
(F541)
🔇 Additional comments (8)
CODE/_dev.py (2)
69-69
: LGTM! Fixed docstring indentation.
The docstring indentation is now properly aligned with the function body.
91-93
: Nice catch! Handling those pesky quotation marks in filenames.
The code now properly handles filenames by removing quotation marks, which could cause issues when working with files. This is a solid fix! 🎯
CODE/VulnScan/v3/_generate_data.py (4)
1-18
: Sweet logging setup! 📝
The new logging configuration with color formatting makes it way easier to track what's happening in the code.
55-79
: Hey, just checking - are you sure about using real-looking sensitive data?
I noticed you're generating realistic-looking credit cards, SSNs, and passwords. A previous reviewer raised this concern, but you mentioned security isn't a priority here.
196-200
: Heads up! SenseMacro mode might crash your computer! 💥
The code generates 100x more files without any memory management. I see you're planning to deprecate this in v3.2.0, which is good!
222-223
: Nice touch with the import protection! 🛡️
Raising an ImportError when someone tries to import this file is smart - it helps prevent accidental imports that could trigger unwanted data generation.
CODE/Logicytics.py (2)
19-20
: Global variables alert! 🚨
Having ACTION
and SUB_ACTION
as global variables makes the code harder to test and maintain. Consider using a class to manage this state.
328-334
: Much better error handling for None values! 👍
The code now properly handles cases where ACTION
is None by including a timestamp in the error message. This makes debugging way easier!
Minor bug fixes, grammatical fixes and code formatting
Code Climate has analyzed commit 62da785 and detected 0 issues on this pull request. View more on Code Climate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
CODE/VulnScan/v3/_generate_data.py (4)
11-17
: Hey! Let's make the logger path work everywhere! 🔧The logger path uses forward slashes which might not work on Windows. Let's use
os.path.join()
to make it work on all operating systems!- "filename": "../../../ACCESS/LOGS/VulnScan_Train.log", + "filename": os.path.join("..", "..", "..", "ACCESS", "LOGS", "VulnScan_Train.log"),
Line range hint
153-161
: Oops! We're usingvariation_choice
before we create it! 🐛The
variation_choice
variable is used in theif
statements but it's only created whenSIZE_VARIATION != 0
. This could crash!Let's fix this:
size = random.randint(MIN_FILE_SIZE, MAX_FILE_SIZE) + variation_choice = 0 if SIZE_VARIATION != 0: variation_choice = random.choice([1, 2, 3, 4]) if variation_choice == 1:
Line range hint
153-161
: This size variation code is kinda hard to read 🤔The size variation logic could be simplified using a dictionary or a more straightforward calculation.
Here's a cleaner way to do it:
size = random.randint(MIN_FILE_SIZE, MAX_FILE_SIZE) - variation_choice = 0 - if SIZE_VARIATION != 0: - variation_choice = random.choice([1, 2, 3, 4]) - if variation_choice == 1: - size = abs(int(size + (size * SIZE_VARIATION))) - elif variation_choice == 2: - size = abs(int(size - (size * SIZE_VARIATION))) - elif variation_choice == 3: - size = abs(int(size + (size / SIZE_VARIATION))) - elif variation_choice == 4: - size = abs(int(size - (size / SIZE_VARIATION))) + if SIZE_VARIATION != 0: + variations = { + 1: lambda s: s * (1 + SIZE_VARIATION), + 2: lambda s: s * (1 - SIZE_VARIATION), + 3: lambda s: s * (1 + 1/SIZE_VARIATION), + 4: lambda s: s * (1 - 1/SIZE_VARIATION) + } + size = abs(int(variations[random.choice(list(variations.keys()))](size)))
183-185
: Let's make these magic numbers easier to understand! 📝The default file sizes and counts are using magic numbers. Let's make them constants with clear names!
- DEFAULT_FILE_NUM = 10000 - DEFAULT_MIN_FILE_SIZE = 10 * 1024 # 10KB - DEFAULT_MAX_FILE_SIZE = 10 * 1024 # 10KB + DEFAULT_FILE_NUM = 10_000 + KB = 1024 + DEFAULT_FILE_SIZE_KB = 10 + DEFAULT_MIN_FILE_SIZE = DEFAULT_FILE_SIZE_KB * KB + DEFAULT_MAX_FILE_SIZE = DEFAULT_FILE_SIZE_KB * KBCODE/dump_memory.py (2)
17-67
: Let's make those GB calculations easier to read! 🎯The memory size calculations could be clearer with a constant.
+ # Constants for memory size conversion + GB = 1024 ** 3 + # Prepare content to dump dump_content = ( f"RAM Snapshot - {datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}\n" "===================================\n" f"{system_info}" - f"Total Memory: {memory_info.total / (1024 ** 3):.2f} GB\n" - f"Available Memory: {memory_info.available / (1024 ** 3):.2f} GB\n" - f"Used Memory: {memory_info.used / (1024 ** 3):.2f} GB\n" + f"Total Memory: {memory_info.total / GB:.2f} GB\n" + f"Available Memory: {memory_info.available / GB:.2f} GB\n" + f"Used Memory: {memory_info.used / GB:.2f} GB\n"
71-71
: Hey, do we need this PyCharm comment? 🤔The
# noinspection PyUnresolvedReferences
comment is IDE-specific and might not be needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
CODE/VulnScan/tools/_study_network.py
(1 hunks)CODE/VulnScan/tools/_test_gpu_acceleration.py
(1 hunks)CODE/VulnScan/tools/_vectorizer.py
(4 hunks)CODE/VulnScan/v2-deprecated/_generate_data.py
(5 hunks)CODE/VulnScan/v2-deprecated/_train.py
(12 hunks)CODE/VulnScan/v3/_generate_data.py
(4 hunks)CODE/VulnScan/v3/_train.py
(2 hunks)CODE/dump_memory.py
(1 hunks)README.md
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- CODE/VulnScan/tools/_test_gpu_acceleration.py
- CODE/VulnScan/tools/_vectorizer.py
- CODE/VulnScan/v2-deprecated/_generate_data.py
- CODE/VulnScan/v2-deprecated/_train.py
👮 Files not reviewed due to content moderation or server errors (3)
- CODE/VulnScan/v3/_train.py
- README.md
- CODE/VulnScan/tools/_study_network.py
🧰 Additional context used
🪛 Ruff (0.8.0)
CODE/VulnScan/tools/_study_network.py
380-380: Use module != model_to_use
instead of not module == model_to_use
Replace with !=
operator
(SIM201)
435-436: Use a single if
statement instead of nested if
statements
(SIM102)
🔇 Additional comments (4)
CODE/VulnScan/v3/_generate_data.py (1)
196-197
: Thanks for the heads up about deprecation! 👍
Good job adding warning logs about the SenseMacro mode deprecation. This helps users prepare for the change.
CODE/dump_memory.py (3)
1-15
: Nice job with the setup! 👍
The imports and constants are organized well, and you're using the right libraries for the job.
129-141
: Looking good! 👌
The function is well-documented and does its job.
212-218
: Sweet error handling! 🎯
The main block catches and logs errors properly. Nice job!
@DefinetlyNotAI: All right, your changes look good! Approving the PR now! ✅ Actions performedComments resolved and changes approved. |
Pull Request Template
Prerequisites
--dev
flag, if required.PR Type
update
Description
CodeRabbit will supply this
Motivation and Context
To maintain v3
Credit
N/A
Issues Fixed
N/A
Summary by CodeRabbit
New Features
.gitignore
file for improved file management.Bug Fixes
Documentation
README.md
for better clarity and usability.Chores
requirements.txt
to add and remove several package dependencies.