Files
2025-04-12 10:55:10 +01:00

6.2 KiB

Project To-Do List

This list outlines the steps required to complete the Torchvision Finetuning project, derived from prompt_plan.md.

Phase 1: Foundation & Setup

  • Initialize project structure (configs, data, models, utils, tests, scripts)
  • Initialize git repository
  • Configure .gitignore
  • Set up pyproject.toml with uv
  • Add dependencies (torch, torchvision with CUDA 12.4, ruff, numpy, Pillow, pytest, pre-commit)
  • Configure pre-commit with ruff (formatting, linting)
  • Create empty __init__.py files
  • Create placeholder files (train.py, test.py, configs/base_config.py, etc.)
  • Create basic README.md
  • Install pre-commit hooks
  • Verify PyTorch GPU integration (scripts/check_gpu.py)
  • Create data download script (scripts/download_data.sh)
  • Implement configuration system (configs/base_config.py, configs/pennfudan_maskrcnn_config.py)

Phase 2: Data Handling & Model

  • Implement PennFudanDataset class in utils/data_utils.py.
    • __init__: Load image and mask paths.
    • __getitem__: Load image/mask, parse masks, generate targets (boxes, labels, masks, image_id, area, iscrowd), apply transforms.
    • __len__: Return dataset size.
  • Implement get_transform(train) function in utils/data_utils.py (using torchvision.transforms.v2).
  • Implement collate_fn(batch) function in utils/data_utils.py.
  • Implement get_maskrcnn_model(num_classes, ...) function in models/detection.py.
    • Load pre-trained Mask R-CNN (maskrcnn_resnet50_fpn_v2).
    • Replace box predictor head (FastRCNNPredictor).
    • Replace mask predictor head (MaskRCNNPredictor).

Phase 3: Training Script & Core Logic

  • Set up basic train.py structure.
    • Add imports.
    • Implement argparse for --config argument.
    • Implement dynamic config loading (importlib).
    • Set random seeds.
    • Determine compute device (cuda or cpu).
    • Create output directory structure (outputs/<config_name>/checkpoints).
    • Instantiate PennFudanDataset (train).
    • Instantiate DataLoader (train) using collate_fn.
    • Instantiate model using get_maskrcnn_model.
    • Move model to device.
    • Add if __name__ == "__main__": guard.
  • Implement minimal training step in train.py.
    • Instantiate optimizer (torch.optim.SGD).
    • Set model.train().
    • Fetch one batch.
    • Move data to device.
    • Perform forward pass (loss_dict = model(...)).
    • Calculate total loss (sum(...)).
    • Perform backward pass (optimizer.zero_grad(), loss.backward(), optimizer.step())
    • Print/log loss for the single step (and temporarily exit).
  • Implement logging setup in utils/log_utils.py (setup_logging function).
    • Configure logging.basicConfig for file and console output.
  • Integrate logging into train.py.
    • Call setup_logging.
    • Replace print with logging.info.
    • Log config, device, and training progress/losses.
  • Implement full training loop in train.py.
    • Remove single-step exit.
    • Add LR scheduler (torch.optim.lr_scheduler.StepLR).
    • Add epoch loop.
    • Add batch loop, integrating the single training step logic.
    • Log loss periodically within the batch loop.
    • Step the LR scheduler at the end of each epoch.
    • Log total training time.
  • Implement checkpointing in train.py.
    • Define checkpoint directory.
    • Implement logic to find and load the latest checkpoint (resume training).
    • Save checkpoints periodically (based on frequency or final epoch).
      • Include epoch, model state, optimizer state, scheduler state, config.
    • Log checkpoint loading/saving.

Phase 4: Evaluation & Testing

  • Add evaluation dependencies (pycocotools - optional initially).
  • Create utils/eval_utils.py and implement evaluate function.
    • Set model.eval().
    • Use torch.no_grad().
    • Loop through validation/test dataloader.
    • Perform forward pass.
    • Calculate/aggregate metrics (start with average loss, potentially add mAP later).
    • Log evaluation metrics and time.
    • Return metrics.
  • Integrate evaluation into train.py.
    • Create validation Dataset and DataLoader (using torch.utils.data.Subset).
    • Call evaluate at the end of each epoch.
    • Log validation metrics.
    • (Later) Implement logic to save the best model based on validation metric.
  • Implement test.py script.
    • Reuse argument parsing, config loading, device setup, dataset/dataloader (test split), model creation from train.py.
    • Add --checkpoint argument.
    • Load model weights from the specified checkpoint.
    • Call evaluate function using the test dataloader.
    • Log/print final evaluation results.
    • Setup logging for testing (e.g., test.log).
  • Create unit tests in tests/ using pytest.
    • tests/test_config.py: Test config loading.
    • tests/test_model.py: Test model creation and head configuration.
    • tests/test_data_utils.py: Test dataset instantiation, length, and item format (requires data).
    • (Optional) Use fixtures in tests/conftest.py if needed.
  • Add pytest execution to pre-commit-config.yaml.
  • Test pre-commit hooks (pre-commit run --all-files).

Phase 5: Refinement & Documentation

  • Refine error handling in train.py and test.py (try...except).
  • Add configuration validation checks.
  • Improve evaluation metrics (e.g., implement mAP in evaluate function).
  • Add more data augmentations to get_transform(train=True).
  • Expand README.md significantly.
    • Goals
    • Detailed Setup
    • Configuration explanation
    • Training instructions (including resuming)
    • Testing instructions
    • Project Structure overview
    • Dependencies list
    • (Optional) Results section
  • Perform final code quality checks (ruff format ., ruff check . --fix).
  • Ensure all pre-commit hooks pass.