Create eval loop and use full train dataset

This commit is contained in:
Craig
2025-04-12 10:55:10 +01:00
parent e9b97ac2b5
commit 0f3a96ca81
3 changed files with 266 additions and 50 deletions

52
todo.md
View File

@@ -60,36 +60,36 @@ This list outlines the steps required to complete the Torchvision Finetuning pro
- [x] Call `setup_logging`.
- [x] Replace `print` with `logging.info`.
- [x] Log config, device, and training progress/losses.
- [ ] Implement full training loop in `train.py`.
- [ ] Remove single-step exit.
- [ ] Add LR scheduler (`torch.optim.lr_scheduler.StepLR`).
- [ ] Add epoch loop.
- [ ] Add batch loop, integrating the single training step logic.
- [ ] Log loss periodically within the batch loop.
- [ ] Step the LR scheduler at the end of each epoch.
- [ ] Log total training time.
- [ ] Implement checkpointing in `train.py`.
- [ ] Define checkpoint directory.
- [ ] Implement logic to find and load the latest checkpoint (resume training).
- [ ] Save checkpoints periodically (based on frequency or final epoch).
- [ ] Include epoch, model state, optimizer state, scheduler state, config.
- [ ] Log checkpoint loading/saving.
- [x] Implement full training loop in `train.py`.
- [x] Remove single-step exit.
- [x] Add LR scheduler (`torch.optim.lr_scheduler.StepLR`).
- [x] Add epoch loop.
- [x] Add batch loop, integrating the single training step logic.
- [x] Log loss periodically within the batch loop.
- [x] Step the LR scheduler at the end of each epoch.
- [x] Log total training time.
- [x] Implement checkpointing in `train.py`.
- [x] Define checkpoint directory.
- [x] Implement logic to find and load the latest checkpoint (resume training).
- [x] Save checkpoints periodically (based on frequency or final epoch).
- [x] Include epoch, model state, optimizer state, scheduler state, config.
- [x] Log checkpoint loading/saving.
## Phase 4: Evaluation & Testing
- [ ] Add evaluation dependencies (`pycocotools` - optional initially).
- [ ] Create `utils/eval_utils.py` and implement `evaluate` function.
- [ ] Set `model.eval()`.
- [ ] Use `torch.no_grad()`.
- [ ] Loop through validation/test dataloader.
- [ ] Perform forward pass.
- [ ] Calculate/aggregate metrics (start with average loss, potentially add mAP later).
- [ ] Log evaluation metrics and time.
- [ ] Return metrics.
- [ ] Integrate evaluation into `train.py`.
- [ ] Create validation `Dataset` and `DataLoader` (using `torch.utils.data.Subset`).
- [ ] Call `evaluate` at the end of each epoch.
- [ ] Log validation metrics.
- [x] Create `utils/eval_utils.py` and implement `evaluate` function.
- [x] Set `model.eval()`.
- [x] Use `torch.no_grad()`.
- [x] Loop through validation/test dataloader.
- [x] Perform forward pass.
- [x] Calculate/aggregate metrics (start with average loss, potentially add mAP later).
- [x] Log evaluation metrics and time.
- [x] Return metrics.
- [x] Integrate evaluation into `train.py`.
- [x] Create validation `Dataset` and `DataLoader` (using `torch.utils.data.Subset`).
- [x] Call `evaluate` at the end of each epoch.
- [x] Log validation metrics.
- [ ] (Later) Implement logic to save the *best* model based on validation metric.
- [ ] Implement `test.py` script.
- [ ] Reuse argument parsing, config loading, device setup, dataset/dataloader (test split), model creation from `train.py`.