23
My whole week got wrecked by a bad AI training run
Last Thursday, I kicked off a new model training job on a custom dataset, expecting it to take maybe 12 hours. It ran for over 60 hours straight, using up all my cloud credits, and then crashed right at the end. The error log was just a line about a memory overflow, with no save point. I had to explain to my team lead why we had nothing to show for the budget. Has anyone else had a training job fail that badly after so much time and money?
3 comments
Log in to join the discussion
Log In3 Comments
haydenc1026d ago
Ouch, that's a brutal one. I feel your pain, though my version is more like driving a shipment across three states only to find the warehouse closed for a holiday. That total loss after so much time just sinks your stomach. I'd be staring at that error log for a week.
4
jaden698d ago
Actually @haydenc10, staring at the log IS the right move, you gotta find what broke before you run it again.
1