Have you looked into using a full MIME/mbox parser library, e.g. GMime [0] or MimeKit [1]? Both support parsing mbox files directly, and they should be able to handle the intricacies of parsing any messages/attachments you throw at them. Then you could write out the MIME representation of each message (including any attachments) into its own file and then check for new messages. That way you can be sure each “chunk” represents a single message in its entirety. Not sure if this is any better since your solution seems to work pretty well.
Gmail takeouts come in an arbitrarily-ordered mbox file; I wanted something a bit more backup friendly so I created a small tool for that purpose and wrote about it.
> if you want to back this file up regularly with something like restic, then you will quickly end up in a world of pain: since new mails are not even appended to the end of the file, each cycle of takeout-then-backup essentially produces a new giant file.
As I'm sure the author is aware, Restic will do hash-based chunking so that similar files can be efficiently be backed up.
How similar are two successive Takeout mboxes?
If the order of messages within an mbox is stable, and new emails are inserted somewhere, the delta update might be tiny.
Even if the order of the mbox's messages are ~random, Restic's delta updates will forego large attachments.
It would be great to see empirical figures here: how large is the incremental backup after after a month's emails. How does that compare for each backup strategy?
The pro of sticking with restic is simplicity, and also avoiding the risk of your tool managing to screw up the data.
This risk isn't so bad if it's a mature tool that canonicalises mboxes (e.g. order them by time), but seems risky for something handrolled.
> As I'm sure the author is aware, Restic will do hash-based chunking so that similar files can be efficiently be backed up.
> Even if the order of the mbox's messages are ~random, Restic's delta updates will forego large attachments.
I forget the exact number, but the rolling hashes for Restic and Borg are tuned to produce chunks sizes on the order of an entire megabyte.
Which means attachment file sizes need to be many megabytes in order for Restic to be much use, since the full chunk has to fall within the attachment. — You'd lose 0.5MB at both ends of each attachment on average, so a 5MB file would only be 80% deduped.
Nothing against Restic, but it's tuned for file-level backup, and I'm sure it wouldn't be as performant if it used chunks that were small enough to pick apart individual e-mails.
I suggested the author check out ZPAQ, which has a user-tunable average fragment size, and is arguably even simpler than Restic.
The ZPAQ file can then itself be efficiently backed up by Restic.
For emails, here is my current simple backup setup. Of course, I’m also looking to do this without having to open Thunderbird, or I might have an old laptop running it. So, work-in-progress.
For the email accounts I want a backup, I set it to spew out POP3 without doing anything (don’t mark read or delete). I set up Thunderbird with that POP3. It has a backup copy of all the emails. I’ve had searchable emails since like 2004/2005, and I’ve occasionally replied to people and gotten back in touch with very old friends from the Internet.
I saw an open-source tool sometime back (I think, here on Hacker News) that backs up your IMAP mails with a nicely done interface. That would be nice to have.
Edit: Perhaps Bichon,[1] mentioned somewhere in the other comment threads[2] was the one.
`zpaq add archive.zpaq new.mbox -fragment 0 -method 3` is great for this. It splits the input into fragments averaging 1024 bytes in size [0], which catches up to ~90% of redundancy. The remaining ~10% is packed and compressed into 64MB (max) blocks that are added to the .zpaq.
The resulting artifact is a single .zpaq file on disk. This file is only ever appended to, never overwritten, so it plays nice with Restic's own chunked deduplication. Plus it won't flood the filesystem with inodes and it suffers less small files overhead than TFA's solution.
Granted I suspect TFA splitting on the e-mail headers may be chunking more efficiently. Though, unless I skimmed the linked GitHub too fast, it looks like TFA's solution also doesn't use any solid compression to exploit redundancy across chunks. And I trust zpaq as a general purpose tool more than a one-off just for a single use case. The code does look clean, though, nice work.
[0] Average fragment size is 1024*2^N. If the most of the data is attachments that don't change, you can probably use a higher `-fragment N` to have less overhead keeping track of hashes. `-method 3` is a good middle ground for backups. `-m5` gets crazy high compression ratios, but also crazy slow speed. Old versions of ingested files are shadowed by default; use `-all` when you want to list/extract them.
How strictly do you define need? I've been living as an adult long enough that there have been countless times I've searched for photos and emails from one or two decades ago. I distinctly remember the first time I met an Inbox Zero person. It was so important to her to militantly delete everything she had dealt with, and to me, the disadvantages from that practice far outweigh the advantages.
Maybe not “need” in the strictest sense, but there have been more times than I can count where digging up old mail has either made things much faster and easier or helped me answer a random question that popped into my head about something that happened ages ago.
Old SMS, iMessage, Telegram etc messages have been useful from time to time too for similar reasons.
Both can also serve as exceptional time capsules that provide windows into past “eras” of life. I occasionally kick myself for not having archived mail and messages from a couple of defunct email addresses and chat apps… without them there’s a hole spanning a few years where visibility is limited.
All the time. I read an interesting thing about someone online, and that name strikes me as someone I have interacted with. I search my email archive, then reply to that thread or start a new one to catch up. All of them have been super happy, “wow! You replied to our email from 10 years ago!”
I do have “Clean Inbox”[1] because I don’t see or interact with them, but I keep them. The only emails I see are the actionable “Unread OR Flagged.”
I have, but very rarely. I could count on one hand how often I’ve needed to dig back more than half a decade ago.
Back when I used Gmail I just kept everything personal and work related but when I moved away and started paying for email storage I took a different approach. It didn’t make sense for me to pay considerably more storage for something I almost never use.
I ended up backing up all of my emails outside of the last 5 years and stored them on an offline drive where I can reference them as eml files if I ever need it.
Going forward once a year I’ll export and purge the oldest year in my account.
I backed up lots of emails that I deemed precious, but I still search through email first, because sometimes it's just easier to search email than to search my backups.
Also, oftentimes I search email not so much for the content, but to find the timestamp associated with a particular event. I have had to search old email metadata a few times when I get an unexpected question related to time (for example, gmail will ask when you created the account as part of its account recovery process).
I've enjoyed digging up an old flight itinerary to see how much I paid back in 2015 or just looking at the messages a company replied in support and realizing I'm not buying from them again because they didn't fix the problem.
[0] https://github.com/jstedfast/gmime
[1] https://github.com/jstedfast/MimeKit
https://github.com/rustmailer/bichon
As I'm sure the author is aware, Restic will do hash-based chunking so that similar files can be efficiently be backed up.
How similar are two successive Takeout mboxes?
If the order of messages within an mbox is stable, and new emails are inserted somewhere, the delta update might be tiny.
Even if the order of the mbox's messages are ~random, Restic's delta updates will forego large attachments.
It would be great to see empirical figures here: how large is the incremental backup after after a month's emails. How does that compare for each backup strategy?
The pro of sticking with restic is simplicity, and also avoiding the risk of your tool managing to screw up the data.
This risk isn't so bad if it's a mature tool that canonicalises mboxes (e.g. order them by time), but seems risky for something handrolled.
> Even if the order of the mbox's messages are ~random, Restic's delta updates will forego large attachments.
I forget the exact number, but the rolling hashes for Restic and Borg are tuned to produce chunks sizes on the order of an entire megabyte.
Which means attachment file sizes need to be many megabytes in order for Restic to be much use, since the full chunk has to fall within the attachment. — You'd lose 0.5MB at both ends of each attachment on average, so a 5MB file would only be 80% deduped.
Nothing against Restic, but it's tuned for file-level backup, and I'm sure it wouldn't be as performant if it used chunks that were small enough to pick apart individual e-mails.
I suggested the author check out ZPAQ, which has a user-tunable average fragment size, and is arguably even simpler than Restic.
The ZPAQ file can then itself be efficiently backed up by Restic.
For the email accounts I want a backup, I set it to spew out POP3 without doing anything (don’t mark read or delete). I set up Thunderbird with that POP3. It has a backup copy of all the emails. I’ve had searchable emails since like 2004/2005, and I’ve occasionally replied to people and gotten back in touch with very old friends from the Internet.
I saw an open-source tool sometime back (I think, here on Hacker News) that backs up your IMAP mails with a nicely done interface. That would be nice to have.
Edit: Perhaps Bichon,[1] mentioned somewhere in the other comment threads[2] was the one.
1. https://github.com/rustmailer/bichon
2. https://news.ycombinator.com/item?id=46429250
The resulting artifact is a single .zpaq file on disk. This file is only ever appended to, never overwritten, so it plays nice with Restic's own chunked deduplication. Plus it won't flood the filesystem with inodes and it suffers less small files overhead than TFA's solution.
Granted I suspect TFA splitting on the e-mail headers may be chunking more efficiently. Though, unless I skimmed the linked GitHub too fast, it looks like TFA's solution also doesn't use any solid compression to exploit redundancy across chunks. And I trust zpaq as a general purpose tool more than a one-off just for a single use case. The code does look clean, though, nice work.
[0] Average fragment size is 1024*2^N. If the most of the data is attachments that don't change, you can probably use a higher `-fragment N` to have less overhead keeping track of hashes. `-method 3` is a good middle ground for backups. `-m5` gets crazy high compression ratios, but also crazy slow speed. Old versions of ingested files are shadowed by default; use `-all` when you want to list/extract them.
I only save financial statements and contact information. Everything else gets deleted as soon as possible.
If she was hard-deleting everything, she wasn't just Inbox Zero, she was F---s Zero, too.
Old SMS, iMessage, Telegram etc messages have been useful from time to time too for similar reasons.
Both can also serve as exceptional time capsules that provide windows into past “eras” of life. I occasionally kick myself for not having archived mail and messages from a couple of defunct email addresses and chat apps… without them there’s a hole spanning a few years where visibility is limited.
What's the advantage to deleting? It's easy to ignore anything old and disk space is cheap. Do you delete old photos?
I do have “Clean Inbox”[1] because I don’t see or interact with them, but I keep them. The only emails I see are the actionable “Unread OR Flagged.”
1. https://brajeshwar.com/2024/email/
Back when I used Gmail I just kept everything personal and work related but when I moved away and started paying for email storage I took a different approach. It didn’t make sense for me to pay considerably more storage for something I almost never use.
I ended up backing up all of my emails outside of the last 5 years and stored them on an offline drive where I can reference them as eml files if I ever need it.
Going forward once a year I’ll export and purge the oldest year in my account.
Also, oftentimes I search email not so much for the content, but to find the timestamp associated with a particular event. I have had to search old email metadata a few times when I get an unexpected question related to time (for example, gmail will ask when you created the account as part of its account recovery process).
I let it pile up, rarely delete anything except marketing emails. Over 30K emails in my gmail inbox.