PDA

View Full Version : Importing old messages



mbbrutman
November 18th, 2019, 04:39 PM
Shawn provided me with a Zip file of the Yahoo! group that somebody scraped.

The messages are in "email" format, so they look like full emails with all of the headers. vBulletin has an API for creating threads. Writing some code to import each message and add it to a new thread (or an existing thread if the subject start with "Re:") should be be too terrible.

For now I think I want to write a simple script to just get the date, subject, sender and the body text onto a static web page. That's now more than a few hours of work and it would be searchable, but it would not be threaded. Getting things threaded and/or importing into vBulletin is a longer term project.


Thoughts?

ldkraemer
November 19th, 2019, 04:53 AM
What email file formats are these in?

.DBX, .PST, or .EML?


Larry

mbbrutman
November 19th, 2019, 08:27 AM
.EML - nice, human readable text. The filenames are just sequence numbers.

driph
December 11th, 2019, 02:38 PM
Shawn provided me with a Zip file of the Yahoo! group that somebody scraped.

The messages are in "email" format, so they look like full emails with all of the headers. vBulletin has an API for creating threads. Writing some code to import each message and add it to a new thread (or an existing thread if the subject start with "Re:") should be be too terrible.

For now I think I want to write a simple script to just get the date, subject, sender and the body text onto a static web page. That's now more than a few hours of work and it would be searchable, but it would not be threaded. Getting things threaded and/or importing into vBulletin is a longer term project.


Thoughts?

This is a great idea, and will be incredibly helpful for future searchers. vBulletin integration will be nice, but at least there will be something to refer to (with a stickied link at the top of the Grid forum?) when digging for old info.

Has a home been found for the files from the group?

mbbrutman
December 11th, 2019, 04:00 PM
I've been meaning to get to the message import but I got sidelined by something horrible. Trust me, it's a good excuse. It will happen in the next few weeks.

Hosting the files here is still possible; I just need to see what kind of copyright risk we would be taking on. One thing that works in our favor is that we are a registered 501C3 with a real museum, so we have more latitude to protect and preserve software than I let on. It's the distribution part that we need to be careful about.

mbbrutman
December 15th, 2019, 11:19 AM
Some progress:

http://www.brutman.com/RuGRiD/

That directory has 8 HTML files which have 500 messages each. The messages have some light formatting on them.

Known problems/limitations:

Many of the messages are "multi-part" and include an HTML version and a plaintext version; the HTML version of those messages is suppressed for now until I can properly sanitize the HTML in them. (It includes things like <head> and <body> tags which screw the overall page up.
Attached pictures and files are not included yet.
This is a prototype. The final location will be on something owned by VCFed.org


Please have a look and let me know about outright bugs. Things like completely missing message bodies might still be happening. The formatting is rough, but it is as it appears in the originals.

Reading MIME emails has been more challenging than I expected. :-)

mbbrutman
December 16th, 2019, 01:48 PM
I updated the files again today; the files should be more complete and readable.

Still to come - attachments.

shawnerz
January 14th, 2020, 07:39 PM
Thanks for your work on this. What you have so far looks great! :)
-Shawn