in projects, python

Improving WhatsApp chat history exports

‘Twas a brisk rainy morning in Vancouver…

…when I woke up to the news of WhatsApp’s updated privacy policy. If you are anything like me, your relationship with social media platforms is complicated. News of privacy policy changes automatically feels as if you’ve been backstabbed even though you likely haven’t even attempted to understand the intent of the actual policy changes. That is what we call distrust. And no other tech company in the world radiates distrust quite like Facebook does. As someone with Cambridge Analytica PTSD and a history of social media reluctance, I’m always looking for a reason to jump ship and the news of WhatsApp’s policy changes was my signal to peace out for good.

Friendship ended with WhatsApp. Now Signal is my best friend.
“Use Signal” as Musk says

“How do I export and backup my data?” – me, a data hoarder

I’ve left Facebook a few times in my life (as I said, it’s a complicated relationship) and assumed that leaving WhatsApp would have similar, convenient features for downloading a copy of my data. All of my WhatsApp media is already backed up thanks to OneDrive automatically doing this for me on my phone, so I really only care about exporting my chat history – most importantly the chat history with my fiance during our 2-year phase of long-distance dating. Those messages are cherished, indispensable memories. Anyways, Google leads me to this WhatsApp support page detailing how to backup chat history. Sweet! A few clicks here and there and I’ll be on my way to greener pastures…

Problems abound

As soon as I exported my chat I knew something was fishy because of the small .txt file size and suspiciously plump scroll bar. Upon comparing the .txt file directly with the chats on my phone, my inner data hoarder wailed at the realization of obviously missing and redacted personal data. A final Google search reveals the scale of this problem. I shudder as I accept the reality of the matter…my data is missing and now I must take matters into my own hands.

My first world problems:

  • Chat history backups are limited to a maximum of 40,000 messages (which in my case only backed up 80% of my longest chat history)
  • The exports exclude/redact messages if they have media attachments and replace the message contents with “<Media omitted>”
  • The export file format is limited to a plain text file
  • WhatsApp has a local database you can save if you connect your phone to a PC, but it’s encrypted and can’t be conveniently read without a decryption key. There are some tools advertised online to help with getting this decryption key but when they require you to root your phone and use their closed source software…no thanks
  • This Chrome extension is an option for $5
Confused Travolta
Where are my messages?

That last option – the Chrome extension – is probably the best solution for most people assuming it works and is trustworthy. Personally, my gut told me to stay away from it because $5 is clearly too expensive for my “indispensable” messages. I mean, what if it doesn’t work and then I waste like 10min of my time and an equivalent amount of money I spend daily on coffee? Our brains are weird like that. In a similar vein, automation development is for the truly afflicted who spend weeks/months automating a task that could be accomplished manually in a few hours or a day. So that begs the question…how on earth did I come to the conclusion that spending ~300 hours making a WhatsApp web scraper to backup a single chat was my best solution?

  1. Maybe I can make something better
  2. Maybe I like solving problems for people
  3. Maybe I wanted to learn more about Selenium and BeautifulSoup
  4. Maybe I felt my neglected blog needed some TLC
  5. Maybe, deep down, I’m craving recognition on social media – the thing I despise and regularly leave, then come crawling back to, like an abusive relationship
Scumbag brain - despises social media, but motivated to code for social media points
It’s complicated, OK?

Introducing WhatSoup 🍲

WhatSoup is a cross-platform web scraper that exports your entire WhatsApp chat history. It’s pretty simple and is far from perfect but it does solve the problems I discussed above…and totally saved me $5! To be clear, this tool is primarily meant for users who:

  • Want their text chat history exported (attachments/media isn’t supported…yet)
  • Have more than 40,000 messages in a single chat
  • Want to backup chats/messages that have been redacted with “<Media omitted>” by the default WhatsApp export feature
  • Want their chat history in a CSV or HTML format (although other tools may be more suitable for simply converting plain text to other formats)

Here’s the gist of how it works:

  1. Opens your browser, loads WhatsApp, reads your chats on the left pane (the contacts name and the last message sent)
  2. Presents you a numbered list of all your chats and asks you to select a number from the list
  3. Locates your selected chat and loads it in the right pane by continuously scrolling up through the chat history until all messages have loaded
  4. Once all messages have been loaded, it then scrapes/scrubs the HTML and extracts the message information we all care about: sender, date/time, message contents
  5. Finally, it exports the scraped data in a file format of your choosing: plain text, CSV, or HTML
Demo of exporting a chat to plain text file with WhatSoup

One final admission to note about performance… 😬

Speed, what is it good for?

Absolutely everything (for web scraping).

I’m not going to sugar coat it: if you have in the 50k range of messages, you better be RAM rich and patient. Those who have more than 50k messages: congratulations, you’ll be setting a new WhatSoup record 🏆

My desktop has 32GB of RAM which is plenty of computing memory for WhatSoup, but both Chrome and Firefox crawl at a snail’s pace around the 15k message mark (3-4GB of browser memory). The 50k message chat averages around 10GB of memory usage in Windows. Chrome Task Manager gives a more accurate view into WhatsApp memory usage, as it excludes the added weight of Windows and details browser-only tasks.

Chrome task manager
Chrome task manager after 8 hours of loading 50k messages in WhatsApp

The takeaway here is that WhatSoup will be bottlenecked by browser and memory limitations. I explored and tested out a number of different strategies to squeeze extra juice out of the browser but unfortunately, they all failed to make any meaningful difference:

  1. Chrome vs Firefox 👎
  2. Headless browsing 👎
  3. Disabling images in browser config 👎
  4. Removing elements from DOM 👎
  5. Changing ‘experimental’ browser settings to allocate more memory 👎

Another takeaway is about myself. I learned that when I’m unironically reading documents about WhatsApp security and reverse engineering WhatsApp encryption, I’ve hit the limits of my own abilities.

I have no idea what I'm doing
Me reading about state-of-the-art cryptography to solve web scraper performance issues

Next steps

Once I finish polishing WhatSoup and stop wasting dev time making outdated memes for my outdated blog, I’ll be sharing it more broadly to selfishly show-off, get feedback, and make further improvements.

I hope other developers will feel welcome to contribute to the source and improve it with me. I’ll be available to tweak it as needed until May 15th, when WhatsApp’s new privacy policy goes into effect. On May 15th, users who have not accepted the policy will no longer be able to read or send messages and will have their accounts deleted 120 days later (TechCrunch).

If you don’t plan to accept the new privacy policy and want to backup your WhatsApp chat history, now is the time to do it.