Since I started covering WordPress in 2009, one of the things I’ve noticed is that certain topics have a cyclical nature to them. One of these is the contention in the WordPress community on what data is sent, stored, and shared on WordPress.org. In a post published on Torquemag.io, Josh Pollock, Founder of CalderaWP, argues that WordPress is a community-driven project and as such, data collected by WordPress.org should be shared with the community.
If installing and updating themes via the WordPress dashboard wasn’t so easy, WordPress wouldn’t be what it is today. I understand and appreciate this.
Here’s the part that doesn’t sit well with me: WordPress.org is collecting data on all of its users (as it should), but this information isn’t available in aggregate form to the community.
Pollock says that as an entrepreneur, the information would help him make informed business decisions.
Data is Stored for Two Days
I spoke to Samuel ‘Otto’ Wood, who helps maintain WordPress.org, and discovered that some of the assumptions people have are not true.
“The data collection systems on w.org have been inconsistent at best, and re-written several times,” Wood said.
“But the general idea that there is some kind of treasure trove of information we’re storing is misguided, at best. The data is collected, aggregated for the things we display, then tossed. We don’t store it for any serious length of time. Just the results of the data like the counts.”
Gathering, sorting, and displaying the large amount of data associated with WordPress is a CPU intensive job. The most recent example of WordPress.org sharing aggregate data is for active installs of plugins and themes. Displaying the Active Install count is the result of significant performance improvements from WordPress lead developer Dion Hulse. Without the improvements, the data collection would have overloaded CPUs and MySQL databases.
“Gathering that data is frickin’ difficult to start with, “Wood said. “For the longest time, we didn’t even have the actual system resources to pull off the ‘Active Installs’ count. We didn’t display that count because we couldn’t do it. The idea that we’re hiding things is ludicrous.”
Raw data is stored for two days and is then overwritten, “basically, there’s too much data to store,” Wood said. “All of the data that w.org gathers is used to display the stats on w.org itself. Nothing special is hidden.”
Data Accuracy is Hard
If developers like Pollock are going to make business decisions using public data, the data has to be accurate. Accuracy is a complex problem but the team has slowly made progress over the years as legacy systems on W.org are phased out.
“A lot of the w.org systems are poorly made,” Wood said. “They’re old, have been modified dozens of times over the years, and badly in need of updating. For a long time, the data we gathered could not be processed fast enough so we simply threw over half of it away.
“Mostly, we phase out old useless systems and replace them with something better and newer which gives us things to display. Active Install counts was an entirely new system that replaced an older one which didn’t give any useful information.”
Wood confirms what I’ve believed to be true for a long time. WordPress.org is not storing data for an extended period of time and the information that is collected is likely on public display somewhere on the site. What types of data would you like to see on WordPress.org?