For those following along asynchronously, this is a scavenger hunt-style challenge intended to be completed (or attempted 🙈 ) within 10 minutes, on an unfamiliar dataset, during the Data Explorers Guild.
The event has passed, but data is forever— Everyone is welcome to play with this public dataset and share any cool insights in the thread. Join us next month to explore live!
This month’s dataset is based on Reddit post data from 2019, available at:
Copy and paste the below into a reply to this post, adding explore links for as many as you get to in 10 minutes. No need to go in order, feel free to jump around.
PS: It’s usually nicer to filter to the top 10 or top 50 subreddits, since there’s a LOT of noise. Bonus points for an explore link that elegantly shows data about the entire ecosystem.
A replica of this visualization, showing the 100 most popular words across the top 10 subreddits. You can sort by one subreddit to pick out the outliers in others:
A visualization that shows the best time of day to post if you want to get a lot of upvotes:
An explore that shows whether having flair affects the number of upvotes/downvotes a post gets:
A visualization of the subreddits with the highest ratio of stopwords to full words:
An explore showing the words that are most “endemic” ie: only appearing or primarily appearing in one subreddit:
The most popular post:
The most unlikely post to be successful (an outlier):
An explore showing what makes posts get lots of comments:
A visualization showing whether post length is correlated with score:
10: Something way better than what I’ve come up with here: