Max Bittker: @nyt_first_said demo

This is a visualization of the process behind @nyt_first_said. Each day, a script scrapes new articles from nytimes.com. That text is tokenized, or split into words based on whitespace and punctuation. Each word then must pass several criteria. Containing a number or special character is criteria for disqualification. To avoid proper nouns, all capitalized words are filtered. The most important check is against the New York Time's archive search service. The archive goes back to 1851 and contains more than 13 million articles. The paper publishes many thousands of words each day, but only a very few are firsts.

#data #visualization #bots #twitter #journalism #language

·maxbittker.github.io·Aug 26, 2020

Max Bittker: @nyt_first_said demo