Background
Social media have
transformed the communications landscape. People increasingly obtain news and
health information online and via social media. Social media platforms also
serve as novel sources of rich observational data for health research
(including infodemiology, infoveillance, and digital disease detection
detection). While the number of studies using social data is growing rapidly,
very few of these studies transparently outline their methods for collecting,
filtering, and reporting those data. Keywords and search filters applied to
social data form the lens through which researchers may observe what and how
people communicate about a given topic. Without a properly focused lens,
research conclusions may be biased or misleading. Standards of reporting data
sources and quality are needed so that data scientists and consumers of social
media research can evaluate and compare methods and findings across studies.
Objective
We aimed to develop and
apply a framework of social media data collection and quality assessment and to
propose a reporting standard, which researchers and reviewers may use to
evaluate and compare the quality of social data across studies.
Methods
We propose a conceptual
framework consisting of three major steps in collecting social media data:
develop, apply, and validate search filters. This framework is based on two
criteria: retrieval precision (how much of retrieved data is relevant) and
retrieval recall (how much of the relevant data is retrieved). We then discuss
two conditions that estimation of retrieval precision and recall rely
on—accurate human coding and full data collection—and how to calculate these
statistics in cases that deviate from the two ideal conditions. We then apply
the framework on a real-world example using approximately 4 million
tobacco-related tweets collected from the Twitter firehose.
Results
We developed and applied
a search filter to retrieve e-cigarette–related tweets from the archive based
on three keyword categories: devices, brands, and behavior. The search filter
retrieved 82,205 e-cigarette–related tweets from the archive and was validated.
Retrieval precision was calculated above 95% in all cases. Retrieval recall was
86% assuming ideal conditions (no human coding errors and full data
collection), 75% when unretrieved messages could not be archived, 86% assuming
no false negative errors by coders, and 93% allowing both false negative and
false positive errors by human coders.
Conclusions
This paper sets forth a
conceptual framework for the filtering and quality evaluation of social data
that addresses several common challenges and moves toward establishing a
standard of reporting social data. Researchers should clearly delineate data
sources, how data were accessed and collected, and the search filter building
process and how retrieval precision and recall were calculated. The proposed
framework can be adapted to other public social media platforms.
Below: The archive (a+b+c+d), retrieved tweets (a+b), and relevant tweets (a+c+e) in Twitterverse
Below: The average limits of 95% confidence intervals for recall (vertical axis) as the sample size of unretrieved messages increases (horizontal axis), fixing the sample size of retrieved data at 3000
By: 1Health Media Collaboratory, Institute for
Health Research and Policy, University of Illinois at Chicago, Chicago, IL,
United States
Yoonsang Kim, Health Media Collaboratory, Institute for
Health Research and Policy, University of Illinois at Chicago, Westside
Research Office Building, M/C 275, 1747 W Roosevelt Rd, Chicago, IL, 60608,
United States, Phone: 1 312 413 7596, Fax: 1 312 996 2703
More at: https://twitter.com/hiv insight
No comments:
Post a Comment