sts {sts}R Documentation

Variational EM for the Structural Topic and Sentiment-Discourse (STS) Model

Description

Estimation of the STS Model using variational EM. The function takes sparse representation of a document-term matrix, covariates for each document, and an integer number of topics and returns fitted model parameters. See an overview of functions in the package here: sts-package

Usage

sts(
  X,
  X_seed,
  corpus,
  numTopics,
  maxIter = 100,
  initialization = "stm",
  estimation = "lasso",
  verbose = TRUE,
  parallelize = FALSE,
  stmSeed = NULL
)

Arguments

X

Data frame of document-specific content covariates affect how much (prevalence) and the way in which a topic is discussed (sentiment-discourse).

X_seed

A vector of length equal to the corpus size. This is the key experimental variable (e.g., review rating or binary indicator of experiment/control group.).

corpus

The document term matrix to be modeled in a sparse term count matrix with one row per document and one column per term. The object must be a list of with each element corresponding to a document. Each document is represented as an integer matrix with two rows, and columns equal to the number of unique vocabulary words in the document. The first row contains the 1-indexed vocabulary entry and the second row contains the number of times that term appears. This is the same format in the stm package.

numTopics

A positive integer (of size 2 or greater) representing the desired number of topics.

maxIter

A positive integer representing the max number of VEM iterations allowed.

initialization

Character argument that allows the user to specify an initialization method. The default choice, "stm", uses a fitted STM model (Roberts et al. 2014, 2016) to initialize coefficients related to prevalence and sentiment-discourse. One can also choose "anchor" to initialize prevalence according to anchor words and the key experimental covariate identified in argument X_seed.

estimation

A character input specifying how kappa should be estimated. "lasso" (default) allows for penalties on the L1 norm. We estimate a regularization path and then select the optimal shrinkage parameter using AIC. "adjusted" does not utilize the lasso penalty. All options use an approximation framework developed in Taddy (2013) called Distributed Multinomial Regression which utilizes a factorized poisson approximation to the multinomial. See Li and Mankad (forthcoming) on the implementation here.

verbose

A logical flag indicating whether information should be printed to the screen.

parallelize

A logical flag indicating whether to parallelize the estimation using all but one CPU cores on your local machine.

stmSeed

A prefit STM model object to initialize the STS model. Note this is ignored unless initialization = "stm"

Details

This is the main function for estimating the Structural Topic and Sentiment-Discourse (STS) Model. Users provide a corpus of documents and a number of topics. Each word in a document comes from exactly one topic and each document is represented by the proportion of its words that come from each of the topics. The document-specific content covariates affect how much (prevalence) and the way in which a topic is discussed (sentiment-discourse).

Value

An object of class sts

alpha

Estimated prevalence and sentiment-discourse values for each document and topic

gamma

Estimated regression coefficients that determine prevalence and sentiment/discourse for each topic

kappa

Estimated kappa coefficients that determine sentiment-discourse and the topic-word distributions

sigma_inv

Inverse of the covariance matrix for the alpha parameters

sigma

Covariance matrix for the alpha parameters

elbo

the ELBO at each iteration of the estimation algorithm

mv

the baseline log-transformed occurrence rate of each word in the corpus

runtime

Time elapsed in seconds

vocab

Vocabulary vector used

mu

Mean (fitted) values for alpha based on document-level variables * estimated Gamma for each document

References

Roberts, M., Stewart, B., Tingley, D., and Airoldi, E. (2013) "The structural topic model and applied social science." In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation.

Roberts M., Stewart, B. and Airoldi, E. (2016) "A model of text for experimentation in the social sciences" Journal of the American Statistical Association.

Chen L. and Mankad, S. (forthcoming) "A Structural Topic and Sentiment-Discourse Model for Text Analysis" Management Science.

See Also

estimateRegnTables

Examples

#An example using the Gadarian data from the stm package.  From Raw text to 
# fitted model using textProcessor() which leverages the tm Package
library("tm"); library("stm"); library("sts")
temp<-textProcessor(documents=gadarian$open.ended.response,
metadata=gadarian, verbose = FALSE)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta, verbose = FALSE)
X <- model.matrix(~1+out$meta$treatment + out$meta$pid_rep + 
out$meta$treatment * out$meta$pid_rep)[,-1]
X_seed <- as.matrix(out$meta$treatment)
## low max iteration number just for testing
sts_estimate <- sts(X, X_seed, out, numTopics = 3, verbose = FALSE, 
parallelize = FALSE, maxIter = 3, initialization = 'anchor')

[Package sts version 1.0 Index]