find.threshold.C {textreg} | R Documentation |
First determines what regularization will give null model on labeling. Then permutes labeling repeatidly, recording what regularization will give null model for permuted labeling. This allows for permutation-style inference on the relationship of the labeling to the text, and allows for appropriate selection of the tuning parameter.
find.threshold.C(corpus, labeling, banned = NULL, R = 0,
objective.function = 2, a = 1, verbosity = 0,
step.verbosity = verbosity, positive.only = FALSE,
binary.features = FALSE, no.regularization = FALSE,
positive.weight = 1, Lq = 2, min.support = 1, min.pattern = 1,
max.pattern = 100, gap = 0, token.type = "word",
convergence.threshold = 1e-04)
corpus |
A list of strings or a corpus from the |
labeling |
A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document. |
banned |
List of words that should be dropped from consideration. |
R |
Number of times to scramble labling. 0 means use given labeling and find single C value. |
objective.function |
2 is hinge loss. 0 is something. 1 is something else. |
a |
What percent of regularization should be L1 loss (a=1) vs L2 loss (a=0) |
verbosity |
Level of output. 0 is no printed output. |
step.verbosity |
Level of output for line searches. 0 is no printed output. |
positive.only |
Disallow negative features if true |
binary.features |
Just code presence/absence of a feature in a document rather than count of feature in document. |
no.regularization |
Do not renormalize the features at all. (Lq will be ignored.) |
positive.weight |
Scale weight pf all positively marked documents by this value. (1, i.e., no scaling) is default) NOT FULLY IMPLEMENTED |
Lq |
Rescaling to put on the features (2 is standard). Can be from 1 up. Values above 10 invoke an infinity-norm. |
min.support |
Only consider phrases that appear this many times or more. |
min.pattern |
Only consider phrases this long or longer |
max.pattern |
Only consider phrases this short or shorter |
gap |
Allow phrases that have wildcard words in them. Number is how many wildcards in a row. |
token.type |
"word" or "character" as tokens. |
convergence.threshold |
How to decide if descent has converged. (Will go for three steps at this threshold to check for flatness.) |
Important: use the same parameter values as used with the original textreg call!
A list of numbers (the Cs) R+1 long. The first number is always the C used for the _passed_ labeling. The remainder are shuffles.
data( testCorpora )
find.threshold.C( testCorpora$testI$corpus, testCorpora$testI$labelI, c(), R=5, verbosity=1 )