FAQ - Word Embedding Analysis

Do you have an API for public use?

We do not have a publically-available API. To limit the number of requests received and thus conserve our compute resources, we simply provide the API via this website. If you wish to compute many comparisons (i.e., over the current limits of the website), you will need to create your own code or download from an open source repository such as GitHub to do so.

What is the maximum number of documents that I can submit for a matrix comparison?

We limit the number of comparisons to 200 documents. This is to conserve compute resources. If you need to compare more, you can submit multiple requests with different data, or if you have a large amount to compare, you may need to download and implement your own software package (see below).

Can I have access to the texts used for creating the LSA semantic spaces?

Please direct this query to LSA-NLP.support@colorado.edu

What specific programming languages, packages, and models are used to generate the analyses?

The backend code for this website is programmed with python. The LSA implementation was programmed from scratch based off of the original implementation on the lsa.colorado.edu website. The word2vec implementation primarily uses the python package gensim and off-the-shelf vectors downloaded from here. The BERT model is the pre-trained bert-base-uncased model accessed via huggingface. See the word embedding overview page for more information on these models.

Can you add a different semantic space for LSA/word2vec/BERT?

We have chosen a limited number of semantic spaces that are widely used and are good for a number of different purposes of text comparison.

Can you point me to a resource that explains the difference between term-term comparison and document-document comparison in LSA?

See chapters 3 and 4 in "Handbook of LSA" Landauer, McNamara, Dennis & Kintsch. 2007.

What is the general reading corpus that is used for LSA?

The TASA corpus contains 10 million words of UNMARKED grade school and college level English text on Language arts, Health, Home economics, Industrial arts, Science, Social studies, and Business. It is divided into 37,600 text samples, contexts, or “documents" (average of 166 words/document).

Still have a question that was not answered here?

Please contact LSA-NLP.support@colorado.edu

FAQ - Frequently Asked Questions