We do not have a publically-available API. To limit the number of requests received and thus conserve our compute resources, we simply provide the API via this website. If you wish to compute many comparisons (i.e., over the current limits of the website), you will need to create your own code or download from an open source repository such as GitHub to do so.
We limit the number of comparisons to 200 documents. This is to conserve compute resources. If you need to compare more, you can submit multiple requests with different data, or if you have a large amount to compare, you may need to download and implement your own software package (see below).
Please direct this query to LSA-NLP.firstname.lastname@example.org
The backend code for this website is programmed with python. The LSA implementation was programmed from scratch based off of the original implementation on the lsa.colorado.edu website. The word2vec implementation primarily uses the python package gensim and off-the-shelf vectors downloaded from here. The BERT model is the pre-trained bert-base-uncased model accessed via huggingface. See the word embedding overview page for more information on these models.
We have chosen a limited number of semantic spaces that are widely used and are good for a number of different purposes of text comparison.
See chapters 3 and 4 in "Handbook of LSA" Landauer, McNamara, Dennis & Kintsch. 2007.
The TASA corpus contains 10 million words of UNMARKED grade school and college level English text on Language arts, Health, Home economics, Industrial arts, Science, Social studies, and Business. It is divided into 37,600 text samples, contexts, or “documents" (average of 166 words/document).
Please contact LSA-NLP.email@example.com