Like many types of analyses, Exaptive applications are great for performing and interacting with Natural Language Processing (NLP) models and methods. NLP models tend to be complex, and have deep result sets that are prime for interactive exploration by your users. Additionally, you may want to crowdsource annotation or other types of active feedback from your users as they explore your models. This is where data applications really shine over static visualizations, and where the Exaptive platform can help make a powerful NLP application.
Using NLP Components in a Xap
Getting to your data: There are a number of ways to make text available.
- SQL: If you have your text data in a SQL database, there are components available that let you connect, run a general query, and use the results.
Web APIs: there are several API components available to read from search sources such Google Books and PubMed. There additionally is a component that can hit any standard web-based REST API and return the results as data that you can use directly.
- ElasticSearch: if the text data you want to work with is in Elastic Search, you can use Exaptive components to serve data requests from searches directly inside your application.
- Text Clustering: turn your documents into an interactive landscape with key terms positioned by their co-occurrence in your corpus, and clustered into similar topics.
- LDA Topic Modeling: a powerful tool for discovering abstract topics that describe your document set, this is a deeply explorable model with a lot of context.
- Sentiment Analysis: classify each document or snippet on a sliding scale of positive, negative, or neutral.
- Named Entity Extraction: by interacting with powerful APIs like IBM Watson's Alchemy directly as a component, powerful features like entity and concept recognition are easily integrated.
- String Distance: How similar are two given strings? By using metrics like Edit Distance and Jaccard Shingle Analysis, this powerful tool can help identify where and how strings are similar; which is very useful in fuzzy matching use cases.
- Text Network: imagine your documents as a network graph, each document, author, and key term a node. Each document is connected to the key terms it includes, to the author that wrote it, and to any other document that it is substantially similar to. Similar metrics connect the other parts of the corpus in a highly enlightening network.
Creating your own NLP Component
Libraries: major libraries for NLP tasks make short work of much functionality. Any NLP component that is service installable (pip for Python, CRAN for R) is available. This means for Python you can use libraries like NLTK, gensim, and sci-kit learn; and for R can use popular libraries like tm, stringR, and OpenNLP.
Unicode: In the Exaptive platform, all strings are communicated from component to component in full Unicode. This means easy integration with alternate character sets. However, if your component needs another encoding (e.g. UTF-8, UTF-16, Latin), conversions can easily made in-component with standard conversion tools.
NLTK assets: one common special case you will likely run into if you use the popular NLTK library is the need to use their internally registered assets as part of your processing. Unlike other common assets, NLTK assets are more than just files your code needs to reference. However, they are very simple if you follow this trick.
In your code should be a line similar to this:
if not os.path.isfile('/tmp/corpora/which-one-you-want'): nltk.download('which-one-you-want', download_dir = '/tmp/', raise_on_error = False) nltk.data.path.append('/tmp/')
Exporting your model: Once your model is built, you want to export it from the component so it can be used by other components. For details on this, please see the guide on the Exaptive data model. In short, your model will need to be constructed in a JSON serializable style object.