Clustering elements based on long-text properties

One of the ways in which information is made findable is by tagging it with keywords. The issue with having people tag their work with keywords, is the keywords are often incomplete, missing relevant nuance, or viewed from a very narrow lens. The Termscape tool allows unstructured text to be mined to find a conceptual landscape in which hidden connections can be found.

Prerequisite

In order to use the Termscape tool, you must have Elements that contain a "reasonable" amount of information in a long-text property. "Description", "Abstract" or "Summary" fields are ideal to collect this kind of information. By a reasonable amount of information, we suggest at a minimum 3-4 sentences from at least 5-6 Elements, though the technique also works with text as little as one sentence or as long as a whole document.

Example Data

For this explainer article we'll use an example about TED Talks. In the screenshot below you'll see we've created a view of 99 TED Talks and selected one of them. You can see in the detail panel on the right that it has a property called "Description" which contains a short description of the talk. This article assumes you are already familiar with building views.

Previewing the results of the Termscape analysis

From within the View Builder interface, notice the "AI Assist" option that sits between "Query" and "Visualize", as shown in the screenshot below. You can access this interface by clicking on the icon with the three horizontal bars in the top right corner of the View Builder. If your Cognitive City doesn't have this option available, please click the blue help button in the lower right corner to request that it be enabled.

Clicking on the "AI Assist" icon will bring you to the Termscape interface. Select the element type you want to analyze and which field or set of fields you want to include in the analysis. If you pick multiple fields, the text from each will be combined. In this case we'll cluster the "Talks" based on their "Description" field. If you're experimenting with this functionality for the first time, we suggest you initially just use a single property for the analysis.

After you've selected an element type and property with text stored in it, click the "Preview" button to see what a termscape looks like using the default analysis parameters. It will appear in a window below the preview button:

You'll notice that if your cursor is placed inside the white termscape window scrolling will move the termscape visualization up and down. You can also zoom in and out and click and drag to pan. Each rectangle in the termscape represents one of the elements you analyzed. Location is critical in the termscape visualization. Items that are closer together in the termscape are more conceptually similar than items that are far apart. Items of the same color are in the same cluster and therefore use similar language, but this doesn't mean that they might not also be similar in some ways to items outside their cluster. Click on an item to see the other items it is most similar to based on the language used. For example a talk in the center that's part of the red cluster is most similar to other talks in the red cluster, but also to a talk in the blue cluster and to a talk in the purple cluster:

To understand what the different color cluster represent, move your cursor outside the termscape window, to the grey area to scroll the overall window. Scroll down to see the legend of the termscape, which has an entry for each cluster. If there were more than 4 clusters identified you may need to place your cursor inside the legend box and scroll to see all the clusters. To get a sense of how distinct the clusters are, click on a cluster in the legend and all the elements associated with that cluster will be highlighted in the termscape:

The words that are listed under each cluster give you a sense of what that cluster is about. The words are listed in order of their importance in differentiating that cluster from the others. Click on any word in the legend to see snippets on the right which show the context in which that word is used. In the TED Talks example, clicking on the word "Cancer" which is the first word of the blue cluster reveals three talks in the blue cluster that use that word:

Just because a word is a differentiating term of a cluster doesn't mean that word might not be used by other elements in other clusters. For example, clicking on the second word in the blue cluster, "computer", shows four talks using the word "computer", but one of them is in the green cluster:

The snippets shown in the detail panel have a color dot on the left of each one to let you know which cluster the element with that snippet is in. The element in the green cluster is talking about computers and brains. We can find out more by clicking on the green rectangle in the termscape viewer, and then clicking the "View Text" button that appears in the right panel. This will open a window showing all the details about that element and you can read the full text of the property that was analyzed.

If you look at the legend under the green cluster, you'll see that cluster is being differentiated by the words "health" and "care" among others. If you read the description of this talk, you'll see that it's not actually a talk about health care, but it's been placed in the green cluster because it's using language more similar to talks in the green cluster than those in the blue cluster that involve cancer and computers.

If you are curious how a particular word is being used that isn't displayed in the legend, use the "Search" box in the upper left corner of the Termscape. Type in the word and then click on the "Full Text" search option in order to find all the places that word is used:

Use clicking on the legend, both the cluster titles and the cluster terms, and searching for terms of interest, to get a sense of how the elements have been grouped by the Termscape. If you feel like the default clustering isn't accurate or informative, you can tune the settings that the algorithm uses by scrolling back up to the top of the page and clicking on the "Tune" button:

After you change a setting you'll need to click the "Preview" button again to re-run the analysis with the new settings. Once you have a preview in which the clusters are meaningful to you, scroll to the top and click the "Apply" button to store the clustering back to the Cognitive City database. After the termscape analysis has been applied, which could take a few minutes, you're ready to refine your view to incorporate the new data provided by the AI.

Modifying your view after applying a Termscape

After you've stored your analysis by clicking the "Apply" button, return to your visualization. You won't see any immediate changes. There are a few steps you'll need to take to incorporate the Termscape clustering into the visualization. We're working on adding features to make this process simpler, but for now, here are the steps you need to take:

1. Add the Termscape Clusters to the view. Use the legend to select all the elements of the type you ran through the Termscape analaysis. Then click the "Expand" button in the right detail panel and select just the "Termscape Cluster" option. These clusters were added as a result of applying your Termscape analysis:

2. Style the clusters so that they are colored based on the colors that were in the Termscape preview. After you've clicked the "Add" button to bring the cluster nodes into the view, the elements will be connected to their clusters, but it won't be easy to tell which clusters are which without some styling:

Open the View Options by clicking the cog in the upper right corner of the View Builder. Select the Termscape Cluster element type. We suggest the following styling:

These settings will set the text color for each cluster based on its color in the Termscape preview so that each talk is now connected to colored cluster label:

If you click on one of these labels, you will see in the detail panels the number of that cluster and the hex color code:

If we want to color-code the elements themselves based on what cluster they are connected to, we'll need to do that via individual custom style rules for each cluster. Return to the Styles in the View Options, but this time click "+ Add a rule" under the "Custom Rules":

Using the cluster numbers and color codes for each cluster add a rule that looks like this:

By repeating this for each of the clusters you can color-code the view of the elements based on their clusters:

Since the talks are now different colors, we suggest turning off the "Show in legend" option for the Talk element type, since the color code will no longer match. You can also use the custom styling to do a wide variety of other things based on the clusters as well. Once you've styled the view based on cluster, we suggest then updating your query to incorporate a second dimension of data into the view and see how the colors of the clusters align, or not, to that second dimension. For example, in the screenshot below we've updated the view so that the nodes are rectangles that show the thumbnail image for the talk with a border colored based on the cluster the talk belongs to As a second dimension we've added the tags that the talks are connected to:

Can you use this technique to find a structural holes in the networks you are visualizing? Those gaps are opportunities for innovation! For example, in the view above the tags in blue are independent of cluster, so we can find a tag like "identity" that has multiple talks across all the different clusters:

To make it easier to find these sorts of overlaps we suggest you learn how to make a parameterized view so that you can filter down to specific terms of interest:

After that, you can learn how to use the Matchmaker tool to automatically find groups like this for you that meet particular criteria of interest.

exaptive