Terra Cognita

Every place Wikipedia knows, painted by meaning.

What this is

Terra Cognita places roughly one million geotagged English Wikipedia articles at their real coordinates and colours each one by the meaning of what it describes. Most embedding atlases throw away geography and use a model's latent space for position. This does the opposite: it keeps the real Earth, and paints it by meaning.

The result is a semantic texture of the planet. You can read the hue of a coastline, a mountain range or a city and see what it is mostly about, then type a concept and watch matching places glow.

What this is not

This is not a map of the world. It is a map of what English Wikipedia has written down. A place with no article is invisible here, and that absence is rarely random.

The colours describe the English-language description of each place, not a universal or local truth. A French cathedral is coloured by how English Wikipedia describes it.

How the colour works

Each article's text is turned into a high-dimensional embedding by an open sentence-embedding model. Articles about similar things end up close together in that space.

We reduce that space to three dimensions with UMAP, then map those three axes onto the CIELAB colour space (a perceptually uniform colour model), clamping lightness to a legible band and quantile-normalising each axis before converting to sRGB. So similar meanings receive similar colours, and the colour distance roughly tracks the meaning distance.

Categories and clusters

In cluster mode, every article is assigned to one of sixteen curated top-level categories, plus an explicit Other or mixed bucket, derived from its Wikidata instance-of value. The palette is chosen to remain distinguishable for viewers with colour-vision deficiency, and every colour is always paired with a text label.

Beneath the top-level categories, an unsupervised clustering (HDBSCAN) finds finer groups, each named automatically from the most distinctive words in its articles. The sixteen top-level names are human-curated and never auto-generated.

  • SettlementsVillages, towns, cities and other inhabited places.
  • Administrative areasCountries, regions, states, provinces and districts.
  • Religious sitesChurches, temples, mosques, monasteries and shrines.
  • Education and scienceSchools, universities, observatories and research institutes.
  • TransportStations, airports, ports, roads, bridges and tunnels.
  • Rivers and lakesRivers, lakes, reservoirs, bays and waterfalls.
  • LandformsMountains, hills, valleys, islands, capes and peninsulas.
  • Parks and natureNational parks, nature reserves, gardens and protected areas.
  • Heritage and archaeologyCastles, ruins, monuments, historic districts and archaeological sites.
  • Culture and artsMuseums, theatres, galleries, stadiums and cultural venues.
  • Industry and energyFactories, mines, power and water infrastructure.
  • Commerce and lodgingHotels, shops, markets, offices and commercial buildings.
  • Military and conflictBattles, military bases, fortifications and conflict sites.
  • Civic and healthHospitals, government buildings, prisons and public institutions.
  • Sport and recreationSports grounds, golf courses, ski resorts and leisure sites.
  • Buildings and housesHouses, residential buildings, towers and generic structures.
  • Other or mixedEverything that does not fall into a single curated category.

Coverage bias

Geotagged English Wikipedia is profoundly uneven. Research from the Oxford Internet Institute found that around 84% of geotagged articles describe Europe and North America, Africa accounts for roughly 3%, and the Middle East and North Africa for under 2%. Wealthier, English-speaking and historically documented places are over-represented.

The density view exists to surface this honestly rather than hide it. Treat blank regions as gaps in the record, not as empty land. This is a coverage lens, not ground truth.

Citation: Graham, M. et al., Oxford Internet Institute — the uneven geography of Wikipedia knowledge.

Ethics and accessibility

Semantic similarity is fuzzy. Where we surface a twin or a match we show the score, and we never claim two places are identical.

Colour is the headline channel, so it is always backed by text: categories are labelled, search results show scores, and the cluster palette is colour-vision-deficiency safe. The interface is keyboard navigable and honours reduced-motion preferences.

Compute transparency

Embeddings are computed once at build time, not per visit. The development build covers a fifty-thousand-article subset on a laptop; the full corpus is embedded on a single GPU for a few hours. Search runs entirely in your browser, so visiting the map costs no server compute.

Data and licences

Terra Cognita stands on four layers of open work, credited and licensed separately:

  • Article text and descriptions are from English Wikipedia, licensed CC BY-SA 4.0. Any description surfaced here carries that ShareAlike obligation.
  • Structured data (coordinates, Wikidata identifiers, instance-of types) is from Wikidata, released under CC0.
  • The geotagged GeoParquet packaging is by Shane98c (wiki-geoparquet), which made the source corpus tractable.
  • Embeddings use open models: nomic-embed-text-v1.5 and all-MiniLM-L6-v2, both under the Apache 2.0 licence.

Basemap tiles are served by CARTO using OpenStreetMap data, copyright OpenStreetMap contributors.