What is entity extraction?

Named entity extraction (NER), also known as entity recognition, is a technique for extracting information. This process sees key elements identified from a text, classified and then assigned to predefined categories. Named entity recognition is a method of computational linguistics and belongs to the Natural Language Processing sub-area. 

The aim of entity extraction is to convert unstructured data into structured data, i.e. to make information machine-readable for further processing. 

Entities in a text such as an article on the news page of an online portal are primarily: people (names), organisations, and places. In addition to such named entities, a text may contain items such as medical codes, times, quantities, percentages, or monetary values. 

During extraction, an HTML is scanned and the recognised objects are marked. There is a high success rate in the automatic determination of entities in texts. Even if the algorithms are confronted with linguistic ambiguities, the success rates of human referring physicians is only a few percent higher. 

 

Where is Entity Extraction applied? 

Named entity recognition is used where large amounts of content are processed. News and publishers, for example, generate large amounts of online content on a daily basis. For the best user experience and the monetisation of content, it is central to structure the information from these articles. 

An entity extraction algorithm automatically scans entire articles and defines which important people, organisations, and locations they contain. Once this information is extracted, it helps to automatically categorize articles into defined hierarchies. On the basis of this information, search results can be compiled more precisely, content can be curated into thematic clusters, content related articles can be displayed to the user, or targeted advertising can be played. 

In addition to their use on news portals, the recommendation features of other media services are also based on named entity recognition. A further field of application outside the media industry would be the sorting of support requests by e-mail or chat through entity extraction. 

Sources

https://www.di-lab.tum.de/fileadmin/w00byz/www/Munichre_SS2018_Presentation.pdf

http://www.dfki.de/~neumann/esslli04/reader/ie-lec3-1.pdf