EPFL student creates a new language-analysis program

This story was originally published by EPFL on 12.01.2021.

Jonathan Besomi, a Master’s student at EPFL, has developed a program called Texthero that lets users generate representations of textual data with just a few lines of code, thereby simplifying the analysis of natural languages.

We now live in a data-filled age that has ushered in its own distinct challenges. One of the biggest is how to analyze vast reams of information. In response, Besomi, a Master’s student in data science, has developed Texthero, a program that simplifies the task of analyzing textual data. It was created in the spring of 2020 under the supervision of Kenneth Younge, Chair of Technology and Innovation Strategy at EPFL’s Management of Technology & Entrepreneurship Institute. Designed as open-source software and written in the Python programming language, Texthero swiftly won over developers around the world.

“Texthero has been downloaded over 23,000 times so far, and has been awarded 2,000 stars on the Github platform,” says Besomi. “It got a lot of attention as soon as we released it – people even began sharing it on social media, primarily Twitter and LinkedIn. This indicates that there was strong demand for such a program in the Python/NLP [Natural Language Processing] community.”

Rapid visual representations

Using Texthero, developers can quickly visualize and understand text-based datasets. “Our program takes a text made up of unstructured data, cleans it up, generates a representation of it by converting it into digital format, and finally visualizes it. In other words, Texthero gives users an overall idea of the structure of a completely unfamiliar text,” explains Besomi.

The rudiments of Texthero first came to Besomi when he was working with Professor Younge on Fastlaw, a program for analyzing legal texts. “Fastlaw is a ‘word-embedding’ tool that was trained on a large corpus of legal data provided by Harvard University’s Caselaw Access Project (CAP) – a project to make every ruling published by US courts freely available,” says Besomi. He and Younge presented their program to the Harvard Law School Library.

“As I was developing Fastlaw, I realized there was a need for software that could quickly pre-process, represent and visualize textual data,” says Besomi. Before Texthero, developers who wanted to process natural language were forced to use a series of applications, such as spaCy, scikit-learn, Gensim and NLTK. The process was both time-consuming and complex. “Now, with Texthero, just a few lines of code are enough to plot a text to be processed.”

A new version

To date, 16 developers have contributed to Texthero through pull requests on Github. They’ve fixed bugs, introduced new features and improved the documentation. “We’re about to release a new version (1.1) that will boost text processing speeds even further,” says Besomi.

Besomi now wants to consolidate and expand the Texthero community through blog posts and tutorials, in order to increase uptake of his program. “When I think about the billions of pieces of data around us that we can’t assimilate, it would seem that text analysis – in all its forms – is the wave of the future,” says Besomi, who is currently completing an in-company internship at IBM Research Zurich and writing a thesis on text analysis. “I’m fascinated by these issues and pleased to have created a simple, straightforward program that makes natural language processing easier.”

EPFL student creates a new language-analysis program

Categories

Disciplines

Methods

Programming Languages

Share

Related Posts /

Test feature

DHI EPFL : “Témoignages d’étudiants”

Postdoctoral Research Fellow Opportunity at the University of Texas at Austin

Poste d’assistant-e étudiant-e à l’ISSUL (UNIL)

Call for Papers – Second Conference on Computational Humanities Research (CHR2021)

Call for papers: Historical Traces of European Radio Archives, 1930-1960

EPFL Laboratory for Experimental Museology opens two PHD Positions

Online training opportunity: Python for SHS – PySHS

Machine learning helps retrace evolution of classical music

Online seminar on 4chan and the study of narratives in online environments

Community

Projects

Resources

Feeds