We are looking for a PhD candidate to join the Data Science group at Radboud University for an exciting new project on transfer learning for language modelling with an application for federated search. Transfer learning learns general purpose language models from huge datasets, such as web crawls, and then trains the models further on smaller datasets for a specific task. Transfer learning in NLP has successfully used pre-trained word-embeddings for several tasks. Although the success of word embeddings on search tasks has been limited, recently pre-trained general purpose language representations such as BERT and ELMo have been successful on several search tasks, including question answering tasks and conversational search tasks. Resource descriptions in federated search consist of samples of the full data that are sparser than full resource representations. This raises the question of how to infer vocabulary that is missing from the sampled data. A promising approach comes from transfer learning from pre-trained language representations. An open question is how to effectively and efficiently apply those pre-trained representations and how to adapt them to the domain of federated search. In this project, you will use pre-trained language models, and further train those models for a (federated) search task. You will evaluate the quality of those models as part of international evaluation conferences like the Text Retrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF).
The Data Management and Biometrics group and Formal Methods & Tools groups at the University of Twente seek a PhD candidate for SEQUOIA: Smart maintenance optimization via big data & fault tree analysis, a project funded by the NWO Applied and Engineering Sciences, and the companies ProRail and NS. ProRail is responsible for the Dutch railway network, including its construction, management, maintenance, and safety; NS has the same responsibility for the Dutch train fleed. The project is led by Mariëlle Stoelinga, Joost-Pieter Katoen and Djoerd Hiemstra.
SEQUOIA aims to improve the reliability of the Dutch railroads by deploying big data analytics to predict and prevent failures. Its scientific core is a novel combination of machine learning, fault tree analysis and stochastic model checking. Key idea is that big data analytics provide the statistics on failures, their correlations, dependencies etc. and fault trees provide the domain knowledge needed to interpret these data. The project outcome aims at developing explainable machine learning techniques that discover causal relations instead of statistical correlations; machine learning of fault trees or of other models that are normally designed top-down by domain experts. The techniques should help ProRail to decrease train disruptions and delays, to lower maintenance cost, and to increase passenger comfort.
The project involves an intense cooperation ProRail and the RWTH Aachen University. The PhD candidate will spend a portion of their time at ProRail. Key project deliverables are efficient analysis algorithms and a workable tool to be used in the ProRail context. For more information, see:
Scientific programmer: folktale search and visualisation
The FACT project will investigate new possibilities for humanities researchers (folktale researchers, narratologists, documentalists, etc.) to study folktales based on annotations and relations that have been automatically assigned using data-driven methods. The Dutch Folktale Database (Nederlandse Volksverhalenbank) of the Meertens Institute is a very large and varied collection of Dutch Folktales. Within FACT, software will be developed to automatically enrich the folktales in this collection with metadata such as names, keywords, genre, a summary and type. An additional research goal is to investigate if automatic analysis of the folktale collection can reveal relations between folktales that are difficult to discover through human inspection. The annotation and clustering methods to be developed will be integrated in a user-friendly XML-based platform for the annotation and exploration of folktales, to support research on the variability of human oral and written transmission.
The University of Twente has vacancies for a PhD-student, a postdoc and a scientific programmer, who will be working together as a team to achieve the project goals. In addition there will be close cooperation with the Tunes & Tales project (funded under the Computational Humanities programme of KNAW) that is aimed at investigating sequences of motifs in, and variability of, melodies and folktales in oral transmission.
The scientific programmer will work on the development of user-friendly tools for folktale researchers that incorporate the annotation and clustering techniques developed by the postdoc and the PhD student. The annotation tool should allow for (semi) automatic annotation of folktales with language, genre, keywords, names, summary and type. The visualization tool should enable easy inspection of document clusters. In addition, the programmer will develop an XML-based search system that allows the general public to search for folktales in the Folktale Database based on their annotations.
Apply on-line (Deadline: 1 November 2011)
The Database Group of the University of Twente offers a PhD student position in the Dutch national project COMMIT, a 100M Euro project involving 10 universities and 70 companies. The program brings together leading researchers in search engines, parallel computing, databases, interaction in context, embedded systems and knowledge technology.
A large part of the web, the invisible web or deep web, cannot be indexed by web crawlers, for instance dynamic web pages that are returned in response to filling in a web form, or performing a search in a search engine. Instead of crawling deep web data, the approach will monitor web pages for certain (types of) queries. The objective is to develop approaches for monitoring web data that allow users to see a page's full history of relevant/important changes by identifying entities: people, organizations, products, geographic locations, events, etc. The approach should relate changes in multiple web sites, giving the user a data-warehouse-like overview of the pages they monitor; drilling down to time periods, persons, events, etc.
The research will be done in co-operation with WCC. WCC, started in 1996 and is a successful software company based in Utrecht (NL) and Reston (USA). WCC's current focus areas are the Employment and Identification Security markets. Both commercial and government customers worldwide use WCC's smart search & match solutions to support their primary processes. Both WCC and the Database Group of the University of Twente have made significant advances in entity matching and entity ranking applied to for instance Employment Matching and Expert Search. This project will extend this work to monitoring of deep web pages, such a social networking sites, micro-blogging sites, job sites, etc. The candidate will spend part of the time at WCC in Utrecht.
[official vacancy text] (deadline: July 3rd, 2011)
The digital library of the future will be a dynamic and highly networked entity, consisting of both the original documents and user-generated annotations and links to and from external resources. Among other things, the Human Media Interaction (HMI) group of the University of Twente investigates the possibilities for multimedia content analysis and information linking to support and provide facilities for navigating and exploring digital libraries with content in a variety of formats including text, audio, images and video. There is funding available for a PhD position starting from January 2010.
The PhD research will be carried out in the context of AXES, a multidisciplinary research project funded by the EU (FP7, Digital Libraries). The research will focus on deploying diverse, automatically generated, time-labeled annotations -for example those coming from automatic speech recognition- for connecting heterogeneous data sources, and will be strongly evaluation-driven.
More information (deadline: 21 November)
Position: Distributed Information Retrieval
The Database Group of the University of Twente offers a job opening in the NWO Vidi Project “Distributed Information Retrieval by means of Keyword Auctions”. The project's aim is to distribute internet search functionality in such a way that communities of users and/or federations of small search systems provide search services in a collaborative way. Instead of getting all data to a centralized point and process queries centrally, as is done by today's search systems, the project will distribute queries over many small autonomous search systems and process them locally. In this project, the PhD student will research a new approach to distribute search: distributed information retrieval by means of keyword auctions. Keyword auctions like Google's AdWords give advertisers the opportunity to provide targeted advertisements by bidding on specific keywords. Analogous to these keyword auctions, local search systems will bid for keywords at a central broker. They “pay” by serving queries for the broker. The broker will send queries to those local search systems that optimize the overall effectiveness of the system, i.e., local search systems that are willing to serve many queries, but also are able to provide high quality results. The PhD student will work within a small team of researchers that approaches the problem from three different angles: 1) modeling the local search system, including models for automatic bidding and multi-word keywords, 2) modeling the search broker's optimization using the bids, the quality of the answers, and click-through rates, and 3) integration of structured data typically available behind web forms of local search systems with text search.
See official announcement. (Deadline: 19 April 2009)
Two positions: PuppyIR, Information Retrieval for Children
The Groups Human Media Interaction and Databases of the University of Twente offer two job openings in the European Project PuppyIR. Current Information Retrieval (IR) systems are designed for adults: they return information that is unsuitable for children, present information in lists that children find difficult to manage and make it difficult for children to ask for information. PuppyIR will create information search services that are tailored to the specific needs of children, giving children the opportunity to fully and safely exploit the power of the Internet. PuppyIR will develop new interaction paradigms to allow children to easily express their information need, to have results presented in an intuitive way and to engage children in system interaction. It will develop a set of Information Services: components to summarise textual and audiovisual content for children, to help children safely explore new information, to moderate information for children at different ages, to build new social networks and to intelligently aggregate and present information to children. PuppyIR will offer an open source platform that enables system designers to construct useful and usable information retrieval systems for children. The project will demonstrate the effectiveness of the PuppyIR modules through demonstrator systems constructed in collaboration with the Netherlands Public Library Association and the Emma Children's Hospital. At the university of Twente, a team of six senior researchers and three PhD students will cooperate in PuppyIR. One PhD student will work on user interaction design. The other two positions are described below.
Position 1: Analyzing and structuring textual information (at Human Media Interaction) Analyzing and structuring textual information studies how natural language processing tools can assist the organization of information in a way that enables children to easily access the information. The PhD student at Human Media Interaction will focus on information extraction, text classification, and story understanding and summarization on written and spoken data, for instance for questions or comments created by children (e.g., chats, blogs) and content created explicitly for children (e.g., stories).
Position 2: Multimedia content mining (at Databases) Multimedia content mining will develop database search technology that enables better understanding of the individual behavior of the child and consequently his/her information need. The PhD student at Databases will focus on concept retrieval, faceted search, query formulation assistance, and intuitive relevance feedback mechanisms that allow children to easily access the content of multimedia data sources, for instance for content sharing within online groups including moderated discovery.
See official announcement. (Deadline: 15 April 2009)
We have two job positions in the MultimediaN project.
Position 1: Speech Technology
SHoUT is an open source speech recognition toolkit developed at the University of Twente. SHoUT is a Dutch acronym for: “Spraak Herkennings Onderzoek Universiteit Twente”, or in English: “Speech Recognition Research at the University of Twente”. SHoUT is used to aid research on large vocabulary continuous speech recognition, including research into the application of statistical language models, audio segmentation and classification, speaker diarization and machine learning hyper parameter estimation for speech recognition.
Position 2: Search Engine Technology
PF/Tijah (Pathfinder/Tijah, pronounce as “Pee Ef Teeja”) is a flexible open source text search system developed at the University of Twente in cooperation with CWI Amsterdam and TU MÃ¼nchen. The system is integrated in the Pathfinder XQuery compiler and can be downloaded as part of the MonetDB/XQuery database system. PF/Tijah is used to aid research in information retrieval at the University of Twente, including the application of language models to search, entity retrieva, and implementation of the W3C candidate recommendation XQuery Full-Text.
[Official Job Advertisement] (deadline: August 1, 2008)
We have an opening for a PhD position in the Effort project: a joint project with the Information and Language Processing Group of the University of Amsterdam, and funded by the Netherlands Organisation for Scientific Research (NWO). Deadline for application: 15 September.