multilingual-paperbase / PYTHON_README.md
Crystina
init
0a97af6

Python Multilingual BibTeX Generator

This Python script generates multilingual_papers.bib by filtering the original anthology+abstracts.bib file for multilingual NLP research papers.

Features

  • Identical Logic: Uses the same filtering logic as the JavaScript web application
  • Comprehensive Detection: Detects multilingual papers using keywords and language names
  • LaTeX Cleaning: Properly handles LaTeX commands and formatting
  • Statistics: Provides detailed statistics about the filtering process
  • Safe Operation: Checks for existing files and asks for confirmation before overwriting

Requirements

  • Python 3.6 or higher
  • No external dependencies (uses only standard library)

Usage

  1. Place your files: Ensure anthology+abstracts.bib is in the same directory as the script
  2. Run the script:
    python generate_multilingual_bib.py
    
  3. Follow prompts: The script will ask for confirmation if multilingual_papers.bib already exists

Output

The script will:

  • Generate multilingual_papers.bib containing only multilingual papers
  • Display statistics about the filtering process
  • Show the top 10 most common keywords found

Example Output

Reading anthology+abstracts.bib...
Parsing BibTeX entries...
Found 50000 total papers
Found 2500 multilingual papers
Generating BibTeX content...
Writing to multilingual_papers.bib...
Successfully generated multilingual_papers.bib with 2500 papers!

Statistics:
  Total papers processed: 50000
  Multilingual papers found: 2500
  Percentage multilingual: 5.0%

Top 10 keywords found:
  multilingual: 1200 papers
  chinese: 800 papers
  crosslingual: 600 papers
  hindi: 400 papers
  low-resource: 350 papers
  korean: 300 papers
  arabic: 250 papers
  japanese: 200 papers
  spanish: 180 papers
  french: 150 papers

Filtering Criteria

The script uses the same criteria as the web application:

Multilingual Keywords

  • multilingual, crosslingual, multi-lingual, cross-lingual
  • low-resource language, low resource language
  • low-resource, low resource

Language Names

  • 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc.
  • Regional language variations and dialects

Customization

You can modify the filtering criteria by editing the constants at the top of the script:

MULTILINGUAL_KEYWORDS = [
    'multilingual', 'crosslingual', 'multi lingual',
    # Add your custom keywords here
]

LANGUAGE_NAMES = [
    'afrikaans', 'albanian', 'amharic', 'arabic',
    # Add more language names here
]

Error Handling

The script includes robust error handling:

  • Checks for input file existence
  • Handles malformed BibTeX entries gracefully
  • Provides clear error messages
  • Asks for confirmation before overwriting existing files

Performance

  • Efficient regex-based parsing
  • Memory-efficient processing for large files
  • Fast keyword matching using set operations

Troubleshooting

File Not Found

Error: anthology+abstracts.bib not found in current directory.

Solution: Ensure the input file is in the same directory as the script.

No Papers Found

No multilingual papers found. Check your keywords and language lists.

Solution: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists.

Encoding Issues

If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded.

Comparison with JavaScript Version

This Python script produces identical results to the JavaScript web application:

  • Same filtering logic
  • Same LaTeX cleaning
  • Same BibTeX output format
  • Same keyword detection

The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process.