Skip to content

rafamedin/Text-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Certainly! Here's the documentation for the Text Summarization project:


Text Summarization Project Documentation

Introduction

Text summarization is the process of generating a concise and coherent summary of a longer text while preserving its key information. This project aims to implement a text summarization system using advanced natural language processing (NLP) techniques and deep learning models.

Features

  • Data Preprocessing: The project preprocesses the input text data by converting it to lowercase, removing punctuation, tokenizing the text, and removing stop words to improve model performance.

  • Model Architecture: The text summarization model utilizes a deep learning architecture consisting of an encoder-decoder framework with attention mechanism and coverage mechanism. It employs bidirectional LSTM layers for encoding input text and LSTM layers with attention for decoding and generating the summary.

  • Pointer Generator Network: The model incorporates a pointer generator network to enable the generation of summary tokens by copying directly from the input text. This helps handle out-of-vocabulary words and improve summary quality.

  • Custom Attention with Coverage: The attention mechanism includes a coverage mechanism to mitigate repetition issues in the generated summaries. The coverage mechanism helps the model keep track of previously attended locations in the input text.

  • Training and Evaluation: The model is trained using the Adam optimizer and sparse categorical cross-entropy loss function. It is evaluated on a separate test dataset to measure performance metrics such as loss and accuracy.

  • Prediction: Once trained, the model can generate summaries for new input text data by tokenizing, padding, and encoding the text, and then decoding to generate the summary tokens.

Usage

  1. Data Preparation: Prepare a dataset in CSV format with two columns: text containing the full text of articles and summary containing the corresponding summaries.

  2. Data Preprocessing: Preprocess the text data by converting to lowercase, removing punctuation, tokenizing, and removing stop words.

  3. Model Training: Train the text summarization model using the preprocessed text data. The model architecture includes bidirectional LSTM layers for encoding, LSTM layers with attention for decoding, and pointer generator network for token generation.

  4. Model Evaluation: Evaluate the trained model on a separate test dataset to measure performance metrics such as loss and accuracy.

  5. Prediction: Use the trained model to generate summaries for new input text data. Preprocess the text, encode it, and then decode to generate the summary tokens.

Dependencies

  • Python 3.x
  • TensorFlow 2.x
  • pandas
  • numpy
  • nltk
  • re

Releases

No releases published

Packages

No packages published

Languages