Ángel Ibáñez Hernández


.NET/SQL Analyst Developer +20 years experience

PdfLogicExtractor

PdfLogicalExtractor Description

PdfLogicExtractor is a piece of software designed to extract information from PDF documents in a logical and orderly manner, so that can later be processed by the systems that integrates with.

The system is based on adaptable logic implemented in a template system that can process all documents of a certain type.

A template is capable of adapting to variations established in its definition, such as months with different numbers of days, displaced document areas, or differences between pages within the same document. And in general, any type of extraction logic needed.

Template extraction logic can perform result cleaning based on predefined rules, obtaining pure data types or removing parts of insignificant text.

Template extraction logic can perform calculations based on results obtained from different extractions or predefined values, such as calculating totals in a table based on price per unit.

Likewise, all functionalities or exceptions that the template logic of a document type requires to be a more effective tool can be specifically programmed.

Installation

The system is contained in a dynamic link library DLL, included in a NuGet package, which can be incorporated into any type of platform or software project in the .NET universe.

Direct integration into a .NET project can be done including the NuGet in the project itself and using the well-defined standard calls in the documentation or in Visual Studio's own intelligence system.

Open NuGet Package Manager in Visual Studio and search for 'Angelves' for access to our packages.



Nuget 2
Nuget 2

Manual installation by console:

PM> NuGet\Install-Package Angelves.PdfLogicalExtractor -Version 1.0.2

Interface

Integration defines a very simple software interface with a few overloaded calls and a response data model accessible directly, which can be easily processed as JSON responses.

namespace Angelves.PdfLogicalExtractor.PublicInterface
{
    public interface ILetsGoToExtraction
    {
        ExtractionResult Start(Template template, string filePath, DocumentType type);

        ExtractionResult Start(string templatePath, string filePath, DocumentType type);

        ExtractionResult Start(string templatePath, Stream fileStream, DocumentType type);
		
        ExtractionResult Start(Template template, Stream fileStream, DocumentType type);

        string GetWordsInPdf(string filePath, string? filter1 = null, int round = 0);
    }
}

Templates

A Template is a JSON file that implements extraction logic against a PDF document.

The goal of these templates is obtain an organized outcome, capable of be processed by an application or system, from plain text reading.


The following table contains a description of sections of template, where you can define the extraction boxes.

Section Description
Name The name of the template is set in this property.
Settings This section configures the template configuration.
Offsets
[pro]
In this section, control points are established to detect movements in the text over the original format.
Metaboxes
[pro]
In this section the Metaboxes are defined. These elements are extraction boxes in themselves but have no impact on the output, as they are used in the calculation of formulas.
Boxes In this section, the Boxes or extraction boxes are defined, which are the data that we extract from the document, calculations, tables, etc.
Renames In this section the output field renamings are established. This section is useful for redefining automatic names from extract operations.

The following table contains a description of the commands that you can use within a Box in the template definition.

Functionality Description
Name Box name. This property is important for calculations.
id (*1)
[pro]
offset identifier
type Defines the type of the Box:
Type Description Parameters
Literal Exact value taken from the template.
empty Empty field.
text Extracted text.
number Whole number.
Decimal Decimal number.
DateTime Date or date and time. The format in Format.
TextSplit Text divided by a break character. Break character in Parameters[0].
table Table in the document. Definition in Header: array of Json Box objects, which define the header.
array List of values from a start position to an end position. Parameter list:
Parameter Description
additionaldata Text or Number
w The width of the columns.
ArraySwap (*2)
[pro]
Array with flipped axes integrated into Table. This type generates a row for each valid entry. Parameter list:
Parameter Description
additionaldata Text or Number
deletenullsinarray Deletes null values from the original array.
arrayindex Json object with y1 and y2, which define the row of the index extraction, days for example, in the top row.
w The width of the columns.
idoffset Box link with the id of the offset control.
required Boxing is required or not.
metabox
[pro]
Converts a normal Box into a Metabox that will not be reflected in the output.
x1, x2, y1 and y2 Text extraction coordinates.
deviationbetweenpages
[pro]
Derivation of the Y coordinate between pages.
master (*2) Boolean that sets the column that masters the Table extraction.
additionaldata Additional data required by some type of Box.
parameters Array of parameters required by some type of Box.
alternativecoords
[pro]
Alternative coordinates based on conditions
Json array of conditions with a "condition" field where a formula equal to result is expressed, and with 4 optional fields x1, x2, y1 and y2, where the new coordinates are established if the condition is met.
extractionrules Rules in data extraction.
Json array with the fields "action", "target" and "parameters[]":
action Description target parameters
Erase Delete the extraction result.
EraseFrom Clears the result from the first character to the end. Character
EraseTo Clears the result from the beginning to the first character. Character
MaxRightCharsFromChar Extracting a number of characters from a given character to the right. Character
QuitSpaces Remove all spaces.
QuitLineReturns Remove all return and carriage jump lines.
Replace Replaces the searched text with the replaced text. Searched text [0] = Replaced text, [1] = "exact" if exact.
ToLower Converts to lowercase.
ToUpper Converts to uppercase.
ToRoundInteger Converts text to rounded integer format text, if possible.
ToDecimal Converts text into decimal-formatted text with two decimal places, if it can.
ToDate Converts text to DateTime formatted text, if it can.
The format is taken from Format.
formula
[pro]
Calculation formulas with numbers in template and/or metaboxes or boxes.
Formula syntax:
Element Description
[SELF] Original value extracted without modification.
[BOX: name] Value of a Box or Metabox identified by "name".
[BOXROW: name] (*2) Value of a Box or a Metabox belonging to the same row of a table identified by "name".
[SWAPVALUE] (*2) The result of the value extracted by each row of the ArraySwap.
+, -, * and / Supported mathematical operators. If no mathematical operator is used, it will be understood that the values are literal and the result is text.
comments Comments, without use in the process.
(*1): Applicable to the offsets section
(*2): Applicable only to Tables.
(*3): Applicable only to Renames.
[pro]: Professional version only.

Example

{
  "templatename": "EXAMPLE TEMPLATE",
  "config": {
    "externalsfieldsintables": false,
    "decimalseparator": ","
  },
  "offsets": [
    {
      "id": 1,
      "text": "PROGRAM",
      "x": 82.92,
      "y": 153.95
    },
    {
      "id": 2,
      "text": "ADVERTISER:",
      "x": 82.92,
      "y": 101.99
    }
  ],
  "metaboxes": [
    {
      "name": "end_month_1",
      "type": "Number",
      "x1": 714,
      "y1": 146,
      "x2": 723,
      "y2": 153
    },
    {
      "name": "end_month_2",
      "type": "Number",
      "x1": 705,
      "y1": 146,
      "x2": 714,
      "y2": 143
    },
    {
      "name": "end_month_3",
      "type": "Number",
      "x1": 696,
      "y1": 146,
      "x2": 705,
      "y2": 153
    }
  ],
  "boxes": [
    {
      "name": "station",
      "required": true,
      "x1": 600,
      "y1": 60,
      "x2": 800,
      "y2": 80
    },
    {
      "name": "advertiser",
      "idoffset": 2,
      "required": true,
      "x1": 160,
      "y1": 90,
      "x2": 350,
      "y2": 110
    },
    {
      "name": "product",
      "idoffset": 2,
      "required": true,
      "x1": 160,
      "y1": 106,
      "x2": 350,
      "y2": 116
    },
    {
      "name": "campaign",
      "type": "Empty"
    },
    {
      "name": "reference",
      "idoffset": 2,
      "required": true,
      "x1": 600,
      "y1": 90,
      "x2": 800,
      "y2": 110,
      "extractionrules": [
        {
          "action": "QuitSpaces"
        },
        {
          "action": "Erase",
          "target": "N.ORDER:"
        }
      ]
    },
    {
      "name": "invoicedate",
      "type": "DateTime",
      "idoffset": 2,
      "required": true,
      "format": "dd/MM/yyyy",
      "x1": 600,
      "y1": 106,
      "x2": 800,
      "y2": 125,
      "extractionrules": [
        {
          "action": "Erase",
          "target": "DATE:"
        },
        {
          "action": "QuitSpaces"
        }
      ]
    },
    {
      "name": "table1",
      "type": "Table",
      "header": [
        {
          "name": "format",
          "idoffset": 1,
          "required": true,
          "master": true,
          "x1": 235,
          "y1": 171.10,
          "x2": 273,
          "y2": 178.41,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "20"
            },
            {
              "action": "Erase",
              "target": "\""
            }
          ]
        },
        {
          "name": "duration",
          "idoffset": 1,
          "required": true,
          "x1": 235,
          "x2": 280,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "CRADLE"
            },
            {
              "action": "Erase",
              "target": "\""
            }
          ]
        },
        {
          "name": "program",
          "idoffset": 1,
          "required": true,
          "x1": 70,
          "x2": 180
        },
        {
          "name": "startend_hour",
          "type": "TextSplit",
          "idoffset": 1,
          "parameters": [ "-" ],
          "x1": 180,
          "x2": 235,
          "extractionrules": [
            {
              "action": "QuitSpaces"
            },
            {
              "action": "Erase",
              "target": "("
            },
            {
              "action": "Erase",
              "target": "MF"
            },
            {
              "action": "Erase",
              "target": "S-U"
            },
            {
              "action": "Erase",
              "target": ")"
            }
          ]
        },
        {
          "name": "swap",
          "type": "ArraySwap",
          "idoffset": 1,
          "additionaldata": "Number",
          "deletenullsinarray": true,
          "x1": 444.21,
          "x2": 723.43,
          "w": 9.007,
          "arrayindex": {
            "y1": 146,
            "y2": 153
          },
          "alternativecoords": [
            {
              "condition": "[BOX:end_month_1] ! 31",
              "x2": 714.42
            },
            {
              "condition": "[BOX:end_month_2] ! 30",
              "x2": 705.41
            },
            {
              "condition": "[BOX:end_month_3] ! 29",
              "x2": 696.41
            }
          ]
        },
        {
          "name": "totalpasses",
          "type": "Decimal",
          "idoffset": 1,
          "x1": 310,
          "x2": 350
        },
        {
          "name": "unitprice",
          "type": "Decimal",
          "formula": "[BOXROW:totalprice] / [BOXROW:totalpasses]"
        },
        {
          "name": "discount",
          "type": "Decimal",
          "idoffset": 1,
          "x1": 380,
          "x2": 410,
          "extractionrules": [
            {
              "action": "Erase",
              "target": "%"
            },
            {
              "action": "QuitSpaces"
            }
          ]
        },
        {
          "name": "agencydiscount",
          "type": "Empty"
        },
        {
          "name": "totalprice",
          "type": "Decimal",
          "idoffset": 1,
          "required": true,
          "x1": 410,
          "x2": 445,
          "extractionrules": [
            {
              "action": "Erase",
              "target": "€"
            },
            {
              "action": "QuitSpaces"
            }
          ]
        }
      ]
    },
    {
      "name": "comments",
      "idoffset": 2,
      "x1": 60,
      "y1": 230,
      "x2": 720,
      "y2": 260,
      "extractionrules": [
        {
          "action": "QuitSpaces"
        },
        {
          "action": "Erase",
          "target": "COMMENTS:"
        }
      ]
    }
  ],
  "renames": [
    {
      "name": "startend_hour.INDEX[0]",
      "rename": "starthour",
      "exact": true,
      "casesensitive": false
    },
    {
      "name": "startend_hour.INDEX[1]",
      "rename": "endhour",
      "exact": true,
      "casesensitive": false
    },
    {
      "name": "swap.SWAP[",
      "rename": "passes",
      "exact": false,
      "casesensitive": false
    }
  ]
}

Commercial Use License

This software is provided under the terms of this Commercial Use License ("License"). By downloading, installing, or using this software, you agree to the terms and conditions of this License.

  1. License Grant:

    This software is a NuGet that can be freely downloaded. The purpose of the software is to provide functionality for logically ordering the reading of a PDF in plain text provided by third-party software. The user is granted a limited, non-exclusive, non-transferable license to use this software for evaluation purposes during a 30-day trial period. After this period, the user must acquire a commercial license to continue using this software.

  2. Use Restrictions:

    The user may not decompile, modify, or resell this software. However, the user may redistribute the software when integrated into their own software under a commercial license. The user must acquire a valid license to use this software for commercial purposes.

  3. Disclaimer:

    The third-party software provided for PDF reading may not be able to read some types of PDFs, such as those from scanned images. The license holder shall not be liable for any loss or damage arising from the third-party software's inability to read a specific PDF.

  4. Ownership Rights:

    All ownership and intellectual property rights of this software are owned by the license holder. This software is protected by copyright laws and other applicable laws.

  5. Disclaimer of Warranties:

    This software is provided "as is," without warranties of any kind, whether express or implied. The license holder shall not be liable for any direct, indirect, incidental, special, exemplary, or consequential damages arising out of the use or inability to use this software.

  6. Governing Law:

    This License shall be governed and construed in accordance with the laws of the State of Spain without regard to its conflict of law principles.

  7. Additional Terms:

    The terms and conditions of this License may be subject to change without prior notice. It is the user's responsibility to periodically review the terms of this License.

  8. Third-party software license:

    PdfPig Liscence
    Newtonsoft Json Liscence

By downloading, installing, or using this software, you acknowledge that you have read and understood the terms and conditions of this License and agree to comply with them.